December 21, 2008

Olivier CARBONE

Talend Certification Program

In November, we have launch the Talend Certification Program. This exam is awarded to individuals who successfully complete a comprehensive online test covering all aspects of the use of Talend Open Studio.

Clients trusting a systems integrator to implement a solution or seeking to hire an individual want to be reassured that the consultants are indeed experts in the technology. Talend certification gives them this level of assurance.

Talend Certification ProgramA number of technical staff members from several Talend Alliance Partners have already received certification, including consultants from Altic, Dmoon, Elosi, Lunexa, Micropole Univers, Napstec, Open Wide, Systech, and Umanis. And you? Learn more on how to get certified!

Fabrice Bonan, COO and co-founder of Talend, explained: “The certification exam was developed and fine-tuned by consultants who are exposed to the many different ways our solutions are used on a daily basis. It’s not an easy exam-you need an in-depth knowledge of Talend Open Studio to pass-but it is delivered in a fair and consistent way, and guarantees that certified individuals have mastered open source data integration.”

Exam Preparation

Read the User Guide and use Talend in a real data integration project. Talend offers two 3-day courses to assist in gaining knowledge and helping you prepare for taking exam. Refer to Talend Training section for details and schedule of public classes.

 

Taking the Exam

The exam, which is around ½ day in duration, needs to be taken online.

The exam requires you to complete a series of tasks. TIP: Since the test is timed, you may not have the luxury to rummage around trying to figure out an answer. Be prepared before you schedule the exam. Know Talend inside and out.

To gain your certification, you must take an exam. See below an extract of the exam:

  • My job contains two components on a green background (startable) and I checked the option Multithread. Which of my components runs first?
  • What are the criteria for choosing between the ETL and the ELT modes of Talend?
  • I have created a variable FileNameParam which contains the name of my parameter file PARAM.txt, how can I use it in my program ready to be deployed to production?
  • In a tJavaRow, how do I fill the column “name”?

Talend Certification Program

Benefits of Talend certifications

Talend certification provides an increased understanding of the planning, design, and maintenance of Talend products and technologies. A Talend certification provides higher morale and self confidence, besides handsome salaries and better opportunities for promotions. An Talend Certified Partners also gets invitations to Talend conferences, technical training sessions, and special events.

Attaining a Talend Open Studio Certification provides proof of your skills to create ETL and Data Integration jobs, using all the tools and features of Talend Open Studio. Talend certification can increase your earning and advancement potential. For these reasons, Talend provide an html code to embed the certification in your resume, in your social network profile or in your website:
Click here to see my Diploma :)

Talend Certification Program

Source: ocarbone.free.fr

by Olivier at December 21, 2008 06:18 PM

December 02, 2008

Olivier CARBONE

An ETL Benchmarks Under The Creative Commons License

Manapps, a Systems Integrator, has just published this benchmark report.

This report summarizes a number of benchmarks about 5 leading data integration tools:

  • Talend Open Studio 2.4.1
  • Informatica 8.1.1
  • DataStage PX 7.5
  • DataStage Server 7.5
  • Pentaho Data Integrator 3.0.0

This document presents 11 test scenarios. For each scenario, the author shows how he has built the integration process in each of the tool, and then presents detailed execution statistics.

See below an extract of this document. The scenario is : “Reading X lines form tables input Oracle and writing another table output Oracle (ELT Mod) after some changes“.

Talend Open Studio
Talend Open Studio

 

Pentaho Data Integrator
Pentaho Data Integrator

 

DataStage Server
DataStage Server

 

DataStage PX
DataStage PX

 

Informatica
Informatica

 

Now, the Exectime for 100.000 lines, 500.000 lines and 1000.000 lines:

Benchmark Extract

 

This sounds good for Talend Open Studio :) This benchmark is under the Creative Commons License, feel free to relay it!

Download the Benchmarks - PDF

 

About Mannaps

Manapps, member of OmegahighTech, is a consulting and service enterprise that has based its rapid success on a strong specialization in New Technologies, CRM (Customer RelationShip Management), MDM (Master Data Management) and Business Intelligence.

,

Source: ocarbone.free.fr

by Olivier at December 02, 2008 10:39 PM

November 30, 2008

Olivier CARBONE

5′ video about Talend Open Studio 3.0

A new version of the 5mn Demo is out! After the declaration of a delimited file in the metadata, the video shows the design of a job to read the data included in the delimited file. The tMap component and the Expression Builder are used to transform the data.

In a second step, a mysql connexion is declared and a join between the 2 sources is designed in the tMap.

In the last minute, the video shows the SQL Builder and the data are recorded in a mysql table.

5' Demo: discover TOS highlights in 5-mn video

In another way, the Rapid Learning often use this kind of video. Fast, friendly, this media is a good solution to transmit information to an expert team ;)

5' Demo

 

Tech News

First open source connectors for SAP

 First open source connectors for SAPVersion 3.0 of Talend Open Studio includes native connectors for SAP.
Extract data from SAP for business intelligence projects, migrate to and from SAP, and synchronize SAP with other applications.

Compiere connector available in the Ecosystem

Compiere connectorConnect to the leading open source business software solution (ERP/CRM) with the Compiere connector, contributed by the community through the Talend Ecosystem.

New set of connectors for Google Apps

GoogleAppsNew set of connectors for Google AppsDiscover in the Talend Ecosystem a new set of connectors for Google Apps. Check them out: gaEmailList, gaAccount, and a few others.

New connector for SAS

New connector for SASExtract data from SAS databases, and integrate SAS with other applications and databases in your  information system.

,

Source: ocarbone.free.fr

by Olivier at November 30, 2008 04:20 PM

November 04, 2008

Sebastiao CORREIA

Datamining type


In Talend Open Profiler, when you create a column analysis, you can see a combo box near each column in the editor which represents the data mining type of the column. What is it? And what is it useful for?

The available data mining types are

  1. nominal
  2. interval
  3. unstructured text
  4. other

Not all indicators (or metrics) can be computed on all kind of data. These data mining types helps Talend Open Profiler to choose the appropriate metrics for the column.

Nominal (also called “categorical” sometimes) means that the data can serve as label. For example, the type of a column called “WEATHER” with values: “sun”, “cloud”, “rain” would be nominal. In Talend Open Profiler, textual data are set to nominal data mining type.

But it happens that data such as “52200″, “75014″ are nominal data too although they are represented by numbers. In fact, a column called “POSTAL_CODE” could have these values. It is clear for the user that these data are of nominal type because they identify a postal code in France. Computing mathematical quantities such as the average on these data is a non sense. In that case, the user should set the data mining type of this column to “nominal”, because there is currently no way to automatically guess the correct type in Talend Open Profiler in such a case.
The same is true for primary or foreign key data. Keys are most of the time numerical data, but their data mining type is “nominal”.

“Interval” data mining type is used for numerical data and time data. Difference between two values, averages can be computed on this kind of data. In databases, sometimes numerical quantities are stored in textual fields. With Talend Open Profiler, it’s possible to declare a textual column (e.g. a column of type VARCHAR) as an interval. In that case, the data should be treated as numerical data and summary statistics should be available. Currently, it’s not yet implemented because there is not yet an interface which allows the user to specify the format of the data. But this feature is planned for a future release.

The other two data mining types are not usual data mining types. In data mining we find sometimes the types “ordinal” and “ratio”.

The reason is that for the indicators currently available in Talend Open Profiler, these two types are not needed. Instead we added two other types to handle textual data. For example, a column “COMMENT” which contains text is not a nominal data, but still we could be interested in seeing the duplicate values of this column. Or we could implement metrics specific to text mining (but this is not for the current release…).

And finally, we also have the type “other” which design a data which Talend Open Profiler does not know how to handle yet.

      

by scorreia at November 04, 2008 08:42 PM

October 28, 2008

Sebastiao CORREIA

How to compute a median in SQL


In Talend Open Profiler, we generate SQL queries to get statistical informations. Among the currently available indicators, the median is one of the most difficult to compute. Nevertheless this indicator is worth computing because it is more stable than the mean indicator (average). By stable, I mean that it is less influenced by extremal values. This is not the case with the average which can vary a lot when extremal values exist.

I found several ways to compute the median depending on the database type. The most simple is for example with Oracle 10g which provides a MEDIAN function, so that your query writes
SELECT MEDIAN(salary) FROM employee

But for other databases, things begin to be more tricky. Let’s take MySQL first. One way to compute the median is the following:
SELECT AVG(salary) FROM (
SELECT salary FROM employee
WHERE salary IS NOT NULL
ORDER by salary ASC
LIMIT p, n) T

where p=1 and n=N/2-1 when the number of non null rows N is even, or p=2 and n=(N-1)/2 when N is odd.

For Postgresql, the query is similar to the MySQL query and uses LIMIT too.
SELECT AVG(salary) FROM (
SELECT salary FROM employee
WHERE salary IS NOT NULL
ORDER by salary ASC
LIMIT n OFFSET p) T

This query can also be used on MySQL but not on old versions of MySQL (before 5.0).
For Oracle 9i, the MEDIAN function does not exists and we must use the PERCENTILE_CONT function:
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary)
FROM employee

For DB2, the following query is used in Talend Open Profiler:
SELECT AVG(salary) FROM (
SELECT salary, COUNT(*) OVER( ) total, CAST(COUNT(*) OVER( ) AS DECIMAL)/2 mid, CEIL(CAST(COUNT(*) OVER( ) AS DECIMAL)/2) next, ROW_NUMBER() OVER ( ORDER BY salary) rn FROM employee
) x
WHERE ( MOD(total,2) = 0 AND rn IN ( mid, mid+1 ) )
OR
( MOD(total,2) = 1 AND rn = next )

For Microsoft SQL Server, we used the TOP clause as follows
SELECT AVG(CAST(salary AS NUMERIC)) FROM (
SELECT TOP n salary FROM (
SELECT TOP m salary FROM employee
WHERE salary IS NOT NULL ORDER BY salary ASC
) AS FOO
ORDER BY salary DESC
) AS BAR

where n is given as in the MySQL case and m=n+p (p being given above for the MySQL case).

Up to now, the only way I found for computing the median on Sybase ASE is the following:
SELECT AVG(CAST (salary AS NUMERIC)) FROM (
SELECT DISTINCT salary FROM (
SELECT salary FROM employee
UNION ALL
SELECT salary FROM employee
) STT
WHERE
(SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM (
SELECT salary FROM employee
UNION ALL
SELECT salary FROM employee
) AS SOU
WHERE SOU.salary <= STT.salary)
AND
(SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM (
SELECT salary FROM employee
UNION ALL
SELECT salary FROM employee
) AS SUR
WHERE SUR.salary >= STT.salary) ) T

This query makes heavy use of correlated subqueries and I hope to find a more efficient way to compute a median on this database.

Median can be computed by other approaches. Temporary tables could be used or cursors. But Talend Open Profiler must only use SELECT statements because a data profiler could not have the permissions to create a table on a database and the use of cursors is too complex for this tool.

      

by scorreia at October 28, 2008 09:46 PM

Copyright © 2006 - 2009 Talend. All rights reserved. Talend Contributor Agreement