Never use SQL again for ETL – Pentaho Data Integration reusable flows instead

Never use SQL again for ETL – Pentaho Data Integration reusable flows instead

Since Pentaho is free and PDI is so easy, our team no longer will use any stand-alone SQL as part of our extract transform and load (ETL) process. Many times we have determined that we only need to extract the data ‘one time’. Inevitably, we need to run it again or something very similar. It takes no extra effort to use PDI and we have a transformation flow document as part of the PDI tool. Many times we have to rework data after it is run as we find additional data or find extra rules that need to be applied. Since we start with PDI, we have a simple way to migrate the work to a different team member or just need to recall the steps that we took.

As a final benefit, we find that using PDI makes the ETL work faster the first time once our developers learn the basics. Here is a sample of how PDI can be used for extracting data from two sources merging and adding data to create a new dataset while documenting and data scrubbing issues that remain. Of course, all of the data may be any database or other data format and the job can be run on demand or scheduled. For more information for Pentaho PDI, please visit http://infocenter.pentaho.com/help/index.jsp?topic=%2Fgetting_started_with_pdi%2Ftopic_introduction.html