ETL Performance troubleshooting with Pentaho Data interchange

System monitoring, memory, partitioning, CPU balancing and disk arrays are all part of making the ETL performance of your Pentaho Data Interchange fast and efficient. System monitoring is a critical first step and catch all for performance issues. As the amount of your data grows, processing the data becomes more of a challenge and can significantly slow your system . Do you have the right processes and tools in place to quickly assess data access issues?

There are a few basic things that one should address to start the performance troubleshooting process. Making sure you have enough memory throughout your process is an area that also demands a lot of attention and should be checked at regular intervals to make sure it is being utilized efficiently and there is enough allocated for the task. Disk partitioning is another area that can be problematic. Is your disk partition correctly to allow load balancing to occur properly and your recovery to work as planned in the event of a system outage? When it comes to storage of your data, are your disk arrays functioning the way they should and are you testing them regularly to ensure data is being stored in the manner in which you intended. It is vital to your organization and your clients to test your storage capabilities on a regular, monthly or bi-monthly basis to ensure that the information being managed is not lost in the event of hardware failure. As for the quick check items,  you should review the latest version, known problems, and patches with Pentaho Data Integration.

After you have performed the basic system level items, then it is typically beneficial to look at the specific functions and techniques that have been used for the extraction and integration program. Here are a few items to consider:

Hitting Database Too Many Times

Alternative: Create a ‘cache’ table once, then consume that data and eliminate. Of course, this has to be a special case where the run time of the onetime ‘cache’ table is offset by the savings of the repeated execution.

Speed versus Dynamic

When to store versus when to calculate is an ongoing challenge when extracting data for analysis. There are many factors that can affect the decision; user needs, network limitations, the amount of data, but there should not be a hard a fast rule throughout the process since business needs change. Speed is also a major factor in this decision but not always. If more network resources are need to generate data outputs, then there is always a cost associated with it so ROI and importance of the data must be considered.
If reports take 5 seconds to generate and you need to move fast, should you have temp storage?
Example: Total record count is high (millions) and or the number of detail reports is very high (hundreds/thousands).

Pulling Data From Multiple Sources

When to merge and when to pull from separate database is also major consideration. Mixing databases slows queries and pulling from multiple databases can require a lot more calculation and also slow your process. You should consider how many times you need to access the database, how much maintenance and management is required and the ability of your network to process the work flow.

As you list options to consider to troubleshoot your PDI program you may want to check



Learn More

Three reasons measuring improves our results

Three reasons measuring improves results

     1. Clarify what is important

     2. Understand how we compare

     3. Justify financial rewards



So you want to know how you measure up. The old saying that you don’t know where you are going until you understand where you have been is a reality when it comes to you and your business. The great literary detective Sherlock Holmes states impatiently “Data, data, data; I cannot make bricks without clay!” This ‘brick making’ is just as true when it comes to your business and how you measure yourself or your organization. Having data that matters helps you know where you are and then chart a course to where you need to be.

Measurements provide management clarity

Top performers, you will find, have figured out what is important to their success and what data to measure and they do so consistently and make it a part of their daily routine. These high achievers either have an instinctive knack for choosing the right activities or focus specifically on doing the key work that makes a difference. Inevitably they perform the work that gets results. They welcome measurements that are results oriented – their key performance indicators. For anyone, when they see the consistent application of measurements, they can easily ascertain what managers have requested of them. Of course, the old adage applies deeply here – be careful what you wish for.

How do I measure up?

How do you compare to others? If your bottom line is not what it needs to be, top performers turn to those that are successful to learn what they are doing and how they can emulate what they are doing for their own success. Borrowing ideas from successful people is never a bad thing; just make sure to give credit where credit is due. While there may be differences in regions, service lines and other factors, having comparable numbers is an objective way to compare and contrast. Simply knowing how others are performing will show possibilities and provide a basis for insightful questions. A shared set of numbers is by definition a form of collaboration. Everyone is sharing their results. Having a tool to analyze and learn facilitates improvement. As with any tool a balance between positive and negative feedback is necessary. Having objective information can increase the human factor as it helps to remove emotional and subjective feedback and focus on how to improve.

Financial results matter

Financial results – profits – in business it all boils down to the bottom line! As we consider our motivators, we cannot avoid profits that drive so much of the financial considerations. While we all want a great environment and a satisfying work, we must achieve target financial results to maintain the jobs, security and standard of living. As we consider the desired outcomes – pay increases, better benefits and better equipment – we certainly make a stronger case when increased profits are part of the plan.

Getting started

Many argue that measuring your work is a required part of any management process to maximize the potential of your company. Your data is simply nuggets of gold that need to be mined and processed for good Business Intelligence to allow you to make quality decisions drive your business goals

Today, more businesses are using Big Data to mine the performance gold. For organizations with lots of outlets, repetitive operations or high value work there is almost always lots of data to dig. Today this high volume situation is called Big Data which often implies the gathering of data for analytics for large businesses only. Big data has lots of meanings including the amount of data. Others refer to big data as that data which drives big results. We suggest not to worry about how much is big, but rather what matters for your results. Get started. Many organizations continue to fine-tune and learn over a period of months and even years. There are so many benefits to measuring and sharing we suggest you pick a few key metrics and find out how to share them consistently and openly with your company. Watch and learn and then grow your measurements. Just as you grow your organizational capabilities increase your measurements to enable more learning and celebrate success

Dashboards make it easy to share

One way to share key data is to create dashboards that enable relevant views such as company-wide, region, district, factory, store, city, department, line, product or team. A dashboard can be a chart in an email, spreadsheet, PDF, web page or a dynamic tool built by professionals. {There are numerous tools available such as the Pentaho CTools with their open source dashboards.} Make sure the data is succinct and accurate. Make a plan, but get started and share the data. You have bricks to make!

Learn More

One year old

We are delighted to announce the pending birthday of our newest addition to the BandyWorks family.  Vanessa Vinodh Kumar was born March 04, 2013.  Mom and Dad are as proud as can be!

Learn More

One fast ETL tool – Pentaho PDI

Extract, transform and loading is a necessary part of many Big Data jobs.

We use tools to do it faster and cheaper to move the work to the dashboards and analytics to get results.

Read The Full Blog Here

Learn More

Case Study: Get to the root of problems

The Latino’s have coached accountability for two generations. Since the 70’s the Reliability Center has guided clients to the source of problems to pro-actively eliminate errors and increase production.

BandyWorks Big Data team helps to manage the software to make this happen.

Read The Full Case Study Here

Learn More

Three results that define ‘Big’ in Big Data

Matthew Shoup of LinkedIn provides an insightful way to interpret the ‘Big’ in Big Data. No one cares about the ‘Big’ unless there are results. This approach puts the focus where it matters.

Read More From His Post Here

Learn More