Your analysts found some data that doesn’t quite look right in your data warehouse and ask you to investigate. You look into the data in more detail and find that the data is indeed incorrect. This is quite alarming since analysts and executives have been making business decisions based on this flawed data. In addition, as the data warehouse administrator you probably take a lot of pride in providing users with quality and reliable data. It is undesirable to run the risk of losing user confidence in the data. Now it is time to dissect the ETL process and determine what went wrong, but where to begin?
- Address the issue – It is important that you inform the people accessing this data that it is incorrect. Any analysis or decision making processes that use the bad data should be discontinued until the data is corrected.
- Reverse engineer – This will be much easier if you were the architect who designed the data warehouse. It will also be a simple process if detailed documentation exists for your data warehouse. Unfortunately this might not be the scenario you are facing.You will first need to trace the data back to your data staging area by reviewing the ETL package that moves data from data staging to the data warehouse.
From there you will need to review the ETL process that moves data from your source systems into your data staging area.
The next step will require you to review your source system data.
- Now ask questions – The next step is to analyze what you have found in your reverse engineering process.Is the source system data wrong? If this is the case you may need to investigate the possibility of data entry mistakes by users, or bugs in the source system. Have any updates or enhancements been made to the source system that could cause this error?
If the source system appears to be correct then you can conclude that the issue lies somewhere in the data warehouse build. Is your data staging area correct? In most cases your data staging area will be straight data pulls from the source system without any major data transformations, but this possibility should still be investigated.
Finally the last step, and the most complicated, is to begin sifting through the intricate ETL package and data quality process that populates your data warehouse. Hopefully this is where you will find your ETL error. If this investigation is unsuccessful however you may want to consider other external processes that alter your data warehouse. For example, is there a scheduled job that runs separately from the initial data warehouse build?
- How far back does the bad data exist? – If the bad data only started appearing in recent months or days then you should investigate what changed that may have caused this unexpected change. If the data has always been incorrect you can assume that the data warehouse architect simply made a mistake.
- Correct or move forward? – Depending on the type of data and the extent of the damage you may opt to correct the historical data or it may be more efficient to correct the data for the future and chalk the old data up to a lost cause. If you choose to only correct the data moving forward then I suggest truncating the field so that future analysis will not include the bad data.
If you are facing the task of debugging your data warehouse I hope that this has been helpful. It can be an overwhelming task (especially when others are not exactly happy) but take a deep breath and just begin the process while focusing on one step at a time.