The task of validating a new data warehouse is a necessary evil. It’s tedious and time-consuming, but if done properly it will result in a stable and robust data warehouse that is the basis for your business intelligence solution. Here are a few things to consider when taking on the task of validating a new data warehouse:
1. What data sources are driving the creation of your new data warehouse? You will likely have multiple source systems such as a point-of-sale system, a human resources/payroll system, an inventory system, or other industry-specific systems that will be included in your data warehouse. In many cases you will also have your current/old data warehouse and a data-staging database, or a reporting database. You may choose to pull history data from the latter sources if your main source systems clear data that is older than x number of days. This is what I experienced when working with a source system that lets data fall off after two months. The needed data was already archived in the current data warehouse, so those records were brought into the new data warehouse once and new records were added directly from the source system.
2. Do these source systems have data integrity issues or other areas of weakness? The opportunity to build a new data warehouse includes the opportunity to clean up dirty data and fix structural and design issues that didn’t quite correlate with your business processes previously. A data quality specialist will be able to cleanse the dirty data in preparation for your new data warehouse. When it is time to compare the data in your new data warehouse be sure to consider the fields and files/tables that you know have un-clean data. If you can identify these areas ahead of time then you will be prepared to give the new data warehouse special consideration when validating these areas.
3. Who is part of your validation team? Even if contractors developed your data warehouse for you, your in-house employees should always be the ones to validate the data. In my experience the in-house team was made up of the data team and a server administrator. The data team had experience with both the old data warehouse and the source systems and the server administrator was an expert in one of the source systems. The team also involved other employees when investigating information that was job-specific to that employee.
4. We’re a team, now what do we do? There are many different ways to document the validation progress, but I have found that a simple spreadsheet will do the trick. List all of the tables in your new data warehouse and assign a team member to a table based on prior knowledge and previous experience. You should place this document on a sharing site such as SharePoint, Google docs, or even a shared network folder to allow both your data validation team and the data warehouse developers to access it. As you find data integrity issues record them on the document in the corresponding table row. Your data warehouse developers will be able to view the issues and update the item as being fixed. This back and forth correspondence will continue until all of the tables have an “approved” status.
5. Now that you have a healthy level of confidence in your data warehouse, begin writing reports from the data. This will allow you to validate more complex and aggregated data that will challenge the integrity of your data even more. I suggest duplicating reports that you know are currently being used in your company and comparing the results.
Like I said, tedious, but necessary. When you have proven the data to be accurate your users will have confidence in the data which will improve user adoption and make your job much easier in the long run.