Data Quality Assurance – Will the Real John Smith Please Stand Up?

As humans we have the ability to scan multiple segments of data and make logical conclusions about the relationships that records have with each other. As an example, if we see the four records below we can easily conclude that all four records are indeed the same person.

Data Quality Assurance

Data quality assurance is the process of cleansing or scrubbing your data as it is extracted from your source systems and before it is inserted into your data warehouse. The ultimate goal is to remove “dirty” data such as duplicate or incomplete records. Multiple tools can and should be used to ensure that your data warehouse is as accurate and clean as possible.

If you data includes addresses you may choose to cross reference the city or province with the zip code or postal code. Many companies sell databases that include updated address information that can be used as a standard comparison for your address data.

Fuzzy logic can be used to merge duplicated records like the ones that appear in the example above. We can see that records one and four are a 100% match, so these can be easily merged. When comparing records one and two we see that two includes a middle initial and has an address discrepancy.

This is where your business rules will come into play. You may decide to disregard middle initials in your data cleansing process, or you may consider that the name field is one letter off and deduct 5% from the data match. Seeing that the address is one number off may also deduct the match score by another 5%. So the data quality score for records one and two comes to 90%. You will need to set your threshold to determine your business rules for merging two records. If your threshold is at 85% then these two records will be merged.

Now the question is to consider which record has the accurate data. This example is relatively simplistic because we do have two identical records. It is probably safe to say that the two 100% matching records have the most valid data. We can also look to see that the same birthday appears three out of four times, so we can safely conclude that the correct birthdate is 11/29/1964.

Data quality assurance is one of the most important things that will be used to construct your data warehouse and you shouldn’t underestimate the importance of investing both time and money into this portion of developing your business intelligence solution.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s