From Data Mart to Data Warehouse

Bottom-Up Data Warehousing

Have you been tasked with building a new data warehouse for you company? Is the existence of the data warehouse your only requirement? Are you having nightmares about the enormity of this task and the resources that will be needed? You may want to consider a bottom-up approach to building your data warehouse.

From Data Mart to Data Warehouse

Rather than spending months gathering all of your requirements and then proceeding to build a massive solution (using standard waterfall development) a bottom up approach will begin with basic data marts designed for functional areas of your business.

Develop Functional Data Marts

Data marts can be designed and developed in a relatively short amount of time. This means that your business users can start benefiting from reporting solutions in a matter of weeks rather than months. You will receive multiple benefits that are expected when using an agile development style.

Document Exposed Requirements

One of these benefits includes gathering data warehouse requirements through exploration. Users will be able to explore your reporting solution and they will be excited to browse the data using self-serve data tools. Be sure to pay close attention to the questions that your users begin asking. These questions are sure to become requirements for your data warehouse solution.

Once users are comfortable I recommend that you encourage forward thinking by helping to prompt “what if” questions. As an example, “What if you could view sales trends on the same chart as product enhancement releases? Would that data be useful?” You may even choose to accomplish this in official requirements meetings or just have short conversations as you deliver new reports.

Create Conformed Facts and Dimensions

This step will not happen until well after users have successfully adopted your reporting technologies. Quality documentation is key to accomplish this task. Each existing data mart will need to be cross referenced to identify commonalities between your measures and dimensions in your various data marts.

Conformed dimensions and facts will be used to create consistency for the foundation of your data warehouse. You will ensure that measures have identical definitions in every data mart that they exist in.

Data Warehouse Development

Armed with your detailed requirements documents and data mart documentation you are now ready to establish an official data warehouse. You have already done all of the heavy lifting with your data mart ETLs. Now you will just build the structure that merges data from multiple functional areas in your business. The provides the much sought after “One version of the truth” that so many companies are seeking.


Thanks for reading!



Business Intelligence User Adoption: Baby Steps

In the past I’ve written a few posts that discuss dealing with the challenge of user adoption when introducing a new business intelligence solution.

Recently I have encountered some scenarios that have challenged my ego and my desires to produce solutions that I feel are extensible and efficient. I continue to learn more about the psychology that is involved when users are asked to abandon their old processes and adopt a new (and in my opinion better) way to accomplish their tasks. So here is my dilemma: I can either continue with my stubborn ideas of implementing best practice solutions or I can bend (can you do the limbo?) and design a process that caters to the users’ comfort levels.

Which one to choose? In this scenario you would need a significant amount of power and influence within your organization to override the user preferences for their work tasks (covering many departments throughout your organization) and replace those processes with your own design and plan. In addition, even if you happen to have this type of clout at your organization this iron fist approach will likely just cause resentment and rebellion – resulting in little to no user adoption.

Baby steps…You’ve done your research, you see multiple ways that tedious manual processes can be streamlined, but your users aren’t quite ready to make this leap. If you are like me this may be challenging. You may be thinking “really, you want me to implement a design that I don’t think is the best solution?” and my humble reply is yes, sometimes. I swallow hard as I write this because admittedly I am still having trouble accepting this concept.

Selling your solution to the users in your organization is a psychological process as much as it is a technical process. Consider this scenario: Users currently present KPI trends and results to executives based on manual entry into spreadsheets. They may incorporate some charts and include some complex macros that beef up the technicality of the spreadsheet, but ultimately they are still relying on manual entry. To me this method screams risks of human error, potentially lost data, lack of historical storage, and difficult data retrieval… Ahhhhh!

From my perspective I see the best solution for generating these KPIs in a consistent and reliable manner is to pull the data from the sources into the data warehouse and then generate automated reports or create dashboards for users and executives. (Seems obvious right?) However the reality is that for most end users this would be a dramatic transition, even if it ultimately means less work. They may also be worried about their jobs no longer being necessary. So, it may be going against everything you feel is “right” about a business intelligence design but I recommend taking baby steps here.

First you may want to try a simple solution that utilizes your data warehouse while still allowing the user to remain in complete control of the data that is inserted and reported on. If they are used to using a spreadsheet as their data entry point then a possible solution is to provide a SharePoint list that resembles their current data entry format. Allow the users to continue their manual entry while also becoming familiar with the SharePoint structure (or other collaboration BI tools you may be using). You will be able to extract this data into your data warehouse and generate reports/dashboards while also maintaining history.

The lesson: Don’t take away all of the users’ control at once. Trust is a slow process and ultimately you need your users to trust the data that lives in your data warehouse. When they are able to create useful reports by pulling the data from the data warehouse (that they essentially manually entered) then they will slowly grow an affection for the structure and will hopefully over time begin to ask you how other processes can be streamlined. You may be the BI guru, but ultimately you users will end up being the ones who set the pace.

Dimensional Modeling

When designing your business intelligence solution your data model should coincide with the way that your business analysts interpret your business structure. Now, the goal here is to hide the complex technicalities. You’ll do this by creating an OLAP model that allows your analysts to create intuitive structured queries. In short, your analysts should be able to query your data warehouse in the same way that they would ask business questions.

Business Intelligence is used to measure aggregations (which are facts) across dimensions. This is the ideal modeling used in the BI environment. The dimensional model is designed using specific business process areas. You may decide to track sales amount as a fact using dimensions such as product or store number or sales date to give meaning to those facts.

Measures or facts are numeric data that corresponds to the questions that your analysts are asking. Facts can point to a record in a dimension table that will give meaning to the data. Common facts include dollar values, sales, and revenue.

Dimensions create the anchor when investigating a fact. An analyst may wish to view sales on a specific date for a specific group of customers. This would require the analyst to create a query that links the Sales fact table with the customer and date dimensions. Without this ability the facts would simply be unstructured facts. The Date and customer dimension allow for filtering, grouping, and labeling to narrow down query results.

This example of a fact table displays how the measures of a customer account are tied to the customer dimension. Each measure will have its own aggregation rule such as sum, average, min, or max. The aggregation rules are defined by the ways the business chooses to compare the values of the measure. In this example I display a monthly average balance attribute. Your business may be interested in a yearly average rather than monthly. In addition, the account rating attribute will be calculated using the pre defined business rules for determining an account rating. It is important that these rules are clearly defined and communicated to your analysts so that the true meaning of the data is understood.

This example of a dimension table displays the details of a customer. It includes attributes that describe business interests about the customer. The customer dimension will tie to multiple fact tables to provide a clear image into the details of customer interactions with the company. For example, the CustomerID shown above clearly links to the FactAccount table shown above.

Dimensional hierarchies define a set of parent-child relationships between attributes within a dimensions. A dimension hierarchy includes levels that can roll up from child to parent. A dimension can have more than one hierarchy. The date dimension example shown can include multiple hierarchies such as “Day > Month > Year”, “Day > Week > Year”, “Day > Month > Quarter > Year”, etc. The specific hierarchies that you will include depend on your internal business needs.

I hope this has been a helpful explanation of dimensional modeling. Have a great day!

Exactly what is a Slowly Changing Dimension (SCD)?

Just the sound of the phrase “slowly changing dimension“ can sound quite intimidating when uttered in a technical meeting, but you might be shocked at how simple this concept really is. In fact, it has nothing to do with shifts or alterations in the universe at all. J

If you are involved in data warehousing you should have a firm grasp of SCDs as they will be necessary to structure your data in a meaningful way that takes advantage of the capabilities of a data warehouse; specifically when you wish to store historical data rather than simply overwriting a record. So, let’s break it down:

Slowly Changing Dimensions: Dimensions in your DW that include attributes that will gradually change over a period of time.

There are three types:

Type I – Overwrite old values: The old record is completely over written by the new record. Type I is used when you have no desire to store the historical attributes of the record.

Type II – Create an additional record: The old record will be kept and a new record will be inserted. Type II is used when you wish to store the historical attributes of the record.

Type III – Create new fields: This type creates new fields in the record to maintain the history and the current values of the specific attributes in the record.

Below are examples of how to use each type. In this example the price of a product is changing.

Type I: The old values are over written

Product price in 2011

Product price in 2012

Type II: A new record is created.

Product price in 2011

Product price in 2012

As you can see in the above example there are a few problems with the structure of the table. When using a type II SCD you will need to add a couple of basic fields to effectively track the historical records. There are several possibilities, but the idea is to be able to differentiate between the two records. Also you may notice that there are two identical IDs for this record. That is OK. This is still the same product so a common identifier is necessary. We will be able to narrow down our selection criteria by choosing the active record. A more realistic type II record would look like the below.

Product price in 2012

Type III: New fields are inserted

Product price in 2011

Product price in 2012

There are many scenarios in business where each of these SCD types will be useful. The most common are types I and II because type III can only store a limited amount of historical data.

I hope that this explanation has helped to simplify your understanding of how to use slowly changing dimensions in data warehousing.

Data Quality Assurance – Will the Real John Smith Please Stand Up?

As humans we have the ability to scan multiple segments of data and make logical conclusions about the relationships that records have with each other. As an example, if we see the four records below we can easily conclude that all four records are indeed the same person.

Data Quality Assurance

Data quality assurance is the process of cleansing or scrubbing your data as it is extracted from your source systems and before it is inserted into your data warehouse. The ultimate goal is to remove “dirty” data such as duplicate or incomplete records. Multiple tools can and should be used to ensure that your data warehouse is as accurate and clean as possible.

If you data includes addresses you may choose to cross reference the city or province with the zip code or postal code. Many companies sell databases that include updated address information that can be used as a standard comparison for your address data.

Fuzzy logic can be used to merge duplicated records like the ones that appear in the example above. We can see that records one and four are a 100% match, so these can be easily merged. When comparing records one and two we see that two includes a middle initial and has an address discrepancy.

This is where your business rules will come into play. You may decide to disregard middle initials in your data cleansing process, or you may consider that the name field is one letter off and deduct 5% from the data match. Seeing that the address is one number off may also deduct the match score by another 5%. So the data quality score for records one and two comes to 90%. You will need to set your threshold to determine your business rules for merging two records. If your threshold is at 85% then these two records will be merged.

Now the question is to consider which record has the accurate data. This example is relatively simplistic because we do have two identical records. It is probably safe to say that the two 100% matching records have the most valid data. We can also look to see that the same birthday appears three out of four times, so we can safely conclude that the correct birthdate is 11/29/1964.

Data quality assurance is one of the most important things that will be used to construct your data warehouse and you shouldn’t underestimate the importance of investing both time and money into this portion of developing your business intelligence solution.