Bad data is why many data warehousing projects fail to deliver results; in fact, data quality in data warehouses remains a significant challenge for many companies. The leading cause for bad data is data across multiple systems being integrated, but this integration is at the base of any data warehousing project.
What Does Data Quality in a Data Warehouse Mean?
The purpose of the data warehouse is to build a unified layer that contains data from all relevant data sources throughout the organization. This means you need to integrate data from multiple systems and optimize it for analysis and business intelligence. So, the data warehouse does not generate any data of its own and any data quality issues are either within the source systems or arose as a result of how data is interpreted in different systems. The data warehousing team must take over the responsibility of identifying such issues, coming up with ways to improve data quality, or gain business agreement over certain issues to be considered acceptable. The last one may seem confusing but is critical to maintaining a balance between cost of improving data quality and what the business is willing to spend.
If data quality is overlooked though, data warehouse users will have inaccurate, incomplete data on their hands. This translates directly to your data not being representative and erroneous analytics. These may be the very same analytics the C-suite uses for decision-making, and we all know how badly a single wrong decision can potentially hurt businesses.
4 Immediate Ways to Improve Data Quality in Your Data Warehouse
Reject the error: You have to decide if you want accuracy or completeness in any given instance. If accuracy is more important, you could reject any record found with said error – if fixing it requires more effort than your business is willing to put in.
Accept the error: If you value completeness over accuracy, you may choose to ignore the error and accept records with said errors into your data warehouse if you consider the errors to be tolerable, deciding to fix the error when your team can find the correct values later.
Fix the error: If your team can find the correct values or format change that’d would fix a specific error at a cost you can bear, the choice is obvious.
Assign default value: If completeness is very important yet the correct value cannot be found, you could assign a default value for each type of error to substitute erroneous data.
Regardless of the action you take, it’s critical that data warehouse users understand the implications of each action so they factor it into their analytics.
Approaches to Data Quality Management
We have seen how data quality is a key requirement in data warehousing, but in practical terms, going about fixing quality issues in data warehouse is a complex process. This section will cover approaches to implementing a data quality framework for data warehouse, specifically:
- Understanding source data in the data warehouse
- Understanding causes of data quality errors
- Bringing together data from different sources to improve quality
- Adding value to data to increase its usefulness
Data Profiling: Understanding Source Data in the Data Warehouse
When trying to improve something, understanding it is the natural first step. The process of understanding existing data with respect to how you want it to be in its final form is called ‘data profiling’. This includes digging deep into source data and understanding the content, structure, and cardinalities. That’s how we identify where data quality processes need to be applied and which approach to choose.
Too often, data profiling has been relegated to the backroom of Extract-Transform-Load (ETL) processes, only brought forward to check for minor anomalies in data once data warehouse design is complete and production data needs to be delivered. In reality, data profiling should be the very next step in your data quality data warehousing project once you’ve gathered business requirements. At this stage, knowing the amount of work that source data would require before becoming usable for analytics would impact design and time taken to build your data warehouse greatly.
When profiling your data during the data warehouse design process, focus on these 4 deliverables to get maximum value from the effort:
- The most basic deliverable is a “no-go” on the entire project. The source data you want to use to build your data warehouse may have too many errors or too much missing information for the data warehousing initiative to be at all viable for analytics. While this may be construed as a huge failure, it’s actually an extremely valuable outcome, because now your team can refocus their efforts elsewhere rather than spending weeks and months building a project only to find out that the end-result is an enormously flawed reporting system that is unusable for decision-making. Such surprises at the end are often career-shortening for business intelligence team leads.
- The second deliverable is a list of issues that already exist in source data which must be fixed before the project can move forward. The fixes are a major external dependency and must be managed well to ensure the success of your data warehouse. You might think that issues can be fixed later once data is written into the data warehouse, but then, every time you sync your operational systems and the data warehouse, the issues will come up.
- The third deliverable is a list of data quality issues that are encountered while extracting data from multiple source and writing to the data warehouse. A deep understanding of such issues will help you come up with data transformation logic and exception handling methods best suited to your business scenario. You will also be able to determine any manual processing that will be required to fix inconsistencies and factor that into the total time it takes to complete the data warehousing project.
- And lastly, focus on business rules previously unanticipated and issues in foreign and primary key relationships and hierarchical structures. You will need to dig at a deeper level to identify such intricate issues, but if left unchecked, they will permeate data warehouse design and may blow out of proportion later.
Here are a few simple examples of issues that data profiling may help discover:
Data Quality Issue | Example |
Invalid value | Valid value can be “1” or “2”, but current value is “3” |
Cultural rule conformity | Date = 1 Feb 2018 or 1-1-18 or 2-1-2018 |
Value out of required range | Customer age = 204 |
Verification | City and State do not correspond to ZIP code |
Format inconsistency | Phone = +135432524 or (001)02325355 |
Data Quality: Understanding Causes of Data Quality Errors
Done with data profiling and ready to move towards improving data quality? Not so fast. Data profiling is an ongoing process of discovery. Establish a quality-oriented culture in your organization by rewarding people who find and report issues in data; just like Japanese manufacturers do in the automotive industry.
Once you have a fair idea of the quality of your data, start re-engineering processes to improve data quality while profiling goes on continuously at all levels of the organization, from front-line data entry operators to the executives at the highest level who use analytics. Changes to the source system will be required, but you need to deal with them delicately, balancing implementation at both technical and operational levels by involving both business and IT. Unless your organization already has a Master Data Management (MDM) system in place that contains master copies of all data, you’d want your data warehouse to eventually serve as your MDM. This means cleaning, persisting, conforming, and de-duplicating numerous data sets across the organization is necessary in the data warehouse.
The first step to improving data quality once profiled is a series of tests incorporated at any points in the data integration process. The tests could refer to a number of business rules or mathematical operations to validate your data for instance. We’ll cover this part in further detail later.
If a data set passes the tests, it’s clean and can be moved to the production data warehouse for modeling. If not, your data validation process should be able to:
- Create an error event record, and
- Either stop the process
- Or suspend erroneous data
- Or simply tag the data
Data Quality Tests
In terms of architecture, all data quality tests are structured similarly but differ in scope. Let’s take a look at the data quality categories defined by Jack Olsen in his book “Data Quality: The Accuracy Dimension”:
- Column-level testing
Data is tested on a very granular level, within a single column. Data quality rules that can be applied at this stage include checking if the value is null, is from a fixed, finite list, fall within specified range, fits field patterns specified in the database, is not a part of exclusion lists, and passes basic spell check.
- Structure-level testing
This type of testing checks data relationships across multiple columns. For instance, fields across columns may be checked to verify a hierarchy, like a one-to-many relationship. Foreign and primary key relationships are also checked. Every field of a specific column can be checked against another column to verify something like postal addresses.
- Business rule testing
Complex testing is done by creating business rules. This type of testing may involve something like checking the eligibility of an airline customer and their status as a Platinum Member by verifying their frequent flyer miles exceed 2 million and that the customer has been a regular member for at least 5 years to be able to qualify for Platinum status.
With these measures in place, we can start taking action:
- Reject the data and exclude from the data warehouse if the errors are too severe to be fixed
- Accept data when errors are within tolerable limits, after conveying this to business users
- Fix the data when the error can be fixed as you go, for instance, when there are multiple versions of the same customer entity, you can designate one as the master record
- Assign a default value like “Not Available” when you cannot let a field remain blank
The action you must take will vary depending on the type of data you’re working with, and typically is the responsibility of the business department that works with a specific type of data set.
Note that all of the measures we’ve discussed till now focus on improving the quality of existing data rather than addressing the root cause – which is often at the point where data is entered in transactional system by front-line employees. If you’re truly invested in improving data quality, you will also need to implement rules that improve quality of data being entered. For instance, in a financial institution, the management may notice that social security numbers of customers are often left blank or incorrectly entered. They could choose to implement a rule that makes the field value ‘required’ in a format specific to social security numbers (AAA-GG-SSSS) while disallowing nonsensical entries like 999-99-9999.
Data Integration: Bringing Together Data from Different Sources to Improve Quality
Data integration as a methodology is different, but in the context of data quality, it refers to integrating data about the same entity across different systems. For instance, information about a specific product may be found in your US database, but the same product may be sold in different countries too, meaning that records of the same product are spread across different database with respect to region. In every region, the product may be sold under a different name, under different branding, and with different patterns used to describe information in database records.
When building your data warehouse, you’d have to integrate all of these disparate pieces of information across multiple database to form a master view that can be used for reporting. Let’s take a look at an example:
Original Data | Data After Standardization |
BMI Corp | BMI Corporation |
BMI Inc | BMI Incorporated |
BMI Co. | BMI Company |
MR JOHN DEERE | Mr. John Deere |
Mister Jonathan Deere | Mr. John Deere |
Mister John DEERe | Mr. John Deere |
#(222)0202020 ext120 | 222-020-2020 ext 120 |
2220202020 x120 | 222-020-2020 ext 120 |
National Salt | National Salt Company |
NSC | National Salt Company |
N. Salt | National Salt Company |
National S. | National Salt Company |
Using our original customer and product example, integrating data in this way revolves arounds two important processes:
- Recognizing whether the same customer entity exists in both sources
- Combining customer data to obtain a consolidated view of the product table
When trying to find if two entities link, you could start with a common field that is likely to exist in the same pattern across systems. For customer entity, this field could be the tax ID number. If the same tax ID number exists for customer records in different, you have just identified commonality in a very efficient manner. However, we are rarely so lucky as to have such simple solutions in the world of databases.
If you cannot find a common field, all available product information would have to matched across tables to determine if the same customer entity exists between two systems. Modern data quality management tools automate this sort of work, which used to take subject matter experts hours of scouring through rows and tables to find links. Let’s take this example further and see how product information could potentially be matched.
Say your US database contains brand, product description, and product identification number all in one field, in varying patterns. In UK, for instance, the database records just the product description, but that too in varying patterns depending on who entered them. An automated data quality tool could determine commonality by:
- Parsing the product description from US and UK database into individual attributes and sorting by brand name
- Applying operations to brand names to make them consistent
- Fixing differences in how product attributes are recorded
- Use fuzzy logic to match product attributes across both databases
- Show reports of products that match and link them to a customer entity
Using data integration in this way has saved companies tons of man hours annually. The best way to go about this when building your data warehouse is to put a comprehensive, API-based data cleansing and matching a solution between source and destination.
Data Augmentation: Adding Value to Data to Increase its Usefulness
So by now, we’ve conducted data profiling, looked at causes of data quality and actions to take, and found how integrating data across multiple sources to determine commonality adds value. The natural last step to complete the data quality cycle is to look at ways to augment existing entity data with data from external sources, outside our own databases.
For entities like Customer, data augmentation like this is very common. Your marketing automation application could, for instance, contain valuable insights about customers that could be used to augment records in the data warehouse. The additional insight will help your business target product offerings better by providing deeper segmentation opportunities.
The table below gives an example of the kinds of data that can be obtained from external sources to augment the master record:
State | CL |
City | Move |
ZIP | 65464 |
ZIP+4 | 3234 |
Delivery ID | 3 |
Route ID | 4 |
Address | 6546 House Way |
House Number | 6546 |
Street | House Way |
Street Type | Way |
County ID | 635 |
County Name | Glow |
District | 47 |
Record Type | Personal |
Latitude | 35.4685 |
Longitude | 64.2334 |
Census Group | 35632165 |
Census Tract | 35 |
In the above example data, a business could potentially look at address and postal codes and the ZIP+4 code to determine if the customer falls in a specific housing segment. For instance, houses in a particular region that have ZIP+4 codes were built in the 80s and spanned 2500 square-ft. This information could be used to target certain product offerings to all such customers.
While the above was just one example, in reality, data augmentation using the address field to build correlation is common. Product data is another example of data that is used for augmentation purposes. Buying patterns, especially when building predictive models, can be determined using correlations in product data with other augmented data. Another prime example of data augmentation can be seen in the case of manufacturer. As a manufacturer, you’d only know what and how much you’re selling to retailers or wholesalers – not the actual amount of product sold to the final customer. Research firms like Nielsen provide data of this sort, which manufacturers purchase to gain a better understand of sales patterns so they can enhance their product offerings and strategy in terms of product inventory and delivery.
While not directly tied to data quality at its core, data augmentation should be the natural last step if you’re building an intensive data quality management framework within your organization.
Conclusion: Build Reliable and Accurate Analytics Capabilities with Data Quality
The purpose of any data warehousing initiative is to provide business intelligence, and that purpose is defeated if enough thought is not given to building a comprehensive data quality framework, with the end-result being inaccurate analytics, and therefore, bad decisions.
Use the data quality framework described here as a base to build your own processes. Your focus should be on adding value to existing data however you can, rather than merely fixing errors in production data and moving onwards.