Every year, organizations invest millions into data warehousing projects, often to see them falter due to bad data. Whether it’s inaccurate entries, inconsistent formats, or fragmented records across systems, the result is the same – flawed analytics and faulty decisions that can cost businesses dearly. Wonder why does this happen? Let’s explore the role of good data quality in data warehouse and why is it the key to better decision-making, efficiency, and growth.
Understanding Data Quality in Data Warehousing
The core purpose of a data warehouse is to build a unified layer that consolidates data from all relevant sources across the organization. It requires integrating data from multiple systems and optimizing it for analysis and business intelligence.
The data warehouse does not generate any data itself; instead, it relies on data from source systems. This means, any data quality issues found in a data warehouse stem either from these source systems or from how data is interpreted in different systems.
Integrating data from multiple systems is the leading cause of poor data quality in a data warehouse. However, this integration is also a fundamental aspect of a data warehousing project. To prevent or overcome this challenge, the data warehousing team must spend time assessing data quality and identifying any data quality issues that may exist in source systems in order to devise effective strategies to address them to improve data quality or negotiate with the business to determine which issues are acceptable based on cost and feasibility. The last one may seem counterintuitive, but reaching a consensus on what can be accepted is crucial to balancing the cost of improving data quality with what the business is willing to invest.
On the flip side, if businesses overlook emerging data quality challenges, data warehouse users will end up with inaccurate, incomplete data, which directly leads to flawed analytics. These may be the very analytics the C-suite uses for decision-making, and we all know how badly a single wrong decision can potentially hurt a business.
4 Immediate Ways to Improve Data Quality in Your Data Warehouse
When it comes to maintaining data quality in data warehouse, there are several strategies you can implement depending on the level of accuracy or completeness you prioritize. Here are four immediate actions to consider:
1. Reject the Error:
You have to decide if you want data accuracy or completeness in any given instance. If accuracy is more important, you could reject any record found with an error – if fixing it requires more effort (time, money, or resources) than your business is willing to put in. Rejecting the error ensures that only high-quality, error-free (accurate) data enters your warehouse.
2. Accept the Error:
If completeness is more critical, you might choose to allow records with minor errors into your data warehouse. In this case, you accept the risk of inaccurate data temporarily, with the intention of correcting the errors later when more resources are available or when your team can find the correct values.
3. Fix the Error:
If the errors are fixable within a reasonable time frame and cost, it’s often best to address them directly. Whether this involves correcting values or adjusting formats, fixing errors early is crucial for maintaining long-term data quality standards as it prevents the problems from snowballing into larger issues later on.
4. Assign Default Value:
If completeness is higher priority and an error cannot be corrected, you could assign a default value for each type of error to substitute erroneous data. This ensures that missing or incorrect data doesn’t create gaps in your dataset, while maintaining the integrity of the overall structure.
No matter what approach you choose, it’s vital that all data warehouse users understand the potential implications of these decisions. The way data you handle data will directly affect the accuracy and usefulness of your analytics. Therefore, make sure to communicate these strategies to everyone involve in the data management process. It’s also recommended to involve data stewards in the process to ensure accountability for measuring data quality and managing its implications.
How to Manage Data Quality in a Data Warehouse? 4 Approaches to Data Quality Management
We have seen how data quality is a key requirement in data warehousing. But, in practice, fixing quality issues in data warehouse is a complex process.
To improve data quality effectively, it’s essential to implement a data quality assessment framework that addresses the key aspects of your data quality process. However, implementing a data quality framework involves several key steps to ensure that the data you rely on is accurate, consistent, and actionable.
Let’s explore the key approaches to implementing a data quality framework for data warehouse and the data quality metrics or data quality dimensions you must focus on.
Data Profiling: Understanding Source Data in the Data Warehouse
When trying to improve something, understanding it is the natural first step. The process of understanding existing data with respect to how you want it to be in its final form is called ‘data profiling’. It involves digging deep into source data and understanding the content, structure, and cardinalities. That’s how we identify where data quality management tools and processes need to be applied and which approach to choose.
Too often, data profiling is relegated to the backroom of Extract-Transform-Load (ETL) processes, only brought forward to check for minor anomalies in data once data warehouse design is complete and production data needs to be delivered. In reality, data profiling should be the very next step in your data warehousing project once you’ve gathered business requirements. Profiling at this stage allows you to assess the amount of work the source data will require before it’s ready for analytics, and it will significantly impact the design and timeline of your data warehouse project.
When profiling your data during the data warehouse design process, focus on these four deliverables to get maximum value from the effort:
1. “No-Go” Decision
The most basic deliverable is a “no-go” on the entire project. If the source data that you want to use to build your data warehouse has too many errors or too much missing information, the data warehousing initiative may not be at all viable for analytics. While this may be construed as a huge failure or setback, it’s actually an extremely valuable outcome because now your team can refocus their efforts elsewhere rather than spending weeks and months building a project only to find out that the end-result is an enormously flawed reporting system that is unusable for decision-making. Such surprises at the end are often career-shortening for business intelligence team leads.
2. List of Data Issues to Fix:
The second deliverable is a list of issues that already exist in source data and must be fixed before the project can move forward. The fixes are a major external dependency and must be managed well to ensure the success of your data warehouse. You might think that issues can be fixed later once data is written into the data warehouse, but then, every time you sync your operational systems and the data warehouse, the issues will come up.
3. Data Quality Issues During ETL:
The third deliverable is a list of data quality issues that are encountered while extracting data from multiple sources and writing to the data warehouse. Profiling data during the ETL process helps you identify these issues. A deep understanding of such issues will help you come up with data transformation logic and exception handling methods best suited to your business scenario. You will also be able to determine any manual processing that will be required to fix inconsistencies and factor that into the total time it takes to complete the data warehousing project.
4. Unanticipated Business Rules and Key Issues
Profiling also helps uncover complex business rules that were previously unanticipated and data integrity issues, such as problems with foreign and primary key relationships and hierarchical structures. You will need to dig at a deeper level to identify such intricate issues. If left unchecked, they will permeate data warehouse design and may blow out of proportion later.
Some common quality issues that data profiling helps identify include:
Data Quality Issue | Example |
Invalid value | Valid value can be “1” or “2”, but current value is “3” |
Cultural rule conformity | Date formats like 1 Feb 2018 or 1-1-18 or 2-1-2018 |
Value out of required range | Customer age = 204 |
Verification | City and State do not correspond to ZIP code |
Format inconsistency | Phone numbers with different formats = +135432524 or (001)02325355 |
Understanding Causes of Data Quality Errors
Done with data profiling and ready to move towards improving data quality? Not so fast.
Data profiling is an ongoing process of discovery. Establish a quality-oriented culture in your organization by rewarding people who find and report issues in data; just like Japanese manufacturers do in the automotive industry.
Once you have a fair idea of the quality of your data, start re-engineering processes to improve data quality while profiling goes on continuously at all levels of the organization, from front-line data entry operators to the executives at the highest level who use analytics. This will require changes to the source system. However, you need to deal with them delicately, ensuring to balance implementation at both technical and operational levels by involving both business and IT. Unless your organization already has a Master Data Management (MDM) system in place that contains master copies of all data, you’d want your data warehouse to eventually serve as your MDM. This means cleaning, persisting, conforming, and de-duplicating numerous data sets across the organization is necessary in the data warehouse.
The first step to improving data quality, once profiled, is a series of tests incorporated at any points in the data integration process. The tests could refer to a number of business rules, mathematical operations, or logic. We’ll cover this part in further detail later.
If a data set passes the tests, it’s clean and can be moved to the production data warehouse for modeling. If not, your data validation process should trigger one of the following actions:
- Create an error event record, and
- Either stop the process
- Suspend erroneous data
- Or tag the data or further review
Data Quality Tests
In terms of architecture, all data quality tests are structured similarly but differ in scope. Let’s take a look at the data quality categories defined by Jack Olsen in his book “Data Quality: The Accuracy Dimension”:
- Column-level testing
Data is tested on a very granular level, within a single column. Data quality rules that can be applied at this stage include checking if the value is null, is from a fixed, finite list, fall within specified range, fits field patterns specified in the database, is not a part of exclusion lists, and passes basic spell check.
- Structure-level testing
This type of testing checks data relationships across multiple columns. For instance, fields across columns may be checked to verify a hierarchy, like a one-to-many relationship. Foreign and primary key relationships are also checked. Every field of a specific column can be checked against another column to verify something like postal addresses.
- Business rule testing
Complex testing is done by creating business rules. This type of testing may involve something like checking the eligibility of an airline customer and their status as a Platinum Member by verifying their frequent flyer miles exceed 2 million and that the customer has been a regular member for at least 5 years to be able to qualify for Platinum status.
With these measures in place, we can start taking action:
- Reject the data and exclude from the data warehouse if the errors are too severe to be fixed
- Accept data when errors are within tolerable limits, after conveying this to business users
- Fix the data when the error can be fixed as you go, for instance, when there are multiple versions of the same customer entity, you can designate one as the master record
- Assign a default value like “Not Available” when you cannot let a field remain blank
The action you must take will vary depending on the type of data you’re working with. This is typically the responsibility of the business department that works with a specific type of data set.
Note that all of the measures we’ve discussed till now focus on improving the quality of existing data rather than addressing the root cause – which is often at the point where data is entered in transactional system by front-line employees. If you’re truly invested in improving data quality, you will also need to implement rules that improve quality of data being entered. For instance, in a financial institution, the management may notice that social security numbers of customers are often left blank or incorrectly entered. They could choose to implement a rule that makes the field value ‘required’ in a format specific to social security numbers (AAA-GG-SSSS) while disallowing nonsensical entries like 999-99-9999.
Data Integration: Bringing Together Data from Different Sources to Improve Quality
Data integration as a methodology is different, but in the context of data quality, it refers to integrating data about the same entity across different systems. For instance, information about a specific product may be found in your US database, but the same product may be sold in different countries too, meaning that records of the same product are spread across different database with respect to region. In every region, the product may be sold under a different name, under different branding, and with different patterns used to describe information in database records.
When building your data warehouse, you’d have to integrate all of these disparate pieces of information across multiple databases to form a master view that can be used for reporting. Use data governance practices to ensure consistent data formats and resolve discrepancies. Let’s take a look at an example:
Original Data | Data After Standardization |
BMI Corp | BMI Corporation |
BMI Inc | BMI Incorporated |
BMI Co. | BMI Company |
MR JOHN DEERE | Mr. John Deere |
Mister Jonathan Deere | Mr. John Deere |
Mister John DEERe | Mr. John Deere |
#(222)0202020 ext120 | 222-020-2020 ext 120 |
2220202020 x120 | 222-020-2020 ext 120 |
National Salt | National Salt Company |
NSC | National Salt Company |
N. Salt | National Salt Company |
National S. | National Salt Company |
Using our original customer and product example, integrating data in this way revolves arounds two important processes:
- Recognizing whether the same customer entity exists in both sources
- Combining customer data to obtain a consolidated view of the product table
When trying to find if two entities link, you could start with a common field that is likely to exist in the same pattern across systems. For customer entity, this field could be the tax ID number. If the same tax ID number exists for customer records in different, you have just identified commonality in a very efficient manner. However, we are rarely so lucky as to have such simple solutions in the world of databases.
If you cannot find a common field, all available product information would have to matched across tables to determine if the same customer entity exists between two systems. Modern data quality management tools automate this sort of work, which used to take subject matter experts hours of scouring through rows and tables to find links. Let’s take this example further and see how product information could potentially be matched.
Say your US database contains brand, product description, and product identification number all in one field, in varying patterns. In UK, for instance, the database records just the product description, but that too in varying patterns depending on who entered them. An automated data quality tool could determine commonality by:
- Parsing the product description from US and UK database into individual attributes and sorting by brand name
- Applying operations to brand names to make them consistent
- Fixing differences in how product attributes are recorded
- Use fuzzy logic to match product attributes across both databases
- Show reports of products that match and link them to a customer entity
Using data integration in this way has saved companies tons of man hours annually. The best way to go about this when building your data warehouse is to put a comprehensive, API-based data cleansing and matching a solution between source and destination.
Data Augmentation: Adding Value to Data to Increase its Usefulness
So by now, we’ve conducted data profiling, looked at causes of data quality and actions to take, and found how integrating data across multiple sources to determine commonality adds value. The natural last step to complete the data quality cycle is to look at ways to augment existing entity data with data from external sources, outside our own databases.
For entities like Customer, data augmentation like this is very common. Your marketing automation application could, for instance, contain valuable insights about customers that could be used to augment records in the data warehouse. The additional insight will help your business target product offerings better by providing deeper segmentation opportunities.
The table below gives an example of the kinds of data that can be obtained from external sources to augment the master record:
State | CL |
City | Move |
ZIP | 65464 |
ZIP+4 | 3234 |
Delivery ID | 3 |
Route ID | 4 |
Address | 6546 House Way |
House Number | 6546 |
Street | House Way |
Street Type | Way |
County ID | 635 |
County Name | Glow |
District | 47 |
Record Type | Personal |
Latitude | 35.4685 |
Longitude | 64.2334 |
Census Group | 35632165 |
Census Tract | 35 |
In the above example data, a business could potentially look at address and postal codes and the ZIP+4 code to determine if the customer falls in a specific housing segment. For instance, houses in a particular region that have ZIP+4 codes were built in the 80s and spanned 2500 square-ft. This information could be used to target certain product offerings to all such customers.
While the above was just one example, in reality, data augmentation using the address field to build correlation is common. Product data is another example of data that is used for augmentation purposes. Buying patterns, especially when building predictive models, can be determined using correlations in product data with other augmented data. Another prime example of data augmentation can be seen in the case of manufacturer. As a manufacturer, you’d only know what and how much you’re selling to retailers or wholesalers – not the actual amount of product sold to the final customer. Research firms like Nielsen provide data of this sort, which manufacturers purchase to gain a better understand of sales patterns so they can enhance their product offerings and strategy in terms of product inventory and delivery.
While not directly tied to data quality at its core, data augmentation should be the natural last step if you’re building an intensive data quality management framework within your organization.
Conclusion: Build Reliable and Accurate Analytics Capabilities with Data Quality
The purpose of any data warehousing initiative is to provide business intelligence, and that purpose is defeated if enough thought is not given to building a comprehensive data quality framework, with the end-result being inaccurate analytics, and therefore, bad decisions.
Use the data quality framework described here as a base to build your own processes. Your focus should be on adding value to existing data however you can, rather than merely fixing errors in production data and moving onwards. For any help in improving the quality of your data, get in touch with our experts today.