Last Updated on
Data is everywhere, with total volume expected to exceed 44 trillion GBs by 2020, but rarely is it useful. Only 27% of organizations with data analysis initiatives in place report any significant success, while a mere 8% describe their efforts as “very successful”. These numbers are low not because businesses don’t have the necessary data – it’s because they don’t have quality data to work with, meaning that the data they have is, for all intents and purposes, useless.
Availability isn’t enough; you need to make data useful.
When building your data warehouse, ensuring that you have clean data is critical, or you risk facing the “garbage in, garbage out” phenomenon. But what exactly does data cleansing entail, and do you really need it in your specific data warehouse design?
If you’re integrating data from a single source, you may find that source records contain missing information, misspellings from when data was originally entered, invalid data, etc. These issues are ubiquitous in databases and are often transferred to analytics reporting – unless you implement data cleansing and data scrubbing techniques beforehand. The issue is magnified when you’re dealing with disparate data sources to build centralized data repositories like a data warehouse, a global information system, or a federated system of operational databases.
When dealing with a single source, you have to deal with inaccuracies in data. With disparate sources, you will also have to deal with differences in how data is stored and structured along with data redundancy issues across the multiple representations of data in each source.
Let’s consider an example. Say you have an entity named ‘John Smith’ with City field filled as ‘London’ in one database. Another database could have an additional column for ‘Country’, and when the two databases are merged, you won’t have any easy way to fill the City field for the former ‘John Smith’ record. After all, there are about 12 different Londons in the world.
This is one example; there could be myriad issues like concatenated names in one database and separate columns for FirstName and LastName in another, or separate columns for HouseNumber and StreetName, while another database has a single Address column.
One way to correct data quality is to manually scan and fix errors, but that quickly becomes impractical when you’re dealing with big data. The other route is to use data quality software built specifically to enable the fastest and most accurate data cleansing for your data warehouse.
Data Cleansing in the Data Warehouse
Your data warehouse needs to have consistent, accurate, deduplicated data available to feed downstream analytics applications and other systems across the enterprise. In a typical scenario, you will need a separate staging area where you import data from the source, and then transform and otherwise wrangle your data for standardization and cleansing. Modern data cleansing software supports in-memory processing, where source data is imported into temporary memory rather than a physical database. You can test your cleansing and deduplication configurations in temp memory and then export it back once you’re satisfied with the results.
When deciding upon a data cleansing approach for your data warehouse, ensure that your chosen method can:
- Handle inconsistencies and errors in both single source integrations and multiple source data integration
- Minimize hand-coding and manual effort of reviewing data to validate quality
- Be used for sources other than the ones defined initially, so that the approach still applies when you need to add more sources in your data warehouse later
- Work with schema-based data transformations that rely on metadata
- Identify linked records across sources for deduplication and enrichment purposes
- Provide workflow orchestration features to support and create a strong foundation for data quality rules applied at scale to cleanse enterprise data.
Data Cleansing in Single Source Integration
As discussed earlier, let’s segment data cleansing issues in the data warehouse into two broad data integration categories due to the unique data cleansing challenges each presents:
- Single source data integration
- Multiple source data integration
For both categories, we will further discuss data cleansing problems on schema and instance level. Schema-related issues refer to those where data can be cleansed by working on improving schema design, while instance-related issues refer to errors at a lower level, in the contents of table fields.
One of the biggest factors in ensuring higher data quality is the availability of constraints in your data schemas. If you have set up constraints in the system to control permissible data, you will have cut down data cleansing effort considerably. The fewer constraints your schema has, the greater the effort in data cleansing. One very effective way to do that is by integrating data quality software within your systems and applications and pre-defining business rules based on the constraints you’d like to implement.
Data Ladder’s API-based integration ensures that all input data is checked against your business rules in real-time, regardless of whether data is being entered into your CRM or being loaded into an SQL database before.
Here are a few examples that will help you understand schema-related data quality issues:
Birthdate = 20.15.90
Entered values for birthdate are outside or permitted range because there can’t be a 15th month.
Other examples could include referential integrity issues, where a field refers to a specific department ID but that department ID hasn’t been defined in the system. Or uniqueness issues where social security number of two employees is listed as same.
On the other hand, instance-related issues may result from misspellings (statename = Coloradoo), abbreviations used mistakenly during data entry (profession = DB prog), and incorrect references (employee = [name= “John”, deptID= “22”], where that employee’s actual deptID was 12).
As you can see, if these issues are not fixed, queries using this data will return incorrect information, potentially leading to business decisions being based on bad data.
Data Cleansing in Multiple Source Integration
Naturally, data quality issues increase manifold when you’re integrating data into your data warehouse from multiple sources. In addition to the general issues like misspellings and incorrect references in data entry, you must reconcile data across multiple representations and build ‘master’ records for your of vision of ‘Single Source of Truth’ to become reality. You also must deal with data redundancy issues, where data overlaps across systems.
With respect to schema design, mainly you will need to deal with structural and naming conflicts in different source systems. Structural conflicts could include different data types, integrity constraints, component structure, etc. In terms of instance-related issues, multiple sources will probably mean that the same attribute is represented differently in different systems. For example: MaritalStatus could be “Married” or “Single” in one system, while another could have a record for RelationshipStatus with different attributes altogether – but showing marital status too.
Data duplication is another very important issue here, particularly when you have multiple records in multiple systems that point to the same entity. Businesses have millions of records related to customers, vendors, and employees, often in different databases. These databases have many duplicate records, which results in decreased operational efficiency as well as missing information. Multiple records mean details of the same customer/entity are spread to multiple records. Deleting duplicates and merging the records increases the efficiency of the database while creating a single source of truth.
Customer Table (ABC Source System):
|11||Kristen Smith||2 Hurley Pl||South Fork, MN 48503||0|
|24||Christian Smith||Hurley St 2||S Fork MN||1|
Client Table (XYZ Source System):
|Cno||Last Name||First Name||Gender||Address||Phone/Fax|
|11||Smith||Christoph||M||23 Harley St, Chicago IL, 60633-2394||333-222-6542 / 333-222-6599|
|493||Smith||Kris L.||F||2 Hurley Place, South Fork MN, 48503-5998||444-555-6666|
Customers Table (Integrated after data cleansing for target data warehouse):
|1||Smith||Kristen L.||F||2 Hurley Place||South Fork||MN||48503-5998||444-555-6666||11||493|
|2||Smith||Christian||M||2 Hurley Place||South Fork||MN||48503-5998||24|
|3||Smith||Christoph||M||23 Harley Street||Chicago||IL||60633-2394||333-222-6542||333-222-6599||11|
Integrating data into your data warehouse revolves around two important processes:
- Recognizing whether the same entity exists in both sources
- Combining entity data to obtain a consolidated view of an entity table
To get a clean, complete view, you must merge related data while purging redundant duplicates, as shown in the example above. The process is commonly known as ‘merge purge’ – the process of combining data from two or more sources, identifying and/or combining duplicates, and eliminating (purging) unwanted records.
To use your data warehouse to its fullest potential, merge purge is critical. Data can be analyzed to find insights, increase efficiency, and discover problems when you are designing your data warehouse to be the ‘single source of truth’ across the enterprise. Merge purge allows you to choose how the data should be merged and purged. Pick the data survivorship rules that serve your purpose best and our merge purge tool will go through millions of records, combining them into complete golden records. Your original data is preserved in its original form, and a new dataset is created containing all the information.
Build Reliable and Accurate Analytics Capabilities into Your Data Warehouse
Data Ladder’s data quality software has been rated the fastest and most accurate platform for record linkage across multiple independent studies. Leverage our powerful, proprietary matching capabilities to detect and purge duplicates, or merge and survive data to build a ‘single source of truth’ using world-class fuzzy matching, intelligent parsing, and pattern recognition techniques.
“My favorite part of DataMatch is how easy and flexible it is to use. I can use data from virtually any data source including ODBC connections, CSV files, and JSON files. It does a great job with data cleansing making the matching process even more powerful.”
The unrivaled speed, accuracy and low cost of DataMatch Enterprise make matching and linking records from all your data repositories a breeze, thanks to the wide variety of integrations that DataMatch Enterprise provides out-of-the-box.
Enhance your cleansing strategy in the data warehouse by leveraging our native integration with the numerous data sources and advanced record linkage features to find data matches across all supported repositories, regardless of whether your data lives in social media platforms and legacy systems or traditional databases and flat files and even Big Data lakes. Data Ladder integrates with virtually all modern systems to help you get the most out of your data warehouse.
Get in touch with our solution architects to work out a plan to scrub, clean, and finally validate data using advanced automation techniques while building your data warehouse and ensure your business users get accurate analytics.