As we move through the industrial revolution of data, companies are beginning to realize the inadequacy of traditional data management tools in handling the complexities of modern data. Many have had to experience rude wake-up calls with failed migration or transformation initiatives caused by poor data, missing data quality management systems, and a reliance on outdated methods that are no longer effective.
At Data Ladder, we see data quality as a necessary, on-going process that must be integrated within and across systems and departments. Moreover, the control over data should be balanced out between IT and business users, for the latter is the true owner of customer data and therefore must be equipped with systems that can enable them to profile and clean data without relying on IT for data preparation.
Data profiling and data cleansing are the two fundamental functions or components of Data Ladder’s data quality management solution and the starting point of any data management initiative. Put simply, to fix your data, you must know what’s wrong with it.
This post covers everything you need to know about the difference between data profiling and data cleansing.
Let’s dig in.
Data Profiling vs Data Cleansing – What’s the Key Difference?
In a data quality system, data profiling is a powerful way to analyze millions of rows of data to identify errors, missing information, and any anomalies that may affect the quality of information. By profiling data, you get to see all the underlying problems with your data that you would otherwise not be able to see.
Data cleansing is the second step after profiling. Once you identify the flaws within your data, you can take the steps necessary to clean the flaws. For instance, in the profiling phase, you discover that more than 100 of your records have phone numbers that are missing country codes. You can then write a rule within your DQM platform to insert country codes in all phone numbers missing them.
The key difference between the two processes is simple – one check for errors and the other lets you clean up errors.
Data profiling and data cleansing aren’t new concepts. However, they have largely been limited to manual processes within data management systems. For instance, data profiling has always been done by IT and data experts using a combination of formulas and codes to identify basic-level errors. The mere profiling process would take weeks to accomplish and even then, critical errors would be missed. Data cleansing was another nightmare. It could take months to clean up a database, including removing duplicates (with a very low accuracy rate). While these methods may have worked for simple data structures, it’s next to impossible to apply the same methods on modern data formats.
Here’s an example of data profiling using Microsoft Visual Studio. IT managers would have to manually set up this workflow just to identify errors in a data source.
Imagine using this workflow for complex, inconsistent third-party data such as that of social media! Imagine having multiple data sources streaming in from multiple apps, platforms (online and offline), vendors, suppliers etc. Imagine recreating this workflow for every different type of data source. Not only is this time-consuming but ineffective in handling large amounts of data.
Using the same example, let’s assume the name was incorrectly entered at the data entry stage and is spelled as Johnnathan Smith in the CRM. If the team handling this data doesn’t run a profile check, they wouldn’t know it’s an error and would proceed to send out an email to the customer with the wrong spelling. I leave you to imagine the consequences of that.
Data profiling, therefore, helps the business identify this spelling error right at the beginning of the process. Data cleansing then lets the user correct this error by replacing it with the right name.
How Data Profiling Was Traditionally Done and Why It’s Important to Turn to Automation
Data profiling is a crucial part of data warehouse and business intelligence projects, where data quality issues in data sources are identified. Furthermore, data profiling allows users to uncover new requirements for a target system. But until recently, data profiling has been an exhausting process that involves manual tasks as:
- Creating data validation rules and formulas to tackle common data error issues
- Calculate statistics like min, max, count and sum
- Cross-column analysis to identify dependencies
- Identifying inconsistencies in the format
Depending on the volume of data and the nature of problems, this step can take days to get done. Once discoveries are made, new rules are added to the list of rules which are then enforced in a cleansing phase. For instance, after discovering that the most frequent pattern for a date is (YYYY/MM/DD) then this pattern is transformed into a rule (done via coding) for all numbers to be formatted accordingly. This rule later might need to be reversed if another data set (such as the format of a new CRM) requires dates to be (DD/MM/YYYY).
In its traditional setting, data profiling is non-trivial – it’s computationally complex to discover data issues, it’s virtually impossible to determine all the constraints that affect columns and lastly, it’s often performed on a large volume of data sets that may not fit into main memory, thus limiting the ability to process exponential numbers of data.
Various tools and algorithms have been introduced to tackle these challenges, such as the use of SQL queries to profile data in the DBMS. While these tools have improved the process considerably, they are still lacking in interactive profiling and handling the complexities of big data. This is where the need for automated self-service tools step in.
An automated self-service data profiling tool like Data Ladder’s DataMatch Enterprise performs complex computational processes using machine-learning technologies and fuzzy matching algorithms. With an easy to use point-and-click interface, business users can easily plug in their data source and let the software do all the computational analysis based on built-in business rules (created via historical learnings of the most common types of errors in a data source).
Use Case 1 – Profiling for Missing Values
Automated data profiling helps you get to the problem faster. For instance, if the profiling reveals you’ve got missing last names or phone numbers in the database, you can easily identify the cause of that problem. Does your webform make it mandatory to enter phone numbers with country codes? Do you have data entry protocols and standards in place? Automation allows you to focus on what matters most – solving the root cause of the problem rather than wasting time fixated on the problem.
Making the Case for Automated Data Cleansing
How do you clean messed up data like this?
In a traditional ETL or Excel environment, cleaning this level of dirty data would take you days. You’d have to run scripts, break down the data, move it around, handle clusters of data, tackling one problem at a time before you can transform this into a standardized format.
Now, if you use an automated data cleansing solution, you’d be able to transform this data by simply clicking on checkboxes.
- Normalize text style.
- Remove unwanted characters
- Remove accidental typos during data entry (these are hard to catch!)
- Clean up spaces between letters/words
- Transform nicknames into actual names (John instead of Johnny)
Try doing the same using Excel and it’d take you hours just to normalize letters.
Business intelligence workers spend between 50 – 90% of their time in prepping data for analysis! Not that cleaning or preparing data is not part of their job, but if they’re being bogged down by what could be achieved via automation, it’s a waste of their time. Data scientists and business intelligence workers need assistance in making sense of data and that comes in the form of a powerful automated solution.
Use Case 2 – Data Profiling and Data Cleansing in Action
P.S: Grab our free data profiling CHEAT SHEET and evaluate your data for basic inconsistencies.
Case Study – Lamb Financial
Lamb Financial Group is one of the fastest-growing insurance brokers in the country, providing financial services exclusively to non-profits and social service organizations nationwide. Based in New York, the organization works with carriers who specialize in non-profits and charities, providing Worker’s Compensation, Casualty, Professional Liability, Property, Auto, Directors and Officers, Group Health, and Disability Insurance.
With so many different data sources to look after, and accuracy being the most critical element of their business, the company’s resources would spend days in data cleansing and standardization. With more business and data streaming in, rising in complexity, the company could no longer afford to clean and standardize manually.
DataMatch Enterprise™ provided an easier data cleansing solution for the organization. They were able to quickly use our Wordsmith™ function, which is a standardization tool to create custom libraries settings. This allowed them to cleanse the data and find the best possible match. The software has now become an important part of their process when importing new data into their CRM.
Want a walkthrough of the product? Book a demo with us and let our experts help you get clean data!