Blog

Data quality management: What, why, how, and best practices

The most common data issue that enterprises deal with is that of its quality. You have the right data applications deployed, the sources capture the kind of data you need, there is an entire system that uses and analyzes the collected data, and yet, the results are unsatisfactory. On further analysis, you find differences between data expectations and reality; the datasets are filled with blank fields, inconsistent abbreviations and formats, invalid patterns, duplicate records, and other such discrepancies.

To eliminate these problems, you must implement corrective measures that consistently validate and fix data quality issues. But to make the data quality dream a reality, it is necessary to understand the basics of data quality – its meaning, impact, and how to plan for improvement. For this reason, we are sharing with you a comprehensive guide that covers everything related to data quality management: what it means, how it can impact a business, how it can be managed, what it looks like across various verticals, and more.

This guide is divided into three parts:

  1. Data quality: What is it and why is it important?
  2. Data quality issues: What are they, where they come from, and how they impact business?
  3. Data quality management: What it means, its pillars and best practices, and some real-world examples in various industries.

Let’s get started.

Data quality

What is data quality?

The degree to which the data fulfills the requirements of an intended purpose.

Organizations store, manage, and use large volumes of every day. If the data fails to serve its purpose, it is believed to be of poor quality. This definition of data quality implies that its meaning differs depending on the organization it belongs to and the purpose it serves. 

For some companies, data completeness may be a better indicator of data quality than data accuracy. 

This leads businesses to define their own set of characteristics and requirements for maintaining quality of data across the organization. There’s another way to define data quality:

The degree to which data is free from intolerable defects.

Data can never be a hundred percent accurate and defect-free. It is bound to have some errors and that is acceptable. But having intolerable defects in your dataset – that damage the execution of critical processes – indicates poor data quality. You have to make sure that data structure is as needed and its contents are as free from defects as possible.

Why is data quality important?

Keeping data clean should be a collective effort between business users, IT staff, and data professionals. But oftentimes, it is just perceived as an IT glitch – believing that data becomes dirty when some technical processes for capturing, storing, and transferring data do not work correctly. Although this can be the case, data needs the attention of all the right stakeholders to maintain its quality over time. For this reason, it becomes imperative to build a case for data quality in front of necessary decision-makers, so that they can help enable it across all departments and levels. 

Below, we have listed the most common benefits of data quality.  

01. Accurate decision making

Business leaders do not rely on assumptions anymore, but rather utilize business intelligence techniques to make better decisions. This is where good data quality can enable accurate decision-making, while poor data quality can skew data analysis results, leading businesses to base crucial decisions on incorrect forecasts. 

02. Operational efficiency

Data is part of every small and big operation at a company. Whether it is product, marketing, sales, or finances – operating data efficiently in every area is the key. Using quality data in these departments can lead your team to eliminate duplicate efforts, reach accurate results quickly, and be productive throughout the day.  

03. Compliance

Data compliance standards (such as GDPR, HIPAA, and CCPA) require businesses to follow the principles of data minimization, purpose limitation, transparency, accuracy, security, storage limitation, and accountability. Conformance with such data quality standards is only possible with clean and reliable data.  

04. Financial operations

Businesses incur huge amounts of financial costs due to poor data quality. Operations such as making timely payments, preventing underpay and overpay incidents, eliminating incorrect transactions, and avoiding chances of fraud due to data duplication are only possible with clean and high-quality data. 

05. Customer personalization and loyalty

Offering personalized experiences to customers is the only way to convince them to buy from your brand instead of a competitor. Companies utilize a ton of data to understand customer behavior and preferences. With accurate data, you can discover relevant buyers and offer them exactly what they are looking for – ensuring customer loyalty in the long run while making them feel like your brand understands them like no one else. 

06. Competitive advantage

Almost every player in the market utilized data to understand future market growth and possible opportunities to upsell and cross-sell. Feeding quality data from the past to this analysis will help you to build a competitive advantage in the market, convert more customers, and grow your market share. 

07. Digitization

Digitization of crucial processes can help you to eliminate manual effort, speed up processing time, and reduce human errors. But with poor data quality, such expectations cannot be fulfilled. Rather, poor data quality will force you to end up in a digital disaster where data migration and integration seem impossible due to varying database structures and inconsistent formats.  

Data quality issues

A data quality issue is defined as: 

an intolerable defect in a dataset, such that it badly impacts the trustworthiness and reliability of that data. 

Before we can move on to implementing corrective measures to validate, fix, and improve data quality, it is imperative to understand what is polluting the data in the first place. For this reason, we are first going to look at:

What are the most common data quality issues?

No.Data quality issue Explanation Example of data quality issue
1Column duplication Multiple columns are present that have the same logical meaning. Product category is stored in two columns that logically mean the same: Category and Classification.
2Record duplication Multiple records are present for the same individual or entity. Every time a customer interacts with your brand, a new row is created in the database rather than updating the existing one.
3Invalid data Data values are present in an incorrect format, pattern, data type or size. Customer Phone Numbers are present in varying formats – some are stored as flat 10 digits, while others have hyphens, some are saved as a string, while others as numbers, and so on.
4Inaccurate data Data values do not conform to reality. Customer Name is incorrectly stored: Elizabeth is stored as Aliza, or Matt is stored as Mathew.
5Incorrect formulae Data values are calculated using incorrect formulae. Customer Age is calculated from their Date of Birth but the formula used is incorrect.
6Inconsistency Data values that represent the same information vary across different datasets and sources. Customer record stored in the CRM represents a different Email Address than the one present in accounts application.
7Missing data Data is missing or is filled with blank values. The Job Title of most customers is missing from the dataset.
8Outdated data Data is not current and represents outdated information. Customer Mailing Addresses are years old leading to returned packages.
9Unverified domain data Data does not belong to a range of acceptable values. Customer Mailing Addresses are years old leading to returned packages.

How data quality issues enter the system?

There are multiple ways data quality errors can end up in your system. Let’s take a look at what they are. 

01. Lack of proper data modeling

This is the first and the most significant reason behind data quality errors. Your IT team does not expend the right amount of time or resources while adopting new technology – whether it is a new web application, database system, or integration/migration between existing systems.  

Data modeling helps to organize and give structure to your data assets and elements. Your data models can be susceptible to any of the following issues: 

a) Lack of hierarchical constraints:  This relates to when there are no appropriate relationship constraints within your data model. For example, you have a different set of fields for Existing Customers and New Customers, but you use a generic Customer model for both, rather than having Existing Customers and New Customers as subtypes of the supertype Customer.

b) Lack of relationship cardinality:  This relates to when there is no number defined that represents the number of relations one entity can have with another. For example, one Order can only have one Discount at a time. 

c) Lack of referential integrityThis relates to when a record in one dataset refers to a record in another one that is not present. For example, the Sales table refers to a list of Product IDs that are not present in the Products table.

02. Lack of unique identifiers

This relates to when there is no way to uniquely identify a record, leading you to store duplicate records for the same entity. Records are uniquely identified by storing attributes like Social Security Number for Customers, Manufacturer Part Number for Products, etc.  

03. Lack of validation constraints

This relates to when data values are not run through the required validation checks before being stored in the database. For example, checking the required fields are not missing, validating the pattern, data type, size, and format of data values, and also ensuring that they belong to a range of acceptable values. 

04. Lack of integration quality

This relates to when your company has a central database that connects to multiple sources and integrates incoming data to represent a single source of information. If this setup lacks a central data quality engine for cleaning, standardizing, and merging data, then it can give rise to many data quality errors. 

05. Lack of data literacy skills

Despite all the right efforts being made to protect data and its quality across datasets, a lack of data literacy skills in an organization can still cause a lot of damage to your data. Employees often store wrong information as they don’t understand what certain attributes mean. Moreover, they are unaware of the consequences of their actions, such as what are the implications of updating data in a certain system or for a certain record. 

06. Data entry errors

Mistyping or misspellings are one of the most common sources of data quality errors. Humans are known to make at least 400 errors while doing 10,000 data entries. This shows that even with the presence of unique identifiers, validation checks, and integrity constraints, there is a chance that human error can intervene and make your data quality deteriorate. 

How data quality issues relate to business dangers?

To onboard relevant decision makers, it is important to educate them how big and small data quality issues are impacting business. A data flaw – business risk matrix, like the one shown below, can help you to do just that. 

Problem IssueBusiness risk QuantifierCost
This is the data quality problem that resides in your dataset. These are the various issues that can arise due to the data problem. This is the impact the issue can have on the business. This quantifies the impact in terms of a business measure. This provides a periodic estimated cost incurred due to the business impact.
Example
Misspelled customer name and contact information Duplicate records created for the same customer Customer service: Increased number of inbound calls Increased staff time $30,000.00 worth more staff time required
Customer service: Decreased customer satisfaction Order reduction, lost customers ~500 less orders this year (as compared to estimated)

Data quality management

We covered the fundamentals of data quality, data quality issues, and how they relate to business risks. Now it’s time to see what the data quality management plan is: how can you fix and consistently manage data quality over time and reap all the benefits it can possibly serve your business. Let’s begin. 

What is data quality management?

Data quality management is defined as: Implementing a systematic framework that continuously profiles data sources, verifies the quality of information, and executes a number of processes to eliminate data quality errors – in an effort to make data more accurate, correct, valid, complete, and reliable. Since the requirements and characteristics of data quality are different for every organization, data quality management also differs between enterprises. The types of people you need to manage data quality, the metrics you need to measure it, the data quality processes you need to implement – everything depends on multiple factors, such as company size, dataset size, sources involved etc. Here, we discuss the main pillars of data quality implementation and management that will give you a good idea about how to go about ensuring data quality at your company for your specific requirements.

What are the 5 pillars of data quality management?

In this section, we look at the most important pillars of data quality management: people, measurement, processes, framework, and technology. 

01. People: Who is involved in data quality management?

It is a common belief that while managing data quality across the organization, you must get approvals and buy-ins from decision makers. But the truth is, you need data professionals appointed at different levels of seniority to ensure your investments in data quality initiatives pay off.  

Here are some roles that are either responsible, accountable, consulted, or informed about data quality control in an organization: 

a) Chief Data Officer (CDO): A Chief Data Officer (CDO) is an executive-level position, solely responsible for designing strategies that enable data utilization, data quality monitoring, and data governance across the enterprise. 

b) Data steward: A data steward is the go-to guy at a company for every matter related to data. They are completely hands-on in how the organization captures data, where they store it, what it means for different departments, and how its quality is maintained throughout its lifecycle.

c) Data custodian: A data custodian is responsible for the structure of data fields – including database structures and models.

d) Data analyst: A data analyst is someone who is capable of taking raw data and converting it into meaningful insights – especially in specific domains. One main part of data analyst is to prepare, clean, and filter the required data.

e) Other teams: These roles are considered to be data consumers, which means they use data – either in its raw form or when it is converted into actionable insights, such as sales and marketing teams, product teams, business development teams, etc.

02. Measurement: How is data quality measured?

The second most important aspect of data quality management is its measurement. These are data characteristics and key performance indicators that validate the presence of data quality in organizational datasets. Depending on how your company uses data, these KPIs may differ. I have listed the most important data quality dimensions and the quality metric they represent: 

data quality metrics steps

03. Process: What are data quality processes?

Since data has grown massively in the last few decades, it has become multi-variate and is measured in multiple dimensions. To fetch, fix, and improve data quality issues, you must implement a variety of data quality processes – where each one of them serves a different, valuable purpose. Let’s take a look at the most common data quality processes that companies use to improve their data quality. 

a) Data profiling 

It is the process of understanding the current state of your data by uncovering hidden details about its structure and contents. A data profiling algorithm analyzes dataset columns and computes statistics for various dimensions, such as completeness, uniqueness, frequency, character, and pattern analysis, etc. 

b) Data cleansing and standardization

It is the process of eliminating incorrect and invalid information present in a dataset to achieve a consistent and usable view across all data sources. It involves removing and replacing incorrect values, parsing longer columns, transforming letter cases and patterns, and merging columns, etc. 

c) Data matching 

Also known as record linkage and entity resolution, it is the process of comparing two or more records and identifying whether they belong to the same entity. It involves mapping the same columns, selecting columns to match on, executing match algorithms, analyzing match scores, and tuning the match algorithms to get accurate results. 

d) Data deduplication 

It is the process of eliminating multiple records that belong to the same entity and retaining only one record per entity. This includes analyzing the duplicate records in a group, marking records that are duplicates, and then deleting them from the dataset. 

e) Data merge and survivorship 

It is the process of building rules that merge duplicate records together through conditional selection and overwriting. This helps you to prevent data loss and retain maximum information from duplicates. It involved defining rules for master record selection and overwriting, executing rules, and tuning them to get accurate results. 

f) Data governance 

The term data governance usually refers to a collection of roles, policies, workflows, standards, and metrics, that ensure efficient data usage and security, and enables a company to reach its business objectives. It involves creating data roles and assigning permissions, designing workflows to verify information updates, ensuring data is safe from security risks, etc. 

g) Address verification 

It is the process of running addresses against an authoritative database – such as USPS in the US – and validating that the address is a mailable, accurate, and valid location within the country for delivering mail. 

04. Framework: What is a data quality framework?

Apart from data quality processes, another important aspect to consider while designing a data quality strategy is a data quality framework. The processes represent stand-alone techniques used to eliminate data quality issues from your datasets. A data quality framework is a systematic process that consistently monitors data quality, implements a variety of data quality processes (in a defined order), and ensures that it doesn’t deteriorate below defined thresholds. It gives more details about the data quality management process flow. 

A simple data quality framework consists of four stages: 

data quality lifecycle

a) Assess:  This is the first step of the framework where you need to assess the two main components: the meaning of data quality for your business and how the current data scores against it. 

b) Design: The next step in data quality framework is to design the required business rules, by selecting the data quality processes you need and tuning them to your data, as well as deciding the architectural design of data quality functions. 

c) Execute: The third stage of the cycle is where the execution happens. You have prepared the stage in the previous two steps, now it’s time to see how well the system actually performs.

d) Monitor: This is the last stage of the framework where the results are monitored. You can use advanced data profiling techniques to generate detailed performance reports.

05. Technology: What are data quality management tools?

Despite the nature of data quality issues being quite complex, many businesses still validate data quality manually – giving way to multiple errors. Adopting a technological solution to this problem is the best way to ensure your team’s productivity and smooth implementation of a data quality framework. There are many vendors that package data quality functions in different offerings, such as: 

a) Stand-alone, self-service data quality software

This type of data quality management software allows you to run a variety of data quality processes on your data. They usually come with automated data quality management or batch processing features to clean, match, and merge large amounts of data at specified times in a day. It is one of the quickest and safest ways to consolidate data records, without losing any important information since all processes are executed on a copy of data and the final data view can be transferred to a destination source. 

b) Data quality API or SDK:

Some vendors expose necessary data quality functions through APIs or SDKs. This helps you to integrate all data quality management features in your existing applications in real-time or runtime. Read more about Data quality API: Functions, architecture, and benefits. 

c) Data quality embedded in data management tools 

Some vendors embed data quality capabilities within centralized data management platforms so that everything is taken care of in the same data pipeline. Designing an end-to-end data management system with embedded data quality function requires you to conduct detailed planning and analysis as well as involve key stakeholders at every step of the process. Such systems are often packaged as master data management solutions.  

How data quality management differs from master data management?

c) Custom in-house solutions 

Despite various data quality and master data management solutions present in the market, many businesses invest in developing an in-house solution for their custom data needs. Although this may sound very promising, businesses often end up wasting a great number of resources – time and money – in this process. The development of such a solution may be easier to implement, but it is almost impossible to maintain over time. 

To know more about this, you can read our whitepaper: Why in-house data quality projects fail. 

What are the best practices for data quality management?

Let’s take a quick look at the data quality best practices: 

a) Find out the relation between data and business performance, and the exact impact poor data quality has on your business goals and objectives.

b) Measure and maintain the definition of data quality by selecting a list of metrics that will you and your teams to be on the same page about data quality and what it means for your organization.

c) Establish data roles and responsibilities across the organization to make people responsible for attaining and maintaining data quality – from top-level to operational staff.

d) Train and educate teams about data assets and their attributes, how to handle data, and the impact of their actions on the entire data ecosystem.

e) Continuously monitor the state of data through data profiling and uncover hidden details about its structure and content. 

f) Design and maintain data pipelines that executes a numbered list of operations on incoming data to attain a single source of truth. 

g) Perform root-cause analysis of data quality errors to understand where the data quality errors are coming from and fixing these issues at the source. 

h) Utilize technology to attain and sustain data quality because no process is promised to perform well, and give the best ROI – if it is not automated and optimized using technology. 

Want to know more about each of these practices, read our detailed blog 8 best practices to ensure data quality at enterprise-level. 

Real-world examples of data quality management

In this final section of our guide, we will look at some data quality use cases and see how renowned brands are utilizing data cleansing and matching tools for managing the quality of their data and see what they have to say about it. 

01. Data quality management in retail

Buckle is a leading upscale retailer of denim, sportswear, outerwear, footwear, and accessories, with over 450 stores in 43 states. Buckle was dealing with the challenge of sorting through large amounts of data records from hundreds of stores. The main task at hand was eliminating all the duplicate information that had been loaded into their current iSeries DB2 system. They were looking for an efficient way to remove duplicate data, which accounted for approximately 10 million records.  

DataMatch Enterprise™ provided a usable and more efficient solution for Buckle. The company was able to run a large number of records through the deduplication process as one project using a single software tool as opposed to using several different methods. 

02. Data quality management in healthcare

St. John Associates provides placement and recruiting services in Cardiology, Emergency Medicine, Gastroenterology, Neurological Surgery, Neurology, Orthopedic Surgery, and other fields. With a growing database of recruitment candidates, St. John Associates needed a way to dedupe, clean and match records. After several years of performing this task manually, the company decided it was time to deploy a tool to reduce the time spent on cleaning records. 

With DataMatch Enterprise, , St. John Associates was able to perform an initial data cleansing operation, finding, merging and purging hundreds of thousands of records in a short period of time. DataMatch™ helped speed up the process of deduplication through fuzzy matching algorithms and made sorting through data fields to find null information easier. It also eliminated the need for manual entry, enabling users to export changes and upload them as needed. 

03. Data quality management in financial services

Bell Bank is one of the largest independently owned banks in the nation, with assets of more than $6 billion and business in all 50 states. As a large private bank, Bell Bank deals with many vendor partners and dozens of service lines – from mortgage to insurance, from retirement to wealth management and many more. With information siloed away and stored in disparate data sources, the bank found it challenging to get a single, consolidated view of its customers; not to mention, they also were incurring unnecessary expenses as a result of sending multiple mails to one vendor or customer.  

DataMatch Enterprise forms a critical part of the bank’s larger in-house data management solution, allowing them to easily group results and hand back the list of records of all customer records that are believed to be of one entity. This consolidated view will help the bank truly understand their customer’s association with the bank and the steps they can take to further strengthen this association. 

04. Data quality management in sales and marketing

TurnKey Auto Events conducts high-volume car buying campaigns for automotive dealers nationwide. They produce events that compel car buyers to attend and purchase vehicles. As a service provider that provides sales leads for automotive vendors, TurnKey Marketing was looking to receive credit for additional sales procured with the various dealerships they partner with.  

By being able to match sales with the multitude of potential prospects they speak to daily, they receive sales credit (and earn money) for each lead. Using DataMatch™, Data Ladder’s sophisticated data matching product, the company was able to match records from several sources. From there they were able to create a bird’s eye view of a potential car sale over time. 

05. Data quality management in education

West Virginia University is the state’s only research, doctoral degree-granting, land-grant university. The school offers nearly 200-degree programs at the undergraduate, graduate, doctoral, and professional levels. They were tasked with assessing the long-term impacts of certain medical conditions on patients over an extended period of time. The data for the medical conditions and the current health records provided by the state exist in separate systems. 

Using DataMatch™, Data Ladder’s flagship data cleansing product, the university was able to clean records from several systems containing the required information. From there they were able to create a unified view of the patient over time. 

The final word

Business leaders understand the importance of data – from routine operations to advanced business intelligence, it is utilized everywhere. But most teams working with data spend extra hours because of duplicate work, lack of data knowledge, and faulty results. And all these issues arise due to poor or no management of data quality.  

Investing in data quality tools, such as DataMatch Enterprise, will definitely help you to get started with data quality management. DataMatch will take you through different stages of data cleansing and matching. Starting from importing data from various sources, it guides you through data profiling, cleansing, standardization, and deduplication. On top of that, its address verification module helps you to verify addresses against the official USPS database.  

DataMatch also offers scheduling features for batch processing records or you can utilize its API to integrate data cleansing or matching functions in custom applications and get instant results. 

Book a demo today or download a free trial to learn more about how we can help you to get the most out of your data. 

In this blog, you will find:

Try data matching today

No credit card required

"*" indicates required fields

Hidden
This field is for validation purposes and should be left unchanged.

Want to know more?

Check out DME resources

Merging Data from Multiple Sources – Challenges and Solutions

Oops! We could not locate your form.