Blog

12 Most Common Data Quality Issues and Where Do They Come From

In early 2022, Unity Technologies faced an unexpected data quality issue—one that cost the company $110 million. Resulting from the ingestion of flawed data by its ad targeting tool, the disruption of its machine learning models sent the company’s stock plummeting by 37%. Its revenue took a massive hit, product launches were delayed, and investors quickly lost faith in the company’s ability to stay on course.

Unity’s story is far from unique.

Data quality issues continues to plague organizations of all shapes and sizes across all industries, disrupting decision-making, draining resources, and leading to preventable failures.

According to recent studies, 91% of businesses acknowledge data quality impacts their operations, yet only a small fraction makes it a priority. A 2022 State of Data Quality Report showed that only 23% of companies had data quality as part of their organizational ethos. What’s even more alarming is that despite the clear risks, 41% of organizations are still operating without a (formal) data quality strategy.

As we’ve seen with Unity (and numerous others), data quality issues—whether it’s duplicates, missing values, inaccurate data, or outdated information—pose real risks to revenue, customer trust, and operational efficiency. But the good news is that these issues are preventable!

By addressing common data quality challenges proactively, organizations can safeguard themselves from losing revenue and reputation, keep their operations and stakeholder trust intact, and unlock sustainable growth.

What’s important to remember here is that while organizations spend quite a lot of time and resources in designing data quality frameworks and fixing data quality issues, to get good results, it is important for them to understand the exact nature of these issues and find out how do they end up in the system in the first place.

What Is a Data Quality Issue?

A data quality issue refers to the presence of an intolerable defect in a dataset, such that it reduces the reliability and trustworthiness of that data.

Data stored across disparate sources is bound to contain data quality issues. These issues can be introduced into the system due to a number of reasons, such as human error, incorrect data, outdated information, or a lack of data literacy in the organization. Since data fuels critical business functions, such issues can cause some serious risks and damage to the company.

The need to leverage high-quality data across all business processes is quite obvious. Leaders are investing in hiring data quality teams because they want to make people responsible for attaining and maintaining data quality. Additionally, they are designing complex data quality frameworks and adopting advanced technologies to ensure fast and accurate data quality management. All these efforts are done in the hopes of making the clean data dream come true.

But none of this can be possible without understanding what is polluting the data in the first place and where exactly it is coming from. Without this insight, even the most sophisticated data management initiatives can fall short, leaving organizations vulnerable to the very problems they seek to eliminate.

Top 12 Data Quality Issues Faced by Companies

piles of documents reflecting the lack of data quality management tools

I recently went through some customer notes and gathered a list of some of the most common data quality problems that are present in organizational data. Let’s take a look at this list.

Issue#01: Lack of Record Uniqueness

An average organization uses about 112 SaaS applications these days. The vast number and variety of the applications used to capture, manage, store, and use data is the main reason behind poor data quality. This not only creates data silos but also complicates data pipelines. And the most common issue that occurs in such situations is that you end up storing multiple records for the same entity.

For example, all interactions that a customer has with your brand during their buying journey are recorded somewhere in a database. These records may be coming from websites, landing page forms, social media advertising, sales records, billing records, marketing records, purchase point records and other such areas. If there’s no systematic way of identifying customer identities and combining all the collected data (merging new information with existing ones), you can end up with redundant data or multiple disjointed records for the same person throughout your datasets.

To fix this duplicate data, you will have to run advanced data matching algorithms that compare two or more records and calculate the likelihood of them belonging to the same entity.

How to build a unified, 360 customer view

Download this whitepaper to learn about why it’s important to consolidate your data to get a 360 view.

Download

Issue#02: Lack of Relationship Constraints

Datasets often include multiple data assets that are interrelated, but when these relationships aren’t clearly defined or enforced, you can end up with a lot of incorrect and incomplete information.

Take, for example, a scenario where your customer portal contains records for New Businesses you won this year as well as Existing Customers who upgraded from last year.

While both will contain the basic customer details, there must be some data fields that only apply to New Businesses and some that exclusive to Existing Customers.

You may handle both datasets with the same, generalized data model, but it can open doors to a lot of data quality issues, such as missing necessary information, ambiguous information, or incorrect data in customer records.

To avoid these pitfalls, you should always create specific data models and enforce relationships between them. By enforcing a parent/child (supertype/subtype) relationship between entities, you are making data capturing, updating, and understanding much easier for those who deal with this information.

See the following ERD diagram as an example. The basic customer fields are kept separate from its child subtypes, that is, New Business and Existing Customer to ensure clarity, accuracy, and correct data.

Issue#03: Lack of Referential Integrity

Referential integrity means that data records accurately reference to their corresponding counterparts. A lack of this integrity results in misalignment between data tables, which then can lead to a range of serious issues.

Let’s consider the example of a retail company to understand the issues the lack of referential integrity can cause.

A retail company probably stores their sales records in a Sales table, and each record mentions which product was sold when that sale was made. So, you probably expect to find Sales IDs as well as Product IDs in the Sales table. But if a Sales record refers to Product IDs that don’t exist in the Product table, it’s obvious that your datasets lack referential integrity.

This can lead your teams to create inaccurate reports, ship incorrect products, or even process orders for non-existent products – resulting in lost time, wasted resources, and, of course, unhappy customers.

Issue#04: Lack of Relationship Cardinality

Relationship cardinality defines the limits of relationships between entities in a dataset. Normally, different types of relationships can be created between data objects, depending on how business transactions are allowed to happen at a company. Relationship cardinality essentially set the rules on how many interactions can occur between two data objects.

For example, in customer data, the relationship or cardinality between different data objects, such as Customer, Purchase, Location, and Product might have the following constraints:

  • One Customer can only have one Location at a time.
  • One Customer can make many Purchases.
  • Many Customers can be from one Location.
  • Many Customers can buy many Products.

If cardinality rules are not well-defined, it can potentially give rise to a number of data quality issues. These may include confusion in data interpretation, broken reports, or inaccurate transaction records.

Issue#05: Lack of Attribute Uniqueness and Meaning

Data quality issues often stem from poorly managed dataset attributes or columns. Quite a lot of times, these problems arise when data models are not explicitly defined and so, the resulting information is unusable.

Common attribute-related data quality challenges include:

  • Multiple columns with the same name but containing different information for a record.
  • Multiple columns with different names, which technically mean the same thing, and hence store the same information. This causes too much data in your records and also leads to redundancy.
  • Column titles are ambiguous and confuse the data entry operant about what to store in the column.
  • Some columns are always left empty; either because they are deprecated or there is no source of getting such information.
  • Some columns are never used and hence, are unnecessarily being stored.

All these scenarios illustrate poor management of dataset attributes and how they contribute to ongoing data quality problems, reducing the reliability and usefulness of the data.

Issue#06: Lack of Validation Constraints

The greatest number of data quality issues are a result of lack of validation constraints.

Validation constraints ensure that data values meet are valid and reasonable, as well as standardized and formatted according to the defined requirements.

Without proper validation, errors can creep in datasets. For example, lack of validation constraints checks for the Customer Name field could lead to the following errors:

  • Extra spaces in the name (leading, trailing, or double spaces in between).
  • Use of inappropriate symbols and characters.
  • The length of name running too long.
  • Single lettered middle names are not capitalized or do not end with a period.
  • All letters of the first, middle, and last names are capitalized, rather than only the first letters being capitalized.

Moreover, some fields may contain incorrect abbreviations and codes, or other values that do not belong to that particular attribute domain.

If validation rules are not defined in your data models and enforced consistently at data entry points, you will end up with a lot of validation errors in your dataset’s most critical and basic fields, such as a customer’s name.

data analysis in process for identifying data quality issues

Issue#07: Lack of Accurate Formulae and Calculations

Many fields in a dataset are derived or calculated from data in other fields. For this, the formulae are designed, implemented, and automatically executed every time new data is entered or updated in the depending fields. Any error in the formulae or calculation can lead you to have incorrect information in the entire column of the dataset. This invalidates the entire field for use in any intended purpose.

Examples of fields that are calculated from others include age calculated from birthdays, applicable discount calculated from number of products bought, or any other percentage calculation.

Any errors in formulae that calculate these could make entire data columns unusable for any business analysis, reporting, or operational purposes. Ensuring these calculations are accurate and reliable is essential to maintain data integrity.

Issue#08: Lack of Consistency Across Sources

One of the most common challenges associated with data is maintaining one definition about the same ‘thing’ across all nodes or data sources.

When organizations use multiple platforms for data management – such as a CRM and a separate billing application – data about the same customer might end up in databases of both applications. This duplicated data makes it difficult to create a single, comprehensive, reliable view of data across multiple systems.

The task of maintaining a consistent – or simply, the same – view of customer information across all databases over time is difficult.

Inconsistent data can mess up the reporting across all functions and operations of your enterprise.

What’s important to note here is that data consistency does not only relate to the meanings of data values, but also applies to the way data is represented. For instance, if values are not applicable or are unavailable, all systems should use consistent terms or symbols to indicate missing data.

Issue#09: Lack of Data Completeness

Data completeness refers to the degree or percentage of necessary fields being filled in a dataset. It can be calculated both vertically (at the attribute-level) and horizontally (at the record-level). Usually, fields are marked mandatory/required to ensure completeness of a dataset, since not all fields are necessary or equally critical.

To ensure completeness of a dataset, it is essential to categorize fields into three key types:

  • Required Fields: These cannot be left empty (must always be filled). An example of required field could be a customer’s National ID.
  • Optional Fields: These are not mandatory, like a customer’s hobbies.
  • Context-Specific Fields: These become irrelevant in certain situations or based on the context of the record, such as a “Spouse’s Name” for an unmarried customer.

Issues pertaining to data completeness usually arise where a large number of fields are left blank for a large number of records. But having empty fields doesn’t necessarily mean you have incomplete data. Completeness of dataset can only be gauged accurately by categorizing every field of a data model. Without defining the importance and necessity of each field, it becomes challenging to accurately assess whether or to what extent your dataset is complete.

Issue#10: Lack of Data Currency

Data ages very fast. Whether a customer changed their residential address, got a new email, or changed their last name changed due to marriage, such changes can impact the currency of your dataset.

A lack of currency in your data means you’re relying on stale information (weeks or months old data), which could lead to flawed decisions. To maintain the freshness of data, organizations should implement periodic review cycles or set age limits on key data attributes. This will ensure all values are subjected to review and update in a given time.

Issue#11: Lack of Data Literacy Skills

Despite all the right efforts being made to protect data and its quality across datasets, a lack of data literacy skills in an organization can still cause a lot of damage and introduce significant data quality problems.

Employees may input wrong information because they don’t understand what certain attributes mean. Moreover, they may not be fully aware of the consequences of their actions on larger systems. A lack of this awareness can lead to records being incorrectly updated, resulting in further data degradation.

Such discrepancies can only be eliminated by creating and designing data literacy plans and courses that introduce teams to organizational data and explain:

  • What it contains
  • What each data attribute means
  • What are the acceptability criteria for its quality?
  • What is the wrong and right way for entering/manipulating data?
  • What data to use to achieve a given outcome?

Issue#12: Mistyping and Other Human Errors

Mistyping or misspellings are one of the most common sources of data quality errors. Research suggest that humans make roughly 400 errors per 10,000 data entries. This shows that even with the presence of unique identifiers, validation checks, and integrity constraints, there is a chance that human error can intervene and erode data quality over time.

Using Self-Service Data Quality Tools

After reviewing the different types of data quality issues that commonly reside in a dataset, it’s clear that teams struggling to sustain acceptable levels of data quality throughout the organization need the right tools. This is where a comprehensive data quality monitoring solution becomes invaluable. An all-in-one, self-service tool that profiles data, performs various data cleansing activities, matches duplicates records, and outputs a single source of truth will ensure that your business operations are fueled only by accurate, reliable data.

DataMatch Enterprise is one such data quality management tool that facilitates data teams in rectifying errors with speed, improving data accuracy and maintaining high data quality, and allows them to focus on more important tasks. With the ability to profile, clean, matchmerge, and purge millions of records in a matter of minutes, it can help data teams save a lot of time and effort that is usually wasted on such tasks.

If you’re looking to elevate your data quality management, download a free trial today or book a demo with our experts to explore how DataMatch Enterprise can help.

To know more about how we can help, you can download a free trial today or book a demo with our experts.

Getting Started with DataMatch Enterprise

Download this guide to find out the vast library of features that DME offers and how you can achieve optimal results and get the most out of your data with DataMatch Enterprise.

Download
In this blog, you will find:

Try data matching today

No credit card required

"*" indicates required fields

Hidden
Hidden
Hidden
Hidden
Hidden
Hidden
Hidden
Hidden
Hidden
This field is for validation purposes and should be left unchanged.

Want to know more?

Check out DME resources

Merging Data from Multiple Sources – Challenges and Solutions

Oops! We could not locate your form.