According to O’Reilly’s report on The state of data quality 2020, 56% of organizations face at least four different types of data quality issues, while 71% face at least three different types. Organizations spend quite a lot of time and resources while designing data quality frameworks and fixing data quality issues. But to get good results, it is important for them to understand the exact nature of these issues and identify how do they end up in the system in the first place.
In this blog, we will look at some general data quality issues that reside in every dataset, and also highlight the common ways in which they can creep up in your database.
Let’s get started.
What is a data quality issue?
A data quality issue refers to the presence of an intolerable defect in a dataset, such that it reduces the reliability and trustworthiness of that data.
Data stored across disparate sources is bound to contain data quality issues. These issues can be introduced into the system due to a number of reasons, such as human error, incorrect data, outdated information, or a lack of data literacy skills in the organization. Since data fuels critical business functions, such issues can cause some serious risks and damage to the company.
The need to leverage quality data across all business processes is quite obvious. Leaders are investing in hiring data quality teams because they want to make people responsible for attaining and sustaining data quality. Moreover, complex data quality frameworks are designed and advanced technology is adopted to ensure fast and accurate data quality management. All these efforts are done in the hopes of making the clean data dream come true.
But none of this can be possible without understanding what is polluting the data in the first place and where exactly it is coming from.
Top 12 data quality issues faced by companies
I recently went through some customer notes and gathered a list of the top 12 data quality issues that are commonly present in a company’s organizational data. Let’s take a look at this list.
Issue#01: Lack of record uniqueness
An average organization with 200-500 employees uses about 123 SaaS applications these days. The vast number and variety of the applications used to capture, manage, store, and use data is the main reason behind poor data quality. And the most common issue that occurs in such situations is that you end up storing multiple records for the same entity.
For example, all interactions that a customer has with your brand during their buying journey are recorded somewhere in a database. These records may be coming from websites, landing page forms, social media advertising, sales records, billing records, marketing records, purchase point records and other such areas. If there’s no systematic way of identifying customer identities and merging new information with existing ones, you can end up with duplicates throughout your datasets.
And to fix duplication, you will have to run advanced data matching algorithms that compare two or more records and calculate the likelihood of them belonging to the same entity.
Issue#02: Lack of relationship constraints
A dataset often references multiple data assets. But when no relationship is defined and enforced amongst two or more distinct data assets, you can end up with a lot of incorrect and incomplete information.
Consider this scenario as an example: Your customer portal contains records for New businesses you won this year as well as Existing Customers that upgraded from last year. Apart from basic customer information, there are definitely some customer fields that are only applicable for a New Business and some that only work for a New Customer. You can handle both scenarios with the same, generalized data model, but it can open doors to a lot of data quality issues, such as missing necessary information, as well as ambiguous or incorrect information in customer records.
To handle such scenarios, you should always create specific data models and enforce relationships between them. By enforcing a parent/child (supertype/subtype) relationship between entities, you are making data capturing, updating, and understanding much easier for those who deal with this information. See the following ERD diagram as an example. The basic Customer fields are kept separate from its child subtypes, that is, New Business and Existing Customer.
Issue#03: Lack of referential integrity
Referential integrity means that data records are true to their referencing counterpart. To understand the issues produced due to a lack of referential integrity, let’s consider the example of a retail company. A retail company probably stores their sales records in a Sales table, and each record mentions which product was sold when that sale was made. So, you probably expect to find Sales IDs as well as Product IDs in the Sales table. But if a Sales record refers to Product IDs that don’t exist in the Product table, it’s obvious that your datasets lack referential integrity.
These issues can lead your teams to create incorrect reports, ship incorrect products, or ship products to customers that don’t exist, and so on.
Issue#04: Lack of relationship cardinality
Relationship cardinality refers to the maximum number of relations two entities can have with one another. Normally, different types of relationships can be created between data objects, depending on how business transactions are allowed to happen at a company.
Consider the following examples to understand cardinality between different data objects, such as Customer, Purchase, Location, and Product:
- One Customer can only have one Location at a time
- One Customer can make many Purchases
- Many Customers can be from one Location
- Many Customers can buy many Products
If cardinality constraints are not well-defined, it can potentially give rise to a number of data quality issues in your dataset.
Issue#05: Lack of attribute uniqueness and meaning
We often find issues related to dataset attributes or columns. Quite a lot of times data models are not explicitly defined and so, the resulting information is deemed unusable. Common issues found are:
- Multiple columns with the same name are present, containing different information for a record.
- Multiple columns with different names are present, which technically mean the same thing, and hence store the same information.
- Column titles are ambiguous and confuse the data entry operant about what to store in the column.
- Some columns are always left empty; either because they are deprecated or there is no source of getting such information.
- Some columns are never used and hence, are unnecessarily being stored.
All these scenarios depict how attributes are poorly managed within a dataset and are increasing the number of data quality issues.
Issue#06: Lack of validation constraints
The greatest number of data quality issues are a result of lack of validation constraints. Validation constraints ensure that data values are valid and reasonable, as well as standardized and formatted according to the defined requirements. For example, lack of validation constraints checks for Customer Name would lead to the following errors:
- Extra spaces in the name (either leading, trailing, or double spaces in between),
- Use of inappropriate symbols and characters,
- The length of name running too long,
- Single lettered middle names are not capitalized or do not end with a period,
- All letters of the first, middle, and last names are capitalized, rather than capitalizing the first letter only.
Moreover, some fields may contain incorrect abbreviations and codes, or other values that do not belong to the attribute domain. If these constraints are not defined in your data models and enforced on data entry points, you will end up with a lot of validation errors in your dataset’s most critical and basic fields, such as a customer’s name.
Issue#07: Lack of accurate formulae and calculations
Many fields in a dataset are derived or calculated from other fields. And so, the formulae are designed, implemented, and automatically executed every time new data is entered or updated in the depending fields. Any error present in the formulae or calculation can lead you to have incorrect information in the entire column of the dataset. This invalidates the field for use in any intended purpose.
Examples of fields that are calculated from others include age calculated from birthdays, applicable discount calculated from number of products bought, or any other percentage calculation.
Issue#08: Lack of consistency across sources
One of the most common challenges associated with data is maintaining one definition about the same ‘thing’ across all nodes or data sources. For example, if a company uses a CRM and a separate billing application, a customer’s record will end up in the databases of both applications. The task of maintaining a consistent – or simply, the same – view of customer information across all databases over time is difficult.
A lack of consistency can mess up the reporting across all functions and operations of your enterprise. Consistency does not only relate to the meanings of data values, but also, their representation; for instance, when values are not applicable or are unavailable, consistent terms must be used to represent the unavailability of data across all sources.
Issue#09: Lack of data completeness
Data completeness refers to necessary fields being present in your dataset. Completeness of a dataset can be calculated vertically (attribute-level) or horizontally (record-level). Usually, fields are marked mandatory/required to ensure completeness of a dataset, since not all fields are necessary.
You will usually find this data quality issue in your dataset where a large number of fields are left blank – for a large number of records. But emptiness doesn’t necessarily mean incompleteness. Completeness of dataset can only be gauged accurately by first categorizing every field of a data model as follows:
- Is the field required? Meaning, it cannot be left empty; for example, National ID of a customer.
- Is the field optional? Meaning, it does not necessarily need to be filled; for example, Hobbies field for a customer.
- Is the field inapplicable in certain cases? Meaning, it becomes irrelevant based on the context of the record, and should be left blank; for example, Spouse’s Name for a non-married customer.
Issue#10: Lack of data currency
Data ages very fast – whether a customer switched their residential address, an email address, or their last name changed due to their marital status. Such changes can impact the currency of your dataset and cause you to have weeks or months old data, leading you to present and base critical decisions on outdated information. To ensure currency of your dataset, you can set reminders to update data, or set limits to an attribute’s age, ensuring all values are subjected to review and update in a given time.
Issue#11: Lack of data literacy skills
Despite all the right efforts being made to protect data and its quality across datasets, a lack of data literacy skills in an organization can still cause a lot of damage to your data. Employees often store wrong information as they don’t understand what certain attributes mean. Moreover, they are unaware of the consequences of their actions, such as what are the implications of updating data in a certain system or for a certain record.
Such discrepancies can only be eliminated by creating and designing data literacy plans and courses that introduce teams to organizational data and explain:
- What it contains,
- What each data attribute means,
- What are the acceptability criteria for its quality,
- What is the wrong and right way for entering/manipulating data?
- What data to use to achieve a given outcome?
Issue#12: Mistyping and other human errors
Mistyping or misspellings are one of the most common sources of data quality errors. Humans are known to make at least 400 errors while doing 10,000 data entries. This shows that even with the presence of unique identifiers, validation checks, and integrity constraints, there is a chance that human error can intervene and make your data quality deteriorate.
Using self-service data quality tools
We just went through the different types of data quality issues that reside in a dataset. Teams struggling to sustain acceptable levels of data quality throughout the organization need the right tools. This is where a data quality management tool can come in handy. An all-in-one, self-service tool that profiles data, performs various data cleansing activities, matches duplicates, and outputs a single source of truth.
DataMatch Enterprise is one such tool that facilitates data teams in rectifying data quality errors with speed and accuracy, and allows them to focus on more important tasks. Data quality teams can profile, clean, match, merge, and purge millions of records in a matter of minutes, and save a lot of time and effort that is usually wasted on such tasks.
Getting Started with DataMatch Enterprise
Download this guide to find out the vast library of features that DME offers and how you can achieve optimal results and get the most out of your data with DataMatch Enterprise.Download