More than 70% of revenue leaders in an InsideView Alignment Report 2020 rank data management as the highest priority, yet, a Harvard Business Review study estimates only 3 percent of companies’ data meets basic quality standards.
There is a major gap between what companies want in terms of data quality and what they are doing to fix it.
The first step to any data management plan is to test the quality of data and identify some of the core issues that lead to poor data quality. Here’s a quick guide-based checklist to help IT managers, business managers and decision-makers to analyze the quality of their data and what tools and frameworks can help them to make it accurate and reliable.
What is data quality and why does it matter? Before we delve into the checklist, here’s a quick briefing on what is data quality and why it matters.
There is no specific definition of data quality and to give one would be to limit the scope of data itself. There are however benchmarks that can be used to assess the state of your data. For instance, data of high quality would mean:
You must understand the metrics that will help you to measure data quality. This could be as simple as the ten critical data quality dimensions that we all know so well. But it is better if you make this a bit more specific to your use case. For example, the Date column in a dataset should contain formatted dates only. But you could also have dates that are actually garbage values since they represent dates that are too old to be accurate. So, you could have your own, more specific definition of what accurate, complete, consistent, valid, timely, and unique means to your company.
This is probably the most important information that you need prior to your data quality testing process. Metadata is the information that describes your data. It helps you to understand the descriptive and structural definition of each data field in your dataset, and hence measure its impact and quality.
Examples of metadata include the data’s creation date and time, the purpose of data, source of data, process used to create the data, creator’s name and so on. Metadata allows you to define why a data field is being captured in your dataset, its purpose, acceptable value range, appropriate channel and time for creation, etc., and use that while testing and measuring data for quality.
Since data is being captured from our surroundings, we can quickly validate its accuracy by comparing it with known truth. For example, does Age column contain any negative values; are required Name fields set to null; do Address field values represent real addresses; does Date column contain correctly formatted dates; and so on.
This level of testing can be performed by generating a quick data profile of your dataset. It is a simple compare and label test where your dataset values are compared against your defined validations and some known/correct values, and classified as valid or non-valid. Although it can be done manually, you can also use an automated tool that will a run a quick profile test and show you where your data stands as compared to the validation rules defined.
But keep in mind that this level only tests the data itself, and not the metadata.
It means computing the statistical distribution of each data attribute, and validating that all values are following the distribution. This allows you to continuously keep in check that the nature of new, incoming data is the same as the data residing within your dataset.
Furthermore, for this type of testing, you can determine the median and average values for each distribution, and set minimum and maximum thresholds. On every new entry to the dataset, you can check the probability that the new data belongs to this distribution. If the probability is high enough (approx. 95% or more), you can conclude that the data is valid and accurate.
You can also use the metadata of an attribute to compute distribution and test incoming data against it. For example, the Name field usually contains 7-15 number of characters. If a new Name entry has only 2 characters, it can be considered as a potential error as the new metadata value did not conform to the expected distribution.
It means performing a holistic analysis to qualify the uniqueness of each record in your dataset. For this type of testing, you need to go row by row in a dataset and verify that all records represent uniquely identifiable entities, and there are no duplicates present. This is a more complex form of testing as it might be difficult to assess uniqueness of a record in the absence of a unique key. For this purpose, advanced algorithms are utilized for performing fuzzy matching techniques and determining probabilistic matches.
Level 3 testing is the same as level 2, but instead of considering only current dataset, historical records are also used for computing row matches, and field distributions. This is done so that any changes in data that happen with time are also considered while validating data values.
For example, yearly sales are expected to spike at the end of the year due to holidays and are comparatively slower in the seasons leading up to it. So, you can end up drawing incorrect conclusions about your data if you don’t take time into consideration. With this level, you can also run tests for detecting anomalies in your data. This is done by looking at the history of values in a data attribute and classifying current values as normal or abnormal.
As data quality challenges become more complex, modern problems require modern solutions. Data scientists and data analysts are spending 80% of their time in testing data quality, and only 20% of the time in extracting business insights. Automated data quality testing tools leverage advanced algorithms to free you from manual labor of testing datasets for quality, or maintaining coded solutions over a period of time as data quality definitions evolve.
These tools are designed to be self-service and user-friendly so that anyone – business users, data analysts, IT managers – can generate quick data profiles as well as perform in-depth analysis of data quality through proprietary data matching techniques.
Normally, these tools specialize in offering two different types of testing engines – some come with only one and very few specialize in both types. Let’s take a look at them.
Rules-based testing tools allow you to configure rules for validating datasets against your custom-defined data quality requirements. You can define rules for different dimensions of a data field. For example, its length, allowed formats and data types, acceptable range values, required patterns, and so on. These tools quickly profile your data against configured rules, and offer a concise data quality summary report which covers the results of the test.
As new data enters into your ecosystem, the overall quality of your data deteriorates. This is why you need to implement data quality checks at the data entry or data integration level. You want to make sure that new data is introduced into the system is accurate and unique and is not a duplicate of any entity currently residing in your master record.
Most companies don’t engage in data quality tests unless critical for data migration or a merger, but at that time, it’s way too late to salvage the problems caused by poor data. Test your data quality, define the criteria, and set benchmarks to drive improvement.
Luckily, you no longer have to put in the effort of manually testing your data as most ML-based data quality testing solutions today allow businesses to do that with a few easy steps. You’re choosing between 2 minutes vs 12 hours. And the choice doesn’t have to be daunting. Best-in-class solutions like DataMatch Enterprise allow free trials that you can benefit from. All you have to do is plug in your data source and let the software guide you through the process. You’ll be surprised at the hours and manual effort you’d be saving your team with an automated solution that also delivers more accurate results than manual methods.
No credit card required
„*“ zeigt erforderliche Felder an
Oops! Wir konnten dein Formular nicht lokalisieren.
Was ist Adressstandardisierung? Unter Adressstandardisierung versteht man den Prozess der Aktualisierung und Implementierung eines Standards oder Formats für Ihre Adressdaten. Schlechte Adressdaten sind ein komplexes
Im Februar 2020 übergab Facebook einen anonymisierten Datensatz an Social Science One – mit dem Ziel, Erkenntnisse über die Kommunikation und das Verhalten in den
Was ist Adressstandardisierung? Unter Adressstandardisierung versteht man den Prozess der Aktualisierung und Implementierung eines Standards oder Formats für Ihre Adressdaten. Schlechte Adressdaten sind ein komplexes
Im Februar 2020 übergab Facebook einen anonymisierten Datensatz an Social Science One – mit dem Ziel, Erkenntnisse über die Kommunikation und das Verhalten in den
In jeder Art von datenreicher Umgebung ist es einfach, Muster zu finden; das ist es, was mittelmäßige Spieler tun. Der Schlüssel liegt in der Feststellung,