Real-world data will always have incomplete or missing values, especially it is gathered from several sources. Incomplete data can result in flawed reports and skewed conclusions in the research sector. For businesses, it translates to poor customer insights, inaccurate business intelligence and the loss of ROI.
Data completeness, therefore, is an essential component of the data quality framework and is closely related to validity and accuracy. If the data is missing, the information cannot be validated and if it’s not validated, it cannot be considered accurate.
Whether you’re a researcher with survey data, a business professional with CRM data, or an IT professional with organizational data, you must be able to identify missing and incomplete data values to determine the next course of action.
In this brief post, I’ll cover key concepts as:
- What does it mean for data to be missing
- Types of missing data
- How do you identify missing data
- The meaning of data completeness
- Types of data completeness
- How to measure completeness
- How DataMatch Enterprise is more effective than Python to identify missing data
What Does it Mean for Data to be Missing?
Missing data refer to rows or columns that have null, blank or incomplete values.
- Missing last names, phone numbers, email addresses in a CRM
- Missing age and employment years in an observational data set
- Missing income figures of employees in organizational data
The causes of missing data are plenty, but can be summarized to three common reasons:
- People’s unwillingness to provide information (such as income figures, sexual orientation etc)
- Data entry errors that are the result of poor data standards (web forms that do not have mandatory fields)
- Fields that are not relevant to the target audience (for instance a column requiring company name will be left empty if the majority of responders are retired)
- For researchers engaging in longitudinal studies, attrition rates (participants dropping out of the study) is also a leading cause of missing data.
In statistical research models, these causes result in four types of missing data that researchers must identify before attempting to make fixes. While statistical missing data types is not within the scope of this piece, I’ll still give you a basic run-through.
Note: For those interested in a deeper understanding of the types of missing data in healthcare research and longitudinal studies, the article, ‘Strategies for handling missing data in longitudinal studies,’ by the Journal of Statistical Computation and Simulation is a great resource to dig into.
Types of Missing Data:
In statistical research, missing data is categorized into four types:
Structural: Data that is missing because it should not logically exist.
Example: People who don’t have kids leave the [No.of Children] column empty. These people can be structurally excluded from the conclusion without impacting the outcome of a study because logically, they don’t need to fill that field.
Missing completely at random (MCAR): The columns with missing data do not have an interdependency.
Example: missing age values are not related to a student’s GPA scores in a study measuring K-12 students’ academic performance. Researchers can assume a median age (for example 16 to 18 for K-12) and move on with concluding a report without affecting is credibility.
Missing at random (MAR): As opposed to MCAR, where there is no connection between missing values and the study subject, MAR allows us to predict a pattern by using other information about the person.
Example: Referring to the example above, age is not related to GPA scores, but it does impact employment. Researchers can conclude that students above age 14 may have lower academic scores because of a part-time job that affects their study hours and mental capacity.
Missing not at random (MNAR): Slightly confusing, MNAR refers to data that is purposefully missing – meaning, people refusing to answer. For instance, in the study above, most 14-year-olds *may* refuse to fill their actual work hours. MNAR data is highly problematic because you can’t conclude or assume a specific reason and neither can you use standard missing data handling methods to extract a conclusive answer.
In the business world, the nature of missing or incomplete data is different. While you don’t have to deal with MAR, MNAR issues, you do have to deal with missing contact information (phone numbers, last names, emails, addresses, Zip codes etc).
How is Missing Data Identified?
Researchers working with statistical programs like SAS, SPSS, Stata have to use manual statistical procedures to identify, delete, and replace missing values. But there’s a problem.
Most of these programs automatically remove missing values from any analysis you run, which is why different variables have different amounts of missing data. Thus, researchers have to check a data set for missing data before they can decide what to remove. This process is manual and demands specific platform knowledge in addition to mastering Python – the language used to code algorithms to detect missing data.
Figuring out missing data is just the first step of the problem. You will have to perform additional coding to replace missing values with imputations (that is using mean, median to replace missing values). You will have to manually match columns, write codes, and keep reiterating the process until the desired outcome is reached. In an age when speed is the norm, you cannot afford to spend months identifying missing values and resolving them.
The same problem is applicable to business environments too. CRMs were meant to lighten the data management burden, but problems with missing data remains a significant challenge. Users have to manually identify missing values and clean them by exporting the data to an XLS format to make bulk changes. If it’s just basic contact information missing, it can be fixed with Excel, but what if the nature of missing data is beyond contact info? What if it’s firmographic or psychographic data?
How do you discover incomplete company names (such as BOSE [a brand] vs BOSE Corporation [a company]). Not so easy to do this manually!
The most important question to answer – how do you limit missing data problems and ensure data completeness!
What is Data Completeness and How Do You Measure It?
In the data quality framework, data completeness refers to the degree to which all data in a data set is available. A measure of data completeness is the percentage of missing data entries.
For instance, a column of 500 with 100 missing fields has a completeness degree of 80%. Depending on your business, 20% of missing entries can translate into losing hundreds of thousands of dollars in prospects and leads!
That said, data completeness is not about ensuring 100% of your fields are complete. It’s about determining what items of information are critical and what are optional. For instance, you would definitely need phone numbers, but you may not need fax numbers.
Completeness indicates the level of information and knowledge you have about your customer and how accurate this information is. For example, in contact data, children and the elderly may not have email addresses or some contacts may not have landline or work numbers.
Data completeness, therefore, does not imply that all data attributes must be present or populated, rather you will have to classify and choose which data sets are important to keep and which can be ignored.
How is Data Completeness Evaluated?
Traditionally, in the data warehouse, data completeness is evaluated through ETL testing that uses aggregate functions like (sum, max, min, count) to evaluate the average completeness of a column or record. Additionally, data profile validation is also performed through manual instructions using commands like:
to compare distinct values and the count of rows for each distinct value. Prior to running commands though, the user will need to determine the type of incompleteness they are dealing with and establish the kind of data quality problem affecting the data. For instance, if all phone numbers are missing city codes, it might be a data quality problem at the entry level. If over 50% of the audience are not providing an answer to a specific problem, it might be a MNAR (missing not at random) problem.
All this is great. But there’s an important step all users miss:
You cannot clean, remove or replace missing values if you don’t know what’s missing and how many percent of this missing data is tolerable.
And this is done via data profiling!
Data Profiling – The First Step in Checking for Data Completeness & Identifying Missing Values at Attribute Level
Data profiling is the process of evaluating your data set to identify a range of problems including:
- Missing values and records
- Data errors such as typos
- Abnormalities at attribute level (use of punctuation marks or negative spaces in fields)
- Standardization problems (inconsistency in formats NYC vs New York)
Data profiling is an exhaustive task.
Ideally, you want to be able to take a glance at customer data and easily or quickly identify the null values in the data. Now you could do this manually by reviewing each column independently or by running algorithms.
Python empowers the user to sort their data based on their perception and understanding. While this is ideal in terms of control and customization, it is not ideal if you’re pressed for time and accuracy.
You will need to consistently create and review codes. You will have to keep testing, measuring and analyzing results. Not only is this counter-productive, it beats the purpose of using data to achieve real-time goals.
Can your marketing teams wait 4 months for you to clean, dedupe and consolidate customer data from multiple sources?
Can your organization afford to lose ROI with every passing day that results in ROI loss? Not really.
Using an ML-Based Solution like DataMatch Enterprise vs Python Scripting or ETL Tools
Python scripting and ETL tools are ideal because it’s in your control. Using an automated tool does not mean you give up control or ditch the usefulness of these platforms. It simply means you’re speedening up the process and automating where needed.
A machine-learning enabled solution like Data Ladder’s DataMatch Enterprise – a data quality management solution that allows you to identify missing values and incomplete information without the need for codes or scripts.
With DME’s data profiling function, you can check the percentage of data completeness at attribute level (complex to do via ETL or coding).
For example, if 30 out of 100 records have missing values, they are automatically highlighted for you to review. You’re also given a health score for each column. In this case, your Last Name column has a health score of 70%. All this is done within just a matter of 10 minutes. You are saving up hours of manual reviews, testing of algorithms and writing/rewriting of definitions to just check your data for accuracy or completeness.
That’s not all.
Sometimes you’re also required to consolidate data and deduplicate redundant data. For instance, you discover that your customer’s information is stored in two separate data sources. One source contains everything you need – phone numbers, email addresses, location etc. The other source contains information you don’t necessarily need – age, gender, fax numbers etc. You want to combine the two, and either merge or purge this information to ensure completeness. Imagine the time and the processes you’ll have to go through to achieve this? The mere thought is exhausting.
The True Goal of Data Completeness
It’s easy to ignore missing values and proceed to compiling reports or performing activities knowing that you have missing data. The end result is tasks that deliver poor responses (such as an email marketing campaign that ignored missing last names and had to struggle with duplicates), reports that have misleading conclusions impacting policies and critical reforms, business plans that fail and errors that have legal implications.
All this said, the true goal of data completeness therefore, isn’t to have perfect, 100% data. It’s to ensure that the data essential to your purpose is valid, complete, accurate and usable. The tools you have at your disposal, such as DME, is technology that will help you get there.