Blog

What is Data Profiling: Scope, Techniques, and Challenges

Q: Scope of data profiling – Is it just data quality testing?

Data profiling is a systematic process that implements a number of algorithms that analyze and assess empirical details of a dataset, and output a summarized view of data structure and its values.

Written by Ehsan Elahi
November 22, 2021

It’s no secret that data is the currency that drives business growth and innovation today. Yet, despite billions spent on data management initiatives, more than 50% to 70% of the data organizations collect remains unutilized. This disconnect often results in an inability to capitalize on the full value of data assets.

This is where data profiling can be a game changer.

What is Data Profiling?

Data profiling is the process of uncovering hidden details about the structure, contents, and relationships of your datasets.

The use of these uncovered details depends on what you are trying to achieve with your data. For example, if your goal is to improve data quality, the data profiling process helps to identify data quality issues such as duplicates, inconsistencies, or inaccuracies. It allows you to assess how well your data aligns with key data quality dimensions like accuracy, completeness, and consistency and identify potential data cleansing and optimization opportunities.

Scope of Data Profiling – Is It Just Data Quality Testing?

Data profiling is mostly seen as just a requirement for ensuring data quality; when in reality, its application and usage extend far beyond that. At its core, data profiling is a systematic process that uses algorithms to analyze and assess empirical details of datasets, and output a summarized view of their patterns, structure, and value distributions. This information can then be used for multiple purposes – for example, highlighting potential to identify data quality issues, deciding areas of improvement, mapping to another data profile for a merging project, etc.

Let’s explore some basic contexts where data profiling is extensively used:

1. Reverse engineering data to find out missing metadata

In many cases where data is being captured since a long period of time, the metadata is usually missing or is incomplete. Metadata is critical because it provides details about each attribute of a dataset, such as its:

Definition: The purpose and meaning of the attribute being stored.
Data Type: The type of data (format) it is allowed to hold; e.g., string, number, date, etc.
Size: The maximum or minimum number of characters it can hold.
Domain: The space from which it derives its values; e.g., values of the column. For example, a country can only be derived from a list of actual countries existing in the world.

When metadata is missing, data profiling can reverse engineer the data by analyzing its values to infer the missing information, so that it can be used for other activities such as building enterprise data model, planning data migrations, renovating business processes, etc.

2. Analyzing anomalies

Before data can be used for any purpose, it must be confirmed that it is free from anomalies, otherwise it will create bias in analysis and skew results. Data profiling helps to perform this data quality analysis. It statistically analyzes a dataset and identify a range of values that fall within the acceptable range, and detect any outliers that may be present. Statistical analysis of a dataset examines the frequency distributions, variant values, percent of absent values, as well as relationships between columns of the same and different datasets.

3. Discovering implicit data rules

In the way organizations capture, store, and manipulate data, a library of data rules is implemented to ensure compliance to data standards. Sometimes these data quality rules are pretty obvious and intentional, but at other times, these rules can be completely unintentional and implicit within a business’s logic and processes.

Examples of such rules include integrity constraints or relational dependencies between attributes. These rules often go unnoticed but can significantly impact data validation and data analytics efforts if left unmanaged.

Data profiling can help you to extract these hidden rules, along with null values and embedded value dependencies that may otherwise go unnoticed. This process can also involve inter-table analysis in a data warehouse to discover hidden relationships between datasets and ensures that existing data adheres to business logic. This structure discovery process is critical in validating the integrity of data before using it in key initiatives such as data migrations, data rule validation, or advanced analytics. By identifying and formalizing these hidden rules, organizations can integrate them into data governance frameworks and improve the accuracy of their data lifecycle management.

Three Levels of Data Profiling

The process of data profiling happens in three levels. Depending on how the profiling output needs to be used, you can execute profiling at only one or a combination of levels. The computation complexity increases as the level increases (more on this in the next section).

At the first and initial level, a single column is analyzed by executing various statistical techniques. At the next level, the analysis of relationships between multiple columns within the same dataset takes place. And finally, at the third level, we analyze the relationships that exist between columns of different datasets or tables.

Let’s cover each level in more detail.

1. Column profiling

Column profiling assesses different characteristics that a column’s values represent and outputs insights about how it is structured – in terms of metadata as well as content. While profiling a column, frequency, statistical, and descriptive analysis is performed.

a. Frequency analysis

This relates to a number of techniques related to count and distribution of values in a column, such as:

Range analysis: assesses whether values of a column can be subjected to ordering, and if there is a well-defined range (minimum and maximum values) within which all values can be mapped.
Null analysis: logs the percentage of values that are null (empty) in the column.
Distinct count analysis: counts the number of distinct values that occur in the column.
Value distribution analysis: assesses how the values of a column are distributed within the defined range.
Uniqueness analysis: labels whether a value in a column occurs only once (is unique) or not.

b. Statistical analysis

This analysis is usually performed for numeric columns or the ones related to timestamps. It gives insights into an aggregated or summarized view of the column, such as:

Min/max value: identifies the minimum and maximum value of the column by ordering all values.
Mean: calculates the average value of the column.
Median: selects the middle value of the ordered column set.
Standard deviation: calculates the variation present within the value set of the column.

c. Descriptive analysis

Finally, descriptive analysis goes in more detail about the column’s content, instead of focusing on its structure and distribution. It involves:

Data type analysis: determines the data type and max size of character count it holds; e.g., string, number, date, etc.
Custom data type analysis: semantically analyzes values to see if an abstract or custom data type exists for the column; e.g., Address, or Phone Number, etc.
Pattern analysis: uncovers hidden patterns or formats used in column values.
Domain analysis: maps out the space from which values of the column are derived; e.g., values of the column Country can only be derived from a list of actual countries existing in the world.

2. Cross-column profiling

This type of analysis identifies dependencies or relationships that are present between multiple columns. Since it involves a larger amount of data, it is more resource intensive.

a. Primary key analysis

A primary key uniquely identifies every entity present within a dataset. For example, a column Social Security Number for a customer dataset uniquely identifies each customer; similarly, the column Product Manufacturer Number for products dataset uniquely identifies each product, and so on.

Oftentimes, datasets do not contain these uniquely identifying attributes or they are present, but most of its values are missing. In such cases, a combination of columns is selected and their values are examined to determine potential primary keys that uniquely identify each record.

b. Dependency analysis

This type of analysis identifies functional dependencies between multiple columns. These relationships are usually embedded within the attribute content. For example, there’s a relation between the two columns City and Country. If two rows within a dataset have the same City, then their corresponding Country values must be same as well.

This type of data profiling helps you to document all such relationships present within your dataset – either generic or specific to your organizational processes.

3. Cross-table profiling

The final level of data profiling is the most computationally complex, since it involves analyzing multiple columns across multiple tables. This is done to determine the relationships that may exist cross-tables, as well as how well these relationships are being maintained. It includes following techniques:

a. Foreign key analysis

During cross-table profiling, foreign keys are analyzed to understand how a column of one table is relating its records to another table. For example, an enterprise may save personal information of their employees in one table, and their employment details in another table. So, a foreign key must be present in the employee table that relates the job role of each individual to the list of available job roles and other related information, such as department, compensation details, and so on.

b. Orphaned records analysis

This analysis examines whether a foreign key relationship is being violated. Extending the previous example, the violation may happen when an employee’s personal record identifies their employment role using a foreign key that is not present in the job role table.

During cross-table profiling, all such orphaned records are determined so that the missing data can be updated and completed.

c. Duplicate columns

Many times, the same information is stored in multiple columns across multiple tables. Alternatively, different information is stored in multiple columns that are named the same. These similarities/differences are analyzed in columns across tables by evaluating column values and their intersections.

Challenges Encountered During Data Profiling

While data profiling is a significant consideration in any data-focused initiative, it could easily get out of hand depending on the scope and size of the analysis process. Following are some most commonly encountered challenges during data profiling:

1. System performance

The process of data profiling is computationally intensive as it involves large amount of column comparisons – within, between, and across tables as well. This requires a huge number of computational resources such as memory and disk space, as well as more time to complete and build output results. Thus, employing a system that can support complex computations is a serious challenge.

2. Limiting scope of results

Since data profile reports are generated by summarizing and aggregating data values, there must be a threshold that defines the level of summarization to be implemented. This helps in getting more meaningful and focused results.

For example, you may not want to know the values that only appeared once or twice in a column, but if it appeared more than ten times, it might add value to the summarization, and thus, should be included. So, the ability to limit or conditionalize what goes in and what does not into the final profile report is a challenging decision to make.

3. Deriving value from profiled reports

Analyzing datasets to understand their structure and content formation is only one side of the story. The generated data profiles must be analyzed to understand the next line of action. Experienced data professionals must be involved that can examine the reports and explain why data is the way it is, and what can be done to transform it as needed.

4. Self-service data profiling tools

Considering how computationally complex data profiling can get, it is a process usually expected to be performed by tech- or data-savvy professionals. The unavailability of self-service data profiling software tools is a common challenge faced.

A self-service data profiling tool that can output quick 360-view of data and identify basic anomalies, such as blank values, field data types, recurring patterns, and other descriptive statistics is a basic requirement for any data-driven initiative. Data Ladder’s DataMatch Enterprise is a fully powered data quality solution that offers data profiling as the first of many steps in correcting, optimizing and refining your data.

To know more about how our solution can help solve your data quality problems, sign up for a free trial today or set up a demo with one of our experts.

Try data matching today

No credit card required

"*" indicates required fields

Want to know more?

Check out DME resources

Oops! We could not locate your form.

BY FEATURE

BY USE CASE

BY INDUSTRY

OUR PRODUCTS

ABOUT US

CUSTOMERS

INSIGHTS

SUPPORT