Today, enterprises highly depend on data for growing their businesses and scaling their goals and expectations. Huge efforts are being invested in devising the perfect data strategy roadmap and using information as the most critical business asset; still, the results are not satisfying. An organization’s inability to understand their data well enough and then mapping it correctly to expected results is the most common reason behind failed data-driven initiatives.
This is where data profiling plays a big role.
What is data profiling?
The use of these uncovered details depends on what you are trying to achieve with your data. For example, if you want to improve data quality, then a data profile helps to identify potential data cleansing opportunities and assess how well your data is being maintained against data quality dimensions.
Scope of data profiling – Is it just data quality testing?
Data profiling is mostly seen as just a requirement for ensuring data quality; when in reality, its application and usage is far more than that. Data profiling is a systematic process that implements a number of algorithms that analyze and assess empirical details of a dataset, and output a summarized view of data structure and its values. This information can then be used for any intended purpose – for example, highlighting potential data quality issues, deciding areas of improvement, mapping to another data profile for a merging project, etc.
Let’s cover some basic contexts where data profiling is extensively used:
1. Reverse engineering data to find out missing metadata
In many cases where data is being captured since a long period of time, the metadata is usually missing or is incomplete. Metadata represents details about each attribute of a dataset, such as its:
- Definition: the purpose of attribute being stored,
- Data type: the type of data it is allowed to hold; e.g., string, number, date, etc.,
- Size: the maximum or minimum number of characters it can hold,
- Domain: the space from which it derives its values; e.g., values of the column Country can only be derived from a list of actual countries existing in the world.
In absence of such information, an attribute’s values are reverse engineered – analyzed and summarized – to uncover the missing metadata, so that it can be used for other activities such as building enterprise data model, planning data migrations, renovating business processes, etc.
2. Analyzing anomalies
Before data can be used for any purpose, it must be confirmed that it is free from anomalies, otherwise it will produce bias in the analyzed results. Data profiling helps to statistically analyze a dataset and identify a range of values that fall within the acceptable range, and detect any outliers that may be present. Statistical analysis of a dataset examines the frequency distributions, variant values, percent of absent values, as well as relationships between columns of the same and different datasets.
3. Discovering implicit data rules
In the way data is captured, stored, and manipulated, a library of data rules is implemented to ensure compliance to data standards. Sometimes these rules are pretty obvious and intentional, but at other times, these rules can be completely unintentional and implicit within a business’s logic and processes.
Examples of such rules include integrity constraints or relational dependencies between attributes. A data profile can help you to extract hidden rules so that they can intentionally be integrated within the data lifecycle.
Three levels of data profiling
The process of data profiling happens in three levels. Depending on how the profiling output needs to be used, you can execute profiling at only one or a combination of levels. The computation complexity increases as the level increases (more on this in the next section).
At the first and initial level, a single column is analyzed by executing various statistical techniques. At the next level, the analysis of relationships between multiple columns within the same dataset takes place. And finally, at the third level, we analyze the relationships that exist between columns of different datasets or tables.
Let’s cover each level in more detail.
1. Column profiling
Column profiling assesses different characteristics that a column’s values represent and outputs insights about how it is structured – in terms of metadata as well as content. While profiling a column, frequency, statistical, and descriptive analysis is performed.
a. Frequency analysis
This relates to a number of techniques related to count and distribution of values in a column, such as:
- Range analysis: assesses whether values of a column can be subjected to ordering, and if there is a well-defined range (minimum and maximum values) within which all values can be mapped.
- Null analysis: logs the percentage of values that are null (empty) in the column.
- Distinct count analysis: counts the number of distinct values that occur in the column.
- Value distribution analysis: assesses how the values of a column are distributed within the defined range.
- Uniqueness analysis: labels whether a value in a column occurs only once (is unique) or not.
b. Statistical analysis
This analysis is usually performed for numeric columns or the ones related to timestamps. It gives insights into an aggregated or summarized view of the column, such as:
- Min/max value: identifies the minimum and maximum value of the column by ordering all values.
- Mean: calculates the average value of the column.
- Median: selects the middle value of the ordered column set.
- Standard deviation: calculates the variation present within the value set of the column.
c. Descriptive analysis
Finally, descriptive analysis goes in more detail about the column’s content, instead of focusing on its structure and distribution. It involves:
- Data type analysis: determines the data type and max size of character count it holds; e.g., string, number, date, etc.
- Custom data type analysis: semantically analyzes values to see if an abstract or custom data type exists for the column; e.g., Address, or Phone Number, etc.
- Pattern analysis: uncovers hidden patterns or formats used in column values.
- Domain analysis: maps out the space from which values of the column are derived; e.g., values of the column Country can only be derived from a list of actual countries existing in the world.
2. Cross-column profiling
This type of analysis identifies dependencies or relationships that are present between multiple columns. Since it involves a larger amount of data, it is more resource intensive.
a. Primary key analysis
A primary key uniquely identifies every entity present within a dataset. For example, a column Social Security Number for a customer dataset uniquely identifies each customer; similarly, the column Product Manufacturer Number for products dataset uniquely identifies each product, and so on.
Oftentimes, datasets do not contain these uniquely identifying attributes or they are present, but most of its values are missing. In such cases, a combination of columns is selected and their values are examined to determine potential primary keys that uniquely identify each record.
b. Dependency analysis
This type of analysis identifies functional dependencies between multiple columns. These relationships are usually embedded within the attribute content. For example, there’s a relation between the two columns City and Country. If two rows within a dataset have the same City, then their corresponding Country values must be same as well.
This type of data profiling helps you to document all such relationships present within your dataset – either generic or specific to your organizational processes.
3. Cross-table profiling
The final level of data profiling is the most computationally complex, since it involves analyzing multiple columns across multiple tables. This is done to determine the relationships that may exist cross-tables, as well as how well these relationships are being maintained. It includes following techniques:
a. Foreign key analysis
During cross-table profiling, foreign keys are analyzed to understand how a column of one table is relating its records to another table. For example, an enterprise may save personal information of their employees in one table, and their employment details in another table. So, a foreign key must be present in the employee table that relates the job role of each individual to the list of available job roles and other related information, such as department, compensation details, and so on.
b. Orphaned records analysis
This analysis examines whether a foreign key relationship is being violated. Extending the previous example, the violation may happen when an employee’s personal record identifies their employment role using a foreign key that is not present in the job role table.
During cross-table profiling, all such orphaned records are determined so that the missing data can be updated and completed.
c. Duplicate columns
Many times, the same information is stored in multiple columns across multiple tables. Alternatively, different information is stored in multiple columns that are named the same. These similarities/differences are analyzed in columns across tables by evaluating column values and their intersections.
Challenges encountered during data profiling
While data profiling is a significant consideration in any data-focused initiative, it could easily get out of hand depending on the scope and size of the analysis process. Following are some most commonly encountered challenges during data profiling:
1. System performance
The process of data profiling is computationally intensive as it involves large amount of column comparisons – within, between, and across tables as well. This requires a huge number of computational resources such as memory and disk space, as well as more time to complete and build output results. Thus, employing a system that can support complex computations is a serious challenge.
2. Limiting scope of results
Since data profile reports are generated by summarizing and aggregating data values, there must be a threshold that defines the level of summarization to be implemented. This helps in getting more meaningful and focused results.
For example, you may not want to know the values that only appeared once or twice in a column, but if it appeared more than ten times, it might add value to the summarization, and thus, should be included. So, the ability to limit or conditionalize what goes in and what does not into the final profile report is a challenging decision to make.
3. Deriving value from profiled reports
Analyzing datasets to understand their structure and content formation is only one side of the story. The generated data profiles must be analyzed to understand the next line of action. Experienced data professionals must be involved that can examine the reports and explain why data is the way it is, and what can be done to transform it as needed.
4. Self-service data profiling tools
Considering how computationally complex data profiling can get, it is a process usually expected to be performed by tech- or data-savvy professionals. The unavailability of self-service data profiling software tools is a common challenge faced.
A self-service data profiling tool that can output quick 360-view of data and identify basic anomalies, such as blank values, field data types, recurring patterns, and other descriptive statistics is a basic requirement for any data-driven initiative. Data Ladder’s DataMatch Enterprise is a fully powered data quality solution that offers data profiling as the first of many steps in correcting, optimizing and refining your data.