Duplicate records in a company’s datasets cause serious roadblocks to business success. 92% of organizations claim to suffer from the nightmare of data duplication. Nowadays, companies extensively utilize data processing techniques like data comparison, matching, and classification to overcome this problem. But different factors make the concept of data matching more complex than it actually is. And data leaders need to understand the ins and outs of this process before they can reap its benefits and maximize results accuracy.
In this blog, we share a complete guide to modern data matching techniques:
- What is data matching?
- What are different types of data matching?
- What are the benefits of data matching?
- What are the use cases of data matching?
- How does data matching work?
- What are challenges to data matching?
- What are some of the top questions to ask while buying a data matching tool?
What is data matching?
The process of comparing two or more records and computing the possibility of them belonging to the same entity.
Here, an entity could be anything that your organization stores data about. Usually, it refers to a customer, product, location, company, employee, etc.
The need for data matching arises since the same data can be represented in different ways. When hundreds of people are using, entering, and manipulating a company’s data applications, they are bound to store data with varying formats and even make typing errors or misspell certain words.
Consider the following customer records as an example. It is obvious that the rows highlighted in orange belong to the same customer, but there is a slight chance that the ones highlighted in blue belong to the same one. This is where data matching techniques offer value, they help us to make the right classification decision: which records are a match, a non-match, or a possible match?
|First Name||Last Name||Email Address||Phone||Address|
|Michael||Scott||[email protected]||+1(234)-5678||32 St. E Avenue, NY|
|Jim G.||Halpert||[email protected]||(654)-7890||12 St W Park, NJ|
|Mike||S||[email protected]||12345678||Street 32, East, New York|
|Jade||Halpert||[email protected]||6557890||Street 12, West Park New Jersey|
What are different types of data matching?
It is clear that records are matched based on data attributes present in datasets. Depending on the nature of attributes a dataset contains, there are two ways that matching can take place:
01. Exact data match or deterministic linkage
As the name suggests, exact data matching exactly matches two fields from separate records – character by character. It is often termed as deterministic linkage since you get one of the two determined results: either the records are a match or they are not.
You can only use this approach when you have uniquely identifiable attributes in your dataset. A unique attribute is a data property that cannot be the same for two different entities. For example, a customer’s Social Security Number. If you have clean and standardized Social Security Numbers of all your customers, you can use exact matching techniques to find the ones belonging to the same entity.
02. Fuzzy data match or probabilistic linkage
Fuzzy matching is used when your dataset does not contain uniquely identifying attributes and you must calculate the probability of two records belonging to the same entity – rather than a determined yes or no result. Probability is usually defined by a match score – where 0% represents a non-match, and 100% represents a full match. A value falling between 0 and 100 shows the match confidence level.
This approach works best with a combination of data attributes. For example, you can start to fuzzy match on a customer’s name, and pair it with a couple other fields, such as date of birth and phone number. Of course, fuzzy matching is more commonly used today since data is full of variations and sometimes even lacks necessary information.
Read our detailed guide on fuzzy matching to know more: Fuzzy matching 101: Cleaning and linking messy data.
What are the benefits of data matching?
Almost every business suffers from the dread of data duplication. With duplicate records overloading your system, crucial business processes are impacted – resulting in reduced operational efficiency and productivity. Here, we look at the top 8 benefits of cleaning and matching your data records.
a) Attain the golden record for your data assets that represents a complete and comprehensive view about everything related to your customers, products, and more.
b) Eliminate discrepancies present in your dataset, such as misspellings, missing information, varying formats, etc.
b) Eliminate discrepancies present in your dataset, such as misspellings, missing information, varying formats, etc.
c) Prepare data for business intelligence so that your BI tool or team of analysts can generate reliable, high-quality insights.
d) Utilize resources better and enhance your team’s operational efficiency by providing them with accurate and complete data.
e) Enrich data gathered from external sources into internal datasets and uncover deeper insights.
f) Maximize the ROI of your sales and marketing campaigns by consolidating customer information and offering personalized experiences.
g) Conform to data compliance standards by keeping your data clean and deduplicated.
h) Prevent fraudulent activities since fraudsters take advantage of multiple records residing at a data source.
To know more about data matching benefits, read our blog: 8 benefits of data matching that can help you grow your business.
What are the use cases of data matching?
The process of data matching is used across every industry to resolve and eliminate duplicates. The exact nature of its use and application differs depending on the challenges encountered. Whether it is the government or public sector, education industry, banking, finance, healthcare, or sales and marketing department – data matching is used everywhere to find and remove duplicate records.
Read more about how data matching is applied in different industries.
How does data matching work?
The process of data matching is simple, but since it has a lot of moving parts, it can get overwhelming pretty quickly. We will look at a simple 4 step process to match data records and also incorporate the details you need to take care of at each step to ensure maximum accuracy.
Step 01. Data preparation and selection
In the initial phase, data is prepared for the matching process. Datasets usually contain a number of data quality issues, such as empty values, misspellings, format and pattern variations, etc. To ensure smooth and accurate matching between records, data must be profiled, cleaned, and standardized.
01. Profiling data
Data profiling runs statistical algorithms on your datasets and uncovers hidden details about its structure and contents. A data profile report highlights the state of your data; this information is useful to identify potential data cleansing opportunities and figure out which attributes can act as a potential player for the matching process.
02. Cleaning and standardizing data
- Remove and replace empty values, leading/trailing spaces, specific characters and numbers, punctuation, etc.
- Parse aggregated or longer columns into smaller sub-components, such as Address field into Street Number, Street Name, City, etc.
- Transform letter cases (upper to lower or lower to upper) to ensure consistent, standardized view.
- Merge same or similar columns together to avoid duplicated columns.
- Match and transform patterns of data values for consistency.
03. Selecting data attributes
The final step in the data preparation phase is selection of data attributes. You need to filter attributes for two purposes:
- Select the data fields you want to keep in the final match results or the golden record; this helps eliminate excessive noise from your output, and
- Select the data fields that will be used to compare records for matching.
The first part is pretty straightforward, but the second is where most people get confused: how to know which attributes will provide the best match results? To answer this question, you must review the available attributes in terms of the following characteristics:
a) Intrinsicality: How intrinsic is this property to the data asset? The more intrinsic the property is, the less likely another entity will have the same value. For example, the exact dimensions of a handbag product.
b) Structural consistency: How stable is the property’s structure? Properties with higher structural stability are more likely to be chosen than the ones with lower stability. For example, email addresses have a similar pattern or structure – making them more stable, while open text fields are less stable.
c)Value consistency: How likely it is that this property will change for the entity? Drastically changing values should be preferred less while choosing matching attributes. For example, a person’s office address may change often, but their date of birth remains the same.
d)Domain variance: How likely it is that this property will change for the entity? Drastically changing values should be preferred less while choosing matching attributes. For example, a person’s office address may change often, but their date of birth remains the same.
e)Accuracy: Do the values represent the real-world truth? Some companies have a high fraction of inaccurate values in their dataset. Choosing a more accurate property will offer more accurate results.
f)Completeness: Does the property have missing values for some entities? Use a property that is more holistically complete across datasets.
Step 02. Data match configuration and execution
With your datasets cleaned and standardized, and the matching attributes selected, it’s now time to configure the matching algorithm. It’s important to mention here that different tools offer different configuration details. The exact nature of these settings may differ amongst vendors, but conceptually, these configurations must be used to ensure accurate results.
The upcoming sections highlight the five configurable parts of the matching process.
01. Comparing within and across datasets
The first configuration requires you to define which datasets must be compared with one another. There are three possible comparisons:
a) Within: This configuration compares data records within the same dataset only; for example, the first row of Dataset A will be compared with all other rows of Dataset A, and so on..
b) Across: This configuration compares data records across datasets; for example, all rows of Dataset A will be compared with all rows of Dataset B.
c) Both: This configuration compares both: within the connected datasets as well as across them; for example, Dataset A is compared with Dataset A and B
02. Blocking record comparisons
Data matching is a computationally intensive process. With millions of records in a dataset, comparing within and across datasets and then matching on multiple fields can put a strain on your system and take a lot of time to generate the first result. For this reason, blocking or indexing techniques are implemented to disqualify some records from the comparison process.
Comparisons are blocked by selecting an attribute that is highly likely to be the same for two records if they belong to the same entity. If the values of two records are too dissimilar, the records are removed from the comparison process.
03. Mapping fields across datasets
For comparisons made across datasets, it is important to map fields that describe the same information. Field mapping is important because disparate data sources have varying:
a) Data structures, for example, one source stores Customer Name as one field while the other stores it in three fields: First, Middle, and Last Name. Field parsing should be performed (preferably in the initial phase of data preparation) to divide a longer field into smaller sub-components.
b) Field titles, for example, one source names the address field as Residential Address, while the other names it as Address
04. Creating match definitions for multiple comparisons
Comparing a single field between records may not yield accurate results. For the best results, you may want to select a combination of fields for matching. Moreover, some data matching tools allow you to design multiple combinations computed with logical AND/OR operations. Consider the following example of customer data matching to understand how this works:
Your customer datasets lack unique identifiers, so you decide to match on a combination of fields. You can design three match definitions:
i) Definition 1 matches Name, Street Number, Street Name, and Zip Code
ii) Definition 2 matches Name and Phone Number
iii) Definition 3 matches Email Address
For two records to be a match, they need to match on all fields of Definition 1, OR all fields of Definition 2, OR all fields of Definition 3.
For every field comparison, you must configure basic settings, such as the match algorithm type, its weight, and threshold for classification. Let’s review what these settings are in more detail:
a) Selecting type of data match algorithm
Different data matching algorithms are used depending on the nature of the data to be compared. For example, integers are compared differently than open-text string fields, so the entity matching algorithm used to compare names should be different from the one used to compare street numbers. A list of most common data match algorithms used is given below:
i) Exact: Does not allow any tolerance in difference and expects to find a full match of character to character.
ii) Fuzzy: Allows tolerance in difference and evaluates the likelihood of two strings being similar by using probabilistic matching. Fuzzy matching technology utilizes a number of industry leading techniques, such as Levenshtein Distance (or Edit Distance), Damerau-Levenshtein Distance, Jaro-Winkler Distance, Metaphone 3, Name Variant, etc.
iii) Phonetic: Useful for matching similar sounding strings as it matches words by their pronunciation. Phonetic match can be exact or fuzzy.
iv) Numeric: Used to run a probabilistic match on numeric fields.
b) Assigning weights to matching attributes
Some attributes have a higher probability of being unique for an entity as compared to others. For example, it is likely for two customers to have the same name, but unlikely to have the same address as well. Comparing two records with different names but the same address should not compute a 50% match score.
This is where weights can be useful. Weight assignments help to prioritize attributes that are less likely to be the same for two entities and should have higher priority while calculating the match score. In the example we just discussed, assigning a higher weight to the address field makes more sense than the name field.
c) Thresholding classification rule
A threshold value defines the classification level for the comparator. If the results of the comparison are equal to higher than the defined level, then the match is considered successful. Any value below the level would be a non-match. For example, if the threshold level is 70, then all match scores above or equal to 70% are a match and the ones below 70 are a non-match.
With all the configurations done, you are now ready to execute. It’s important to review your configurations to ensure no mistakes were made. A configured data matching example may look like this:
|Definition 3||Email Address||Fuzzy||100||70|
Due to computational complexity, match algorithms can take a long time to complete execution and generate results. But there are some data matching software in the market that use modern matching process and match 2 million records in less than 2 minutes, such as DataMatch Enterprise.
Step 03. Results evaluation
Once the match results are computed, they will give you the following information:
- Does a record match any other record?
- What is the overall match score for the matching records?
- What are the individual match scores for every field?
- Which definitions did the records match on?
After result generation, you need to evaluate results and perform accuracy assessment.
01. Reviewing false positives and negatives
Even though the match algorithm classifies records as a match or a nonmatch based on the threshold level defined, there are still some cases where:
- Matching records are incorrectly classified as a nonmatch – false negative,
- Nonmatching records are incorrectly classified as a match – false positive.
Such instances usually have a match score that lies within the gray area. For example, if the threshold value was set to 70, most false negatives and positives are encountered where the match scores lie between the 60-80 range. If the datasets under consideration have high variations, it is best to review these records and identify the ones that were misclassified.
02. Re-tuning data match configuration
After reviewing the match results, it is best to re-tune the match algorithms based on the findings. For example, if many records were misclassified as a match when they were a non-match, you can reduce the threshold level and re-execute the algorithm. Similarly, you can try to change the type of match algorithm or the weight of matching attributes to get more accurate results.
Step 04. Data merge and deduplicate
The last phase of the data match process is to eliminate the identified duplicates. There are two ways to resolve the duplicates:
- Merge duplicate records together to contain information in a single, golden record, or
- Select the most comprehensive record as the golden record and delete the rest of the duplicates.
Sometimes both approaches are implemented to remove duplicates and retain maximum information. You can also design rules to conditionally overwrite and merge fields.
What are challenges to data matching?
After seeing how data matching works, it is obvious that some parts can easily make the process unpredictable and generate unexpected results. There are a few factors that play a big role in the success of a data matching process, and if they are not handled properly, you can end up with unreliable results every time. Here are the top 6 challenges of data matching:
01. Data heterogeneity
Datasets stored across sources have systematic differences, which make it difficult to exactly compare them. This is usually known as data heterogeneity. Although you can resolve these differences using different data cleansing and standardization techniques, a lot of effort is required to ensure information is not lost and is only transformed in the required structure and format. Heterogenous data is of two types:
a) Structural heterogeneity
This type of difference occurs when fields of different databases represent the same information in a structurally different manner. For example, one database stores a contact’s name as Contact Name, while in a second database, it is stored in multiple columns such as, Salutation, First Name, Middle Name, and, Last Name.
b) Lexical heterogeneity
This type of difference occurs when fields of different databases are structurally the same, but they represent the same information in a syntonically different manner. For example, two or more databases can have the same Address field, but one stores the address value as 32 E St. 4, while the other can have 32 East, 4th Street.
02. Matching attributes selection
We already discussed the characteristics to look for in an attribute while choosing it for the matching process: intrinsicality, structural and value consistency, domain cardinality, completeness, and accuracy. But the truth is, even when datasets have such attributes, they are either missing for most records, or contain outdated information – yielding unreliable results.
Another challenging aspect of attribute selection is that many people believe the more attributes you use in your match definition, the more accurate the results. This is not true. The more dimensions you add, the likelihood of two records being classified as a match increases – which means false negatives increase. The key here is to assign weights. Two people are more likely to have the same birthday, but less likely to have the same email address. If both are used to match, assigning a higher priority to the one that is more unique can be helpful.
03. Computational complexity
In the matching process, every data record must be compared with every other record in the same dataset or across different datasets. It is expected that the computational complexity of these comparisons grows quadratically as the size of database grows. This indicates that your matching process must be carried out using a data system that can withstand such complex computational power. Moreover, blocking or indexing technique is helpful to limit the number of comparisons, as discussed earlier.
04. Match algorithm configuration
The main part of the data matching process is data match configuration. Many people get overwhelmed by the number of details that need to be perfectly configured for a match process to generate accurate results. And let’s face it – there will always be some trade-off while fine-tuning such algorithms.
There are data deduplication tools in the market nowadays that come with predefined matching suggestions. Based on the nature of your data, they suggest a basic configuration for your matching process. However, you can always override the default configuration and tune the algorithm according to your custom requirements.
05. High-risk data availability
A data matching process uses the most unique attributes of an entity. Mostly, these are social security numbers, bank account numbers, credit card numbers, or MR numbers for patients, etc. But organizations are reluctant to share these identifiers with matching tools and usually keep this information encrypted in their databases.
With such identifiers out of the picture, it gets fairly difficult to match common attributes that have a high chance of repetition across entities. A way to overcome this challenge is to use data masking – this technique converts your data values using patterns, and masks/hides any sensitive or personally identifiable information present to ensure data compliance.
06. Human intervention to resolve false positives and negatives
With the advent of technology, most business processes are automated. Organizations do not expect their teams to be knee-deep in the data matching process resolving conflicts when they have already invested in a data matching tool. Using systems that allow more control and customization as well as offer features to design rules for automated conflict resolution can help minimize human intervention and free up your teams to focus on more important tasks.
Top questions to ask while buying a data matching tool
Having discussed various aspects of data matching, it is clear that any organization that suffers from the data duplication nightmare needs a data matching tool. But there are so many options in the market these days, it can get harder to understand which one fits best for you. Here are some top questions that you need to ask before buying a data matching tool:
- Can you connect and pull data from a variety of data sources?
- Can you prepare data for matching within the same tool, such as parsing, cleaning, and standardizing data values?
- Can you configure different settings of the match definition and algorithm?
- Can you match data using multiple match algorithms, such as phonetically matching Alizabeth and Elizabeth, fuzzy matching on Johnny and John, and so on?
- Can you easily interpret the computed match results or do you require technical expertise?
- Does the solution consistently generate accurate and reliable results?
- Can you merge or remove duplicates within the same tool?
- Can you efficiently scale the matching solution in case the data volume increases in the future?
Finding, matching, and merging duplicates are crucial for smooth business operation and intelligence. Companies that resolve duplicates present in their datasets are less likely to miss out on opportunities for business expansion, customer acquisition, product enhancement, and increased revenue. But it is difficult to find an all-in-one solution that takes care of all four phases of the data matching process: preparation, match configuration and execution, results evaluation, and merge and deduplication.
DataMatch Enterprise is one such tool that offers a range of modules and answers yes to all questions mentioned in the previous section. Our tool supports data coming in from different sources, allows field mapping, and suggests matching algorithms that work best for your data. Of course, you can override the default selection with your special requirements. You can also use DME to evaluate how different match algorithms perform with your dataset and choose the one that offers highest match accuracy.