Blog

A Quick Guide to Record Linkage Software

According to Lucidworks, a technology research specialist, about 7.5 sextillion gigabytes of data is generated worldwide every single day. Out of this data, companies only analyze about 10 percent, while the rest turns dark. Dark data comprises of all information that lurks unused. Unstructured, unformatted, and duplicated data records are some common reasons why this data cannot be monetized or used for analysis.

What is record linkage?

To achieve digital transformation and personalization, companies must include maximum amount of data to their analysis processes. Every business has multiple disparate systems in place to track and monitor customer behavior across various channels. These systems gather psychographic, demographic, behavioral, and device information across every touchpoint. To understand the complete customer story and extract benefit, it is necessary to integrate and merge this data such that each record represents a single entity. This is called record linkage.

Record linkage is the process of comparing records from two or more disparate data sources and identifying whether they refer to the same entity or individual. 

This process is pretty simple when you have standardized datasets that contain unique identifiers, but it is quite challenging when your datasets do not conform to a standardized format or lack uniquely identifying data attributes.

Why do you need record linkage?

Dark data piles up when information about the same individual or entity is scattered across different sources. Companies can extract more value from this data when records are considered together, rather than when considered individually. Following are some of the most important reasons for performing record linkage:

Types of record linkage methods

To link records, uniquely identifying attributes in the datasets are compared to decide whether two records belong to the same entity or individual. Based on the nature of these unique attributes, record linkage can be done in two ways:

01. Deterministic record linkage

This method is used when your datasets contain one or more attributes that uniquely identify the records. This way, if two records have the same unique attribute, they can said to be a match and classified as the same entity. If your datasets have multiple attributes that uniquely identify a record, then comparisons can be performed based on all these columns. Records can be considered a match if they match on a single attribute or any set threshold value. Data attributes such as social security number and national ID are good examples of uniquely identifying attributes which can be used for deterministic record linkage.

You can choose this method of record linkage when you have high quality data and simple SQL queries can help you make the matching decision.

02. Probabilistic record linkage

This method is used when your datasets do not contain exact uniquely identifying attributes, and you must leverage fuzzy matching techniques to link records. In this case, multiple attributes are assigned weights and considered together to classify records as matches or non-matches. An example of probabilistic record linkage is when datasets either don’t contain social security numbers or the values present are not accurate.

To link records, fuzzy matching is performed on multiple fields such as First and Last Name, Date of Birth, or Address, and based on the calculated weights, possible matches are computed. The next section goes in more detail about the probabilistic record linkage process.

Record linkage process

Peter Christen describes a general approach for tackling a record linkage problem in the following five steps:

Step 01: Preprocessing

Record linkage results depend on the quality of input data. Data cleansing and standardization activities are carried out to ensure reliable data quality, such as identifying missing values and specifying accurate ones, removing garbage and invalid data, checking data accuracy and relevancy, and ensuring that data is up to date. As the process involves multiple datasets, the same data cleaning rules must be applied to ensure consistency across all data sources.

Step 02: Blocking / Indexing

To link records, each data record must be compared with every other record in that dataset. As most organizations track user interactions with different data applications, a single record should be compared across various datasets as well. It is expected that the computational complexity of these comparisons grows quadratically as the size of database grows.

For this reason, blocking and indexing techniques are implemented to limit the number of comparisons. These techniques ensure that only similar records are compared, which reduces the overall number of comparisons. This is achieved by selecting a data field that has a very high probability of being the same across records if they belong to the same entity. And if these data values are completely dissimilar, the algorithm can simply block these records from being compared, and reduce overall computational complexity.  

Step 03: Field comparisons

For this step, data fields are first chosen that can be used for comparison. These are usually the attributes that uniquely identify the record. But if such attributes do not exist or are not available due to confidentiality purposes, a combination of fields are used for probabilistic comparisons.

Similarity functions, such as Jaro Winkler and Levenshtein, are usually used to calculate the distance between two data values and assess how similar/dissimilar these values are. For each record in a comparison, agreement patterns are evaluated. Based on these patterns, classification decision is made whether these records belong to the same entity or not.

The Fellegi and Sunter framework gives the mathematical foundation and derives the probabilistic decision rule for classifying records as matches or nonmatches based on agreement pattern. Let’s take an example of two datasets A and B, and see how their records are compared to compute the agreement patterns.

In datasets A and B, consider records a and b such that a ∈ A and b ∈ B.

The data field to be compared – or simply, the matching variable – is present in both datasets A and B. If α and β are two matching variables from dataset A and B respectively, then the aim of the record linkage process is to find

  1. Set of matches M such that M = {(α(a), β(b)) | a = b}
  2. Set of nonmatches NM such that NM = {(α(a), β(b)) | a ≠ b}


In absence of a single unique identifier, multiple matching variables are used. Let’s say we choose three matching variables from A and B. These are:

  1. Last name (γ1)
  2. First name (γ2)
  3. Street name (γ3)


Note that γ=0 if a pair matches on that field, and γ=1 if a pair does not match on that field. For example, the agreement pattern for a pair could be (1,0,1), which represents that the pair agrees on last name, disagrees on first name, and agrees on street name.

As field comparisons are performed, these agreement patterns are computed so that they can be used to calculate the likelihood ratio of two records belonging to the same entity. This is done using classifying algorithms which are covered in the next step.

Step 04: Classification

The Fellegi and Sunter framework serves as the basis for many record linkage classifiers, such as Naive Bayes, Support Vector Machine (SVM), and Expectation Conditional Maximization, etc. According to this framework, the agreement likelihood ratio (R(γj))  that an algorithm will correctly classify a pair (j) of true matches as matches and would not incorrectly classify them as a nonmatch is: m/u. Mathematically: R(γj) = m/u, where:

  1. The m-probability is the conditional probability that a record pair j has an agreement pattern γj given that it is a match (M), denoted as m = P(γj| M),
  2. The u-probability is the conditional probability that a record pair j has an agreement pattern γj given that it is not a match (NM), denoted as u = P(γj| NM).

Placing the values of m and u in above equation:

R(γj) = P(γj|M)/P(γj|NM) —– (i)

The likelihood ratio from the above equation gives a numerical value that represents how likely two records belong to the same entity. But for you to classify the records as a match or nonmatch, you need to run record linkage on a test dataset and compute a series of likelihood ratios. This data will allow you to determine the upper cutoff (W+) and a lower cutoff (W−) for classifying records as matches and nonmatches correctly.

These cutoff values highly depend on the type of data your datasets contain. And so, once you have estimated the cutoff values for your dataset, you can first calculate the likelihood ratio of a pair using the above formula and then easily map record pairs to three possible outcomes:

  1. The pair is a match if likelihood ratio R(γj) is greater than or equal to upper cutoff W+
  2. The pair is a nonmatch if likelihood ratio R(γj) is less than or equal to lower cutoff W-
  3. The pair belongs to the undecided cases for manual clerical reviews, if otherwise


Let’s expand equation (i) to accommodate multiple matching variables, and take log on both sides to reduce skew in the data:

Step 05: Evaluation

Although the process of record linkage can add huge value to your business intelligence processes, the inaccuracy of record linkage algorithms can cost you big bucks too. This is why it is crucial to choose the method and data attributes that will ensure maximum linkage accuracy on your dataset.

There are four outcomes that are possible in a record linkage decision:

  1. Correctly classifying true matches as matches (A)
  2. Incorrectly classifying true matches as nonmatches (B)
  3. Correctly classifying true nonmatches as nonmatches (C)
  4. Incorrectly classifying true nonmatches as matches (D)

An efficient way of evaluating your record linkage method is to check the following four metrics:

  1. Sensitivity (equals to A/(A+B)), which measures how accurately the algorithm classifies true matches as matches.
  2. Specificity (equals to C/(C+D)), which measures how accurately the algorithm classifies true nonmatches as nonmatches.
  3. Positive Predictive Value (PPV = A/(A+C)), which measures the proportion of nonmatched pairs that are actually matches.
  4. Negative Predictive Value (NPV = D/(B+D)), which measures the proportion of matched pairs that are actually nonmatches.

Challenges to overcome during record linkage

There are multiple challenges encountered while performing record linkage, such as:

Applications and use cases of record linkage

Different domains and industries use the record linkage process for various purposes. Listed below are some of these applications:

Adapting an automated, self-service record linkage solution

A self-service record linkage software allows business users and corporate professionals to easily link their data records without requiring a steep learning curve or programming languages.

record linkage software, such as Data Ladder’s DataMatch Enterprise, is designed to help business users to:

  • Prepare data by profiling the data for errors and information consistency
  • Clean and normalize data according to defined business rules
  • Match multiple datasets using a combination of proprietary and established algorithms
  • Remove duplicates with an accuracy rate of 95 – 100%
  • Create golden records and obtain a single source of truth

and much more.

These tools are becoming the flagship solution to an age-old problem of reliance on complex IT processes to make sense of their data. In an age when automation is the key to business success, businesses cannot afford to lose the bigger part of their big data to the dark.

Try data profiling today

No credit card required

"*" indicates required fields

Hidden
This field is for validation purposes and should be left unchanged.

Want to know more?

Check out DME resources

Merging Data from Multiple Sources – Challenges and Solutions

Oops! We could not locate your form.