According to Lucidworks, a technology research specialist, about 7.5 sextillion gigabytes of data is generated worldwide every single day. Out of this data, companies only analyze about 10 percent, while the rest turns dark. Dark data comprises of all information that lurks unused. Unstructured, unformatted, and duplicated data records are some common reasons why this data cannot be monetized or used for analysis.
To achieve digital transformation and personalization, companies must include maximum amount of data to their analysis processes. Every business has multiple disparate systems in place to track and monitor customer behavior across various channels. These systems gather psychographic, demographic, behavioral, and device information across every touchpoint. To understand the complete customer story and extract benefit, it is necessary to integrate and merge this data such that each record represents a single entity. This is called record linkage.
Record linkage is the process of comparing records from two or more disparate data sources and identifying whether they refer to the same entity or individual.
This method is used when your datasets contain one or more attributes that uniquely identify the records. This way, if two records have the same unique attribute, they can said to be a match and classified as the same entity. If your datasets have multiple attributes that uniquely identify a record, then comparisons can be performed based on all these columns. Records can be considered a match if they match on a single attribute or any set threshold value. Data attributes such as social security number and national ID are good examples of uniquely identifying attributes which can be used for deterministic record linkage.
You can choose this method of record linkage when you have high quality data and simple SQL queries can help you make the matching decision.
This method is used when your datasets do not contain exact uniquely identifying attributes, and you must leverage fuzzy matching techniques to link records. In this case, multiple attributes are assigned weights and considered together to classify records as matches or non-matches. An example of probabilistic record linkage is when datasets either don’t contain social security numbers or the values present are not accurate.
To link records, fuzzy matching is performed on multiple fields such as First and Last Name, Date of Birth, or Address, and based on the calculated weights, possible matches are computed. The next section goes in more detail about the probabilistic record linkage process.
Peter Christen describes a general approach for tackling a record linkage problem in the following five steps:
Record linkage results depend on the quality of input data. Data cleansing and standardization activities are carried out to ensure reliable data quality, such as identifying missing values and specifying accurate ones, removing garbage and invalid data, checking data accuracy and relevancy, and ensuring that data is up to date. As the process involves multiple datasets, the same data cleaning rules must be applied to ensure consistency across all data sources.
To link records, each data record must be compared with every other record in that dataset. As most organizations track user interactions with different data applications, a single record should be compared across various datasets as well. It is expected that the computational complexity of these comparisons grows quadratically as the size of database grows.
For this reason, blocking and indexing techniques are implemented to limit the number of comparisons. These techniques ensure that only similar records are compared, which reduces the overall number of comparisons. This is achieved by selecting a data field that has a very high probability of being the same across records if they belong to the same entity. And if these data values are completely dissimilar, the algorithm can simply block these records from being compared, and reduce overall computational complexity.
For this step, data fields are first chosen that can be used for comparison. These are usually the attributes that uniquely identify the record. But if such attributes do not exist or are not available due to confidentiality purposes, a combination of fields are used for probabilistic comparisons.
Similarity functions, such as Jaro Winkler and Levenshtein, are usually used to calculate the distance between two data values and assess how similar/dissimilar these values are. For each record in a comparison, agreement patterns are evaluated. Based on these patterns, classification decision is made whether these records belong to the same entity or not.
The Fellegi and Sunter framework gives the mathematical foundation and derives the probabilistic decision rule for classifying records as matches or nonmatches based on agreement pattern. Let’s take an example of two datasets A and B, and see how their records are compared to compute the agreement patterns.
In datasets A and B, consider records a and b such that a ∈ A and b ∈ B.
The data field to be compared – or simply, the matching variable – is present in both datasets A and B. If α and β are two matching variables from dataset A and B respectively, then the aim of the record linkage process is to find
In absence of a single unique identifier, multiple matching variables are used. Let’s say we choose three matching variables from A and B. These are:
Note that γ=0 if a pair matches on that field, and γ=1 if a pair does not match on that field. For example, the agreement pattern for a pair could be (1,0,1), which represents that the pair agrees on last name, disagrees on first name, and agrees on street name.
As field comparisons are performed, these agreement patterns are computed so that they can be used to calculate the likelihood ratio of two records belonging to the same entity. This is done using classifying algorithms which are covered in the next step.
The Fellegi and Sunter framework serves as the basis for many record linkage classifiers, such as Naive Bayes, Support Vector Machine (SVM), and Expectation Conditional Maximization, etc. According to this framework, the agreement likelihood ratio (R(γj)) that an algorithm will correctly classify a pair (j) of true matches as matches and would not incorrectly classify them as a nonmatch is: m/u. Mathematically: R(γj) = m/u, where:
Placing the values of m and u in above equation:
R(γj) = P(γj|M)/P(γj|NM) —– (i)
The likelihood ratio from the above equation gives a numerical value that represents how likely two records belong to the same entity. But for you to classify the records as a match or nonmatch, you need to run record linkage on a test dataset and compute a series of likelihood ratios. This data will allow you to determine the upper cutoff (W+) and a lower cutoff (W−) for classifying records as matches and nonmatches correctly.
These cutoff values highly depend on the type of data your datasets contain. And so, once you have estimated the cutoff values for your dataset, you can first calculate the likelihood ratio of a pair using the above formula and then easily map record pairs to three possible outcomes:
Let’s expand equation (i) to accommodate multiple matching variables, and take log on both sides to reduce skew in the data:
Although the process of record linkage can add huge value to your business intelligence processes, the inaccuracy of record linkage algorithms can cost you big bucks too. This is why it is crucial to choose the method and data attributes that will ensure maximum linkage accuracy on your dataset.
There are four outcomes that are possible in a record linkage decision:
An efficient way of evaluating your record linkage method is to check the following four metrics:
A self-service record linkage software allows business users and corporate professionals to easily link their data records without requiring a steep learning curve or programming languages.
A record linkage software, such as Data Ladder’s DataMatch Enterprise, is designed to help business users to:
and much more.
These tools are becoming the flagship solution to an age-old problem of reliance on complex IT processes to make sense of their data. In an age when automation is the key to business success, businesses cannot afford to lose the bigger part of their big data to the dark.
No credit card required
„*“ zeigt erforderliche Felder an
Oops! Wir konnten dein Formular nicht lokalisieren.
Was ist Adressstandardisierung? Unter Adressstandardisierung versteht man den Prozess der Aktualisierung und Implementierung eines Standards oder Formats für Ihre Adressdaten. Schlechte Adressdaten sind ein komplexes
Im Februar 2020 übergab Facebook einen anonymisierten Datensatz an Social Science One – mit dem Ziel, Erkenntnisse über die Kommunikation und das Verhalten in den
Was ist Adressstandardisierung? Unter Adressstandardisierung versteht man den Prozess der Aktualisierung und Implementierung eines Standards oder Formats für Ihre Adressdaten. Schlechte Adressdaten sind ein komplexes
Im Februar 2020 übergab Facebook einen anonymisierten Datensatz an Social Science One – mit dem Ziel, Erkenntnisse über die Kommunikation und das Verhalten in den
In jeder Art von datenreicher Umgebung ist es einfach, Muster zu finden; das ist es, was mittelmäßige Spieler tun. Der Schlüssel liegt in der Feststellung,