Record linkage is a method used to link the data records to the same entities, such as customers. Record linkage can be used to improve the quality and integrity of data, to allow reusing the existing data sources. While dealing with data from diverse sources, whether the data are from reviews, internal data, external data vendors, or scraped from the web, we often want to link individuals or firms across the datasets. Unluckily, we almost never start with seamlessly clean data. When using structured data, individuals make absent-minded errors like mixing up letters in names, individual values are recorded wrongly, and measurement error impacts the outcomes. Numerous things can occur before we get to crack into the data. Occasionally, maybe even more annoyingly, different sources just use different names for the similar entity. These little differences make it terrible to just merge the data together on the distinctive identifiers. The process to link individual or firm data across sources is called record linkage. Record linkage (RL) is the process of finding the same record across data sets. The records can be people, books etc. It has become an important discipline in computer science and in Big Data.
Regardless of which linkage method is being used, the linkage procedure is typically headed by a data cleaning stage. Data cleaning (occasionally called standardization or data cleansing) includes modifying, eliminating or in some way altering fields based on their values. These new values are expected to improve the quality of data and thus be more valuable in the linkage process. There are two sorts of data linkage which includes:
Deterministic record linkage produces associations established on the number of separate identifiers which are equal among the existing data groups. Two records are assumed to match through a deterministic record association process if all or some identifiers are same.
This method, occasionally called fuzzy matching include a different method to the record linkage problem by considering a broader variety of potential identifiers, calculating weights for every identifier based on its projected ability to properly classify a match or a non-match, and using these weights to compute the likelihood that two assumed records refer to the similar entity. Record pairs with likelihoods above a certain edge are reflected to be matches, though pairs with likelihoods under another edge are reflected as non-matches; pairs which fall among these two edges are reflected to be “possible matches” and can be dealt with accordingly.
Challenge in Record Linkage
A major challenge in record linkage is the lack of common object identifiers across diverse source systems to be coordinated. As an outcome of this, the matching needs to be directed using qualities which contain partly identifying information, such as names, addresses, or dates of birth. Though such classifying information is often of low quality and particularly suffer from regularly occurring typographical differences and faults, such information can change over time, human errors or it is only partly available in the sources to be coordinated. During the past decade, substantial advances have been made in different aspects of the record linkage process, particularly on how to increase the accuracy of data matching, and how to gauge data matching to very large systems which contain millions of records.
Data Quality and Data Cleansing
The process of data cleansing includes removing dismissed, out-dated, or wrong data. Clean data is a critical element for correct information, reports, and analysis. Throughout the organization, individuals make business decisions established on data which is provided to them. Data cleansing offers high-quality data which helps to overcome the fraud challenges and enable organizations to comply with regulations. High-quality data about key business entities offer the growth channel for a successful enterprise.
By using data cleansing techniques the organizations can quickly match and recognize duplicates in their data. Clean customer records allow effective sales and advertising and help the organization to grow. Visualize reaching out to the same client multiple times only because of several entries in the system – costly and time-consuming for the sales and support staff, difficult for the data analyst, cumbersome for the BI developer and frustrating for the customer. Poor data quality also hits the brand value and hurts customer experience.
Key Attribute selection in Record Linkage
This includes choosing the best characteristics on which we can differentiate between two individuals which are similar. For individual records name first, name last, address and email are the key features. The objective is to make, for a pair of records, a “comparison vector” of resemblance scores of every component attribute. Resemblance scores can simply be Boolean (match or non-match) or they can be actual values with distance functions.
It involves developing the programs to perform record linkage and data processing of small data samples before applying on the whole data set. As normally the size of data sets are huge and takes a lot of time and calculations. This helps in tweaking the algorithms and process of record linkage as turnaround time decrease considerably while performing tests. It is significant the sample set must be the representation of the actual data set.
After construction of a vector components-wise similarities for a pair of records, it is important to compute the probability that the recorded pair is a match. There are numerous methods for determining the likelihood of a match. Two simple methods are to use a weighted sum or average of similarity scores of the components. Another simple method is to apply rule-based matching, but the manual formation is difficult. The similarity scores are generated based on several algorithms normally string matching which includes edit distance and fuzzy string matching algorithms.
The quality of record linkage can be measured using the following dimension:
- The number of record pairs linked correctly (true positives)
- The number of record pairs linked incorrectly (false positives, Type I error)
- The number of record pairs unlinked correctly (true negatives)
The number of record pairs unlinked incorrectly (false negatives, Type II error).