92% of organizations claim that their data sources are full of duplicate records. To make things worse, valuable information is present in every duplicate that represents a single entity. For example, customer names are more complete in the first data source, while demographics are more accurate in the second one, and contact information is more standardized in the third one.
Today, organizations deal with multiple data source applications and depend on their teams to select the best entity record on the go. But this approach reduces their team’s productivity and efficiency – since every person who needs information about a customer must first evaluate all sources, and then build the best customer golden record that is usable for the intended purpose. Plus, human bias and error can
Matching data to find which records belong to the same entity is half the game, the next half is deciding which record should be selected as the main surviving record. This blog covers everything you need to know about building survivorship rules and conditionally overwriting records to attain the golden record. Let’s begin.
What is data survivorship?
Data survivorship is defined as:
The process of identifying the best pieces of information (or attributes) across duplicate records of an entity and merging them to attain a single comprehensive view.
The single comprehensive view that is created after merging duplicate records of an entity is called the golden record since it contains all necessary information about that entity. Due to the increasing number of applications used to facilitate business processes, key properties about customers, products, payments, etc. are dispersed across sources and channels. This is where techniques like data survivorship can help figure out the best surviving view of your data assets.
Matching, deduplication, merging, and survivorship: What’s the difference?
Multiple data quality processes must be executed to attain a clean, standardized, and unique view of your organizational data. It is important to understand the role each process plays in achieving the desired results so that the most optimal data quality framework can be designed.
When it comes to data survivorship, it is often overlooked as part of other processes, such as data matching (a.k.a. record linkage or entity resolution), data deduplication or data merging. The truth is that the purpose of these processes can be the same (that is, get rid of duplicates), but there is a slight difference in how each process generates results. See the table below to know more:
|Data matching||Comparing two or more records and computing the probability of them belonging to the same entity.||Generates results in terms of groups of matching records.|
|Data deduplication||Removing duplicates from the same or across different datasets; it happens after data matching and can be performed to remove duplicate rows or duplicate columns.||Generates a dataset that is free of duplicates: there are no multiple records for the same entity or multiple properties that represent the same information about an entity.|
|Data merging||Aggregating rows and columns of two or more datasets into one.||Generates a dataset that contains more information as compared to the input data sources. But it doesn’t necessarily mean that the merged result is a complete, comprehensive view of the entities involved.|
|Data survivorship||Identifying the best pieces of information (or attributes) across duplicate records of an entity and merging them to attain a single comprehensive view.||Generates a complete dataset for an entity (for example, customer) that contains: unique records of all customers and unique properties of all customers that are required for smooth operation. This resulting dataset is called the golden record.|
Common approaches for selecting the golden ‘surviving’ record
Data survivorship technique helps you to choose the best golden record or assemble it in case there is no one record that is as comprehensive as needed. This selection depends on a number of factors, such as the quality of your data sources as well as the quality of the data attributes stored in these sources. There are four approaches used while selecting or assembling the golden record.
1. Manual review and selection
The most basic approach is to manually review the records belonging to the same entity and select the one that is most comprehensive in terms of data attributes. Although this approach yields accurate results, if we consider the number of data sources, entities, and attributes involved, this approach may not be feasible for most organizations as they have multiple sources and millions of records. Plus, human bias and error can lead to inconsistent results.
2. Contending data sources
The second approach is pretty simple, you want to retain the duplicate record that belongs to a specific data source. You can prioritize and choose data sources based on a quality score. The score is measured by measuring your datasets against a list of data quality metrics. Assigning a data quality score based on these measurements can help realize which sources have the most complete, accurate, and standardized data values. Obviously, you will face a challenge when the duplicates belong to the same data sources. The third approach may be useful for such cases.
3. Contending data attributes
The third approach is most commonly used: you decide which record to select as the golden one depending on the values of certain data attributes. For example, in a group of four matching records, you may want to retain the one that has the most complete customer name. Some organizations may use a combination of attributes for this selection.
For example, in the same example, maybe three records have the same customer’s name. So, you choose another attribute and draw comparisons: which duplicate record has a verified phone number? It may take you down to two records. Weighing in more attributes and drawing a sequence of rules can help determine the most comprehensive golden record for that customer.
4. Overwriting: Best of all worlds
The fourth and final approach goes one step further: instead of simple golden record selection from a group of duplicates, it allows you to assemble your own golden record by overwriting the most comprehensive values of data attributes onto the master record. It offers a way to retain the best information from all duplicate records and ensure that nothing is lost.
But this in itself can be quite challenging. For example, you may end up overwriting valuable data with empty values. For this reason, it requires you to design complex overwriting rules that are based on conditions; you want to overwrite when certain conditions are met or skip overwriting when another set of conditions is met.
How to build survivorship rules to get the golden record?
Now that we know the most common approaches used to attain the golden record, we will see what the data survivorship process actually looks like.
1. Prerequisite: Groups of matching records
The first step of the process is to get matching or duplicate records. This may be a challenge depending on the information you store about your entities. For example, you may have attributes in your dataset that uniquely identify an entity – social security numbers for customers or manufacturing part numbers for products, and so on. In this case, you can just compare records based on this unique attribute and find the ones belonging to the same entity.
In the absence of unique attributes, you will have to run complex data matching algorithms and find the possibility of two or more records belonging to the same entity. Check out our definitive guide to data matching and understand how to compare records and identify duplicates. Conclusively, this step will involve the following activities:
- Parsing, cleaning, and formatting dataset values to achieve a uniform and standardized view,
- Selecting data attributes to base match algorithms on,
- Mapping match fields across datasets,
- Configuring match algorithms based on the nature of data values,
- Executing match algorithms and assessing results,
- Retuning match algorithms to reduce false positives and negatives.
The output of this step is groups of matching records.
2. Select best approach for golden record selection
Once you have the match results, it’s time to decide which approach will work best for golden record selection. To make this decision, you must assess the state of your data. During assessment, you are looking to find answers to questions like:
- Does any data source have an overall better data quality score?
- This can be true for a data source that has more complete, accurate, and standardized data values.
- Do certain attributes play a major role in the golden record?
- This indicates you need to select the golden record depending on how complete, accurate, and standardized a specific data attribute is as compared to other records in the match group.
- Do you want to retain information and get the best of all records?
- This indicates you need to select the golden record (either based on a source or attribute) and then also run some conditional overwriting rules to get the best of all matching records.
3. Create rules to select the best record as the golden record
For the sake of explanation, let’s say you want to compare data attributes to select the golden record. As discussed before, you can base the selection on a single attribute or choose a combination of attributes that reveal the best record. Moreover, the values can be compared with a set of operations, such as:
- The longest or the shortest value wins – in terms of number of characters.
- The maximum or the minimum value wins – used for numeric data or datetime values.
- The most frequently occurring value wins.
- The most recently updated value wins – this requires metadata information.
Once the set of data attributes and the comparator for each attribute is chosen, you can now build a sequenced set of survivorship rules to attain the golden record. Let’s take a look at an example. A group of 4 matching records belonging to the same customer are shown in the table below.
|No.||First Name||Last Name||Address||State||Phone||Birthdate|
|1.||John||Oneel||123 Avenue Street||New||5749864455||10/11/1988|
|2.||Jonathan||O’neil||123 Av St.||New York||9864455||Oct-11|
Here is a prioritized list of rules that compare data values to select the golden record.
|Data attributes to compare||Comparator||Result|
|First Name||Longest||Record number 2, 3, and 4|
|Last Name||Longest||Record number 2 and 3|
|Birthdate||Minimum (Oldest)||Record number 3 is selected as the golden record.|
4. Build rules to conditionally overwrite the golden record
With the golden record selected, you can clearly see here that you had to leave behind an accurate phone number and a more complete address. This is where you need to utilize conditional overwriting rules to get the best out of your data.
To design overwriting rules, you need to find out four things:
- Which fields to overwrite?
- Which fields to overwrite from?
- When to overwrite them?
- When not to overwrite them?
The answer to the first question can be found after reviewing the matching groups, the golden record selected, and the duplicates left out. Maybe you noticed that the duplicate records that were left behind had more accurate phone numbers and more complete Addresses and States. In this case, you can choose to overwrite these three fields.
You can use the same set of operations that were discussed in the previous section to select fields to overwrite from (longest/shortest, maximum/minimum, etc.). For example, you can choose to overwrite the Address of the golden record from the longest Address in the duplicate records.
For conditional logic that decides when to overwrite and when not to, here are some common operations used:
- Overwrite to is empty
- Overwrite to is not empty
- Overwrite from is empty
- Overwrite from is not empty
- Record is flagged – You can flag records that require special attention.
Let’s design a prioritized list of overwriting rules for the same example used in the previous section:
|Overwrite to||Overwrite from||Overwrite when||Do not overwrite when||Result|
|Phone Number||Longest||Overwrite from is empty||Phone Number of Record 4 is overwritten to Record 3|
|Address||Longest||Overwrite from is empty||Address of Record 1 is overwritten to Record 3|
|State||Longest||Overwrite from is empty||State of Record 2 is overwritten to Record 3|
5. Execute and review results
Since survivorship rules have a lot of configurations, you may not get the best results on the first run. You will have to spend some time to understand which sets of rules work best with your data. Reconfiguring, re-executing, and reevaluating the results can help attain the perfect golden record for each customer. For example, here’s the golden record that we selected and assembled based on the survivorship and overwriting rules that were designed:
|First Name||Last Name||Address||State||Phone||Birthdate|
|Jonathan||O’neal||123 Avenue Street||New York||15749864455||10/11/1988|
Utilizing data survivorship tools to get the golden record
No doubt that data survivorship is a complex process. And if we weigh in other processes that are crucial for the success of data survivorship (such as data cleansing, standardization, and matching), it can get overwhelming pretty quickly. This is where using an all-in-one tool that has a robust engine for cleaning, matching, and merging data can change the game for your organization.
Tools like DataMatch Enterprise do not only allow you to make all these survivorship configurations that we discussed here in this blog – but also reuse them whenever needed. With unprecedented accuracy and speed levels, you can get your golden records from 2 million records in less than 13 minutes. This helps you to easily go back and forth between your matching and merging configurations to get the best results.
Want to see our product in action? You can download the free trial today, or schedule a personalized session with our experts to understand how our product can help you get the most out of your data.