Blog

Why duplicates exist and how to get rid of them?

Q: Why duplicates exist and how to get rid of them?

Data deduplication happens when you store information about the same entity multiple times, instead of updating a single record. In this guide, we will discuss some basic concepts involving data duplication and a list of techniques and algorithms that are commonly used to fix it.

Written by Ehsan Elahi
November 15, 2021

According to Natik Ameen, Marketing Expert at Canz Marketing, duplicate data in the company’s CRM happens due to a range of reasons:

“from a human error to customers providing slightly different information at different points in time in the organizational database. For example, a consumer lists his name as Jonathan Smith on one form and Jon Smith on the other. The challenge is exacerbated by a growing database. It is often increasingly tough for administrators to keep track of DB and as well as track the relevant data. It gets more and more challenging to ensure that organization’s DB remains accurate”.

Data duplication happens when you store information about the same entity multiple times, instead of updating a single record. In this guide, we will discuss some basic concepts involving data duplication and a list of techniques and algorithms that are commonly used to fix it.

Why duplicates exist?

Data has multiple representations – meaning the same data can be represented in different ways. This is the primary reason why duplicate records exist in a database. Whether the records are merged from independent data sources, or entered in a single database over time, both lead to the complex problem of data duplication.

Ideally, every record in a database should represent a single, unique entity. But due to a number of reasons (the most common ones are mentioned below), we notice that an entity’s information spans multiple records.

1. Lack of unique identifiers

Having unique identifiers in your database is the best way to avoid storing duplicates. A unique identifier is a data field that is always unique for an entity (for example, Social Security Number (SSN) for customer data, Manufacturer Part Number (MPN) for product data, etc.). On every new data entry, you can check whether a record with the same unique identifier exists. And if it does, you can simply update or merge it, and avoid storing a new record for the same entity. But if your database does not contain such a unique identifier, the process of relating new incoming entities to existing ones becomes a complex task.

2. Lack of validation checks and integrity constraints

Even with the presence of unique identifiers, you can end up with duplicates in your database. This happens when the unique identifiers do not conform to valid patterns (for example, AAA-GG-SSSS for SSN), or do not have strict integrity constraints (for example, 11 char limit for SSN).

3. Data entry errors

The error rate of data entry is 400 per 10,000 entries, which is a significant number. So, even with the presence of unique identifiers, validation checks, and integrity constraints, there is a chance that human error can intervene and allow duplicates in your database.

Difficulty in data comparison – Data heterogeneity

To get rid of the duplicates in your database, you need to compare the records and evaluate which ones belong to the same entity. But when you compare data records (either in the same database or belonging to different databases), you will notice that they have some systematic differences, which make it difficult to exactly compare them. This is usually known as data heterogeneity.

Broadly speaking, you can classify heterogenous data as:

1. Structural heterogeneity

This type of difference occurs when fields of different databases represent the same information in a structurally different manner. For example, one database could be storing a contact’s name as Contact Name, while in a second database, it is stored in multiple columns such as, Salutation, First Name, Middle Name, and, Last Name.

2. Lexical heterogeneity

This type of difference occurs when fields of different databases are structurally the same, but they represent the same information in a syntonically different manner. For example, two or more databases can have the same Address field, but one can have an address value: 32 E St. 4, while the other can have 32 East, 4^th Street.

Process of data deduplication

Simply put, the process of deduplication involves:

Preparing data by standardizing fields across all databases
Mapping the fields that represent the same information
Choosing an appropriate field matching technique (depending on the nature of data), and then computing similarity between data fields

In upcoming sections, we will go in a bit more detail of the steps mentioned above.

1. Data preparation

The first step in the process is to ensure uniformity across all databases in terms of data structure and fields. This process reduces the structural heterogeneity across the databases – at least to some extent. It involves following two steps:

a. Data cleansing and standardization

This involves removing any errors or variations in formats and data types of values, or in the structure of databases. This can be achieved by:

Parsing long strings to identify important data components. An example of this is when you have the entire address in one field. Parsing the Address field will give you its subsequent data components, such as Street Name, Street Number, ZIP Code, City, State, and Country. Matching becomes quite easier on these parsed elements, as compared to matching on the entire field.
Transforming data values to achieve similar data types, naming conventions, etc. This can be done by converting data types (e.g., string to number), renaming column names, or merging fields together.
Standardizing patterns for all values contained in a data field, so that every value is expected to follow the specified pattern. An example of this is when you standardize the pattern of the Phone Number field to XXX-XXX-XXXX. Hence, making comparisons and matching easier and more accurate.

b. Data field mapping

Once the databases are standardized (as much as possible), the next step is to map fields that represent the same information. This is either done manually (for example, Address to Address, Phone Number to Phone Number, etc.), or running checks to identify which field’s values overlap with the other database’s fields. For smaller datasets, the former technique useful, but if you have large datasets where columns are named differently, the latter is quite helpful.

2. Computing similarity using field matching techniques

Once this is done, now the data is in relatively better shape to compare and identify duplicates. But misspellings, human typographical errors, and conventional variations still exist. This is why exact matching techniques are not useful here, and we need techniques that consider these aspects of data while computing scores to assess similarity between individual values, and hence, the entire record.

a. Character based similarity metrics

Since most typographical errors occur in strings, in this section we will look at the most common techniques for matching similarity between strings.

i. Edit distance

This algorithm calculates the distance between two strings, computed character by character. The distance is calculated by counting the number of edits required to transform the first string into the second string. And then a threshold is defined which classifies two strings as a match (if distance < threshold) or non-match (if distance > threshold). There are three types of edits allowed to calculate the distance: inserting a character into the string, deleting a character from a string, replacing a character with another in the string.

Normally, the count of one edit operation is considered to be ‘1’. But different models propose a different cost of each edit. For example, the Levenshtein distance considers the cost of each edit as 1, while Needleman and Wunsch explained that the cost of each edit depends on the nature of the edit (replacing O with 0 has a smaller cost, then replacing T with M).

ii. Affine gap distance

The edit distance algorithm does not work well with strings that have initials or short forms. For example, the edit distance might classify Jennifer Lily Stevens and Jen L. Stevens as a non-match. This is where the affine gap distance may be useful as it introduces two more edit operations called:

Open gap: it refers to adding a gap (or space) to a string where there was none.
Extend gap: it refers to adding a gap (or space) to a string where there was already a gap.

It is obvious that the cost of opening a gap (where there was none) is more than extending a gap (where there was already a gap). This variation of edit distance lets you calculate the similarity between shortened strings as well.

iii. Smith-Waterman distance

This is another variation of edit distance and affine gap distance. This model lowers the cost of mismatches found in the beginning or ending of strings, since the prefixes and suffixes commonly differ. For example, matching these two strings with S-W distance makes more sense: Dr. Jennifer Lily Stevens and Jennifer Lily Stevens, Doctor at Nat Medical Center.

iv. Jaro distance

Jaro introduced a formula for comparing the similarity between first and last names. The algorithm that calculates Jaro metric is as follows:

Compute the lengths of both strings to compare (S1 and S2).
Identify the number of characters that are common in both strings (C).
Compare each character of the first string to the corresponding character of the second, and calculate each nonmatching character as a transposition (T).
Evaluate Jaro metric as:
Jaro = 1/3 * [ (C/S1) + (C/S2) + ((C-(T/2))/C) ]

The lower the value of Jaro metric, the more similar two strings are.

v. N-gram distance.

This algorithm creates N-lettered substrings from the matching strings, and compares the substrings, rather than the entire word. Let’s take the words Guide and Guode as an example. To calculate 2-gram distance between them, following substrings are created:

Guide = {‘gu’, ‘ui’, ‘id’, ‘de’}
Guode = {‘gu’, ‘uo’, ‘od’, ‘de’}

The similarity is then calculated by assessing the number of substrings that are same. This evidently shows if the user meant to type the same word and if it is just a typographical error.

b. Token based similarity metrics

Token-based similarity metrics come into play when you want to compare strings that are rearranged differently, but mean the same thing. For example, first and last names are usually rearranged, such as Jennifer Stevens is the same as Stevens, Jennifer. But character-based comparison will not be effective for such scenarios. This is where we use token-based similarity metrics.

i. Atomic strings

The most common token-based similarity metric is atomic strings. In this algorithm, the entire string is divided into words delimited by punctuations, such as space, comma, full stop, etc. And then the words are compared to each other, rather than the entire string.

ii. WHIRL

Atomic strings algorithm does not assign any weights to the words during comparison. Because of this, Doctor Jennifer Stevens, and Amanda Tates, Doctor will be considered somewhat similar (since one word is a complete match). WHIRL fixes this problem by assigning relatively lower weights to commonly used words and computes similarity accordingly.

iii. N-grams with WHIRL

WHIRL did not consider misspellings in its similarity comparison algorithm. It was extended to include N-grams comparison technique, so that n-grams were compared instead of whole words (or tokens).

c. Phonetic Similarity Metrics

The character- and token-based algorithms were designed to compare strings that reflected similarity in their character composition. On the other hand, we have other cases where we need compare strings that may not look the same at all but have a very similar sound when they are pronounced. This is where phonetic similarity metrics come in handy. Let’s take a look at the most common techniques for computing phonetic similarity metrics.

i. Soundex

Soundex is commonly used to identify surnames that may be different in spelling, but are phonetically similar. This helps in catching any typographical or spelling mistakes that occurred while entering the data. But the algorithm mostly performs well with English surnames and is not a good choice for names of other origins.

Soundex algorithms computes a code for each string, and compares how similar the codes for two separate strings are. The Soundex code is computed as:

Keep the first letter of the name.
Ignore all occurrences of w and h.
The letters a, e, i, o, u, and y are not coded and are only kept temporarily (as they’ll be dropped completely in the latter step).
Replace following letters with these digits:
1. b, f, p, v → 1
2. c, g, j, k, q, s, x, z → 2
3. d, t → 3
4. l → 4
5. m, n → 5
6. r → 6
If two or more identical digits are present in the code, only keep the first occurrence and drop the rest.
Drop these letters: a, e, i, o, u, and y.
Keep the first letter (from step A.) and the first three digits created. If there are less than three digits, then append zeros.

For example, these two strings ‘Fairdale’ and ‘Faredayle’ output the Soundex code F634, since they are phonetically the same. Soundex proves to be 95.99% accurate while locating similarly sounding surnames.

ii. New York State Identification and Intelligence System (NYSIIS)

As the name suggests, this algorithm was devised for New York State Identification and Intelligence System in 1970 – which is now part of the New York State Division of Criminal Justice Services. Its accuracy rate is 98.72% (2.7% increase from that of Soundex) as it retains details about vowel position in the code (maps them to the letter A). Moreover, consonants are mapped to other alphabets and not to numbers, thus creating a complete alpha code – no numbers involved.

iii. Metaphone, Double Metaphone, and Metaphone 3

Lawrence Philips developed a better rendition of Soundex called Metaphone in 1990. It performed remarkably well as it considered the details of variations and inconsistences that are in English pronunciation and spelling. In his algorithms, he made use of 16 consonant sounds that are used in pronouncing a large library of English and non-English words.

Later on, Philips published a newer version called Double Metaphone, in which he also incorporated details of a number of languages – in addition to English. Finally, in 2009, he developed Metaphone 3, that proved to be 99% accurate for English words, other words familiar to Americans, and first and family names commonly used in the US.

d. Numeric similarity metrics

There are many methods for calculating string-based differences, but for numeric datasets, these methods are limited. Simple numeric differences are usually assessed by calculating how far values are from each other, but for complex computation, the distribution of numeric data can also be considered. Algorithms such as Cosine Similarity can also be used to locate numeric differences.

Which field matching technique to use?

As we just witnessed, the process of finding similarity between two data fields is quite complex. We reviewed multiple data matching techniques but noticed how each of them solves a specific data deduplication problem, and there’s not a single technique that promises to perform well for all data types and formats.

Selecting a matching technique highly depends on these factors:

Nature of your data – or the data type. For example, Jaro distance performs well for strings, but cosine similarity is broadly used for numeric datasets.
Kind of duplicates that are present in your dataset. For example, typos and spelling mistakes are better tracked using character-based similarity metrics, while differently formatted fields are better matched using token-based similarity metrics.
Domain of your data. For example, if you are matching English first or surnames, then Metaphone works well, but if non-English names are also involved in your dataset, then using Double Metaphone or Metaphone 3 makes more sense.

Automating the deduplication process

Understanding the internals of data matching techniques and choosing an appropriate one for your dataset is a difficult task. In many situations, one technique is not enough, and a combination of techniques are used to accurately dedupe data. For this reason, the need for digital tools is increasing. Tools that not only optimize time and effort, but also intelligently select data matching techniques depending on the nature of your data structure and values.

DataMatch Enterprise is one such tool that handles your entire data quality process from beginning till end. It offers a range of modules that support data coming in from different sources, enable field mapping, and suggest a combination of match definitions that are specific to your data. You can use the suggested matching fields and algorithms or override by selecting your own. The tool can also be used to assess match accuracy of different match techniques on your dataset, and conclude which algorithm performs well.

To know more, sign up for a free trial today or set up a demo with one of our experts, and start deduping your data!

Try data matching today

No credit card required

"*" indicates required fields

Want to know more?

Check out DME resources

Oops! We could not locate your form.

BY FEATURE

BY USE CASE

BY INDUSTRY

OUR PRODUCTS

ABOUT US

CUSTOMERS

COMPARE

INSIGHTS

SUPPORT