Dirty, unstructured structured data, dozen-plus name variations, and inconsistent field definitions across disparate sources. This can of worms is an almost staple occupational hazard for any data analyst working on a project involving thousands of records. And the implications are anything but ordinary:
- Global financial institutions were fined $5.6 billion in penalties from failure to meet compliance regulations in 2020
- Poor patient matching led to a third of claims getting denied in healthcare organizations in a survey from Black Book Market Research
- Sales representatives lose 25% of their time due to bad prospect data.
So, here’s the key question: Is there a better way of overcoming these problems?
Unlike entity resolution tools that can perform data ingestion from multiple points and find non-exact matches at unparalleled speeds, manually entity resolving data using complex algorithms and techniques proves to be a far costly (not to mention exhausting) endeavor. Research from Gartner has found that bad data quality costs companies $15 million every year – especially for those with operations spanning across multiple territories and business units.
This detailed guide will walk you through entity resolution, how it works, why manual entity resolution is problematic for enterprises, and why opting for entity resolution tools is optimal.
What is Entity Resolution?
The book Entity Resolution and Information Quality describes entity resolution (ER) as ‘determining when references to real-world entities are equivalent (refer to the same entity) or not equivalent (refer to different entities)’.
In other words, it is the process of identifying and linking multiple records to the same entity when the records are described differently and vice versa.
For example, it asks the question: are data entries ‘Jon Snow’ and ‘John Snowden’ the same person or are they two different people entirely?
This also applies to addresses, postal and zip codes, social security numbers, etc.
ER is done by looking at the similarity of multiple records by checking it against unique identifiers. These are records that are least likely to change overtime (such as social security numbers, date of birth, postal codes, etc.). Finding out if these records are the same or not involves matching it against a unique identifier in the following way:
In the above example, John Oneil, Johnathan O and Johny O’neal are all matched through a unique identifier which is the national ID number.
ER usually consists of linking and matching data across multiple records to find possible duplicates and removing the matched duplicates which is why it is used interchangeably with:
How Entity Resolution Works in Practice
There are several steps involved in an ER activity. Let’s look at these in more detail.
This involves putting all data from multiple sources under one centralized view. An enterprise often has data scattered across disparate databases, CRMs, Excel and PDFs and data formats including string, date, and both.
For example, a large mortgage and financial services company can have a central database in MySQL, claims forms data in PDF and its homeowners list in Excel. Importing data from all these sources will help the set the stage for linking records and finding duplicates. For more info, click here.
In other cases, combining different sources into one can also mean changing the schema of the databases into one predefined schema for further processing.
After the data sources are imported, the next step is to check its health to identify any kind of statistical anomalies in the form of missing and inaccurate data and casing issues (i.e., lowercase and uppercase). Ideally, a data analyst will try to find potential problem areas that need to be fixed before doing any kind of cleaning and entity resolving.
Here a user may want to check if the fields conform to RegEx – regular expressions that determine string types for different data fields. Based on this, the user can determine how many records are either unclean or don’t conform to a set encoding.
Doing so can help reveal crucial data statistics including but not limited to:
- Presence of null values e.g., missing email addresses in lead gen forms
- Number of records with leading and trailing spaces e.g. David Matthews
- Punctuation issues e.g. hotmail,com instead of Hotmail.com
- Casing issues e.g. nEW yORK , dAVID mATTHEWS, MICROSOFT
- Presence of letters in numbers and vice versa e.g. TEL-516 570-9251 for contact number and NJ43 for state
Deduplication and Record Linking
Through matching, multiple records that are potentially related to the same entity are joined to remove duplicates, or deduplicated using unique identifiers. The matching techniques can vary depending on the type of field such as exact, fuzzy, or phonetic.
For names, for instance, exact match is often used where unique identifiers such as SSN or address are accurate in the entire dataset. If the unique identifiers are inaccurate or invalid, fuzzy matching proves to be a much more reliable form of matching to easily pair two similar records (e.g., John Snow and Jon Snowden).
Deduplication and record linking, in most cases, are understood to be one and the same thing. However, a key difference is that the former is about detecting duplicates and consolidating it within the same dataset (i.e. normalizing the schema) while the latter is about matching the deduplicated data across other datasets or data sources.
Canonicalization is another key step in ER where entities that have multiple representations are converted into a standard form. It involves taking the most complete info as the final record and leaving out outliers or noisy data that could distort the data.
When finding matches for an entity across hundreds and thousands of records, the potential combinations that could yield the right matches can end up in thousands (if not millions). To avoid this problem, blocking is used to limit the potential pairings using specific business rules.
Challenges of Entity Resolution
Despite the many approaches and techniques available for ER, it falls short on several fronts. These include:
1. ER Works Well Only If the Data Is Rich and Consistent
Perhaps the biggest problem of ER is that the accuracy of the matches is dependent on the data’s richness and consistency across datasets.
For instance, deterministic matching is quite straightforward. Say you have ‘Mike Rogers’ in database 1 and ‘Mike Rogers’ in database 2. Through simple record linking (or exact matching), we can easily identify that one is a duplicate of another.
However, probabilistic matching, where similar data records exist in the form of misspellings, abbreviations or nicknames (e.g. ‘Mike Rogers’ in database 1 and ‘Michael Rogers’ in database 2) – is another story. A unique identifier (such as address, SSN, or birth date) may not be consistent across the databases and any kind of exact or deterministic matching will become nearly impossible especially when dealing with data in large volumes.
2. ER Algorithms Don’t Scale Well
Big Data enterprise projects that deal with terabytes of data in the financial, government, or healthcare industry have too much information for traditional ER, record linking and deduplication to work properly. The business rules required to make the algorithms work would have to account for far larger data to work consistently.
For example, the blocking technique – used to limit mismatched pairs when finding duplicates – is dependent on the quality of the record fields. If you have fields containing errors, missing values and variations, you can end up inserting data into the wrong blocks and face higher false negatives.
3. Manual ER is Complex
It is not uncommon for enterprise companies or institutions dealing with large volumes of data to opt for ER projects in-house. The rationale is that they can make use of technical resources (software engineers, consultants, database administrators) without having to purchase any of the entity resolution tools available in the market.
There are a few problems with this. Firstly, entity resolution isn’t a subset of software development. Sure, there are publicly available algorithms and blocking techniques that might be useful. But in the grand scheme of things, the skills required are vastly different. The user will have to:
- Combine disparate unstructured and structured data sources
- Be aware of different types of encoding, nicknames, variations for matching accuracy
- Know how to entity resolve records for different use-cases
- Ensure different matching techniques complement one another for consistency
Ticking all these boxes for the right user can be unlikely, and even if it is possible you have the risk of them leaving the firm that can put the entire project on hold.
4 Reasons Why Entity Resolution Tools Are Better
Entity resolution tools can provide many benefits that traditional ER can’t. These include:
1. Greater Match Accuracy
Dedicated entity resolution tools that have sophisticated fuzzy matching algorithms and entity resolving capabilities in place can give far better record linking and deduplication results than common ER algorithms.
When dealing with heterogeneous datasets, finding the similarity of two records can be exceptionally difficult due to the different types of entities, encoding, formatting issues and languages. Schema changes can also pose a problem. Healthcare organizations, for example, use both SQL and NoSQL-powered databases and converting all data into a pre-defined schema through schema-matching and data exchange can be risky as a lot of valuable information can be lost in the process.
Furthermore, a data analyst may have to use several string-metrics to do fuzzy matching effectively such as Levenshtein Distance, Jaro-Winkler Distance, Damerau-Levenshtein Distance and more. Incorporating all of these manually to improve match accuracy can be problematic.
Entity resolutions tools, on the other hand, can seamlessly link records by employing a wide range of string-metrics and other algorithms to give higher match results.
2. Lower Time-To-First Result
In most cases, time is critical for ER projects especially in the case of master data management (MDM) initiatives that require a single source of truth. The information relating to an entity can quickly change within weeks or months that can pose serious data quality risks.
Let’s say a B2B sales and marketing organization wants to run campaigns on its top-tier accounts, Ideally, It will want to make sure that its targeted prospects haven’t switched jobs, changed titles, or retired before wasting any marketing spend. In such cases, doing ER within a deadline is critical.
ER, if done manually, can take as long as 6+ months, by which time many records in databases may become obsolete and inaccurate. Entity resolution tools, however, can take half as long and more advanced tools can give a time-to-first result of 15 minutes.
3. Better Scalability
Entity resolution tools are far more adept at ingesting data from multiple points and run record linkage, deduplication and cleansing tasks at a much larger scale. Government databases such as those containing tax collection and census data store millions (if not trillions) of records. A government institution deciding to do ER for fraud prevention, for instance, would be restricted in using manual ER approaches and algorithms. A user would become inundated with the data that needs to be worked with any business rules for blocking techniques – to minimize the number of similar comparisons – would be futile.
Entity resolution tools, however, can not only import data from various sources, but also ensure its ER efficiency remains intact across large data volumes.
Entity resolution tools, particularly for enterprise level applications, can cost a sizable investment. Data professionals tasked with ER may be reluctant to consider opting for this reason alone. They may reason that doing it manually would be far cost-effective and improve their chances of promotion.
Although this may sound reasonable at first glance, the costs of project delays, poor matching accuracy, and labor resources may end up becoming higher than that of an ER tool.
How to Choose the Right Entity Resolution Software
Choosing the right entity resolution software is equally important. Many entity resolution tools differ in its features, scope and value.
Import Disparate Data Sources
Enterprises can have data stored in a wide variety of formats and sources such as Excel, delimited files, web applications, databases and CRMs. An entity resolution software must be capable of importing data from disparate sources for the specific use-case.
DataMatch Enterprise’s Import module allows you to source data in various formats as shown above.
Profile and Cleanse Data at Scale
The right entity resolution software must also be able to profile and clean the data prior to any deduplication and record linkage efforts. DataMatch Enterprise, using pre-built patterns based on Regex expressions, can determine valid and invalid records, null, distinct, leading and trailing spaces, and more.
Once a profile is generated, the data can then be cleansed using various functionalities such as:
- Merge Fields
- Characters to Remove
- Characters to Replace
- Numbers to Remove and more
Robust Matching Capabilities
There are many entity resolution tools that claim to provide a high match score. However, matching accuracy is linked to how sophisticated the algorithms used to match records within and across multiple datasets are. DataMatch Enterprise employs a range of match types (Exact, Fuzzy, Phonetic, string-metrics to establish distance across entities and makes use of domain-specific libraries (nicknames, addresses, phone numbers) to establish a higher than industry match score.
An independent study done by Curtin University found that DataMatch’s match accuracy surpassed those of other vendors including IBM’s Quality Stage and SAS Dataflux.
As crucial it is for enterprises to do ER, manually carrying out deduplication, record linkage and other ER tasks have serious limits when trying to match data across millions and trillions of records. Using an entity resolution software like DataMatch Enterprise, enterprises are in a far better position to realize their business goals from a scalability, cost, and results standpoint.
Start your free trial today