As data undergoes a paradigm shift, so do the systems, processes, and approaches involved. Legacy systems are dying. Batch ETL pipelines are slowly becoming obsolete. Ownership is moving away from IT to business. Functions like data matching and data preparation are evolving from manual, query-based, programmatic to automated, point-and-click, business-centric processes.
There is a modern approach to data matching – one that promises high accuracy rates, low false positives and only takes minutes instead of weeks and months.
This guide on modern data matching explains the processes involved, the different types of scenarios where this approach saves time while increasing efficiency and finally how users can tune matching parameters to get the best possible matches.
What is Data Matching?
Data matching compares data from multiple records to determine fields that refer to the same entity. Put simply, data matching enables the user to detect duplicate records or/and merge identical records. It is performed using established algorithms that are designed to compare different types of data such as strings, dates, and integers. With the nature of data turning increasingly complex, matching is no longer just about comparing two records – it involves key processes as data profiling, data completeness, data accuracy, and more.
The Basics – Deterministic and Probabilistic Matching
Known by multiple terms, record or data linkage, entity resolution, object identification or field matching; data matching is the task of identifying, matching and merging records that refer to the same entity from within or across several databases.
Traditionally, data matching was performed through running queries using complex algorithms and formulas to match records. Deterministic and probabilistic matching are the two most common data matching approaches that make the use of algorithms such as the Edit Distance, Soundex, Levenshtein distance to match strings and return a match result.
The deterministic method is simple. If you’ve got two data fields that have the exact properties, you can use this method to find matches. The condition is your data has to be squeaky clean and standardized. Unique identifiers like social security numbers, driver license, passport numbers need to be accurate. Easier said than done especially since there’s always a chance for user errors. Also, since these are confidential information, there is little chance companies can lay their hands on this information. They then resort to phone numbers, email addresses as unique identifiers. Then again, this data is always rife with errors, null values, and other data quality issues.
A good example of deterministic matching is when a bank matches consumer account numbers with their names and date of birth to confirm identities or when retail stores use invoice numbers to match with product numbers to confirm product sales.
Deterministic matching works well where rules are defined, the data is clean and you’re sure the unique identifiers are accurate.
But the data we have today hardly meets these rules.
Hence, the need for probabilistic matching.
Probabilistic matching uses a statistical approach in measuring the probability that two customer records represent the same individual. This methodology uses several fuzzy matching algorithms to determine a match, non-match or possible match. Like a deterministic match, probabilistic matching requires data is clean and standardize, but it doesn’t have to be ‘exact.’
If John is written as Johnny, a smart data matching solution should be able to detect this as a potential match based on certain business rules which are usually pre-defined in a commercial solution.
A good example of probabilistic matching is when a company has multiple email addresses and phone numbers (mobile/office/home) of the same individual with no unique identifiers to determine the identity. In this event, the company will have to run several algorithms to determine identities based on strings and integer values.
This is where fuzzy matching algorithms come in.
Rather than flagging records as a ‘match’ or ‘non-match’, fuzzy matching identifies the likelihood that two records are a true match based on whether they agree or disagree on the various identifiers. Here’s a list of the various fuzzy matching techniques that are in use today:
- Levenshtein Distance (or Edit Distance)
- Damerau-Levenshtein Distance
- Jaro-Winkler Distance
- Keyboard Distance
- Kullback-Leibler Distance
- Jaccard Index
- Metaphone 3
- Name Variant
- Syllable Alignment
You can read the following guide to know more about fuzzy matching and how it can help you with matching complex data.
Modern data matching uses both deterministic and probabilistic approaches depending on the nature of data and the type of match the organization needs to perform.
For both methods, a common requirement for all successful match projects is that of data quality which is achieved via data preparation.
Preparing Data for Data Matching
The nature of data today is anything but simple. One entity can have dozens of columns of data – phone numbers (home, office, mobile), email addresses (personal/professional), social media accounts, device IDs and many more. This data is hardly accurate.
These are summarized as under:
DataMatch Enterprise allows for the native integration of over 500 data sources including popular CRMs as Salesforce, HubSpot, and many more
Investigate the data for accuracy and completeness. Data profiling allows you to evaluate the health of your data and discovers rows and columns that have missing values, corrupt or incomplete information, null fields and much more.
|Data cleansing + standardization||Use patterns and regular expressions to clean, sort and optimize the data for matching. Transform semi-structured, dirty data into clean, standardized data.|
The Matching Process – Creating Definitions, Assigning Rules
In theory, the data matching process sounds simple – you compare two records, find the common information between both, and accomplish the match goal.
In practice, data matching is a complex process where you’re trying to determine that two records actually refer to the same entity.
Take for instance the table below.
How would you decide the two records belong to the same individual? Or that they can be labeled as duplicates?
|First Name||Last Name||Address||Phone|
A way of approaching this instance is saying that because the records share similarities in terms of Johnny being a popular nickname for John or that both last names start with D, hence these are likely to be duplicates. But intuition or guesswork is not the right way to go about it. Hence the need for matching solutions that can use multiple matching algorithms to determine whether two records are similar.
For starters, if the table has unique identifiers – which in this case, could be the phone number, then it’s easy to determine a match. But in the absence of unique identifiers, you’ll need to use either deterministic or probabilistic matching to determine the record similarity.
You start the matching process by identifying the attributes that are unlikely to change – for instance, surnames, date of birth, height, color etc. Next, you assign a match type (phonetic, exact, fuzzy) for each attribute. Names for example can be matched phonetically. Numbers and dates can be matched by similarity.
This is built right within DataMatch Enterprise.
You select your column of choice, choose a match type and begin the matching. You can also configure the match based on three critical settings: All, Between and Within.
- All: This will look for matches between all data sources integrated into the tool. But it does more than just look between them – it also looks for duplicates in *each* of them. Traditionally, it would take weeks to find for duplicates in just one single file, but with an automated solution, you can now look for matches within each file and between multiple files.
- Between: If you’re just looking for matches between sources and not within sources, you can configure the match setting to ‘between’. Indicate the number of data sources you want to match, and the solution will run matches between those sources without finding for duplicates within them.
- Within: If you specifically want to look for matches or duplicates within one data source, this configuration lets you scan rows and columns of that specific source.
Next, you can assign weights for each attribute, meaning, it determines the importance a field score has on the total match score. For instance, if the first letter of your records match, then the tool will add extra scoring. This is especially helpful if you want to match things like middle names (A vs Andrews) and nicknames, however, this should be used carefully as it can skew your results.
You must have a reliable, non-changeable field, such as a SSN number over an email address to perform reliable matching. Once the matching has been done, you can fine-tune it to get even more accurate results.
How to Tune Data Matching Results
To get the best out of the match, you can create your match rules. DME uses a unique Pattern Builder tool to create custom match rules.
Take the following scenario:
A company has two contact number columns – mobile and landline for each of its contact. For a year, this data has been duplicated with several hundred rows missing either a mobile or a landline number. The company opts for a fuzzy match within the data source to weed out duplicate contacts. After preparing, cleaning and transforming their data, the output is a neat consolidated list of unique surnames with unique mobile and landline numbers. Most companies would stop here.
But this is where it gets interesting.
There is a consolidated, unique list, but how can the company be sure that each contact has the right mobile and landline number?
Turns out, there were not sure.
Numbers in the mobile column did not have a country code.
Numbers in several– hundred landline column were mobile numbers.
Here’s where you’d start to tune data matching results.
With the surety that surname data is unique, it can be used as a unique identifier to match only the Mobile and Landline column. In this case, the country and area code prefix were the differentiators between mobile and landline. How do you fine-tune this match?
DME uses a pattern builder to create customized match configurations designed for complex match processes as in this instance. Using the pattern builder, the user can create an expression that assigns a prefix (country + area code) to all mobile numbers. The mobile column will be matched with this prefix and all numbers without the prefix will be assigned the prefix. Next, the mobile and the landline is matched to determine if there are any duplicates. If there are no duplicates, a final column with the updated mobile is created. Within this pattern builder, the user can define the range of numbers – for instance, any number less than 6 digits is an incomplete or inaccurate number.
As a final step, make a copy of the final result and compare your match records. Export the results to a spreadsheet and keep track of changes you’re making. DME makes this easier as the tool stores copies of all your match results, ensuring you do not lose any previous records that you want to reevaluate.
Match tuning helps you reduce false positives and negatives. Organizations do not have the capacity to manage excess false positives, hence, it’s necessary to define a match definition that is not broad or generic. For instance, using first and last names (which is often the easiest) to perform a match will result in high false positives because two distinct individuals can still have the same first and last name. Thus, a narrower definition such as a phone number or an email address is better as two users do not have the same number. In this case, the phone number must be a 100% accurate to be used as a match definition.
Some Essential Data Match Tuning Functions in DME
The DataMatch Enterprise is a self-service, automated, powerful data match tool that allows users to create custom match settings based on a range of settings including using alphanumeric characters, strings, digits, whitespace, delimiters and many more.
Here are some key functions that the tool uses to deliver highly accurate matches.
Type of Match: The tool allows for five types of data match:
- Exact: Will only match if the fields are the same
- Phonetic: Field will match when they sound alike (Bear and Bare)
- Numeric: Compares numerical values
- Fuzzy: This will match fields based on character coincidence (john and jhon) and return a score based on that coincidence and the order of the characters.
- Level: This will define the threshold of the match score. (For example, if you define last names as 70%, then all records of that group must match 70% or more).
Group IDs: This allows users to create fields for cross column search between two or more columns. For instance, you want to match the first name column of Record A in the last name column of Record B or within Record A itself.
Group Level: Matching score calculated for a group ID based on field level and weight.
Filter Editor: Filter columns using AND/OR options and determine the values you want to keep in a column. For instance, taking the scenario above, you might want to keep numbers that only begin with the area code of one country.
Merge Matches: What if you have two numbers for one column? Would you keep one and remove the other? Unless you know exactly which of them is the correct version, you can merge them, separated by a delimiter. This way you get to keep both results that you can eyeball on later.
Finally, DME allows you to export records in over a dozen formats. You can also choose to create a golden record after the matching that will hold the most accurate, consolidated version of your data.
Traditionally, a data match activity has three goals.
- Get data columns that match
- Discover columns that don’t match
- Get an output that holds true, accurate information
Today, these goals are expanded by time and match rates. Companies want a 100% accurate match rate in the shortest time possible with minimal use of resources. Essentially, this is a demand for automation. Next-gen solutions deliver on this demand by allowing users to consistently refine and optimize their data by cleaning, matching, and tuning the match for more focused, accurate results.
DME helps you achieve all of these goals.
- We are rated the highest in match rate accuracy beating SAS and IBM
- It takes just 45 minutes to match millions of rows of data
- Our tool is designed for business users to clean, match and transform data
- The user can define rules, create exceptions, optimize match configurations according to their data requirements
- It requires no programming language expertise and is a self-service tool for anyone interested in working around data quality and data matching.
Want to know how we can help you perform a customized data match exercise? Download the free trial and experience data matching the modern, better, more powerful way.