Catherine spelled as Cathy, Kath or Katharine; John entered in your system as Jon, Jonathan, or Jonny; or a Margaret who goes by Peggy when purchasing online — name variations cause significant problems in maintaining an accurate customer or vendor profile for organizations. In this article, we will take a look at how name matching software and techniques help businesses.
While seemingly a small issue, name variations result in duplicate records being created across disparate data sources. Your reps are spending way too much time trying to consolidate customer information and verifying if two customers are in fact the same person, and your analytics might show you a flawed view of your customers, affecting business decisions.
Here’s a small example of what goes wrong with a name mismatch.
Say, you want to send out a promotional email to your customers. You plug your database to your automated email platform and send out an email. William Rogers is one of your customers but when William receives the email, he is addressed as, ‘Willy Rog.’ You just lost the trust, and therefore ongoing business, of a customer.
How do you prevent such an accident from happening? What practical steps can you take to ensure that your database has the right information?
We will be answering all these questions by covering:
- What is Name Matching?
- Why Do Name Matching Problems Occur?
- 4 General Approaches to Resolving Name Matching Issues
- Challenges with Existing Approaches
- How Name Matching Software Can Help
- Zurich Insurance Case Study
Let’s dig in.
What is Name Matching?
In layman terms, name matching simply means making sense of several variations of a name and matching it to one primary name. So, taking the example above: William may be written as Will, Willy, Wils, and so on. The goal of name matching is to identify these variations and associate them with the correct name, aka, William.
Sounds simple, right?
Not quite so.
In databases, names also act as unique identifiers – which means your database may rely on a name to look up a record. More importantly, when you use marketing automation tools, you rely on the [name] token to personalize the email, leaving no room for errors. The negative consequences of wrong spellings can cost businesses hundreds of thousands of dollars.
So what can you do? Data experts implement algorithms or methods to remove duplicates or inaccurate name strings – keeping just one true source. So, in the case of William, all the other variations are removed, giving you one main correct name, while storing the other variations in a separate field for reference, if needed.
Why Do Name Matching Problems Occur?
Name variations happen for a number of reasons – the first and foremost being the behaviour or intention of the user. Some people may choose to give their nickname (a common problem with online businesses that require users to fill in forms). Some may choose to just give their initials, or some may simply type in a random name.
Regardless of your business size, type or industry, the cost of false or inaccurate data is always high. But if your organization is in the law enforcement, homeland security, financial compliance or similar data-sensitive industries, you cannot risk the problem with name variation. For organizations in these industries, the stakes for name matching variations are high.
TransUnion, a popular consumer credit reporting agency, lost a massive class-action lawsuit for incorrectly flagging customers as criminals. Similarly, PayPal, a popular online financial transaction company was fined for not preventing transactions to Iran, Cuba, and Sudan because their filter was not working properly.
Increasing variability and complexity in data types, data formats and data sources (mobile, social, device logs etc) has further complicated name matching challenges.
Some of the common issues with name matching are:
Typographical errors: Missing the, ‘a’ in Angela could change the name to Angel. The problem with typos? Sometimes, we’re not even aware we’ve made a typo.
Phonetics: Is it Carl or Karl? Grey or Gray? These names sound alike but are spelled differently. If someone is entering the name over a call (a customer service agent for example), neglecting to confirm the spelling produces an error that goes unnoticed.
Nicknames: This is a common problem. Sometimes nicknames replace the original name entirely. So, someone may be habitual of typing down Mike instead of Michael or Liz instead of Elizabeth.
Initials: Sometimes, with very long names, people tend to jot down just the initials. Mary Jane Thomas could be written as M.J Thomas. In this event, there is also a possibility that M.J Thomas may be mistaken as a male!
Foreign Names: This is super tricky! When it comes to foreign names, there are high chances of spelling errors. Asian names, particularly Vietnamese, Korean, Chinese names are difficult to tackle. For example, ‘Nguyen thi…’ is a common Vietnamese first name for women. Some write it as Nugyen, some as Nguyen – the former being a wrong spelling. The same goes for Asian names that have been Americanized, for example, Farah is written and pronounced as Farrah.
Because there are so many sources, processes, and people involved in the recording of names, it becomes difficult to ensure a 100% accuracy. Thanks to modern technology though, it is possible to significantly reduce if not completely remove inaccurate data.
You will need to use different name matching methods to solve different name matching challenges. In either case though, it must be noted that there is no one solution to resolving these challenges. There are different approaches developed to address different challenges, but there is no one-size fit all solution.
Most of the frameworks described are designed for specific challenges and require significant customization before you can develop and deploy them in an enterprise environment.
4 General Methods to Resolving Name Matching Issues
The challenge of string matching has bothered enterprises and organizations for decades. Enterprises like Google and Amazon use several methods to overcome this challenge, while less capitalized enterprises are still struggling with the cost of maintaining a large database.
Here are some of the most common name matching approaches used in the industry.
The Common Key Method
Phonetics being a common name challenge can be resolved by using the Common Key Method. In this method, names are represented by a key or code based on their English pronunciation.
A phonetic algorithm Soundex is used to index names by sound. So, for example, SMITH and SCHMIDT have S530 as their key. Now this may seem like a super-easy way to resolve name issues, but it is highly limited.
It only works on Latin-based languages. This means it will decipher foreign language names according to English phonetics. Double Metaphone, another phonetic algorithm uses a primary and second code for each name which enables it to take into consideration other languages such as Slavic, Germanic, Spanish, French, Greek, Italian and even Chinese!
Double Metaphone will therefore, encode, Smith with a primary code of SM0 and a secondary code of XMT. When it reads Schmidt, it will use the secondary code of Smith, which is XMT as primary and a secondary code of SMT. Notice the sharing of XMT? That indicates similarity between similar-sounding names.
Despite being a popular method, the greatest challenge with Common Key algorithms is precision. It’s mostly guesswork ( as in the case of Smith vs Schmidt), and although there are better and more advanced algorithms being defined to handle phonetic differences, there will always be challenges when it comes to non-English names. In the case of Korean names, for example, both Soundex and Metaphone will convert the names to Latin characters then create keys for it. This process adds to the complexity of the task and increases the chances of error rather than reducing it.
Pros: Simple, fast & high recall value
Cons: Does not work as smoothly with non-Latin names. May compromise on precision.
List or Dictionary Look-Up Method
The method is simple – list all possible variants of a name and match them to the main source.
This method works best for multi-cultural data as there are different derivations of a name – the cause of which could be cultural preferences, individuality, or simply a human error that wasn’t fixed.
Take for example the name Aiden. It’s also written as Aydin. Another common example is Ayesha also written as Aisha or Aiesha.
Although the list method is simple and easy to maintain, it is resource-intensive and falters when challenged with other variations like initials, nicknames, surnames, etc. Another drawback is that a name variation not in the list will not be found as a match which makes the list method inefficient for use in industries like homeland security, anti-money laundering, etc.
Pros: Easy to use
Cons: Resource intensive; has recall problems since new variants may not be captured; is slow as it scans through a large database to return a match.
Edit Distance Method
The edit distance method breaks down spellings into characters and gives them a weightage. “Carl” and “Karl” will have an edit distance of 1 since the C turns into a K. In this case, the C is “transposed” for the K. The term, “edit” in this method refers to the insert, delete, and transpose actions that will be required to match the strings.
It makes use of two key factors:
1). The number of similar spellings in the string
2). The number of edit operations it takes to turn one variant into another.
The drawback to this method is the same as with the other methods – accuracy is limited only for English names. For non-English names, a translation process takes place following which the edits are made. A Vietnamese name, “Hang” may be translated as “Heng” which is a Chinese surname. Both have the same spellings save for the vowels and both even sound the same, coming from the same oriental culture.
It’s therefore obvious that not only does the edit distance method loses out on language nuances, but it also causes chances for significant errors when it goes through the process of translating non-Latin languages to English.
Pros: Easy to execute
Cons: Does not work efficiently for non-Latin languages.
This is an interesting method that relies on human knowledge. This method is labor-intensive, but it incorporates real-world knowledge about names from different cultures and ethnicities. The benefit of this method is that there is no translation from a foreign language to English language and cultural nuances of a language are held intact.
The three drawbacks to this method?
- It relies on the extent of human knowledge.
- It requires a great deal of work to feed multiple name variations based on human knowledge alone.
- It is slow as it has to sift through millions of names to look for a good match.
Pros: Caters to foreign language names
Cons: Relies on human knowledge
Making Use of the Hybrid Model
Hybrid models make use of two or more methods to achieve the highest recall and precision. It may make use of the Common Key Method’s high recall ability with the rule-based method’s human knowledge of names to achieve these goals.
With a hybrid model, the rules are generated from real data which means it doesn’t have to rely entirely on human knowledge and neither does it have to rely entirely on a translation. Furthermore, this method works perfectly well for cross-lingual name matching where users can simply type in a name in English and still get accurate results.
As a result, a hybrid model is fast to execute, provides accurate recall and also resolves the Non-Latin to Latin issue.
It’s important to mention here though that it is not an easy task to pull off the task of developing a hybrid model to meet your data needs. You first have to identify the problem you’re having, the type of approach that will work with your specific data and the high level of customization you will have to perform to get the model to work on your data. Additionally, you will also have to spend months in testing, recording, updating and reviewing the effectiveness of different methods. This is an expensive endeavor and one that will not help you overcome your current data challenges.
Challenges with Existing Approaches
If your organization is dealing with just a couple of hundred with names in an excel sheet, you can manually fix the name issues or you can use any of the algorithms described above. Of course, this is a given that implementing any of these approaches will cost you hundreds of thousands of dollars, months if not years of testing and implementation and the hiring of a development team that does not come without caveats.
There are also additional challenges that may be difficult to overcome with existing approaches:
1. The problem with handling a variety of scripts: Most of the approaches eal primarily with Latin-based languages. With multi-cultural or multi-language scripts, these approaches perform very poorly. They can handle only one script at a time, so you can’t really use it to sort through multiple scripts simultaneously.
2. Issues with precision and recalls: The common key method may have high recall, but poor precision. Because it’s just matching strings based on sounds or keys, it falters when it comes to high-variance data. The rule-based method may offer precision, but because it has to manually scan through data, it has a super slow recall process.
3. High computational resources: Sorting through a large scale enterprise database requires high computational resources that deliver equally high run-time. You should be able to recall a name or a match within just seconds of a search. This need for instant loading of information requires systems and resources that is costly – not to mention, it also requires yearly maintenance and updates.
4. Lack of automated improvements: Over time, all these approaches need to be manually updated for improvements. Not only is this time-consuming and complex, but it also increases the challenge of precision and accuracy.
5. Hiring of the right kind of talent: Anyone can learn a language and set up a program for you. But you need more than just a Python developer to get this job done. You need a team that understands how to use a certain model to solve a specific problem – and that team doesn’t come cheap.
Name Matching Software: The Code-Free Approach
Although these algorithms may “sound” simple, their execution is hardly a matter of simplicity.
The constraints, the need for a team and computational resources and most importantly, the challenge in implementing an approach that works are difficult if not impossible to overcome. It costs hundreds of thousands of dollars, is a severe strain on business processes and still fails to capture the exponential rise of different data sources, types and formats.
This is where you would need a name matching software – a solution that is code-free, hassle-free, and works exceptionally well with rising data needs.
Modern name matching software solutions do more than just name matching. They clean data, remove duplication issues, clear off redundancy by implementing standardization, and help your organization bank on reliable and accurate data.
DataMatch Enterprise is one such one-stop solution that goes beyond name matching. Used by more than 4,000+ organizations across 40 countries and recognized as the #1 data matching and data cleansing solution; it is a solution that resolves modern data woes. The system implements a hybrid model in identifying as well as resolving variations in multiple data points.
Furthermore, it offers an API solution that integrates any of your data source with the DataMatch Enterprise platform where you can easily profile, clean, match and deduplicate with ease.
Zurich Insurance – Case Study
Zurich Insurance, one of the largest Swiss insurance companies in Switzerland, was able to use DataMatch Enterprise to look at information and make sure that payments were processed correctly and without human error.
Their current system does not have a hard edit function where payee names can be pre-populated so those managing and entering information in the database can just key in any type of information. If any query was run against the main data warehouse, a long list of duplicate information would pop up.
The result? Vendor names were not being aggregated appropriately, causing massive headaches and operational inefficiency.
Using DataMatch Enterprise, the company was able to:
- Create accurate and confidential reports for the industry
- Fulfill data cleansing and fuzzy matching needs
- Process payments without human error
You can download the case study and discover how your organization can benefit from using an automated name matching software to clean and organize your data.
Conclusion – Your Organization Needs Data You Can Trust
Raw data is always error-prone. No matter what front-end systems you put in place, when it’s a human being filling or giving info, there will always be issues with variations. If these issues are not resolved, it may turn into a costly mistake.
Organizations could be sued in class actions, may lose customers, may get bad reviews online or may even lose out on the competition with the case of bad data.
An investment in name matching software & data cleansing solutions, therefore, is a necessity and not a luxury.