Identity resolution for government and public sector institutions

Match records across disparate systems, datasets, and jurisdictions and enhance operational efficiency with identity resolution solutions for government and public sector institutions

Duplicate and inaccurate data records are a universal problem, however, when it comes to government databases, they can be the cause of damaging circumstances. There are examples everywhere.

Take, for example, one state that carried out a sample using an old program to analyze the number of students in one year who attended post-secondary education in a specific city. Using the outdated system, the state found that 22% out of 5,344 students in the city had gone on to higher education. When the state used Data Ladder’s identity resolution software, the number went up to 41%. This example was just a small experiment conducted by the state to test the accuracy of two systems – imagine the consequences of using this flawed insight to obtain a grant or publish a study.


At Data Ladder, we have worked with the U.S. Department of Education, Department of Transportation, and Department of Treasury to ensure that government and public organizations can get a single view of truth through the implementation of various data cleansing and matching techniques.

Identity Resolution as the Stepping Stone to Data Quality Management

Data is the lifeline of any government organization. Here’s another example of how poor data quality can lead to inaccurate information resulting in actual damages.


In California, account records at the State Controller’s Office had instances where sick leaves and vacation credits were inaccurately recorded. Instead of 8 hours, there were fields where it was 80 and even 800. The system, being outdated, didn’t have the controls to prevent this error. The audit found 200,000 inexplicable hours of leave due to data errors. Those leave hours had a value of $6 million.


Data quality, therefore, isn’t just an attempt at modernizing processes – it is quite literally an attempt at preventing costly damages to government offices as well as citizens. And this is where identity resolution comes in.

In this whitepaper, you will be introduced to the concept of identity resolution, the cause of identity resolution challenges, the different identity resolution methods/solutions available, and what factors to look for when choosing a vendor or software solution for your institution.

The Challenge of Unique Identities


At the most basic level, data is manually collected and stored against multiple fields such as name, age, DOB, address, etc. Often, the data entry is performed by a human (either by a government officer or by the citizens themselves) which means there is always a chance of data errors. Some of the key causes of errors are:

Typos and short forms or abbreviations entered in a hurry.
Language barriers if names are non-English.
Cultural nuances in first names or surnames.
Interpretation of abbreviations.
Nicknames vs real names.
Changes in contact information that were not recorded or updated properly, resulted in duplicates.

All these circumstances create a significant challenge in obtaining a unified entity view. The fundamental objective is to achieve a single source of truth or to have different instances in the same (or multiple) data sets to point to the same real-world entity.


The expectation is that a unified data asset will lead to improved processes, accurate insights and analysis, and effective communication. The search for connections between records, matching, and unification of data records is a data consolidation effort.


Data consolidation or data integration has a fundamental challenge – how do you determine if two records refer to the same entity? In rare cases, you also have to be certain that two records do NOT point to the same individual. This distinction is difficult to determine in a database riddled with duplicate or inaccurate data.


This is where the need for identity resolution comes in – a method used to find connections between records. The goal of identity resolution is to empower organizations to search, match and identify data. The methodology is applied to scan large databases across multiple jurisdictions, states, and even countries.


In the absence of unique identifiers such as SSNs, Passport numbers or other similarly sensitive information, identity resolution is a methodology that can provide an accurate view of the entity.


Identity resolution takes into its stride data profiling, data analysis, and data standardization – the three critical steps in cleaning data before it begins comparing them. The method is performed via the use of algorithms that parse, standardize, normalize, and finally compare data values.

Redundancy & Errors in Entities

For identity resolution, let’s consider an entity as a single, primary, data concept that is defined by its attributes. Common entities are, ‘customers,’ ‘events’ ‘employers’, etc. In a perfect world, each of these entities will have unique identifiers that can differentiate them from other instances within the dataset.


However, in the real world, there are no unique names or dates of birth. There are unique identifiers such as SSNs, Passport numbers, license numbers, etc, but those identifiers are often confidential and are not used when data is shared.


As a simple example, consider the [name] token of an entity. Any real-world object has some kind of name that is used for reference, but the name is not unique. Cynthia Smith may also be known as Dr. Cynthia or Dr. Smith, or Dr. C.N. Smith, each name with a context in a relevant database. Cynthia’s name in the Department of Public Safety may be Cynthia Naomi Smith, but in the hospital where Cynthia is working, her entry may simply be, Dr. C.N. Smith.


In any unprecedented event where the Department of Public Safety or the hospital may have to share data for analysis or research, Cynthia’s record is not standardized. In this event, it may be required to determine whether Cynthia Naomi Smith or Dr. C.N. Smith is the same person.

It is also equally possible that they are NOT the same person. Establishing these two sides of the same coin relationship is critical, especially since there may be millions of similar instances. Although these examples sound simple, they are highly complex in nature considering that databases have billions of rows of records collected over decades. The name problem is just one of the challenges of data redundancy.


Identity resolution, therefore, tries to search and match for characteristics or attributes of any entity set to determine the match. So for example, in the case of Cynthia Smith, other attributes such as phone number, DOB, address, employer, and any other primary key are searched to determine the match.

For government organizations on a quest for a single source of truth, it’s not just about identifying records – it’s about establishing the relationship of identifiers or markers to the main entity with precision. And that’s where it gets a little slippery.


Determining what values to use as identifiers is a critical task – also known formally as setting a, ‘personally identifiable information’ (PII) parameter. Sometimes, such PII parameters are set if the database has such data records (registration numbers are good PIIs) or other times if the PII is confidential (such as a SSN), then a combination of all the regular attributes such as DOB, address, zip code, etc are combined to get a single PII. These units on their own would not qualify as PII. Here, it is essential to note that any identity resolution solution would first parse and standardize data before the matching process. Once done, it makes use of various techniques to carry out the match.

Identity Resolution Techniques

There are two ways to perform an identity resolution process – get your in-house team of developers to build a custom solution, or use specialized data matching software. We’ll come back later to these two choices and see which one works best for your organization. For now, let’s discuss some straightforward techniques that are used in an identity resolution process.

Exact Match Technique: The simplest method, it uses a common identifier found on all records being sorted. This is the most basic form of identity resolution and makes use of fields such as account numbers or registration numbers that can be found across multiple databases. For records that do not require deep-level cleansing or matching, the exact match technique is the most efficient.

Deterministic Match Technique: Personally identifiable information (PII) such as phone numbers, emails, names, are used to identify the entity. The deterministic approach relies on the master ID that is set in the database to identify the entity. For this method to work, the master ID should remain unchanged.

Probabilistic Match Technique: Often discussed in line with deterministic matching, probabilistic matching is similar to deterministic matching with just a slight difference – it works on probability. If an entity has the same phone number and address across different data sets, then it is, ‘probable’ that it is the same entity. In this case, probabilistic matching isn’t as accurate as deterministic matching, but one of the main advantages of using probabilistic over deterministic is if you want to scale your search. Say you want to match information on a third or fourth data set that may not necessarily have the deterministic identifier, then the probabilistic matching works best.

Similarity Match Technique: While deterministic and probabilistic methods rely on the same data values across datasets, a similarity match takes in small variations that are missed by other matching techniques. It’s one of the most popular matching techniques that is accurate, scalable, and can be applied to various complex datasets. For similar matching to deliver results with precision, data cleaning and data standardization is highly recommended.

All of the above techniques can be used in combination or on their own for an identity resolution objective. The effectiveness of the technique will depend on the quality of data, the number of databases, and the type of datasets involved.

How Does Identity Resolution Benefit Government and Public Sector Institutes?

Unlike business databases, government databases have a direct impact on citizens’ lives as well as on the operations of the state. Identity resolution can be used by any department or public institution to perform key operations such as prevention of identity fraud and ensuring public safety, research, and analysis, grant raising and funding, political analysis, and much more.

A quick walkthrough:


Public Safety and Prevention of Identity Fraud: One of the most critical advantage to identity resolution is public safety and prevention of identity fraud. Identity resolution can tap into millions of profiles and create a match of an individual based on various identifiers and data fields. For departments dealing with crime, identity fraud, immigration and other public domains, identity resolution can be used to determine attributed identity (object that links identity such as ID cards and driver license), biographical identity (social activities recorded in a public or private domain), and biometric identity (DNA/fingerprints).


Ensuring Compliance Against Sanction Lists: Financial institutions must comply with a number of anti-money laundering (AML) rules and regulations to ensure that they do not carry out any business or trading with businesses blacklisted in sanction lists. Identity resolution can be used to screen customer records against sanction lists such as the UN Security Council, the US Treasury Department’s Office of Foreign Assets Control and the EU to prevent the possibility of money laundering.


Cross-Jurisdictional Research: Identity resolution is very effective for public sector institutes especially when data has to be analyzed to determine eligibility for a grant or for research. Often, institutions in education and healthcare need cross-jurisdictional data matching for research and funding purposes. When states share data, PII fields such as SSNs, ID numbers, etc are removed. It is here when states have to rely on other primary indicators such as phone numbers and addresses to get accurate information. Identity resolution works best when working with cross-jurisdictional data.


Delivering Personalized Citizen Engagements: Whether it’s a business or a government entity, there is an increasing need for delivering personalized engagement. Identity resolution allows for the organization to connect with citizen services in new and innovative ways. The point of concern here is citizen satisfaction with government services. Citizens want digital services initiatives that save them time, are convenient, easy to use, and ideally can be performed right from their phone. In a survey, 73% of citizens would provide biometric data such as fingerprints and retinal scans just to get more personalized services. With this level of expectation, it is only natural for government institutes to seriously consider optimizing their data to facilitate citizens.


Delivering Personalized Citizen Engagements: If there is one area where identity resolution is used to its maximum potential, it is in politics, especially during election times. Multiple databases are used including the National Election Pool’s exit poll, the U.S. Census’ Current Population Survey, the American Trends Panel, and state-level voter records. An analysis of all these databases gives a rich picture of a citizen’s vote history, including who they are likely to support in an upcoming election. In modern times, citizen’s online behavior data is also studied along with their voting data to determine the likelihood of their support for a political party. Identity resolution is so effectively used that analysts can determine which candidate a Hispanic female is more likely to vote for.

There is no limit to the potential of identity resolution. All that’s needed is a clean database and a powerful identity resolution tool. It’s important to remember that for identity resolution to be effective, data needs to be complete, standardized, and dedicated.

Building a Solution In-house vs Using an Identity Resolution Software

Restricted budgets, talent constraints, and excessive overload of duties and tasks with limited staff are some of the key reasons why government and public organizations are slow on digital data growth.


Thankfully, identity resolution can now easily be performed via software solutions that are designed to handle large databases, providing data cleansing, data profiling, data standardization, and a lot more.


Of course, there is always the option of either hiring data experts in-house verses using software, but here’s a quick breakdown of the cost of hiring in-house.

If you don’t build it
Annual cost of paying for a commercial solution
15,000 USD
If you build it
Number of employees required
6
Average employee salary + benefits
120,000 USD
Weeks to build
20
Days per month of maintenance
3

$275,975+

Cost to build

$70,965

Annual Maintenance

-$55,965

Saved Annually

Never

Years to Savings

And this is just the tip of the iceberg. For government organizations, even a paid solution may be considered costly since these organizations operate on a very restricted budget. This really leaves no room for hiring in-house talent, considering the fact that most technical talents are more attracted to private-sector jobs than the government.


You can read a complete analysis of why in-house data quality solutions fail in this extensive white paper.


That leaves us with a paid software solution. Since digital data management and data quality have been the central point of focus for over two decades now, there are plenty of automated solutions out there that can be used to accomplish identity resolution and data quality goals.

Having said that, it should be noted that not every solution provides optimal accuracy. The market standard accuracy rate of matching software solutions is 88% whereas Data Ladder’s matching accuracy is 96% – as proven in 15 independent studies conducted by different universities.

What Should You Look for In a Commercial Software Solution?


It goes without saying that an accuracy match is the most important factor to look out when choosing a paid solution. Other than the accuracy match, there are other factors that determine whether a solution suits your needs.

  1. Full-Spectrum Data Quality: Identity resolution is not just a data-matching process. Before you compare data you have to ensure that it is clean. To this effect, there are three types of data cleansing functions that any market-competitive data solution must offer:
  • Data Profiling: Most of the time, data fields are left empty or have invalid, outdated information. Data profiling is a feature that lets you sort incomplete and invalid fields of your data set and make it ready for the cleansing process.
  • Data Standardization: New York was written as NY, NYC, N.Y? This is a universal problem with most databases. To resolve this problem, you need to set automated standards to your data field. A software will let you accomplish this by automatically implementing the set standards on your data fields.
  • Data Deduplication: This is the process where duplicates are identified and removed. Duplicates cause a significant problem for databases and are largely responsible for skewed statistics. Once data is deduped, you can get an accurate picture of your data.
  1. Integration with Systems: Be it Oracle or your own ERP – you would want to easily connect your database to the data-matching software. Once you connect your database to the software you should be able to cleanse, dedupe, standardize, and enrich your data to create a 360-degree customer view, without leaving the platform.
  1. Cross-Jurisdictional Matching Capability: As discussed before, government institutions often have to share data and conduct cross-database analysis. For this purpose, the software needs to have the capability to perform operations on multiple datasets without compromising precision and accuracy.

  1. Easy to Use: Some renowned data cleansing solutions require the learning of additional languages as well as trained specialists to run the software. Because government organizations have restricted time and budget, software that is not easy to use is counter-productive to identity resolution goals.

  1. Great Customer Support: Although most data quality management tools are easy to use, you do need customer support to provide the initial deployment and training. When choosing a data solution software, it is recommended to check customer reviews and reach out to different platforms to get an idea of their customer support effectiveness.

Conclusion

As digital technology is taking over traditional methodologies, data is quickly becoming a central theme that concerns all facets of our digital society. The rapid evolution of technology accelerated by the internet, smartphones, and social media has enabled global connectivity on an unprecedented scale. This connectivity has also given rise to enormous volumes of data which has made data one of the most valuable assets of the modern world.

Identity resolution, therefore, is no longer a fancy concept – it is a requirement that must be fulfilled if institutes and government departments have the objective of making the most out of their data. However, there is more to identity resolution than just using matching techniques to identify data. It is an important part of a data quality management initiative – in fact, it is the first step to take if your goal is to implement a data quality management process in the coming years.

Want to know more?

Check out DME resources

Merging Data from Multiple Sources – Challenges and Solutions

Oops! We could not locate your form.