Using Data Ladder’s record linkage in enterprise level healthcare systems

Any research initiative to resolve pressing health service and policy issues requires access to accurate, complete, valid, and consolidated patient and organizational data.


Despite the advancement in technology, healthcare systems in leading countries of the world such as the US, Canada, Australia are still struggling with data quality challenges. Record linkage, i.e, the linking of data sets to get a consolidated view of an individual or a population remains one of the most significant challenges in the healthcare industry.


The purpose of this quick paper is to demonstrate how Data Ladder’s DataMatch solution can be used to help federal healthcare and government-sponsored institutions to link administrative and health records across and between multiple states in the light of policies that prevent the use of unique identifiers to link data.

Meeting the Unique Identifier Challenge in Federal Record Linkage Projects


For decades now, the healthcare system in the US has suffered billions of dollars in losses because of the lack of unique numbers or codes comparable to an SSN assigned to each individual. As a result, the healthcare system has faced significant challenges in linking patient records to perform functions as delivering quality care, conducting longitudinal studies and improving care overall.

Hospitals fail half the time in attempting to match records between patients and between hospitals. That does not just harm patients but also burdens employees, staff and vendors. It’s not uncommon to hear of medical procedures or diagnosis conducted on the wrong patient with the wrong data. This challenge is exacerbated when it comes to federal sharing of information between states to fulfil purposes of research or longitudinal studies. Hundreds of thousands of records are duplicated and have messy data.

Worse, records come from multiple states and sources, and mostly do not have unique identifiers due to privacy concerns, making record linkage challenging.

This is where Data Ladder can help.

Using Data Ladder’s Fuzzy Matching Capabilities to Perform Record Linkage

Record linkage is done via two techniques – deterministic and probabilistic. Data sets that come with unique identifiers like social security numbers (SSNs) and medical record numbers (MRNs) are linked using exact-matching algorithms that rely on the accuracy of the code or number to identify matches.

On the other hand, data sets that do not have unique identifiers will need to rely on other variables such as age, address, first and last names to link records. This is done using probabilistic matching that relies on fuzzy matching algorithms to match records based on multiple factors like numeric similarity, sound,
phonetic, and much more. Fuzzy matching algorithms are designed to cater to the increasing variability of data.

Deterministic linkage is relatively simple but because even unique identifiers are numbers susceptible to human errors or inevitable miscoding, probabilistic matching of variables as gender, DOB, names etc is
carried out in conjunction with deterministic methods to match information.

Regardless of the record linkage method, one condition is absolute – for records to be accurately matched, it is imperative that the data is cleansed and prepared before it is put up for matching. Data Ladder’s DataMatch Enterprise (DME) is a one-stop solution that allows for pre-processing functions as:

Data profiling: Identifying the type of errors that exist in the data set and the ‘health percentage’ of each data column. For instance, profiling highlights the number of rows that have missing last names in the LastName column. Other than that, it also highlights fields that have negative spaces, punctuation marks etc.

Data cleansing and standardization: Users can clean and transform data by simply clicking on built-in options for standardization. Additional rules can also be created using the platform’s parsing tools that
helps break down components of data.

Once the quality of data is optimized, users can then initiate the record linkage process.

DME has been recorded as having the highest accuracy rates for inexact matching at 95%, higher than the industry standard of 80 – 85%.

With DME, users can:

  1. Ensure Accuracy with Code-free Data Parsing

DME offers a code-free solution to data cleansing and record linkage. Once a data source is integrated into the platform, the user follows a sequence of functions starting from data profiling, moving on to data cleansing and standardization to address validation and finally to data matching and consolidation.

During the cleaning and standardization process, users can benefit from tools exclusive to the DME platform – the Pattern Builder and the Word Smith.

The Pattern Builder:

Parsing is an important function in record linkage, especially when matching needs to be done with specific components such as CityName to CityCode or CountryName to CountryCode and so on. For this purpose, data must be broken down into components to make the match possible. DME allows the use of regular expressions to create rules specific to business requirements.

For instance, full names can be parsed into first, middle, last, and maiden names; dates of birth can be parsed into month, day, and year of birth; and addresses can be parsed into street, city, State, and ZIP Code.

Or, if a user needs to match [Last Name] and the year from the DOB column and the DOB column contains a MM/DD/YYYY type of format the pattern builder can be used to parse the one column with MM/DD/YYYY into three new columns, one for the MM, one for the DD, and one for the YYYY, which will allow the user to match with the YYYY.

The Wordsmith:


Another tool, exclusive to Data Ladder is the WordSmith.

When matching your data, the presence of noise (meaning repetitive names/information) results in missed matches and false positives.

For instance, in the [HospitalName] column, you may see one hospital name’s written in three different ways:

  • Mount Sinai Hospital
  • Mt. Sinai Hosp
  • Mt. Sinai Hospital New York


Now you know that these three names belong to the same hospital. But when data matching, you’d miss those matches. With WordSmith, you can identify such instances by setting max.Number of words in a group to 2 or 3, then using the replace option to replace the three phrases with your preferred phrase or format.

The change is automatically applied to every instance in your data source regardless of the record size and the frequency.

The parsing feature can be used in hundreds of different ways. Another use-case is when you have phone numbers with area code as the prefix. You could separate area codes from phone numbers for standardization purposes. Simply parse out area codes into a separate column and assign a replacement value if you want so the new column simply lists area name rather than the area code.

You can now delete the area codes from the original column. All of this can be done within WordSmith by a few simple point-and-click actions and the changes will be applied to your data in bulk.


  1. Data Cleansing & Standardization for Record Linkage

In a record linkage process, the objective of data cleansing is to enhance the excellence of linkage. This includes minimizing the number of duplicate records, ensuring the consistency of data & removing unwanted errors that degrade quality of information. In the absence of data cleansing, a number of truly matched records may not be discovered due to the reason that the relevant qualities may not be adequately same.


DME allows for easy, point-and-click data cleansing. Unlike ETL tools or Excel, there is no manual effort involved. With DME, users can:

Transform poor data by simply clicking on checkboxes.
Normalize text style.
Remove unwanted characters
Remove accidental typos during data entry (these are hard to catch!)
Clean up spaces between letters/words
Transform nicknames into actual names (John instead of Johnny)

Minor errors in data can result in false positives and negatives during a data match operation. Researchers end up spending hours and days in reviewing every false positive indication while false negatives remain an undetected threat harming the integrity and quality of information. Hence, for a record linkage process to be successful, clean data is mandatory.

  1. Minimum Effort Required to Perform Record Linkage at Multiple Levels

There are two ways to perform record linkage.


The traditional way: Record linkage is performed at multiple levels of complexity, therefore, the person writing the code must possess a great deal of programming expertise and be familiar with linkage theory – talent that is expensive and beyond federal budget. Furthermore, manual coding results in unsatisfactory match results, increasing the workload manifold.

The easy & accurate way: Code-based processes are no longer effective in managing large amounts of complex data; hence, ML-based solutions are a better, faster alternative. A tool like DataMatch Enterprise performs data matching at multiple levels, giving users the ability to dig deeper at their data.

DataMatch Enterprise Allows for:

Record linkage based on the Cross-Jurisdictional Model for health-related research in Australia. In this model, multiple stages share and link data to enable research at national or cross-jurisdictional level.

On-going linkage where a consolidated record is created and maintained over time and re-used for multiple research projects. New records are cleansed, deduped, amended and added into this record. This approach improves linkage quality and reduces effort.

Handle large scale record linkage (> 10 million) with multiple data set, linking between/across or within data sets simultaneously.

Maintaining a history of changes and record linkages which is often required by researchers and also by regulatory personnel to determine how specific records were linked together.

Make use of multiple approaches to record linkage – such as linking all records in a data to all other records in the system, or linking between and across multiple records.

Delete, amend, update records as needed with historical changes intact. Users can always return to a previous version as each activity is stored in a local folder.

The process of conducting record linkage on a large scale is complex and resource-intensive. With a tool like DME, users can get flexible, scalable, reliable and accurate enterprise-level linkage. As the demand for database linkage arises, requests for a robust data match solution are becoming more frequent. Over the years, Data Ladder has helped several healthcare organizations with data cleansing and large-scale record linkage.

Case Study: West Virginia University – When Disparate Data and Duplicates become Overwhelming

West Virginia University is one of the best examples of record linkage for large-scale research. The university was tasked with assessing the long-term impacts of certain medical conditions on patients, over extended periods.

The problem was to evaluate and determine how previous conditions affected the long-term health and care of those patients. The challenge existed in accurate and timely examination and study of the data.

Unfortunately, the data existed in separate database systems. Researchers were challenged to find a solution to link the database records so accurate data could be used in their work. With restricted budgets and limited timeframe, manually performing this task would be overwhelming, not to mention, there would be no guarantee of accuracy and quality.

The team was in search of a solution that could perform data matching and meet the level of accuracy required. After testing several tools, the team opted for DME based on key reasons including:

The intuitive user-interface, operates on a point-and-click design. Users did not have to key in any manual instructions, following an easy step-by-step process.

The ability to easily integrate multiple sources of data without the need for extra plugins or integrations. This allowed for consolidating multiple data sources that were later used for data matching.

Duplicates were the biggest obstacle in using the information effectively. Manual methods only detected fields that had exact matches but were unable to detect fields that had similar matches. With DME, the data was standardized, cleansed, and then matched for duplicates. When duplicates were removed, multiple clean records were created to perform the needed record linkage.

The ease of standardization and data transformation only required the user to click on built-in options. Users who wanted specific business rules were guided by the support team to create regular expressions that enabled a deeper break down of data.

With DME, West Virginia University was able to derive accurate data to be used in analysis and insights within a short amount of time. What would have taken the university months to achieve was completed in just a week.

Case Study: St. John’s Associates – When Manual Methods Fail & Integrity is at Risk

St. Johns Associates provides placement and recruiting services in a multitude of departments, ranging from Gastroenterology to Neurological surgery and a dozen others. The organization obtains talent profiles from multiple sources – social media, web forms, third-party vendors, etc making dirty data and duplication inevitable.

The company relied on manual record cleansing methods like ETL scripts to clean and dedupe its records. Over the years though, the database began growing exponentially, and the organization found it difficult to maintain data quality with traditional methods. They decided to invest in an automated tool that would allow them to clean records, saving hundreds of hours in a manual effort. The organization was specifically looking for a tool that would have an intuitive, code-free interface, allowing users to clean and match data as needed. DME became the preferred tool of choice for its point-and-click interface and its built-in rules that enabled non-IT users to clean, standardize and match data as needed.

Conclusion

The healthcare industry is buckling under the weight of poor data and outdated technologies with dangerous repercussions. A botched study as a result of obsolete data may prevent a necessary funding. A skewed report may downplay the urgency of a disease. A poor linkage may cause fatal errors.

Data impacts real life in ways we can never truly fathom.

It makes it even more important for your organization to reduce the reliance on traditional method that no longer hold up to modern data structures and instead power up by opting for automated solutions that can help you meet project deadlines on time without overwhelming your staff.

We know the stakes are high. We know the consequences of inaccurate records. We understand the implications of poor matching results. And because we know all this, Data Ladder’s teams work closely with you to help you optimize the quality of your data and ensure that your record linkage results meet the highest level of accuracy.

Want to know more?

Check out DME resources

Merging Data from Multiple Sources – Challenges and Solutions

Oops! We could not locate your form.