By Jack Bradford, Senior Product Specialist
Data preparation and matching is a growing profession that requires an in-depth analysis of the scope of a project, as well as an intimate knowledge of the data that will be processed.
The growing trend is to have a full-time position or even a department that is dedicated to data management. The main focus of this department is to continuously scrutinize the criteria that are used to refine the data quality.
- In most cases, we want to find two or more of the same record that exists in one or more data sources
- Merge data into one record where different information exists for the same record in different systems
- De-duplicate data sources so that one master record exists
- Standardize the record information so that different systems may access the same data source with ease
So now that we know what we want, how do we get there? Before we can start a project, we need to understand the data that we will be using. The first step in understanding a data source is conducting a data profile. This profiling step consists of a few variables:
1. The following questions should be asked when attempting to gain an understanding of what a data source is comprised:
- Are we dealing with string data, dates, numeric, or irregular characters?
- Is the data complete? Are there many null values within the records?
- How “clean” is the data? Are there instances of leading or trailing spaces? What about non-printable characters?
- Are the fields standardized? Do the date-time fields contain data only, or do they also include time?
2. Many times the teams that work with the data on a daily basis should be consulted in order to understand certain nuances regarding the data.
3. How is the data collected? Many times we can gather valuable insight if we can see what or who is creating the records.
4. Wash, Rinse, Repeat! It is important to understand that the data is forever changing and this analysis must be performed again and again over time in order to capture those changes.
These steps will usually ensure a moderate to intimate understanding of the data composition. I find many times that system users will use certain fields in order to meet their immediate needs. It is very possible that the form or field was not designed to receive this type of input, but the developer did not install the proper error handling in order to avoid “dirty” data. A simple example of this is a telephone number field that should receive a 10 digit number, however a user will input a phone number and the extension number in the same field. This one mistake may make it impossible to extract fuzzy numeric matches on the entire column.
This is why we cannot rely on documentation of what the system “should” do. In turn we must consult users, profile the data and analyze collection methods
Once we finally have a decent understanding of our data source, we will want to look at cleansing and standardization. Do we need to transform some of the data? Perhaps we need to remove or replace some strings or characters? Do we need to look for a specific pattern and parse the result into a new column? Most of the time when we are dealing with human entered data, the answer to these questions is YES!
Before we get into the “how,” let’s look at the why. Why do we want to cleanse and standardize our data? Why does it matter? The simple answer is match quality. When you run a match event, sometimes the smallest detail and cause major issues. Having a space after someone’s last name is technically an extra character. A comma instead of a period in an email address can result in lost revenue or higher expenses.
We now have the why, let’s look at the how. The following are a few real world examples of the standardization of data:
- Time is of the essence. System users will often abbreviate items that save them time when doing data entry. The problem is that almost always these abbreviations will defer from user to user. So we may want to change WHT to WHITE, lg to Large or 3rd to Third.
- We may have someone’s first and last name saved in the same field. We may need to parse the first and last name into separate fields.
- In addition, many times we will find that one user inputs Tim and another input Timothy. We need to apply a common name tool in order to standardize these fields.
- ABC, Inc, ABC Incorporated and ABC company may all be the same business. The suffix in this case will make it nearly impossible to find a match. It is advisable to replace these values with something standardized or remove the suffix completely.
- TODAY THE USER FORGOT CAPS LOCKED WAS ON. This may not cause system or matching issues, but if you doing professional mailing, you will want to ensure you use the proper case.
Now my data is clean and standardized, I want to find matches and duplicates. This is the moment where our planning and thorough analysis will come in handy. When we begin to design a match event, we will want to know the following:
- I have multiple data sources, what columns should be mapped together? It is critical that like values are mapped together.
- Did we find a high level of empty fields? If so, we must allow for this in our match criteria.
- What fields will be good candidates for exact matching. We always recommend that there is at least one exact match in any match definition.
- For our fuzzy matching, what level of threshold should we define? How mismatched should we allow the data to be?
- Are we matching on addresses? We should have standardized our through a verification system in order to match street numbers and street names separately.
- Will one match definition with multiple criteria suffice? Sometimes we want to execute multiple definitions simultaneously.
Fantastic! I have matches. Before you export and potentially end your current employment, let’s take a minute to analyze the results. I generally find that if my match results look too good to be true, they are. Many times we will create match criteria that is too “tight” and we are suppressing viable matches. I find that a best practice is to loosen a match definition so that false positives are invoked. By identifying these records, we can see where and how we will want to tighten the match criteria. It is easier to remove false positives than guess what records are being suppressed.
I have reviewed my match results and I am confident that we are “golden.” Speaking of “golden,” we may want to merge some of the data in match groups in order to obtain a golden record. The term Golden Record in terms of data management means that we have the most complete record using all of the information that is provided. It is at this point that we should consider what data do we want to merge and which record in the group will be the survivor. In order to define the golden record, or master record, we should decide column in the match group can be used to distinguish the desired record. We generally see that when a “create date” or “modified date” field is used, this generally is a good candidate. If there are no fields that jump out at you, then you may want to use the most complete record.
Before moving on, you will want to consider overwriting certain fields that contain out-of-date information, such as addresses or phone numbers. The idea behind a golden record is to use the best and most complete information we have available between all of our data sources. Do not rush past this step!
Finally! We have our results. Just to review:
- We carefully planned our project by profiling our data and learning the composition of the data sources
- We cleansed and standardized our data in order to increase match quality and data professionalism
- We conducted a match event based on a variety of match criteria and definitions
- We created a master record that is comprised of the best and most complete data between all data sources
What now? Export. Now that we finished our project, we need to analyze what will we do with these results. Do want a “clean” list where the matches are suppressed? Do you want a list of duplicates only that you may use for additional investigation? We find that the most common export choice is to export a de-duplicated set of records. This would include the unique, non-matching records, and of course, our master record that we created in the merge and survivorship process.
Stay tuned for future articles where will explore more profusely each of the steps that will help you gain data nirvana. For more help on our data management tools, please contact us.