Data Cleansing Techniques for Redundancies

Dealing with duplicate data requires a strategy to deal with inconsistent data. The first step would be to standardize addresses with data matching software.  Secondly, ensure that you use data entry programs that validate field formats, preventing errors, such as names being entered into a phone number field. Finding all records that contain exactly or approximately the same data in one or more fields is critical.   Review the sample below of five records containing six fields in each record:

Name    Address           City         St ZIP®       Phone

——  —————– ———— — ———- ————–

1  DAVIS   115 E 1ST ST      CLEBURNE     TX 76031-2407 (817) 458 9992

2  DAVIS   1 115 ST EAST     CLEBURNE     TX 76031

3  DAVIS   1 EAST 15TH       CLEBURNE DR  TX            817-458-9992

4  DAVIS   1 E FIFTEENTH ST  CLEBURNE     TX 76031      458-9992

5  DAVIS   ONE EAST 15TH ST  CLEBURNE     TX 76031      817-458-9991

 

You will see that all five of the above records refer to the same person at the same address; no two records are exactly alike. Then consider the possible attempts to locate duplicates in the file:

BROWSE 1: Select records with the same address field. Finds none of the above records.

BROWSE 2: Select records with the same name and same five-digit ZIP. Misses records 1, 3, and 5.

BROWSE 3: Select records with the name “DAVIS”. Misses records 2 and 3 (while probably matching lots of other DAVIS’ at other addresses).

After completing an address correction and field validation, the above listed samples become:

Name   Address     City     St ZIP        Phone

—–  ———– ——-  — ———- ————

1  DAVIS  115 E 1ST ST CLEBURNE TX 76031-2407 817-458-9992

2  DAVIS  115 E 1ST ST CLEBURNE TX 76031-2407

3  DAVIS  115 E 1ST ST CLEBURNE TX 76031-2407 817-458-9992

4  DAVIS  115 E 1ST ST CLEBURNE TX 76031-2407 XXX-458-9992

5  DAVIS  115 E 1ST ST CLEBURNE TX 76031-2407 817-458-9992

 

Once you have completed standardizing, attempts at duplicate detection will be greatly improved and will have a better chance of finding the correct group of duplicates. By selecting “records with the same address, ZIP, and “soundex name” is an attempt that works perfectly on the above example.

 

As you start your journey to address redundancies and duplications, Data Ladder is your partner and analytical expert.  We can bring simplicity and clarity to an otherwise muddled and complicated project. Have confidence that Data Ladder will help you resolve your data quality issues and measurably improve quality and financial performance. Contact us for more information and to get your free trial.

Tags :