Remove Duplicate Records and Rows
When reviewing or combining one or more lists or databases, duplicates are a common problem, this article will detail the steps needed to successfully clean your file. The purpose of deleting duplicate rows / records is to clean the underlying data set to achieve productivity improvements, save on duplicate mailings, and increase customer satisfaction. Deleting duplicates can be a time consuming and error prone task which is why duplicate removal software is an essential tool for database administration. DataMatch is our flagship product for deleting duplicate records. Try a free trial today or continue reading for more best practices on merge purge of databases.
Merging different databases with different sources (SQL server, MySQL, Excel, ODBC etc.) and combining into a common structure is the first step in the process. Usually duplicates are between databases, but sometimes duplicates are within a single file. DataMatch can import, combine, and export to the most common database formats. Additionally DataMatch will automap similar fields from different data sources together (Which can be customized and overwritten)
A key component of eliminating duplicates is the definition of what a duplicate is. The following best practices are key and are all included in DataMatch.
> Fuzzy logic identification of percent matches between records and setting minimum percent match thresholds by field
> Acronym identification for matching (Match International Business Machines to IBM)
> Cleaning and standardizing data prior to matching (Street to street, eliminating unnecessary syntax in phone numbers, etc.)
> Applying libraries for standardization, especially for first names (Jon, Jonathan, and John etc.) DataMatch includes over 300,000 standardization rules for names, addresses, phone numbers and company names.
One of the critical pieces of eliminating duplicates is survivorship. If you have duplicate records, which one should stay (survive) and which one should go? DataMatch allows customized settings for which merged data should survive
DataMatch allows customized settings for which merged data should survive
In this example there are two duplicate records. Each has some slightly different data in the notes field. You may prefer to keep all records, but often times a single master record must be chosen to maintain data quality.
With DataMatch you can choose which record survives by choosing what field to merge on, in this case Customer Number, and ascending or descending order. If ascending the first customer number would hold priority ‘1005643’, if descending the later customer number ‘1106789’ would have priority. Note you can always change which record is a master manually in DataMatch.
Unfortunately normal duplicate removal software routines can delete vital business data.
What if you want to keep both pieces of information in the same master record?
The best solution would be to keep all data that is different in a new field. DataMatch has this capability.
The result would be this
Note the alternate information is captured in a new field. The benefit is a single master record, with no vital data loss. (Old customer number kept for referencing, and critical customer comments, like interest in a new product, kept)
Try the free trial on your own data set!
Note DataMatch never deletes any information from the source files , all information is kept temporarily in memory where you can test different duplicate removal settings without consequence. Although you can overwrite your original source files if you choose.