Data Quality Management in DataMatch Enterprise
DME takes you through the data quality management journey in the following stages:
Data Import
Connect and load data from a number of different data sources, including local files, relational database servers, data stores at third-party applications, and many more.
Data Profiling
Assess the current state of your data and identify cleaning opportunities.
Data Cleansing and Standardization
Perform a variety of data cleansing operations and attain a standardized view across all imported data sources.
Data Match Configuration and Definition:
Configure and execute proprietary or industry-leading data matching algorithms, and finetune them according to your dataset requirements to get optimal results.
Data Match Results:
Assess the match results and their match confidence levels to flag false matches and determine the master record.
Data Merge and Survivorship:
Design merge and survivorship rules to overwrite all poor-quality data fields automatically and retrieve the golden record.
Data Export:
Export or load the profiled, cleaned, matched, and overwritten data back to the data source of your choice, at any point of the process.
Let’s take a deeper look at the features available under these modules.
1. Data Import
Files
Such as text files, CSV, excel sheets, etc.
Databases
Such as SQL Server, Oracle,Teradata, etc.
CRM
Such as Salesforce.
ODBC
Which allows you to connect with any other database system.
While importing data from the connected source, you can also select specific tables, as well as columns within those tables for
custom import. For SQL databases, DME also offers the option to write and run custom SQL scripts for importing data based on some criteria.
Once the data is imported, you can rename columns, sync sources to fetch updated data, and replace these files as well. All these data imports are maintained in your Data Library, always ready to be used subsequent steps of data quality management.
Want to see DME’s Data Import in action? Check out this video.
2. Data Profiling
Data profiling helps you to assess the current state of your imported data and then identify:
- What data needs to be cleansed and standardized?
- What can be used as match criteria?
When you generate a profile of the imported data, DME evaluates and displays the following information against the columns present in the file:
Filled:
Displays the percentage of rows that are non-blank, or have values in them.
Type:
Displays the identified data type for the column.
Length:
Displays the maximum number of characters this column holds.
Length:
Displays the most popular pattern that a column follows. This is evaluated by running the data values against a dictionary of pre-installed patterns in DME, but you can also create your custom patterns (covered in the upcoming section). Further, DME also shows the number of records that are valid and meet the criteria of the identified pattern.
Count:
Displays the total number of filled, null, and distinct values in the column.
Character Count:
Displays how many values in the column have numbers (and only numbers), letters (and only letters), numbers and letters, punctuation, leading spaces, and non-printable characters.
Statistical Values:
Displays the minimum, maximum, mean, median, mode, and extreme values for numerical and date-type columns.
Want to see DME’s Data Profiling in action? Check out this video.
3. Data Cleansing and Standardization
As disparate files have different data quality issues, DME allows you to standardize and transform your data through various cleansing operations. These options include:
Copy Field:
You can create a copy of any column present in your data. This option is helpful for preserving the original values of a column and performing new functions on the copied column only.
Type:
This feature allows you to run a column’s values against a dictionary of words (like Name, Address, ZIP code, etc.), and then parse them to identify new fields. One common example of this is when you want to parse your Address field into smaller components such as Street Number, Street Name, ZIP, City, and State. Or when you want to parse Full Name into First Name, Middle Name, and Last Name columns.
Update Column Case:
You can create a copy of any column present in your data. This option is helpful for preserving the original values of a column and performing new functions on the copied column only.
Pattern Builder:
DME possesses an extensive functionality for pattern building and recognizing. DME’s Pattern Builder comes with a number of prebuilt patterns (such as identify alphanumeric strings, delimiters, beginning of strings, etc.).
If needed, you can also create custom patterns by manually typing in the regular expression, or using DME’s visual regex designer tool. The pattern builder parses the column field according to the chosen pattern, and outputs the matched results in new columns. Furthermore, you can also format these new columns so that they all follow a standardized pattern.
A common example of this is when you have a US phone number field, but all numbers are following different patterns like: 231 456 7890, (231)4567890, or 231-456-7890. By using our Pattern Builder, you can recognize all values and standardize them to follow the same format such as +1 (231) 456-7890.
Want to see DME’s Pattern Builder in action? Check out this video.
Wordsmith
DME’s Wordsmith tool profiles all data values in a column to show the most repetitive words and their count. Once the wordsmith tool has fetched all words, you can choose to flag records where a certain word occurs, delete all instances of the word, or replace it with something else.
For example, you can run the Wordsmith tool on a Company Name column and then choose to flag, delete, or replace all instances of llc, inc, corp, svc, etc.
Want to see DME Wordsmith in action? Check out this video.
Merge Fields:
DME allows you to merge multiple fields into a single column with a defined delimiter.
Note: The changes made to the data files during the cleansing process are saved in a new file under the DME folder structure, and it does not affect the source files imported into this project.
Removals and Replacements:
To standardize data across all sources, DME allows you to perform the following functions on a column field:
- Remove or replace non-printable characters.
- Replace empty values.
- Remove leading or trailing spaces.
- Remove or replace certain characters.
- Remove spaces, letters, or numbers.
- Replace zeros with the letter O, or vice versa.
Want to see DME Data Cleansing and Standardization in action? Check out this video.
4. Data Match Configuration and Definitions
With DME, you can build match definitions according to the nature of your data. DME offers extensive list of options for configuring these definitions. You can choose to match between or within multiple data source files, or both. Let’s take a look at how data matching works in DME:
Field Mapping Between Data Sources:
Firstly, you need to map fields between all data source files that need to be matched. You can do that manually, or use DME’s Suggest Mapping feature to intelligently maps all fields. You can then use these mapped fields to build match definitions.
Field Mapping Between Data Sources:
In DME, you can create multiple match definitions, and each definition can hold multiple criteria. This structure helps you create various logical AND OR expressions, based on which the data records can be matched.
Multiple criteria within a definition are checked by using the logical AND, which means different records are said to be a match if they match on all criteria defined in that definition. Furthermore, multiple definitions are checked using the logical OR, meaning data records are said to be a match if they match even on a single definition.
Match Algorithms:
A field can be matched using the following matching algorithms:
- Exact: Does not allow any tolerance in difference and expects to find a full match of character to character.
- Fuzzy: Allows tolerance in difference and evaluates the likelihood of two strings being similar by using probabilistic matching.
- Phonetic: Useful for matching similar-sounding strings as it matches words by their pronunciation. The phonetic match can be exact or fuzzy.
- Numeric: Used to run a probabilistic match on numeric fields.
Furthermore, you can configure the following options for each of these algorithms:
- Fast Level
- Level: It defines the threshold for the comparator. If the results of the comparison are equal to higher than the defined level, the match would be considered successful. Any value below the level would be a non-match.
- Weight: By default, DME assigns equal priority to all matching criteria in a definition. But you can override this configuration and set your own custom weights to match criteria, and configure match thresholds. Fields assigned higher weights (or higher priority) will have more impact on the overall match confidence level as compared to fields having lower weights.
Want to see DME Data Match Configurations and Definitions in action? Check out this video.
5. Data Match Results
Based on the configured match configuration and definitions, DME computes and displays match results. In the results, DME clusters all matching records in groups and assigns a match score to each record – showing the match confidence level. So, a group contains all records that have matched on one or more match definitions that are configured for the project. In a group of matched results, you can select which record is the unique/master record and which ones are its duplicates.
By default, all of the results are sorted based on the match score, so the records with a higher match score show up first and the lower match scores show up at the bottom of the list.
In the match results, you need to make some important decisions before moving on to the next step. These include:
- Which data record is the unique or master record within a group?
- Which data record pairs are falsely matched and are not duplicates?
- Flag or select certain records based on the match definitions that they matched on (optional).
Want to see DME Data Match Results in action? Check out this video.
6. Data Merge and Survivorship
With DME’s Merge and Survivorship module, you can build rules and conditions that will automatically determine the master record from a match group as well as overwrite certain data field values. After obtaining the groups of matching records, this module helps you to automate the manual effort of selecting records that represent the correct view of your data, and then updating them to ensure any valuable information is not left behind
1. Determine Your Master Record: You can determine the master record by running sequenced checks or rules on the data fields of the matching records. Following options will help you configure rules for selecting the master record in a match group:
a. Default Selection: By default, the first record in a match group is selected as the master record.
b. Selection Based on a Data Source: This option selects the master record as the one that comes from the specified data source. If multiple data sources are selected or there are multiple records from the same data source in a match group, then the operation will be evaluated to make the final selection (coming up next).
c. Selection Based on an Operation: This option selects the master record from any of the “checked” data source files based on an operation on the data field. Following operations are available for evaluating data field values.
a. Longest: Selects the master record that has the highest number of characters in the chosen data field.
b. Shortest: Selects the master record that has the lowest number of characters in the chosen data field.
c. Maximum: Selects the master record that has the highest number value or latest date value in the chosen data field.
d. Minimum: Selects the master record that has the lowest number value or oldest date value in the chosen data field.
e. Most Popular: Selects the master record that has the most recurring value of the column in the chosen data field.
All activated rules are executed in the sequenced order to determine the master record. If by the end, multiple records still qualify for the defined selection rules, DME will choose the first record that appears in the match result group.
Let’s say that a match group has 10 records. Using the Determine Master Record feature, you can configure multiple rules where DME evaluates these records. It can start by checking which of these 10 records have the longest State? That might take you down to 8 records (because all 8 records have the same number of characters in State field). Next, it can look for the longest ZIP code, and that might take you down to 3 records.And finally, it can check the longest Industry, and you’re left with 1 record. If the checking reaches the end of rules and there is more than one record left in the group, DME will choose the first record as the master record of that group.
1. Overwrite Records: With this option, you can configure rules to overwrite one or more data values from a record in a match group to all other records in that group. Following options will help you configure overwriting rules:
a. Overwrite Data From a Specific Data Source: This option overwrites data field value from the selected data source to all other records in group.
b. Overwrite Data Based on a Conditional Operation: This option overwrites the data field values based on the configured operations and conditions.
Operation: You can define operations that decide which record’s value should be used to overwrite all other records’ values present in a match group.
a. Longest: Overwrites the longest data field value to all other values.
b. Shortest: Overwrites the shortest data field value to all other values.
c. Maximum: Overwrites the highest number or the latest date to all other values.
d. Minimum: Overwrites the lowest number or the oldest date to all other values.
e. From Master: Overwrites the value from the master record to all other values.
f. From the Best record
g. Merges All Values Into One: Merges All Values Into One: Merges all unique values in a comma separated list.
Condition: You can decide to overwrite or not to overwrite if:
For example, before finalizing your master records for all groups and exporting them, you want to have the correct values for Company Name in your master record. Now, it is possible that only some records in a group have the Company Name value, while others are empty. So, you can build a rule that checks if a record in a group has a longest Company Name value, then overwrite this to all other records in that group (one of which is the master record). This shows that even though you’ve chosen one record as the master record, you can still enrich its fields based on the data values residing in the other records in a group. Hence, getting the most out of your matches.
7. Data Export
DME offers dynamic export options at every stage of your data quality management journey. You can choose to export the profiled, cleaned, matched, or merged data whenever needed. As DME supports an extensive list of file and database types, the data can be exported to local files as well as uploaded to any database, such as Excel, Tableau, SQL Server, Oracle, Salesforce CRM, etc.
The final export (the one that happens at the end of the process) allows you to make a number of selections to ensure that only the required data is exported from the application. And while you are selecting the data through filters, you can also preview it in real-time within the application to review the output before export.
Following are the options that you can select from based on your choice:
- Which datasets to include in export?
- What to do with duplicates?
- Show all record and flag duplicates.
- Show all unique records and suppress duplicates.
- Show with a master record remaining for each group of duplicate records.
- Export duplicates only.
- Handle selected rows by showing them or suppressing them.
8. Scheduler
With DME Scheduler, you can schedule any data cleansing, matching, or overwriting task and configure it to be automatically triggered and executed. Scheduled tasks can be triggered on any date and time, or when a connected data source is updated.
DME also displays scheduled tasks in a calendar view that gives you a summary report of upcoming tasks configured for various projects.
Note: For the Scheduler to execute the created tasks, DME Agent must be running. You can set it up to run on Windows startup through settings.
9. DataMatch Enterprise API
All of DME features and functionality are exposed through its RESTful API. The API allows you to use any data quality management function in your custom or existing applications and set it up for real-time data profiling, cleansing, matching, or deduplication.
DME API can be efficiently utilized for:
- Displaying consolidated data by joining multiple independent data sources in real-time.
- Verifying and validating upcoming records to prevent unstandardized or unformatted data from getting into your system.
- Checking upcoming records for uniqueness and only entering them in a database if they are not a duplicate.
- Eliminating manual review and management by automatically categorizing matches and non-matches based on thresholds.