DataMatch Enterprise Documents: DataMatch Enterprise: Profiling Your Data Seamlessly

Today we`ll be analyzing the DataProfiler on Tab.2

The DataProfiler is a very important step that will help you determine what data needs to be cleansed and standardized as well as decide what may be used as match criteria.

Let’s take a look at the information provided on the screen, the upper window contains quite a few statistical values and the lower window contains a representative sample of our data. On the upper window beginning on the left side, you can see the column headers and the field name columns, to the right of that you can see a bar graph representation of the number of non-blank or filled fields in the corresponding column, this is useful when determining if the field is a good candidate for matching if a field has many empty values you may consider not using that field for matching.

Next, we can see the Field Type of columns, this is helpful to know when you might want to match on numeric or date fields for example. The length is also helpful in making sure certain fields have the number of characters you are expecting. For example, you have a column for a 10-digit phone number but you see a length of 13 then you know that at least one or more values have 13 characters and you have a data quality problem.

The next category is the Pattern Category, as you can see there are four columns in this category. Firstly, when you run the data profile the software will also try to categorize the data by using data dictionaries and patterns in order to determine what type of data populates the column. You can see the most popular pattern in the pattern column as well as the total number of fields in the column, the count of valid values, and the count of invalid values. This is very helpful in showing you the count of values the meet the criteria of the displayed pattern and more importantly the invalid values. At any time you can click on the fields above and immediately see a filtered set of data below based on the value you clicked.

We`d like to take you a few steps further regarding the Pattern Recognition feature.

In the lower window, you can click on the pattern matches the button and this will give you more details regarding pattern recognition. You`ll see all of the patterns that the column values matched., the filter data set on the right-hand side when you select one of the rows. Keep in mind that a column value may match more than one pattern.

Next, let`s look at the Pattern Options, here you can see a long list of predefined patterns that come preinstalled in DataMatch Enterprise you may disable certain patterns that you do not want to see also you can use our Pattern Builder to create your own pattern in order to categorize proprietary data. We will not cover custom pattern creation in this post since it`s outside the scope of what we want to cover.

Let`s go back to the Stats Details, moving on let’s look at the Count Category, here you can see the number of filled or un-blank fields, null or blank fields, and distinct fields. These values are helpful if you want to see how Distinct is valued compared to the total amount of filled fields and how many blanks you might have.

Next, we`re going to look at the Character Count Category, most of these columns should be analyzed on a case-by-case basis. For example, if you have a US Zip coded and we see the letters exist on some f the values, we can click this and see the offending data.

Another helpful feature is the Export Data bottom when you filter data by clicking a value in the above grid, you can easily export this filter data by clicking this button.

Next, we can see a column for Punctuation, many times punctuation can be removed in order to enhance data quality for matching purposes. Also leading spaces and non-printable characters can decrease valid matches in some instances.

Finally, we can see different Statical Values for our data, we can see the Min value in a column, the Max value, the Mean of Average value, the Medium or middle value, the Mode or most recurring value, and the extreme value. The extreme value refers to a value that is farthest to the relative range of the values.

One last thing we`d like to mention is the Profile History, by clicking the button we can see the historical profiles that we may have run for this dataset in a project. This can be helpful if your data changes over time by checking the historical data profiles you might be able to see certain trends that change in the data

In closing, we always recommend that you profile your data before you begin your project in order to to get the most out of your data.

How can we help?