Blog

Guide to pattern matching: What it means and how to do it?

Finding patterns is easy in any kind of data-rich environment; that’s what mediocre gamblers do. The key is in determining whether the patterns represent signal or noise.

Nate Silver

Anyone who works with data understands the importance of patterns. Whether you are holistically analyzing large datasets or drilling down to the most granular value, patterns are everywhere. They can be universal – like the pattern of a credit card number – or they can be unique to your business, for example the pattern used to display product information on your website.

When data is captured, it does not always follow the correct pattern. Companies need to implement different methods for matching, validating, and transforming patterns to get the data in the required shape and format.

In this blog, we will learn some important concepts related to pattern matching and validation, such as:

  1. What does pattern matching mean?
  2. How pattern matching differs from string matching?
  3. How does pattern matching work?
  4. What are the most common reasons for matching and validating patterns?
  5. How can you transform your data into the pattern you need?

Let’s dive in.

What is pattern matching?

A pattern is perceived as something that is the opposite of mess or chaos. It is a repetitive model that can be identified across a large set of data values belonging to the same domain. Therefore, pattern matching can be defined as:

The process of searching for a specific sequence or placement of characters in a given set of data.

Pattern matching produces definitive results: the input string either contains the pattern (is valid) or it doesn’t (is invalid). In case the string doesn’t contain the required pattern, the matching process is often extended to pattern transformation where sub data elements are extracted from the input value and then reformatted to build the required pattern.

Pattern matching versus string matching

Before we discuss how pattern matching algorithms work, it is important to understand its relation with string matching algorithms. Both these concepts are often treated as the same thing, but they are quite different in their purpose and use. The table below highlights some of the key differences:

Pattern matchingString matching
ComparisonIt compares a string with a standard pattern that represents blocks or tokens of characters.It compares two strings character by character.
ExampleComparing [email protected] with [name]@[domain].[domain-extension].Comparing Elizabeth with Alizabeth.
ResultsComputes definitive results – either the pattern is found or is absent.Computes exact matches (matching dust with dust) or fuzzy matches (matching dust with rust).
UsesUsed to parse and extract values or transform values to follow standard patterns.Used to correct misspellings, detect plagiarism, and identify values having similar meaning or character composition.

How does pattern matching work?

Simply put, pattern matching algorithms work with regular expressions (or regex). To understand what a regular expression is, think of it as a language that helps you to define a pattern and share it with someone – or in our case, a computer program.

Regular expressions tell computer programs which pattern to look for in testing data. Sometimes, the program is intelligent enough to pick patterns from a set of data values and automatically generate a regex. Some programs or tools have a built-in regex library that contains commonly used patterns, such as credit card number, U.S. phone numbers, datetime formats, email addresses, etc.

Example of matching email address pattern

To figure out what a pattern matching algorithm is, let’s take the example of validating the pattern of email addresses. The first step is to define the regex that communicates the pattern of a valid email address. A sample pattern of a valid email address may look like this:

[name]@[domain].[domain-extension]

In the regex language, this pattern will be translated as:

^[\w-.][email protected]([\w-]+.)+[\w-]{2,3}$

Where,

  • ^ means the beginning of a sentence and $ means the ending.
  • [\w-.] means a word that contains alphanumeric characters, underscore, hyphen, or a full-stop.
  • [email protected] implies the addition of an @ symbol.
  • ([\w-]+.) means a word that contains alphanumeric characters, underscore, or hyphens, and ends with a full-stop.
  • +[\w-]{2,3} means a word that contains alphanumeric characters or a hyphen, and that word can only have at least two and at most 3 characters.

Below, you can see a number of test email addresses that are run through this regex pattern and the results produced.

No.TestResultReason for failure
1.[email protected]Valid
2.pam.beesly_gmail.comInvalidMissing @ symbol.
3.[email protected]InvalidThe domain has an unexpected full-stop.
4.[email protected]InvalidThe domain extension has more than 3 characters (i.e., com4).

It is obvious that manually defining regexes is tedious and requires some expertise. You can also opt for data standardization tools that offer visual regex designers (more on this in a later section).

Pattern matching use cases

Now that we know what pattern matching is and how the algorithm works, you may be wondering where exactly is it put to use. Pattern matching is one of the most fundamental concepts across different fields, such as computer programming, data science and analysis, natural language processing, and more.

If we specifically talk about pattern matching and validation in the data field, here are some of its most common applications:

1. Validating form submissions

As data pattern matching differentiates between valid and invalid information, it is mostly used to validate forms submitted on websites or other software applications. The regex is applied on the form fields as needed; some sample validations are given below:

  • A person’s name only contains alphabets or symbols,
  • Email address follows the correct pattern,
  • Phone number contains only digits,
  • Credit card number is not more than 16 digits, and so on.

2. Performing search and replace operations

Pattern matching is also useful in applications that have find and replace features for textual information. Some basic applications only offer character by character matching (or string matching), while others also provide regex search and replace functionality – that allows you to search patterns in text documents and not just exact string matches.

3. Cleaning and standardizing datasets

You can try to validate information at data entry – such as form submissions, but due to the various limitations and restrictions encountered across systems, your organizational datasets can still end up with multiple representations of the same information. This is where it becomes imperative to clean and standardize datasets before they can be used for routine operations or BI.

4. Parsing and extracting values

Since pattern matching looks for specific sequence of characters in a given value, this process is also useful to match and extract value tokens that reside in extended forms of information. For example, you may want to extract the domains from a list of business email addresses to find out which company the person works at, or you can extract the city and country of residence from address fields that contain 3-4 lines of information.

How to match patterns?

There are two approaches usually adopted by businesses while matching and validating patterns: one is to write in-house code scripts and the other is to use third-party software tools. Let’s discuss the implementation of both approaches.

1. Pattern matching using code

When it comes to cleaning and standardizing data, the default solution for many organizations is to create custom in-house applications and coding scripts for various standardization operations, including pattern matching and transformation. As interesting as it may sound, it can be quite challenging.

Why in-house data quality projects fail

Read this whitepaper to understand the consequences of ignoring poor data quality, gain insight on why in-house data quality solutions fail and at what costs.

Download

Let’s take a look at a JavaScript code snippet that validates email addresses.

function emailValidation(input)
{
var regex = /^\w+([.-]?\w+)@\w+([.-]?\w+)(.\w{2,3})+$/;
if(input.value.match(regex))
{
alert("Valid"); return true;
}
else
{
alert("Invalid"); return false;
}
}

Note that this code snippet only validates email addresses and does not transform them into a standardized pattern in case they are invalid. Moreover, it only validates the email address field, so to match different patterns, you need similar code implementation for each. Finally, the regex that validates email addresses is still a bit easier to decode. If we consider data fields that have complex patterns, regexes can span over a number of lines. For example, the following code snippet finds pattern matches for URLs.

function URLValidation(input)
{
var regex = /[[email protected]:%.+~#=] {1,256}.[a-zA-Z0-9()]{1,6}\b ([-a-zA-Z0-9()@:%+.~#?&//=]*)
?/gi;
if(input.value.match(regex))
{
alert("Valid"); return true;
}
else
{
alert("Invalid"); return false;
}
}

2. Pattern matching using software tools

For the reasons mentioned above, maintaining custom applications can be very resource-intensive. It requires you to hire a team of in-house developers who are constantly approached by business users with requests to debug and update code functionality.

This is why many managers and senior data engineers lean towards the idea of adopting simple tools for building, matching, and transforming patterns that can be easily used by IT as well as non-IT staff.

Such pattern matchers are packaged with different features. The most common features are discussed below.

1. Visual pattern builders

A visual pattern building feature offers a drag-and-drop graphical user interface that can be used for creating patterns. While a user drops pattern blocks or token in the workspace, an equivalent regex is being generated at the backend. This feature eliminates the need for technical expertise, and encourages naïve users to build patterns as well.

A screenshot of the visual pattern designer in DataMatch Enterprise is shown below:

2. Pattern matching by data type

Another cool feature of pattern matching tools is the ability to profile entire columns by their data type patterns. For example, you can profile the phone number column by the integer data type, and the fraction of values that contain other symbols and characters in addition to digits can be flagged as invalid. This can be done to get a quick assessment about the standardization effort required to fix the invalid patterns.

A screenshot of matching patterns by data type in DataMatch Enterprise is shown below:

3. Pattern matching using regex library

Many tools come with in-built regex libraries full of commonly used patterns, such as credit card number, U.S. phone numbers, datetime formats, email addresses, etc. Moreover, you can also create custom patterns (specialized for your business use) and save them in the library for reuse.

A screenshot of regex library in DataMatch Enterprise is shown below:

4. Complete data cleansing and standardization package

One of the biggest benefits of such tools is that they mostly come packaged with other data cleansing and standardization features that are critical to transform your data into an acceptable shape and format. Because once you have the pattern matching report that shows which data values are valid and which ones are not, the next important step is to fix patterns as well.

This is why adopting an end-to-end system that takes care of various data quality management disciplines – including data profiling, cleansing, standardization, matching, and merging – can be a huge bonus.

A screenshot of various data quality functions offered by DataMatch Enterprise is shown below:

Opting for a code-free solution that builds, matches, and transforms patterns

Although we mostly focused on pattern matching in this blog, the art of pattern transformation is just as interesting – yet challenging. For this reason, many organizations like to provide their teams with self-service data cleansing and standardization tools that are designed with pattern designing, matching, and transforming features. Adopting such tools can help your team execute complex data cleansing and standardizing techniques on millions of records in a matter of minutes.

DataMatch Enterprise is one such tool that facilitates data teams in rectifying pattern errors with speed and accuracy, and allows them to focus on more important tasks. To know more about how DataMatch Enterprise can help, you can download a free trial today or book a demo with an expert.

In this blog, you will find:

Try data matching today

No credit card required

"*" indicates required fields

Hidden
This field is for validation purposes and should be left unchanged.

Want to know more?

Check out DME resources

Merging Data from Multiple Sources – Challenges and Solutions

Oops! We could not locate your form.