A recent survey shows that 24% of data teams use tools to find data quality issues, but they are typically left unresolved. This means that most data quality tools have the capability to catch problems and raise alerts in case the data quality deteriorates below an acceptable threshold. But still, they leave out an important aspect: automating the execution of data quality processes (whether based on time or certain events) and resolving the issues automatically. A lack of this strategy forces human intervention – which means, someone has to trigger, oversee, and end data quality processes in the tool to fix these issues.
This is a big overhead, especially in organizations that produce huge amounts of data every day. For this reason, companies have to hire more staff and expend more resources that are solely responsible for running the configured data quality checks on new upcoming data regularly. But some organizations do consider the possibility of automating large scale data quality verification for batch processing at certain times of the day or week.
In this blog, we are going to look at scheduled data quality validation, and also see how it compares to instant data quality validation.
Data quality validation
Before we get into the two different ways of handling data quality validation, it is important to revise what data quality validation actually encompasses.
Most data produced in organizations today has various forms of quality errors. For this reason, data leaders design data quality management frameworks or improvement plans that assess, identify, fix, and monitor data quality issues. In this framework, a list of data quality processes are configured to be executed on new data to ensure any errors that may arise are fixed in time. These processes usually include:
- Gathering input
- Fetching new data from disparate sources.
- Profiling data to highlight errors,
- Running data parsing, cleansing, and standardization techniques to achieve a consistent view,
- Matching records that belong to the same entity (exactly on a unique identifier or fuzzy matching on a combination of fields),
- Merging records together to remove unnecessary information and achieve a single source of truth.
- Loading output
- Storing the single source of truth to the destination source.
When to validate data quality?
The execution of these processes on new data can happen at two times: you can either schedule data quality validation to happen at a later time in the day or week (scheduled), or validate it immediately on every upcoming data stream before it is stored in database (real-time).
Let’s take a look at both in more detail.
Scheduling data quality validation for batch processing
Batch processing means running the same set of operations repetitively on a high volume of data at a scheduled time.
The concept of batch processing is fairly common when it comes to data processing. Since the volume of data exponentially increases, validating upcoming data streams in real-time can be very challenging and limiting. This is why batch processing large amounts of data at a specific time of the day or week can be very efficient.
Following are some aspects to consider while scheduling data quality validation tasks through automated data quality management:
- What tasks to execute?
- In what order should the tasks be executed?
- What are the configured variables and definitions of the tasks to be executed (if applicable)?
- What are the locations of inputs and outputs?
- When to trigger the execution of tasks?
Scenario: Scheduling data quality validation for customer data
Depending on your data quality management framework, you can configure multiple tasks on any dataset. For example, you probably capture and store customer information at multiple locations in your organization; an analytics tool tracks website activity, a marketing automation tool saves email activity, an accounts software stores billing transactions, a CRM maintains customer contact information, and so on. But to make this data usable, you probably need it to be:
- Free from data quality errors, such as formatting, misspelling, incompleteness, etc.
- Aggregated together to represent a single source of truth about each customer.
An efficient way of handling this scenario is to choose an automated approach where a background service performs the data quality validation tasks (mentioned above) at scheduled times. This will ensure that customer data is fetched, processed, and loaded to a destination source at the end of every day (for example), and the manual overhead of managing these processes is reduced.
Pros and cons of scheduling data quality validation
Here are some benefits and challenges of scheduling data quality validation:
- One of the biggest benefits of batch processing data is effective resource utilization. You do not only reduce and eliminate human intervention from execution, but also ensure that other resources (such as desktop or server computation power) are utilized at the best times – when it is idle and available.
- Another benefit is that it reduces the probability of human error and produces consistent results at regular time intervals. The same tasks, if handled by individuals, are prone to be late or inconsistent due to varied human judgment.
- Scheduled processing also enhances business efficiency and productivity as results are ready on time with minimal overhead and involvement.
- Scheduling data quality tasks in bulk at one time is simpler and less complex as compared to designing a real-time validation architecture.
- You usually don’t need specialized hardware for running scheduled background services as there is no urgent need for fast processing and result generation.
- One of the biggest cons of delaying data quality validation is the downtime when data is left invalidated, and is waiting to be processed at scheduled time.
- Tasks are scheduled to be executed during off times – and if the scheduling service fails to trigger (due to any bug or glitch), the data can be left unprocessed until a human interacts and force triggers it.
- Some additional technical expertise may be required to design the scheduled jobs for appropriate hardware and power usage, as well as raising alerts for task completion and error notifications.
Implementing real-time data quality validation
Real-time data quality validation refers to verifying the quality of data before it is stored in the database.
To maintain a clean, standardized, and deduplicated view of data at all times, the data can be validated before it is committed to the database. This can be possible in two ways:
- Implementing data validation checks on all data entry tools; for example, website forms, CRMs, accounting software, etc.
- Deploying a central data quality firewall or engine that processes every incoming data stream and validates it before storing it in the database.
Although the first case is comparatively less technically complex, it could be challenging to synchronize data quality checks and fixtures across disparate applications. For this reason, many organizations opt for the second option, where they implement a data quality firewall within their data management architecture.
Some design a custom data quality firewall for their specific data quality requirements, while others utilize API services of third-party vendors and integrate them in their data architecture. Either way, the same outcome is achieved in both cases: you are able to validate the quality of data at the time of data entry or before it is stored in the database.
Scenario: Real-time data quality validation for customer data
In the same example mentioned above, you can choose to perform data quality checks on upcoming customer data in real-time. When a change is made to any customer record or when a new customer record is created in any connected application, the update is first sent to the central data quality engine. Here, the change is verified against the configured data quality definition, such as ensuring required fields are not blank, the values follow standard format and pattern, a new customer record doesn’t possibly match an existing customer record, and so on.
If data quality errors are found, a list of transformation rules are executed to clean the data. In some cases, you may need a data quality steward to intervene and make decisions where data values are ambiguous and cannot be well-processed by configured algorithms. For example, there could be a 60% possibility that a new customer record is a duplicate, and someone may need to manually verify and resolve the issue.
Pros and cons of instant data quality validation
Here are some benefits and challenges of instantly validating data quality:
- One of the biggest benefits of real-time data quality validation is that it ensures reliable state of data at most times by validating and fixing data quality instantly after every update.
- Since the data quality firewall is implemented centrally, you can achieve a consistent data quality across all enterprise-wide data stores.
- It can help you to implement custom workflows on top of your existing data management architecture. For example, you can route certain data to specific locations after cleaning or raise alerts in case something needs urgent attention.
- A data quality firewall that implements a front-end mechanism for data review by data stewards can also help to override default results in special cases, such as overriding incorrect decisions made by matching algorithms. On the other hand, batch processing completely eliminated human intervention, causing some false negatives or positives to fall though in your dataset.
- With this approach, you can enable multi-thread processing, which means the firewall can serve multiple requests at the same time, coming from various applications.
- Deploying a central data quality engine is comparatively more technically complex. And since all data undergoes this route, it has high impact and doesn’t allow any gaps for errors.
- This approach may require specialized hardware for quick and instant computation, and accurate result generation.
- Implementing real-time data quality validation can ask for more technical and domain expertise, as well as reconsidering the entire data management architecture. This probably makes the implementation riskier and complex.
Which one to choose: scheduled or real-time data quality validation?
As always, the short answer to this question is that: it depends.
Some of these dependent factors include:
- Your data quality rules and requirements,
- The frequency with which your business operations query new or updated data,
- The amount of effort, time, and cost you are willing to invest,
- The magnitude of impact your business can withstand during the implementation of either approach.
Best of both worlds
Sometimes, organizations use both approaches at the same time. This can happen in three ways:
- Either the data is divided between the two approaches (some is processed with the scheduled service while the other is processed in real-time),
- Each approach processes a different set of data quality functions on the same data (data cleansing and standardization is executed in real-time and complex techniques such as fuzzy matching, data deduplication, or merge purge are executed in batch at scheduled time), or
- The low impact scenarios (where accuracy is more important than speed) are handled with scheduled processing and high impact scenarios (where speed is more important than accuracy) can be handled with real-time validation.
Due to the complex and tricky nature of data quality errors and their possible fixes, it has become imperative to adopt creative approaches. This will ensure minimum data quality errors fall through the system and most data is kept clean and standardized.
To execute creative approaches, you need creative tools and technologies that can support the execution of your plans. But most times, it is less likely that one tool or one vendor can serve your data quality needs (in all shapes and forms).
DataMatch Enterprise is an exceptional tool that offers its industry-leading and proprietary data quality functions in all forms:
- A desktop application with an intuitive UI,
- A scheduling service that processes data files in bulk at scheduled time, and
- A data quality firewall or API that exposes all functions for real-time processing.