Data Quality for Artificial Intelligence

Good data quality for Artificial Intelligence is critical for AI success. For a system to be artificially intelligent, it needs to consume vast amounts of data to be able to formulate theories, gain ‘experience’ and learn. But what happens if that data is incomplete – or bad?
Even as new AI systems are being developed, the debate rages on about whether it is a good thing or not. On one side, you have those who believe AI has the power to improve our daily lives in countless areas. On the other, those who warn of machines rebelling against humans and doing all manner of unspeakable things.

Whichever side you’re on, there’s no doubt that AI is gaining ground and more and more systems are becoming ‘aware’. And there is an inherent point of potential failure – the data that is consumed by the system.
So how do you make sure your AI technology uses clean data?

Quality in equals quality out

Okay, I twisted the old saying but it’s true. The more high-quality data you feed into your AI system, the more insights (and value) you’ll get out of it. But note that there are two parts at play here – more data and quality data. More data is easy. It’s everywhere.
Your CRM or ERP platforms, data lakes, and datasets from limitless sources can all help to feed the insatiable monster that is AI. And the more data you can collect, the better your AI system is going to be. But only if it’s quality data. Because the more data you collect, the greater the chances of bad quality data creeping in. And that bad data will render your AI system useless. Let’s look at an example:

AI for parts management

XYZ Corp sells air conditioning equipment and parts from branches across the country. Depending on the weather, demand for certain parts can peak or drop. In warm weather, demand for cooling parts is at its highest, but during cold spells, it’s heating components that are in greatest demand. To keep track of what parts they need to stock, and at which branch they need to stock them, they build an AI system.
The system pulls in data for both short- and long-term weather forecasts. It matches that data with their customer’s address, and forecasts what parts are likely to be needed. It then orders the parts to be sent to the branch nearest their customer so they always have stock.
If any of their data is wrong, they could end up in one of two scenarios: 1) over-stocking parts, leaving them with unused inventory that they have paid for, or 2) under-stocking parts leaving them with unhappy customers who go elsewhere.

Check that address!

With the address data, there’s already a strong possibility that the data is bad. As we discussed in “Think Address Verification is an Option?”, 14.2% of Americans and 19.3% of Businesses move each year. If that data is not kept accurate, in about 5 years 50% of those customer addresses will be wrong.
So they need address verification, but it’s completely unrealistic to expect them to export customer records, verify the addresses and then import that into the AI systems. It needs to be automated. Fortunately, DataMatch Enterprise Address Verification can connect to their CRM system and USPS data, verify addresses, and update the records using an API.
So now they have clean address data, all is good, right? Well no. There’s two more sources of data that are at risk of error.

Fat finger syndrome

XYZ Corp stocks or sells over 50,000 individual parts, from condensers to heating coils to relays to bolts. And every part needs to be entered into their parts management system. Unfortunately, ‘fat finger syndrome’ is as prevalent in XYZ Corp as elsewhere and their parts system has 90,000 records!
You’ve seen it yourself. Data gets entered incorrectly with small but, at least to the human eye, noticeable errors. Part number XYZ123 gets entered as XZY123 and a new part is created in the system. You and I know that they’re actually the same but the system sees a new part number. Or someone enters “condencer” instead of “condenser”. Maybe the parts buyer orders a faucet from the UK and copies the description of the tap from the supplier’s website.
The last one is not bad data, per se, but it can still cause an error if someone searches for ‘faucet’.
Whatever the cause, errors creep in. And the more parts there are, the more potential there is for error. But using ProductMatch, XYZ Corp can clean their records, remove duplicates and end up with better parts data.

Whether the weather be hot…

We all know that, even with the greatest of care, weather forecasts can be inaccurate at times. So, there’s an inherent risk of bad data. And you probably think that’s the end of that. But it’s not.
The weather data can suffer from exactly the same fat finger syndrome as the parts data. For example, if the temperature in five days of data ranges from 45F to 58F but someone enters 158F for December 25.
You’ve got a couple of options to catch this kind of error. You can build some sort of anomaly detection into the AI systems but you’ll need to be careful. It’s possible that some spikes or troughs will occur naturally and you need to ensure that they AI system knows the difference between a natural high and an error.
The other option is using the data profiling in DataMatch Enterprise. Using profiling, DataMatch Enterprise can flag these errors allowing you to review and correct the data before feeding it into the AI system.

Data quality for Artificial Intelligence

It’s easy to be complacent about data quality for AI, but we shouldn’t. In our example, the only thing at risk is the wrong parts being ordered but with some of the AI systems being developed, the risks are much higher. But some of that risk can be mitigated, especially when it comes to feeding the beast with good quality data.
At Data Ladder, our products such as DataMatch Enterprise and ProductMatch can help you make sure that the data you’re feeding your AI system is clean, quality data.

Get in touch TODAY for more information.

Try data matching today

No credit card required


Want to know more?

Check out DME resources

Merging Data from Multiple Sources – Challenges and Solutions