Traditionally, data migration involved extracting, transforming, and loading data from relational databases to the central enterprise data warehouse where it could be put to use in applications.
The warehouse would hold millions of sales, purchase, and invoice records, and would be considered as the primary, single source of truth for the organization.
But times have changed.
Today companies don’t just have a million records of bills and invoices. They have billions of interactional data. They don’t just have customer IDs, email addresses, or phone numbers. They have meta-data; device logins, website sessions, cookies, image attributes, and so on. Companies are investing billions in big data projects, hoping data will turn into money.
With grand ambitions such as big data, AI, and ML, the reliance on traditional data warehouses is diminishing. According to an article published in CIO, in 2018, nearly 96% of companies invested in some form of cloud migration.
Estimated at just 24.65 billion in 2010, it has already surpassed the 100 Billion mark and looking to reach 150 Billion by the end of 2020.
– TechJury
Cloud migration, while just a buzzword in the early 2000s, quickly became mainstream as companies began dabbling in complex data that terabyte-sized enterprise data warehouses were no longer equipped to hold. ETL became a cumbersome process with its limitations in handling complex data. This gave rise to powerful data preparation solutions like DataMatch Enterprise, which allowed business users to integrate data from third-party sources, clean, prepare, dedupe, and standardize records, overcoming ETL challenges.
Challenges with ETL
A typical ETL system is effective when data is structured, batch-oriented, and has been regularly updated. However, with time-sensitive streaming data, ETL systems tend to falter unless the system is tweaked through custom programming. Even then, an ETL system in a real-time environment will be challenged with the requirements of low latency and high availability.
The following image depicts the ETL process. As can be noted, ETL does not cater to the intricate nature of modern data. For instance, users cannot cross-match data or enrich data. They cannot weed out deeply embedded duplicates which in the context of modern data is a serious issue.
Extract
Load
Transform
Analyze
The ETL process itself became increasingly complex, especially since data formats expanded and multiple scripts, and APIs were required against each format to parse data. This implied that if compatible APIs or drivers were not available, ETL specialists had to specifically code an ETL process – a tedious task considering a data source from an average company would have millions of rows of data.
Although there are many commercially available ETL tools in the marketplace, capable of handling complex data and overcoming ETL challenges, they still need a significant learning curve and additional process implementation to make data usable.
Data Preparation as the First Step to a Cloud Migration
Data preparation – the process of collecting, gathering, cleaning, and organizing data for use replaces the restrictions of ETL. As companies adopt agile data management practices and move into flexible, centralized storage structures, data preparation comes as a much-needed reprieve to ETL.
That said, data preparation is still widely mistaken as a mere data cleansing activity. When companies say they are preparing their data, the max they do is extract data and load it to the cloud. They seldom spend time evaluating the quality of their data. Moreover, they fail to assess the value of their records and are in the dark when it comes to understanding their data.
A lack of data prep strategy could result in cloud migration failures that impact business goals. Data preparation then is not just a technical process. It’s also a business process that determines the success of your migration. It’s easily the difference between success and failure, between usable insights and unintelligible text, between a migration that goes ahead and a migration that goes back to the data center.
Avoid the Migration Trap
Plan for a successful transition to a new storage platform:
83% Of File Migration Projects Fail, Exceed Budget, Or Run Over Schedule
When planning a successful cloud migration, several different things need to happen in a specific order.
Step 1 – Profile or Assess the Data
The first step and possibly the most critical is to assess the quality of your data. It’s the perfect time to take a good, hard look at your data and make some assessments.
Questions you can ask to assess your data can include:
Do you have raw, untreated data?
Do you have technically correct but duplicated data?
Do you have clean usable data to work with?
Do you have siloed data from disparate sources?
Do you have a selection of the type of data you will need for this goal?
Do you need to integrate large data sources such as social media, and transactional or behavioral data to get a unified view of your customers?
Do you have a data preparation solution in place that can work on-premises and on the cloud?
Profiling the data will give you an idea of the health and quality of your data which makes it easier to make informed data-cleansing decisions.
Step 2 – Clean the Data
The biggest mistake companies make is to clean their data post-migration. This results in wasted time, and effort & increases migration costs exponentially.
In plenty of research studies and analyses, it is estimated that 10 to 25% of data is inaccurate – this does not even take into account obsolete, outdated, or useless data stored on enterprise servers or data warehouses.
For instance, you have a mailing list of 100,000, and 25% of the data is incomplete with missing phone numbers, invalid addresses, and typo-ridden names. That’s losing 25,000 mailers – not to mention printing, return mail, postal and operational costs. If you transport this dirty data into your cloud migration, you’re essentially ballooning costs for no reason at all.
Data cleansing is not only fixing format or content issues. It’s also largely about deduping data. Data duplication is a pain point in data sources and one that is extremely difficult to resolve. It takes months to match data from multiple sources and remove duplicates. Not only is the migration affected, but the company’s analysis and reporting significantly suffer with duplicated data.
Step 3 – Purge the Data
A common mistake companies still make is moving everything to the cloud, when not everything belongs there. Profiling and cleaning data will allow you to see what data sets need to be purged. For instance, data that is junk such as customer records from a decade ago that has never been updated and is now useless, or financial data that has passed the seven-year rule and has aged are still migrated to the cloud. Companies end up paying for the storage of data that is of no use to them.
Step 4 – Standardizing Data
Data standardization is a key part of ensuring data quality. Unfortunately, data standardization is often left out of the discussion. You’ve cleansed and deduped the data. Next, you have to standardize it according to your organization’s data rules and map it according to the new system.
For instance, a customer name is usually one that requires standardization. It may be represented in thousands of semi-structured forms. A data standardization tool will be able to parse the different components of the name (such as first name, last name, middle name, initials, etc) and allow the user to rearrange those names into a suitable format.
Data standardization is rule-based and is guided by established standards (such as all names must begin with an uppercase) that have been recognized as common over time. Standardizing data ensures data integrity and makes it easier during the mapping process.
Step 5 – Verification and Validation
Data preparation also involves verification and validation of data at its final stage. Once the data is profiled, cleansed, deduped, and standardized, it must be verified and validated.
Address records for instance require verification against an authority standard, such as the USPS in the US. This process ensures that addresses are complete and accurate, making it possible for companies to send out emails with confidence.
While this may not seem like a part of cloud migration, it is an essential step of quality assurance that is usually missed as companies scramble to get their migration done on time. It’s always advisable to spend time resolving data quality issues before migration because the migration process itself is not time-consuming – it’s the data preparation that takes up the most time. Companies do it the other way around; they spend more time moving the data and less time preparing the data.
How Data Quality Affects Customer Data – A Case Study
Bell Bank, a renowned bank with branches in all 50 states of the US has a goal – to unify data from internal and external sources and get a singular customer view that can be used to create personalized services.
But before that’s possible, the bank has to deal with:
Disparate Data Sources: The bank’s external vendors and third-party services manage data using their platforms and solutions. When the bank wants to get the 360-customer view, it has to retrieve this data from third-party sources with its inconsistencies and variability.
In this specific example, disparate data is due to external associations, which makes sense. In most companies though, internal teams use varying apps and systems to collect and store data which makes it even more difficult to resolve. Even if the company uses one central platform, the lack of a definite process and adherence to data quality rules causes disparate information.
Duplicate Records: Duplication of data is the most dangerous data quality problem as it’s so difficult to identify. In the case of Bell Bank, the consolidation of customer data resulted in significant duplicates. It was bound to happen – when you consolidate data from multiple sources, chances are you’re storing multiple versions of a single entity’s records.
This is five different versions of one entity (individual) created by perhaps five different people, stored in different platforms. The entity has changed phone numbers, email addresses, and physical addresses in the five years that he has been with the bank. This kind of duplicate is the most difficult to weed out because none of the values are deterministic.
Messy Data: Not only is this data duplicated, but it is also severely messy. There are no standards and no formatting rules are applied. Abbreviations, punctuations, and misspelled names are degrading the quality and affecting the data’s usability. While this may seem like a small problem, companies end up spending millions in hiring data analysts just to identify these errors and clean them up. Moreover, this messy database can result in data security problems, embarrassing mistakes, and annoyed customers.
Replacing ETL with a Self-Service Data Preparation Solution
ETL’s limitations have prompted a rise in data preparation solutions, specifically in self-service solutions where business users, data analysts, and IT experts can all clean and prepare the data for migration.
While ETL used to be restricted to IT, self-service data preparation is designed for business users. This lowers the burden on IT staff, allowing them time to focus on the technical matters of migration. To expect IT to clean business records is like asking a technician to be an engineer. It assists business users in data quality – IT is not responsible for data quality. This is why, modern data preparation tools like DataMatch Enterprise are dramatically different than ETL tools.
While handling complex data preparation processes such as data matching and deduplication, data integration, and consolidation, it also allows users to easily explore and assess their data. It opens a whole new world of possibilities where users can create multiple versions of the truth and bring forward master records that can be quickly mapped into the new system. All this is made possible in a DIY tool that does not mandate programming or database management knowledge.
DataMatch Enterprise follows a workflow that lets users perform key functions as:
Data Integration: Integrate data from over 150+ applications and derive data sets that you need for your analysis and reports. This means you can import data from social media sources like Facebook and Twitter, big data platforms like Hadoop, and CRM platforms like Salesforce and HubSpot. You simply integrate the platform with DME and make fixes as you go.
Data Profiling: Identify the flaws of your data visually and get an overview of your data health. You can see the number of misspelled names, typos, and other common data quality problems categorized into multiple columns.
Clean Data: Data cleaning is achieved by applying pre-defined rules to your data. With these rules, you can clean up messy data and standardize it according to established standards. For instance, if your data columns are suffering from mixed-case issues, you can clean them up by applying an upper-case function on the data with a simple check-box click
Match Data: Match data from within, between or across multiple data sources using a combination of fuzzy matching algorithms and the software’s proprietary algorithm. Data matching is the very foundation of data preparation because it is the only way you can remove duplicates from your data. If the tool you’re using does not have 100% data match accuracy, your data probably has thousands of hidden duplicates that cannot be easily detected.
Merge: Merge corrected and matched data into a single master file and make it your single source of truth before loading this new information into a new system or source.
Our native integration with modern CRMs and advanced record linkage features allow users to find data matches across all supported repositories, regardless of whether your data lives in social media platforms and legacy systems or traditional databases and flat files. Data Ladder integrates with virtually all modern systems to help you “Get the Most Out of Your Data”.
Conclusion
Cloud migration is the future, but one does not simply move data to a cloud. You have to make data quality a mandatory focus while data preparation is a process that must be fulfilled before you make the move.
Despite being in demand, cloud migration is complex and the chances for failing are higher than for success. Most organizations have to experience an alarming situation of going back to the data center, failing mid-way with a migration that was blotched with data problems. If a cloud migration strategy has to succeed, it must start with an organizational effort to review & optimize data quality.