As users seek to go beyond canned reports, dashboards, and spreadsheets to employ sophisticated visual analytics to drive decisions and actions, traditional processes for preparing data are under pressure. These include steps for data quality and profiling, data transformation, and other forms of enrichment.
To perform business-driven analytics, users want flexibility in data preparation; they don’t want to wait for long cycles of extraction, transformation, and loading (ETL) only to gain access to a limited selection in the data warehouse. In today’s brave new big data world full of Hadoop clusters and nontraditional data types, users want to explore it all with less restraint and more self-service.
The established world of ETL and data integration is thus in the midst of a shakeup. Innovators are coming out of Google, Facebook, and leading universities to launch new companies with the backing of top-flight venture capitalists (VCs). Traditional vendors have had to adjust quickly and introduce solutions that are geared more to ad hoc, on-the-fly data integration, transformation, quality improvement, and other preparation for analytical activities such as blending internal and external data to gain insights into competitive pricing. Newer solutions employ machine learning and other advanced analytics to enable users to learn about the data faster, with algorithms for finding relevant data relationships and anomalies.
As always happens in our industry, new technologies bring new terminology with them to draw distinctions from the old. Rather than ETL and data integration, the latest data preparation and integration technologies apply terms such as “data blending,” “data munging,” and “data wrangling.” Although the vendors use them somewhat differently, the terms generally stand for easier and faster data preparation and integration of a wider range of sources, usually through automated processes driven by advanced analytics.
Being able to integrate and prepare a wider variety of data types is a major distinction between the newer solutions and the old. Both inexperienced and expert analysts today increasing want to blend views of disparate data types including geospatial, text, and demographic data with their more traditionally structured transactional data. These nonstandard data types are often voluminous, varied, and messy; to gain business value from them sooner, manual work must be replaced by automated methods.
Although experienced data scientists and analysts may still prefer to get their hands dirty and write code to analyze the data based on intimate knowledge of the sources, most users need automation to run queries and models against what could be petabytes of highly varied data.
To hide the complexity of selecting, blending, and accessing data sources, many of the newer tools provide graphical user interfaces of their own or the ability to embed icons in leading business intelligence and visual analytics solutions. Users can work with icons rather than code to perform data mashups, set filters, or create custom data blends for their immediate analytic needs. The tools are thus fueling the trend toward self-service data integration, taking tasks out of the hands of IT to enable business analysts and other nontechnical users work on their own to develop variables, build models, or query sources to find data patterns and correlations.
Of course, much of this innovation is aimed at enabling organizations to gain more value out of the growing “lake” of data stored in Hadoop clusters. Organizations need tools that are geared to the “schema-on-read” data analysis style prevalent with Hadoop where schema, transformation, and other steps are applied to data when it is accessed rather than as it enters the systems, which is typical with traditional BI and data warehousing systems. Because no one vendor is entrenched as the market leader for data preparation on Hadoop, the new firms and their VC backers see a major opportunity.
Many of the new data preparation software providers are led by technologists with deep experience in using Hadoop, MapReduce, Spark, and related Apache open source technologies. Offloading of ETL jobs to cheaper Hadoop systems has already been growing and will likely accelerate as Spark and commercial SQL-on-Hadoop options mature. Over time, these trends will make it easier for organizations to view Hadoop as an appropriate platform for a greater share of their data preparation, enrichment, and integration tasks.
Advances in data preparation and integration are having a major impact on BI, visual analytics, and data discovery. Here are three key points to consider when you’re evaluating tools for data preparation.
Ensure good data governance.
One of the potential dangers of breaking away from IT control and increase users’ self-service with data preparation is that proper data governance can become more difficult. Data preparation and integration tools are increasingly providing data lineage tracking capabilities, which can be helpful for data governance. Users and IT should work together to set rules and ensure that they are followed.
Manage performance carefully.
Whether they are using traditional ETL or newer data preparation and integration software, better performance is always a key goal and a high priority for users. Look carefully at how vendors are currently employing or planning to employ in-database and in-memory processing for data preparation analytics to improve performance.
Make it easier for users, not harder.
Many newer technologies are attempting a transition from Hadoop’s developer-oriented culture to the world of nontechnical users who generally do not want to code and are more focused on solving business problems. Graphical interfaces help, but they can also mask confusion. Ensure that users are properly trained and guided as they move toward self-service data preparation and integration.
New Opportunities, New Responsibilities
New technologies entering the market mean that these are exciting times for users who have been frustrated with traditional ETL and data integration and seek more flexibility and control. However, as Uncle Ben so famously said in Spider-Man, “With great power comes great responsibility.” Users and IT must adjust rules, practices, and their relationship to make fortuitous use of new data preparation technologies and avoid potential pitfalls.