Apache Hadoop enterprise player Cloudera has launched a real-time query engine for Hadoop, dubbed Impala. Cloudera Enterprise will now effectively be the first big-data management solution that allows batch and real-time operations to be performed on unstructured or structured data all within the same scalable system.
- The Essential Guide to IT Transformation
- Consolidation: The Foundation for IT Business Transformation
- COBOL in the Big Data Era: A Guide
- Bring Salesforce.com Alive with Your Key Business Processes: Register Now
The suggestion is that organizations will now be able to process data at petabyte scale and, on the same system, interact with that data in real time to deliver what Cloudera likes to call "speed-of-thought" insights.
NOTE: Cloudera Impala is an Apache-licensed, real-time query engine for data stored in HDFS (Hadoop Distributed File System) and HBase. Cloudera Enterprise RTQ (Real-time Query) provides the management and support needed to operate Cloudera Impala in production environments.
"Mainstream enterprise adoption of Hadoop will inevitably raise expectations," said Tony Baer, principal analyst for Ovum. "Enterprises have grown accustomed to interactive querying and on-the-spot analytics with their existing data warehousing and BI infrastructures and will expect no less of Hadoop. With a real-time query capability powered by its new Impala engine, Cloudera is striving to level the playing field in performance and accessibility with massively parallel SQL platforms."
The implication is that developers are going to have learn new tricks to take advantage of real-time big data crunching made possible by Impala and Impala queries are generally short lived, with smaller, focused result sets. Additionally, Impala queries operate on data sets of any size in HDFS.
Impala is "especially well suited" to use cases where real-time queries and speed are essential. But while many developers will be familiar with Hive and Pig, Impala uses its own daemons that are spread across the cluster for queries. Furthermore, Impala does not leverage MapReduce, allowing Impala to return result in real time.
"We have already seen high levels of interest in, and adoption of, Hadoop by enterprises for low-cost storage and transformational processing of large volumes of data, but have argued that for Hadoop to gain more adoption for analytic workloads we need to see analytic tools taking full advantage of Hadoop’s scalable parallel processing architecture," said Matt Aslett, research manager, data management and analytics, 451 Research. "Cloudera Enterprise RTQ and Cloudera Impala look to be a significant step in enabling enterprises to take advantage of existing SQL skills and tools to realize the potential of real-time analytics against large volumes of structured and unstructured data stored in Hadoop."
NOTE: Apache Hadoop started as an offline, batch processing system. Subsequently, Hadoop was extended to service more interactive online workloads. First among these was HBase, the distributed, tabular data store.
Cloudera Impala introduces what is essentially a scalable, distributed query engine to the Hadoop ecosystem. The technology was developed by lead architect of the Impala project, Marcel Kornacker, who previously helped build the query engine for the F1 project at Google.
"Apache Hadoop has already transformed the industry, unlocking value from Big Data for enterprises around the world," said Mike Olson, CEO of Cloudera. "Until now, enterprises had to limit the work they did with Hadoop because batch-mode processing using MapReduce was just too slow for some business problems. With today's release of Cloudera Enterprise Real-Time Query powered by Impala, we solve that problem. Cloudera Impala complements MapReduce and is the latest addition to our one hundred percent open source Big Data platform."
"You can now store all your data in Hadoop and use the same hardware to do both powerful analytics and run real-time queries using industry-standard tools and the SQL language," added Olson.
NOTE: Cloudera remains a leader in open-source contribution across Hadoop and supplemental projects such as Hive, Flume, Search, and Impala, and is the single biggest contributor to Hadoop-related projects with over 50 project committers, PMC members, and code contributors to Apache.