Why Hadoop and SQL?
Organizations seeking fast business results using Hadoop are often impeded by its complexity. While powerful for unstructured data analysis, Hadoop remains complex for traditional structured data analytics.
Integrating a fast, mature SQL query processing engine with Hadoop enables users to achieve greater project agility and streamline development by selecting appropriate tools for each step in their analytical processing. SQL processing complements MapReduce and other Hadoop development methods, simplifying the task of analyzing unstructured data and structured data together.
Cloudera, MapR, Pivotal Greenplum and Hortonworks are already pushing their own products and projects.
Cloudera announces general availability of its Impala SQL query engine for Hadoop. Cloudera Impala 1.0, the first production-ready release, is now available.
Impala is Cloudera’s attempt to address the growing demand for interactive SQL analytics on Hadoop data.
Impala actually uses the same “nearly ANSI” version of SQL as does current standard bearer Hive, but that technology (created by Facebook in 2009 as a data warehouse layer for Hadoop) is still too high-latency for real-time queries. Stinger Initiative, with input and participation from the broader community, to enhance Hive with more SQL and better performance for these human-time use cases (http://hortonworks.com/blog/100x-faster-hive/).
MapR-led Apache Drill project is cut from the same cloth as Impala (that is, being a Google Dremel clone designed specifically for Hadoop). Drill is “a distributed system for interactive analysis of large-scale datasets.” Where as a expert data scientist might use Hadoop to analyze years of marketplace usage data to find hidden customer behavior patterns, for example, a less sophisticated analyst could tap Drill’s SQL-like functionality to answer specific questions, such as, “What were the top 100 apps during the last quarter?” or “What time of day is most popular for app downloads?”. Further, the Hadoop/MapReduce job could take minutes or hours to return results, where as Drill produces query results in near real-time – seconds or less. And where Hadoop requires significant expertise from the user, Drill is aimed at regular business intelligence users.
Drill supports standard ANSI SQL:2003 and can be used by any SQL-based tool such as Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, …
- Prototype: Q1
- Alpha: Q2
- Beta: Q3
What takes the Pivotal HD Hadoop distribution to the next level is a major new component: HAWQ, a relational database that runs atop of HDFS. HAWQ draws from the 10 years of development on the Greenplum Database product and is fully compliant with SQL92 and SQL99 and also support the SQL 2003 OLAP extensions. Pivotal initial tests shows that HAWQ is hundreds of times faster than Hive.