EMC’s Pivotal Initiative division made a big splash last week with the launch of its
Pivotal HD distribution of Hadoop. This is not a normal Hadoop distribution, but one that takes the parallel guts of the Greenplum database and reworks them to transform the Hadoop Distributed File System (HDFS) into something that speaks perfectly fluent SQL.
As discussed previously, Pivotal HD will incorporate Project Hawq, an SQL database layer that rides atop of HDFS rather than trying to replace it with a NoSQL data store.
The Hawq extensions to Hadoop’s HDFS turn it into a database, explained Josh Klahr, product manager for the Pivotal HD line at EMC. “Hawq really is a massively parallel processing, or MPP, database running in Hadoop, running on top of HDFS,” he said, “embedded, as one single system, one piece of converged infrastructure that can run and deliver all of the great things that Hadoop and HDFS have to offer as well as the scale and performance and queriability that you get from an MPP database.”
“It really is SQL-compliant, and I don’t use those terms lightly,” Klahr explained further. “It is not SQL-ish, it is not SQL-like. The Hawq allows you to write any SQL query and have it work on top of Hadoop. SQL-99, SQL-92, SQL-2011, SQL-2003, and I am sure there are some other years in there as well.”
The secret to Hawq, Sherry said, was that the database layer has dynamic pipelining, which is the combination of a bunch of different Greenplum technologies that have been built for the parallel relational database (a derivative of PostgreSQL) that Greenplum created when it was a standalone company a decade ago.
The dynamic pipeline is a job scheduler for queries (separate from the NameNode and the JobTracker in Hadoop) that can schedule queries in the most optimal way.
After the queries are executed on the appropriate data chunks under the direction of the Hadoop NameNode, they stream back to the NameNode and the results are streamed back to whatever bit of software was doing the SQL query.
This dynamic pipeline is what makes Hawq perform 10X to 600X times faster running SQL queries compared to using something like Hive running atop HDFS. And, these performance improvements are what turn Hadoop from a batch system into an interactive one.
Then, in a live demonstration, Klahr took a 60-node Hadoop cluster equipped with a retail establishment’s data with 1 billion rows of data and sorted that customer information into two buckets, male and female. Using HDFS and the Hive data warehouse and its SQL-like HiveQL, this query took more than an hour. On the same cluster running Hawq on top of it, this sort took around 13 seconds on stage.
Here’s how the Hawq database services running atop Pivotal HD stack up to the HDFS-Hive combo on a variety of benchmarks on the same 60-node Hadoop cluster. Queries that might take hours or even more than a day in Hive can be done in minutes in Hawq:
How Hawq stacks up to Hive on a 60-node cluster doing various queries
Hive converts those HiveQL queries into MapReduce routines and runs then against data stored inside of HDFS, but the Project Impala database layer from commercial Hadoop distie Cloudera gets MapReduce out of the way and puts a database execution engine on each one of the Hadoop nodes. It then parallelizes the queries, much as Hawq is doing. But, as you can see, Greenplum knows a thing or two about parallel queries that Cloudera apparently has not (yet) learned:
Hawq outruns Cloudera’s Impala on SQL queries – at least when EMC runs the tests
“Something that may take you an hour in Impala may take you a minute in Hawq,” proclaims Klahr.
The other thing that EMC thinks it can do better than the HDFS-Impala combination is scale horizontally. Take a look at this comparison:
EMC says that its Hawq database for HDFS will scale better than Cloudera’s Impala