Spark and Hive Integration

There has been considerable excitement about Spark since it became an Apache top-level project.  The reasons are that it is a hundred times faster than Hadoop MapReduce in memory, can run on Hadoop 2’s YARN cluster, can read any existing Hadoop data, and it allows data workers to rapidly iterate over data via machine learning and other data science techniques.

Hortonworks heavily  investing in Spark and by year-end  plan to offer support for:

  1. Improved integration with Apache Hive
    Hortonworks is contributing to Spark to enable support for Hive 0.13, and as the Hive community marches towards Hive 0.14, will contribute additional Hive innovations that can be leveraged by Spark.  This allows SparkSQL to use modern versions of Hive to access data for machine learning, modeling etc.
  2. Support for ORC file format
    As part of the Stinger Initiative, the Hive community introduced the Optimized Row Columnar (ORC) file format. ORC is a columnar storage format that is tightly integrated with HDFS and provides optimizations for both read performance and data compression and is rapidly becoming the defacto storage format for Hive.  Watch Jira issue SPARK-2883.
  3. Security
    Most common request we hear from our enterprise customers is to provide authorization via integration with LDAP or Active Directory, before granting access to the native Spark Web User Interface and ensuring that Spark runs effectively on a secure Hadoop cluster.
  4. Operations
    Spark component(s) and services can be managed by Ambari so that you can install, start, stop and configure to fine-tune a Spark deployment all via a single interface that is used for all engines in your Hadoop cluster.

The combination of running Hive both for analytical queries and for fetching dataset for logistic regression in Spark is mind-blowing. The Spark uses logistic regression algorithm implemented in the MLlib library. The algorithm applies the same operations repeatedly to the same dataset, so it benefits greatly from caching the input in RAM across iterations.

val trainingDataTable = sql(“SELECT e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userId = e.userId”)
// Since sql returns an RDD, the results of can be easily used in MLlib
val trainingData = trainingDataTable.map { row =>val features = Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
trainingData.cache()
val model = new LogisticRegressionWithSGD().run(trainingData)

In addition to Hive integration, you can allow Hive to run on Spark execution engine instead of Tez (HIVE-7292). You can compare perfomance with Tez and choose the better one for each query you have. The change is only in replacing set hive.execution.engine=tez; into set hive.execution.engine=spark; and vice versa.

Oglasi
Ovaj unos je objavljen u Nekategorizirano. Bookmarkirajte stalnu vezu.

Komentiraj

Popunite niže tražene podatke ili kliknite na neku od ikona za prijavu:

WordPress.com Logo

Ovaj komentar pišete koristeći vaš WordPress.com račun. Odjava / Izmijeni )

Twitter picture

Ovaj komentar pišete koristeći vaš Twitter račun. Odjava / Izmijeni )

Facebook slika

Ovaj komentar pišete koristeći vaš Facebook račun. Odjava / Izmijeni )

Google+ photo

Ovaj komentar pišete koristeći vaš Google+ račun. Odjava / Izmijeni )

Spajanje na %s