Apache Hive Optimizations

The following text presents a procedure which I use for speed up analytical Hive queries.
Basically, I transfer plain Hive table into partitioned table stored as ORC file format.

Partitions: Each table can have one or more partition keys which determines how the data is stored. Each unique value of the partition keys defines a partition of the Table. For example all log data from “2014-09-16″ and severity “Error” is a partition of the wlogs1 table. You can run query only on the relevant partition of the table thereby speeding up the analysis significantly.

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.



The Apache Flume fills data for the wlogs Hive table. After some time interval, I run transfer.bat which consists of the following commands:

java -cp .;C:/hdp/hadoop/hadoop-;C:/hdp/hadoop/hadoop-* CleanLogNew store

call %HIVE_HOME%\bin\hive -f C:\hdp\script.hql

java -cp .;C:/hdp/hadoop/hadoop-;C:/hdp/hadoop/hadoop-* CleanLogNew delete

After that, I have data in the Hive table named wlogs1 that is partitioned and has an ORC storage format.

The sources for script.hql and CleanLogNew.java are given below.

add jar C:/hdp/hadoop/hive-;
add jar C:/hdp/customlogs.jar;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true;

INSERT INTO TABLE wlogs1 PARTITION (date1,severity)
select /*+ MAPJOIN(file)*/ t1.*
(select regexp_extract(INPUT__FILE__NAME,’[^/]+$’,0) as filename, subsystem,
machinename,servername, threadid, userid, transactionid, throwable,
floor(tstamp/1000), messageid, logmessage, to_date(from_unixtime(floor(tstamp/1000))) as date1,severity
from wlogs where machinename is not null) t1
JOIN file ON (t1.filename=file.name);

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.Utils.OutputFileUtils.OutputFilesFilter;
import java.net.URI;
import java.io.*;
import java.util.Calendar;
import java.text.SimpleDateFormat;

public class CleanLogNew {

public static void main(String[] args) {
   SimpleDateFormat df=new SimpleDateFormat(“yyyy-MM-dd”);
   Calendar cal1=Calendar.getInstance();
   Configuration confHadoop = new Configuration();  
   confHadoop.addResource(new Path(“C:/hdp/hadoop/hadoop-”));
   try{ FileSystem fs = FileSystem.get(URI.create(“hdfs://win8blue:8020”),confHadoop);
        FileStatus[] fstatus=fs.listStatus(new Path[] {new Path(“/weblogic_logs”)});
        if(args.length==0) return;
          {    fs.mkdirs(new Path(“/for_migration”));
               FSDataOutputStream fos= fs.create(new Path(“/for_migration/filenames”),true);
            for(FileStatus fstat:fstatus)
              { if (fstat.isDir() || fstat.getPath().getName().endsWith(“.tmp”)) continue;
          { Path filenamesPath = new Path(“/for_migration/filenames”);
            if(!fs.exists(filenamesPath)) return;
            BufferedReader in = new BufferedReader(new InputStreamReader(fs.open(filenamesPath)));
            String name=””;
              Path logFile = new Path(“/weblogic_logs/”+name);
       } catch(Exception ex) {ex.printStackTrace();}     

Ovaj unos je objavljen u Nekategorizirano. Bookmarkirajte stalnu vezu.


Popunite niže tražene podatke ili kliknite na neku od ikona za prijavu:

WordPress.com Logo

Ovaj komentar pišete koristeći vaš WordPress.com račun. Odjava / Izmijeni )

Twitter picture

Ovaj komentar pišete koristeći vaš Twitter račun. Odjava / Izmijeni )

Facebook slika

Ovaj komentar pišete koristeći vaš Facebook račun. Odjava / Izmijeni )

Google+ photo

Ovaj komentar pišete koristeći vaš Google+ račun. Odjava / Izmijeni )

Spajanje na %s