Archive

Archive for the ‘Business Intelligence’ Category

Hive – A SQL-like Wrapper over Hadoop

This is a summary and review for Hive [1].

1. Motivation

For companies like Facebook in the industry, the size of data being collected and analyzed for business intelligence (BI) is growing rapidly. Traditional data warehouse solutions become prohibitively expensive in this scenario. To solve this problem, Hadoop [2] is a popular open-source map-reduce [4] implementation which is being used widely to store and process extremely huge data sets on commodity hardware. However, the war never ends — the map-reduce programming model is very low-level and requires the developers to write custom programs which is hard to maintain and reuse. In other words, the business intelligence goal directly on top of hadoop will be hard to achieve. These backgrounds are all the bedding for a new solution — Hive [1][3].

2. What is Hive

Hive is a open-source data warehousing solution built on top of Hadoop. In the front end, Hive supports a SQL-like query language called HiveQL, and in the backend, HiveQL will be compiled into map-reduce jobs and executed on Hadoop. In addition, HiveQL allows users to plug in custom map-reduce scripts into queries. Just as SQL, HiveQL supports tables containing primitive types (number, boolean, string, etc), collections (array, map, etc) and nested compositions.

3. Architecture

Here is the system architecture for Hive. We can see, Hive is built on top of Hadoop, including a Metastore with schemas and statistics inside, which are useful in data exploration and query optimization.

image

4. HiveQL

Basically, HiveQL comprises of a subset of SQL and some useful extensions. While HiveQL has great advantages in manipulating huge data, some limitations emerges due to various reasons:

  • Lack of inequality operator. Join predicates only support equality operator. Say bye-bye to ‘<’ and ‘>’.
  • Lack of “insert into”. Cannot insert into an existing table or data partition. Only support “insert overwrite” and an insert will always overwrite the existing data in the whole table or partition, so be careful here!
  • Lack of “update” and “delete”. As claimed by the paper, the marginal gain of “update” and “delete” will be offset by the new complexity of dealing with reader and writer concurrency and I agree on this point.

Here is an example of HiveQL query:

FROM (SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid AND a.ds=’2009-03-20′ )) subq1

INSERT OVERWRITE TABLE gender_summary PARTITION(ds=’2009-03-20′) SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender

INSERT OVERWRITE TABLE school_summary PARTITION(ds=’2009-03-20′) SELECT subq1.school, COUNT(1) GROUP BY subq1.school

This query has a single join followed by two different aggregations. By writing the query as a multi-table-insert, make sure that the join is only performed once. The query plan of this query with 3 map-reduce jobs is shown in the following figure (click to view large):

image

5. Comments

Hive provides a solution to perform business intelligence of huge data on top of mature Hadoop map-reduce platform. The SQL-like HiveQL cuts off the learning curve compared with low-level map-reduce programs. To think of the constraints, I can list the following:

  • While Hive brings convenience by high-level SQL-like language, this will harm the generality and expressive capability. Imagine a task T which can be written by map-reduce programs, but may be hard or impossible to be written by HiveQL.
  • I agree on the removal of “delete” and “update”, since they make the performance declined. There should be other convenient ways to update and delete, so HiveQL can save on them. But I do not feel fine on the removal of inequality operators, which are useful in many BI analysis.
  • Hive is not the end for BI solutions on Hadoop. Hive is definitely a huge step for pushing map-reduce platforms towards BI, but many advanced BI techniques such as clustering, classification and prediction, still have a long way to go when facing huge data.

References

[1] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Anthony, Hao Liu, Raghotham Murthy: Hive – a petabyte scale data warehouse using Hadoop. ICDE 2010:996-1005

[2] Apache Hadoop. Available at http://wiki.apache.org/hadoop

[3] Hive wiki at http://www.apache.org/hadoop/hive

[4] Hadoop Map-Reduce Tutorial at http://hadoop.apache.org/common/docs/current/mapred_tutorial.html