KDnuggets Home » News » 2012 » Mar » Software » Comparison of Hadoop Frameworks  ( < Prev | 12:n08 | Next > )

Comparison of Hadoop Frameworks


 
  
Open Source expert compares 6 Hadoop Frameworks: Hive, Pig, Scalding, Scoobi, Scrunch and Spark.


Date:

Hadoop Sami Badawi Blog, Mar 26, 2012

I had to do simple processing of log files in a Hadoop cluster. Writing Hadoop MapReduce classes in Java is the assembly code of Big Data. There are several high level Hadoop frameworks that make Hadoop programming easier. Here is the list of Hadoop frameworks I tried:

Pig, Scalding, Scoobi, Hive, Spark, Scrunch, Cascalog

Pig
Created by Yahoo! Language Pig Latin.
pig.apache.org/

Pig is a data flow language / ETL system. It work at a much higher level than direct Hadoop in Java. You are working with named tuples. ...

Scalding
Created by Twitter. Language Scala.
https://github.com/twitter/scalding

What sets Scalding apart from other Scala based frameworks is that you work with tuples with named fields. ...

Hive
Created by Facebook. Language SQL.
hive.apache.org/

Hive works on tables made of named tuples with types. It does not check the type at write time, you just copy files into the directory that represent a table. Writing to Hive is very fast, but it does check types at read time. ...

...

Conclusion

I liked all of the Hadoop frameworks I tried, but there is a learning curve and I found problems with all of them.

Extract Transform Load
For ETL Hive and Pig are my top picks. They are easy to use, well supported, and part of the Hadoop ecosystem. It is simple to integrate a prebuilt Map Reduce classes in data flow in both. It is trivial to join data source. This is hard to do in plain Hadoop.

Cascalog is serious contender for ETL if you like Lisp / Clojure.

Hive vs. Pig
I prefer Hive. It is based on SQL. You can use your database intuition and you can access it though JDBC.

Scala based Hadoop frameworks
They all made Hadoop programming look remarkable close to normal Scala programming.

For programming Hadoop Scalding is my top pick since I like the named fields.

Both Scrunch and Scoobi are simple and powerful Scala based Hadoop frameworks. They require Cloudera's Hadoop distribution, which is a very popular distribution.

Read more.


 
Related
Data Mining Software

KDnuggets Home » News » 2012 » Mar » Software » Comparison of Hadoop Frameworks  ( < Prev | 12:n08 | Next > )