Enterprise Hadoop: Big data processing made easier

Amazon, Cloudera, Hortonworks, IBM, and MapR mix simpler setup of Hadoop clusters with proprietary twists and trade-offs. Peter Wayner tests different Hadoop offerings and reports.

ComputerWorld, By Peter Wayner, January 18, 2012

Hadoop InfoWorld - It's been a big year for Apache Hadoop, the open source project that helps you split your workload among a rack of computers. The buzzword is now well known to your boss but still just a vague and hazy concept for your boss's boss. That puts it in the sweet spot when there's plenty of room for experimentation. The list of companies using Hadoop in production work grows longer each day, and it probably won't be long before "Hadoop cluster" takes over the role that the words "crazy supercomputer" used to play in thriller movies. The next version of the WOPR is bound to run Hadoop.

The area is flourishing as the core project attracts a wide collection of helper projects that organize the workload and make it simpler to manage a collection of jobs to run at particular times. There's HDFS, a standard file system that can organize the data spread out around the cluster; Hive, a data warehousing layer for making sense of this data; Mahout, a collection of routines for trying to learn something from said data; and ZooKeeper, a tool for keeping all of the balls in the air. At least a half-dozen or more other open source tools live in a stable orbit around Hadoop.

To get a feel for the excitement, I took four major collections out for a test-drive. ... Lest anyone doubt the efficiency of cloud computing, I noticed that the rate for my cluster of relatively fat machines with 4GB of RAM was less than the cost to park a car around the corner. The parking meters spin faster.

Amazon Elastic MapReduce
It should be no surprise that Amazon, one of the pioneers of cloud computing, offers a mechanism for spinning up Hadoop clusters on its EC2 cloud. Elastic MapReduce is tightly integrated with all of Amazon's other elastic offerings, and it sits as another tab on the Amazon Web Services main page. You store your data in S3, then fire up a job to churn through it.

...

Cloudera CDH, Manager, and Enterprise
Cloudera is a startup that has collected Hadoop experts from all of the major companies using Hadoop. The CTO came from Yahoo, the chief scientist from Facebook, and the CEO from Oracle. The staff is filled with the names of people who learned Hadoop by building it.

...

IBM InfoSphere BigInsights
IBM bundles Hadoop into something it calls InfoSphere BigInsights. The word "Hadoop" is on the main page, but the advertising copy clearly suggests that this is a product to help people who want "deep insights" into "big data." It's a tool for data analysis that just happens to use Hadoop for all of the structure.

There are two tiers: basic and enterprise. The basic edition is available completely for free, but you can buy support if you like. The enterprise edition, available through a commercial license, includes a number of extra features like BigSheets, a spreadsheetlike tool for drilling down into the data sitting in the cluster.

...

MapR M3 and M5
Whereas Cloudera is run by folks who come from Hadoop strongholds such as Yahoo, MapR's corporate team is filled with people who hail from Google, EMC, Microsoft, and Cisco, companies with plenty of experience with big data sets, even if they're not steeped in Hadoop's traditional way of working with them.

The new talent is also bringing more sophistication to the stack. The MapR distribution of Hadoop includes a better version of the file system with snapshots, mirroring, and direct NFS access if you need it. MapR also offers a more resilient architecture that won't go down if the central controller locks up. MapR calls all of this "high availability" and charges for it.

Read more.