Hadoop: Elephants in the Cloud

Hadoop in the Cloud became a trending topic in 2013, with many new product and project announcements. This guest post explores 6 reasons why customers are seeing increased value in this model.

Guest blog by Hemanth Yamijala, Jan 7, 2014

In 2013, it has become evident that Hadoop in the Cloud is a trending topic. There have been a lot of new product and project announcements related to running Hadoop in Cloud environments. This post explores six of the reasons why customers are seeing increased value in this model.

1. Lowering the cost of innovation

HadoopRunning Hadoop on the cloud makes sense for similar reasons as running any other software offering on the cloud. For companies still testing the waters with Hadoop, the low capacity investment in the cloud is a no-brainer. The cloud also makes sense for a quick, one time use case involving big data computation. As early as in 2007, the New York Times used the power of Amazon EC2 instances and Hadoop for just one day to do a one time conversion of TIFF documents to PDFs in a digitization effort. Procuring scalable compute resources on demand is appealing as well.

2. Procuring large scale resources quickly

The point above of quick resource procurement needs some elaboration. Hadoop and the platforms it was inspired from made the vision of linear storage and compute using commodity hardware a reality. Internet giants like Google, who always operated at web-scale, knew that there would be a need for running on more and more hardware resources. They invested in building this hardware themselves.

In the enterprise though, this was not necessarily an option. As the demand for analytics within enterprises grew, the need to expand the capacity of the Hadoop clusters also grew. The data platform teams started hitting a bottleneck of a different kind. While the software itself had proven its capability of handling linear scale, the time it took for hardware to materialize in the cluster due to IT policies varied from several weeks to several months, stifling innovation and growth.

3. Handling Batch Workloads Efficiently

A fixed capacity Hadoop cluster built on physical machines is always on whether it is used or not – consuming power, leased space, etc. and incurring cost. The cloud, with its pay as you use model, is more efficient to handle such batch workloads. Given predictability in the usage patterns, one can optimize even further by having clusters of suitable sizes available at the right time for jobs to run. Companies can schedule cloud-based clusters to be available only for the period of time during the day when the data needs to be crunched.

4. Handling Variable Resource Requirements

Not all Hadoop jobs are created equal. While some of them require more compute resources, some require more memory, and some others require a lot of I/O bandwidth. Cloud solutions meanwhile already offer a choice to the end user to provision clusters with different types of machines for different types of workloads. Intuitively, this seems like a much easier solution for the problem of handling variable resource requirements. For example, with Amazon Elastic MapReduce, you can launch a cluster for yourself with m2.large machines if your Hadoop jobs require more memory, and c1.xlarge machines if your Hadoop jobs are compute intensive.

5. Running Closer to the Data

As businesses move their services to the cloud, it follows that data starts living on the cloud. And as analytics thrives on data, and typically large volumes of it, it makes no sense for analytical platforms to exist outside of the cloud leading to inefficient, time consuming migration of this data from source to the analytics clusters.

Running Hadoop clusters in the same cloud environment is an obvious solution to this problem. This is, in a way, applying Hadoop’s principle of data locality at the macro level.

6. Simplifying Hadoop Operations

As cluster consolidation happens in the enterprise, one thing that gets lost is the isolation of resources for different sets of users. As all user jobs get bunched up in a shared cluster, administrators of the cluster start to deal with multi-tenancy issues like user jobs interfering with one another, varied security constraints etc.

The typical solution to this problem has been to enforce very restrictive cluster level policies or limits that prevent users from doing anything harmful to other users jobs. The problem with this approach is that valid use cases of users are also not solved. For instance, it is common for administrators to lockdown the amount of memory Hadoop tasks can run with. If a user genuinely requires more memory, he or she has no support from the system.

Using the cloud, one can provision different types of clusters with different characteristics and configurations, each suitable for a particular set of jobs. While this frees administrators from having to manage complicated policies for a multi-tenant environment, it enables users to use the right configuration for their jobs.

In 2013, the Hadoop community released Hadoop 2.0 which includes a new resource management framework called Apache YARN (http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html). Using YARN, it is now possible to run a larger variety of workloads, like real time stream processing workloads, in addition to traditional MapReduce workloads, making Hadoop more relevant for different types of data processing.

Amazon's Elastic MapReduce offers Hadoop 2.0 as an option for its users (http://aws.typepad.com/aws/2013/10/elastic-mapreduce-updates.html). Such a cloud offering allows enterprises to evaluate new versions of Hadoop without disrupting their existing, stable infrastructure. Hadoop 2.0 clusters could be launched alongside existing Hadoop 1.0 clusters, processing the same data stored in cloud storage like Amazon's S3, with minimal capacity investment.

If found suitable, clusters can be switched to the new version in a phased manner, easing migration.  Thus, the cloud helps enterprises to stay current without too much disruption to business continuity.

Bio: Hemanth Yamijala is software developer, solutions architect and consultant on BigData projects at ThoughtWorks, a global technology company. His primary area of interest is in building large-scale distributed systems to cater to the growing data needs of organizations.