YARN is All the Rage at Hadoop Summit 2014

Apache YARN, which enables much broader types of computations than MapReduce, is quickly becoming an integral part of Hadoop projects. We review best practices considerations for a YARN cluster.

By Daniel D. Gutierrez, June 2014

I’m back from the Hadoop Summit 2014 and experiencing a significant amount of Hadoop overload. But in spite of the exhaustion, this annual conference in Silicon Valley has turned into one of my favorites. In a single venue for a modest 3 days time, I’m able to get a good pulse for where the big data industry is headed – very valuable for my position as a practicing data scientist.

Yarn One of the distinct messages I gleaned from the show was how much the leading Hadoop distributions are banking on YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of user applications. YARN is a sub-project of Hadoop at the Apache Software Foundation, first introduced with Hadoop 2.0 last year, that separates the resource management and processing components. YARN was born of a need to enable a broader array of interaction patterns for data stored in HDFS beyond MapReduce. YARN incorporates different applications such as Apache Tez for interactive/batch applications, and Apache Storm for stream-processing.

[Tweet "Many in the Hadoop ecosystem tout YARN as the next generation compute framework for Apache Hadoop"].
And for good reason, it provides a generic data processing engine beyond MapReduce and enables the Hadoop compute layer to be a common resource-management platform that can host a multitude of applications. YARN provides a flexible resource sharing model that makes it attractive for many services to co-exist on a single cluster without worrying about resource management, isolation, multi-tenancy issues etc.

Making the optimal use of a YARN cluster requires a number of best-practices considerations to be addressed:

  • The administrators managing a YARN cluster must determine how to go from configuring Map/Reduce slots to configuring resources and containers.
  • An administrator can configure a YARN cluster to optimally use resources depending on the kind of hardware and types of applications being run.
  • Operations teams now have to deal with a new range of metrics when managing YARN clusters.
  • Operations teams must focus on managing a cluster shared across numerous users, managing queues, and performing capacity allocation across different business units.
  • The YARN application-developer has to understand how to write efficient applications to make the best use of YARN, including security and failure handling.

So far, YARN is faring well in deployments large and small, and stands apart from the batch-processing-only world of Hadoop 1.0. YARN has a single point of failure in the form of its master – the ResourceManager (RM). The RM keeps track of all the slaves, schedules work in the cluster, and handles all client interactions. Unanticipated events like node crashes and planned events like upgrades may reduce the availability of this central service and YARN itself. At the Summit, Cloudera and Hortonworks teamed up to describe their recent work on Highly Available Resource Manager (HARM) in YARN.

Hadoop Native Applications

YARN and Spark Team Up

Another important component that was prevalent at the Summit was Apache Spark, designed to let users build unified data analytic pipelines that combine diverse processing types. Databricks demoed Spark by building a machine learning pipeline with three stages: consuming JSON data from Hive, training a k-means clustering algorithm, and applying the model to a live stream of Tweets – all to classify raw Tweets in real-time. Typically this kind of pipeline might require a separate processing framework for each stage, but the demo showed how to leverage the versatility of the Spark runtime to combine Shark, MLlib, and Spark Streaming and the same time perform all of the processing by a single, small program. This arrangement allows us to reuse code and memory between the components, improving both development time and runtime efficiency. Spark as a platform integrates seamlessly with Hadoop components, running natively in YARN.

Yahoo provided a great presentation about how they’ve been working with open source communities to bring MapReduce and additional applications onto YARN. They’re working to empower Spark applications via so-called Spark-on-YARN. Spark-on-YARN enables Spark clusters and applications to be deployed on existing Hadoop hardware without creating a separate cluster. Spark applications can then directly access Hadoop datasets on HDFS. In Spark-on-YARN, Spark applications are launched in either standalone mode, executing the Spark master in a YARN container, or in client mode which executes the Spark master within the user’s launcher environment. Spark-on-YARN has been enhanced to support – authentication, secure HDFS access, Hadoop distributed cache, and linking YARN UI to Spark UI.

The Future of YARN

From what I can tell, the future of YARN is destined to be exciting – its features are making YARN a first-class resource-management platform for enterprise Hadoop. There’s a long list of future promises for YARN including rolling upgrades with no/minimal service interruption, high-availability, support for long running services alongside applications like Apache HBase, Apache Storm natively on YARN without any changes, fine-grained isolation for multi-tenancy, powerful scheduling features like application priorities, preemption, application level SLAs, and other usability tools like application-history-server, client submission web-services, better queue management, developer tools for easier application authoring.


Apache Hadoop YARN brings us a step closer to realizing the vision of Hadoop providing a single grid to run all data processing applications. This year’s Hadoop Summit set the stage for the evolution of YARN as an important enabler for allowing the Hadoop platform’s continued acceptance in the enterprise.

Daniel D. Gutierrez is a Los Angeles–based data scientist working for a broad range of clients through his consultancy AMULET Analytics. He is also a well-recognized Big Data journalist.