Big Data Use Case: Zookeeper at Rubicon Project
What is the big idea with ZooKeeper - a summary of an excellent Big Data use case using Apache ZooKeeper for Hadoop implementation.
By Daniel D. Gutierrez, May 27, 2014
This past week I was hot on the Meetup circuit, attending the latest edition of the LA Big Data Users Group where I am co-organizer. The feature for the evening was “What is the big idea with ZooKeeper” presented by Jan Gelin of Rubicon Project. As Chief System Architect, Jan leads the company’s architecture group which is responsible for developing and inventing scalable and critical building blocks and frameworks for the business software applications. Gelin’s talk was very detailed in terms of how Rubicon deploys ZooKeeper for its Hadoop implementation. He even ran a 3 node ZooKeeper ensemble on his laptop to demonstrate how sessions interact. Plus, he bravely wrote some Java code on-the-fly to show how a ZooKeeper Watcher operates. A video is available for the complete presentation HERE.
Apache ZooKeeper Ties it All Together
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications including Hadoop deployments. In any distributed cluster, it is important that all nodes be able to share configuration and state data in a reliable way. Hadoop relies on ZooKeeper to keep each of its distributed processes, including MapReduce and HBase, consistent across the cluster. ZooKeeper nodes store a shared hierarchical name space of data registers in RAM, allowing clients to access it with high throughput and low latency. Hadoop clusters should be provisioned with an odd number of ZooKeeper nodes, typically either 3 or 5, to provide high availability and maintain a quorum.
ZooKeeper itself is intended to be replicated over a sets of hosts called an ensemble
Rubicon Project is Truly Big Data
Rubicon Project (recently having gone public - NYSE: RUBI)is a leading technology company that developed software for automating the buying and selling of online advertising. The company processes a huge number of small-packet transactions in support of the advertising industry, but they’re not an ad company and not the publisher of ads. Rubicon is a MapR shop so the talk was skewed toward that specific Hadoop distribution. The company performs over 90 billion real-time auctions on their global transaction platform per day, which translates to about 3.5 PB of data that needs to be managed and analyzed in a MapR cluster. When Gelin started working at Rubicon in 2010, they had about 40 servers, and today it’s close to 3,000 servers.
Rubicon Project utilizes a number of distributed computer systems that run Apache ZooKeeper as a core component of their infrastructure. Gelin’s presentation served as an introduction to how ZooKeeper works and also to provide tips to consider when deploying ZooKeeper in a distributed system such as Hadoop. In addition, he went through the internal components and some of the pain-points.
Gelin started off by making some recommendations about the architecture for a 100 node Hadoop cluster based on Rubicon’s requirements, such as around 60TB per day in the in-coming stream. Low latency is a big requirement for bidders in the auction process.They use 3 ZooKeeper nodes to handle the consistency.
For Rubicon Project, compatibility with Apache Hadoop was critical, as they had been using Hadoop for several years. However, in order to support their growing advertising platform, the company needed to move to a fault-tolerant, mission-critical Hadoop production system. They chose MapR because of its enterprise-grade features. MapR provides automated stateful failover of critical services from multiple failures, along with automated recovery of the job tracker, preventing service disruption from runaway jobs. This allows Rubicon Project to run Hadoop along with the rest of their enterprise infrastructure in a lights out data center.
The Big Data Meetup ecosystem is an excellent way to enjoy technical talks by leading industry practitioners speaking from the point of view of field-tested experience in robust technology infrastructures. It was enlightening to see such a detailed, behind-the-scenes perspective of Rubicon Project’s Hadoop architecture that depends highly on ZooKeeper.
Daniel D. Gutierrez is a Los Angeles–based data scientist working for a broad range of clients through his consultancy AMULET Analytics. He’s been involved with data science and Big Data since long before it came in vogue. He is also a recognized Big Data journalist and is working on a new machine-learning book due out in later this year.