Interview: Ranjan Sinha, eBay on Advanced Hadoop Cluster Management through Predictive Modeling

We discuss categorization of e-commerce analytics, opportunities/ challenges of Big Data, Astro predictive model for Hadoop cluster management, and Apache Kylin.

Twitter Handle: @hey_anmol

ranjan-sinhaRanjan Sinha is head of data science engineering & technology at eBay Inc., where he leads projects on customer analytics and personalization. Earlier, as lead data scientist, he led multiple business impacting projects in recommendations and personalization that has significantly enhanced consumers’ shopping experiences.

Prior to eBay, Dr. Sinha was a research academic at the University of Melbourne and holds a PhD in computer science from RMIT University, Australia. He has published over 30 works, including in top-tier venues such as IEEE Big Data, VLDB, and ACM SIGMOD.

He won the sort benchmark for both JouleSort and PennySort and was amongst Wall Street Journal’s top-12 Asia-Pacific young inventors. He regularly speaks on data science, big data technologies, and co-organizes the popular Bay Area Search Meetup.

Here is my interview with him:

Anmol Rajpurohit: Q1. What are the benefits of categorizing e-commerce Analytics into the Online, Nearline, and Offline categories? Can you provide a use case for each of them?

ecommerceRanjan Sinha: The three levels of categorization are in reference to Personalization platforms. I believe it is necessary to distinguish between preferences based on long-term behavior and active intents based on recent behavior of a customer. For example, a customer may have a long-standing preference in collecting Native American artifacts but is currently looking to purchase (active intent) camping gear for the upcoming long weekend. It is imperative that we differentiate between our customer's recent intents and long-term preferences in order to provide relevant inventory.

Online analytics considers the events occurring within the same session as the user is engaging on the site and rapidly deriving insights from them in order to be actionable within the same session. Nearline analytics considers the events that have occurred in the recent past and resides in an online data repository of platforms. Offline analytics considers the vast majority of events that are stored in an offline data repository such as Hadoop or Teradata; these typically include events spanning nearline and several months or years.

AR: Q2. The amount of data (and correspondingly, the number of Hadoop cluster nodes) at eBay has grown significantly from 2010 to 2014. How has this rapid growth of data impacted global customer optimization? What are the main opportunities and challenges?

hadoop-clusterRS: Both the amount of data and the number of Hadoop cluster nodes have experienced dramatic growth at eBay. The insights generated from Big Data processing on Hadoop have greatly helped us to better our customers’ experience across multiple domains. I am one of the first beneficiaries (in 2010) to use Hadoop for operationalizing data science models across major sites that led to nearly a billion impressions per day surfacing on high traffic pages such as Home, Search Results, and Product listing besides seasonal and gifting pages.

The Hadoop eco-system is being used for a diverse range of workloads such as for Data Processing, System Services, Data Science, Business Intelligence, Analytical applications, and Search. There is significant opportunity in using Big Data to continually derive insights at scale and speed to better our customer's experience. The challenge is in ensuring that as users we also manage capacity and job resources in the shared Hadoop cluster judiciously.

AR: Q3. What was the major motivation behind developing Astro? Can you briefly describe the Astro Predictive Model and the results of performance assessment tests?

cluster-managementRS: The sheer growth in data volume and Hadoop cluster size make it a significant challenge to diagnose and locate problems in a production-level cluster environment efficiently and within a short period of time. Often times, the distributed monitoring systems are not capable of detecting a problem well in advance when a large-scale Hadoop cluster starts to deteriorate in performance or nodes become unavailable. As a result, both reliability and throughput of the cluster reduce significantly. The current rules-based approach using a handful of metrics results in a large number of false positives, which is especially time consuming due to the existing manual intervention necessary to flag failed nodes to the scheduler. Alternative approaches include installing Apache Ambari, but it does not provide prediction models and it can only detect failures once the data nodes already reach a bad state. astro-architecture We address this problem by proposing a system called Astro, which consists of a predictive model and an extension to the Hadoop scheduler. The predictive model in Astro takes into account a rich set of cluster behavioral information that are collected by monitoring processes and model them using machine learning algorithms to predict future behavior of the cluster. The Astro predictive model detects anomalies in the cluster and also identifies a ranked set of metrics that have contributed the most towards the problem. The Astro scheduler uses the prediction outcome and the list of metrics to decide whether it needs to move and reduce workloads from the problematic cluster nodes or to prevent additional workload allocations to them, in order to improve both throughput and reliability of the cluster. astro-framework The results demonstrate that the Astro scheduler improves usage of cluster compute resources compared to traditional Hadoop and the runtime of representative applications reduces by over 20% during the time of anomaly, thus improving the cluster throughput. For further information on Astro, please refer to the paper titled “Astro: A predictive model for anomaly detection and feedback-based scheduling on Hadoop” published in the 2014 IEEE International Conference on Big Data.

AR: Q4. What are the key capabilities of Apache Kylin? How does it compare against the other alternatives such as Hive?

RS: apache-kylinApache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets. It is an extremely fast OLAP engine at scale and designed to reduce query latency on Hadoop for tens of billions of rows of data. Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions. It is attractive for multi-dimension analysis on large data sets (billion+ rows) offering low query latency. Kylin also provides good integration with existing BI tools such as Tableau. Users can interact with Hadoop data via Kylin at sub second latency, which is better than Hive queries for the same dataset. Existing SQL-on-Hadoop solutions need to scan partial or whole data set to answer a user query, which can be inefficient.

For further details, please refer to the following online resources:
  1. Home Page of Apache Kylin
  2. eBay Tech Blog on Apache Kylin

Second part of the interview