Interview: Nandu Jayakumar, Yahoo on How Yahoo is Harnessing Big Data

We discuss the major Big Data uses cases at Yahoo, major challenges, trends in enterprise Big Data implementations, and advantages of using Spark.

nandu-jayakumarNandu Jayakumar has been working with Big Data for over a decade now. He is passionate about databases and distributed systems. At Yahoo, he is currently building data applications that power digital advertising. He is also focused on advanced analytics that aim to improve user understanding at Yahoo. As a senior leader of Yahoo’s well regarded data team, he has built key pieces of Yahoo’s data processing platforms and tools through their several iterations. These include data repositories, data pipelines and reporting systems. In the past, he has contributed to open source projects, including Shark (part of the Apache Spark effort).

Nandu holds a Bachelor’s degree in Electronics Engineering from Bangalore University and a Master’s degree in Computer Science from Stanford University.

Here is my interview with him:

Anmol Rajpurohit Q1. What are the goals of the Big Data ecosystem at Yahoo? What kind of business problems are eligible to benefit the most from Big Data (in other words, most prominent use cases of Big Data at Yahoo)?

Nandu Jayakumaryahoo-logo: Yahoo is focused on making the world's daily habits inspiring and entertaining. This focus on the experience users have on our apps and websites drives us toward creating highly personalized and optimized experiences. Big Data is critical to implementing this for our hundreds of millions of users.

Yahoo’s digital advertising business, much like ad tech in general, is very data intensive. Big Data technologies allow us to serve billions of relevant, targeted ads every day.

AR: Q2. According to you, what are the biggest challenges in working with Big Data today?

NJ : Over the last decade, the industry’s ability to work with large amounts of data has advanced to the point where solutions like Hadoop and other cloud-based technologies have become commonplace. Using the right tools has allowed us to overcome the challenge that large scale imposes.

An area which needs significant work is the ease with which we can operate on this data. Today’s tools and platforms are only accessible to small number of experts. They are not user-friendly enough, or general enough to meet the needs of many others.

The tool ecosystem is not as broad or mature when operating out of the realm of RDBMS implementations. The equivalents of commercial ETL or Modeling tools do not compare well. Similarly the ability to interface BI or statistical implementations with good visualization is not as mature when we operate with large scale data.

Another problem in the Big Data world is the proliferation of open source choices that solve very similar problems. This healthy, evolving ecosystem can be an advantage; the challenge is simply to choose the right tool for the job at hand.

AR: Q3. What are the major trends that you observe in Enterprise implementations of Big Data technology?

NJ : Trends include:
  • Datacenter agnostic computing (dealing with data problems that span multiple datacenters)
  • Cloud based solutions for data handling
  • Apache Spark suite
  • Mainstream adoption of stream-based data processing (including Apache Storm and Apache Kafka)

New NoSQL stores are showing up all the time, each targeting some specific subset of the data storage/query problem. Older RDBMS based tools are now trying to work well with Hadoop. Reporting solutions that are backed directly by data stored in Hadoop instead of an intermediate RDBMS are another trend.

AR: Q4. What are the key advantages of using Spark? What are the benefits of integrating Shark with Spark?

sparkNJ : As far as Yahoo is concerned, core Spark offers us a user-friendly, high-level language, and data model and, indeed, a clearly defined way of thinking about data manipulation.

Its performance, and potential for significant improvement, has also been a key reason for our adoption of Spark. We are excited about the maturing of the rest of the projects in the Spark ecosystem, and are eagerly awaiting better stability and reliability.

Shark brought SQL to Spark. Scala is great and Spark code is elegant, but SQL is accessible to many more programmers, and often results in clearer code. A SQL implementation also allows for ODBC/JDBC, and, hence, access to a large ecosystem of tooling built around those standards.
Most of the code associated with Shark is now part of Spark itself, and is available as Spark SQL. One of the original benefits of Shark was tight integration with the de facto standard for SQL on Hadoop - Hive. Spark SQL will soon match that integration, and go beyond with fantastic integration of SQL relations within the Spark data model. Spark code can now incorporate SQL, where appropriate, seamlessly.

Second part of the interview