Exclusive Interview: Matei Zaharia, creator of Apache Spark, on Spark, Hadoop, Flink, and Big Data in 2020
Apache Spark is one the hottest Big Data technologies in 2015. KDnuggets talks to Matei Zaharia, creator of Apache Spark, about key things to know about it, why it is not a replacement for Hadoop, how it is better than Flink, and vision for Big Data in 2020.
Gregory Piatetsky: Q1. How did you start Apache Spark and what were some of the key decisions that enabled it to become one of the hottest trends in Big Data technology and the largest Apache Project?
Matei Zaharia: I initially started working on big data systems through Hadoop, back in 2007. I worked with early Hadoop users (e.g. Facebook and Yahoo!) and saw that they all had common needs in terms of new workloads they wanted to run on their data that MapReduce wasn't suited for (e.g. interactive queries and machine learning). I first began Spark in 2009 to tackle the machine learning use case, because machine learning researchers in our lab at UC Berkeley were trying to use MapReduce for their algorithms and finding it very inefficient. After starting with this use case, we quickly realized that Spark could be useful beyond machine learning, and focused on designing a general computing engine with great libraries for building complete data pipelines.
GP: Q2. What are the key things to know about Apache Spark?
MZ: Here are some lesser-known things:
- While Spark is known for in-memory computing, the engine is also really fast on disk, and quite a bit of work was done recently to optimize that. At Databricks, we used Spark to beat the world record for sorting on-disk data in 2014, using 10x fewer resources than MapReduce. Many users run Spark on petabyte-scale on-disk datasets.
-
Many people ask whether Spark is a replacement of Hadoop. The short answer is that it is not: Spark is only a computing engine, while Hadoop is a complete stack of storage, cluster management and computing tools, and Spark can run well on Hadoop. However, we do see many deployments that are not on Hadoop, including deployments on NoSQL stores (e.g. Cassandra) and deployments directly against cloud storage (e.g. Amazon S3, Databricks Cloud). In this sense Spark is reaching a broader audience than Hadoop users.
- Most of the development activity in Apache Spark is now in the built-in libraries, including Spark SQL, Spark Streaming, MLlib and GraphX. Out of these, the most popular are Spark Streaming and Spark SQL: about 50-60% of users use each of them respectively.
GP: Q3. You work on several exciting projects at the forefront of Big Data and cloud computing - what is your vision for where Big Data and Cloud technology will be in 2020?
MZ:
I also think that the cloud will play a big role, which is why Databricks started with a cloud product. The cloud provides a very low-cost way to store and manage data, and lets organizations focus on just the processing they want to do instead of operations / IT. It's where a lot of data is "born". And it makes it very easy to deploy and run new applications at the same site where the data lives, which is important for data processing because a lot of it is exploratory.
I think that by 2020 most data will be in either public clouds or cloud-like private environments.GP: Q4. You are a CTO and co-founder of Databricks, which aims to help clients do Big Data processing using Spark. What does Databricks do that Apache Spark does not do?
MZ: Databricks offers a cloud service that makes it easy to deploy Spark applications and work on data collaboratively in a team. The code you write is all Apache Spark code, and can therefore run on any Spark cluster. However, we provide tools to make it easy to run this code (e.g. schedule a production job and get an email if it doesn't run), and a UI for fast, productive data exploration (a Google-Docs-like notebook environment, publishable dashboards, etc).
GP: Q5. What are the next exciting projects you are working on?
MZ: In Databricks Cloud, we're building some pretty exciting new features that we plan to announce soon (stay tuned around
GP: Q6. InfoWorld says that one possible problem with Spark is that it is not a pure stream-processing engine, but a fast-batch operation working on a small part of incoming data ("micro-batching"), and it is not as good as
MZ: While it is true that Spark uses a micro-batch execution model, I don't think this is a problem in practice, because the batches are as short as 0.5 seconds. In most applications of streaming big data, the latency to get the data in is much higher (e.g. you have sensors that send in data every 10 seconds, or something like that), or the analysis you want to do is over a longer window (e.g. you want to track events over the past 10 minutes), so it doesn't matter that you take a short amount of time to process it.
The benefit of Spark's micro-batch model is that you get full fault-tolerance and "exactly-once" processing for the entire computation, meaning it can recover all state and results even if a node crashes. Flink and Storm don't provide this, requiring application developers to worry about missing data or to treat the streaming results as potentially incorrect. Again, that can be okay for some applications (e.g. just basic monitoring), but it makes it hard to write more complex applications and reason about their results.
GP: Q7. What do you like to do when you are away from a computer and Big Data? Is there a book that you recently read and liked?
MZ: In this day and age, you're never far away from Big Data. But when I am, I like to read books, walk around, and occasionally try to cook things. I really liked The Martian by Andy Weir.