Apache Spark, which is a fast general engine for Big Data processing, is one the hottest Big Data technologies in 2015.
It was created by Matei Zaharia, a brilliant young researcher, when he was a graduate student at UC Berkeley around 2009. Since then
the most active Apache project
with hundreds of contributors and deployments.
I recently had a chance to ask Matei Zaharia a few questions about Spark and Big Data - see interview below.
) was born in Romania, but his family later moved to Canada where he graduated from U. of Waterloo with a medal for highest academic standing. He received numerous awards at programming contests including a gold medal in ACM ICPC in 2005. He started the Spark project while a graduate student at UC Berkeley, where he received his PhD in 2013. He is currently an assistant professor at MIT.
He is also CTO of
and serves as Spark's vice president at Apache.
Gregory Piatetsky: Q1. How did you start Apache Spark and what were some of the key decisions that enabled it to become one of the hottest trends in Big Data technology and the largest Apache Project?
I initially started working on big data systems through Hadoop, back in 2007. I worked with early Hadoop users (e.g. Facebook and Yahoo!) and saw that they all had common needs in terms of new workloads they wanted to run on their data that MapReduce wasn't suited for (e.g. interactive queries and machine learning). I first began Spark in 2009 to tackle the machine learning use case, because machine learning researchers in our lab at UC Berkeley were trying to use MapReduce for their algorithms and finding it very inefficient. After starting with this use case, we quickly realized that Spark could be useful beyond machine learning, and focused on designing a general computing engine with great libraries for building complete data pipelines.
I think several things helped the project get to where it is today. First, we worked from very early on to foster a great community, including mentoring external contributors and accepting patches from them, publishing free training materials, etc. This led to many new contributors to the project from both inside and outside UC Berkeley. Second, Spark provided advantages in several dimensions (speed and ease of use) that were unmatched by alternatives. Third, we've managed to scale the development process and keep the project evolving quickly, so we continue to see exciting ideas added in Spark. Some recent ones include DataFrames, machine learning pipelines, R support, and a huge range of new algorithms that we're getting parallel implementations for in MLlib
(Apache Spark's scalable machine learning library).
GP: Q2. What are the key things to know about Apache Spark?
Here are some lesser-known things:
GP: Q3. You work on several exciting projects at the forefront of Big Data and cloud computing - what is your vision for where Big Data and Cloud technology will be in 2020?
- While Spark is known for in-memory computing, the engine is also really fast on disk, and quite a bit of work was done recently to optimize that. At Databricks, we used Spark to beat the
world record for sorting on-disk data in 2014, using 10x fewer resources than MapReduce. Many users run Spark on petabyte-scale on-disk datasets.
- Many people ask whether Spark is a replacement of Hadoop. The short answer is that it is not: Spark is only a computing engine, while Hadoop is a complete stack of storage, cluster management and computing tools, and Spark can run well on Hadoop. However, we do see many deployments that are not on Hadoop, including deployments on NoSQL stores (e.g. Cassandra) and deployments directly against cloud storage (e.g. Amazon S3, Databricks Cloud). In this sense Spark is reaching a broader audience than Hadoop users.
- Most of the development activity in Apache Spark is now in the built-in libraries, including Spark SQL,
Spark Streaming, MLlib and GraphX. Out of these, the most popular are Spark Streaming and Spark SQL: about 50-60% of users use each of them respectively.
To me, the most exciting question is the applications, especially beyond traditional data processing. We're already seeing some exciting scientific applications built on big data systems like Spark, including in genomics, neuroscience and image processing. Some of these could enable new industrial applications, where this type of data crunching could be done on a regular basis to process data from industrial machines or sensors, analyze medical scans or sequencing data, etc. I think that by 2020 we'll see several such applications in common use.
I also think that the cloud will play a big role, which is why Databricks started with a cloud product. The cloud provides a very low-cost way to store and manage data, and lets organizations focus on just the processing they want to do instead of operations / IT. It's where a lot of data is "born". And it makes it very easy to deploy and run new applications at the same site where the data lives, which is important for data processing because a lot of it is exploratory.
I think that by 2020 most data will be in either public clouds or cloud-like private environments.
GP: Q4. You are a CTO and co-founder of Databricks, which aims to help clients do Big Data processing using Spark. What does Databricks do that Apache Spark does not do?
Databricks offers a cloud service that makes it easy to deploy Spark applications and work on data collaboratively in a team. The code you write is all Apache Spark code, and can therefore run on any Spark cluster. However, we provide tools to make it easy to run this code (e.g. schedule a production job and get an email if it doesn't run), and a UI for fast, productive data exploration (a Google-Docs-like notebook environment, publishable dashboards, etc).
GP: Q5. What are the next exciting projects you are working on?
In Databricks Cloud, we're building some pretty exciting new features that we plan to announce soon (stay tuned around
, June 15-17). One that I worked on closely was Jobs, our feature for deploying and monitoring Spark applications. Another area I'm involved is Project Tungsten, which is an effort to let Spark leverage modern hardware advances (e.g. solid-state memory, vectorized instructions in CPUs, maybe even GPUs) that presents lots of potential opportunities throughout the engine.
GP: Q6. InfoWorld says that one possible problem with Spark is that it is not a pure stream-processing engine, but a fast-batch operation working on a small part of incoming data ("micro-batching"), and it is not as good as
Flink for streaming operations. Do you agree and how do you compare Spark and Flink?
While it is true that Spark uses a micro-batch execution model, I don't think this is a problem in practice, because the batches are as short as 0.5 seconds. In most applications of streaming big data, the latency to get the data in is much higher (e.g. you have sensors that send in data every 10 seconds, or something like that), or the analysis you want to do is over a longer window (e.g. you want to track events over the past 10 minutes), so it doesn't matter that you take a short amount of time to process it.
The benefit of Spark's micro-batch model is that you get full fault-tolerance and "exactly-once" processing for the entire computation, meaning it can recover all state and results even if a node crashes
. Flink and Storm don't provide this, requiring application developers to worry about missing data or to treat the streaming results as potentially incorrect. Again, that can be okay for some applications (e.g. just basic monitoring), but it makes it hard to write more complex applications and reason about their results.
GP: Q7. What do you like to do when you are away from a computer and Big Data? Is there a book that you recently read and liked?
In this day and age, you're never far away from Big Data. But when I am, I like to read books, walk around, and occasionally try to cook things. I really liked The Martian by Andy Weir.