Silver Blog7 Steps to Mastering Apache Spark 2.0

Looking for a comprehensive guide on going from zero to Apache Spark hero in steps? Look no further! Written by our friends at Databricks, this exclusive guide provides a solid foundation for those looking to master Apache Spark 2.0.

Step 6: Structured Streaming with Infinite DataFrames

For much of Spark’s short history, Spark streaming has continued to evolve, to simplify writing streaming applications. Today, developers need more than just a streaming programming model to transform elements in a stream. Instead, they need a streaming model that supports an end-to-end applications that continuously react to data in real-time. We call them continuous applications that react to data in real-time.

Continuous applications have many facets–interacting both with batch and real-time data, performing ETL, serving data to a dashboard from batch and stream, or doing online machine learning by combining static dataset with real-time data. Currently such facets are handled by separate applications rather than a single one.

Apache Spark 2.0 lays foundational steps for a new higher-level API, Structured Streaming, for building continuous applications.


Fig 7. Traditional Streaming vs Structured Streaming

Central to Structured Streaming is the notion that you treat a stream of data as unbounded table. As new data arrives from the stream, new rows of the DataFrame are appended to an unbounded table:


Fig 8. Stream as an unbounded table

You can perform computations or issue SQL type query operations on your unbounded table as you would on a static table. In this scenario, developers can express their streaming computations just like batch computations, and Spark will automatically execute it incrementally as data arrives in the stream.


Fig 9. Similar code for streaming and batch

Based on DataFrames/Datasets API, a cool benefit of using the Structured Streaming API is that your DataFrame/SQL based query for a batch DataFrame is similar to a streaming one, as you can see in the code in Fig 9., with a minor change. In the batch version, we read a static bounded log file, whereas in the streaming version, we read off an unbounded stream. Though the code looks deceptively simple, all the complexity is hidden from a developer and handled by the underlying model and execution engine, which is explained in the video talk.

After you take a deep dive into Structured Streaming in the video talk, also read the Structure Streaming Programming Model, which elaborates all under-the-hood complexity of data integrity, fault tolerance, exactly-once semantics, window-based aggregation, and out-of-order data. As a developer or user, you need not worry about them.

Learn further about Structured Streaming directly from Spark committer Tathagata Das, and try the accompanying notebook to get some hands-on experience on your first Structure Streaming continuous application.

Structured Streaming API in Apache Spark 2.0: A new high-level API for streaming

Similarly, the Structured Streaming Programming Guide offers short examples on how to use supported sinks and sources:

Structured Streaming Programming Guide

Step 7: Machine Learning for Humans

At a human level, machine learning is all about applying statistical learning techniques and algorithms to a large data set to identify patterns, and from these patterns make probabilistic predictions. A simplified view of a model is a mathematical function f(x); with a large data set as input, the function f(x) is repeatedly applied to the data set to produce an output with a prediction.

Model as function

Fig 10. Model as a mathematical function

For key terms of machine learning, Matthew Mayo’s Machine Learning Key Terms, Explained is a valuable reference for understanding some concepts discussed in the webinar link below.

Machine Learning Pipelines

Apache Spark’s DataFrame-based MLlib provides a set of algorithms as models and utilities, allowing data scientists to build machine learning pipelines easily. Borrowed from the scikit-learn project, MLlib pipelines allow developers to combine multiple algorithms into a single pipeline or workflow. Typically running machine learning algorithms involves a sequence of tasks, including pre-processing, feature extraction, model fitting, and validation stages. In Spark 2.0 this pipeline can be persisted and reloaded again, across languages Spark supports (see the blog link below).

ML pipeline

Fig 11. Machine Learning pipeline

In the webinar on Apache Spark MLlib, you will get a quick primer on machine learning, Spark MLlib, and an overview of some Spark machine learning use cases, along with how other common data science tools such as Python, pandas, scikit-learn and R integrate with MLib.

Spark MLlib

Moreover, two accompanying notebooks for some hands-on experience and a blog on persisting machine learning models will give you insight into why, what and how machine learning plays a crucial role in advanced analytics.

  1. 2015 Median Home Price by State
  2. Population vs. Median Home Prices: Linear Regression with Single Variable
  3. Saving and Loading Machine Learning Models in Apache Spark 2.0

If you follow these steps, watch all the videos, read the blogs, and try out the accompanying notebooks, we believe that you will be on your way to master Spark 2.0.

Jules S. Damji is a Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.

Sameer Farooqui is a Technology Evangelist at Databricks where he helps developers use Apache Spark by hosting webinars, writing blogs and speaking at conferences and meetups.