Spark SQL for Real Time Analytics – Part Two
Apache Spark is the hottest topic in Big Data. Part 2 of this covers basic concepts of Stream Processing for Real Time Analytics and for the next frontier – Internet of Things (IoT).
By Sumit Pal and Ajit Jaokar, (FutureText).
This article is part of the forthcoming Data Science for Internet of Things Practitioner course. If you want to be a Data Scientist for the Internet of Things, this intensive course is ideal for you. We cover complex areas like Sensor fusion, Time Series, Deep Learning and others. We work with Apache Spark, R language and leading IoT platforms. Please contact info@futuretext.com for more details
Here is Part 1 of this tutorial: Spark SQL for Real-Time – Part One
Introduction
Here we will discuss basic concepts of Stream Processing and how Spark handles stream processing. We also highlight patterns for Stream Processing how they are implemented using SQL in Spark.
Today massive amounts of data are generated as data streams. From social media, gaming apps and devices, financial data and sensor data generates reams and streams of data. These datasets have value for immediate analytics – but they often lose their value exponentially with passage of time. Batch processing does not work in these scenarios. Instead, Stream processing is used in these scenarios where immediate analysis on streaming data is needed.
The holy grail of Stream processing is
You can have
only 2 out of 3:
Speed,
Volume,
or Accuracy
Stream analytics has 2 major concepts. First, because a Stream is an infinite flow of data and the system has limited resources and only a finite amount of time, stream processing systems break the data flows into fragments of given size – called windows. Stream processing processes this window of data as fast as it can and then goes to the next one. Second, stream processing systems cannot store infinite amount of state unlike batch processing systems. In a stream processing system, not every data point is necessary and approximations are accepted to get the results in time.
For stream processing, the idea of data processing is flipped. The data flows (data is in motion) – while the queries are static. In other words, the system is built knowing its objectives and analytics to be performed. Ad-hoc queries on stream processing are not the norm. Most stream processing systems are built over in-memory architecture, to save the time to write to disk and read from disk before computing the metrics. For compute intensive tasks in stream processing GPUs can be leveraged to improve the throughput and latency.
Streaming Data architecture must be able to ingest the stream & process discrete incoming events to allow applications that act in real time to make fast, high-value decisions.
Some simple examples of streaming analytics over flowing data are: counts, histogram, quantiles, trending, fraud, and outlier analytics.
Key Streaming Concepts in Spark
DStream : Sequence of RDDs representing a stream of data
Transformations : Modify data in a DStream to RDD
Standard RDD Operations : map, countByValue, reduce, join
Stateful operations: window, countByValueAndWindow
Window Operations
Spark Streaming provides windowed computations, which allows applying transformations over a sliding window of data. This figure illustrates this concept.
When the window slides over a DStream, RDDs within the window are combined and operated to produce the RDDs of the windowed DStream. A window operation needs to specify two parameters:
window length – The duration of the window
sliding interval – The interval at which the window operation is performed (2 in the figure)