Spark SQL for Real Time Analytics – Part Two
Apache Spark is the hottest topic in Big Data. Part 2 of this covers basic concepts of Stream Processing for Real Time Analytics and for the next frontier – Internet of Things (IoT).
Why SQL for Streaming Analytics is useful
Before discussing why SQL is useful for streaming analytics let us look at some useful analytic computation patterns in streaming analytics.
- Counting within Windows
- Initial Preprocessing – filling missing data, enriching, cleaning, filtering & transformations
- Monitoring, Alerting on anomalies, outliers
- Detecting trends, sequences and correlations
- Joining multiple streams or streams with static data
- Detecting Event Sequence Patterns
If batch processing is being done using plain RDDs or data frame API with SQL and if the same processing/analytics is needed for Stream Processing too, it makes sense to reuse the same code. This is the biggest elegance and advantage of the Spark framework.
One does not have to have different code for batch, interactive, streaming analytics.
The same code can be reused – for streaming, batch or interactive workloads. Spark framework supports the flexibility of molding the data sets in a way that the same code can be reused. For batch and interactive processing, RDDs are used, while for Streaming DStream encapsulates multiple RDDs in a given window. In both cases, the computation logic remains the same.
This works well because SQL is the most popular data processing tool used and understood by a huge pool of developers and analysts. SQL is also easily comprehensible and nothing new needs to be learnt to use it on a new framework. Sometimes, the complexities of code in a Streaming data analytics pipeline can make the code difficult to read and maintain, debug and troubleshoot. The ability to use SQL simplifies the whole process. Streaming analytics engines can optimize the SQL execution plan reusing years of research in DB query planning and execution optimizations.
Spark SQL on Streaming Data – Few Examples
The best way to get started with SQL for Streaming Analytics is to use DataFrames. This involves creating a SQLContext from the SparkContext – which the StreamingContext is already composed of.
DStreams in a Spark Streaming are composed of RDDs (as was had discussed in Part 1). The trick is to get the RDDs from the DStreams using the foreachRDD method of DStream. Once the RDD is obtained it can be transformed into a Data Frame, registered as a temporary table and any SQL can be executed against that data frame. The code below highlights this pattern.
The data in the DataFrame can be joined either to external static data sets by creating another RDD and then transforming it to its own data frame and joining 2 data frames using standard SQL. The most elegant part is that the Data Frames created from the DStream – can also be combined with Data Frames from another DStream running asynchronously in another thread.
Let us take a look at few patterns of using SQL for Stream Processing with code examples
Example 1 – Plain SQL on Streaming Network Data coming from IOT device
Let us take the ever popular example of word counting where we count words coming from a socket streaming process. Startup a command shell and type
nc -l -p 8888
(on linux ) and start typing any set of words with whitespace separators
val ssc = new StreamingContext(sparkConf, Seconds(5)) // Streaming source w 5 sec window
val lines = ssc.socketTextStream("localhost", 8888, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
// Convert RDDs to DataFrame and execute SQL
words.foreachRDD((rdd: RDD[String], time: Time) => {
val sqlContext = new SQLContext(sparkContext)
// Convert RDD to DataFrame
val wordsDF = rdd.toDF(“countingwords”)
// Register the Data Frame as a temporary Table
wordsDF.registerTempTable("countingwords ")
// Count using SQL
val countDF =
sqlContext.sql("select word, count(*) as WordCount from countingwords group by word")
countDF.show()
})
The same approach can be applied to any streaming data source – whether the streaming data is coming from a FileStream or any DataStream.
Example 2 – Joining Streaming SQL with external lookups & Joins
Say we have a lookup table in an Oracle DB – which has synonym array for each of the words appearing in the above stream. The code below shows how we join the word count for the words with this external lookup table to get an output like – Word, Synonym List, Count
// this portion of the code is before the above code where - words.foreachRDD – starts – it is just the setup to the static reference table
val lookupDF = sqlContext.load("jdbc", Map(
"url" -> "jdbc:oracle:thin:refdata/password@refdata.dev.com:1521:orcl",
"dbtable" -> "wordLookup))
lookupDF.registerTempTable("lk")
// Join with external Static table and count words – Show Word, Synonyms and Counts
val wordCountsDFWithLookup =
sqlContext.sql("select word, p.synonyms, count(*) as total
from words w
join lk p on w.word = p.word
group by word order by word")
wordCountsDFWithLookup.show( )
Summary
With the oncoming deluge of IOT data – stream processing is a becoming a necessity. Stream processing requires high speed complex calculations on large data volumes. Volume of the data is controlled with window and slider sizes. Analytic computations on streams can quickly become complex and this is where using SQL provides the right approach for complex calculations.
Bio: Sumit Pal is a big data, visualisation and data science consultant. He is also a software architect and big data enthusiast and builds end-to-end data-driven analytic systems. Sumit has worked for Microsoft (SQL server development team), Oracle (OLAP development team) and Verizon (Big Data analytics team) in a career spanning 22 years. Currently, he works for multiple clients advising them on their data architectures and big data solutions and does hands on coding with Spark, Scala, Java and Python. Sumit blogs at sumitpal.wordpress.com/
Ajit Jaokar (@AjitJaokar) does research and consulting for Data Science and the Internet of Things. His work is based on his teaching at Oxford University and UPM (Technical University of Madrid) and covers IoT, Data Science, Smart cities and Telecoms..
Contact info@futuretext.com
Related: