Big Data Developer Conference, Santa Clara: Day 3 Highlights

Highlights from the presentations/tutorials by Data Science leaders from VISA, Glassbeam, Unravel on day 3 of Big Data Developer Conference, Santa Clara.

Resilient Distributed Datasets (RDD) is data abstraction in Spark. RRD is an immutable, partitioned, distributed, fault-tolerant collection of data elements that can be operated in parallel. He discussed common RDD transformations and actions. Transformations and cache operations are both lazy. He also talked about Spark SQL, Spark Streaming and Machine Learning before starting with hands-on learning session.

Eric Chu, Unravel delivered a talk on how to go about managing big data applications and systems. Since there are so many modules involved while running a big data application, it becomes very difficult to understand why an app is slow/failed, how to allocate resources, how to store and retain data. Often development operations team relies on monitoring tools but the tools focus on infrastructure and services and do not tell anything about application. So, ultimately developers have to dig logs to fix problems in their applications, which is very painful since these logs are spread out, incomplete and very difficult for average user to understand.

Unravel management platform solves this difficult task by providing various apps which help in automatic speedup, alerts & dashboards, smart resource planning, etc. With Unravel one can clearly understand activity on cluster, can plan systems intelligently as it is single tool for multiple layers, easily pinpoint and resolve bottlenecks.

Chris Fregly, author of Effective Spark, started with giving a brief overview of Spark and then talked about RDDs and Spark API in detail. He supported Scala by mentioning some nice properties it holds: Java being used by all popular big data frameworks, supports closures, is functional but not obscure, and leverages Spark REPL. Talking about Spark Execution Model, he talked about optimization using improved shuffle and sort. Towards the end, he did a demo to query Parquet using SQL to show its high-speed performance when compared to querying JSON using SQL.