Silver Blog7 Steps to Mastering Apache Spark 2.0

Looking for a comprehensive guide on going from zero to Apache Spark hero in steps? Look no further! Written by our friends at Databricks, this exclusive guide provides a solid foundation for those looking to master Apache Spark 2.0.

Step 3: Advanced Apache Spark Core

To understand how all Spark components interact, it’s essential to grasp Spark’s core architecture in details. All the key terms and concepts defined above (and more) come to life when you hear and see them explained. In this Spark Summit training video, you can immerse yourself and take the journey into Spark’s core.

Step 4: DataFrames, Datasets and Spark SQL Essentials

In step 3, you might have learned about Resilient Distributed Datasets (RDDs)—if you watched its linked video—because they form the core data abstraction concept in Spark and underpin all other higher-level data abstractions and APIs, including DataFrames and Datasets.

In Spark 2.0, DataFrames and Datasets, built upon RDDs, form the core high-level and structured distributed data abstraction, across most libraries and components in Spark.  DataFrames are named data columns in Spark and they can impose a schema in how your data is organized, and how you would process data or express a computation or issue a query. And Datasets go one step further to provide you strict compile-time type safety, so certain type of errors are caught at compile time rather than runtime.

DataFrame Error Types

Fig 4. Spectrum of Errors Types detected for DataFrames & Datasets

Because of structure in your data and type of data, Spark can understand how you would express your computation, what particular typed-columns or typed-named fields you would access in your data, and what domain specific operations you may use. As a result, Spark will optimize your code, through Spark 2.0’s Catalyst optimizer, and generate efficient byte code through Project Tungsten.

DataFrames and Datasets offer high-level domain specific language APIs, making your code expressive and allowing high-level operators like filter, sum, count, avg, min, max etc. Whether you express your computations in Spark SQL or Python, Java, Scala, or R, the underlying code generated is identical because all execution planning undergoes the same Catalyst optimizer.

For example, this high-level domain specific code in Scala or its equivalent relational query in SQL will generate identical code. Consider a Dataset Scala object called Person and an SQL table “person.”

// a dataset object Person with field names fname, lname, age, weight
// access using object notation
val seniorDS = peopleDS.filter(p=>p.age > 55)
// a dataframe with structure with named columns fname, lname, age, weight
// access using col name notation
Val seniorDF = peopleDF.where(peopleDF("age") > 55)
// equivalent Spark SQL code
val seniorDF = spark.sql("SELECT age from person where age > 35")

To get a world-wind introduction of why Structuring data in Spark is important and why DataFrames and Datasets and Spark SQL provide an efficient way to program Spark, we urge you to watch this Spark Summit talk by Michael Armbrust, Spark PMC and committer, in which he articulates the motivations and merits behind structure in Spark.

In addition, these couple of blogs discuss DataFrames and Datasets, and how to use them in processing structured data like JSON files and issuing Spark SQL queries.

  1. Introduction to Datasets in Apache Spark
  2. A tale of Three APIS: RDDs, DataFrames, and Datasets

Step 5: Graph Processing with GraphFrames

Even though Spark has a general purpose graph RDD-based processing library GraphX, which is optimized for distributed computing and supports graph algorithms, it has some challenges. It has no Java and Python APIs, and it’s based on low-level RDD APIs. Because of these challenges, it cannot take advantage of recent performance and optimization introduced in DataFrames, through Project Tungsten and Catalyst Optimizer.

By contrast, the DataFrame-based GraphFrames address all these challenges: It provides an analogous library to GraphX but with high-level, expressive and declarative APIs, in Java, Scala and Python; an ability to issue powerful SQL like queries using DataFrames APIs; saving and loading graphs; and takes advantage of underlying performance and query optimizations in Spark 2.0. Moreover, it integrates well with GraphX. That is, you can seamlessly convert a GraphFrame into an equivalent GraphX representation.

In the Graph diagram below, representing airport codes in their cities, all the vertices can be represented as rows of DataFrames; likewise, all the edges as rows of DataFrames, with their respective named and typed columns. Collectively, these DataFrames of vertices and edges comprise a GraphFrame.

Graph of cities

Fig 5. A graph of cities represented as GraphFrame

// create a Vertices DataFrame
val vertices = spark.createDataFrame(List(("JFK", "New York", "NY"))).toDF("id", "city", "state")
// create a Edges DataFrame
val edges = spark.createDataFrame(List(("JFK", "SEA", 45, 1058923))).toDF("src", "dst", "delay", "tripID")
// create a GraphFrame and use its APIs
val airportGF = GraphFrame(vertices, edges)
// filter all vertices from the GraphFrame with delays greater an 30 mins
val delayDF = airportGF.edges.filter("delay > 30")
// Using PageRank algorithm, determine the Airport ranking of importance
val pageRanksGF = airportGF.pageRank.resetProbability(0.15).maxIter(5).run()

With GraphFrames you can express three kinds of powerful queries. First simple SQL type of queries on vertices and edges such as what trips are likely to have major delays. Second, graph type queries such as how many vertices have incoming and outgoing edges. And finally, motif queries, by providing a structural pattern or path of vertices and edges and then finding those patterns in your graph’s dataset.

Additionally, GraphFrames easily support all graph algorithms supported in GraphX. For example, find important vertices using PageRank. Or determine the shortest path from source to destination. Or perform a Breadth First Search (BFS). And determine strongly connected vertices, for exploring social connections.

In the webinar link below, Joseph Bradley, Spark Committer, gives an illuminative introduction to graph processing with GraphFrames, its motivations and ease of use, and the benefits of its DataFrame-based API. And through a demonstrated notebook as part of the webinar, you’ll learn the ease with which you can use GraphFrames and issue all the aforementioned types of queries and types of algorithms.


GraphFrames: DataFrame-based API for Apache Spark

Complementing the above webinar, two instructive blogs, with their accompanying notebooks, offer an introductory and hands-on experience with DataFrame-based GraphFrames.

  1. Introduction to GraphFrames
  2. On-time Flight Performance with GraphFrames for Apache Spark

Going forward with Apache Spark 2.0, many Spark components, including Machine Learning MLlib and Streaming, are increasingly moving toward offering equivalent DataFrames APIs, because of performance gains, ease of use, and high-level abstraction and structure. Where necessary or appropriate for your use case, you may elect to use GraphFrames instead of GraphX. Below is a succinct summary and comparison between GraphX and GraphFrames.

Comparison chart

Fig 6. Comparison chart

Finally, GraphFrames continue to get faster, and a Spark Summit talk by Ankur Dave shows specific optimizations. A newer version of GraphFrame package compatible with Spark 2.0 is available as a spark package.