Introduction to Spark with Python

Get a handle on using Python with Spark with this hands-on data processing tutorial.

By Srini Kadamati, Data Scientist at

After lots of ground-breaking work led by the UC Berkeley AMP Lab, Spark was developed to utilize distributed, in-memory data structures to improve data processing speeds over Hadoop for most workloads. In this post, we're going to cover the architecture of Spark and basic transformations and actions using a real dataset. If you want to write and run your own Spark code, check out the interactive version of this post on Dataquest.

Resilient Distributed Datasets (RDD's)

The core data structure in Spark is an RDD, or a resilient distributed dataset. As the name suggests, an RDD is Spark's representation of a dataset that is distributed across the RAM, or memory, of lots of machines. An RDD object is essentially a collection of elements that you can use to hold lists of tuples, dictionaries, lists, etc. Similar to DataFrames in Pandas, you load a dataset into an RDD and then can run any of the methods accesible to that object.


While Spark is writen in Scala, a language that compiles down to bytecode for the JVM, the open source community has developed a wonderful toolkit called PySpark that allows you to interface with RDD's in Python. Thanks to a library called Py4J, Python can interface with JVM objects, in our case RDD's, and this library one of the tools that makes PySpark work.

To start off, we'll load the dataset containing all of the Daily Show guests into an RDD. We are using the TSV version of FiveThirtyEight's dataset. TSV files are separated, or delimited, by a tab character "\t" instead of a comma "," like in a CSV file.

raw_data = sc.textFile("daily_show.tsv")
 '1999\tactor\t1/11/99\tActing\tMichael J. Fox',
 '1999\tComedian\t1/12/99\tComedy\tSandra Bernhard',
 '1999\ttelevision actress\t1/13/99\tActing
  \tTracey Ullman',
 '1999\tfilm actress\t1/14/99\tActing\tGillian Anderson']


SparkContext is the object that manages the connection to the clusters in Spark and coordinates running processes on the clusters themselves. SparkContext connects to cluster managers, which manage the actual executors that run the specific computations. Here's a diagram from the Spark documentation to better visualize the architecture:

The SparkContext object is usually referenced as the variable sc. We then run:

raw_data = sc.textFile("daily_show.tsv")

to read the TSV dataset into an RDD object raw_data. The RDD object raw_data closely resembles a List of String objects, one object for each line in the dataset. We then use the take() method to print the first 5 elements of the RDD:


To explore the other methods an RDD object has access to, check out the PySpark documentation. take(n) will return the first n elements of the RDD.

Lazy Evaluation

One question you may have is if an RDD resembles a Python List, why not just use bracket notation to access elements in the RDD? Because RDD objects are distributed across lots of partitions, we can't rely on the standard implementation of a List and the RDD object was developed to specifically handle the distributed nature of the data. One advantage of the RDD abstraction is the ability to run Spark locally on your own computer. When running locally on your own computer, Spark simulates distributing your calculations over lots of machines by slicing your computer's memory into partitions, with no tweaking or changes to the code you wrote.

Another advantage of Spark's RDD implementation is the ability to lazily evaluate code, postponing running a calculation until absolutely necessary. In the code above, Spark didn't wait to load the TSV file into an RDD until raw_data.take(5) was run. When raw_data = sc.textFile("dail_show.tsv") was called, a pointer to the file was created, but only when raw_data.take(5) needed the file to run its logic was the text file actually read into raw_data. We will see more examples of this lazy evaluation in this lesson and in future lessons.