Introduction to Spark with Python

Get a handle on using Python with Spark with this hands-on data processing tutorial.


Spark borrowed heavily from Hadoop's Map-Reduce pattern, but is quite different in many ways. If you have experience with Hadoop and traditional Map-Reduce, read this great post by Cloudera on the difference. Don't worry if you have never worked with Map-Reduce or Hadoop before as we'll cover the concepts you need to know in this course.

The key idea to understand when working with Spark is data pipelining. Every operation or calculation in Spark is essentially a series of steps that can be chained together and run in succession to form a pipeline. Each step in the pipeline returns either a Python value (e.g. Integer), a Python data structure (e.g. Dictionary) or an RDD object. We'll first start with the map() function.


The map(f) function applies the function f to every element in the RDD. Since RDD's are iterable objects, like most Python objects, Spark runs function f on every iteration and returns a new RDD.

We'll walk through a map example so you can get a better sense. If you look carefully, raw_data is in a format that's hard to work with. While currently each element is a String, we'd like to transform every element into a List so the data is more managable. While traditionally, we would:

1. Use a `for` loop to iterate over the collection
2. Split each `String` on the delimiter
3. Store the result in a `List`

Let's walk through how we use map to do this in Spark.

In the below code block, we:

1. Call the RDD function `map()` to specify we want the enclosed logic to be applied to every line in our dataset
2. Write a lambda function to split each line using the tab delimiter "\t" and assign the resulting RDD to `daily_show`
3. Call the RDD function `take()` on `daily_show` to display the first 5 elements (or rows) of the resulting RDD

The map(f) function is known as a transformation step and either a named or lambda function f is required.

daily_show = line: line.split('\t'))
[['YEAR', 'GoogleKnowlege_Occupation', 'Show', 'Group', 
 ['1999', 'actor', '1/11/99', 'Acting', 'Michael J. Fox'],
 ['1999', 'Comedian', '1/12/99', 'Comedy', 
   'Sandra Bernhard'],
 ['1999', 'television actress', '1/13/99', 'Acting', 
   'Tracey Ullman'],
 ['1999', 'film actress', '1/14/99', 'Acting', 
   'Gillian Anderson']]

Python and Scala, friends forever

One of the wonderful features of PySpark is the ability to separate our logic, which we prefer to write in Python, from the actual data transformation. In the above code block, we wrote a lambda function in Python code: line(line.split('\t')))

but got to take advantage of Scala when Spark actually ran the code over our RDD. This is the power of PySpark. Without learning any Scala, we get to harness the data processing performance gains from Spark's Scala architecture. Even better, when we ran:


the results were returned to us in Python friendly notation.

Transformations and Actions

In Spark, there are two types of methods:

1. Transformations - map(), reduceByKey()
2. Actions - take(), reduce(), saveAsTextFile(), collect()

Transformations are lazy operations and always return a reference to an RDD object. The transformation, however, is not actually run until an action needs to use the resulting RDD from a transformation. Any function that returns an RDD is a transformation and any function that returns a value is an action. These concepts will become more clear as you work through this lesson and practice writing PySpark code.


You may be wondering why we couldn't just split each String in place instead of creating a new object daily_show? In Python, we could have modified the collection element-by-element in place without returning and assignign to a new object.

RDD objects are immutable and their values can't be changed once the object is created. In Python, List objects and Dictionary objects are mutable, which means we can change the object's values, while Tuple objects are immutable. The only way to modify a Tuple object in Python is to create a new Tuple object with the necessary updates. Spark utilizes immutability of RDD's for speed gains and the mechanics of that are outside the scope of this lesson.


We would like to get a histogram, or a tally, of the number of guests in each year the show has been running. If daily_show were a List of Lists, we could write the following Python code to achieve this result:

tally = dict()
for line in daily_show:
  year = line[0]
  if year in tally.keys():
    tally[year] = tally[year] + 1
    tally[year] = 1

The keys in tally will be unique Year values and the values will be the number of lines in the dataset that contained that value.

If we want to achieve this same result using Spark, we will have to use a Map step followed by a ReduceByKey step.

tally = x: (x[0], 1))
reduceByKey(lambda x,y: x+y) print(tally)
PythonRDD[156] at RDD at PythonRDD.scala:43