Introduction to Spark with Python

Get a handle on using Python with Spark with this hands-on data processing tutorial.


You may notice that printing tally didn't return the histogram we were hoping for. Because of lazy evaluation, PySpark delayed executing the map and reduceByKey steps until we actually need it. Before we use take() to preview the first few elements in tally, we'll walk through the code we just wrote. x: (x[0], 1))
    .reduceByKey(lambda x, y: x+y)

During the map step, we used a lambda function to create a tuple consisting of:

  • key: x[0], the first value in the List
  • value: 1, the int

Our high level strategy was to create a tuple with the key representing the Year and the value representing 1. After the map step, Spark will maintain in memory a list of tuples resembling the following:

('YEAR', 1)
('1991', 1)
('1991', 1)
('1991', 1)
('1991', 1)

and we'd like to reduce that down to:

('YEAR', 1)
('1991', 4)

reduceByKey(f) combines tuples with the same key using the function we specify f.

To see the results of these 2 steps, we'll use the take command, which forces lazy code to run immediately. Since tally is an RDD, we can't use Python's len function to know how many elements are in the collection and will instead need to use the RDD count() function.

[('YEAR', 1),
 ('2013', 166),
 ('2001', 157),
 ('2004', 164),
 ('2000', 169),
 ('2015', 100),
 ('2010', 165),
 ('2006', 161),
 ('2014', 163),
 ('2003', 166),
 ('2002', 159),
 ('2011', 163),
 ('2012', 164),
 ('2008', 164),
 ('2007', 141),
 ('2005', 162),
 ('1999', 166),
 ('2009', 163)]


Unlike Pandas, Spark knows nothing about column headers and didn't set them aside. We need a way to get rid of the element:

('YEAR', 1)

from our collection. While you may be tempted to try to find a way to remove this element from the RDD, recall that RDD objects are immutable and can't be changed once created. The only way to remove that tuple is to create a new RDD object without that tuple.

Spark comes with a function filter(f) that allows us to create a new RDD from an existing one containing only the elements meeting our criteria. Specify a function f that returns a binary value, True or False, and the resulting RDD will consist of elements where the function evaluated to True. Read more about the filter function over at Spark's documentation.

def filter_year(line):
    if line[0] == 'YEAR':
        return False
        return True
filtered_daily_show = daily_show.filter(lambda line: 

All together now

To flex Spark's muscles, we'll demonstrate how to chain together a series of data transformations into a pipeline and observe Spark managing everything in the background. Spark was written with this functionality in mind and is highly optimized for running tasks in succession. Previously, running lots of tasks in succession in Hadoop was incredibly time consuming since intermediate results needed to be written to disk and Hadoop wasn't aware of the full pipeline (optional reading if you're curious:

Thanks to Spark's aggressive usage of memory (and only disk as a backup and for specific tasks) and well architected core, Spark is able to improve significantly on Hadoop's turnaround time. In the following code block, we'll filter out actors with no profession listed, lowercase each profession, generate a histogram of professions, and output the first 5 tuples in the histogram.

filtered_daily_show.filter(lambda line: line[1] != '') \
                   .map(lambda line: (line[1].lower(), 1))\
                   .reduceByKey(lambda x,y: x+y) \
[('radio personality', 3),
 ('television writer', 1),
 ('american political figure', 2),
 ('former united states secretary of state', 6),
 ('mathematician', 1)]

Next steps

We hope that in this lesson, we have whet your appetite for Spark and how we can use PySpark to write Python code we're familiar with but still take advantage of distributed processing. When working with larger datasets, PySpark really shines since it blurs the line between doing data science locally on your own computer and doing data science using large amounts of distributed computing on the internet (also referred to as the cloud).

If you enjoyed this post, check out part 2 on Dataquest where we learn more about transformations & actions in Spark.