How to Generate FiveThirtyEight Graphs in Python
In this post, we'll help you. Using Python's matplotlib and pandas, we'll see that it's rather easy to replicate the core parts of any FiveThirtyEight (FTE) visualization.
By Alex Olteanu, Student Success Specialist at Dataquest.io
If you read data science articles, you may have already stumbled upon FiveThirtyEight's content. Naturally, you were impressed by their awesome visualizations. You wanted to make your own awesome visualizations and so asked Quora and Reddit how to do it. You received some answers, but they were rather vague. You still can't get the graphs done yourself.
In this post, we'll help you. Using Python's matplotlib and pandas, we'll see that it's rather easy to replicate the core parts of any FiveThirtyEight (FTE) visualization.
We'll start here:
And, at the end of the tutorial, arrive here:
To follow along, you'll need at least some basic knowledge of Python. If you know what's the difference between methods and attributes, then you're good to go.
Introducing the dataset
We'll work with data describing the percentages of Bachelors conferred to women in the US from 1970 to 2011. We'll use a dataset compiled by data scientist Randal Olson, who collected the data from the National Center for Education Statistics.
If you want to follow along by writing code yourself, you can download the data from Randal's blog. To save yourself some time, you can skip downloading the file, and just pass in the direct link to pandas' read_csv()
function. In the following code cell, we:
- Import the pandas module.
- Assign the direct link toward the dataset as a
string
to a variable nameddirect_link
. - Read in the data by using
read_csv()
, and assign the content towomen_majors
. - Print information about the dataset by using the
info()
method. We're looking for the number of rows and columns, and checking for null values at the same time. - Show the first five rows to understand better the structure of the dataset by using the
head()
method.
import pandas as pd direct_link = 'http://www.randalolson.com/wp-content/uploads/percent-bachelors-degrees-women-usa.csv' women_majors = pd.read_csv(direct_link) print(women_majors.info()) women_majors.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 42 entries, 0 to 41 Data columns (total 18 columns): Year 42 non-null int64 Agriculture 42 non-null float64 Architecture 42 non-null float64 Art and Performance 42 non-null float64 Biology 42 non-null float64 Business 42 non-null float64 Communications and Journalism 42 non-null float64 Computer Science 42 non-null float64 Education 42 non-null float64 Engineering 42 non-null float64 English 42 non-null float64 Foreign Languages 42 non-null float64 Health Professions 42 non-null float64 Math and Statistics 42 non-null float64 Physical Sciences 42 non-null float64 Psychology 42 non-null float64 Public Administration 42 non-null float64 Social Sciences and History 42 non-null float64 dtypes: float64(17), int64(1) memory usage: 6.0 KB None
YEAR | AGRICULTURE | ARCHITECTURE | ART AND PERFORMANCE | BIOLOGY | BUSINESS | COMMUNICATIONS AND JOURNALISM | COMPUTER SCIENCE | EDUCATION | ENGINEERING | ENGLISH | FOREIGN LANGUAGES | HEALTH PROFESSIONS | MATH AND STATISTICS | PHYSICAL SCIENCES | PSYCHOLOGY | PUBLIC ADMINISTRATION | SOCIAL SCIENCES AND HISTORY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1970 | 4.229798 | 11.921005 | 59.7 | 29.088363 | 9.064439 | 35.3 | 13.6 | 74.535328 | 0.8 | 65.570923 | 73.8 | 77.1 | 38.0 | 13.8 | 44.4 | 68.4 | 36.8 |
1 | 1971 | 5.452797 | 12.003106 | 59.9 | 29.394403 | 9.503187 | 35.5 | 13.6 | 74.149204 | 1.0 | 64.556485 | 73.9 | 75.5 | 39.0 | 14.9 | 46.2 | 65.5 | 36.2 |
2 | 1972 | 7.420710 | 13.214594 | 60.4 | 29.810221 | 10.558962 | 36.6 | 14.9 | 73.554520 | 1.2 | 63.664263 | 74.6 | 76.9 | 40.2 | 14.8 | 47.6 | 62.6 | 36.1 |
3 | 1973 | 9.653602 | 14.791613 | 60.2 | 31.147915 | 12.804602 | 38.4 | 16.4 | 73.501814 | 1.6 | 62.941502 | 74.9 | 77.4 | 40.9 | 16.5 | 50.4 | 64.3 | 36.4 |
4 | 1974 | 14.074623 | 17.444688 | 61.9 | 32.996183 | 16.204850 | 40.5 | 18.9 | 73.336811 | 2.2 | 62.413412 | 75.3 | 77.9 | 41.8 | 18.2 | 52.6 | 66.1 | 37.3 |
Besides the Year
column, every other column name indicates the subject of a Bachelor degree. Every datapoint in the Bachelor columns represents the percentage of Bachelor degrees conferred to women. Thus, every row describes the percentage for various Bachelors conferred to women in a given year.
As mentioned before, we have data from 1970 to 2011. To confirm the latter limit, let's print the last five rows of the dataset by using the tail()
method:
women_majors.tail()
YEAR | AGRICULTURE | ARCHITECTURE | ART AND PERFORMANCE | BIOLOGY | BUSINESS | COMMUNICATIONS AND JOURNALISM | COMPUTER SCIENCE | EDUCATION | ENGINEERING | ENGLISH | FOREIGN LANGUAGES | HEALTH PROFESSIONS | MATH AND STATISTICS | PHYSICAL SCIENCES | PSYCHOLOGY | PUBLIC ADMINISTRATION | SOCIAL SCIENCES AND HISTORY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
37 | 2007 | 47.605026 | 43.100459 | 61.4 | 59.411993 | 49.000459 | 62.5 | 17.6 | 78.721413 | 16.8 | 67.874923 | 70.2 | 85.4 | 44.1 | 40.7 | 77.1 | 82.1 | 49.3 |
38 | 2008 | 47.570834 | 42.711730 | 60.7 | 59.305765 | 48.888027 | 62.4 | 17.8 | 79.196327 | 16.5 | 67.594028 | 70.2 | 85.2 | 43.3 | 40.7 | 77.2 | 81.7 | 49.4 |
39 | 2009 | 48.667224 | 43.348921 | 61.0 | 58.489583 | 48.840474 | 62.8 | 18.1 | 79.532909 | 16.8 | 67.969792 | 69.3 | 85.1 | 43.3 | 40.7 | 77.1 | 82.0 | 49.4 |
40 | 2010 | 48.730042 | 42.066721 | 61.3 | 59.010255 | 48.757988 | 62.5 | 17.6 | 79.618625 | 17.2 | 67.928106 | 69.0 | 85.0 | 43.1 | 40.2 | 77.0 | 81.7 | 49.3 |
41 | 2011 | 50.037182 | 42.773438 | 61.2 | 58.742397 | 48.180418 | 62.2 | 18.2 | 79.432812 | 17.5 | 68.426730 | 69.5 | 84.8 | 43.1 | 40.1 | 76.7 | 81.9 | 49.2 |
The context of our FiveThirtyEight graph
Almost every FTE graph is part of an article. The graphs complement the text by illustrating a little story, or an interesting idea. We'll need to be mindful of this while replicating our FTE graph.
To avoid digressing from our main task in this tutorial, let's just pretend we've already written most of an article about the evolution of gender disparity in US education. We now need to create a graph to help readers visualize the evolution of gender disparity for Bachelors where the situation was really bad for women in 1970. We've already set a threshold of 20%, and now we want to graph the evolution for every Bachelor where the percentage of women graduates was less than 20% in 1970.
Let's first identify those specific Bachelors. In the following code cell, we will:
- Use
.loc
, a label-based indexer, to:- select the first row (the one that corresponds to 1970);
- select the items in the first row only where the values are less than 20; the
Year
field will be checked as well, but will obviously not be included because 1970 is much greater than 20.
- Assign the resulting content to
under_20
.
under_20 = women_majors.loc[0, women_majors.loc[0] < 20] under_20
Agriculture 4.229798 Architecture 11.921005 Business 9.064439 Computer Science 13.600000 Engineering 0.800000 Physical Sciences 13.800000 Name: 0, dtype: float64
Using matplotlib's default style
Let's begin working on our graph. We'll first take a peek at what we can build by default. In the following code block, we will:
- Run the Jupyter magic
%matplotlib
to enable Jupyter and matplotlib work together effectively, and addinline
to have our graphs displayed inside the notebook. - Plot the graph by using the
plot()
method onwomen_majors
. We pass in toplot()
the following parameters:x
- specifies the column fromwomen_majors
to use for the x-axis;y
- specifies the columns fromwomen_majors
to use for the y-axis; we'll use the index labels ofunder_20
which are stored in the.index
attribute of this object;figsize
- sets the size of the figure as atuple
with the format(width, height)
in inches.
- Assign the plot object to a variable named
under_20_graph
, and print its type to show that pandas usesmatplotlib
objects under the hood.
%matplotlib inline under_20_graph = women_majors.plot(x = 'Year', y = under_20.index, figsize = (12,8)) print('Type:', type(under_20_graph))
Using matplotlib's fivethirtyeight style
The graph above has certain characteristics, like the width and color of the spines, the font size of the y-axis label, the absence of a grid, etc. All of these characteristics make up matplotlib's default style.
As a short parenthesis, it's worth mentioning that we'll use a few technical terms about the parts of a graph throughout this post. If you feel lost at any point, you can refer to the legend below.
Source: Matplotlib.org
Besides the default style, matplotlib comes with several built-in styles that we can use readily. To see a list of the available styles, we will:
- Import the
matplotlib.style
module under the namestyle
. - Explore the content of
matplotlib.style.available
(a predefined variable of this module), which contains a list of all the available in-built styles.
import matplotlib.style as style style.available
['seaborn-deep', 'seaborn-muted', 'bmh', 'seaborn-white', 'dark_background', 'seaborn-notebook', 'seaborn-darkgrid', 'grayscale', 'seaborn-paper', 'seaborn-talk', 'seaborn-bright', 'classic', 'seaborn-colorblind', 'seaborn-ticks', 'ggplot', 'seaborn', '_classic_test', 'fivethirtyeight', 'seaborn-dark-palette', 'seaborn-dark', 'seaborn-whitegrid', 'seaborn-pastel', 'seaborn-poster']
You might have already observed that there's a built-in style called fivethirtyeight
. Let's use this style, and see where that leads. For that, we'll use the aptly named use()
functionfrom the same matplotlib.style
module (which we imported under the name style
). Then we'll generate our graph using the same code as earlier.
style.use('fivethirtyeight') women_majors.plot(x = 'Year', y = under_20.index, figsize = (12,8))
Wow, that's a major change! With respect to our first graph, we can see that this one has a different background color, it has grid lines, there are no spines whatsoever, the weight and the font size of the major tick labels are different, etc.
You can read a technical description of the fivethirtyeight
style here - it should also give you a good idea about what code runs under the hood when we use this style. The author of the style sheet, Cameron David-Pilon, discusses some of the characteristics here.
For more on generating FiveThirtyEight graphs in Python, see the rest of the original article here.
Bio: Alex Olteanu is a Student Success Specialist at Dataquest.io. He enjoys learning and sharing knowledge, and is getting ready for the new AI revolution.
Original. Reposted with permission.
Related:
- Analyzing the Migration of Scientific Researchers
- 7 Techniques to Visualize Geospatial Data
- The Python Graph Gallery