How to Generate FiveThirtyEight Graphs in Python

In this post, we'll help you. Using Python's matplotlib and pandas, we'll see that it's rather easy to replicate the core parts of any FiveThirtyEight (FTE) visualization.

comments

By Alex Olteanu, Student Success Specialist at Dataquest.io

If you read data science articles, you may have already stumbled upon FiveThirtyEight's content. Naturally, you were impressed by their awesome visualizations. You wanted to make your own awesome visualizations and so asked Quora and Reddit how to do it. You received some answers, but they were rather vague. You still can't get the graphs done yourself.

In this post, we'll help you. Using Python's matplotlib and pandas, we'll see that it's rather easy to replicate the core parts of any FiveThirtyEight (FTE) visualization.

We'll start here:

And, at the end of the tutorial, arrive here:

final3

To follow along, you'll need at least some basic knowledge of Python. If you know what's the difference between methods and attributes, then you're good to go.

Introducing the dataset

We'll work with data describing the percentages of Bachelors conferred to women in the US from 1970 to 2011. We'll use a dataset compiled by data scientist Randal Olson, who collected the data from the National Center for Education Statistics.

If you want to follow along by writing code yourself, you can download the data from Randal's blog. To save yourself some time, you can skip downloading the file, and just pass in the direct link to pandas' read_csv() function. In the following code cell, we:

Import the pandas module.
Assign the direct link toward the dataset as a string to a variable named direct_link.
Read in the data by using read_csv(), and assign the content to women_majors.
Print information about the dataset by using the info() method. We're looking for the number of rows and columns, and checking for null values at the same time.
Show the first five rows to understand better the structure of the dataset by using the head() method.

import pandas as pd

direct_link = 'http://www.randalolson.com/wp-content/uploads/percent-bachelors-degrees-women-usa.csv'
women_majors = pd.read_csv(direct_link)

print(women_majors.info())
women_majors.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 18 columns):
Year                             42 non-null int64
Agriculture                      42 non-null float64
Architecture                     42 non-null float64
Art and Performance              42 non-null float64
Biology                          42 non-null float64
Business                         42 non-null float64
Communications and Journalism    42 non-null float64
Computer Science                 42 non-null float64
Education                        42 non-null float64
Engineering                      42 non-null float64
English                          42 non-null float64
Foreign Languages                42 non-null float64
Health Professions               42 non-null float64
Math and Statistics              42 non-null float64
Physical Sciences                42 non-null float64
Psychology                       42 non-null float64
Public Administration            42 non-null float64
Social Sciences and History      42 non-null float64
dtypes: float64(17), int64(1)
memory usage: 6.0 KB
None

	YEAR	AGRICULTURE	ARCHITECTURE	ART AND PERFORMANCE	BIOLOGY	BUSINESS	COMMUNICATIONS AND JOURNALISM	COMPUTER SCIENCE	EDUCATION	ENGINEERING	ENGLISH	FOREIGN LANGUAGES	HEALTH PROFESSIONS	MATH AND STATISTICS	PHYSICAL SCIENCES	PSYCHOLOGY	PUBLIC ADMINISTRATION	SOCIAL SCIENCES AND HISTORY
0	1970	4.229798	11.921005	59.7	29.088363	9.064439	35.3	13.6	74.535328	0.8	65.570923	73.8	77.1	38.0	13.8	44.4	68.4	36.8
1	1971	5.452797	12.003106	59.9	29.394403	9.503187	35.5	13.6	74.149204	1.0	64.556485	73.9	75.5	39.0	14.9	46.2	65.5	36.2
2	1972	7.420710	13.214594	60.4	29.810221	10.558962	36.6	14.9	73.554520	1.2	63.664263	74.6	76.9	40.2	14.8	47.6	62.6	36.1
3	1973	9.653602	14.791613	60.2	31.147915	12.804602	38.4	16.4	73.501814	1.6	62.941502	74.9	77.4	40.9	16.5	50.4	64.3	36.4
4	1974	14.074623	17.444688	61.9	32.996183	16.204850	40.5	18.9	73.336811	2.2	62.413412	75.3	77.9	41.8	18.2	52.6	66.1	37.3

Besides the Year column, every other column name indicates the subject of a Bachelor degree. Every datapoint in the Bachelor columns represents the percentage of Bachelor degrees conferred to women. Thus, every row describes the percentage for various Bachelors conferred to women in a given year.

As mentioned before, we have data from 1970 to 2011. To confirm the latter limit, let's print the last five rows of the dataset by using the tail() method:

women_majors.tail()

	YEAR	AGRICULTURE	ARCHITECTURE	ART AND PERFORMANCE	BIOLOGY	BUSINESS	COMMUNICATIONS AND JOURNALISM	COMPUTER SCIENCE	EDUCATION	ENGINEERING	ENGLISH	FOREIGN LANGUAGES	HEALTH PROFESSIONS	MATH AND STATISTICS	PHYSICAL SCIENCES	PSYCHOLOGY	PUBLIC ADMINISTRATION	SOCIAL SCIENCES AND HISTORY
37	2007	47.605026	43.100459	61.4	59.411993	49.000459	62.5	17.6	78.721413	16.8	67.874923	70.2	85.4	44.1	40.7	77.1	82.1	49.3
38	2008	47.570834	42.711730	60.7	59.305765	48.888027	62.4	17.8	79.196327	16.5	67.594028	70.2	85.2	43.3	40.7	77.2	81.7	49.4
39	2009	48.667224	43.348921	61.0	58.489583	48.840474	62.8	18.1	79.532909	16.8	67.969792	69.3	85.1	43.3	40.7	77.1	82.0	49.4
40	2010	48.730042	42.066721	61.3	59.010255	48.757988	62.5	17.6	79.618625	17.2	67.928106	69.0	85.0	43.1	40.2	77.0	81.7	49.3
41	2011	50.037182	42.773438	61.2	58.742397	48.180418	62.2	18.2	79.432812	17.5	68.426730	69.5	84.8	43.1	40.1	76.7	81.9	49.2

The context of our FiveThirtyEight graph

Almost every FTE graph is part of an article. The graphs complement the text by illustrating a little story, or an interesting idea. We'll need to be mindful of this while replicating our FTE graph.

To avoid digressing from our main task in this tutorial, let's just pretend we've already written most of an article about the evolution of gender disparity in US education. We now need to create a graph to help readers visualize the evolution of gender disparity for Bachelors where the situation was really bad for women in 1970. We've already set a threshold of 20%, and now we want to graph the evolution for every Bachelor where the percentage of women graduates was less than 20% in 1970.

Let's first identify those specific Bachelors. In the following code cell, we will:

Use .loc, a label-based indexer, to:
- select the first row (the one that corresponds to 1970);
- select the items in the first row only where the values are less than 20; the Year field will be checked as well, but will obviously not be included because 1970 is much greater than 20.
Assign the resulting content to under_20.

under_20 = women_majors.loc[0, women_majors.loc[0] < 20]
under_20

Agriculture           4.229798
Architecture         11.921005
Business              9.064439
Computer Science     13.600000
Engineering           0.800000
Physical Sciences    13.800000
Name: 0, dtype: float64

Using matplotlib's default style

Let's begin working on our graph. We'll first take a peek at what we can build by default. In the following code block, we will:

Run the Jupyter magic %matplotlib to enable Jupyter and matplotlib work together effectively, and add inline to have our graphs displayed inside the notebook.
Plot the graph by using the plot() method on women_majors. We pass in to plot()the following parameters:
- x - specifies the column from women_majors to use for the x-axis;
- y - specifies the columns from women_majors to use for the y-axis; we'll use the index labels of under_20 which are stored in the .index attribute of this object;
- figsize - sets the size of the figure as a tuple with the format (width, height) in inches.
Assign the plot object to a variable named under_20_graph, and print its type to show that pandas uses matplotlib objects under the hood.

%matplotlib inline
under_20_graph = women_majors.plot(x = 'Year', y = under_20.index, figsize = (12,8))
print('Type:', type(under_20_graph))

Using matplotlib's fivethirtyeight style

The graph above has certain characteristics, like the width and color of the spines, the font size of the y-axis label, the absence of a grid, etc. All of these characteristics make up matplotlib's default style.

As a short parenthesis, it's worth mentioning that we'll use a few technical terms about the parts of a graph throughout this post. If you feel lost at any point, you can refer to the legend below.

Source: Matplotlib.org

Besides the default style, matplotlib comes with several built-in styles that we can use readily. To see a list of the available styles, we will:

Import the matplotlib.style module under the name style.
Explore the content of matplotlib.style.available (a predefined variable of this module), which contains a list of all the available in-built styles.

import matplotlib.style as style
style.available

['seaborn-deep',
 'seaborn-muted',
 'bmh',
 'seaborn-white',
 'dark_background',
 'seaborn-notebook',
 'seaborn-darkgrid',
 'grayscale',
 'seaborn-paper',
 'seaborn-talk',
 'seaborn-bright',
 'classic',
 'seaborn-colorblind',
 'seaborn-ticks',
 'ggplot',
 'seaborn',
 '_classic_test',
 'fivethirtyeight',
 'seaborn-dark-palette',
 'seaborn-dark',
 'seaborn-whitegrid',
 'seaborn-pastel',
 'seaborn-poster']

You might have already observed that there's a built-in style called fivethirtyeight. Let's use this style, and see where that leads. For that, we'll use the aptly named use() functionfrom the same matplotlib.style module (which we imported under the name style). Then we'll generate our graph using the same code as earlier.

style.use('fivethirtyeight')
women_majors.plot(x = 'Year', y = under_20.index, figsize = (12,8))

Wow, that's a major change! With respect to our first graph, we can see that this one has a different background color, it has grid lines, there are no spines whatsoever, the weight and the font size of the major tick labels are different, etc.

You can read a technical description of the fivethirtyeight style here - it should also give you a good idea about what code runs under the hood when we use this style. The author of the style sheet, Cameron David-Pilon, discusses some of the characteristics here.

For more on generating FiveThirtyEight graphs in Python, see the rest of the original article here.

Bio: Alex Olteanu is a Student Success Specialist at Dataquest.io. He enjoys learning and sharing knowledge, and is getting ready for the new AI revolution.

Original. Reposted with permission.

Related:

How to Generate FiveThirtyEight Graphs in Python

Introducing the dataset

The context of our FiveThirtyEight graph

Using matplotlib's default style

Using matplotlib's fivethirtyeight style

More On This Topic

Latest Posts

Top Posts