KDnuggets Home » News » 2017 » Dec » Tutorials, Overviews » How to Generate FiveThirtyEight Graphs in Python ( 17:n48 )

How to Generate FiveThirtyEight Graphs in Python


In this post, we'll help you. Using Python's matplotlib and pandas, we'll see that it's rather easy to replicate the core parts of any FiveThirtyEight (FTE) visualization.



By Alex Olteanu, Student Success Specialist at Dataquest.io

If you read data science articles, you may have already stumbled upon FiveThirtyEight's content. Naturally, you were impressed by their awesome visualizations. You wanted to make your own awesome visualizations and so asked Quora and Reddit how to do it. You received some answers, but they were rather vague. You still can't get the graphs done yourself.

In this post, we'll help you. Using Python's matplotlib and pandas, we'll see that it's rather easy to replicate the core parts of any FiveThirtyEight (FTE) visualization.

We'll start here:

default_graph

And, at the end of the tutorial, arrive here:

final3

To follow along, you'll need at least some basic knowledge of Python. If you know what's the difference between methods and attributes, then you're good to go.

 

Introducing the dataset

 
We'll work with data describing the percentages of Bachelors conferred to women in the US from 1970 to 2011. We'll use a dataset compiled by data scientist Randal Olson, who collected the data from the National Center for Education Statistics.

If you want to follow along by writing code yourself, you can download the data from Randal's blog. To save yourself some time, you can skip downloading the file, and just pass in the direct link to pandas' read_csv() function. In the following code cell, we:

  • Import the pandas module.
  • Assign the direct link toward the dataset as a string to a variable named direct_link.
  • Read in the data by using read_csv(), and assign the content to women_majors.
  • Print information about the dataset by using the info() method. We're looking for the number of rows and columns, and checking for null values at the same time.
  • Show the first five rows to understand better the structure of the dataset by using the head() method.
import pandas as pd

direct_link = 'http://www.randalolson.com/wp-content/uploads/percent-bachelors-degrees-women-usa.csv'
women_majors = pd.read_csv(direct_link)

print(women_majors.info())
women_majors.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 18 columns):
Year                             42 non-null int64
Agriculture                      42 non-null float64
Architecture                     42 non-null float64
Art and Performance              42 non-null float64
Biology                          42 non-null float64
Business                         42 non-null float64
Communications and Journalism    42 non-null float64
Computer Science                 42 non-null float64
Education                        42 non-null float64
Engineering                      42 non-null float64
English                          42 non-null float64
Foreign Languages                42 non-null float64
Health Professions               42 non-null float64
Math and Statistics              42 non-null float64
Physical Sciences                42 non-null float64
Psychology                       42 non-null float64
Public Administration            42 non-null float64
Social Sciences and History      42 non-null float64
dtypes: float64(17), int64(1)
memory usage: 6.0 KB
None


YEAR AGRICULTURE ARCHITECTURE ART AND PERFORMANCE BIOLOGY BUSINESS COMMUNICATIONS AND JOURNALISM COMPUTER SCIENCE EDUCATION ENGINEERING ENGLISH FOREIGN LANGUAGES HEALTH PROFESSIONS MATH AND STATISTICS PHYSICAL SCIENCES PSYCHOLOGY PUBLIC ADMINISTRATION SOCIAL SCIENCES AND HISTORY
0 1970 4.229798 11.921005 59.7 29.088363 9.064439 35.3 13.6 74.535328 0.8 65.570923 73.8 77.1 38.0 13.8 44.4 68.4 36.8
1 1971 5.452797 12.003106 59.9 29.394403 9.503187 35.5 13.6 74.149204 1.0 64.556485 73.9 75.5 39.0 14.9 46.2 65.5 36.2
2 1972 7.420710 13.214594 60.4 29.810221 10.558962 36.6 14.9 73.554520 1.2 63.664263 74.6 76.9 40.2 14.8 47.6 62.6 36.1
3 1973 9.653602 14.791613 60.2 31.147915 12.804602 38.4 16.4 73.501814 1.6 62.941502 74.9 77.4 40.9 16.5 50.4 64.3 36.4
4 1974 14.074623 17.444688 61.9 32.996183 16.204850 40.5 18.9 73.336811 2.2 62.413412 75.3 77.9 41.8 18.2 52.6 66.1 37.3


Besides the Year column, every other column name indicates the subject of a Bachelor degree. Every datapoint in the Bachelor columns represents the percentage of Bachelor degrees conferred to women. Thus, every row describes the percentage for various Bachelors conferred to women in a given year.

As mentioned before, we have data from 1970 to 2011. To confirm the latter limit, let's print the last five rows of the dataset by using the tail() method:

women_majors.tail()


YEAR AGRICULTURE ARCHITECTURE ART AND PERFORMANCE BIOLOGY BUSINESS COMMUNICATIONS AND JOURNALISM COMPUTER SCIENCE EDUCATION ENGINEERING ENGLISH FOREIGN LANGUAGES HEALTH PROFESSIONS MATH AND STATISTICS PHYSICAL SCIENCES PSYCHOLOGY PUBLIC ADMINISTRATION SOCIAL SCIENCES AND HISTORY
37 2007 47.605026 43.100459 61.4 59.411993 49.000459 62.5 17.6 78.721413 16.8 67.874923 70.2 85.4 44.1 40.7 77.1 82.1 49.3
38 2008 47.570834 42.711730 60.7 59.305765 48.888027 62.4 17.8 79.196327 16.5 67.594028 70.2 85.2 43.3 40.7 77.2 81.7 49.4
39 2009 48.667224 43.348921 61.0 58.489583 48.840474 62.8 18.1 79.532909 16.8 67.969792 69.3 85.1 43.3 40.7 77.1 82.0 49.4
40 2010 48.730042 42.066721 61.3 59.010255 48.757988 62.5 17.6 79.618625 17.2 67.928106 69.0 85.0 43.1 40.2 77.0 81.7 49.3
41 2011 50.037182 42.773438 61.2 58.742397 48.180418 62.2 18.2 79.432812 17.5 68.426730 69.5 84.8 43.1 40.1 76.7 81.9 49.2


 

The context of our FiveThirtyEight graph

 
Almost every FTE graph is part of an article. The graphs complement the text by illustrating a little story, or an interesting idea. We'll need to be mindful of this while replicating our FTE graph.

To avoid digressing from our main task in this tutorial, let's just pretend we've already written most of an article about the evolution of gender disparity in US education. We now need to create a graph to help readers visualize the evolution of gender disparity for Bachelors where the situation was really bad for women in 1970. We've already set a threshold of 20%, and now we want to graph the evolution for every Bachelor where the percentage of women graduates was less than 20% in 1970.

Let's first identify those specific Bachelors. In the following code cell, we will:

  • Use .loc, a label-based indexer, to:
    • select the first row (the one that corresponds to 1970);
    • select the items in the first row only where the values are less than 20; the Year field will be checked as well, but will obviously not be included because 1970 is much greater than 20.
  • Assign the resulting content to under_20.
under_20 = women_majors.loc[0, women_majors.loc[0] < 20]
under_20


Agriculture           4.229798
Architecture         11.921005
Business              9.064439
Computer Science     13.600000
Engineering           0.800000
Physical Sciences    13.800000
Name: 0, dtype: float64


 

Using matplotlib's default style

 
Let's begin working on our graph. We'll first take a peek at what we can build by default. In the following code block, we will:

  • Run the Jupyter magic %matplotlib to enable Jupyter and matplotlib work together effectively, and add inline to have our graphs displayed inside the notebook.
  • Plot the graph by using the plot() method on women_majors. We pass in to plot()the following parameters:
    • x - specifies the column from women_majors to use for the x-axis;
    • y - specifies the columns from women_majors to use for the y-axis; we'll use the index labels of under_20 which are stored in the .index attribute of this object;
    • figsize - sets the size of the figure as a tuple with the format (width, height) in inches.
  • Assign the plot object to a variable named under_20_graph, and print its type to show that pandas uses matplotlib objects under the hood.
%matplotlib inline
under_20_graph = women_majors.plot(x = 'Year', y = under_20.index, figsize = (12,8))
print('Type:', type(under_20_graph))


 

Using matplotlib's fivethirtyeight style

 
The graph above has certain characteristics, like the width and color of the spines, the font size of the y-axis label, the absence of a grid, etc. All of these characteristics make up matplotlib's default style.

As a short parenthesis, it's worth mentioning that we'll use a few technical terms about the parts of a graph throughout this post. If you feel lost at any point, you can refer to the legend below.

anatomy1

Source: Matplotlib.org

Besides the default style, matplotlib comes with several built-in styles that we can use readily. To see a list of the available styles, we will:

  • Import the matplotlib.style module under the name style.
  • Explore the content of matplotlib.style.available (a predefined variable of this module), which contains a list of all the available in-built styles.
import matplotlib.style as style
style.available


['seaborn-deep',
 'seaborn-muted',
 'bmh',
 'seaborn-white',
 'dark_background',
 'seaborn-notebook',
 'seaborn-darkgrid',
 'grayscale',
 'seaborn-paper',
 'seaborn-talk',
 'seaborn-bright',
 'classic',
 'seaborn-colorblind',
 'seaborn-ticks',
 'ggplot',
 'seaborn',
 '_classic_test',
 'fivethirtyeight',
 'seaborn-dark-palette',
 'seaborn-dark',
 'seaborn-whitegrid',
 'seaborn-pastel',
 'seaborn-poster']


You might have already observed that there's a built-in style called fivethirtyeight. Let's use this style, and see where that leads. For that, we'll use the aptly named use() functionfrom the same matplotlib.style module (which we imported under the name style). Then we'll generate our graph using the same code as earlier.

style.use('fivethirtyeight')
women_majors.plot(x = 'Year', y = under_20.index, figsize = (12,8))


538_graphs_AO_11_1

Wow, that's a major change! With respect to our first graph, we can see that this one has a different background color, it has grid lines, there are no spines whatsoever, the weight and the font size of the major tick labels are different, etc.

You can read a technical description of the fivethirtyeight style here - it should also give you a good idea about what code runs under the hood when we use this style. The author of the style sheet, Cameron David-Pilon, discusses some of the characteristics here.

 
For more on generating FiveThirtyEight graphs in Python, see the rest of the original article here.

Bio: Alex Olteanu is a Student Success Specialist at Dataquest.io. He enjoys learning and sharing knowledge, and is getting ready for the new AI revolution.

Original. Reposted with permission.

Related: