Data Visualization in Python: Matplotlib vs Seaborn
Seaborn and Matplotlib are two of Python's most powerful visualization libraries. Seaborn uses fewer syntax and has stunning default themes and Matplotlib is more easily customizable through accessing the classes.
Python offers a variety of packages for plotting data. This tutorial will use the following packages to demonstrate Python's plotting capabilities:
In the above code chunk, we import the Matplotliib library with the
PyPlot module as
plt This is to make it easier to execute commmands as we will see later on in the tutorial.
PyPlot contains a range of commands required to create and edit plots.
%matplotlib inline is run so that the plot will show underneath the code chunk automatically when it is executed. Otherwise the user will need to type
plt.show() everytime a new plot is created. This functionality is exclusive to Jupyter Notebook/IPython. Matplotlib's highly customizable code structure makes it a great guide to other plotting libraries. Lets see how we can generate a scatter plot from matplotlib.
A handy tip is that whenever matplotlib is executed, the output will always include a text output that can be very visually unappealing. To fix this, add a semicolon - ';' at the end of the last line of code when executing a code chunk to generate a figure.
The dataset used is the Bike Sharing Dataset from the UCI Machine Learning Repository.
Matplotlib: Scatter Plot
A scatter plot is one of the most influential, informative, and versatile plots in your arsenal. It can convey an array of information to the user without much work (as demonstrated below)
plt.scatter()will give us a scatter plot of the data we pass in as the initial arguments.
tempis the x-axis and
cntis the y-axis.
cdetermines the colors of the data points. Because we passed a string - 'season' which is a column of the dataframe day, the colors correspond to the different seasons. This is a quick and easy method to group data in a visual format.
Lets see the information that it shows:
- There were more than 8000 bike rentals at some point in time.
- The normalized temperature has gone above 0.8.
- The amount of bike rentals does not differ much with temperature or season.
- There is a positive linear relationship between bike rentals and normalized temperature.
This graph does indeed give us much information. However, the graph does not produce a legend, which makes it difficult to decipher anything about the seasonal groups. This is due to the Matplotlib being unable to produce a legend when a plot is made in this fashion. In the next section we will see how the above plot is hiding and even misleading viewers.
Lets look at the same plot that has undergone thorough editing. The goal here will be to produce a legend to decipher differences between the groups.
plt.rcParams['figure.figsize'] = [15, 10]allows to control the size of the entire plot. This corresponds to a 15∗10 (length∗width) plot.
fontdictis a dictionary that can be passed in as arguments for labeling axes.
fontdictfor the title,
fontdictxfor the x-axis and
fontdictyfor the y-axis.
- There are now 4
plt.scatter()function calls corresponding to one of the four seasons. This is seen again in the data argument in which it has been subsetted to correspond to a single season. marker and color arguments correspond to using a
'o'to visually represent a data point and the respective color of that marker.
plt.legend()is where we can pass our arguments to make a legend. The first two arguments are handles: the actual plots to be represented in the legend and labels: the names corresponding to each plot that will be shown in the legend. scatterpoints are the size of each marker for the scatter plot.
bbox_to_anchor=(1, 0.7), loc=2, borderaxespad=1. These 3 arguments are used in tandem to correspond to the location of the legend; click on the link at the start of this sentence to find out the nature of these arguments.
Now we can distinguish the seasons to check for more underlying information. However, even after adding these extra layers, the plot can still hide information and be prone to misinterpretation.
- had data overlapping each other.
- was cluttered.
- did not reveal any discernable differences among the seasonality of bike rentals.
- hid patterns such as bike rentals increasing in the spring and summer as temperatures rose.
- shows an overall positive trend between total bike rentals and temperature.
- does not clearly show which season had the lowest temperature in comparison.
Creating subplots are probably one of the most attractive and professional charting techniques in the industry. Subplots are necessary when a single plot is overcrowded with information. That information cannot be assessed in that state.
Faceting is the process of creating multiple plots of a graph that share the same axes. Faceting is one of the most versatile techniques of data visualization. Faceted plots can convey information in many dimensions and can reveal information that was previously hidden.
plt.figure()will be used to create an empty plot canvas as explained before. It is saved as fig.
fig.add_subplot()will be repeated 4 times to correspond to a respective season. The arguments correspond to
ncols, index. For example in
ax1it corresponds to the 1st plot of the figure (index starts at 1 in the upper left corner and increases to the right.)
- The remaining function calls are either self-explanatory or have been previously covered.
Now we can analyze each group independently and as we will see more effectively. First thing we should notice is that the relationship between temperature and bike rentals differs between seasons:
- Positive linear relationship in the Spring.
- Quadratic non-linear relationship in the Winter and Summer.
- Weak Positive to No discernible relationship in Autumn.
However, again there is a chance of misleading the viewers and it is for less than obvious reasons. The axes are all different among the 4 plots. Most people will not realize that this can cause misleading insights if no caution is taken. See below on how this issue can be fixed:
Now this plot grid has been adjusted to share the same x-axis as Summer because it has a wider range for temperature. Now interestingly, this data shows us some new insights:
- Spring had the lowest temperatures.
- Fall/Autumn had the highest temperatures.
- The total number of bike rentals and temperature seem to have a quadratic relationship in the Summer and Autumn.
- Less bikes are rented in low temperatures regardless of season.
- There is a clear positive linear relationship between temperature and total bike rentals in the Spring.
- There seems to be a mild negative linear relationship between temperature and bike rentals in the Fall/Autumn.
Re-angling/juxtaposing the plots now show another perspective:
- All seasons had over 8000 bike rentals at some point in time.
- There is a large clustering in Autumn and Spring compared to the other seasons.
- Winter and Summer had the most varied amount of bike rentals.
Do not attempt to decipher a relationship between the variables from this angle. It can mislead you again because now it looks like there is a negative linear relationship between bike rentals and temperature in both Spring and Summer and we saw before that this is not the case.
Here is a link to an intuitive tutorial by Real Python on using Matplotlib.
The seaborn package was developed based on the Matplotlib library. It is used to create more attractive and informative statistical graphics. While seaborn is a different package, it can also be used to develop the attractiveness of matplotlib graphics.
While matplotlib is great, we always want to do better. Run the code chunk below to import the seaborn library and create the previous plot and see what happens.
First we import the library with
import seaborn as sns. The next line sns.set() will load seaborn's default theme and color palette to the session. Run the code below and watch the change in the chart area and the text.
Once we load seaborn into the session, everytime a matplotlib plot is executed, seaborn's default customizations are added as you see above. However, a huge problem that troubles many users is that the titles can overlap. Combine this with matplotlib's only confusing naming convention for its titles it becomes a nuisance. Nevertheless, the attractive visuals still make it usable for Data Scientist's work.
In order to get the titles in the fashion that we want and have more customizability, We need to use the structure below. Note that this is only necessary if we use subtitles in our plots. Sometimes they are necessary so it is better to have it on hand.
Going deeper into seaborn, we can recreate the above visualizations from the Bike Rentals dataset with fewer lines of code and similar syntax. Seaborn still uses Matplotlib syntax to execute seaborn plots with relatively minor but obvious synctactic differences.
For simplicity and better visuals, I am going to rename and relabel the 'season' column of the bike rentals dataset.
Now that the 'Season' column is edited to our liking, we will continue onto creating a seaborn style visualization of the previous plots.
The first noticeable difference is the default theme that seaborn presents when its default aesthetics are loaded into the session. The default theme as you see directly above is a result of
sns.set_style('whitegrid') being applied in the background when
sns.set() is called. As we will see this is easily overridden according to our liking with the readily available themes as stated in the below cell:
sns.set_style()must be one of 'white', 'dark', 'whitegrid', 'darkgrid', 'ticks'. This controls the plot area. Such as the color, grid and presence of ticks.
sns.set_context()must be in 'paper', 'notebook', 'talk', 'poster'. This controls the layout of the plot in terms of how it is to be read. Such as if it was on a 'poster' where we will see enlarged images and text. 'Talk' will create a plot with a more bold font.
Now lets take a look at the same plot but with
sns.set_context('paper', font_scale=2) and
Now we have finally recreated our previous matplotlib style plot with Seaborn using fewer lines of code and better resolution in my opinion. Let's take it one step further and facet the plot to finish:
In order to change the shape of the figures, the
aspect argument needs to be changed. Increasing the value of aspect here will create a more square shaped figure. It works in tandem with
height so experiment with the size using both arguments.
To change the number of rows and columns, use the
col_wrap argument to do this. This works in tandem with the
col argument. It detects the number of categories and allocates it accordingly.
Note: Parts of this tutorial were used in a tutorial I prepared for the Victorian Institute of Technology
- 6 Data Visualization Disasters – How to Avoid Them
- 5 Quick and Easy Data Visualizations in Python with Code
- 10 Useful Python Data Visualization Libraries for Any Discipline