Creating Data Visualization in Matplotlib
Matplotlib is the most widely used data visualization library for Python; it's very powerful, but with a steep learning curve. This overview covers a selection of plots useful for a wide range of data analysis problems and discusses how to best deploy each one so you can tell your data story.
By DataScience.com Sponsored Post.
Experience with the specific topic: Novice
Professional experience: No industry experience
The reader should be familiar with basic data analysis concepts and have some experience with a programming language (Python is ideal but not required). The dataset used can be downloaded here. You will only need day.csv after unzipping the dataset.
Introduction to Data Visualization
Data visualization is a key part of any data science workflow, but it is frequently treated as an afterthought or an inconvenient extra step in reporting the results of an analysis. Taking such a stance is a mistake — as the cliché goes, a picture is worth a thousand words.
Data visualization should really be part of your workflow from the very beginning, as there is a lot of value and insight to be gained from just looking at your data. Summary statistics often don't tell the whole story; Anscombe's quartet is an unforgettable demonstration of this principle. Furthermore, the impact of an effective visualization is difficult to match with words and will go a long way toward ensuring that your work gets the recognition it deserves.
In data visualization, there are three main types of variables:
- Quantitative: These are numerical data and represent a measurement. Quantitative variables can be discrete (e.g., units sold in 2016) or continuous (e.g., average units sold per person).
- Categorical: The values of these variables are names or labels. There is no inherent ordering to the labels. Examples of such variables are countries in a sales database and the names of products.
- Ordinal: Variables that can take on values that are ranked on an arbitrary numerical scale. The numerical index associated with each value has no meaning except to rank the values relative to each other. Examples include days of the week, levels of satisfaction (not satisfied, satisfied, very satisfied), and customer value (low, medium, high).
When visualizing data, the most important factor to keep in mind is the purpose of the visualization. This is what will guide you in choosing the best plot type. It could be that you are trying to compare two quantitative variables to each other. Maybe you want to check for differences between groups. Perhaps you are interested in the way a variable is distributed. Each of these goals is best served by different plots and using the wrong one could distort your interpretation of the data or the message that you are trying to convey. To that end, I have grouped the different plots we will cover by the situation that they are best suited for.
Another critical guiding principle is that simpler is almost always better. Often, the most effective visualizations are those that are easily digested — because the clarity of your thought processes is reflected in the clarity of your work. Additionally, overly complicated visuals can be misleading and hard to interpret, which might lead your audience to tune out your results. For these reasons, restrict your plots to two dimensions (unless the need for a third one is absolutely necessary), avoid visual noise (such as unnecessary tick marks, irrelevant annotations and clashing colors), and make sure that everything is legible.
Introduction to Matplotlib
Matplotlib is the leading visualization library in Python. It is powerful, flexible, and has a dizzying array of chart types for you to choose from. For new users, matplotlib often feels overwhelming. You could spend a long time tinkering with all of the options available, even if all you want to do is create a simple scatter plot.
This tutorial is intended to help you get up-and-running with matplotlib quickly. We will go over how to create the most commonly used plots, when you would want to use each one, and highlight the parameters that you are most likely to adjust. There are actually two main methods of interacting with matplotlib: the simpler pylab interface and the more complex pyplot one. We will be focusing on pyplot even though it has the steeper learning curve because it is the better way of accessing the full power of matplotlib.
Example: Creating Visualizations in Matplotlib Using a Bikeshare System Dataset
For all examples shown, we will be using the daily version of the Capital Bikeshare System dataset from the UCI Machine Learning Repository. This data set contains information about the daily count of bike rental checkouts in Washington, D.C.'s bikeshare program between 2011 and 2012. It also includes information about the weather and seasonal/temporal features for that day (like whether it was a weekday).
Step 1: Identify Your Data
The object containing the dataset is called
daily_data. This dataset contains a mix of categorical, quantitative, and ordinal variables. For this tutorial, only a subset of the available fields will be used, described and previewed below:
dteday: Date of the record (YYYY-MM-DD format)
weekday: Day of the week (0=Sunday, 6=Saturday)
temp: Normalized temperature in Celcius
windspeed: Normalized wind speed
casual: Count of checkouts by casual/non-registered users
registered: Count of checkouts by registered users
cnt: Total checkouts