Annotated Heatmaps of a Correlation Matrix in 5 Simple Steps
A heatmap is a graphical representation of data in which data values are represented as colors. That is, it uses color in order to communicate a value to the reader. This is a great tool to assist the audience towards the areas that matter the most when you have a large volume of data.
By Julia Kho, Data Scientist
A heatmap is a graphical representation of data in which data values are represented as colors. That is, it uses color in order to communicate a value to the reader. This is a great tool to assist the audience towards the areas that matter the most when you have a large volume of data.
In this article, I will guide you in creating your own annotated heatmap of a correlation matrix in 5 simple steps.
- Import Data
- Create Correlation Matrix
- Set Up Mask To Hide Upper Triangle
- Create Heatmap in Seaborn
- Export Heatmap
You can find the code from this article in my Jupyter Notebook located here.
1) Import Data
df = pd.read_csv(“Highway1.csv”, index_col = 0)
This highway accidents data set contains the automobile accident rate, in accidents per million vehicle miles along with several design variables. More information about the data set can be found here.
2) Create Correlation Matrix
corr_matrix = df.corr()
We create the correlation matrix with .corr
. Notice that the htype column is not present in this matrix because it is not numeric. We will need to dummify htype to calculate correlation.
df_dummy = pd.get_dummies(df.htype) df = pd.concat([df, df_dummy], axis = 1)
In addition, note that the upper triangle half of the correlation matrix is symmetrical to the lower triangle half. Thus, there is no need for our heatmap to show the entire matrix. We’ll hide the upper triangle in the next step.
3) Set Up Mask To Hide Upper Triangle
mask = np.zeros_like(corr_matrix, dtype=np.bool) mask[np.triu_indices_from(mask)]= True
Let’s break the above code down. np.zeros_like()
returns an array of zeros with the same shape and type as the given array. By passing in the correlation matrix, we get an array of zeros like below.
The dtype=np.bool
parameter overrides the data type, so our array is an array of booleans.
np.triu_indices_from(mask)
returns the indices for the upper triangle of the array.
Now, we set the upper triangle to True.
mask[np.triu_indices_from(mask)]= True
Now, we have a mask that we can use to generate our heatmap.
4) Create Heatmap in Seaborn
f, ax = plt.subplots(figsize=(11, 15)) heatmap = sns.heatmap(corr_matrix, mask = mask, square = True, linewidths = .5, cmap = ’coolwarm’, cbar_kws = {'shrink': .4, ‘ticks’ : [-1, -.5, 0, 0.5, 1]}, vmin = -1, vmax = 1, annot = True, annot_kws = {“size”: 12}) #add the column names as labels ax.set_yticklabels(corr_matrix.columns, rotation = 0) ax.set_xticklabels(corr_matrix.columns) sns.set_style({'xtick.bottom': True}, {'ytick.left': True})
To create our heatmap, we pass in our correlation matrix from step 3 and the mask we created in step 4, along with custom parameters to make our heatmap look nicer. Here’s a description of the parameters if you are interested in understanding what each line does.
#Makes each cell square-shaped. square = True, #Set width of the lines that will divide each cell to .5 linewidths = .5, #Map data values to the coolwarm color space cmap = 'coolwarm', #Shrink the legend size and label tick marks at [-1, -.5, 0, 0.5, 1] cbar_kws = {'shrink': .4, ‘ticks’ : [-1, -.5, 0, 0.5, 1]}, #Set min value for color bar vmin = -1, #Set max value for color bar vmax = 1, #Turn on annotations for the correlation values annot = True, #Set annotations to size 12 annot_kws = {“size”: 12}) #Add column names to the x labels ax.set_xticklabels(corr_matrix.columns) #Add column names to the y labels and rotate text to 0 degrees ax.set_yticklabels(corr_matrix.columns, rotation = 0) #Show tickmarks on bottom and left of heatmap sns.set_style({'xtick.bottom': True}, {'ytick.left': True})
5) Export Heatmap
Now that you have the heatmap, let’s export it out.
heatmap.get_figure().savefig(‘heatmap.png’, bbox_inches=’tight’)
If you find that you have a very large heatmap that doesn’t export correctly, usebbox_inches = ‘tight’
to prevent your image from being cut off.
Thanks for reading! Feel free to share heatmaps that you’ve made with your data in the comments below.
Bio: Julia Kho is a Data Scientist passionate about creative problem solving and telling stories with data. She has previous experience in environmental consulting and working with spatial data.
Original. Reposted with permission.
Related:
- PyViz: Simplifying the Data Visualisation Process in Python
- Make your Data Talk!
- Best Data Visualization Techniques for small and large data