KDnuggets Home » News » 2016 » Mar » Tutorials, Overviews » 4 Lessons for Brilliant Data Visualization ( 16:n10 )

4 Lessons for Brilliant Data Visualization

Get some pointers on data visualization from a noted expert in the field, and gain some insight into creating your own brilliant visualizations by following these 4 lessons.

By Dr. Randal Olson.

Last year, the vaccination debate was all the rage again. “Pro-vaxxers” were loudly proclaiming that everyone should get vaccinated and discussing the science behind it, and “anti-vaxxers” were casting their doubts and still refusing to get vaccinated for personal reasons. Around that time, The Wall Street Journal released a brilliant series of heat maps showing infection rates for various diseases over time, broken down by state. These heat maps easily demonstrated one of the most important facts in the vaccination debate: Time and time again, vaccines work.

WSJ Polio visualization

Today, I would like to revisit the WSJ’s heat maps through the lens of a data visualization practitioner. In particular, I would like to show how these heat maps can possibly be improved upon by reviewing some basic rules of data visualization, and trying out some other methods for displaying the data. Below, I’m going to walk through four major criticisms and show how addressing them can possibly improve the original work.

For the curious, I’ve released my notebook with the Python code used to generate the new visualizations.

Categorical color palettes should not be used to display continuous values

Perhaps one of the most straining issues with the original WSJ heat maps was their use of a custom categorical color palette to display the infection rates. The palette runs through most of the colors of the rainbow at seemingly-random intervals. It’s possible that they calculated the quantiles to determine the ranges for the color bins (as they should!), but that wasn’t indicated in their methodology.

In any case, it’s rarely a good idea to use multiple colors to display a single continuous variable. Here, all we want to do is use color to show the infection rates for each year. If we use more than one color, our readers have to constantly refer back to the legend to figure out what each color means, which is an unnecessary cognitive strain on our reader. Instead, we should use a single-color sequential palette, where lighter shades indicate lower values and darker shades indicate higher values. I’ve reworked the Polio heat map to do just that below.

Polio cases heatmap

One exception to this “rule,” of course, is diverging color palettes. If there is a clear divide in our continuous variable — for example, if we’re displaying gains and losses for a company — then it could be appropriate to use a diverging color palette with one color to represent gains (values >= $0) and another to represent losses (values <$0). Just for fun, I recreated the same chart above for Measles so we can compare it to the originals on WSJ.

Measles cases heatmap

Multi-hue color palettes should take color blindness into account

Color blindness is probably one of the most-overlooked issues in data visualization, and the WSJ heat maps are a great example. I ran the WSJ heat map above through a color blindness simulator for red-green color blindness — the most common form of color blindness — and below is the (low-resolution) result.

WSJ polio data visualization

Disastrous! Much of the color gradient is lost in some yellow/grey abyss, and the dark purple colors represent low values whereas the lighter yellow and dark grey colors represent higher values. This color palette survives better than most and the main message is still (mostly) communicated, but the WSJ color palette is certainly far from ideal here.

For comparison, I ran my rework from above through the same red-green color blindness simulator. As we can see, the simple sequential color palette is practically unaffected by this form of color blindness. Problem solved!

Polio cases heatmap

The main lesson here is that we should always run our color palette through a color blindness simulator before committing to it. Roughly 5% of our audience will experience our data visualizations through that lens.