4 Lessons for Brilliant Data Visualization
Get some pointers on data visualization from a noted expert in the field, and gain some insight into creating your own brilliant visualizations by following these 4 lessons.
Color can’t display specific values very well
One of the major drawbacks of heat maps is that they rely on color to communicate the specific values in each cell. While it’s not always important to display a precise value, there can sometimes be important trends hiding in these small differences. For that reason, I reworked the Polio heat map into a simple line chart below, where each light line is a state and the dark line is the median value between all the states for each year.
The above chart isn’t too useful, and the data is too messy to make much sense of the state-by-state trends. However, the decline in infection rates after the introduction of the vaccine is abundantly clear even in this case.
No post of mine is complete without small multiples, so let’s give that a try. Below, each state has its own chart, and all 50 states (+ D.C.) are put on the same time axis.
Each line tells its own story, and these are stories that were masked in the heat maps. Small multiples allow use to see specific state-by-state trends, for example, Polio outbreaks were already on the decline in South Dakota even before the introduction of the Polio vaccine. Meanwhile, Polio outbreaks were at their worst in New Hampshire just prior to the introduction of the Polio vaccine, which made short order of Polio immediately thereafter.
We should always ask ourselves when designing data visualizations: Do we care about the broader story, or the smaller stories? In this case we could go either way, but the direction we go depends on the story we want to tell.
Sometimes you can show too much data
Another fair criticism of all the data visualizations shown so far is that they show too much data. After all, the main message of the WSJ heat maps was simple: When introduced to human populations, vaccines work. There’s no need to show the state-by-state trends then; in fact, we may be overwhelming our reader by providing too much data that doesn’t get right to the point. For example, what happened with Polio in Utah, with the infection rate more than doubling after the introduction of the Polio vaccine? Or what about South Dakota, where Polio seems to have been mostly eliminated even before the vaccines were made available?
These outliers are distractions to the overall trend. We can overcome these distractions by applying a simple statistical analysis to the data, and show the overall trend with confidence bounds. Below, I’ve done just that by plotting the median Polio infection rate across all states (dark line) with bootstrapped 95% confidence intervals (shaded area).
By summarizing the data with some basic statistics, we’ve removed the distractions and gotten straight to the point: Overall in the U.S., Polio outbreaks were on the rise from the 1940s onward. Right at the introduction of the Polio vaccine in 1955, we immediately saw a decline in Polio outbreaks until it was practically eliminated in the 1960s.
Again, we should always consider our story when designing data visualizations. If we have one clear story that we want to communicate, we should consider reducing the amount of data we show to the point that we can effectively — and honestly — communicate our story. There’s no point in confusing our reader with unnecessary details, unless those details contain an important caveat.
Of course, at face value these charts only demonstrate correlatione: When vaccinations were introduced to the population, the prevalence of infectious disease decreased shortly thereafter. I believe it’s important to point out here that even though I want to focus on data visualization techniques in this post, the science behind vaccination is not up for debate, and these charts are in fact demonstrating a proven causal relationship. Please don’t waste your time typing out “correlation != causation” in the comments.
To wrap up, these are the lessons we’ve drawn from revisiting the popularized vaccine visualizations:
- Use sequential color schemes when presenting continuous values
- Consider color blindness before committing to a color scheme
- When presenting specific values is important, don’t use color to represent those values
- Only show enough data to effectively and honestly tell your story
If you liked what you saw in this post and want to learn more, check out my Python data visualization video course that I made in collaboration with O’Reilly. In just one hour, I will cover these topics and much more, which will provide you with a strong starting point for your career in data visualization.
Bio: Dr. Randy Olson is an artificial intelligence researcher at the University of Pennsylvania, where he specializes in machine learning and data visualization in biomedical domain. Randy writes about his data science and AI projects on his blog, where he regularly explores fun projects that make these topics more accessible to non-experts.
Original. Reposted with permission.