Sound Data Science: Avoiding the Most Pernicious Prediction Pitfall

Data science and predictive analytics can provide huge value, but they can mislead and backfire if not used with fail-safe measures. The author gives examples of such problems and provides guidelines to avoid them.

In this excerpt from the updated edition of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, Revised and Updated Edition, I show that, although data science and predictive analytics’ explosive popularity promises meteoric value, a common misapplication readily backfires. The number crunching only delivers if a fundamental—yet often omitted—failsafe is applied.

Prediction is booming. Data scientists have the “sexiest job of the 21st century” (as Professor Thomas Davenport and US Chief Data Scientist D.J. Patil declared in 2012). Fueled by the data tsunami, we’ve entered a golden age of predictive discoveries. A frenzy of analysis churns out a bonanza of colorful, valuable, and sometimes surprising insights:[i]

• People who “like” curly fries on Facebook are more intelligent.

• Typing with proper capitalization indicates creditworthiness.

• Users of the Chrome and Firefox browsers make better employees.

• Men who skip breakfast are at greater risk for coronary heart disease.

• Credit card holders who go to the dentist are better credit risks.

• High-crime neighborhoods demand more Uber rides.

Look like fun? Before you dive in, be warned: This spree of data exploration must be tamed with strict quality control. It’s easy to get it wrong, crash, and burn—or at least end up with egg on your face.

In 2012, a Seattle Times article led with an eye-catching predictive discovery: “An orange used car is least likely to be a lemon.”[ii] This insight came from a predictive analytics competition to detect which used cars are bad buys (lemons). While insights also emerged pertaining to other car attributes—such as make, model, year, trim level, and size—the apparent advantage of being orange caught the most attention. Responding to quizzical expressions, data wonks offered creative explanations, such as the idea that owners who select an unusual car color tend to have more of a “connection” to and take better care of their vehicle.

Examined alone, the “orange lemon” discovery appeared sound from a mathematical perspective. Here’s the specific result:


This shows orange cars turn out to be lemons one third less often than average. Put another way, if you buy a car that’s not orange, you increase your risk by 50%.

Well-established statistics appeared to back up this “colorful” discovery. A formal assessment indicated it was statistically significant, meaning that the chances were slim this pattern would have appeared only by random chance. It seemed safe to assume the finding was sound. To be more specific, a standard mathematical test indicated there was less than a 1% chance this trend would show up in the data if orange cars weren’t actually more reliable.

But something had gone terribly wrong. The “orange car” insight later proved inconclusive. The statistical test had been applied in a flawed manner; the press had ran with the finding prematurely. As data gets bigger, so does a potential pitfall in the application of common, established statistical methods.

The Little Gotcha of Big Data

The trouble with the world is that the stupid are cocksure and the intelligent are full of doubt.

—Bertrand Russell

Big data brings big potential—but also big danger. With more data, a unique pitfall often dupes even the brightest of data scientists. This hidden hazard can undermine the process that evaluates for statistical significance, the gold standard of scientific soundness. And what a hazard it is! A bogus discovery can spell disaster. You may buy an orange car—or undergo an ineffective medical procedure—for no good reason. As the aphorisms tell us, bad information is worse than no information at all; misplaced confidence is seldom found again.

This peril seems paradoxical. If data is so valuable, why should we suffer from obtaining more and more of it? Statistics has long advised that having more examples is better. A longer list of cases provides the means to more scrupulously assess a trend. Can you imagine what the downside of more data might be? As you’ll see in a moment, it’s a thought-provoking, dramatic plot twist.

The fate of science—and sleeping well at night—depends on deterring the danger. The very notion of empirical discovery is at stake. To leverage the extraordinary opportunity of today’s data explosion, we need a surefire way to determine whether an observed trend is real, rather than a random artifact of the data. How can we reaffirm science’s trustworthy reputation?

Statistics approaches this challenge in a very particular way. It tells us the chances the observed trend could randomly appear even if the effect were not real. That is, it answers this question:[iii]

Question that statistics can answer: If orange cars were actually no more reliable than used cars in general, what would be the probability that this strong a trend—depicting orange cars as more reliable—would show in data anyway, just by random chance?

With any discovery in data, there’s always some possibility we’ve been Fooled by Randomness, as Nassim Taleb titled his compelling book. The book reveals the dangerous tendency people have to subscribe to unfounded explanations for their own successes and failures, rather than correctly attributing many happenings to sheer randomness. The scientific antidote to this failing is probability, which Taleb affectionately dubs “a branch of applied skepticism.”

Statistics is the resource we rely on to gauge probability. It answers the orange car question above by calculating the probability that what’s been observed in data would occur randomly if orange cars actually held no advantage. The calculation takes data size into account—in this case, there were 72,983 used cars varying across 15 colors, of which 415 were orange.[iv]

Calculated answer to the question: Under 0.68%

Looks like a safe bet. Common practice considers this risk acceptably remote, low enough to at least tentatively believe the data. But don’t buy an orange car just yet—or write about the finding in a newspaper for that matter.

What Went Wrong: Accumulating Risk

In China when you’re one in a million, there are 1,300 people just like you.

—Bill Gates

So if there had only been a 1% long shot that we’d be misled by randomness, what went wrong?

The experimenters’ mistake was to not account for running many small risks, which had added up to one big one…

Click here to access the complete article as originally published in OR/MS Today

[i] For more details on these findings, see the section on “Bizarre and Surprising Insights” within the Notes for my book, Predictive Analytics, available as a PDF online at And for further reading on this article’s overall topic, look in the section, “Further Reading on Vast Search” within the same document.

[ii] This discovery was also featured by The Huffington Post, The New York Times, National Public Radio, The Wall Street Journal, and the New York Times Bestseller Big Data: A Revolution That Will Transform How We Live, Work, and Think.

[iii] The notion that orange cars have no advantage is called the null hypothesis. The probability the observed effect would occur in data if the null hypothesis were true is called the p-value. If the p-value is low enough—e.g., below 1% or 5%—then a researcher will typically reject the null hypothesis as too unlikely, and view this as support for the discovery, which is thereby considered statistically significant.

[iv] The applicable statistical method is a 1-sided equality of proportions hypothesis test, which calculated the p-value as under 0.0068.

Original published in OR/MS Today. Reposted with permission.