KDnuggets Home » News » 2017 » Dec » Opinions, Interviews » 4 Common Data Fallacies That You Need To Know ( 17:n46 )

4 Common Data Fallacies That You Need To Know


In this post you will find a list of common the data fallacies that lead to incorrect conclusions and poor decision-making using data. Here you will find great resources and information so that you can always be reminded of these fallacies when you’re working with data.



By Simon Whittick, Geckoboard

As powerful as data can be, it can also be misleading. There are many tricks that data can play on us when we analyze it. These are commonly known as data fallacies -- myths and traps that lie within data. They ultimately lead to us drawing incorrect conclusions from data and making poor decisions.

To avoid falling for these tricks, the first step is to be aware of them so you can avoid being a victim. That’s why we put together a guide to common data fallacies. We’ve also designed this poster for your workspace, so that you can always be reminded of these fallacies when you’re working with data. There are 15 that we’ve covered, but I want to highlight four of the most common fallacies that we’ve identified.

Data Dredging

This is also sometimes known as data fishing, data snooping, or p-hacking. It’s the practice of repeatedly testing new hypotheses against the same set of data, failing to acknowledge that most correlations will be the result of chance. Tests for statistical significance only work if you’ve defined your hypothesis upfront.

For example, this has been a big problem with clinical trials. Researchers have ‘data-dredged’ their results, repeatedly switching what they were testing for on a set of results. This resulted in them finding a spurious correlation between two variables that’s likely the result of chance. It explains why so many results published in scientific journals have subsequently been proven to be wrong. To avoid this, it’s now becoming standard practice to register clinical trials, stating in advance what your primary endpoint measure is.

Fallacy 1 Data Dredging

To avoid falling for this fallacy, define your hypothesis upfront before analyzing data or testing for statistical significance.

False Causality

This can also be known as “cum hoc ergo propter hoc”, which is Latin for "with this, therefore because of this". That’s because this data fallacy is the false assumption that when two events occur together one must have caused the other. Correlation does not imply causation.

Fallacy 2 False Causality

For example, global temperatures have steadily risen over the past 150 years and the number of pirates has declined at a comparable rate. No one would reasonably claim that the reduction in pirates caused global warming or that more pirates would reverse it.

But it’s not usually this clear-cut. Often, correlations between two things tempt us to believe that one caused the other. However, it’s often a coincidence or there’s a third factor causing both effects that you’re seeing. In our pirates and global warming example, the cause of both is industrialization. There are many more examples of false causality and Tyler Vigen does a great job of highlighting spurious correlation examples.

Never assume causation because of correlation alone – always gather more evidence and consider additional variables that might be causing both movements.

Overfitting

A more complex explanation will often describe your data better than a simple one. However, a simpler explanation is usually more representative of the underlying relationship. This is in essence what overfitting is -- it’s creating a model that’s overly tailored to the data you have and not representative of the general trend.

Fallacy 3 Overfitting

When looking at data, you’ll want to understand what the underlying relationships are. To do this, you create a model that describes them mathematically. The problem is that a more complex model will fit your initial data better than a simple one. However, they tend to be very brittle: They work well for the data you already have, but try too hard to explain random variations. Therefore, as soon as you add more data, they break down. Simpler models are usually more robust and better at predicting future trends.

For more specific examples of overfitting and how to remedy them, there’s a great overview and discussion on different methods of overfitting and how to remedy them here. However, generally speaking, when first creating models, try to find the simplest possible hypothesis and avoid explaining random variations in your model.

Simpson’s Paradox

A statistical phenomenon in which a trend appears in different groups of data but disappears or reverses when the groups are combined.

Fallacy 4 Simpson Paradox

For example, in the 1970s, Berkeley University was accused of sexism because female applicants were less likely to be accepted than male ones. However, when trying to identify the source of the problem, they found that for individual subjects the acceptance rates were generally better for women than men. The paradox was caused by a difference in what subjects men and women were applying for. A greater proportion of the female applicants were applying to highly competitive subjects where acceptance rates were much lower for both genders. There are many more examples of this paradox, some of which are included in this video.

It’s important to be aware of this paradox so that you can identify when it is appearing in your data. When you do see it happening you need to get more context and go outside statistics to look for other variables which are causing it. In the above example this would be that women were applying for more competitive subjects than men.

Always be aware!

When analyzing data or running tests, be aware of these fallacies. When you’re working with data, take into consideration some of the below factors to reduce your chances of falling victim to data fallacies:

  • Ensure you have a hypothesis upfront before testing or analyzing data.
  • Question the data you’re looking at. How has it been gathered? Is there any potential bias or negative impact that the way it’s been gathered might have on your conclusions?
  • Consider what data or other variables you’re not seeing. Is there other research that might contradict what you’re seeing? Are there additional variables that aren’t being considered in your data?
  • Consider that you won’t get the same results if you were to gather your data again. There could be random variables impacting your data.
  • Consider the shape of your data by visualizing it rather than solely relying on summary metrics.

Bio: Simon Whittick is VP Marketing at Geckoboard, a live TV dashboard software. They’re on a mission to make data more approachable by providing educational content on data fundamentals that’s accessible to everyone, no matter their experience with data.

Want to understand more data fallacies?
Learn about all of them here.

Related