Surprising Random Correlations


An interesting demo showing how easy it is to find surprising correlations in real data. Is German unemployment rate related to Apple Stock? Is 10-year Treasury rate related to price of Red Winter Wheat? You will be surprised.



By Sameer Manek.

I've noticed people frequently misusing data to find correlations between seemingly unrelated data sets and inferring a relationship. While they'll generally volunteer that they haven't proven causality, they frequently claim that there must be some underlying relationship for the p value to be so low.

I built a toy to try and show the error in this. Essentially, you can take almost any real life data and infer a relationship, especially if you perform multiple tests. Here I take a number of data sets from Quandl and plot whichever have very low p-values.

Here are some examples (see live demo below).

Random Correlations: German Unemployment vs Apple Stock
Fig 1. German Unemployment Rate vs Apple Stock Price


Random 10 Year Treasury Winter Wheat
Fig 2. 10-Year US Treasury rate vs No. 1 Hard Red Winter Wheat


The causes of these 'relationships' vary, but a few key factors that I think are generally worth checking. These don't invalidate the slope or intercept, but they may call the test statistics into question (e.g., p value).
  • Are the residuals normally distributed?
  • What if I detrend the data?
  • Are the residuals autocorrelated?
  • Do the residuals have constant variance?
  • Are there any points with a lot of leverage?
  • How many relationships did I test before finding this?
  • Do I need to apply a multiple testing correction?

 
Bio: Sameer Manek is a student at Harvard Business School, last time studying biomedical engineering at Johns Hopkins University. He is interested in data science, investing, design, books, and really just about everything else.

Original.

See below an embedded demo using shinyapps.io (may be pretty slow).

Wait for it to load, then hit the "Another Relationship!" button.