Causation vs Correlation: Visualization, Statistics, and Intuition
Visualizations of correlation vs. causation and some common pitfalls and insights involving the statistics are explored in this case study involving stock price time series.
By Alex Jones, Jan 2015.
As someone who has a tendency to think in numbers, I love when success is quantifiable!
So, I looked into how my working at Cameron relates to the company's stock price. Alongside this analysis, I'll demo scaling and data manipulation to illustrate basic lessons of visualization, statistics, and analysis.
First, I pulled Stock Price over my first ~90 Days, which aligns perfectly with Days Worked.
Then simply added a rolling count of days, how convenient!
Example:
Neat! Let’s graph Stock Price vs Days Worked.
Super! Obviously no relationship!
Not so fast… let’s Regress Days Worked across Stock Price.
It's important to realize that while visualization is a powerful tool and incredibly insightful way to ingest data, it's not the whole story.
Blasphemy! With an R squared of .88 and a P Value out 42 Decimal Places, traditional statistics would say we are incredibly confident about the results!
So what do all those numbers really say? Well, one interpretation would be:
StockPrice= $75.99 -$.29672(NumberDaysAlexHasWorked)
That's a heck of a deal! I cost a little under 30 cents a day...
WRONG. That's per share. With ~197.45M Shares Outstanding, that means, I cost $58,587,364 per day.
Well, this is awkward...
Quick, let's perform some "Transformations" to get a "Better result".
First, let's Scale Stock Price from 0 (lowest) to 1 (highest price). To do so, we'll get the Minimum, Maximum value, and Spread.
Spread= Maximum - Minimum.
How do we scale every datapoint? Simply, (Stock Price X - Minimum)/ Spread. Boom! Scaled.
Let’s graph!
Great, no relationship! Phew!
Woa... that doesn't seem right? Let’s also Scale Days Worked…
Ok, maybe there's a relationship...
See how the Orange line (Days worked) currently starts at 0 and goes to 1. Let's flip that. How? Inverted Days Worked= 1-Scaled Days Worked.
Graphed:
Holy Moly. I see it now.
Essentially, we have taken two vectors of differing relative magnitudes, scaled them to an equivalent range and controlled for directionality. Thereby enabling a linear depiction of the relationship and intuitive visualization!
Sorry, that’s unnecessary.
Now what are the regression results?!
Sit down for this... The regression results, in absolute terms, are EXACTLY the same. Even though the equation (on the final graph) is apparently different, once we "undo" the transformations (get numbers to their original values)... they’ll be the same!
Why is that? Because we didn’t change the overall geometry. We changed all points, not just one point.
Picture our data as a cube.
If we turn, flip, invert, scale, zoom out, or angle the cube-- has the cube itself changed? Absolutely not. It's exactly the same!
We're simply looking at it from a different perspective, we're just finding that perfect angle to tell our story/ visualization. That's powerful!
Even with these findings, we must address Causation vs Correlation! Based on statistics-- "data driven" results, and the interpretation we proposed-- I'm the worst!
However, I bet you there's another variable impacting stock. So what else has happened in the time period?
Well, considering that the company is an oilfield services firm-- we understand the missing link is price of oil (I certainly hope!).
What we should realize is that these relationships aren't always obvious! In fact, visualizations can hide relationships!
Added Oil price:
Most importantly, this should exemplify one of the most exciting value potentials of "Big Data". Today, we have access to incredible amounts of information relative to the "Universal Variable"-- of time. With that to relate on, we can see how major indexes, markets, events, weather patterns, etc interrelate!
As we move towards a smaller and more interconnected world, actively promote "Universal" data points!
This is an abridged version of a post
https://www.linkedin.com/pulse/causation-vs-correlation-alex-jones
Alex Jones is a Graduate Student at U. Texas McCombs School of Business.
Related:
As someone who has a tendency to think in numbers, I love when success is quantifiable!
So, I looked into how my working at Cameron relates to the company's stock price. Alongside this analysis, I'll demo scaling and data manipulation to illustrate basic lessons of visualization, statistics, and analysis.
First, I pulled Stock Price over my first ~90 Days, which aligns perfectly with Days Worked.
Then simply added a rolling count of days, how convenient!
Example:
Neat! Let’s graph Stock Price vs Days Worked.
Super! Obviously no relationship!
Not so fast… let’s Regress Days Worked across Stock Price.
It's important to realize that while visualization is a powerful tool and incredibly insightful way to ingest data, it's not the whole story.
Blasphemy! With an R squared of .88 and a P Value out 42 Decimal Places, traditional statistics would say we are incredibly confident about the results!
So what do all those numbers really say? Well, one interpretation would be:
StockPrice= $75.99 -$.29672(NumberDaysAlexHasWorked)
That's a heck of a deal! I cost a little under 30 cents a day...
WRONG. That's per share. With ~197.45M Shares Outstanding, that means, I cost $58,587,364 per day.
Well, this is awkward...
Quick, let's perform some "Transformations" to get a "Better result".
First, let's Scale Stock Price from 0 (lowest) to 1 (highest price). To do so, we'll get the Minimum, Maximum value, and Spread.
Spread= Maximum - Minimum.
How do we scale every datapoint? Simply, (Stock Price X - Minimum)/ Spread. Boom! Scaled.
Let’s graph!
Great, no relationship! Phew!
Woa... that doesn't seem right? Let’s also Scale Days Worked…
Ok, maybe there's a relationship...
See how the Orange line (Days worked) currently starts at 0 and goes to 1. Let's flip that. How? Inverted Days Worked= 1-Scaled Days Worked.
Graphed:
Holy Moly. I see it now.
Essentially, we have taken two vectors of differing relative magnitudes, scaled them to an equivalent range and controlled for directionality. Thereby enabling a linear depiction of the relationship and intuitive visualization!
Sorry, that’s unnecessary.
Now what are the regression results?!
Sit down for this... The regression results, in absolute terms, are EXACTLY the same. Even though the equation (on the final graph) is apparently different, once we "undo" the transformations (get numbers to their original values)... they’ll be the same!
Why is that? Because we didn’t change the overall geometry. We changed all points, not just one point.
Picture our data as a cube.
If we turn, flip, invert, scale, zoom out, or angle the cube-- has the cube itself changed? Absolutely not. It's exactly the same!
We're simply looking at it from a different perspective, we're just finding that perfect angle to tell our story/ visualization. That's powerful!
Even with these findings, we must address Causation vs Correlation! Based on statistics-- "data driven" results, and the interpretation we proposed-- I'm the worst!
However, I bet you there's another variable impacting stock. So what else has happened in the time period?
Well, considering that the company is an oilfield services firm-- we understand the missing link is price of oil (I certainly hope!).
What we should realize is that these relationships aren't always obvious! In fact, visualizations can hide relationships!
Added Oil price:
Most importantly, this should exemplify one of the most exciting value potentials of "Big Data". Today, we have access to incredible amounts of information relative to the "Universal Variable"-- of time. With that to relate on, we can see how major indexes, markets, events, weather patterns, etc interrelate!
As we move towards a smaller and more interconnected world, actively promote "Universal" data points!
This is an abridged version of a post
https://www.linkedin.com/pulse/causation-vs-correlation-alex-jones
Alex Jones is a Graduate Student at U. Texas McCombs School of Business.
Related: