KDnuggets Home » News » 2015 » Jan » Opinions, Interviews, Reports » Causation vs Correlation: Visualization, Statistics, and Intuition ( 15:n01 )

Causation vs Correlation: Visualization, Statistics, and Intuition


Visualizations of correlation vs. causation and some common pitfalls and insights involving the statistics are explored in this case study involving stock price time series.



By Alex Jones, Jan 2015.

xkcd correlation

As someone who has a tendency to think in numbers, I love when success is quantifiable!

So, I looked into how my working at Cameron relates to the company's stock price. Alongside this analysis, I'll demo scaling and data manipulation to illustrate basic lessons of visualization, statistics, and analysis.

First, I pulled Stock Price over my first ~90 Days, which aligns perfectly with Days Worked.

Then simply added a rolling count of days, how convenient!

Example:

stocks vs. days worked

Neat! Let’s graph Stock Price vs Days Worked.

Days vs stock prices

Super! Obviously no relationship!

Not so fast… let’s Regress Days Worked across Stock Price.

It's important to realize that while visualization is a powerful tool and incredibly insightful way to ingest data, it's not the whole story.

results confidence

Blasphemy! With an R squared of .88 and a P Value out 42 Decimal Places, traditional statistics would say we are incredibly confident about the results!

So what do all those numbers really say? Well, one interpretation would be:

StockPrice= $75.99 -$.29672(NumberDaysAlexHasWorked)

That's a heck of a deal! I cost a little under 30 cents a day...

WRONG. That's per share. With ~197.45M Shares Outstanding, that means, I cost $58,587,364 per day.

Well, this is awkward... 

Quick, let's perform some "Transformations" to get a "Better result".

First, let's Scale Stock Price from 0 (lowest) to 1 (highest price). To do so, we'll get the Minimum, Maximum value, and Spread.

Spread= Maximum - Minimum.

How do we scale every datapoint? Simply, (Stock Price X - Minimum)/ Spread. Boom! Scaled.

Let’s graph!

stock price vs days worked

Great, no relationship! Phew!

Woa... that doesn't seem right? Let’s also Scale Days Worked…

scaled days worked

Ok, maybe there's a relationship...

See how the Orange line (Days worked) currently starts at 0 and goes to 1. Let's flip that. How? Inverted Days Worked= 1-Scaled Days Worked.

Graphed:

graphed scaled values

Holy Moly. I see it now.

Essentially, we have taken two vectors of differing relative magnitudes, scaled them to an equivalent range and controlled for directionality. Thereby enabling a linear depiction of the relationship and intuitive visualization!

Sorry, that’s unnecessary.

Now what are the regression results?!

Sit down for this... The regression results, in absolute terms, are EXACTLY the same. Even though the equation (on the final graph) is apparently different, once we "undo" the transformations (get numbers to their original values)... they’ll be the same!

Why is that? Because we didn’t change the overall geometry.  We changed all points, not just one point.

Picture our data as a cube.

data cube

If we turn, flip, invert, scale, zoom out, or angle the cube-- has the cube itself changed? Absolutely not. It's exactly the same!

We're simply looking at it from a different perspective, we're just finding that perfect angle to tell our story/ visualization. That's powerful!

Even with these findings, we must address Causation vs Correlation! Based on statistics-- "data driven" results, and the interpretation we proposed-- I'm the worst!

However, I bet you there's another variable impacting stock. So what else has happened in the time period?

Well, considering that the company is an oilfield services firm-- we understand the missing link is price of oil (I certainly hope!).

What we should realize is that these relationships aren't always obvious! In fact, visualizations can hide relationships!

Added Oil price:

added oil prices

Most importantly, this should exemplify one of the most exciting value potentials of "Big Data". Today, we have access to incredible amounts of information relative to the "Universal Variable"-- of time. With that to relate on, we can see how major indexes, markets, events, weather patterns, etc interrelate!

As we move towards a smaller and more interconnected world, actively promote "Universal" data points!

This is an abridged version of a post

https://www.linkedin.com/pulse/causation-vs-correlation-alex-jones

Alex Jones is a Graduate Student at U. Texas McCombs School of Business.

Related:

Sign Up