How to reduce Data Hoarding, get Better Visualizations and Decisions
Creating a hodge-podge of pretty pictures of every datapoint is a guaranteed way to destroy the value of a visualization. We examine how to reduce such data hoarding and improve decisions.
Welcome Principal Components Analysis! What Principal Components does is it tries to “Maximize Variability” of our input data by “projecting” it linearly into new Super Variables (Principal Components).
This is all witchcraft! Keep reading, it’ll start to click.
Some notes about Principal Components:
- Principal Components (Prins) are created in order of variability, so Prin1 will be most spread-out, then Prin2, etc.
- Prins are orthonormal. In other words, they’re uncorrelated (at right angles)—which is important for linear regression! For instance, the number of movies you saw in 2015 and the amount you paid in taxes 1983.
PSA- Multicollinearity is a statistical faux pas that impacts millions every year. Avoid years of ridicule, embarrassment, and professional exile,
don’t go into statistics… Always check for Multicollinearity.
Ok, let’s look at our first Principal Component!
Wonderful! See how the long red line overlays the “length” of our blob— that’s our first Principal Component. The shorter yellow line (width) represents our second principal component.
Now let’s get our final and third Principal Component– the “height” of our blob.
Here we see that our third principal component is the “Flat” part of the disc. We’ll be able to drop that variable from our analysis.
How do we know we’ll be able to drop one? Well… U, V, W is the exact same data as X, Y, Z! It has just been multiplied by constants.
Nothing like an invariant linear transformation to get the blood flowing!
Why yes, I am free this evening… How’d you know?
Let’s look at a quick example of what Principal Components looks like on some International Risk Data (below is a sample).
Ok, so let’s run Principal Components Analysis and see how our principal components relate to the variability of the underlying data and how to interpret it.
In our output, we see Eigenvalue, Percentage, and Cumulative Percentage (with respect to total variability).
In other words, the first row shows that Principal component number 1 represents 54.671% of the total variability!
Wow, that basically means we’ve condensed a whole lot of “Information” into one variable. Ok, but we still have 15 of them… How many do we keep?
Well, there’s no hard and fast guideline, just a few “Rules of Thumb”, one would say, you maintain at least 80% of the variability, another would say all those Principal Components with an Eigenvalue greater than 1.
I prefer the intuition that comes with eigenvalues greater than 1. Why? Because if you sum all the eigenvalues, they will be equal to the total number of incoming variables (15 in our case). So one way to interpret that would be, those Principal Components with eigenvalues greater than 1 are MORE informative than the original incoming variables.
By extension, those variables with an Eigenvalue LESS than 1, are NOT as informative as the original variables. But that’s just one nerd’s opinion.
Ok sounds neat! How do we make sense of this stuff?
Let’s look at our loading matrix, which shows us how our incoming variables relate to our Principal components.
What we see in Prin1 is that our “Largest” numbers (those variables most closely related) are Rule of Law, Regulatory Quality, Government Effectiveness, Control of Corruption, etc.
So how could we interpret this? Well, we’d say something like this:
Those countries with a high value in Principal Component one, represent countries with a strong government, good regulations, etc. That makes sense! Those variables are intuitively related.
More importantly, now we don’t have to look at the United Nations, CIA, IHS, Transparency International, and World Bank’s “Rule of Law/ Government Factors” (look at the original data, the data points are on numerous different scales 0-100, 1-5, -3-3, etc) we can look at one single value!
So which would you rather have: a data-hoard with 15 different graphs, gauges, and mashups of risk data or one intelligent number?
Wonderful! That’s progress— it’s comprehensible, intelligent, and intuitive.
How do I try “PCA” for myself? There are a ton of open-source (free) software tools that will calculate all this for you— You just need to interpret it!
I’d recommend starting with a tool like KNIME, Weka, or RapidMiner Community Edition. Check out the tools in Top 10 Data Analysis Tools for Business
Most importantly, they’ll help prevent data-hoarding and “information overload” by creating super-intelligent metrics. That will leave the “computing” to the computers and the decisions and action plans to people.
PCA’s Cousin: Linear Discriminant Analysis
If you think this is cool (term used loosely) and you have a specific question in mind, try out Linear Discriminant Analysis. It’s a lot like PCA but it projects variables into new variables based on a specific goal (rather than the “spread/ variability” of the data).
For example, if you wanted to predict “Buyers”-Blue vs “Non-Buyers”-Red, Linear Discriminant Analysis (LDA) would help you do that more directly than PCA.
In the example below, if you use PCA your first Principal Component would go from the top-left corner to the bottom right (between the red and the blue).
However, Linear Discriminant Analysis would go “Hey wait, instead of maximizing for variability, let’s optimize to separate out our “classes” of buyers/ non-buyers!”
What does that actually mean? Basically, instead of looking for “flat” parts, we’d rotate in a way that prevents “buyers and non-buyers” from overlapping (separate them).
Essentially, we’d have a new variable similar to “Likelihood to buy” or an “attractiveness score”. In our case, those low on the LDA score (those who are Red) would be really unlikely to buy– and we could show that in one single number!
Moreover, you can do really outstanding things if you move from “linear” to hyperplanes or radial basis. To give a little flavor, here’s a few visuals of the idea.
What a wonderful mathematical cliff hanger!
Simplicity and intuition enhance intelligence, while data hoarding limits action.
Thanks for reading!