Advanced Data Analytics for Business Leaders Explained

A business-level explanation of most important data analytics and machine learning methods, including neural networks, deep learning, clustering, ensemble methods, SVM, and when do use what models.



By Alex Jones, Sept 2014.

This is a second part of the post - here is the first part: Data Analytics for Business Leaders Explained

Neural Networks


Example: R's neuralnet package
Although a somewhat older method, Neural Networks have only recently truly been utilized, thanks to an exponential decrease in the cost of computing. Neural networks take in input data and create "hidden layers" capable of learning and self-improving through trial and error-- heck, it's logic resembles that of the human brain! IBM's Watson, Image Recognition, and Handwriting software often are driven by neural networks and they often are well suited for tasks that have huge amounts of information and are poorly suited for traditional programming. The computational complexity can be immense and interpretation of the "importance/ influence" of any given variable is difficult.



Limitations
Its not easy to interpret the influential variables, works best with large amounts of information, and can become very computationally complex.

Deep Learning

Deep Learning is the stuff of sci-fi movies and artificial intelligence in which computers parse through unstructured and unlabeled data (including images, videos, language, speech, etc) and eventually make sense of the information.

Recently, Google developed an artificial neural network across 16,000 CPU cores and "watched" YouTube videos for 3 days. Without guiding or limiting the model to particular videos, they found it was able to recognize cats. While this may serve as testament to the massive number of cat videos on YouTube, the finding itself is extraordinary.

Deep Learning of images

Essentially rather than being told what "features" or "characteristics" to look for, the model identified repetitive shapes and recognized patterns within an otherwise convoluted and messy context.

While humans are incredible at these sort of tasks, Deep Learning methods are catching up... and in some cases beating us. In 2011, a contest in Germany put humans against computers to accurately recognize street signs (many of which were blurry and dark). The top algorithm won with 99.4% while humans achieved 98.5% accuracy. Want to learn more about our future robot overlords artificial intelligence? Check out this article.

Limitations
The computational complexity and data requirements are massive, but it represents the progressive evolution of data science and that is exciting.

Clustering


Example: R's kmeans package

Clustering has been around for decades. However, Clustering has greatly evolved since the days of hierarchical agglomerative (quite a mouth full).

Clustering is a lot like K-nearest neighbor, except clustering looks for "groups" of neighbors and is often used to identify things like customer segments.

Below is a visual of a handful of the more popular clustering algorithms and how they would cluster the respective dataset. As you can see, some algorithms have difficulty discerning data that isn't well separated from other clusters while others are perfectly capable of identifying such relationships.



Limitations
Depending on the algorithm you use, it can be challenging to interpret what you're looking at. Also, most methods begin with a "random" starting point, which means you can run the algorithm multiple times and get different (although likely very similiar) results.

One of the advantages of open-source software (R, KNIME, Rapidminer--and other tools listed in Top 10 Data Analysis Tools for Business) is that when new methods come out, they are often quickly released. Whereas, on traditional software, development can be a long process. Finally, Complexity is a function of the number of columns, therefore, very "Wide" datasets, can take some time.

Holy suspenders batman, we've become nerds?! +5 WoW points

Ensemble


Ensembles are a little unique in that they're the combination of existing. Ensembles are phenomenal, intuitive, and relatively quick. Essentially they take multiple models and combine their predictions into prediction-ultron (related to Google Ultron). Ensembles allow you to balance out models weaknesses/ biases and improve predictive performance.

For this post, we'll look at an Ensemble Regression Decision Tree. Essentially this randomizes the development of decision trees (to create some variability), then at each "Leaf" node (aka the end of the decision tree/ flow chart) it fits a regression line!
In other words, rather than simply saying everything in leaf node number one is a buy, this model would use a regression equation within that node to say something like "there's a 92% chance that Bob will buy, a 99% that Ron will buy, a 82% chance Tod will buy, etc".



But wait, there's more! Not only is this pretty neat, an Ensemble Regression Decision Tree creates X number of trees (let's say 50). Then makes 50 predictions for each point.

That's not all! Oftentimes simply averaging (or other aggregation methods such a committee vote, majority vote, confidence weighted averaging, etc) across multiple models can improve predictive power by a few percentage points (which translates into real money).

I'm not done yet! Finally, since each one of the decision trees/ algorithms are independent, ensembles are wonderfully parallelizable. What does that mean? Basically, it is pretty darn fast.

Limitations
Good luck interpreting it. Although the model eventually spits out a single averaged/ aggregated prediction, it is hard to tell what is driving the changes.

Ensemble II: Adaboost


Example: Adaboost in Weka
Adaboost is one of the more famous (and rightfully so, it is quite powerful) algorithms that implements an ensemble method. Essentially, it fits a classification parameter (shown below with a line), then aggregates each prediction into a single parameter (the bottom box). What makes Adaboost interesting is that it focuses on "weak" classifiers. In other words, those variables that weren't really "used" in the initial model are now called upon to fit a micro-level model. In essence, take a bunch of weak individuals, combine into one strong super-variable! *Insert Clever Optimus Prime Reference* *Nerd Dreams*



Limitations
Although there's no perfect algorithm, Adaboost is one of those pretty-darn good ones, that does well without too much extra work.

SVM: Support Vector Machine


Support Vector Machines are black magic/ voodoo. I say that because it often appears counter-intuitive if you try to understand it first from a mathematical perspective. Rather, this is best explained by reddit. In the most basic sense, SVM separates data into two classes (think buyers - Blue, non-buyers - Red). Take a look at the visual below.



Looks good to me? Both sides are separated! That's true. But take a look at how the line leans a little bit and there's red/ non-customers close on the bottom and blue/ customers close on the top. Keep that in mind. Now, let's say we run a marketing campaign and we predict purchases using this line. We'll most likely get something like this:



Pretty good. Essentially, we just got that Red customer wrong. In other words, although that customer didn't buy, we predicted that it would. No big deal, right? Maybe... but we can do better!

So let's see what this SVM thing is all about. Hold onto your suspenders. We're going to project the points into a higher dimensional space that maximizes the distance between the two groups. Confused? Don't worry about it. Keep reading.

Ultimately, we'll end up with a line like this. Rather than "tilting", this line tries to maximize the Margin/ space (that weird yellowish colored area around the line) between the two groups.



Ok now let's see what happens when we predict new customers.



Witchcraft! Boom, accuracy improved. Below is a visual that's a little more accurate in illustrating what is actually going on and how the whole "projecting into higher dimensions" works. Rather than "lines" SVM works with hyperplanes and there's a whole slew of "kernel functions" that add to the complexity. But again, for our purposes, not really important.



Limitations
SVM can be a computational nightmare. So why use SVM? Because it's incredibly accurate and "learns" rapidly from additional input data. Sadly, it is simply not possible to interpret and challenging to tune.

When do use different models?


Below is a decent reference table for determining what algorithm will likely be useful. However, think for a moment about the problem at hand, reflect on the conceptual understanding of your goals versus the methods available, and realize that most times data can be manipulated, massaged, encoded, and reformatted to fit into one of the major techniques. Having said that, reach out for support-- data cleansing, prepping, and modeling can be daunting!



At a macro-level, you may be thinking that these don't seem that novel or revolutionary. I agree. It's all just math.

What has changed is the sheer size and scale of data, the scope of metrics being collected, and the ability to analyze it with relatively inexpensive, scalable, and fast computing.

Over time, more industries will fundamentally change or be disrupted as companies begin to leverage analytics, enhance efficiency, and allow data to drive decisions. Simply put, the competitive environment will necessitate data science capabilities.

My hope is that this will help to develop an intuition and trust for data science among business leaders. Ultimately allowing the nerds/ data scientists to focus less on how to present complex topics and business to begin realizing value!

Thanks for reading!

Here is the original post.

Alex Jones is a Graduate Student at U. Texas McCombs School of Business.

Related: