Data Analytics for Business Leaders Explained
Tags: Alex Jones, Algorithms, Business Leader, Classification, Data Analytics, Decision Trees, Linear Programming, Regression
Learn about a variety of different approaches to data analytics and their advantages and limitations from a business leader's perspective in part 1 of this post on data analytics techniques.
By Alex Jones, Sept 2014.
Gartner's 2014 Hype Cycle (below) shows the relative expectations of various technologies, including Big Data, Data Science, InMemory Databases, and Prescriptive Analytics. In my mind, this illustrates it's time to stop tossing around buzzwords and start realizing value.
To give some perspective, it is important to realize that analytics is an evolution of skills and capabilities. Consider where you would put your organization.
With that little bit of context, let's take a moment to cut the buzzwords and get into the nuts and bolts of data science techniques.
This particular post is one that I have held in draft for some time now, simply because generalizing the complexities of mathematical models, computational efficiency, and sophisticated techniques is most definitely going to lose some accuracy. With that, this isn't meant to teach or to guide a data scientist, it is meant to help business leaders understand the analytics opportunity and techniques. All the while, providing a reference for Data Scientists and technical leaders to use as they try to distill immensely complex subject areas into comprehensible bitesize pieces.
Let's begin!
Linear Programming & NonLinear Programming:
Example: Solver/ SolverTable within Excel
Linear programming is an optimization method that allows users to maximize (or minimize) an objective function (a metric defined by an equation). In the graph below, the objective function is to maximize profits, given the trade off of manufacturing tables or chairs within a certain number of production hours. Although the example below is quite rudimentary, linear programming allows for many constraints/ factors and is incredibly fast, because it simply draws a number of "lines" to represent each constraint (green lines) and then identifies the peak "feasible" point (the optimal).
As you might expect, nonlinear programming doesn't require linear constraints. However, nonlinear is known to be much more computationally challenging as the program runs through each potential point (or uses an approximation parameter/ gradient).
Non Linear
Limitations: Requires users to "know" the constraints, influential variables and their relative impacts.
Monte Carlo Simulations:
Example:@Risk
An engineering and MBA classic, Monte Carlo simulations allow for users to designate randomization functions and distributions to represent unknowns. This is used to simulate problems that are not deterministic (meaning that they can't be solved directly). This ultimately estimates both a cone of uncertainty and the mostprobable outcome.
Limitations:
Since Monte Carlo runs simulations based on estimates and user defined inputs, the output has ample opportunity for human error. Monte Carlo simulations can get pretty darn complex, but don't mistake complexity for accuracy. Instead, with Monte Carlo we must apply a veil of reason and constantly work to eliminate error by testing and benchmark against real world outcomes. Furthermore, whenever possible we should derive our estimations from historical data.
With that said, there are certain realms where Monte Carlo is phenomenal and best suited. As with any technique, simply urging caution. Take a look at some of Monte Carlo's variations such as Markov Chains, but that's for another post.
Regression
Example: StatTools
The classic. Good ole regression. Fitting a line to a set of points (ax+b=y). Regression provides insights into the relative importance of variables and the drivers of a given outcome. Today, regression takes many forms linear, logistic, polynomial, MARS, etc. One of the major differences is the "loss function." Most people are familiar with SSE, the sum of squared error. However, there are many more exciting options! Below is an image of a few of the flavors.
Limitations
The key limitations are the input data being independent, well chosen, and interpreting the output. Regression can be deceptively confidence boosting particularly on large datasets. If you're feeling extra nerdy, check out this article on the limitations of p value (shocking and saddening, I know).
Decision Trees
Example: Numerous
Decision trees are easy to interpret and often output a great visual. Decision trees work well in situations where they are predicting a binary outcome. For instance, buy or not buy (1=buy, 0=not buy) based on certain characteristics of a consumer/ customer. As we progress, the examples I use will focus on marketing because it is relatively easy to follow, however, these models are all greedy datamongers, they don't care what functional area or industry the data comes from! Below is an elementary example of a decision tree.
Limitations
Decision trees aren't always good with datasets that are dynamically changing. In other words,when what's happening or going to happen doesn't match what happened in the past. Also, they have a tendency to "overfit" the data. That's where your data scientists come in, they're well aware of these problems and are able to "tune", adjust, reconfigure, and test against a holdout dataset.
What's a holdout set? Great question! Essentially, by randomly splitting the data or using cross validation, analysts can build a model with one set of data and then get the accuracy stats with another set.
Another concern is that interpretation is limited because variables exist at different "steps" in the decision tree and errors propagate forward. In other words, mistakes made at the beginning can impact the entire model!
You're still reading?! I'm impressed.
Classification
Example: knn package in R
Although there are a ton of classification algorithms, we'll focus on KNearest Neighbor, simply as a means to convey the logic. Let's say we have a dataset of buyers and nonbuyers with lots of characteristic columns (things like age, gender, income, etc).
Technically speaking, it would be more accurate to describe our data as input/ training class labeled vectors in a multidimensional feature space but life's too short for that many twodollar words in one sentence.
We'll stick with Buyers and NonBuyers with lots of columns. So let's say we have a new list of "prospects" and we have the columns of characteristics but we don't know if they'll become customers. Well, KNN can help predict! In the visual below, let's say that we have customers blue squares and non customers red triangles.
Then along comes "Greendotman". Will he be a customer or not? Well, in this case, it depends on a few things as to what we would predict.
First, how big is "K", in other words, how any nearby points are we going to consider? If we look at K=3, then we would look at the points inside the solidline circle and see there's 2 red/noncustomers and 1 blue/customer, so we'd predict greendot man is a noncustomer. However, if we look at K=5, we'd look within the dotted line circle and find 3 blue/ customers, 2 red/ noncustomers.
What can we do? Well, we could weight by distance. In other words, we could say, let's consider those points that are closest (aka most like Greenman) more than the points further away. In that case, it would likely be a tossup. However that is informative too! As our model would give us a "probability" of being a customer. For things like mail campaigns that is highly relevant!
So when is this a good option? Well, let's think about Amazon for a second. Currently, Amazon recommends products that are "associated" with the product you are looking at or your browsing history. However, that's a pretty loose model.
Rather, a KNearest Neighbor model is likely to find that handful of weirdos just like you, those guys that also buy red silk suspenders, rent movies at 9pm on Friday nights, look at pocket protectors that are dishwasher safe, and write shamelessly about their adventures following the purchase of 3 wolves tshirt. Those recommendations will drive tons of sales! Talk about product discovery!
Limitations
The true downside is that KNN calculates the distance between each Greendotman (new point) and every other point. That's a lot of math. Fortunately, there are some binning, parallelization and generalization strategies that can speed up the process.
Here is the original post.
Alex Jones is a Graduate Student at U. Texas McCombs School of Business.
Related:
Gartner's 2014 Hype Cycle (below) shows the relative expectations of various technologies, including Big Data, Data Science, InMemory Databases, and Prescriptive Analytics. In my mind, this illustrates it's time to stop tossing around buzzwords and start realizing value.
To give some perspective, it is important to realize that analytics is an evolution of skills and capabilities. Consider where you would put your organization.
With that little bit of context, let's take a moment to cut the buzzwords and get into the nuts and bolts of data science techniques.
This particular post is one that I have held in draft for some time now, simply because generalizing the complexities of mathematical models, computational efficiency, and sophisticated techniques is most definitely going to lose some accuracy. With that, this isn't meant to teach or to guide a data scientist, it is meant to help business leaders understand the analytics opportunity and techniques. All the while, providing a reference for Data Scientists and technical leaders to use as they try to distill immensely complex subject areas into comprehensible bitesize pieces.
Let's begin!
Linear Programming & NonLinear Programming:
Example: Solver/ SolverTable within Excel
Linear programming is an optimization method that allows users to maximize (or minimize) an objective function (a metric defined by an equation). In the graph below, the objective function is to maximize profits, given the trade off of manufacturing tables or chairs within a certain number of production hours. Although the example below is quite rudimentary, linear programming allows for many constraints/ factors and is incredibly fast, because it simply draws a number of "lines" to represent each constraint (green lines) and then identifies the peak "feasible" point (the optimal).
As you might expect, nonlinear programming doesn't require linear constraints. However, nonlinear is known to be much more computationally challenging as the program runs through each potential point (or uses an approximation parameter/ gradient).
Non Linear
Limitations: Requires users to "know" the constraints, influential variables and their relative impacts.
Monte Carlo Simulations:
Example:@Risk
An engineering and MBA classic, Monte Carlo simulations allow for users to designate randomization functions and distributions to represent unknowns. This is used to simulate problems that are not deterministic (meaning that they can't be solved directly). This ultimately estimates both a cone of uncertainty and the mostprobable outcome.
Limitations:
Since Monte Carlo runs simulations based on estimates and user defined inputs, the output has ample opportunity for human error. Monte Carlo simulations can get pretty darn complex, but don't mistake complexity for accuracy. Instead, with Monte Carlo we must apply a veil of reason and constantly work to eliminate error by testing and benchmark against real world outcomes. Furthermore, whenever possible we should derive our estimations from historical data.
With that said, there are certain realms where Monte Carlo is phenomenal and best suited. As with any technique, simply urging caution. Take a look at some of Monte Carlo's variations such as Markov Chains, but that's for another post.
Regression
Example: StatTools
The classic. Good ole regression. Fitting a line to a set of points (ax+b=y). Regression provides insights into the relative importance of variables and the drivers of a given outcome. Today, regression takes many forms linear, logistic, polynomial, MARS, etc. One of the major differences is the "loss function." Most people are familiar with SSE, the sum of squared error. However, there are many more exciting options! Below is an image of a few of the flavors.
Limitations
The key limitations are the input data being independent, well chosen, and interpreting the output. Regression can be deceptively confidence boosting particularly on large datasets. If you're feeling extra nerdy, check out this article on the limitations of p value (shocking and saddening, I know).
Decision Trees
Example: Numerous
Decision trees are easy to interpret and often output a great visual. Decision trees work well in situations where they are predicting a binary outcome. For instance, buy or not buy (1=buy, 0=not buy) based on certain characteristics of a consumer/ customer. As we progress, the examples I use will focus on marketing because it is relatively easy to follow, however, these models are all greedy datamongers, they don't care what functional area or industry the data comes from! Below is an elementary example of a decision tree.
Limitations
Decision trees aren't always good with datasets that are dynamically changing. In other words,when what's happening or going to happen doesn't match what happened in the past. Also, they have a tendency to "overfit" the data. That's where your data scientists come in, they're well aware of these problems and are able to "tune", adjust, reconfigure, and test against a holdout dataset.
What's a holdout set? Great question! Essentially, by randomly splitting the data or using cross validation, analysts can build a model with one set of data and then get the accuracy stats with another set.
Another concern is that interpretation is limited because variables exist at different "steps" in the decision tree and errors propagate forward. In other words, mistakes made at the beginning can impact the entire model!
You're still reading?! I'm impressed.
Classification
Example: knn package in R
Although there are a ton of classification algorithms, we'll focus on KNearest Neighbor, simply as a means to convey the logic. Let's say we have a dataset of buyers and nonbuyers with lots of characteristic columns (things like age, gender, income, etc).
Technically speaking, it would be more accurate to describe our data as input/ training class labeled vectors in a multidimensional feature space but life's too short for that many twodollar words in one sentence.
We'll stick with Buyers and NonBuyers with lots of columns. So let's say we have a new list of "prospects" and we have the columns of characteristics but we don't know if they'll become customers. Well, KNN can help predict! In the visual below, let's say that we have customers blue squares and non customers red triangles.
Then along comes "Greendotman". Will he be a customer or not? Well, in this case, it depends on a few things as to what we would predict.
First, how big is "K", in other words, how any nearby points are we going to consider? If we look at K=3, then we would look at the points inside the solidline circle and see there's 2 red/noncustomers and 1 blue/customer, so we'd predict greendot man is a noncustomer. However, if we look at K=5, we'd look within the dotted line circle and find 3 blue/ customers, 2 red/ noncustomers.
What can we do? Well, we could weight by distance. In other words, we could say, let's consider those points that are closest (aka most like Greenman) more than the points further away. In that case, it would likely be a tossup. However that is informative too! As our model would give us a "probability" of being a customer. For things like mail campaigns that is highly relevant!
So when is this a good option? Well, let's think about Amazon for a second. Currently, Amazon recommends products that are "associated" with the product you are looking at or your browsing history. However, that's a pretty loose model.
Rather, a KNearest Neighbor model is likely to find that handful of weirdos just like you, those guys that also buy red silk suspenders, rent movies at 9pm on Friday nights, look at pocket protectors that are dishwasher safe, and write shamelessly about their adventures following the purchase of 3 wolves tshirt. Those recommendations will drive tons of sales! Talk about product discovery!
Limitations
The true downside is that KNN calculates the distance between each Greendotman (new point) and every other point. That's a lot of math. Fortunately, there are some binning, parallelization and generalization strategies that can speed up the process.
Here is the original post.
Alex Jones is a Graduate Student at U. Texas McCombs School of Business.
Related:
Top Stories Past 30 Days

