3 Ways to Test the Accuracy of Your Predictive Models

3 different methods for testing accuracy of predictive models from 3 leading analytics experts - Karl Rexer, John Elder, and Dean Abbott explain using lift charts, randomization testing, and bootstrap sampling.

Plotting Success, Victoria Garment, Jan 29, 2014

In data mining, data scientists use algorithms to identify previously unrecognized patterns and trends hidden within vast amounts of structured and unstructured information. These patterns are used to create predictive models that try to forecast future behavior.

.. There are many different tests to determine if the predictive models you create are accurate, meaningful representations that will prove valuable to your organization-but which are best? To find out, we spoke to three top data mining experts. Here, we reveal the tests they use to measure their own results, and what makes each test so effective.

Compare Predictive Model Performance Against Random Results With Lift Charts and Decile Tables

Karl Rexer is the founder of Rexer Analytics, a small consulting firm that provides predictive modeling and other analytic solutions to clients such as Deutsche Bank, CVS, Redbox and ADT Security Services. The measures his firm uses to create predictive models are often binary: people either convert on a website or don't, bank customers close their accounts or leave them open, etc.

Rexer's firm creates models that help clients determine how likely people are to complete a binary behavior. These models are created with algorithms that typically use historical data pulled from the client's data warehouse to characterize behaviors and identify patterns.

To test the strength of these models, Rexer's firm frequently uses lift charts and decile tables, which measure the performance of the model against random guessing, or what the results would be if you didn't use any model.

... data can also be plotted on a lift chart to create a visual representation of model performance. If no model was used and leads were contacted randomly, this would result in a linear line (represented in red in the chart below). Contacting the first decile, or the first 10 percent of leads, would yield 10 percent of sales, contacting 20 percent of leads would yield 20 percent of sales, and so on.

Lift Chart

In this chart, the predictive model is represented by the curved blue line. The red X signifies the lift of the first decile above the random model. The lift of the first decile is 4.0 times greater than the random model (10 percent), or 40 percent. This indicates that if one were to select the top 10 percent of leads with the highest model scores, one would obtain 40 percent of total sales, which is substantially better than random.

Here is the rest of the explanation of lift charts and decile tables.

Evaluate the Validity of Your Discovery With Target Shuffling

John Elder is the founder of data mining and predictive analytics services firm Elder Research. He tests the statistical accuracy of his data mining results through a process called target shuffling [also known as Randomization Testing]. It's a method Elder says is particularly useful for identifying false positives, or when two events or variables occurring together are perceived to have a cause-and-effect relationship, as opposed to a coincidental one.

"The more variables you have, the easier it becomes to "oversearch" and identify (false) patterns among them," Elder says-what he calls the 'vast search effect.'

As an example, he points to the Redskins Rule, where for over 70 years, if the Washington Redskins won their last home football game, the incumbent party would win the presidential election. "There's no real relationship between these two things," Elder says, "but for generations, they just happened to line up."

Histogram comparing the success of a model to that of shuffled models

In the histogram pictured above, the model scored in the high 20's. Only 0.04 percent of the random, shuffled models performed better, meaning the model is significant to that level (and would meet the criteria of a publishable result in any journal).

Here is the rest of the explanation of reshuffling.

Test Predictive Model Consistency With Bootstrap Sampling

Like Elder, Dean Abbott, president of Abbott Analytics, Inc. says predictive analytics and data mining types typically don't use the traditional statistical tests taught in college statistics classes to assess models-they assess them with data.

"One big reason for this is that everything passes statistical tests with significance," he says. "If you have a million records, everything looks like it's good."

According to Abbott, there's a difference between statistical significance and what he calls operational significance. "You can have a model that is statistically significant, but it doesn't mean that it's generating enough revenue to be interesting," he explains.

"You might come up with a model for a marketing campaign that winds up generating $40,000 in additional revenue, but that's not enough to even cover the cost of the modeler who built it."

This, Abbott says, is where many data miners fall short. He gives the example of comparing two different models, where you're looking at the lift in the third decile for each. In one model, the lift is 3.1, while the other is 3.05.

"Data miners would typically say, 'Ah! The 3.1 model is higher, let me go with that,'" he says. "Well, that's true-but is it [operationally] significant? And this is where I like to use other methods to find out if one is truly better than the other. It's a more intuitive way to get there."

Read more about these tests on the Plotting Success blog by Software Advice, a company that reviews business intelligence software.