The Top Predictive Analytics Pitfalls to Avoid

Predictive modelling and machine learning are significantly contributing to business, but they can be very sensitive to data and changes in it, which makes it very important to use proper techniques and avoid pitfalls in building data science models.



By Robin Davies, Principa.

Predictive Analytics can yield amazing results.  The lift that can be achieved by basing future decisions from observed patterns in historical events can far outweigh anything that can be achieved by relying on gut-feel or being guided by anecdotal events.  There are numerous examples that demonstrate the possible lift that can be achieved across all possible industries, but a test we did recently in the retail sector showed that applying stable predictive models gave us a five-fold increase in the take-up of the product when compared against a random sample.  Let’s face it, there would not be so much focus on Predictive Analytics and in particular Machine Learning if it was not yielding impressive results.

Read about the lessons we learned while using Machine Learning to predict the 2015 Rugby World Cup results

But predictive models are not bullet proof.  They can be a bit like race horses: somewhat sensitive to changes and with a propensity to leave the rider on the ground wondering what on earth just happened.

The commoditising of Machine Learning is making data science a lot more accessible to the non data scientists of the world than ever before. With this in mind, my colleague and I sat and pondered, and we devised the following list of top predictive analytics pitfalls to avoid in order to keep your models performing as expected:

  • Making incorrect assumptions on the underlying training data.
    Rushing in and making too many assumptions on the underlying training data can often lead to egg on the proverbial face. Take time to understand the data and trends in the distributions, missing values, outliers, etc.
  • Working with low volumes.
    Low volumes is the data scientist’s unhappy place – they can lead to statistically weak, unstable and unreliable models.
  • The over-fitting chestnut.
    In other words, creating a model that has many branches and therefore seems to provide better discrimination of the target variable, but falls over in the real world as it has introduced too much noise into the model.
  • Bias in the training data.
    For example, you only offered a certain product to the Millennials. So, guess what? The Millennials are going to come through strongly in the model.
  • Including test data in the training data.
    There have been a few epic fails where the test data has been included in the training data – giving the impression that the model is going to perform fantastically, but in reality results in a broken model. In the predictive analytics world, if the results are too good to be true, it is worth spending more time on your validations and even getting a second opinion to check over your work.
  • Not being creative with the provided data.
    Predictive models can be significantly improved by creating some clever characteristics or features that can be used to better explain the trends in the data. Too often data scientists will work with what has been provided and will not spend enough time considering more creative features from the underlying data that can strengthen the models in ways that an improved algorithm cannot achieve.
  • Expecting machines to understand business.
    Machines cannot figure out (yet) what the business problem is and how best to tackle the problem. This is not always straight forward and can require some careful thought, involving wholesome discussions with the business stakeholders.
  • Using the wrong metric to measure the performance of a model.
    For example, out of 10,000 cases you have only two cases that are fraudulent and 9,998 cases that are not fraud. If the performance metric used in the model training was just straightforward accuracy, the model would attempt to maximize accuracy. So if it predicts all 10,000 cases to not be fraud, the model would have an accuracy of 99,98% which is seemingly amazing, but it does not serve any purpose in identifying fraud. It simply identifies 99,98% percent of the non-fraud instances correctly. So, for rare event modelling (of which fraud is a good example), alternative approaches need to be applied.
  • Using plain linear models on non-linear interaction.
    This happens commonly when, for example, building binary classifiers and logistic regression is chosen as the preferred method, when in reality the relationship between the features are not linear. Using tree-based models or support vector machines work better in such cases. Not knowing which methods are applicable to which problems results in poor models and subsequent predictions.
  • Forgetting about outliers.Outliers usually deserve special attention or should be ignored entirely, some methods of modelling are extremely sensitive to outliers and forgetting to remove or cater for them can cause poor performance in your model.
  • Performing regularisation without standardisation.
    Many practitioners are not aware of the redundancy of applying regularisation to the model’s features without first standardising the data so all the data is on the same scale. Regularisation would be biased, because it will penalise features that are on smaller scales more. For example, if there is a feature that is on a scale of 3,000 – 10,000, and another variable that is on the scale of 0 – 1, and another on the scale of -9,999 to 9,999.
  • Not taking into account the real-time scoring environment.
    Practitioners can sometimes get distracted by building the most perfect model, but when it comes to deployment, it is so complex that the model cannot be integrated into the operational system.
  • Using characteristics that will not be available in the future, due to operational reasons.
    One may identify a very predictive characteristic (like gender), but due to regulations, this field cannot be used in modelling, or the capturing of the field has been suspended and will be available in the future for use in the model.
  • Not considering the real-world implications and possible fallout of applying effective predictive analytics.
    American retailer Target made headlines four years ago when New York Times reporter Charles Duhigg brought to the public’s attention the now famous incident of Target’s analytics models predicting a teenager’s pregnancy before her father knew.  As some have pointed out, just because you can, doesn’t mean you should.

We’ve collectively learned some valuable lessons using predictive analytics over the years. Why not take advantage? If you’d like some further guidance in navigating the predictive analytics field, drop us a line and get in touch! We’d be happy to meet up for an informal chat over some coffee, share some knowledge and learn about your predictive analytics projects and plans.

Original. Reposted with permission.

Bio: Robin Davies is the Head of Decision Analytics at Principa. Robin’s team has a winning track record using descriptive, predictive and prescriptive analytical techniques within the financial services, marketing and loyalty sectors.

Related: