Careful! Looking at your model results too much can cause information leakage

We all are aware of the issue of overfitting, which is essentially where the model you build replicates the training data results so perfectly its fitted to the training data and does not generalise to better represent the population the data comes to, with catastrophic results when you feed in new data and get very odd results.

comments

By Paul May, Consultant Data Scientist

It’s always to use as much data as you can when building machine learning models. I think we all are aware of the issue of overfitting, which is essentially where the model you build replicates the training data results so perfectly its fitted to the training data and does not generalise to better represent the population the data comes to, with catastrophic results when you feed in new data and get very odd results.

Photo by Jacek Dylag on Unsplash

What can you do to prevent this?

Look online and you will find reams and reams of articles about overfitting, what it is and how to prevent it. These have been done to a great degree so I won’t repeat them all here but usually these can be controlled most easily by segregating the available data you have before you start.

What data segregation's are there?

The most basic way is to take the whole set of available data and split it into three different samples (with no duplication) and given the following labels:

Training
Validation
Test

Training, Test and Validation sets of data each carry out a specific purposeand will usually not be of the same size. Indeed some of the most used values are a 3:1:1 split (or 60% training, 20% validation and 20% test)

Training Data

This is the meat of the data. This is what is used to train and create the models you use. Naturally the more data you have in training the better you can expect the model to generalise to it. This generally has around 60% of all the available data in it.

Validation Data

This is the set of data that is used for tuning and improving your model. Whenever you train a model you use it against the validation set in order to generate some predictions and score its performance. You can then tweak the model, re-train it and run it against this data again to see if an improvement has been made. This can be a very iterative step. You want a reasonable size of data here to cover the parameter space the model will be exposed to, but not too much that your model does not have enough data to train on. This is generally around 20% of all the available data.

Test Data

This data is put aside at the start of the project and is not touched until you are happy with your model. You then apply it to this data to see if it really has learnt to generalise well or if it has been overfitted. Similar to the validation data set you want enough to cover what you want to test and get confidence your model is working correctly, but you do not want too much taken away from what can be trained on. This is generally around 20% again of the available data.

You mentioned covering the parameter space?

I did. For example, if you are going to run the model against flow rates of water in different diameter of pipes, you want to make sure you have data in validation (and in fact all three sets) to make sure the model can learn this ability and be tested against it. If you only train on data for one pipe diameter there’s a very low chance it’ll learn that there is a relationship between water flow and pipe diameter, resulting in poor results for the other diameters.

Keeping enough data around for training is therefore important.

Photo by T L on Unsplash

Conserving Data

If 40% of the data is being used just to tune and do the final scoring of your model you may need a lot of data for a complicated model that needs a lot of heavily varying features. To increase the amount of data available for training data scientists will often use cross-validation.

Cross-validation is where instead of the data in the train and validation sets being fixed they are instead pooled and then systematically split up into segments (or “folds”) and then assigned to each of these groups. For example, for a four fold cross-validation the data put aside for test and validation is split into four equal groups and each of those groups takes a turn being the validation set with the other three being pooled into the training set.

Ultimately this means you have got a prediction against each data point and also used it for training the model (but not at the same time) and are leveraging more of the data. Cross-validation is also used this way to tune the hyper-parameters of specific model you are using in order to improve the accuracy.

So what is the problem?

Well you are still only using 80% of the data and there is always an urge to take a peek of your current models results using the test set. That is not usually a problem, but only if you control how you use the data you have got.

Some people will train and tune the model comparing the accuracy of the validation set against the test set and stop adjusting the model once the accuracy of the test set no longer improves but gets worse. Usually in these situations the validation set will keep improving even as the test set gets worse and is nicely illustrating the model becoming overfitted.

The issue is where you use the test set to check your model too much as slowly but surely you are doing to the test set what is happening to the validation set described above. For poor data or a very complicated model you may end up running it against the test set multiple times and without realising you are overfitting because you are using the information you see and making changes to the model.

For example, you get your final model and run it against the test set and while the results are okay you notice that it could probably be improved if you change a parameter. You change it and the model is improved, but you have used information about the test set to feed into the model. Information has been leaked between the groups.

Information Leakage

Information leakage is when information from a sealed system leaks out somewhere it shouldn’t. This is a term often used in cryptography where a secure system gives clues out to an eavesdropper about what is insider it and this may be used to eventually break the security.

So here the conduit is the data scientist, who inadvertently is using their own knowledge and skill and slowly adjusting a model based on how they see it performing against the test set.

I should add that this is not generally a big problem, but is one I’ve seen happen once or twice. Also my approach is not the only way of controlling it, but I give my opinion below.

What can be done?

This is something that is easily controlled once you are aware it might be a problem. The easiest solution is to make sure you aren’t using the final test set many times and restrict what you change based on the results.

For model tuning it may be beneficial to actually take another small set out of the train and validation data pool and use that as a proxy for the test set. This way you can still tune your model and see how the improvements to the model are coming along (and stopping as this additional test set gets worse) and also have the benefit of retaining a final test set to show off the final results.

While this sounds like more data is being reserved and taken away from training a cross-validation approach can help to mitigate this. For example, you could take 10 to 20% of the 80% and you’re still training and validating on the remaining 60 to 70% (which is similar to the non cross-validation approach).

Key Points

Photo by Laurenz Kleinheider on Unsplash

When building a complicated model that takes a lot of tuning or adjustments be aware that:

Testing against the final reserved test set can lead to overfitting
Restrict how much you modify the model based on your final test set results
You can be a conduit for over-fitting your model
Cross-validation can help you to use the most of the data you have
If fitting a lot of hyper-parameters an extra test set can be useful

Bio: Paul May is a Consultant Data Scientist, working to extract value out of both available data sets but also in combination with other novel sources to increase their effectiveness. He employs both machine learning techniques and statistical analysis methods to manipulate and present data to a wide variety of sources.

Original. Reposted with permission.

Related: