Dealing with Data Leakage
Target leakage and data leakage represent challenging problems in machine learning. Be prepared to recognize and avoid these potentially messy problems.
By Susan Currie Sivek, Ph.D., Senior Data Science Journalist
Captured still via GIPHY
You’re studying for an upcoming exam. The exam is open-book, so you’re using your reference materials as you review, and you’re doing great.
But when you show up on test day, suddenly you’re told the exam isn’t open-book anymore. It doesn’t go so well.
This sounds like an academic overachiever’s anxiety dream, but it’s similar to what’s happening when target leakage occurs in a machine learning model. Say you build a model that’s intended to predict a certain outcome, and you train it with information that helps the model make its prediction. That model may perform well ... perhaps suspiciously well. But if some of that information won’t be available to the model at the actual time it has to make its prediction, its real performance will be lower. That’s the result of target leakage — a data scientist’s anxiety dream!
I recently heard one of our in-house Alteryx experts call target leakage the toughest problem in machine learning. But how does it happen, and how can you avoid this issue with your models? And how does it relate to “data leakage” more generally?
Target leakage occurs when a model is trained with data that it will not have available at the time of prediction. The model does well when it is initially trained and tested, but when it’s put into production, the lack of that now-missing data causes the model to perform poorly. Just like you studying with your books, then taking the exam without them, the model is missing helpful information that improved its performance during training.
Here are some scenarios that represent target leakage:
- Including the outcome to be predicted as a feature in the dataset used to train the model (this may sound silly, but it could happen; for example, duplicating and renaming your target variable field, then forgetting about that duplication, could lead you to inadvertently use the extra version of the target as a predictor);
- Including a feature representing the number of years a student attended a college in a model predicting whether the student would accept an offer of admission to that college;
- Including a feature representing the number of months of a subscription in a model predicting whether a potential customer would subscribe or not;
- Including a feature representing whether a fire-related insurance claim was approved in a model predicting fires in homes with a certain type of siding; and
- Including information from other datasets that introduces details not otherwise available to the model at the time of prediction.
In all of these cases, information that can’t be known at the time of prediction was included when the model was built. We can’t know how many months a customer will subscribe when we are still trying to figure out if they’ll subscribe in the first place. (Can we build a model that could try to predict how many months a subscriber will subscribe? Sure. But we’d base that on our data about known subscribers, not the entire pool of those who may or may not subscribe.) Similarly, if we’ve told a model that people in homes built with certain materials have made fire insurance claims, we’re introducing knowledge from after the fires have occurred into our model trying to predict the fires.
Even seemingly innocent details like file size or timestamps can unintentionally be proxies for a target variable. For example, a 2013 Kaggle competition had to be paused and the dataset revamped because of this kind of issue. The team that discovered (and diligently reported) the leakage enjoyed a brief stint on the top of the leaderboard!
What results from data leakage is overfitting to your training data. Your model can be very good at predicting with that extra knowledge — excelling on the open-book exam — but not so good when that information isn’t provided at prediction time.
Another form of data leakage is sometimes called “train-test contamination.” This problem may not specifically involve your target variable, but it does affect model performance. It’s another way we might inadvertently add knowledge about future data into our training data, resulting in performance metrics that look better than they would in production. (By the way, if you look for more reading on this topic, be forewarned that “data leakage” is also a term sometimes used by cybersecurity folks to talk about data breaches.)
A common way train-test contamination occurs is preprocessing your dataset in its entirety before splitting it into training and test sets or prior to using cross-validation.
For example, normalizing data requires using the numerical range of each variable in the dataset. Normalizing the entire dataset as a whole provides that “knowledge” to the model when it’s evaluated. However, a model that is put into production won’t have that knowledge, and so won’t perform as well when it is used for prediction. Similarly, standardizing the full dataset would inappropriately inform the model about the mean and standard deviation of the entire dataset. Imputing missing values also uses summary statistics about your dataset (e.g., median, mean).
All of these clues can help the model perform better on your training and test data than it will when it is eventually introduced to brand-new data. This article provides an in-depth exploration of this kind of data leakage, including code to demonstrate.
Another issue can emerge if you’re using k-fold cross-validation to evaluate your model. As long as your dataset includes only one observation from each individual person/source, this type of leakage should not be an issue for you. However, if you have multiple observations (i.e., rows of data) from each person or source in your dataset, all of those observations from the same source need to be grouped together when the subsets or “folds” of your data are created for training and testing the model.
For example, you may end up using training data from person A to predict an outcome for test data from person A, if observations from person A end up included in both the training group and the test group. The model will seem to perform better on the test set — which again includes person A — because it already knows something about person A from the training set. But in production, it won’t have that advantage of prior exposure. For more elaboration on this issue (sometimes called “group leakage”), check out this article.
Dealing with Data Leakage
When a faucet drips in your house, you know it by the sound and puddles. But these types of leakage can be difficult to detect. There are still preventive maintenance and repairs you can do to address this challenge.
Unusually good model performance may be a sign of leakage. If your model is performing shockingly, remarkably well, resist the temptation to pat yourself on the back and ship it. That performance might be the result of garden-variety overfitting, but it may also be reflecting target or data leakage.
To try to stave off data leakage in the first place, you can do thorough exploratory data analysis (EDA) and look for features that have especially high correlations with your outcome variable. It’s worth looking closely at those relationships to ensure there isn’t the potential for leakage if the highly correlated features are used together in the model. This review can be challenging if you have a high-dimensional dataset with many features, so using a tool like the Pearson Correlation Tool in Alteryx Designer and filtering and/or visualizing its output could be helpful.
To avoid train-test contamination, be sure to split your data into training/test/holdout sets prior to applying any transformations such as normalization, then train the model on the training set. Follow up by applying the transformations to the test/holdout sets with the same parameters applied to the training set, then test your model’s performance.
Additionally, be sure you fully understand all the features in your dataset. This Kaggle example shows how a feature named “expenditures” in a credit card application approval dataset could cause target leakage if used in a model to predict approved applications. “Expenditures” as a feature name could mean a lot of things, but in this case, it referred to how much credit card users spent on their cards. That feature implied that they were indeed approved for the cards and informed the model’s prediction of approvals inappropriately, since expenditure information would not be available at the time of prediction.
Finally, checking your features’ relative importance and reviewing other interpretability tools could help you catch leakage. In the credit card application approval example above, “expenditures” might have looked like a really important feature in a prediction of credit card approvals. If in doubt, you can try removing a feature to see how your model’s performance changes, and then determine whether the model suddenly performs at a more realistic level.
Leakproofing Your Models
I hope this overview has provided you some new plumbing skills to help you avoid these leaky situations! With careful EDA and thorough knowledge of your dataset, as well as correct preprocessing and cross-validation setup, you should be able to keep your targets nicely contained and your datasets free from contamination.
Originally published on the Alteryx Community Data Science Blog.
Bio: Susan Currie Sivek, Ph.D. is the senior data science journalist for the Alteryx Community, where she explores data science concepts with a global audience. She is also the host of the Data Science Mixer podcast. Her background in academia and social science informs her approach to investigating data and communicating complex ideas — with a dash of creativity from her training in journalism.
Original. Reposted with permission.
- Continuous Training for Machine Learning – a Framework for a Successful Strategy
- 5 Essential Papers on AI Training Data
- Dataset Splitting Best Practices in Python