Adversarial Validation, Explained

This post proposes and outlines adversarial validation, a method for selecting training examples most similar to test examples and using them as a validation set, and provides a practical scenario for its usefulness.

In this second article on adversarial validation we get to the meat of the matter: what we can do when train and test sets differ. Will we be able to make a better validation set?

The problem with training examples being different from test examples is that validation won’t be any good for comparing models. That’s because validation examples originate in the training set.

We can see this effect when using Numerai data, which comes from financial time series. We first tried logistic regression and got the following validation scores:

AUC: 52.67%, accuracy: 52.74%

MinMaxScaler + LR
AUC: 53.52%, accuracy: 52.48%

What about a more expressive model, like logistic regression with polynomial features (that is, feature interactions)? They’re easy to create with scikit-learn:

from sklearn.pipeline import make_pipeline

poly_scaled_lr = make_pipeline( PolynomialFeatures(), MinMaxScaler(), LogisticRegression())

This pipeline looked much better in validation than plain logistic regression, and also better thanMinMaxScaler + LR combo:

PolynomialFeatures + MinMaxScaler + LR
AUC: 53.62%, accuracy: 53.04%

So that’s a no-brainer, right? Here are the actual leaderboard scores (from the earlier round of the tournament, using AUC):

# AUC 0.51706 / LR
# AUC 0.52781 / MinMaxScaler + LR
# AUC 0.51784 / PolynomialFeatures + MinMaxScaler + LR

After all, poly features do about as well as plain LR. Scaler + LR seems to be the best option.

We couldn’t tell that from validation, so it appears that we can’t trust it for selecting models and their parameters.


We’d like to have a validation set representative of the Numerai test set. To that end, we’ll take care to select examples for the validation set which are the most similar to the test set.

Specifically, we’ll run the distinguishing classifier in cross-validation mode, to get predictions for all training examples. Then we’ll see which training examples are misclassified as test and use them for validation.

To be more precise, we’ll choose a number of misclassified examples that the model was most certain about. It means that they look like test examples but in reality are training examples.

Numerai after PCA

Numerai data after PCA. Training set in red, test in turquoise. Quite regular, shaped like a sphere...

Numerai image 2

Or maybe like a cube? Anyway, sets look difficult to separate.

UPDATE: Now you can create 3D visualizations of your own data sets. Visit and upload a CSV or libsvm-formatted file.

First, let’s try training a classifier to tell train from test, just like we did with the Santander data. Mechanics are the same, but instead of 0.5, we get 0.87 AUC, meaning that the model is able to classify the examples pretty well (at least in terms of AUC, which measures ordering/ranking).

By the way, there are only about 50 training examples that random forest misclassifies as test examples (assigning probability greater than 0.5). We work with what we have and mostly care about the order, though.

Cross-validation provides predictions for all the training points. Now we’d like to sort the training points by their estimated probability of being test examples.

i = predictions.argsort()

train['p'] = predictions
train_sorted = train.iloc[i]

Validation and predictions, take two

We did the ascending sort, so for validation we take a desired number of examples from the end:

val_size = 5000

train = data.iloc[:-val_size]
val = data.iloc[-val_size:]

The current evaluation metric for the competition is log loss. We’re not using a scaler with LR anymore because the data is already scaled. We only scale after creating poly features.

AUC: 52.54%, accuracy: 51.96%, log loss: 69.22%

Pipeline(steps=[('poly', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)), ('scaler', MinMaxScaler(copy=True, feature_range=(0, 1)))])
AUC: 52.57%, accuracy: 51.76%, log loss: 69.58%

Let us note that differences between models in validation are pretty slim. Even so, the order is correct - we would choose the right model from the validation scores. Here’s the summary of results achieved for the two models:


# 0.6922 / LR
# 0.6958 / PolynomialFeatures + MinMaxScaler + LR

Public leaderboard:

# 0.6910 / LR
# 0.6923 / PolynomialFeatures + MinMaxScaler + LR

And the private leaderboard at the end of the May round:

# 0.6916 / LR
# 0.6954 / PolynomialFeatures + MinMaxScaler + LR

As you can see, our improved validation scores translate closely into the private leaderboard scores.


Bio: Zygmunt Zając likes fresh air, holding hands, and long walks on the beach. He runs, the most popular machine learning blog in the whole wide world. Besides a variety of entertaining articles, FastML now has a machine learning job board and a tool for visualizing datasets in 3D.

Originals: Part 1. Part 2. Reposted with permission.