How Feature Engineering Can Help You Do Well in a Kaggle Competition – Part 2
In this post, I describe the competition evaluation, the design of my cross-validation strategy and my baseline models using statistics and trees ensembles.
By Gabriel Moreira, CI&T.
In the first part of this series, I introduced the Outbrain Click Prediction machine learning competition. That post described some preliminary and important data science tasks like exploratory data analysis and feature engineering performed for the competition, using a Spark cluster deployed on Google Dataproc.
In this post, I describe the competition evaluation, the design of my cross-validation strategy and my baseline models using statistics and trees ensembles.
Prediction task
In that competition, Kagglers were required to rank recommended ads by decreasing predicted likelihood of being clicked. Sponsored search advertising, contextual advertising, display advertising and real-time bidding auctions have all relied heavily on the ability of learned models to predict ad click–through rates (CTRs) accurately, quickly and reliably.
The evaluation metric was MAP@12 (Mean Average Precision at 12), which measures the ranking quality. In other words, given a set of ads presented during a user visit, this metric assesses whether the clicked ad was ranked by your model on top of other ads. MAP@12 metric is shown below, where |U| is the total number of user visits (display_ids) to a landing page with ads, n is the number of ads on that page, and P(k) is the precision at cutoff k.
Competition evaluation metric — Mean Average Precision at 12
This competition made available a very large relational database in CSV format, whose model was shown in the first post. The main tables were clicks_train.csv (~87 million rows) and clicks_test.csv (~32 million rows), representing train set and test set, respectively.
The train set (display_id, ad_id, clicked) contained information about which ads (ad_id) were recommended to a user in a given [landing] page visit (display_id), and which one of the ads was actually clicked by the user. Display_id and ad_id are foreign keys, allowing other tables to join with more context about user visit, landing page and the pages referred by the ads.
The test set (display_id, ad_id) contained the same attributes, except the clicked flag, as expected.
The expected solution was a CSV file with predictions for test set. The first column was the display_id and the second column was a ranked list of ad_ids, split by a space character.
display_id,ad_id 16874594,66758 150083 162754 170392 172888 180797 16874595,8846 30609 143982 16874596,11430 57197 132820 153260 173005 288385 289122 289915 ...
Cross-validation
It is always critical to verify whether a machine learning model can generalize beyond the training dataset. Cross-Validation (CV) is an important step to ensure such generalization.
It was necessary to separate from clicks_train.csv a percentage of samples for validation set (about 30%) and the remaining samples as train set. Machine learning models were trained using train set data and their accuracy was evaluated on validation set data, by comparing the predictions with the ground truth labels (clicks).
As we optimize CV model accuracy — by testing different feature engineering approaches, algorithms and hyperparameters tuning — we expect to improve our score on the competition Leaderboard (LB) accordingly (test set).
For most Kaggle competitions, there is a daily submission limit (two in this contest), and CV techniques allow unlimited evaluation of approaches.
For this competition, I used a CV method named holdout, which simply consists of separating a fixed validation set from train set. Another common evaluation strategy is called k-fold, for example.
Time split is also an important aspect when evaluating machine learning models based on human behavior. Many factors might affect users’ interests across time, like recency, search trends and real-world events, among others.
In the figure below, we observe the fraction of events distributed across time (15 days) on train and test sets. About half of the test data (clicks_test.csv) was sampled from the same days as the train set (in-time), with another half being sampled from the two days immediately following representing future predictions (out-of-time).
Fraction of train and test set events across days. From joconnor EDA kernel
Based on that observation, my CV strategy was to sample validation set samples according to the same time distribution of test set. Below you can see how that stratified sampling can be easily done using a Spark SQL Dataframe (cluster deployed on Google Dataproc). For the validation set, I sampled 20% of each day, and all events from the last two days (11 and 12).
Stratified sampling based on event days using Spark Dataframe (Python)
That careful sampling for validation set did pay off during competition, because my CV score matched the public leaderboard score up to the fourth decimal digit. Thus, I was confident in optimizing my models using the CV score, and only submitting when I could get a significant improvement on it.
Baseline predictions
Up to this point, I had a hard time performing exploratory data analysis, feature engineering and implementing the cross-validation strategy. In my experience, those tasks usually take about 60%-80% of your effort in machine learning projects. And if you get them wrong, they would compromise the maximum accuracy that could be achieved by the models.
Now, let’s switch gears and talk about the predictive algorithms used in this journey. My strategy, focused on personal learning, was to evaluate from simpler to state-of-the-art models in order to assess their upper bound accuracy for this challenge.
My Approach #1 was already described in the first post of this series, simply using the historical CTR to rank ads, with no machine learning involved. That was a popular baseline approach among the competitors, because it provided an LB score (MAP@12) of 0.637. Just for reference, the official competition baseline was 0.485, obtained by ranking the ads solely by their ids (random-like approach).
The Kaggle community is very generous on shared insights and approaches, even during the competitions. One of the Kagglers shared a data leak he had discovered. That leak, based on the page_views.csv dataset, revealed the actually clicked ads for about 4% user visits (display_ids) of test set. For a machine learning competition, sharing the data leak was kind of a fair-play, and created a new baseline for competitors. My Approach #2 was to use the Approach #1 predictions and only adjust ads ranking for leaked clicked ads, putting them on top of other ads. My LB score then jumped up to 0.65317, and the data leak was used in all my remaining submissions in the competition, as well as the other competitors did.
Most ads had very few views (less than 10) to compute a statistically significant CTR. So, my intuition was that prior knowledge on clicks probability for other categorical values would have some predictive power for unseen events. Thus, I computed average CTR not only for ad ids, but also for other categorical values, that is, P(click | category), and for some two-paired combinations: P(click | category1, category2).
The categorical fields whose average CTR presented higher predictive accuracy on CV score were ad_document_id, ad_source_id, ad_publisher_id, ad_advertiser_id, ad_campain_id, document attributes (category_ids, topics_ids, entities_ids) and their combinations with event_country_id, which modeled regional user preferences.
The CTR confidence was a metric tailored by me during this competition to measure how much a categorical CTR could be used to predict clicks for a specific ad.
CTR Confidence metric
In this equation, d is the number of distinct ads sharing a categorical value, and v was the views count of those ads. A high d value means that a categorical value is too generic to be used for a specific ad (eg. “topic=politics”), leading to a lower CTR confidence.
A high v indicates that ads with that categorical value had many views, increasing CTR statistical significance. For example, the CTR for a specific campaign (categorical value) might be more specific to the ad than using the historical advertiser CTR (higher d), therefore, maybe the campaign had not enough views (v<10, for example) for statistical significance, leading to a lower confidence on the CTR of a specific campaign when compared to advertiser historical CTR.
Log transformation is also applied to smooth the function for very popular categories/ads. Finally, we normalize the metric to a range between 0.0 and 1.0 by dividing by m, which represent a reference number of average views of distinct ads (v/d) for maximum confidence (1.0). I used m=100,000 for this competition.
The Approach #3 consisted in a weighted average of all selected CTRs by categories, weighted by their CTR Confidence. That approach improved my LB score to 0.65498.
After manually stressing of those baseline predictors, it was time to start exploring machine learning algorithms to deal with this problem.