How Feature Engineering Can Help You Do Well in a Kaggle Competition – Part 2
In this post, I describe the competition evaluation, the design of my cross-validation strategy and my baseline models using statistics and trees ensembles.
Machine learning models
In this section I present the first machine learning models I tried for this challenge: Collaborative Filtering and Trees Ensembles.
Collaborative Filtering — ALS Matrix Factorization
Collaborative Filtering (CF) is probably the most popular technique for recommender systems. CF allow the modeling of users interests by collecting preferences or taste information from many other users (collaborating) to provide personalized recommendations (filtering). Thus, this method was a natural first candidate to be assessed.
Alternating Least Squares (ALS) Matrix Factorization can be applied as a model-based CF for a large matrix of users and items. We used Spark ALS implementation, which features distributed processing across a cluster and supports implicit feedback data (e.g. views, clicks, purchases, likes and shares) as well as explicit feedback data (e.g. movie or books ratings).
The first step was to build a sparse utility matrix of users and documents (content pages referred by ads). The matrix was filled with views count of each document by each user. Log transformation was necessary to smooth views count because some users (or bots) had seen the same web page many times during the 15 days of competition data.
After many attemps tuning Spark’s ALS Matrix Factorization hyperparameters, the best model (presented below) got a Cross-Validation (CV) MAP score of 0.56116, much lower than previous baselines. Thus, this model was not used in my final ensembled solution.
A possible reason for such bad result was the users’ cold-start — the average page views by user was only 2.5, from a collection of more than 2 million pages, making it difficult for a CF approach to infer users preferences and complete such a large and sparse matrix.
Spark ALS model training (Python)
Gradient Boosted Decision Trees
Standard CF uses only the utility matrix between users and items (documents). For this competition, there were a vast amount of attributes related to the visit context, the landing page and the ads. Thus, we tried a Learning To Rank approach to take advantage of such rich side information.
The machine learning model employed was Gradient Boosted Decision Trees(GBDT), which is an ensemble of many Decision Trees (weak learners). Similarly to Random Forests (RF), each individual tree is trained with a subsample of the training instances (a.k.a. bagging) and a subsample of the attributes (feature bagging) in order to get individual models with different perspectives of the data. Differently than RF, GBDT assigns more weight to training instances mistakenly classified by previous trees, improving model accuracy and making it a more robust classifier.
I’ve tested two GBDT frameworks, whose implementation was very fast and memory-efficient for large datasets: XGBoost and LightGBM. In our context, they allowed to focus in optimizing the ads ranking for a given user page visit (display_id), instead of trying to predict whether individual ads were clicked or not. XGBoost with ranking objective optimized learning based on MAP metric (the official competition evaluation metric), while LightGBM optimized upon a different ranking metric named NDCG.
The attributes used in XGBoost models, described in more details in the first post, were One-Hot Encoded (OHE) categorical ids, average CTR by categories and their confidences, contextual documents similarity (cosine similarity between landing page profiles — categories, topics, entities — and ad documents profiles), users preference similarity (cosine similarity between user profile and ad documents profile). Such number of attributes, specially OHE categorical ids, lead to very sparse feature vectors, with dimensionality of more than 126,000 positions and only 40 non-zero features.
XGBoost hyperparameters tuning is tricky and this tuning guide was very useful for me. The best XGBoost model (Approach #4) got a LB score of 0.66821 (hyperparameters shown below) — a large jump over baseline models. That took about three hours to train on a single server with 32 CPUs and 208 GB RAM (n1-highmem-32 machine type on Google GCE).
Training of the best XGBoost model, using its Python API
For LightGBM models, I used as input data only numerical attributes (CTR and similarities), ignoring OHE categorical fields. Training on such data was pretty fast, under 10 minutes. Surprisingly, I could get with LightGBM (Approach #5) a better model than with XGBoost (LB score of 0.67073). My hypothesis is that the high-dimensional categorical features encoded with OHE made it harder to get a random set of very predictive bagged features for trees building. In fact, some competitors reported single GBDT models with scores above 0.68, using the original categorical ids as provided (no OHE or feature hashing), probably because thousand of trees in the GBDT ensemble were able to ignore the noise and model such ids representation.
I used the LightGBM command line interface for training and predicting the models, and the best hyperparameters I could get for my data are presented below for reference.
More on part III
Up to this point, my LB score had increased significantly over the baseline, by using some basic statistics and trees ensembles. In the next post of this series, I will describe the most powerful machine models to tackle the Click Prediction problem and the ensemble techniques that took me up to the 19th position on the leaderboard (top 2%)
Bio: Gabriel Moreira is a scientist passionate about solving problems with data. He is a Doctoral student at Instituto Tecnológico de Aeronáutica - ITA, researching on Recommender Systems and Deep Learning. Currently, he is Lead Data Scientist at CI&T, leading the team in the solution of customer's challenging problems using machine learning on big and unstructured data.
Original. Reposted with permission.