How to Rank 10% in Your First Kaggle Competition
This post presents a pathway to achieving success in Kaggle competitions as a beginner. The path generalizes beyond competitions, however. Read on for insight into succeeding while approaching any data science project.
Home Depot Search Relevance
In this section I will share my solution in Home Depot Search Relevance Competition and what I learned from top teams after the competition.
The task in this competition is to predict how relevant a result is for a search term on Home Depot website. The relevance is an average score from three human evaluators and ranges between 1 ~ 3. Therefore it’s a regression task. The datasets contains search terms, product titles / descriptions and some attributes like brand, size and color. The metric isRMSE.
This is much like Crowdflower Search Results Relevance. The difference is that Quadratic Weighted Kappa is used in Crowdflower competition and therefore complicated the final cutoff of regression scores. Also there were no attributes provided in Crowdflower.
There were several quite good EDAs by the time I joined the competition, especially this one. I learned that:
- Many search terms / products appeared several times.
- Text similarities are great features.
- Many products don’t have attributes features. Would this be a problem?
- Product ID seems to have strong predictive power. However the overlap of product ID between the training set and the testing set is not very high. Would this contribute to overfitting?
You can find how I did preprocessing and feature engineering on GitHub. I’ll only give a brief summary here:
- Use typo dictionary posted in the forum to correct typos in search terms.
- Count attributes. Find those frequent and easily exploited ones.
- Join the training set with the testing set. This is important because otherwise you’ll have to do feature transformation twice.
- Do stemming and tokenizing for all the text fields. Some normalization (with digits and units) and synonym substitutions are performed manually.
- *Attribute Features
- Whether the product contains a certain attribute (brand, size, color, weight, indoor/outdoor, energy star certified …)
- Whether a certain attribute matches with the search term
- Meta Features
- Length of each text field
- Whether the product contains attribute fields
- Brand (encoded as integers)
- Product ID
- Whether search term appears in product title / description / attributes
- Count and ratio of search term’s appearance in product title / description / attributes
- *Whether the i-th word of search term appears in product title / description / attributes
- Text similarities between search term and product title/description/attributes
- Latent Semantic Indexing: By performing SVD decomposition to the matrix obtained from BOW/TF-IDF Vectorization, we get a latent representation of different search term / product groups. This enables our model to distinguish between groups and assign different weights to features, therefore solving the issue of dependent data and products lacking some features (to an extent).
Note that features listed above with
* are the last batch of features I added. The problem is that the model trained on data that included these features performed worse than the previous ones. At first I thought that the increase in number of features would require re-tuning of model parameters. However, after wasting much CPU time on grid search, I still could not beat the old model. I think it might be the issue of feature correlationmentioned above. I actually knew a solution that might work, which is to combine models trained on different version of features by stacking. Unfortunately I didn’t have enough time to try it. As a matter of fact, most of top teams regard the ensemble of models trained with different preprocessing and feature engineering pipelines as a key to success.
At first I was using
RandomForestRegressor to build my model. Then I triedXgboost and it turned out to be more than twice as fast as Sklearn. From that on what I do everyday is basically running grid search on my work station while working on features on my laptop.
Dataset in this competition is not trivial to validate. It’s not i.i.d. and many records are dependent. Many times I used better features / parameters only to end with worse LB scores. As repeatedly stated by many accomplished Kagglers, you have to trust your own CV score under such situation. Therefore I decided to use 10-fold instead of 5-fold in cross validation and ignore the LB score in the following attempts.
My final model is an ensemble consisting of 4 base models:
The stacker is also a
The problem is that all my base models are highly correlated (with a lowest correlation of 0.9). I thought of including linear regression, SVM regression and
XGBRegressorwith linear booster into the ensemble, but these models had RMSE scores that are 0.02 higher (this accounts for a gap of hundreds of places on the leaderboard) than the 4 models I finally used. Therefore I decided not to use more models although they would have brought much more diversity.
The good news is that, despite base models being highly correlated, stacking still bumps up my score a lot. What’s more, my CV score and LB score are in complete sync after I started stacking.
During the last two days of the competition, I did one more thing: use 20 or so different random seeds to generate the ensemble and take a weighted average of them as the final submission. This is actually a kind of bagging. It makes sense in theory because in stacking I used 80% of the data to train base models in each iteration, whereas 100% of the data is used to train the stacker. Therefore it’s less clean. Making multiple runs with different seeds makes sure that different 80% of the data are used each time, thus reducing the risk of information leak. Yet by doing this I only achieved an increase of
0.0004, which might be just due to randomness.
After the competition, I found out that my best single model scores
0.46378 on the private leaderboard, whereas my best stacking ensemble scores
0.45849. That was the difference between the 174th place and the 98th place. In other words, feature engineering and model tuning got me into 10%, whereas stacking got me into 5%.
There’s much to learn from the solutions shared by top teams:
- There’s a pattern in the product title. For example, whether a product is accompanied by a certain accessory will be indicated by
With/Without XXXat the end of the title.
- Use external data. For example use WordNet or Reddit Comments Dataset to train synonyms and hypernyms.
- Some features based on letters instead of words. At first I was rather confused by this. But it makes perfect sense if you consider it. For example, the team that won the 3rd place took the number of letters matched into consideration when computing text similarity. They argued that longer words are more specific and thus more likely to be assigned high relevance scores by human. They also used char-by-char sequence comparison (
difflib.SequenceMatcher) to measure visual similarity, which they claimed to be important for human.
- POS-tag words and find head in phrases and use them when computing various distance metrics.
- Extract top-ranking trigrams from the TF-IDF of product title / description field and compute the ratio of word from search terms that appear in these trigrams. Vice versa. This is like computing latent indexes from another point of view.
- Some novel distance metrics like Word Movers Distance
- Apart from SVD, some used NMF.
- Generate pairwise polynomial interactions between top-ranking features.
- For CV, construct splits in which product IDs do not overlap between training set and testing set, and splits in which IDs do. Then we can use these with corresponding ratio to approximate the impact of public/private LB split in our local CV.
- It was a good call to start doing ensembles early in the competition. As it turned out, I was still playing with features during the very last days.
- It’s of high priority that I build a pipeline capable of automatic model training and recording best parameters.
- Features matter the most! I didn’t spend enough time on features in this competition.
- If possible, spend some time to manually inspect raw data for patterns.
Several issues I encountered in this competitions are of high research values.
- How to do reliable CV with dependent data.
- How to quantify the trade-off between diversity and accuracy in ensemble learning.
- How to deal with feature interaction which harms the model’s performance. Andhow to determine whether new features are effective in such situations.
- Choose a competition you’re interested in. It would be better if you’ve already have some insights about the problem domain.
- Following my approach or somebody else’s, start exploring, understanding and modeling data.
- Learn from forum and scripts. See how others interpret data and construct features.
- Find winner interviews / blog posts of previous competitions. They’re extremely helpful, especially if from competitions that share some similarities with that one you’re working on.
- Start doing ensemble after you have reached a pretty good score (e.g. 10% ~ 20%) or you feel that there isn’t much room for new features (which, sadly, always turns out to be false).
- If you think you may have a chance to win the prize, try teaming up!
- Don’t give up until the end of the competition. At least try something new every day.
- Learn from the sharings of top teams after the competition. Reflect on your approaches. If possible, spend some time verifying what you learn.
- Get some rest!
- Beating Kaggle the Easy Way - Dong Ying
- Search Results Relevance Winner’s Interview: 1st place, Chenglong Chen
- (Chinese) Solution for Prudential Life Insurance Assessment - Nutastray
Bio: Linghao Zhang is a senior year Computer Science student at Fudan University and Data Mining Engineer at Strikingly. His interests include machine learning, data mining, natural language processing, knowledge graphs, and big data analytics.
Original. Reposted with permission.