How to win Kaggle competition based on NLP task, if you are not an NLP expert

Here is how we got one of the best results in a Kaggle challenge remarkable for a number of interesting findings and controversies among the participants.

The number of such “neighbors” turned out to be a very strong magic feature. It can be seen in the following diagram. It shows the correlation between duplicates and non-duplicates in the training set for the records with a particular number of “common neighbors”.

So, around 80% of the records that score zero in this feature are duplicates, and those that have one “neighbor” have less than 40% chance of being duplicates.

2. Another structural characteristic is the frequency of a question. Let’s calculate the number of incident edges for every node of the graph (in other words: we’ll calculate how often each question occurs in the data set).  This way every record will contain the number of occurrences for each question. We can use the minimum/maximum number of these occurrences, their average value or difference between them. Such features also turn out to be very strong and improve model’s performance.

These features were referred to as “magic” ones during the competition since they turned out to be quite strong. Many participants were surprised when they found out that useful information can also be found in data set’s structure. Also, it wasn’t clear from the beginning whether such features would be useful for competition organizer’s business. Moreover, certain NLP models (TF-IDF, for example) implicitly use question’s number of occurrences, this means that such models can improve their quality of prediction only based on the specific characteristics of a particular data set.

Of course, there are many other ways of finding useful information in data set’s structure, but those were the most powerful ones in a particular Kaggle competition.

Our solution to Quora Question Pairs Kaggle competition

Deep learning

Taking task’s specifics into account, it is worth mentioning that a lot of hopes of both participants and organizers were placed into deep learning. It is true that in many cases deep learning models have shown better results than the models using hundreds of handcrafted features (Quora has already been using one of such models). This is why we started the work on our solution with deep learning models.

Word vectors (Embeddings)

Modern deep learning models are represented by deep neural networks that get raw data as an input (questions’ texts) and produce the necessary features themselves. The problem is that neural networks prefer to work with sets of numbers (vectors) rather than with raw texts. For example, words “dog” and “puppy” have similar meanings, but they are not meaningful for a computer. The words have different lengths and letters, so the computer won’t see any similarity. And in order to solve this task, it is important to understand whether the words are similar or not. Word2vec approach can help with that. Its essence can be described in one quote: “You shall know a word by the company it keeps” – Firth, J. R. 1957. Word2vec modifies words into vectors, this way the words that are used in similar contexts, have similar vectors. Using word2vec we can transform raw texts into a set of vectors, that can be easily used in a neural network.

It is worth mentioning that it is hard to train word2vec and other embeddings since we need a text corpus the size of Wikipedia. That is why the majority of competitions’ participants use pre-trained models.

Neural Networks

Siamese neural networks are the best solution for such task. They are used when we need to identify how similar the two objects are. Such neural networks have two identical inputs that are used to extract the features. On the next level, either those features are used to calculate the similarity metrics or all the features are combined and passed over to a fully connected layer. After the experiments were over, we chose the following architecture of a neural network:

Important remarks:

  1. Apart from having two inputs for the compared questions, our neural network had a third input for handcrafted features, which is not very typical for deep learning models. It was made due to the “magic features”, that had a strong influence in a given data set.
  2. We used two neural networks in our solution. In the first one, we used LSTM in order to extract information from the questions. In the other one, we used a couple of convolution layers followed by Global Max Pooling.
  3. The architecture we chose is not so “deep”. It can be explained by the fact that deep neural networks need A LOT of data to work properly. Our data set was not so big and its labeling was rather noisy. So there was a high chance of model overfitting, especially when using LSTM.
  4. Some more ideas worth experimenting:
    1. LSTM with attention
    2. Character-Aware Neural Network
    3. Triplet Neural Network
    4. Target encoding

After the experiments with neural networks were over, it became evident that they were not enough. In addition to the data set being too small and noisy labeled for deep learning models to show their full power, there were problems with transforming the text into vectors. The data set contained a lot of mathematical formulas, rare abbreviations, and spelling mistakes. This is why part of the information about a question went missing, which made the work of neural networks very complicated.

Gradient boosting

And finally, the time for a favorite tool of data science competitions participants – gradient boosting. The model has proved itself to be very strong and stable to labeling noise.

We used the following features for gradient boosting:

  • Length of questions, number of words, number of words except for stopwords
  • The number of capital letters, question marks, quotes etc…
  • Indicators for questions starting with “Are”, “Can”, “How” etc…
  • Similarity measures on Word embeddings (Word2Vec, FastText, Glove)
  • Word Mover’s Distance
  • Similarity measures on bag of character n-grams (including TF-IDF)
  • Jaccard, Canberra, Chebyshev similarities
  • Abhishek’s and Mephistopheles’s shared features
  • PageRank

During this competition we’ve also got to test the new gradient boosting library LightGBMIt turned out that it as precise as the old one XGBoost or even better, but it is way faster. So many participants used LightGBM in their final models.

We’ve also used out of fold neural networks predictions as boosting features. It was only left to balance the difference of target distribution between train and test sets, pick model’s hyperparameters and validate the results. Such model was more than enough to get a silver medal on Kaggle.

Quora Question Pairs Kaggle competition winning recipe

Here is a short guide to what you should have done to win this competition:

  • The more the better
    We’ve only used around 70 handcrafted features and 3 models in our solution. However, the winners have used 1000+ features and combined hundreds (up to a thousand) of models. It often happens that the more of different models you combine in your solution, the higher are the chances you’ll win a Kaggle competition.
  • Advanced graphical features
    Graphical features we’ve already mentioned were not the only way to use the data set’s structure. Winners managed to find a lot more of such features and used them in their models.
  • Local rescaling
    The main idea is that the whole data set can be divided into a number of smaller data sets. Each of them will possess different duplicates distribution, so it is necessary to balance these data sets in different ways.
  • Prediction postprocessing
    One more way of improving the result was to correct the predictions the model has already made. For example, we could use the transitivity characteristic. In case question B is a duplicate to question A, and question C is a duplicate to question B, it is evident that questions A and C are also duplicates.

There is a solution by Alex that is worth mentioning. It uses just one model – convolutional neural network (its architecture is quite similar to our solution). The model shows good accuracy and at the same time has really good productivity, as against to other solutions. This model is the most suitable for real-life use cases and is worth the attention.


Kaggle Competition is always a great place to practice and learn something new. However, the best solution on Kaggle does not guarantee the best solution of a business problem. The example of Quora Question Pairs Kaggle Competition illustrates how important it is to be very careful and considerate while preparing a training data. In case the data set characteristics we’ve described are simulated and not typical for the whole Quora question base, it means that participant’s solutions are not applicable in real life. But still the participants have proved that the hacks to finding duplicates among questions can not only be found in texts themselves. Good luck with your experiments and have fun!

Original. Reposted with permission.

Bio: Artem Farazei is a Data Scientist at InData Labs.