How to win Kaggle competition based on NLP task, if you are not an NLP expert

Here is how we got one of the best results in a Kaggle challenge remarkable for a number of interesting findings and controversies among the participants.

By Artem Farazei, InData Labs.

Apart from performing for our clients, InData Labs data science team is keen on taking part in top notch data science competitions, for example, Kaggle Competition.

The team has recently shown one of the best results in Quora Question Pairs Challenge on Kaggle. The challenge is remarkable for a number of interesting findings and controversies among the participants, so let’s dig deeper into the details of the competition and create a winning formula for data science and machine learning Kaggle competition.

About Quora Question Pairs Kaggle competition

Quora is a Q&A site where anyone can ask questions and get answers. Quora audience is quite diverse. People use it for studying, work consultations and whenever they have second thoughts about almost anything. Over 100 million people visit Quora every month, so it’s no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question and make writers feel they need to answer multiple versions of the same question. That’s why the goal of the competition was to predict which of the provided pairs of questions contained two questions with the same meaning.

Participants were offered a training data set that contained more than 404 thousand question pairs. After getting a closer look at the examples, you’ll understand that the task is very complicated even for humans.

First three question pairs were preliminary marked by Quora as duplicates,  whereas pairs 4-6 were marked as non-duplicates. As you can see the questions from duplicating pairs look totally different and the non-duplicating questions may differ with just one word. This is one of the main specifics of the data set that makes the task pretty difficult for NLP technologies.

Unique characteristics of the data set

Right after the start of the Kaggle competition participants started sharing interesting findings about the data set. Some of the findings could really skew final results of the competition, others were just amusing. Here are the most commonly met ones:

Incorrect labeling

As recognized by the organizers “The ground truth labels on this data set should be taken to be ‘informed’ but not 100% accurate, and may include incorrect labeling. We believe the labels, on the whole, to represent a reasonable consensus, but this may often not be true on a case by case basis for individual items in the dataset.” In reality there were a lot of cases when question pairs included inaccurate and ambiguous tags. Certain compared questions were only partially included. Here is just a couple of examples:

A lot of questions about India

Although training set contains many different questions about various topics, it catches the eye, that a lot of questions are about India. Certain NLP models that were trained on such data could start giving a lot of value to the words specific only to the questions about India. This means that such models can skew their answers in the questions not related to India. That can cause bias in case models that were trained on this data were used on other data. However, it didn’t happen to us, since the test dataset also included a lot of questions about India.

Training and test data sets were not equally distributed.

This data set characteristic is associated with the ID of each question in a training set. IDs are additional info, but in machine learning competitions they often contain non-evident useful information. For example, let’s suppose older questions have smaller IDs. This way we can see how the number of duplicates changes with time.

As we can see in the diagram, the number of duplicates is decreasing over time. Unfortunately, unlike the training set, test set does not contain questions’ IDs, that is why we can’t use this information, although some of the contestants tried to restore the questions’ IDs for the test set. The characteristic is important since it is a  common practice to use older data in training sets and newer one in test sets. All of it makes us think that the number of duplicates in the test set is less than in a training set.

It is very important to know in advance in case the duplicates’ distribution is different in test and training data sets since the quality metric used in this solution is very sensitive to those distribution changes. For example, in case the model is processing the test data and it is hesitating it will probably choose to mark the pair as a non-duplicate since it already knows that the number of non-duplicates prevailed in the training set. And in case the distribution in a test data set is different, the model will most likely be wrong.

Magic features

Contest organizers expected that participants will be preliminary using NLP features in their solutions. But it turned out that data set structure itself contained valuable features.

Two of them turned out to be the most defining.

1. In order to use this valuable information, it is convenient to present the data set as a graph. It can be done in a number of ways. For example, let’s build a graph where every record will be represented with two nodes connected with an edge. At the same time, one node corresponds to one question from the data set. Let’s imagine the data set contains only seven records:

The graph will look the following way:

Now we can calculate the number of “common neighbors” for every question pair from the data set. “Common neighbors” will be the questions that are neighboring in the graph with questions from one record. For example, the first record from our data set will have two of such “neighbors”: