Deep Learning For Chatbots, Part 2 – Implementing A Retrieval-Based Model In TensorFlow

Check out part 2 of this tutorial on building chatbots with deep neural networks. This part gets practical, and using Python and TensorFlow to implement.

Editor's note: The first part of this tutorial can be found here.

The Code and data for this tutorial is on Github.

Retrieval-based Bots

In this post we’ll implement a retrieval-based bot. Retrieval-based models have a repository of pre-defined responses they can use, which is unlike generative models that can generate responses they’ve never seen before. A bit more formally, the input to a retrieval-based model is a context c (the conversation up to this point) and a potential response r. The model outputs is a score for the response. To find a good response you would calculate the score for multiple responses and choose the one with the highest score.

But why would you want to build a retrieval-based model if you can build a generative model? Generative models seem more flexible because they don’t need this repository of predefined responses, right?

The problem is that generative models don’t work well in practice. At least not yet. Because they have so much freedom in how they can respond, generative models tend to make grammatical mistakes and produce irrelevant, generic or inconsistent responses. They also need huge amounts of training data and are hard to optimize. The vast majority of production systems today are retrieval-based, or a combination of retrieval-based and generative. Google’s Smart Reply is a good example. Generative models are an active area of research, but we’re not quite there yet. If you want to build a conversational agent today your best bet is most likely a retrieval-based model.

The Ubuntu Dialog Corpus

In this post we’ll work with the Ubuntu Dialog Corpus (papergithub). The Ubuntu Dialog Corpus (UDC) is one of the largest public dialog datasets available. It’s based on chat logs from the Ubuntu channels on a public IRC network. The paper goes into detail on how exactly the corpus was created, so I won’t repeat that here. However, it’s important to understand what kind of data we’re working with, so let’s do some exploration first.

The training data consists of 1,000,000 examples, 50% positive (label 1) and 50% negative (label 0). Each example consists of a context, the conversation up to this point, and an utterance, a response to the context. A positive label means that an utterance was an actual response to a context, and a negative label means that the utterance wasn’t – it was picked randomly from somewhere in the corpus. Here is some sample data.

Head of Ubuntu Dialog Corpus Training Set

Note that the dataset generation script has already done a bunch of preprocessing for us – it has tokenizedstemmed, and lemmatized the output using the NLTK tool. The script also replaced entities like names, locations, organizations, URLs, and system paths with special tokens. This preprocessing isn’t strictly necessary, but it’s likely to improve performance by a few percent. The average context is 86 words long and the average utterance is 17 words long. Check out the Jupyter notebook to see the data analysis.

The data set comes with test and validations sets. The format of these is different from that of the training data. Each record in the test/validation set consists of a context, a ground truth utterance (the real response) and 9 incorrect utterances calleddistractors. The goal of the model is to assign the highest score to the true utterance, and lower scores to wrong utterances.

Ubuntu Dialog Corpus Test Head

The are various ways to evaluate how well our model does. A commonly used metric is recall@k. Recall@k means that we let the model pick the k best responses out of the 10 possible responses (1 true and 9 distractors). If the correct one is among the picked ones we mark that test example as correct. So, a larger k means that the task becomes easier. If we set k=10 we get a recall of 100% because we only have 10 responses to pick from. If we set k=1 the model has only one chance to pick the right response.

At this point you may be wondering how the 9 distractors were chosen. In this data set the 9 distractors were picked at random. However, in the real world you may have millions of possible responses and you don’t know which one is correct. You can’t possibly evaluate a million potential responses to pick the one with the highest score – that’d be too expensive. Google’s Smart Reply uses clustering techniques to come up with a set of possible responses to choose from first. Or, if you only have a few hundred potential responses in total you could just evaluate all of them.


Before starting with fancy Neural Network models let’s build some simple baseline models to help us understand what kind of performance we can expect. We’ll use the following function to evaluate our recall@k metric:

def evaluate_recall(y, y_test, k=1):
    num_examples = float(len(y))
    num_correct = 0
    for predictions, label in zip(y, y_test):
        if label in predictions[:k]:
            num_correct += 1
    return num_correct/num_examples

Here, y is a list of our predictions sorted by score in descending order, and y_test is the actual label. For example, a y of [0,3,1,2,5,6,4,7,8,9] Would mean that the utterance number 0 got the highest score, and utterance 9 got the lowest score. Remember that we have 10 utterances for each test example, and the first one (index 0) is always the correct one because the utterance column comes before the distractor columns in our data.

Intuitively, a completely random predictor should get a score of 10% for recall@1, a score of 20% for recall@2, and so on. Let’s see if that’s the case.

# Random Predictor
def predict_random(context, utterances):
    return np.random.choice(len(utterances), 10, replace=False)
# Evaluate Random predictor
y_random = [predict_random(test_df.Context[x], test_df.iloc[x,1:].values) for x in range(len(test_df))]
y_test = np.zeros(len(y_random))
for n in [1, 2, 5, 10]:
    print("Recall @ ({}, 10): {:g}".format(n, evaluate_recall(y_random, y_test, n)))

Recall @ (1, 10): 0.0937632
Recall @ (2, 10): 0.194503
Recall @ (5, 10): 0.49297
Recall @ (10, 10): 1

Great, seems to work. Of course we don’t just want a random predictor. Another baseline that was discussed in the original paper is a tf-idf predictor. tf-idf stands for “term frequency – inverse document” frequency and it measures how important a word in a document is relative to the whole corpus. Without going into too much detail (you can find many tutorials about tf-idf on the web), documents that have similar content will have similar tf-idf vectors. Intuitively, if a context and a response have similar words they are more likely to be a correct pair. At least more likely than random. Many libraries out there (such as scikit-learn) come with built-in tf-idf functions, so it’s very easy to use. Let’s build a tf-idf predictor and see how well it performs.

class TFIDFPredictor:
    def __init__(self):
        self.vectorizer = TfidfVectorizer()
    def train(self, data):,data.Utterance.values))
    def predict(self, context, utterances):
        # Convert context and utterances into tfidf vector
        vector_context = self.vectorizer.transform([context])
        vector_doc = self.vectorizer.transform(utterances)
        # The dot product measures the similarity of the resulting vectors
        result =, vector_context.T).todense()
        result = np.asarray(result).flatten()
        # Sort by top results and return the indices in descending order
        return np.argsort(result, axis=0)[::-1]

# Evaluate TFIDF predictor
pred = TFIDFPredictor()
y = [pred.predict(test_df.Context[x], test_df.iloc[x,1:].values) for x in range(len(test_df))]
for n in [1, 2, 5, 10]:
    print("Recall @ ({}, 10): {:g}".format(n, evaluate_recall(y, y_test, n)))

Recall @ (1, 10): 0.495032
Recall @ (2, 10): 0.596882
Recall @ (5, 10): 0.766121
Recall @ (10, 10): 1

We can see that the tf-idf model performs significantly better than the random model. It’s far from perfect though. The assumptions we made aren’t that great. First of all, a response doesn’t necessarily need to be similar to the context to be correct. Secondly, tf-idf ignores word order, which can be an important signal. With a Neural Network model we can do a bit better.