Deep Learning For Chatbots, Part 2 – Implementing A Retrieval-Based Model In TensorFlow

Check out part 2 of this tutorial on building chatbots with deep neural networks. This part gets practical, and using Python and TensorFlow to implement.

Dual Encoder LSTM

The Deep Learning model we will build in this post is called a Dual Encoder LSTMnetwork. This type of network is just one of many we could apply to this problem and it’s not necessarily the best one. You can come up with all kinds of Deep Learning architectures that haven’t been tried yet – it’s an active research area. For example, the seq2seq model often used in Machine Translation would probably do well on this task. The reason we are going for the Dual Encoder is because it has been reported to give decent performance on this data set. This means we know what to expect and can be sure that our implementation is correct. Applying other models to this problem would be an interesting project.

The Dual Encoder LSTM we’ll build looks like this (paper):

Dual Encoder RNN

It roughly works as follows:

  1. Both the context and the response text are split by words, and each word isembedded into a vector. The word embeddings are initialized with Stanford’sGloVe vectors and are fine-tuned during training (Side note: This is optional and not shown in the picture. I found that initializing the word embeddings with GloVe did not make a big difference in terms of model performance). 
  2. Both the embedded context and response are fed into the same Recurrent Neural Network word-by-word. The RNN generates a vector representation that, loosely speaking, captures the “meaning” of the context and response (c and r in the picture). We can choose how large these vectors should be, but let’s say we pick 256 dimensions.
  3. We multiply c with a matrix M to “predict” a response r'. If c is a 256-dimensional vector, then M is a 256×256 dimensional matrix, and the result is another 256-dimensional vector, which we can interpret as a generated response. The matrix Mis learned during training.
  4. We measure the similarity of the predicted response r' and the actual response rby taking the dot product of these two vectors. A large dot product means the vectors are similar and that the response should receive a high score. We then apply a sigmoid function to convert that score into a probability. Note that steps 3 and 4 are combined in the figure.

To train the network, we also need a loss (cost) function. We’ll use the binary cross-entropy loss common for classification problems. Let’s call our true label for a context-response pair y. This can be either 1 (actual response) or 0 (incorrect response). Let’s call our predicted probability from 4. above y'. Then, the cross entropy loss is calculated as L= −y * ln(y') − (1 − y) * ln(1−y). The intuition behind this formula is simple. If y=1 we are left with L = -ln(y'), which penalizes a prediction far away from 1, and if y=0 we are left with L= −ln(1−y), which penalizes a prediction far away from 0.

For our implementation we’ll use a combination of numpy, pandas, Tensorflow and TF Learn (a combination of high-level convenience functions for Tensorflow).

Data Preprocessing

The dataset originally comes in CSV format. We could work directly with CSVs, but it’s better to convert our data into Tensorflow’s proprietary Example format. (Quick side note: There’s also tf.SequenceExample but it doesn’t seem to be supported by tf.learn yet). The main benefit of this format is that it allows us to load tensors directly from the input files and let Tensorflow handle all the shuffling, batching and queuing of inputs. As part of the preprocessing we also create a vocabulary. This means we map each word to an integer number, e.g. “cat” may become 2631. The TFRecordfiles we will generate store these integer numbers instead of the word strings. We will also save the vocabulary so that we can map back from integers to words later on.

Each Example contains the following fields:

  • context: A sequence of word ids representing the context text, e.g. [231, 2190, 737, 0, 912]
  • context_len: The length of the context, e.g. 5 for the above example
  • utterance A sequence of word ids representing the utterance (response)
  • utterance_len: The length of the utterance
  • label: Only in the training data. 0 or 1.
  • distractor_[N]: Only in the test/validation data. N ranges from 0 to 8. A sequence of word ids representing the distractor utterance.
  • distractor_[N]_len: Only in the test/validation data. N ranges from 0 to 8. The length of the utterance.

The preprocessing is done by the Python script, which generates 3 files: train.tfrecordsvalidation.tfrecords andtest.tfrecords. You can run the script yourself or download the data files here.

Creating An Input Function

In order to use Tensorflow’s built-in support for training and evaluation we need to create an input function – a function that returns batches of our input data. In fact, because our training and test data have different formats, we need different input functions for them. The input function should return a batch of features and labels (if available). Something along the lines of:

def input_fn():
  # TODO Load and preprocess data here
  return batched_features, labels

Because we need different input functions during training and evaluation and because we hate code duplication we create a wrapper called create_input_fn that creates an input function for the appropriate mode. It also takes a few other parameters. Here’s the definition we’re using:

def create_input_fn(mode, input_files, batch_size, num_epochs=None):
  def input_fn():
    # TODO Load and preprocess data here
    return batched_features, labels
  return input_fn

The complete code can be found in On a high level, the function does the following:

  1. Create a feature definition that describes the fields in our Example file
  2. Read records from the input_files with tf.TFRecordReader
  3. Parse the records according to the feature definition
  4. Extract the training labels
  5. Batch multiple examples and training labels
  6. Return the batched examples and training labels

Defining Evaluation Metrics

We already mentioned that we want to use the recall@k metric to evaluate our model. Luckily, Tensorflow already comes with many standard evaluation metrics that we can use, including recall@k. To use these metrics we need to create a dictionary that maps from a metric name to a function that takes the predictions and label as arguments:

def create_evaluation_metrics():
  eval_metrics = {}
  for k in [1, 2, 5, 10]:
    eval_metrics["recall_at_%d" % k] = functools.partial(
  return eval_metrics

Above, we use functools.partial to convert a function that takes 3 arguments to one that only takes 2 arguments. Don’t let the namestreaming_sparse_recall_at_k confuse you. Streaming just means that the metric is accumulated over multiple batches, and sparse refers to the format of our labels.

This brings is to an important point: What exactly is the format of our predictions during evaluation? During training, we predict the probability of the example being correct. But during evaluation our goal is to score the utterance and 9 distractors and pick the best one – we don’t simply predict correct/incorrect. This means that during evaluation each example should result in a vector of 10 scores, e.g. [0.34, 0.11, 0.22, 0.45, 0.01, 0.02, 0.03, 0.08, 0.33, 0.11], where the scores correspond to the true response and the 9 distractors respectively. Each utterance is scored independently, so the probabilities don’t need to add up to 1. Because the true response is always element 0 in array, the label for each example is 0. The example above would be counted as classified incorrectly by recall@1 because the third distractor got a probability of 0.45 while the true response only got 0.34. It would be scored as correct by recall@2 however.