Deep Learning For Chatbots, Part 2 – Implementing A Retrieval-Based Model In TensorFlow

Check out part 2 of this tutorial on building chatbots with deep neural networks. This part gets practical, and using Python and TensorFlow to implement.

Boilerplate Training Code

Before writing the actual neural network code I like to write the boilerplate code for training and evaluating the model. That’s because, as long as you adhere to the right interfaces, it’s easy to swap out what kind of network you are using. Let’s assume we have a model function model_fn that takes as inputs our batched features, labels and mode (train or evaluation) and returns the predictions. Then we can write general-purpose code to train our model as follows:

estimator = tf.contrib.learn.Estimator(
input_fn_train = udc_inputs.create_input_fn(
input_fn_eval = udc_inputs.create_input_fn(
eval_metrics = udc_metrics.create_evaluation_metrics()
# We need to subclass theis manually for now. The next TF version will
# have support ValidationMonitors with metrics built-in.
# It's already on the master branch.
class EvaluationMonitor(tf.contrib.learn.monitors.EveryN):
def every_n_step_end(self, step, outputs):
eval_monitor = EvaluationMonitor(every_n_steps=FLAGS.eval_every), steps=None, monitors=[eval_monitor])

Here we create an estimator for our model_fn, two input functions for training and evaluation data, and our evaluation metrics dictionary. We also define a monitor that evaluates our model every FLAGS.eval_every steps during training. Finally, we train the model. The training runs indefinitely, but Tensorflow automatically saves checkpoint files in MODEL_DIR, so you can stop the training at any time. A more fancy technique would be to use early stopping, which means you automatically stop training when a validation set metric stops improving (i.e. you are starting to overfit). You can see the full code in

Two things I want to mention briefly is the usage of FLAGS. This is a way to give command line parameters to the program (similar to Python’s argparse). hparams is a custom object we create in that holds hyperparameters, nobs we can tweak, of our model. This hparams object is given to the model when we instantiate it.

Creating The Model

Now that we have set up the boilerplate code around inputs, parsing, evaluation and training it’s time to write code for our Dual LSTM neural network. Because we have different formats of training and evaluation data I’ve written a create_model_fnwrapper that takes care of bringing the data into the right format for us. It takes amodel_impl argument, which is a function that actually makes predictions. In our case it’s the Dual Encoder LSTM we described above, but we could easily swap it out for some other neural network. Let’s see what that looks like:

def dual_encoder_model(
  # Initialize embedidngs randomly or with pre-trained vectors if available
  embeddings_W = get_embeddings(hparams)
  # Embed the context and the utterance
  context_embedded = tf.nn.embedding_lookup(
      embeddings_W, context, name="embed_context")
  utterance_embedded = tf.nn.embedding_lookup(
      embeddings_W, utterance, name="embed_utterance")
  # Build the RNN
  with tf.variable_scope("rnn") as vs:
    # We use an LSTM Cell
    cell = tf.nn.rnn_cell.LSTMCell(
    # Run the utterance and context through the RNN
    rnn_outputs, rnn_states = tf.nn.dynamic_rnn(
        tf.concat(0, [context_embedded, utterance_embedded]),
        sequence_length=tf.concat(0, [context_len, utterance_len]),
    encoding_context, encoding_utterance = tf.split(0, 2, rnn_states.h)
  with tf.variable_scope("prediction") as vs:
    M = tf.get_variable("M",
      shape=[hparams.rnn_dim, hparams.rnn_dim],
    # "Predict" a  response: c * M
    generated_response = tf.matmul(encoding_context, M)
    generated_response = tf.expand_dims(generated_response, 2)
    encoding_utterance = tf.expand_dims(encoding_utterance, 2)
    # Dot product between generated response and actual response
    # (c * M) * r
    logits = tf.batch_matmul(generated_response, encoding_utterance, True)
    logits = tf.squeeze(logits, [2])
    # Apply sigmoid to convert logits to probabilities
    probs = tf.sigmoid(logits)
    # Calculate the binary cross-entropy loss
    losses = tf.nn.sigmoid_cross_entropy_with_logits(logits, tf.to_float(targets))
  # Mean loss across the batch of examples
  mean_loss = tf.reduce_mean(losses, name="mean_loss")
  return probs, mean_loss

The full code is in Given this, we can now instantiate our model function in the main routine in that we defined earlier.

model_fn = udc_model.create_model_fn(

That’s it! We can now run python and it should start training our networks, occasionally evaluating recall on our validation data (you can choose how often you want to evaluate using the --eval_every switch). To get a complete list of all available command line flags that we defined using tf.flags and hparamsyou can run python --help.

INFO:tensorflow:training step 20200, loss = 0.36895 (0.330 sec/batch).
INFO:tensorflow:Step 20201: mean_loss:0 = 0.385877
INFO:tensorflow:training step 20300, loss = 0.25251 (0.338 sec/batch).
INFO:tensorflow:Step 20301: mean_loss:0 = 0.405653
INFO:tensorflow:Results after 270 steps (0.248 sec/batch): recall_at_1 = 0.507581018519, recall_at_2 = 0.689699074074, recall_at_5 = 0.913020833333, recall_at_10 = 1.0, loss = 0.5383

Evaluating The Model

After you’ve trained the model you can evaluate it on the test set using python --model_dir=$MODEL_DIR_FROM_TRAINING, e.g.python --model_dir=~/github/chatbot-retrieval/runs/1467389151. This will run the recall@k evaluation metrics on the test set instead of the validation set. Note that you must call udc_test.pywith the same parameters you used during training. So, if you trained with --embedding_size=128 you need to call the test script with the same.

After training for about 20,000 steps (around an hour on a fast GPU) our model gets the following results on the test set:

recall_at_1 = 0.507581018519
recall_at_2 = 0.689699074074
recall_at_5 = 0.913020833333

While recall@1 is close to our TFIDF model, recall@2 and recall@5 are significantly better, suggesting that our neural network assigns higher scores to the correct answers. The original paper reported 0.550.72 and 0.92 for recall@1, recall@2, and recall@5 respectively, but I haven’t been able to reproduce scores quite as high. Perhaps additional data preprocessing or hyperparameter optimization may bump scores up a bit more.

Making Predictions

You can modify and run to get probability scores for unseen data. For example python --model_dir=./runs/1467576365/ outputs:

Context: Example context
Response 1: 0.44806
Response 2: 0.481638

You could imagine feeding in 100 potential responses to a context and then picking the one with the highest score.


In this post we’ve implemented a retrieval-based neural network model that can assign scores to potential responses given a conversation context. There is still a lot of room for improvement, however. One can imagine that other neural networks do better on this task than a dual LSTM encoder. There is also a lot of room for hyperparameter optimization, or improvements to the preprocessing step. The Code and data for this tutorial is on Github, so check it out.

Original. Reposted with permission.