Deep Learning For Chatbots, Part 2 – Implementing A Retrieval-Based Model In TensorFlow
Check out part 2 of this tutorial on building chatbots with deep neural networks. This part gets practical, and using Python and TensorFlow to implement.
Dual Encoder LSTM
The Deep Learning model we will build in this post is called a Dual Encoder LSTMnetwork. This type of network is just one of many we could apply to this problem and it’s not necessarily the best one. You can come up with all kinds of Deep Learning architectures that haven’t been tried yet – it’s an active research area. For example, the seq2seq model often used in Machine Translation would probably do well on this task. The reason we are going for the Dual Encoder is because it has been reported to give decent performance on this data set. This means we know what to expect and can be sure that our implementation is correct. Applying other models to this problem would be an interesting project.
The Dual Encoder LSTM we’ll build looks like this (paper):
It roughly works as follows:
- Both the context and the response text are split by words, and each word isembedded into a vector. The word embeddings are initialized with Stanford’sGloVe vectors and are fine-tuned during training (Side note: This is optional and not shown in the picture. I found that initializing the word embeddings with GloVe did not make a big difference in terms of model performance).
- Both the embedded context and response are fed into the same Recurrent Neural Network word-by-word. The RNN generates a vector representation that, loosely speaking, captures the “meaning” of the context and response (
rin the picture). We can choose how large these vectors should be, but let’s say we pick 256 dimensions.
- We multiply
cwith a matrix
Mto “predict” a response
cis a 256-dimensional vector, then
Mis a 256×256 dimensional matrix, and the result is another 256-dimensional vector, which we can interpret as a generated response. The matrix
Mis learned during training.
- We measure the similarity of the predicted response
r'and the actual response
rby taking the dot product of these two vectors. A large dot product means the vectors are similar and that the response should receive a high score. We then apply a sigmoid function to convert that score into a probability. Note that steps 3 and 4 are combined in the figure.
To train the network, we also need a loss (cost) function. We’ll use the binary cross-entropy loss common for classification problems. Let’s call our true label for a context-response pair
y. This can be either 1 (actual response) or 0 (incorrect response). Let’s call our predicted probability from 4. above
y'. Then, the cross entropy loss is calculated as
L= −y * ln(y') − (1 − y) * ln(1−y). The intuition behind this formula is simple. If
y=1 we are left with
L = -ln(y'), which penalizes a prediction far away from 1, and if
y=0 we are left with
L= −ln(1−y), which penalizes a prediction far away from 0.
The dataset originally comes in CSV format. We could work directly with CSVs, but it’s better to convert our data into Tensorflow’s proprietary Example format. (Quick side note: There’s also
tf.SequenceExample but it doesn’t seem to be supported by tf.learn yet). The main benefit of this format is that it allows us to load tensors directly from the input files and let Tensorflow handle all the shuffling, batching and queuing of inputs. As part of the preprocessing we also create a vocabulary. This means we map each word to an integer number, e.g. “cat” may become
TFRecordfiles we will generate store these integer numbers instead of the word strings. We will also save the vocabulary so that we can map back from integers to words later on.
Example contains the following fields:
context: A sequence of word ids representing the context text, e.g.
[231, 2190, 737, 0, 912]
context_len: The length of the context, e.g.
5for the above example
utteranceA sequence of word ids representing the utterance (response)
utterance_len: The length of the utterance
label: Only in the training data.
distractor_[N]: Only in the test/validation data. N ranges from 0 to 8. A sequence of word ids representing the distractor utterance.
distractor_[N]_len: Only in the test/validation data. N ranges from 0 to 8. The length of the utterance.
The preprocessing is done by the
prepare_data.py Python script, which generates 3 files:
test.tfrecords. You can run the script yourself or download the data files here.
Creating An Input Function
In order to use Tensorflow’s built-in support for training and evaluation we need to create an input function – a function that returns batches of our input data. In fact, because our training and test data have different formats, we need different input functions for them. The input function should return a batch of features and labels (if available). Something along the lines of:
Because we need different input functions during training and evaluation and because we hate code duplication we create a wrapper called
create_input_fn that creates an input function for the appropriate mode. It also takes a few other parameters. Here’s the definition we’re using:
The complete code can be found in
udc_inputs.py. On a high level, the function does the following:
- Create a feature definition that describes the fields in our
- Read records from the
- Parse the records according to the feature definition
- Extract the training labels
- Batch multiple examples and training labels
- Return the batched examples and training labels
Defining Evaluation Metrics
We already mentioned that we want to use the
recall@k metric to evaluate our model. Luckily, Tensorflow already comes with many standard evaluation metrics that we can use, including
recall@k. To use these metrics we need to create a dictionary that maps from a metric name to a function that takes the predictions and label as arguments:
Above, we use
functools.partial to convert a function that takes 3 arguments to one that only takes 2 arguments. Don’t let the name
streaming_sparse_recall_at_k confuse you. Streaming just means that the metric is accumulated over multiple batches, and sparse refers to the format of our labels.
This brings is to an important point: What exactly is the format of our predictions during evaluation? During training, we predict the probability of the example being correct. But during evaluation our goal is to score the utterance and 9 distractors and pick the best one – we don’t simply predict correct/incorrect. This means that during evaluation each example should result in a vector of 10 scores, e.g.
[0.34, 0.11, 0.22, 0.45, 0.01, 0.02, 0.03, 0.08, 0.33, 0.11], where the scores correspond to the true response and the 9 distractors respectively. Each utterance is scored independently, so the probabilities don’t need to add up to 1. Because the true response is always element 0 in array, the label for each example is 0. The example above would be counted as classified incorrectly by
recall@1 because the third distractor got a probability of
0.45 while the true response only got
0.34. It would be scored as correct by