A Survey of Available Corpora for Building Data-driven Dialogue Systems

This post is a summary of Serban, et al. "A Survey of Available Corpora for Building Data-Driven Dialogue Systems," which is of increasing relevance given the recent state of conversational AI.

Incorporating external knowledge

Chatbots may rely on more than just dialogue corpora for training. When building a goal-drive dialogue system for movies, Dodge et al. (“Evaluating prerequisite qualities for learning end-to-end dialog systems”) identify four tasks that such a working dialogue system should be able to perform: question answering, recommendation, question answering with recommendation, and casual conversation. They use four different subsets of data to train models for these tasks: “a QA dataset from the Open Movie Database (OMDb) of 116k examples with accompanying movie and actor metadata in the form of knowledge triples; a recommendation dataset from MovieLens with 110k users and 1M questions; a combined recommendation and QA dataset with 1M conversations of 6 turns each; and a discussion dataset from Reddit’s movie subreddit.”


Using external information is usually of great importance to dialogue systems, especially goal driven ones. This could include structured information – such as bus or train timetable for answering questions about public transport – typically contained in relational databases or similar. It’s also possible to take advantage of structured external knowledge from general natural language processing databases and tools. Some good sources include:

  • WordNet, with lexical relationships between words for over a thousand words,
  • VerbNet, with lexical relationships between verbs, and
  • FrameNet, which contains ‘word senses’ for over ten thousand words.

Tools include part-of-speech taggers, word category classifiers, word embedding models, named entity recognition models, semantic role labelling models, semantic similarity models, and sentiment analysis models.

If you’re building a new application and don’t have ready access to large corpora for training, you may also be able to transfer learning from related datasets to bootstrap the learning process. “Indeed, in several branches of machine learning, and in particular in deep learning, the use of related datasets in pre-training the model is an effective method of scaling up to complex environments.”

An example of this approach in action is the work of Forgues et al. on dialogue act classification (classify a user utterance as one out of k dialogue acts).

They created an utterance-level representation by combining the word embeddings of each word, for example, by summing the word embeddings or taking the maximum w.r.t. each dimension. These utterance-level representations, together with word counts, were then given as inputs to a linear classifier to classify the dialogue acts. Thus, Forgues et al. showed that by leveraging another, substantially larger, corpus they were able to improve performance on their original task.

What were you saying?

Tracking the state of a conversation is a whole sub-genre of its own, which goes by the name of dialogue state tracking or DSTC. It is framed as a classification problem: given current input to the dialogue state tracker plus any relevant external knowledge from other sources (e.g. the timetable information from our previous example), the goal is to output a probability distribution over a set of predefined hypotheses, plus a special ‘REST’ hypothesis which captures the probability that none of the others are correct. For example, the system may believe with high confidence that the user has requested timetable information for the current day. DSTC model include both statistical approaches and hand-crafted systems. “More sophisticated models take a dynamic Bayesian approach by modeling the latent dialogue state and observed tracker outputs in a directed graphical model… Non-bayesian data-driven models have also been proposed.”

Longer term memories

We recently looked at Memory Networks and Neural Turing Machines which can store some part of their input in a memory and use this to perform a variety of tasks.

Although none of these models are explicitly designed to address dialogue problems, the extension by Kumar et al. to Dynamic Memory Networks specifically differentiates between episodic and semantic memory. In this case, the episodic memory is the same as the memory used in the traditional Memory Networks paper which is extracted from the input, while the semantic memory refers to knowledge sources that are fixed for all inputs. The model is shown to work for a variety of NLP tasks, and it is not difficult to envision an application to dialogue utterance generation where the semantic memory is the desired external knowledge source.


As if all of the above wasn’t hard enough, you may also want your bot to exhibit some kind of consistent personality. In fact say the authors, “attaining human-level performance with dialogue agents may well require personalization.”

We see personalization of dialogue systems as an important task, which so far has been mostly untouched.

The Last Word

There’s plenty more detail and several additional topics in the original paper, which I have skipped over. If this topic interests you, it’s well worth checking out. I’ll leave you with the following closing thought:

There is strong evidence that over the next few years, dialogue research will quickly move towards large-scale data-driven model approaches, in particular in the form of end-to-end trainable systems as is the case for other language-related applications such as speech recognition, machine translation and information retrieval… While in many domains data scarcity poses important challenges, several potential extensions, such as transfer learning and incorporation of external knowledge, may provide scalable solutions.

Bio: Adrian Colyer was CTO of SpringSource, then CTO for Apps at VMware and subsequently Pivotal. He is now a Venture Partner at Accel Partners in London, working with early stage and startup companies across Europe. If you’re working on an interesting technology-related business he would love to hear from you: you can reach him at acolyer at accel dot com.

Original. Reposted with permission.