A Survey of Available Corpora for Building Data-driven Dialogue Systems

This post is a summary of Serban, et al. "A Survey of Available Corpora for Building Data-Driven Dialogue Systems," which is of increasing relevance given the recent state of conversational AI.

A survey of available corpora for building data-driven dialogue systems
Serban et al. 2015

Bear with me, it’s more interesting than it sounds :). Yes, this (46-page) paper does include a catalogue of data sets with dialogues from different domains, but it also includes a high level survey of techniques that are used in building dialogue systems (aka chatbots). In particular, it focuses on data-driven systems, i.e. those that incorporate some kind of learning from data.

… a wide range of data driven machine learning methods have been shown to be effective in natural language processing, including tasks relevant for dialogue such as dialogue policy learning, dialogue state tracking, and natural language generation.

This particular paper is focused on corpus-based learning where you have been able to build up, or have access to, a data set on which you can train your models. If you want to build a defensible machine learning based business, having access to quality sources of data that your competitors don’t is a good start. Out of scope is training dialogue systems through live interaction with humans – but there are some references to follow on this so I may well return to that topic later on in this mini-series.

Anatomy of a dialogue system

The standard architecture for a dialogue system looks like this:
Natural language interpretation and generation are core NLP problems with applications well beyond dialogue systems. For building chatbots, where we assume written input and output, the speech recogniser and synthesiser can be left out. I had naively assumed that if you had a good working system that can deal with textual inputs and outputs, it would be a simple matter of bolting a speech-to-text recogniser in front of the system in order to build a voice-driven assistant. It turns out it’s not quite as simple as that, since the way we speak and the way we write have important differences:

The distinction between spoken and written dialogues is important, since the distribution of utterances changes dramatically according to the nature of the interaction…. Spoken dialogues tend to be more colloquial, use shorter words and phrases and are generally less well-formed, as the user is speaking in a train-of-thought manner. Conversely, in written communication, users have the ability to reflect on what they are writing before they send a message. Written dialogues can also contain spelling errors or abbreviations, which are generally not transcribed in spoken dialogues.

Even written dialogue – for example for movies and plays, and in fictional novels – has apparent distinctions from real speech. Which leads to this wonderful observation: “Nevertheless, recent studies have found that spoken language in movies resembles human spoken language.” As an occasional movie watcher, I had never thought to question that, or that a study might be necessary to demonstrate it!!

Anyway, I digress. Within dialogue systems we can distinguish between goal driven systems – such as travel assistants or technical support services – where the aim is to accomplish some goal or task, and non-goal driven systems such as language learning tools or computer game characters. Most startups building chatbots will be building goal driven systems.

Initial work on goal driven dialogue systems primarily used rule-based systems… with the distinction that machine learning techniques have been heavily used to classify the intention (or need) of the user, as well as to bridge the gap between text and speech. Research in this area started to take off during the mid 90s, when researchers began to formulate dialogue as a sequential decision making problem based on Markov decision processes.

Commercial systems to date are highly domain specific and heavily based onhand-crafted features. “In particular, the datasets are usually constrained to a very small task.”

Discriminative models and supervised learning

Discriminative models, which use supervised learning to predict labels, can be used in many parts of a dialogue system. For example, to predict the intent of a user in a dialogue, conditioned on what they have said. Here the intent is the label, and the conditioned utterances are called conditioning variables or inputs.

Discriminative models can be similarly applied in all parts of the dialogue system, including speech recognition, natural language understanding, state tracking, and response selection.

One popular approach is to learn a probabilistic model of the labels, another is to use maximum margin classifiers such as support vector machines. Discriminative models may be trained independently and then ‘plugged in’ to fully deployed dialogue systems.

Answering back

When it comes to choosing what your chatbot is going to say (e.g., in response to a user message) there are again two broad distinctions. The simpler approach is to select deterministically from a fixed set of possible responses (which may of course use parameter substitution):

The model maps the output of the dialogue tracker or natural language understanding modules together with the dialogue history (e.g. previous tracker outputs and previous system actions) and external knowledge (e.g. a database, which can be queried by the system) to a response action.

This approach effectively bypasses the natural language generation part of the system. The fixed responses may have been crafted up-front by the system designers, but there are also systems that effectively search through a database of dialogues and pick the responses from there that have the most similar context:

…the dialogue history and tracker outputs are usually projected into an Euclidean space (e.g. using TF-IDF bag-of-words representations) and a desirable response region is found (e.g. a point in the Euclidian space). The optimal response is then found by projecting all potential responses into the same Euclidean space, and the response closest to the desirable response region is selected.

More complex chatbots generate their own responses. Using a method known as beam-search they can generate highly probably responses. The approach is similar to that used in the sequence-to-sequence machine translation paper we looked at recently. Short (single request-response) conversations are simpler than those that need to be able to handle multiple interactive turns.

For example, an interactive system might require steps of clarification from the user before being able to offer pertinent information. Indeed, this is a common scenario: many dialogues between humans, as well as between humans and machines, yield significant ambiguity, which can usually be resolved over the course of the dialogue. This phase is sometimes referred to as the grounding process. To tackle such behaviors, it is crucial to have access to dialogue corpora with long interactions, which include clarifications and confirmations which are ubiquitous in human conversations. The need for such long-term interactions is confirmed by recent empirical results, which show that longer interactions help generate appropriate responses.