Three Impactful Machine Learning Topics at ICML 2016

This post discusses 3 particular tutorial sessions of impact from the recent ICML 2016 conference held in New York. Check out some innovative ideas on Deep Residual Networks, Memory Networks for Language Understanding, and Non-Convex Optimization.

Memory Networks for Language Understanding

Jason Weston motivated building an end-to-end dialog agent. He detailed a simple model that makes headway toward this goal: Memory Networks. He provided means to test this model against a set of toy benchmarks. He described the benchmarks as an escalating sequence of tasks. Jason showed a revised memory network model that learns end-to-end without explicitly supervised attention. He gave real-world datasets where memory networks do well and where they do poorly. He portrayed a way to scale efficiently to large datasets. He presented two revisions: one using key-value pairs and another learning from textual feedback. Finally, he asked questions motivating future research.

First, Jason introduced a set of beliefs describing an ideal dialog agent. It should use all its knowledge to perform complex tasks. It should converse at length and understand the motives underlying the dialog. It should be able to grow its capabilities while conversing. It should learn end-to-end.

Next, Memory Networks (MemNNs) were introduced. Memory Networks combine inputs with attention on memories to provide reasoned outputs. He limits the first iteration’s scope to be as simple as possible. It consists of a recurrent controller module that accepts an initial query. To start, its memory is loaded with a set of facts. The query and facts are bag-of-words vectors. The controller predicts an attention vector (with a supervision signal) to choose a fact. It reads the chosen memory to update its hidden state. After several repetitions, or hops, it formulates an output. The output ranks possible responses from a dictionary of words. Error signals back-propagate through the network via the output and the supervised attention episodes.

Memory Networks

From slide 9.

He described a set of toy benchmarks of increasing complexity. Each benchmark consists of a set of short stories. Each story is a sequence of statements about an evolving situation. The model should read a single story and answer one or more questions about it. Within a benchmark, the stories test the same skill. Across the different benchmarks, the skills get more difficult.

John was in the bedroom.

Bob was in the office.

John went to the kitchen.

Bob travelled back home.

Where is John? A: kitchen

(Example from slide 11.)

The benchmarks are:

  • Factoid question/answer with single supporting fact
  • Factoid QA with two supporting facts
  • Factoid QA with three supporting facts
  • Two argument relations: subject versus object
  • Three argument relations
  • Yes/no questions
  • Counting
  • Lists/sets
  • Simple negation
  • Indefinite knowledge
  • Basic coreference
  • Conjunction
  • Compound coreference
  • Time manipulation
  • Basic deduction
  • Basic induction
  • Positional reasoning
  • Reasoning about size
  • Path finding
  • Reasoning about agent’s motivation

A revised model, the End-to-end Memory Network (MemN2N) learns without attention supervision. It uses soft-attention (a probability vector) to read the memory. Thus, it is fully-differentiable and can learn from output supervision alone. The newer model still fails on some toy benchmark tasks. Yet, it succeeds on several real-world benchmarks, such as children’s books and news question sets.

Another revision, the Key-Value Memory Network splits each memory cell into two parts. The first part is a lookup key used to match the incoming state vector. The second is a value combined with attention to produce the read value. Key-Value MemNNs closely match state-of-the-art on some real-world question-answering datasets.
Finally, a third revision learns only through textual feedback. It learns to predict the response of a “teacher” agent that provides feedback in words. Mismatches between predicted and actual feedback provide a training signal to the model.

Memory Networks

From slide 86.

Explore the papers, code and datasets in slide 87. Find questions for future research in slide 10slide 83 and slide 88.

Non-Convex Optimization

Anima Anandkumar covered methods that achieve guaranteed global optimization for non-convex problems. Machine learning problems are optimization problems, often non-convex. But, non-convex problems have an exponential number of critical points. These saddle points impede the progress of gradient descent and Newton’s method. She detailed conditions that define different types of critical points. She gave algorithms to escape well-behaved functions to find local optima. Such well-behaved functions are twice-differentiable and have non-degenerate saddle points. Stochastic gradient descent and Hessian methods can escape saddle points efficiently. She showed how higher-order critical points impede the progress of these algorithms. She detailed specific problems for which global optima can be reached: matrix eigen-analysis and orthogonal tensor decomposition.


She showed tensor decomposition can replace popular machine learning methods that use maximum likelihood: document topic modeling,convolutional dictionary models, fast text embeddings and neural network training. She gave steps for future research on slide 87. If these methods interest you further, read this detailed post at offconvex from her research group.

Bio: Robert Dionne is a software developer working on backend services and deep learning infrastructure. If you’re interested in learning more about conversational interfaces, follow him and on Medium and Twitter. And if you’re looking to create a conversational interface for your app, service, or company, check out

Original. Reposted with permission.