Explainable Visual Reasoning: How MIT Builds Neural Networks that can Explain Themselves

New MIT research attempts to close the gap between state-of-the-art performance and interpretable models in computer vision tasks.

Source: https://datainnovation.org/2019/03/improving-visual-reasoning-in-ai-systems/


I recently started a new newsletter focus on AI education and already has over 50,000 subscribers. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:



Interpretability is one of the biggest challenges in deep learning solutions in the real world. Any basic deep learning model can contain dozens of hidden layers and millions of neurons interacting with each other. Additionally, the structure of the networks can change as it builds up new knowledge. Tracing at explaining the specifics of how a deep neural network arrives to a specific decision has proven to be almost impossible in many scenarios. In 2018, researchers from the MIT Lincoln Laboratory’s Intelligence and Decision Technologies Group published a paper in which they proposed a method that can perform complex, human-like reasoning about images in an interpretable manner.


Interpretability vs. Accuracy

One of the reasons interpretability is so challenging in deep learning models is that, many times, it comes at the cost of accuracy. In some context, the friction between accuracy and interpretability is one of the most important balances in deep neural network architectures.

Many deep learning techniques are complex in nature and, although they result very accurate in many scenarios, they can become incredibly difficult to interpret. If we can plot some of the best-known deep learning models in a chart that correlates accuracy and interpretability, we will get something like the following:


Transparency by Design Network

To validate some of their ideas of how to improve interpretability, the MIT researchers focused on a visual question answering(VQA) scenario in which a model must be capable to make complex spatial reasoning over an image. When confronted with a question such as “What color is the cube to the right of the large metal sphere?”, a model must identify which sphere is the large metal one, understand what it means for an object to be to the right of another, and apply this concept spatially to the attended sphere. The MIT team proposed an approach based on modular neural networks in which the visual reasoning task is decomposed into a small of primitives that facilitate the interpretability of the model.

Called Transparency by Design network (TbD-Net), the MIT model breaks a complex chain of reasoning into a series of smaller subproblems, each of which can be solved independently and composed, is a powerful and intuitive means for reasoning. Specifically, TbD-Net abstracts any visual reasoning task into 7 highly specialized visual reasoning primitives.

  • Attention: The Attention module takes as input image features and a previous attention to refine (or an all-one tensor if it is the first Attention in the network) and outputs a heatmap of dimension 1 x H x W corresponding to the objects of interest.
  • And-Or: These two modules combine two attention masks in a set intersection and union, respectively.
  • Relate: This module attends to a region that has some spatial relation to another region. For example, in the question “What color is the cube to the right of the small sphere?”, the network should determine the position of the small sphere using a series of Attention modules, then use a Relate module to attend to the region that is spatially to the right.
  • Query: This module modules extracts a feature from an attended region of an image. For example, these modules would determine the color of an object of interest.
  • Same: This module attends to a region, extracts a relevant property from that region, and attends to every other region in the image that shares that property. As an example, when answering the question “Is anything the same color as the small cube?”, the network should localize the small cube via Attention modules, then use a Same module to determine its color and output an attention mask localizing all other objects sharing that color.
  • Compare: This module compares the properties output by two Query modules and produces a feature map which encodes whether the properties are the same. This module would be used to answer the question “Are the cube and the metal sphere the same size?”.

The following figure illustrates how the different visual reasoning primitives can be assembled to answer complex questions about an image. As you can see, every step of the model can be explain with a visual representation. The TbD-Net team called this visual representations “attention masks”.

Source: https://arxiv.org/abs/1803.05268



The key innovation that makes TbD-Net more interpretable than traditional modular neural networks is that it explicitly composes visual attention masks to arrive at an answer. Each module’s output is depicted visually in what the group calls an “attention mask.” The attention mask shows heat-map blobs over objects in the image that the module is identifying as its answer. These visualizations let the human analyst see how a module is interpreting the image.

Source: https://arxiv.org/abs/1803.05268


The MIT team tested TbD-Net using the CLEVR dataset generator to produce a dataset consisting of 70,000 training images and 700,000 questions, along with test and validation sets of 15,000 images and 150,000 questions. The initial model achieved 98.7 percent test accuracy on the dataset, which, according to the researchers, far outperforms other neural module network–based approaches.

TbD-net showed that is possible to build neural networks that are both highly performant and readily interpretable. The key advantage of the model was to decompose a visual reasoning problem into a series of primitives that produce “attention masks” that can be easily interpreted it by human analysts. While TbD-net is focused on visual reasoning systems, the ideas of the transparency by design approach can be easily extrapolated to other neural network models.

Original. Reposted with permission.