Teaching AI to See Like a Human

DeepMind Generative Query Networks can infer knowledge as they navigate a visual environment.

I recently started a new newsletter focus on AI education and already has over 50,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:



An image is worth a thousand words says the old wisdom quote that captures the importance of visual analysis in the learning process of human species. Every time we are presented with a visual scene, our brains make thousands of inferences about the objects in it and their contextual nature. For instance, if we see a person sitting down, we will infer that there is a chair underneath him. Visual inferences works even when we can’t see the object. For instance, if we see a closet in a bedroom, we would assume there are clothing items inside even when we can’t see them. As a cognitive skill, visual inference is foundational to other abilities such as memory, planning or imagination.

These visual cognitive tasks are effortless to the human brain but are incredibly hard to replicate in artificial intelligence(AI) agents. Today, visual recognition models relied on large datasets of labeled images in order to do basic tasks such as object recognition. Producing those labeled datasets is a labor-intensive process and, more often than not, it fails to capture many contextual aspects of the scene such as light position, the perspective of the viewer or the relationships between objects. The next frontier in AI visual recognition models is to tech AI agents to learn from a visual scene performing inferences similar to how humans do. This is the subject of a 2018 research paper published by DeepMind in the Science Magazine.

Under the title “Natural Scene Representation and Rendering” , DeepMind introduces the concept of Generative Query Networks(GQN), a model that enables the creation of AI agents that can learn from their surroundings as they navigate a visual scene. The GQN model is composed of two neural networks: a representation network and a generation network. The representation network takes as input the agent’s observations and produces a neural scene representation, which encodes information about the underlying scene. Each additional observation accumulates further evidence about the contents of the scene in the same representation. The generation network then predicts the scene from an arbitrary query viewpoint, using stochastic latent variables to create variability in its outputs where necessary. The two networks are trained jointly, in an end-to-end fashion, to maximize the likelihood of generating the ground-truth image that would be observed from the query viewpoint.

Source: Science


The main contribution of the GQN model is based on the fact that the representation network is unaware of which viewpoints the generation network will be queried to predict. Consequently, the representation network will produce scene representations that contain all information (e.g., object identities, positions, colors, counts, and room layout) necessary for the generator network to make accurate predictions. During the training process, the generator network learns about the relationships between typical objects in an environment which allows it to receive the description from the representation network and fill in the details. For instance, the representation network will succinctly represent ‘blue cube’ as a small set of numbers and the generation network will know how that manifests itself as pixels from a particular viewpoint. The following figure illustrates a GQN model analyzing a spatial room with certain objects.

Source: Science


To test the effectiveness of GQNs, the DeepMind team conducted a series of experiments using generated 3D environments with multiple objects in arbitrary positions, colors, shapes and textures. The experiments highlighted some remarkable properties of GQNs. For instance, GQN models shown they were able to account for uncertainly in a visual scene when some of its contents were not truly visible. For instance, the GQN model below expresses its uncertainty through the variability of its predictions, which gradually reduces as it moves around the maze.

Source: Science


Similarly, GQN model were able to “imagine” unobserved scenes from new viewpoints. As shown in the following image, when given a scene representation and new camera viewpoints, a GQN model was able to generate sharp images without any prior specification of the laws of perspective.

Source: Science


GQNs are a very novel technique that still has many limitations compared to traditional deep learning image analysis methods. However, GQNs showed that is possible for AI agents to perceive, interpret, and represent synthetic scenes without any human labeling of the contents of these scenes which represents a major breakthrough in the evolution of image analysis methods.

Original. Reposted with permission.