Multimodal Grounded Learning with Vision and Language

How to enable AI models to have similar capabilities: to communicate, to ground, and to learn from language.

Multimodal Grounded Learning with Vision and Language
Image by Editor


Hi everyone! My name is Bogdan Ponomar, and I’m CEO at AI HOUSE – AI – community in Ukraine that brings together talents under dozens of initiatives, primarily educational. We are part of the Roosh tech ecosystem.

In August 2022, we launched a new educational project “AI for Ukraine” – a series of workshops and lectures held by international artificial intelligence experts to support the development of Ukraine’s tech community. The leading international AI/ML experts participated in this charity project and we decided to share some abstracts from the most insightful and engaging AI for Ukraine sessions.

The first synopsis in the series was devoted to the lecture by Joshua Bengio covering the topic of "Bridging the gap between current deep learning and human higher-level cognitive abilities". You can read it  here.

The next topic I’m inviting you to delve into is "Multimodal Grounded Learning with Vision and Language" delivered by Anna Rohrbach, Research Scientist at UC Berkeley.

Let's start by acknowledging that humans use a variety of modalities, most notably vision and language, to perceive their environment and interact with one another. It is a fundamental human ability to describe what we observe to one another. We have a shared reality, which is essential for understanding one another since we ground the concepts to the world around us.

In most cases, language is also used by humans to transfer knowledge about new things.

Thus, we can learn just via language. In her lecture, a Research Scientist at UC Berkeley Anna Rohrbach discussed how to provide AI models with similar capabilities, such as communication, grounding, and learning from language. Although there has been a significant advancement in the classical vision and language tasks, notably visual captioning, the AI models sometimes still have difficulties. One of the most challenging aspects in multimodal learning is exactly grounding, that is – accurately mapping language concepts on visual observations. Lack of proper grounding may negatively affect the models by producing bias or hallucinations.

Furthermore, even models that are able to communicate and ground may still require human "advice," or learn how to act more like humans. Language is increasingly being used to improve visual models by allowing zero-shot capabilities, improving generalization, minimizing bias and so forth. Anna Rohrbach is particularly interested in developing models that may use language advice to improve their behavior. In the talk, Anna covered how she and other researchers in the field try to achieve the aforementioned capabilities and the challenges as well as exciting opportunities that lie ahead.

AI is already everywhere, impacting all aspects of our lives from healthcare and assistive technology to self-driving and smart homes. Further, more human-like interactions with AI are going to be established. We are going to communicate with AI more and build trust. 

In our multimodal world, the primary two ways of interaction are through vision and language. Human-to-human interaction has such forms:

  • Communication about what you see (e.g., Look, a blue jay is sitting on a branch)
  • Grounding concepts to the common reality
  • Learning from language (e.g., Did you know, blue jays like eating acorns?)

These three points are the key priorities for human-AI interaction. The researchers try to achieve it with visual captioning, including assisting visually impaired users and generating explanations (for example, bluejay — is a bird that is blue above and white below, with a large crest and black necklace). The explanations should be faithful and introspective. 




The task of visual grounding includes the localization of nouns or short phrases (e.g., a black beak, a blue wing). Recently, there’s been much interest in scaling it up. 

The problem of alignment between modalities that we expect from AI-trained models often occurs. We expect that they relate the right words to the image. However, in practice, this is not always the case. For example, the system may say “A bird is sitting on a branch,” but we see no branch in the picture. This is an example of the failure of grounding. Lack of grounding may sometimes even hurt the user.

Anna suggests using language as a source of knowledge for AI models as we humans do. We could leverage language for zero-shot learning. This approach has already been used before. For instance, when you are asked to name the attributes to learn about the new species. Recently, this approach has been pushed forward with the arrival of large models. 

Another way to achieve grounding is by using advisable learning. We learn not just from experience or doing things, we can learn by reading and listening to other humans too. Similarly, AIs can be trained to correct undesired model behavior. For example, if there’s a picture of a bird flying over water, the machine might be confused about the type of bird, so we can tell it to look at the bird, not the water. Therefore, the machine will not confuse it with a duck, for example. 

Therefore, we need to build AI models that can communicate, are grounded, and learn from language. 

Anna Rohrbach outlined the lecture with these aspects:

  • Communication needs grounding
  • Grounded communication with Advisability
  • Learning from language for improving visual robustness and transfer.


Communication Needs Grounding


If there’s a picture of a person riding a snowboard. Most likely the machine will identify it as “he”. Captioning models talk about gender even when humans would not. Models not only capture inequality (there is more data about men), they exaggerate this imbalance. For example, in a picture with the woman sitting at the desk with a laptop computer, the machine still identifies the person as a “man”. The model doesn’t even see the person, it sees a monitor and based on some correlations assumes this is a man. In the example of a man holding a tennis racket, the model identifies that gender correctly but not by looking at the person but at the racket. Therefore, the model does not attend to the person when discussing their gender. 

Anna and other researchers work on overcoming this problem by shifting the attention to the right person. To do this, they apply caption correctness loss to the gender. They have introduced confidence loss and appearance confusion loss. It provided a more fair model behavior: similar low error for both men and women.

In the Baseline-FT and UpWeight, the model identified “A man and a dog are in the snow”

The equalizer is more concerned about the gender mistake and identified it as “A person walking a dog on a leash” if the model couldn’t clearly recognize the gender in the picture. 

The issues that lack of grounding may cause include: inappropriate, harmful, and offensive items.


Grounded Communication with Advisability


When you are self-driving, the car can communicate to you when it slows down since it’s about to turn left. There is a description (the car slows down) and an explanation of the action (since it is about to turn left). Anna and her colleagues have conducted a DeepDrive eXplanation (BDD-X) dataset experiment at Berkley University. The model was trying to predict the vehicle’s future ego motion.  

They have introduced an explanation generator that generates the natural language explanation of the rationales behind the driving model. In the process of attention alignment, the key idea is to align the vehicle controller and the textual justifier such that they look at the same input regions. The key results they got were the most weakly aligned attention, “explanation-as-additional-loss” explainable without performance loss. The problem lay in that the system does not attend to pedestrians nor reacted to their presence.


Advisable Driving Model


How to make the model recognize the pedestrians by advice? Observation to action, not vice versa. This way the model learns to summarize its visual observations in natural language and predicts an appropriate action response. In the CARLA stimulator, the researchers studied the behavior of a non-explainable model, an explainable model, and an advisable model. Humans trust the advisable model the most.

Therefore, incorporating language advice in deep models leads to better-performing, more interpretable models that gain higher human trust. 


Learning from language for Improving Visual Robustness


Large Vision Language Models


CLIP — a pre-trained model that learns to match images to captions. It can recognize and ground many high-level concepts, but not the fine-grained classes, such as dog species. 

The researchers find a lot of contextual bias that lies in separating the concept from context. They introduced GALS: Guiding visual attention with the language specification. It improves model learning with prompts (photo of a bird). They established turning high-level language task specification into spatial attention, which guides a CNN away from bias. 


Learning from Language for Improving Visual Transfer


Large-scale V+L models work well for high-level concepts, but not so for fine-grained ones. The idea is that a lot of general knowledge is captured in external resources.


Zero-Shot Learning


Traditional Class level transfer has a goal to generalize to unseen object classes. The new task-level transfer has the goal of generalizing to unseen datasets or tasks. However, modeling external knowledge has not been explored. Whereas in traditional class-level transfer modeling external knowledge explored how to associate seen and unseen classes via some auxiliary info, such as embeddings or attributes. 

In external knowledge, we could use explanation. Humans leverage prior (structured) knowledge. Can the same be done for AI? The researchers used K-Lite: Knowledge-augmented Language Image Training and Evaluation. Knowledge helps to improve performance in fine-grained concepts, such as sashimi — a Japanese specialty that consists of fresh, raw fish or meat that has been thinly sliced and is frequently consumed with soy sauce. However, it hurts performance when knowledge coverage is low and spurious works are contained. The researcher found that we can further enhance learning from the language by learning from the external language.


What is Next? 


AI models that can learn from language. However, the possible limitation is human supervision that is hard to scale. Anna predicts that large-scale pre-trained models with open-ended scenarios with arbitrary concepts, complex relationships, and world knowledge will be introduced. Her long-term vision is that foundational grounded models will be compositional and structured, so AIs will communicate, be more grounded, and will be able to learn from language. They will also be sample-efficient with less human supervision.

Bohdan Ponomar is CEO at the AI HOUSE community. He is creating a leading AI ecosystem for students and experts to build world-class AI ventures in Ukraine.