Advancing Multimodal Vision-Language Learning
Aishwarya Agrawal
Assistant Professor @ UdeM and Mila
Research Scientist @ Google DeepMind
Multimodal AI Research (MAIR) Lab
Oscar Mañas
Saba Ahmadi
Le Zhang
Sarvjeet Singh Ghotra
Rabiul Awal
Kanishk Jain
Qian Yang
Shravan Nayak
Ankur Sikarwar
Rocktim Jyoti Das
(joining in Fall 2024)
Hanan Gani
(joining in Fall 2024)
(joining in Fall 2024)
Vision-Language Tasks
3
“A group of young people playing a game of Frisbee.”
Image Captioning
Q: “What is the mustache made of?”
A: “bananas”
Visual Question Answering
Vision-Language Tasks
4
Image Retrieval
“Grey haired man in
black and yellow tie.”
Image Generation
“Grey haired man in
black and yellow tie.”
Applications of vision-language systems?
5
What kind of wine is this?
Photo and question are from vizwiz.org
Applications of vision-language systems?
6
Applications of vision-language systems?
7
Have you seen my keys?
Photo credits images.google.com
Vision-Language Progress
8
DeepMind’s Flamingo
Vision-Language Progress
9
OpenAI’s DALL.E 2
Vision-Language Challenges
10
Vision-Language Challenges
11
Visio-linguistic compositional Reasoning
Which caption correctly matches the image?
Caption 1:
“The dog is on the left and the cat is on the right”
Caption 2 :
“The dog is on the right and the cat is on the left”
Visio-linguistic compositional Reasoning
Which caption correctly matches the image?
Caption 1:
“The dog is on the left and the cat is on the right”
Caption 2 :
“The dog is on the right and the cat is on the left”
Contrastive VL Models Struggle
CLIP loss and pretraining data is too coarse to learn compositional relationship from
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding
Le Zhang
Rabiul Awal
Aishwarya Agrawal
Our Method Overview
Our Method Overview
Our Method Overview
Hard Negative Caption Generation
“A black(adj) dog(noun) and a brown(adj) cat(noun) sitting(verb) on grass(noun) ”
“grass and a black cat sitting on a black dog”
“A brown dog and a black cat sitting on grass”
“A black dog and a brown cat playing on grass”
“A black cat and a brown cat sitting on grass”
Original Image-Text Pair
Augmented Hard Negative Captions
Swap Object (Relation)
Change Attribute (Attribute)
Change Verb (Action)
Change Object (Object)
Losses
Text
Encoder
image-text
pair
similarity
(a) Image-text Contrast
Image
Encoder
Pull matched image-text pair closer and push unmatched pair away
Losses
Text
Encoder
Text
Encoder
text
similarity
(b) Intra-Modal Contrast
Text
Encoder
image-text
pair
similarity
(a) Image-text Contrast
Image
Encoder
Push hard negative captions away from the corresponding positive captions
Losses
Text
Encoder
Text
Encoder
text
similarity
(b) Intra-Modal Contrast
Image
Encoder
Text
Encoder
image-text
pair
similarity
(c) Cross-Modal Rank
Text
Encoder
image-text
pair
similarity
(a) Image-text Contrast
Image
Encoder
Maintain a minimum similarity gap between positive image-text pair and hard negative image-text pair with an adaptive threshold
Adaptive Threshold
The threshold grows adaptively, providing stronger supervision signal as training progresses.
The CMR loss stabilizes after initial steps while the total loss keep decreasing
Results
We outperform previous state-of-the-art methods across several compositional benchmarks
Conclusions
VisMin: Visual Minimal-Change Understanding
26
Aishwarya Agrawal
Saba Ahmadi*
Rabiul Awal*
Le Zhang*
(*equal contributions, under submission at NeurIPS)
Data Generation Framework
Data Generation Framework
Data Generation Framework
75% of data filtered out
31
Spatial Relationship
Object
Attribute
Counting
A large scale synthetic data for model training (64K)
=>
object, attribute, counting and spatial relationship
Data Generation Framework
Additional 83% of data filtered out
33
Spatial Relationship
Object
Attribute
Counting
High quality human verified data for model benchmarking (2K)
=>
object, attribute, counting and spatial relationship
Dataset key features
Dataset distribution
Comparison of benchmarks offering visual hard negatives
Benchmarking
The tasks involved two settings:
A man pouring white wine into a couple of wine glasses.
A man pouring red wine into a couple of wine glasses.
Caption 1
Caption 2
Image 1
Image 2
Image-text Matching Tasks
A man pouring white wine into a couple of wine glasses.
A man pouring red wine into a couple of wine glasses.
Caption 1
Caption 2
Image 1
Image 2
Image-text Matching Tasks
A man pouring white wine into a couple of wine glasses.
A man pouring red wine into a couple of wine glasses.
Caption 1
Caption 2
Image 1
Image 2
Text score
Sim(Image1, Caption1) > Sim(Image1, Caption2)
Image score
Sim(Image1, Caption1) > Sim(Image2, Caption1)
Foundation Models e.g. CLIP
Image-text Matching Tasks
A man pouring white wine into a couple of wine glasses.
A man pouring red wine into a couple of wine glasses.
Caption 1
Caption 2
Image 1
Image 2
Text score
Sim(Image1, Caption1) > Sim(Image1, Caption2)
Image score
Sim(Image1, Caption1) > Sim(Image2, Caption1)
Text score
Does this image best match: {Caption1 or Caption2}
Image score
Which image better aligns with the description: ‘{C}’? The first or the second image
Foundation Models e.g. CLIP
Multimodal large language model e.g. GPT-V
Benchmarking Foundational VLMs on VisMin
Object and attribute understanding in VLMs is very good, while attribute understanding is more challenging. Scaling model size also helps.
Benchmarking Foundational VLMs on VisMin
Models fail significantly on spatial relationships and counting categories. Increasing model size doesn’t help in understanding spatial relationships.
Benchmarking Multimodal LLMs on VisMin
Open-source MLLMs struggle more in image-score.
GPT4V has slightly better spatial relationship understanding, but still limited.
Finetuning VLMs on VisMin Training Set
Strong boost in performance for both foundational VLM and multimodal LLM.
Multimodal LLM benefits more from our training set, particularly for spatial relational understanding..
Finetuning VLMs on VisMin Training Set
We evaluate our finetuned model on multiple OOD benchmarks.�
Across the board we see VisMin training set brings notable gains → our training set improves fine-grained understanding in general
Finetuning VLMs on VisMin Training Set
Finetuning on our data improves performance on standard image-text retrieval for CLIP and
Recall results with ViT-L/14 on COCO
Benchmark
Evaluation standard VL tasks:MMMU and POPE
Finetuning VLMs on VisMin Training Set
Larger model has better gains from our training set.
Fine-tuning on CLIP variants. Circle radius reflects
the number of model parameters
Conclusions
Webpage link: https://rabiul.me/vismin/
Learning to decompose complex questions into simpler sub-questions
50
(work in progress)
Aishwarya Agrawal
Qian Yang
Adaptive Decomposition for Complex Vision-Language Tasks
Qian Yang
Question: Think about the magnetic force between the magnets in each pair. Which of the following statements is true?
Options: A: The magnetic force is stronger in Pair 2.
B: The magnetic force is stronger in Pair 1
C: The strength of the magnetic force is the same in both pairs.
Label: B: The magnetic force is stronger in Pair 1
VLM (LLaVA): C: The strength of the magnetic force is the same in both pairs.
VLMs exhibit limitations in handling complex Vision-Language tasks.
Pair 1
Pair 2
S
N
S
N
S
N
S
N
30mm
16mm
Adaptive Decomposition for Complex Vision-Language Tasks
How to create an adaptive and reliable decomposition approach tailored to specific VLMs?
Qian Yang
Adaptive Decomposition for Complex Vision-Language Tasks
Qian Yang
Question: Which of the following statements is true?
Options:
A: The magnetic force is stronger in Pair 2.
B: The magnetic force is stronger in Pair 1
C: The strength of the magnetic force is the same in both pairs.
VLM
C: The strength of the magnetic force is the same in both pairs.
Confidence: 45%.
Unreliable!
Justifier
Adaptive Decomposition for Complex Vision-Language Tasks
Qian Yang
Question: Which of the following statements is true?
Options: A: The magnetic force is stronger in Pair 2. B: The magnetic force is stronger in Pair 1 C: The strength of the magnetic force is the same in both pairs.
Sub-Question 1: Are the magnets in each pair the same size and shape?
VLM
LLM
Sub-Answer 1: Yes
Confidence: 95%.
Reliable!
Justifier
Adaptive Decomposition for Complex Vision-Language Tasks
Original Question: Which of the following …
Sub-Question 1: Are the magnets in each pair the same size and shape?
Sub-Answer 1: Yes. Confidence 95%
Qian Yang
Sub-Question 2: Are the magnets in each pair attracted or repelled when placed close to each other?
VLM
LLM
Sub-Answer 2: Attracted.
Confidence: 93%
LLM
Sub-Question 3: Are the magnets in Pair 1 closer to each other than the magnets in Pair 2?
VLM
Sub-Question 1: Are the magnets in each pair the same size and shape?
Sub-Answer 1: Yes. Confidence 95%
Sub-Answer 3: Yes. Confidence: 89%
Adaptive Decomposition for Complex Vision-Language Tasks
Qian Yang
Sub-Question 1: Are the magnets in each pair the same size and shape?
Sub-Answer 1: Yes. Confidence 95%
Sub-Question 2: Are the magnets in each pair attracted or repelled when placed close to each other?
Sub-Answer 2: Attracted. Confidence 93%
Sub-Question 3: Are the magnets in Pair 1 closer to each other than the magnets in Pair 2?
Sub-Answer 3: Yes. Confidence: 89%
Question: Think about the magnetic force between the magnets in each pair. Which of the following statements is true? Options: A: The magnetic force is stronger in Pair 2. B: The magnetic force is stronger in Pair 1. C: The strength of the magnetic force is the same in both pairs.
VLM (LLaVA): B : The magnetic force is stronger in Pair 1.
Pair 1
Pair 2
S
N
S
N
S
N
S
N
30mm
16mm
Adaptive Decomposition for Complex Vision-Language Tasks
Qian Yang
Dataset | LLaVA | LLaVA + Adaptive Decomp |
SNLI-VE | 55.0 | 59.4 (+ 4.4%) |
ScienceQA | 59.0 | 64.0 (+5%) |
A-OKVQA | 67.3 | 73.9 (+6.6%) |
Vision-Language Challenges
58
Improving
Automatic VQA Evaluation
Using Large Language Models
Oscar Mañas
Benno Krojer
Aishwarya Agrawal
VQA Accuracy
VQA Accuracy failure modes
34.25%
Multiple answers
Q: “What are the sticks for?”
A: “balance”, “pushing”, “skating”
27.75%
Over- or under- specifying and verbosity
Q: “Where is the hydrant?”
A: “on the right”, “right”
21.0%
Synonym
Q: “What is the setting of this picture?”
A: “field”, “plains”, “grassland”
18.0%
Broad/bad question or generic response
Q: “How many sheep are there?”
A: “many”
LLM-Assisted VQA Evaluation (LAVE)
You are given a question, a set of gold-standard reference answers written by experts, and a candidate answer. Please rate the accuracy of the candidate answer for the question considering the reference answers. Use a scale of 1-3, with 1 indicating an incorrect or irrelevant answer, 2 indicating an ambiguous or incomplete answer, and 3 indicating a correct answer. Give the rationale before rating.
THIS IS VERY IMPORTANT: A binary question should only be answered with 'yes' or 'no', otherwise the candidate answer is incorrect.
Task description
+
Question: What's the weather like?
Reference answers: sunny, clear, bright, sunny, sunny
Candidate answer: cloudy
Output: The candidate answer is incorrect because it contradicts the reference answers that suggest clear weather. So rating=1
Demonstrations
+
The candidate answer is correct because the right window is slightly open. So rating=3
Completion
Test example
Question: Which window is slightly open?
Reference answers: right, right one, one on right, one in back, right window, yes, right, yes, second one
Candidate answer: the right window
Output:
Language Model
❄
Correlation with human judgment
Qualitative examples
An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics
65
Aishwarya
Agrawal
Saba
Ahmadi
EACL 2024
What is the image captioning task?
66
Model
Input:
A boy smiling while jumping on a skateboard.
Output:
Example from Microsoft COCO: Common Objects in Context (Tsung-Yi et al.)
Robust Automatic Evaluation
[Hessel et al. EMNLP 2021]
Reference-free metrics
68
Vision encoder
Text encoder
A boy smiling while jumping on a skateboard.
Similarity score
Robust Automatic Evaluation
Saba Ahmadi
Recently proposed reference-free image-captioning metrics are not robust enough!
Captions | CLIPScore | UMIC |
The title of the book is topology. | 0.62 | 0.19 |
The title of the book is muffin. | 0.74 | 0.62 |
Figure credits: Saba Ahmadi
Robust Automatic Evaluation
Saba Ahmadi
Recently proposed reference-free image-captioning metrics are not robust enough!
Figure credits: Saba Ahmadi
Captions | CLIPScore | UMIC |
Small Object: There is a knife. | 0.62 | 0.36 |
Big Object: There is a pizza. | 0.72 | 0.34 |
Robust Automatic Evaluation
Saba Ahmadi
Recently proposed reference-free image-captioning metrics are not robust enough!
Figure credits: Saba Ahmadi
Captions | CLIPScore | UMIC |
Small Object: There is a knife. | 0.62 | 0.36 |
Big Object: There is a pizza. | 0.72 | 0.34 |
Shuffled Small Object: A there knife is. | 0.63 | 0.19 |
Shuffled Big Object: A there pizza is. | 0.74 | 0.18 |
Robust Automatic Evaluation
Saba Ahmadi
Recently proposed reference-free image-captioning metrics are not robust enough!
Figure credits: Saba Ahmadi
Captions | CLIPScore | UMIC |
Small Object: There is a knife. | 0.62 | 0.36 |
Big Object: There is a pizza. | 0.72 | 0.34 |
Shuffled Small Object: A there knife is. | 0.63 | 0.19 |
Shuffled Big Object: A there pizza is. | 0.74 | 0.18 |
Vision-Language Challenges
73
Motivation: Western centric bias
74
Web
COCO, LAION, CC-12M, WIT, CoYo
Motivation: Western centric bias
ImageNET and COCO primarily consist of images from North America and Europe. Many benchmarks built upon these inherit the same biases.
Source: Does Object Recognition Work for Everyone? De Vries et.al. 2019
Benchmarking geo-diverse cultural understanding in VLMs
Shravan Nayak
Kanishk Jain
Rabiul Awal
Karolina Stańczak
Lisa Anne Hendricks
Sjoerd van Steenkiste
Siva Reddy
Aishwarya Agrawal
Culture from VL perspective
Culture is a multifaceted concept that may refer to but is not limited to beliefs, practices, symbols, and norms found in human society.
Vision-Language perspective
Visible Cultural Manifestations
Invisible Cultural Manifestations
Cross-Cultural Analysis: The Science and Art of Comparing the World's Modern Societies and Their Cultures. Michael Minkov.
Visible cultural manifestations
Food
Clothing
Drinks
Rituals
Cross-Cultural Analysis: The Science and Art of Comparing the World's Modern Societies and Their Cultures. Michael Minkov.
Invisible cultural manifestations
Societal norms
Legal frameworks
Beliefs
Values
Cross-Cultural Analysis: The Science and Art of Comparing the World's Modern Societies and Their Cultures. Michael Minkov.
Invisible cultural manifestations can't be seen directly, but their impact is often visible in people's actions and behaviors.
From a VL perspective, culture means not only identifying and describing visual concepts but also understanding the underlying values and socio-cultural contexts they reflect
Values
Norms
Beliefs
Image Source Left to Right: Trip Savy, India Times, Getty Images
CANDLE Dataset
81
Size:
60k clusters
1.1 M raw sentences
Domains:
Nguyen et.al. Extracting Cultural Commonsense Knowledge at Scale
Built by filtering C4
Consists of Cultural Commonsense Knowledge (CCSK) assertions.
Pairing assertions with images from web
82
Cultural Repository of Images and Text (CRIT)
People in Brazil wear either black or purple to represent grief and loss.
The festival features Nigerian 'voodoo spirits' walking the streets.
Cultural Repository of Images and Text (CRIT)
Pav Bhaji is an Indian fast food dish
The lion dance is an important tradition in Chinese culture
Cultural Repository of Images and Text (CRIT)
Janmashtami is known and celebrated by different names across in India
Mexican Horchata is a refreshing and slightly sweetened rice milk drink
Cultural Repository of Images and Text
CulturalVQA
Culturally knowledgeable human annotators
CRIT Images +
Metadata
CulturalVQA
Collecting Questions
We ran several studies on MTurk to collect culturally nuanced questions for images.
Additional conditions that questions must satisfy
Your task is to ask a question about the cultural concept depicted in the image that someone from your culture will be able to answer easily, but someone who is not familiar with your culture will not be able to answer.
Collecting Answers
“Holiday” -> “Christmas”�
Culturally precise
Indian Bread
Naan
What food item is shown in the image?
Dhaniya Patta
Coriander leaves
What is the green substance in the image?
Challenges
CulturalVQA Benchmark
Our efforts are focused on collecting data from the following countries
Examples
What does the bride-to-be add to the groom-to-be's drink in the image, when he asks for her family's blessing to marry?
Ans: Salt
Turkey
What is around the bride and groom's neck?
Ans: Varmala
India
Examples
Canada
What do the feathers on his head mean?
Ans: Chief
Nigeria
The beaded headgear is from which culture?
Ans: Edo culture
CulturalVQA: Statistics
CulturalVQA: Statistics
CulturalVQA: Statistics
Evaluation Metric
Leverage LAVE metric which utilizes in-context learning capabilities of instruction tuned LLMs for VQA evaluation by formulating it as an answer-rating task
Improving Automatic VQA Evaluation Using Large Language Models; Manas et. al.
Evaluation Metric
When is the object shown in the image usually used?
Ground Truth: praying or religious offerings
Model: Prayer time
String Matching:
LAVE:
Evaluation Metric
What does the food kept in baskets represent in the image?
Ground Truth: Prasadam
Model: Prasad
String Matching:
LAVE:
Benchmarking VLMs on CulturalVQA
Models perform better for North America than for Africa / Asia.���Closed source models are strictly better than open source models���Gap between closed-source and open-source increases as we go beyond North America!
Benchmarking SOTA VLMs on CulturalVQA
GPT-4 outperforms all models with Gemini being second best.
�Intern VL 1.6 is the best open source model (larger size plus more data)
Question: What type of Indian bread is this in the picture?
Pred: Khakhra
GT: Naan
Question: What is the local name for the white-colored food in the image in East Africa?
Pred: Ugali
GT: Kawunga
Question: For what occasion is the flower typically displayed in front of the capital building?
Pred: Tulip Festival
GT: Remembrance Day
Question: The beaded headgear is from which culture?
Pred: Benin Kingdom
GT: Edo culture
GPT4 Failure Cases
Conclusions
Vision-Language Challenges
106
Vision-Language Challenges
107
Vision-Language Challenges
108
Vision-Language Challenges
109
Thanks!�Questions?