1 of 109

Advancing Multimodal Vision-Language Learning

Aishwarya Agrawal

Assistant Professor @ UdeM and Mila

Research Scientist @ Google DeepMind

2 of 109

Multimodal AI Research (MAIR) Lab

Oscar Mañas

Saba Ahmadi

Le Zhang

Sarvjeet Singh Ghotra

Rabiul Awal

Kanishk Jain

Qian Yang

Shravan Nayak

Ankur Sikarwar

Rocktim Jyoti Das

(joining in Fall 2024)

Hanan Gani

(joining in Fall 2024)

3 of 109

Vision-Language Tasks

3

“A group of young people playing a game of Frisbee.”

Image Captioning

Q: “What is the mustache made of?”

A: “bananas”

Visual Question Answering

4 of 109

Vision-Language Tasks

4

Image Retrieval

“Grey haired man in

black and yellow tie.”

Image Generation

“Grey haired man in

black and yellow tie.”

5 of 109

Applications of vision-language systems?

Aid to visually impaired users��

5

What kind of wine is this?

Photo and question are from vizwiz.org

6 of 109

Applications of vision-language systems?

Aid to visually impaired users��
Online shopping and organizing photos��

6

7 of 109

Applications of vision-language systems?

Aid to visually impaired users��
Online shopping and organizing photos��
Grounded virtual assistants

7

Have you seen my keys?

Photo credits images.google.com

8 of 109

Vision-Language Progress

8

DeepMind’s Flamingo

Link

9 of 109

Vision-Language Progress

9

OpenAI’s DALL.E 2

Link

10 of 109

Vision-Language Challenges

Visio-linguistic compositional reasoning

Robust automatic evaluation

Cultural and geo-diverse understanding

10

11 of 109

Vision-Language Challenges

Visio-linguistic compositional reasoning

Robust automatic evaluation

Cultural and geo-diverse understanding

11

12 of 109

Visio-linguistic compositional Reasoning

Which caption correctly matches the image?

Caption 1:

“The dog is on the left and the cat is on the right”

Caption 2 :

“The dog is on the right and the cat is on the left”

13 of 109

Visio-linguistic compositional Reasoning

Which caption correctly matches the image?

Caption 1:

“The dog is on the left and the cat is on the right”

Caption 2 :

“The dog is on the right and the cat is on the left”

14 of 109

Contrastive VL Models Struggle

CLIP loss and pretraining data is too coarse to learn compositional relationship from

15 of 109

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Le Zhang

Rabiul Awal

Aishwarya Agrawal

16 of 109

Our Method Overview

17 of 109

Our Method Overview

18 of 109

Our Method Overview

19 of 109

Hard Negative Caption Generation

“A black(adj) dog(noun) and a brown(adj) cat(noun) sitting(verb) on grass(noun) ”

“grass and a black cat sitting on a black dog”

“A brown dog and a black cat sitting on grass”

“A black dog and a brown cat playing on grass”

“A black cat and a brown cat sitting on grass”

Original Image-Text Pair

Augmented Hard Negative Captions

Swap Object (Relation)

Change Attribute (Attribute)

Change Verb (Action)

Change Object (Object)

20 of 109

Losses

Text

Encoder

image-text

pair

similarity

(a) Image-text Contrast

Image

Encoder

Pull matched image-text pair closer and push unmatched pair away

21 of 109

Losses

Text

Encoder

Text

Encoder

text

similarity

(b) Intra-Modal Contrast

Text

Encoder

image-text

pair

similarity

(a) Image-text Contrast

Image

Encoder

Push hard negative captions away from the corresponding positive captions

22 of 109

Losses

Text

Encoder

Text

Encoder

text

similarity

(b) Intra-Modal Contrast

Image

Encoder

Text

Encoder

image-text

pair

similarity

(c) Cross-Modal Rank

Text

Encoder

image-text

pair

similarity

(a) Image-text Contrast

Image

Encoder

Maintain a minimum similarity gap between positive image-text pair and hard negative image-text pair with an adaptive threshold

23 of 109

Adaptive Threshold

The threshold grows adaptively, providing stronger supervision signal as training progresses.

The CMR loss stabilizes after initial steps while the total loss keep decreasing

24 of 109

Results

We outperform previous state-of-the-art methods across several compositional benchmarks

25 of 109

Conclusions

CLIP like models struggle with visio-linguistic compositional reasoning
Automatically generating hard-negative captions and finetuning CLIP with them is promising
Explicitly encouraging disparity between positive and hard-negative representations helps even further

26 of 109

VisMin: Visual Minimal-Change Understanding

26

Aishwarya Agrawal

Saba Ahmadi*

Rabiul Awal*

Le Zhang*

(*equal contributions, under submission at NeurIPS)

27 of 109

28 of 109

Data Generation Framework

29 of 109

Data Generation Framework

30 of 109

Data Generation Framework

75% of data filtered out

31 of 109

31

Spatial Relationship

Object

Attribute

Counting

A large scale synthetic data for model training (64K)

=>

object, attribute, counting and spatial relationship

32 of 109

Data Generation Framework

Additional 83% of data filtered out

33 of 109

33

Spatial Relationship

Object

Attribute

Counting

High quality human verified data for model benchmarking (2K)

=>

object, attribute, counting and spatial relationship

34 of 109

Dataset key features

Visual Minimal Changes: Targeted change in one aspect: object, attribute, counting, or spatial relation.
Visual Complexity: COCO and diffusion-generated images, edited by a detailed pipeline for targeted changes.
Caption Complexity: Blend of human-written and LLM-generated captions.
Human-Approved: Captions are sensical and images are natural looking.

35 of 109

Dataset distribution

36 of 109

Comparison of benchmarks offering visual hard negatives

: Visual Minimal HN
: Visual Complexity
: Textual Complexity

: Human-approved Captions
: Human-approved Images
: Size
: Criterion holds for a subset of the benchmark

37 of 109

Benchmarking

The tasks involved two settings:

Choosing the correct image from two captions and
Selecting the correct caption from two images.

A man pouring white wine into a couple of wine glasses.

A man pouring red wine into a couple of wine glasses.

Caption 1

Caption 2

Image 1

Image 2

38 of 109

Image-text Matching Tasks

A man pouring white wine into a couple of wine glasses.

A man pouring red wine into a couple of wine glasses.

Caption 1

Caption 2

Image 1

Image 2

39 of 109

Image-text Matching Tasks

A man pouring white wine into a couple of wine glasses.

A man pouring red wine into a couple of wine glasses.

Caption 1

Caption 2

Image 1

Image 2

Text score

Sim(Image1, Caption1) > Sim(Image1, Caption2)

Image score

Sim(Image1, Caption1) > Sim(Image2, Caption1)

Foundation Models e.g. CLIP

40 of 109

Image-text Matching Tasks

A man pouring white wine into a couple of wine glasses.

A man pouring red wine into a couple of wine glasses.

Caption 1

Caption 2

Image 1

Image 2

Text score

Sim(Image1, Caption1) > Sim(Image1, Caption2)

Image score

Sim(Image1, Caption1) > Sim(Image2, Caption1)

Text score

Does this image best match: {Caption1 or Caption2}

Image score

Which image better aligns with the description: ‘{C}’? The first or the second image

Foundation Models e.g. CLIP

Multimodal large language model e.g. GPT-V

41 of 109

Benchmarking Foundational VLMs on VisMin

Object and attribute understanding in VLMs is very good, while attribute understanding is more challenging. Scaling model size also helps.

42 of 109

Benchmarking Foundational VLMs on VisMin

Models fail significantly on spatial relationships and counting categories. Increasing model size doesn’t help in understanding spatial relationships.

43 of 109

Benchmarking Multimodal LLMs on VisMin

Open-source MLLMs struggle more in image-score.

GPT4V has slightly better spatial relationship understanding, but still limited.

44 of 109

Finetuning VLMs on VisMin Training Set

Strong boost in performance for both foundational VLM and multimodal LLM.

Multimodal LLM benefits more from our training set, particularly for spatial relational understanding..

45 of 109

Finetuning VLMs on VisMin Training Set

We evaluate our finetuned model on multiple OOD benchmarks.�

Across the board we see VisMin training set brings notable gains → our training set improves fine-grained understanding in general

46 of 109

Finetuning VLMs on VisMin Training Set

Finetuning on our data improves performance on standard image-text retrieval for CLIP and

Recall results with ViT-L/14 on COCO

Benchmark

Evaluation standard VL tasks:MMMU and POPE

47 of 109

Finetuning VLMs on VisMin Training Set

Larger model has better gains from our training set.

Fine-tuning on CLIP variants. Circle radius reflects

the number of model parameters

48 of 109

Conclusions

A new benchmark VisMin for fine-grained visual understanding evaluation.
An automatic pipeline for generating visual minimal-change pairs.
VLMs are good at object/attribute understanding, but poor at counting/spatial relation understanding.
Finetuning VLMs on our visual minimal-change data helps!

Improvement observed everywhere except CLIP on spatial relations

Our dataset shows potential as a robust training resource for VLMs.

49 of 109

Webpage link: https://rabiul.me/vismin/

50 of 109

Learning to decompose complex questions into simpler sub-questions

50

(work in progress)

Aishwarya Agrawal

Qian Yang

51 of 109

Adaptive Decomposition for Complex Vision-Language Tasks

Qian Yang

Question: Think about the magnetic force between the magnets in each pair. Which of the following statements is true?

Options: A: The magnetic force is stronger in Pair 2.

B: The magnetic force is stronger in Pair 1

C: The strength of the magnetic force is the same in both pairs.

Label: B: The magnetic force is stronger in Pair 1

VLM (LLaVA): C: The strength of the magnetic force is the same in both pairs.

VLMs exhibit limitations in handling complex Vision-Language tasks.

Pair 1

Pair 2

S

N

S

N

S

N

S

N

30mm

16mm

52 of 109

Adaptive Decomposition for Complex Vision-Language Tasks

How to create an adaptive and reliable decomposition approach tailored to specific VLMs?

Qian Yang

Overlooking VLM Feedback in Pre-Decomposition: Crucial feedback from VLMs is ignored, preventing adaptive adjustments based on VLM responses.

Absence of Reliability Checks: Unreliable intermediate answers can bias the reasoning process, affecting the integrity of the conclusions.

Previous Works focus on pre-decomposition.

53 of 109

Adaptive Decomposition for Complex Vision-Language Tasks

Qian Yang

The Justifier assesses the confidence level of an answer and determines its reliability.

Question: Which of the following statements is true?

Options:

A: The magnetic force is stronger in Pair 2.

B: The magnetic force is stronger in Pair 1

C: The strength of the magnetic force is the same in both pairs.

VLM

C: The strength of the magnetic force is the same in both pairs.

Confidence: 45%.

Unreliable!

Justifier

The VLM generates the initial answer.

54 of 109

Adaptive Decomposition for Complex Vision-Language Tasks

Qian Yang

LLM generate the first sub-question based on the original question.

Question: Which of the following statements is true?

Options: A: The magnetic force is stronger in Pair 2. B: The magnetic force is stronger in Pair 1 C: The strength of the magnetic force is the same in both pairs.

Sub-Question 1: Are the magnets in each pair the same size and shape?

VLM

LLM

Sub-Answer 1: Yes

Confidence: 95%.

Reliable!

Justifier

VLM answers the first sub-question conditioned on the image.

Justifier validates the reliability of the sub-answer.

55 of 109

Adaptive Decomposition for Complex Vision-Language Tasks

Original Question: Which of the following …

Sub-Question 1: Are the magnets in each pair the same size and shape?

Sub-Answer 1: Yes. Confidence 95%

Qian Yang

LLM adaptively generates subsequent sub-questions conditioned on the original question, all previous sub-QAs, and their reliability.

Sub-Question 2: Are the magnets in each pair attracted or repelled when placed close to each other?

VLM

LLM

Sub-Answer 2: Attracted.

Confidence: 93%

LLM

Sub-Question 3: Are the magnets in Pair 1 closer to each other than the magnets in Pair 2?

VLM

Sub-Question 1: Are the magnets in each pair the same size and shape?

Sub-Answer 1: Yes. Confidence 95%

VLM answers each sub-question conditioned on the image and previous reliable sub-QAs.

Sub-Answer 3: Yes. Confidence: 89%

56 of 109

Adaptive Decomposition for Complex Vision-Language Tasks

Qian Yang

Sub-Question 1: Are the magnets in each pair the same size and shape?

Sub-Answer 1: Yes. Confidence 95%

Sub-Question 2: Are the magnets in each pair attracted or repelled when placed close to each other?

Sub-Answer 2: Attracted. Confidence 93%

Sub-Question 3: Are the magnets in Pair 1 closer to each other than the magnets in Pair 2?

Sub-Answer 3: Yes. Confidence: 89%

Question: Think about the magnetic force between the magnets in each pair. Which of the following statements is true? Options: A: The magnetic force is stronger in Pair 2. B: The magnetic force is stronger in Pair 1. C: The strength of the magnetic force is the same in both pairs.

VLM (LLaVA): B : The magnetic force is stronger in Pair 1.

Pair 1

Pair 2

S

N

S

N

S

N

S

N

30mm

16mm

57 of 109

Adaptive Decomposition for Complex Vision-Language Tasks

Qian Yang

Performance Enhancement: Achieve a minimum of +4% improvement in SNLI-VE, ScienceQA, and A-OKVQA benchmarks.

Dataset	LLaVA	LLaVA + Adaptive Decomp
SNLI-VE	55.0	59.4 (+ 4.4%)
ScienceQA	59.0	64.0 (+5%)
A-OKVQA	67.3	73.9 (+6.6%)

58 of 109

Vision-Language Challenges

Visio-linguistic compositional reasoning

Robust automatic evaluation

Cultural and geo-diverse understanding

58

59 of 109

Improving

Automatic VQA Evaluation

Using Large Language Models

Oscar Mañas

Benno Krojer

Aishwarya Agrawal

60 of 109

VQA Accuracy

VQA is traditionally evaluated with VQA Accuracy

Based on exact string match (EM)

61 of 109

VQA Accuracy failure modes

34.25%

Multiple answers

Q: “What are the sticks for?”

A: “balance”, “pushing”, “skating”

27.75%

Over- or under- specifying and verbosity

Q: “Where is the hydrant?”

A: “on the right”, “right”

21.0%

Synonym

Q: “What is the setting of this picture?”

A: “field”, “plains”, “grassland”

18.0%

Broad/bad question or generic response

Q: “How many sheep are there?”

A: “many”

62 of 109

LLM-Assisted VQA Evaluation (LAVE)

You are given a question, a set of gold-standard reference answers written by experts, and a candidate answer. Please rate the accuracy of the candidate answer for the question considering the reference answers. Use a scale of 1-3, with 1 indicating an incorrect or irrelevant answer, 2 indicating an ambiguous or incomplete answer, and 3 indicating a correct answer. Give the rationale before rating.

THIS IS VERY IMPORTANT: A binary question should only be answered with 'yes' or 'no', otherwise the candidate answer is incorrect.

Task description

+

Question: What's the weather like?

Reference answers: sunny, clear, bright, sunny, sunny

Candidate answer: cloudy

Output: The candidate answer is incorrect because it contradicts the reference answers that suggest clear weather. So rating=1

Demonstrations

+

The candidate answer is correct because the right window is slightly open. So rating=3

Completion

Test example

Question: Which window is slightly open?

Reference answers: right, right one, one on right, one in back, right window, yes, right, yes, second one

Candidate answer: the right window

Output:

Language Model

❄

Rating scale from 1 to 3 (as opposed to binary rating) to account for ambiguous questions or incomplete answers
We append “give the rationale before rating” to elicit a justification for the assigned rating. We observed this justification isn’t always correct, though we didn’t evaluate its accuracy.
We manually curate 2 sets of 8 demonstrations, one for binary questions and the other for general questions
Each demonstration contains the question, the set of reference answers, the candidate answer, the answer rating, a justification for its rating and, optionally, a caption describing the image (we found visual context does not help)
We linearly map the answer rating to a score between 0 and 1
More insights:

Importance of demonstrations
Binary questions are tricky

Talk about the rationale:

It helps enriching the context
It doesn’t necessarily provide the reasoning for the selected rating

63 of 109

Correlation with human judgment

64 of 109

Qualitative examples

65 of 109

An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics

65

Aishwarya

Agrawal

Saba

Ahmadi

EACL 2024

66 of 109

What is the image captioning task?

66

Model

Input:

A boy smiling while jumping on a skateboard.

Output:

Example from Microsoft COCO: Common Objects in Context (Tsung-Yi et al.)

67 of 109

Robust Automatic Evaluation

Evaluating image captioning is difficult!

[Hessel et al. EMNLP 2021]

Recently, reference-free metrics have been proposed – CLIPScore, UMIC.

Existing metrics rely on n-gram matches between candidate and reference captions.

68 of 109

Reference-free metrics

68

Vision encoder

Text encoder

A boy smiling while jumping on a skateboard.

Similarity score

In order to overcome these shortcomings, recent works CLIPScore and UMIC and PAC-S proposed reference-free approaches which is more aligned with how humans actually judge a caption. We as a human do not need a reference caption to judge a candidate caption.

These approaches leverage large pretrained image-text matching models to generate a score that measures the similarity between the image and the candidate caption.

No need for these details:

CLIPScore measures similarity between the image and the candidate caption using a scaled cosine similarity of the image and text representations from the CLIP model without finetuning. CLIP is pretrained on a very large dataset of text-image pairs.

UMIC is based on the UNITER model which is pretrained to align image and text pairs. UNITER is pretrained on a very smaller dataset compared to the CLIP, however, in UMIC they finetune UNITER by contrastive learning to distinguish reference captions from its hard negatives.

PAC-S introduces a new metric that strategically curates positive pairs for contrastive learning, enhancing the multimodal embedding space of CLIP. Similar to CLIPScore, PAC-S uses scaled cosine similarity to evaluate the similarity between the candidate caption and the provided image.

//don’t say this:

CLIP uses contrastive learning to have image and text representations in the same space, aiming to minimize the similarity between matched pairs and maximize the similarity between negative sample

//don’t say this to save time

These hard negatives are automatically curated by either sampling reference captions of images which are similar to the given image, or by altering the reference captions in different ways, such as, substituting nouns, verbs and adjectives with random ones, dropping or repeating some words randomly, and randomly permuting the word order. 2’

69 of 109

Robust Automatic Evaluation

Saba Ahmadi

Recently proposed reference-free image-captioning metrics are not robust enough!

They fail to recognize fine-grained differences between correct and incorrect captions.

Captions	CLIPScore	UMIC
The title of the book is topology.	0.62	0.19
The title of the book is muffin.	0.74	0.62

Figure credits: Saba Ahmadi

They have poor understanding of negation.

They are biased by the length of the captions.

70 of 109

Robust Automatic Evaluation

Saba Ahmadi

Recently proposed reference-free image-captioning metrics are not robust enough!

CLIPScore is more sensitive (than UMIC) to the number and size of objects mentioned in the caption.

Figure credits: Saba Ahmadi

Captions	CLIPScore	UMIC
Small Object: There is a knife.	0.62	0.36
Big Object: There is a pizza.	0.72	0.34

71 of 109

Robust Automatic Evaluation

Saba Ahmadi

Recently proposed reference-free image-captioning metrics are not robust enough!

CLIPScore is more sensitive (than UMIC) to the number and size of objects mentioned in the caption.

Figure credits: Saba Ahmadi

CLIPScore is indifferent to the sentence structure.

Captions	CLIPScore	UMIC
Small Object: There is a knife.	0.62	0.36
Big Object: There is a pizza.	0.72	0.34
Shuffled Small Object: A there knife is.	0.63	0.19
Shuffled Big Object: A there pizza is.	0.74	0.18

72 of 109

Robust Automatic Evaluation

Saba Ahmadi

Recently proposed reference-free image-captioning metrics are not robust enough!

CLIPScore is more sensitive (than UMIC) to the number and size of objects mentioned in the caption.

Figure credits: Saba Ahmadi

CLIPScore is indifferent to the sentence structure.

Captions	CLIPScore	UMIC
Small Object: There is a knife.	0.62	0.36
Big Object: There is a pizza.	0.72	0.34
Shuffled Small Object: A there knife is.	0.63	0.19
Shuffled Big Object: A there pizza is.	0.74	0.18

73 of 109

Vision-Language Challenges

Visio-linguistic compositional reasoning

Robust automatic evaluation

Cultural and geo-diverse understanding

73

74 of 109

Motivation: Western centric bias

74

Web

COCO, LAION, CC-12M, WIT, CoYo

75 of 109

Motivation: Western centric bias

ImageNET and COCO primarily consist of images from North America and Europe. Many benchmarks built upon these inherit the same biases.

Source: Does Object Recognition Work for Everyone? De Vries et.al. 2019

76 of 109

Benchmarking geo-diverse cultural understanding in VLMs

Shravan Nayak

Kanishk Jain

Rabiul Awal

Karolina Stańczak

Lisa Anne Hendricks

Sjoerd van Steenkiste

Siva Reddy

Aishwarya Agrawal

77 of 109

Culture from VL perspective

Culture is a multifaceted concept that may refer to but is not limited to beliefs, practices, symbols, and norms found in human society.

Vision-Language perspective

Visible Cultural Manifestations

Invisible Cultural Manifestations

Cross-Cultural Analysis: The Science and Art of Comparing the World's Modern Societies and Their Cultures. Michael Minkov.

78 of 109

Visible cultural manifestations

Food

Clothing

Drinks

Rituals

Cross-Cultural Analysis: The Science and Art of Comparing the World's Modern Societies and Their Cultures. Michael Minkov.

79 of 109

Invisible cultural manifestations

Societal norms

Legal frameworks

Beliefs

Values

Cross-Cultural Analysis: The Science and Art of Comparing the World's Modern Societies and Their Cultures. Michael Minkov.

80 of 109

Invisible cultural manifestations can't be seen directly, but their impact is often visible in people's actions and behaviors.

From a VL perspective, culture means not only identifying and describing visual concepts but also understanding the underlying values and socio-cultural contexts they reflect

Values

Norms

Beliefs

Image Source Left to Right: Trip Savy, India Times, Getty Images

81 of 109

CANDLE Dataset

81

Size:

60k clusters

1.1 M raw sentences

Domains:

Geography
Religion
Occupations
States of U.S.A

Nguyen et.al. Extracting Cultural Commonsense Knowledge at Scale

Built by filtering C4

Consists of Cultural Commonsense Knowledge (CCSK) assertions.

82 of 109

Pairing assertions with images from web

82

83 of 109

Cultural Repository of Images and Text (CRIT)

People in Brazil wear either black or purple to represent grief and loss.

The festival features Nigerian 'voodoo spirits' walking the streets.

84 of 109

Cultural Repository of Images and Text (CRIT)

Pav Bhaji is an Indian fast food dish

The lion dance is an important tradition in Chinese culture

85 of 109

Cultural Repository of Images and Text (CRIT)

Janmashtami is known and celebrated by different names across in India

Mexican Horchata is a refreshing and slightly sweetened rice milk drink

86 of 109

Cultural Repository of Images and Text

87 of 109

CulturalVQA

A systematic way to evaluate understanding of cultural nuances in VLMs → Visual Question Answering
Solicit question and answer annotations from participants familiar with the culture
Choose annotators from different cultures/countries

Culturally knowledgeable human annotators

CRIT Images +

Metadata

CulturalVQA

88 of 109

Collecting Questions

We ran several studies on MTurk to collect culturally nuanced questions for images.

Additional conditions that questions must satisfy

The question must require an understanding of your culture to to answer it correctly
The question must require looking at the image to answer it correctly
The question must elicit a single correct answer.
Do not ask a question based on stereotypes i.e., oversimplified beliefs about your cultural group

Your task is to ask a question about the cultural concept depicted in the image that someone from your culture will be able to answer easily, but someone who is not familiar with your culture will not be able to answer.

89 of 109

Collecting Answers

Be culturally precise: Your answer should be the precise term people from your culture would use. Your answer should not be generic.

“Holiday” -> “Christmas”�

Use English: Your answer should be written in English. Use regional terms only if there is no direct English translation for that term.� “Dhaniya patta” -> “Coriander leaves”� “Naan” -> “Naan” (instead of “Indian bread”)

90 of 109

Culturally precise

Indian Bread

Naan

What food item is shown in the image?

Dhaniya Patta

Coriander leaves

What is the green substance in the image?

91 of 109

Challenges

Difficulty in sourcing annotators

Most annotators from US and India. African countries have little to no annotators available
We had to pool annotators from multiples sources: MTurk, Masakhane (African regions), Universities in Montreal

Adhering to instructions

Language barrier: Annotators from non-English speaking countries have a hard time referring to images in questions.

92 of 109

CulturalVQA Benchmark

Our efforts are focused on collecting data from the following countries

93 of 109

Examples

What does the bride-to-be add to the groom-to-be's drink in the image, when he asks for her family's blessing to marry?

Ans: Salt

Turkey

What is around the bride and groom's neck?

Ans: Varmala

India

94 of 109

Examples

Canada

What do the feathers on his head mean?

Ans: Chief

Nigeria

The beaded headgear is from which culture?

Ans: Edo culture

95 of 109

CulturalVQA: Statistics

96 of 109

CulturalVQA: Statistics

97 of 109

CulturalVQA: Statistics

98 of 109

Evaluation Metric

Leverage LAVE metric which utilizes in-context learning capabilities of instruction tuned LLMs for VQA evaluation by formulating it as an answer-rating task

Improving Automatic VQA Evaluation Using Large Language Models; Manas et. al.

99 of 109

Evaluation Metric

When is the object shown in the image usually used?

Ground Truth: praying or religious offerings

Model: Prayer time

String Matching:

LAVE:

100 of 109

Evaluation Metric

What does the food kept in baskets represent in the image?

Ground Truth: Prasadam

Model: Prasad

String Matching:

LAVE:

101 of 109

Benchmarking VLMs on CulturalVQA

102 of 109

Models perform better for North America than for Africa / Asia.��Closed source models are strictly better than open source models��Gap between closed-source and open-source increases as we go beyond North America!

103 of 109

Benchmarking SOTA VLMs on CulturalVQA

GPT-4 outperforms all models with Gemini being second best.

�Intern VL 1.6 is the best open source model (larger size plus more data)

104 of 109

Question: What type of Indian bread is this in the picture?

Pred: Khakhra

GT: Naan

Question: What is the local name for the white-colored food in the image in East Africa?

Pred: Ugali

GT: Kawunga

Question: For what occasion is the flower typically displayed in front of the capital building?

Pred: Tulip Festival

GT: Remembrance Day

Question: The beaded headgear is from which culture?

Pred: Benin Kingdom

GT: Edo culture

GPT4 Failure Cases

105 of 109

Conclusions

A large scale repository of culturally relevant image-text pairs.�
CulturalVQA – a new task to evaluate VLMs’ geo-diverse cultural understanding. �
Comprehensive evaluation of both open-source and closed-source VLMs on the CulturalVQA benchmark showing interesting findings.

106 of 109

Vision-Language Challenges

Visio-linguistic compositional reasoning

VLMs are still poor at visio-lingistuic compositional reasoning
Fine-tuning with high-quality visual and textual hard-negatives is promising
Explicitly encouraging disparity between positive and hard-negative representations helps even further

106

107 of 109

Vision-Language Challenges

Visio-linguistic compositional reasoning

Robust automatic evaluation

VQA: Using LLMs for reference-based evaluation is promising!
Image Captioning: Reference-free evaluation is not robust enough!

107

108 of 109

Vision-Language Challenges

Visio-linguistic compositional reasoning

Robust automatic evaluation

Cultural and geo-diverse understanding

Need to go beyond traditional recognition problems now that VLMs are quite strong
Evaluating cultural understanding is not easy
Although current VLMs have decent cultural understanding for North America, they perform poorly for Africa and Asia.

108

109 of 109

Vision-Language Challenges

Visio-linguistic compositional reasoning

Robust automatic evaluation

Cultural and geo-diverse understanding

109

Thanks!�Questions?