1 of 48

Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering

Abhirama Subramanyam Penamakuri¹, Manish Gupta²,

Mithun Das Gupta² and Anand Mishra¹

¹Indian Institute of Technology Jodhpur

²Microsoft

2 of 48

VQA is a well studied problem!

[Agrawal et al. 2015]

3 of 48

VQA is a well studied problem!

1 Question in the context of “1” image!

[Agrawal et al. 2015]

4 of 48

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

1 Question in the context of “Multiple” images!

5 of 48

1 Question in the context of “Multiple” images!

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Do the rose and sunflower share the same color?

6 of 48

…

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Do the rose and sunflower share the same color?

7 of 48

…

Step 1: Retrieval!

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Do the rose and sunflower share the same color?

8 of 48

…

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Step 1: Retrieval!

Do the rose and sunflower share the same color?

9 of 48

…

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Step 2: Question Answering!

Do the rose and sunflower share the same color?

10 of 48

…

Do the rose and sunflower share the same color?

A: No, the rose and the sunflower do not share the same color.

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Step 2: Question Answering!

11 of 48

Given a Question (Q) and a set of N heterogeneous images (I), generate an answer (A) to the question (Q) where only some images from I are relevant.

Proposed task: RetVQA- Formal Definition

12 of 48

Do the rose and sunflower share the same color?

Multi

Image

Context

Proposed framework: (Relevance encoder + Multi-Image-BART)

13 of 48

Do the rose and sunflower share the same color?

Multi

Image

Context

Stage 1: Relevance Encoding!

Goal: To learn the relevance between the question and the image, i.e., to learn whether an image is relevant or irrelevant with respect to the question.

Proposed framework: (Relevance encoder + Multi-Image-BART)

14 of 48

Do the rose and sunflower share the same color?

BERT

Faster RCNN

Multi

Image

Context

Proposed framework: (Relevance encoder + Multi-Image-BART)

[Devlin et al., 2019]

[Ren et al., 2015b]

15 of 48

Do the rose and sunflower share the same color?

BERT

Faster RCNN

Multi

Image

Context

Proposed framework: (Relevance encoder + Multi-Image-BART)

16 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

BERT

Faster RCNN

Multi

Image

Context

Proposed framework: (Relevance encoder + Multi-Image-BART)

17 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

BERT

Faster RCNN

Multi

Image

Context

Proposed framework: (Relevance encoder + Multi-Image-BART)

18 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

BERT

Faster RCNN

Multi

Image

Context

Relevance Encoder is pre-trained on MS-COCO using two objectives:

Image Text Matching (ITM)
Masked Language Modeling (MLM)

Proposed framework: (Relevance encoder + Multi-Image-BART)

19 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

BERT

Faster RCNN

Multi

Image

Context

Relevance Encoder is fine-tuned on our proposed RetVQA dataset with ITM objective!

Proposed framework: (Relevance encoder + Multi-Image-BART)

20 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

Relevance

Encoder

Relevance

Encoder

BERT

Faster RCNN

Multi

Image

Context

Proposed framework: (Relevance encoder + Multi-Image-BART)

21 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

Relevance

Encoder

Relevance

Encoder

BERT

Faster RCNN

Multi

Image

Context

Stage 2:

Retriever!

To retrieve relevant images from the pool and discard the irrelevant ones!

Proposed framework: (Relevance encoder + Multi-Image-BART)

22 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

Relevance

Encoder

Relevance

Encoder

BERT

Faster RCNN

Multi

Image

Context

Top-k

Proposed framework: (Relevance encoder + Multi-Image-BART)

23 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

Relevance

Encoder

Relevance

Encoder

BERT

Faster RCNN

Multi

Image

Context

Top-k

Proposed framework: (Relevance encoder + Multi-Image-BART)

24 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

Relevance

Encoder

Relevance

Encoder

BERT

Faster RCNN

Multi

Image

Context

Top-k

Retrieved Context

Proposed framework: (Relevance encoder + Multi-Image-BART)

25 of 48

Do the rose and sunflower share the same color?

Proposed framework: (Relevance encoder + Multi-Image-BART)

26 of 48

Do the rose and sunflower share the same color?

Proposed framework: (Relevance encoder + Multi-Image-BART)

Stage 3: Question Answering!

Goal: To answer the question from the retrieved relevant images.

27 of 48

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

Do the rose and sunflower share the same color?

Proposed framework: (Relevance encoder + Multi-Image-BART)

28 of 48

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

Do the rose and sunflower share the same color?

Proposed framework: (Relevance encoder + Multi-Image-BART)

29 of 48

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

Do the rose and sunflower share the same color?

Proposed framework: (Relevance encoder + Multi-Image-BART)

30 of 48

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

Do the rose and sunflower share the same color?

Proposed framework: (Relevance encoder + Multi-Image-BART)

31 of 48

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

Do the rose and sunflower share the same color?

Proposed framework: (Relevance encoder + Multi-Image-BART)

the

32 of 48

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

Do the rose and sunflower share the same color?

Proposed framework: (Relevance encoder + Multi-Image-BART)

the

rose

33 of 48

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

Do the rose and sunflower share the same color?

Proposed framework: (Relevance encoder + Multi-Image-BART)

the

rose

and

34 of 48

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

Do the rose and sunflower share the same color?

Proposed framework: (Relevance encoder + Multi-Image-BART)

the

rose

and

color

…

35 of 48

Results: Datasets

RetVQA (proposed dataset)

Derived from Visual Genome [Krishna et al., 2017]

36 of 48

Results: Datasets

RetVQA (proposed dataset)

Derived from Visual Genome [Krishna et al.]

Largest dataset (418K) which needs ≥ 2 images to answer the question.

37 of 48

Results: Datasets

RetVQA (proposed dataset)

Derived from Visual Genome [Krishna et al.]

Largest dataset (418K) which needs ≥ 2 images to answer the question.

Two types of answers:

Open-set generative
Binary generative

38 of 48

Results: Datasets

RetVQA (proposed dataset)

Derived from Visual Genome [Krishna et al.]

Largest dataset (418K) which needs ≥ 2 images to answer the question.

Two types of answers:

Open-set generative
Binary generative

“The colors of the rose and the sunflower are red and yellow respectively.”

39 of 48

Results: Datasets

RetVQA (proposed dataset)

Derived from Visual Genome [Krishna et al.]

Largest dataset (418K) which needs ≥ 2 images to answer the question.

Two types of answers:

Open-set generative
Binary generative

“No, the rose and sunflower do not share the same color.”

40 of 48

Results: Datasets

RetVQA (proposed dataset)

Derived from Visual Genome [Krishna et al.]

Largest dataset (418K) which needs ≥ 2 images to answer the question.

Two types of answers: Open-set generative, Binary generative.

Five types of questions:

Color
Shape
Count
Object-attributes
Relation-based

41 of 48

Results: Datasets

RetVQA (proposed dataset)

Derived from Visual Genome [Krishna et al.]

Largest dataset (418K) which needs ≥ 2 images to answer the question.

Two types of answers: Open-set generative, Binary generative.

Five types of questions:

Color
Shape
Count
Object-attributes
Relation-based

42 of 48

Results: Datasets

RetVQA (proposed dataset, summary)

Derived from Visual Genome [Krishna et al.]

Largest dataset (418K) which needs ≥ 2 images to answer the question.

Two types of answers: Open-set generative, Binary generative.

Five types of questions: Color, Shape, Count, Relation-based and object-attributes.

WebQA [img-only segment] [Chang et al., 2022]

43 of 48

Results: Metrics

Accuracy (A): is the correct answer present in the generative answers?

Fluency (F): P(generated answer | ground truth answer) using BART normalised by P(ground truth answer | ground truth answer) [Yuan et al.]

44 of 48

Results: Quantitative Results

45 of 48

Results: Qualitative Results

…

GT: Four birds are pictured. VLP Answer: Two birds are pictured. MI-BART Answer (Ours): Four birds are pictured.

Context:

Question:

How many birds are pictured?

…

GT: Sheep eats same thing as brown horse. VLP Answer: Zebra eats same thing as brown horse.

MI-BART Answer (Ours): Sheep eats same thing as brown horse.

Context:

Question:

What else eats same thing as brown horse ?

46 of 48

Results: Qualitative Results

What else eats same thing as brown horse?

Context:

GT Answer: sheep eats same thing as brown horse

MI-BART: sheep eats same thing as brown horse

sheep

eats

same

Question:

thing

brown

horse

47 of 48

Summary

RetVQA (Retrieval-based VQA)

Different from retrieval augmented generation

Proposed a unified Multi-Image BART model to answer the question from the retrieved images using our relevance encoder.

Achieves an accuracy of 76.5% and a fluency of 79.3% on the new RetVQA dataset.

Outperforms SOTA by 4.9% and 11.8% on the image segment of the WebQA dataset over accuracy and fluency.

1 of 48

2 of 48

3 of 48

4 of 48

5 of 48

6 of 48

7 of 48

8 of 48

9 of 48

10 of 48

11 of 48

12 of 48

13 of 48

14 of 48

15 of 48

16 of 48

17 of 48

18 of 48

19 of 48

20 of 48

21 of 48

22 of 48

23 of 48

24 of 48

25 of 48

26 of 48

27 of 48

28 of 48

29 of 48

30 of 48

31 of 48

32 of 48

33 of 48

34 of 48

35 of 48

36 of 48

37 of 48

38 of 48

39 of 48

40 of 48

41 of 48

42 of 48

43 of 48

44 of 48

45 of 48

46 of 48

47 of 48

48 of 48