1 of 48

Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering

Abhirama Subramanyam Penamakuri1, Manish Gupta2,

Mithun Das Gupta2 and Anand Mishra1

1Indian Institute of Technology Jodhpur

2Microsoft

2 of 48

VQA is a well studied problem!

[Agrawal et al. 2015]

3 of 48

VQA is a well studied problem!

1 Question in the context of “1” image!

[Agrawal et al. 2015]

4 of 48

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

1 Question in the context of “Multiple” images!

5 of 48

1 Question in the context of “Multiple” images!

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Do the rose and sunflower share the same color?

6 of 48

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Do the rose and sunflower share the same color?

7 of 48

Step 1: Retrieval!

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Do the rose and sunflower share the same color?

8 of 48

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Step 1: Retrieval!

Do the rose and sunflower share the same color?

9 of 48

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Step 2: Question Answering!

Do the rose and sunflower share the same color?

10 of 48

Do the rose and sunflower share the same color?

A: No, the rose and the sunflower do not share the same color.

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Step 2: Question Answering!

11 of 48

Given a Question (Q) and a set of N heterogeneous images (I), generate an answer (A) to the question (Q) where only some images from I are relevant.

Proposed task: RetVQA- Formal Definition

12 of 48

Do the rose and sunflower share the same color?

.

.

.

Multi

Image

Context

1

2

N

Proposed framework: (Relevance encoder + Multi-Image-BART)

13 of 48

Do the rose and sunflower share the same color?

.

.

.

Multi

Image

Context

1

2

N

Stage 1: Relevance Encoding!

Goal: To learn the relevance between the question and the image, i.e., to learn whether an image is relevant or irrelevant with respect to the question.

Proposed framework: (Relevance encoder + Multi-Image-BART)

14 of 48

Do the rose and sunflower share the same color?

BERT

Faster RCNN

.

.

.

Multi

Image

Context

1

2

N

Proposed framework: (Relevance encoder + Multi-Image-BART)

[Devlin et al., 2019]

[Ren et al., 2015b]

15 of 48

Do the rose and sunflower share the same color?

BERT

Faster RCNN

.

.

.

Multi

Image

Context

1

2

N

Proposed framework: (Relevance encoder + Multi-Image-BART)

16 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

BERT

Faster RCNN

.

.

.

Multi

Image

Context

1

2

N

Proposed framework: (Relevance encoder + Multi-Image-BART)

17 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

BERT

Faster RCNN

.

.

.

Multi

Image

Context

1

2

N

Proposed framework: (Relevance encoder + Multi-Image-BART)

18 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

BERT

Faster RCNN

.

.

.

Multi

Image

Context

1

2

N

Relevance Encoder is pre-trained on MS-COCO using two objectives:

  1. Image Text Matching (ITM)
  2. Masked Language Modeling (MLM)

Proposed framework: (Relevance encoder + Multi-Image-BART)

19 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

BERT

Faster RCNN

.

.

.

Multi

Image

Context

1

2

N

Relevance Encoder is fine-tuned on our proposed RetVQA dataset with ITM objective!

Proposed framework: (Relevance encoder + Multi-Image-BART)

20 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

Relevance

Encoder

Relevance

Encoder

BERT

Faster RCNN

.

.

.

Multi

Image

Context

1

2

N

Proposed framework: (Relevance encoder + Multi-Image-BART)

21 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

Relevance

Encoder

Relevance

Encoder

BERT

Faster RCNN

.

.

.

Multi

Image

Context

1

2

N

Stage 2:

Retriever!

To retrieve relevant images from the pool and discard the irrelevant ones!

Proposed framework: (Relevance encoder + Multi-Image-BART)

22 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

Relevance

Encoder

Relevance

Encoder

BERT

Faster RCNN

.

.

.

Multi

Image

Context

1

2

N

Top-k

Proposed framework: (Relevance encoder + Multi-Image-BART)

23 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

Relevance

Encoder

Relevance

Encoder

BERT

Faster RCNN

.

.

.

Multi

Image

Context

1

2

N

Top-k

.

.

.

1

K

Proposed framework: (Relevance encoder + Multi-Image-BART)

24 of 48

Do the rose and sunflower share the same color?

Relevance

Encoder

Relevance

Encoder

Relevance

Encoder

BERT

Faster RCNN

.

.

.

Multi

Image

Context

1

2

N

Top-k

.

.

.

1

K

Retrieved Context

Proposed framework: (Relevance encoder + Multi-Image-BART)

25 of 48

Do the rose and sunflower share the same color?

.

.

.

Proposed framework: (Relevance encoder + Multi-Image-BART)

1

K

26 of 48

Do the rose and sunflower share the same color?

.

.

.

Proposed framework: (Relevance encoder + Multi-Image-BART)

1

K

Stage 3: Question Answering!

Goal: To answer the question from the retrieved relevant images.

27 of 48

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

.

.

.

Do the rose and sunflower share the same color?

.

.

.

Proposed framework: (Relevance encoder + Multi-Image-BART)

1

K

28 of 48

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

.

.

.

.

.

.

Do the rose and sunflower share the same color?

.

.

.

Proposed framework: (Relevance encoder + Multi-Image-BART)

1

K

29 of 48

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

.

.

.

.

.

.

Do the rose and sunflower share the same color?

.

.

.

Proposed framework: (Relevance encoder + Multi-Image-BART)

1

K

No

30 of 48

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

.

.

.

.

.

.

Do the rose and sunflower share the same color?

.

.

.

Proposed framework: (Relevance encoder + Multi-Image-BART)

1

K

No

,

31 of 48

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

.

.

.

.

.

.

Do the rose and sunflower share the same color?

.

.

.

Proposed framework: (Relevance encoder + Multi-Image-BART)

1

K

No

,

the

32 of 48

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

.

.

.

.

.

.

Do the rose and sunflower share the same color?

.

.

.

Proposed framework: (Relevance encoder + Multi-Image-BART)

1

K

No

,

the

rose

33 of 48

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

.

.

.

.

.

.

Do the rose and sunflower share the same color?

.

.

.

Proposed framework: (Relevance encoder + Multi-Image-BART)

1

K

No

,

the

rose

and

34 of 48

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

.

.

.

.

.

.

Do the rose and sunflower share the same color?

.

.

.

Proposed framework: (Relevance encoder + Multi-Image-BART)

1

K

No

,

the

rose

and

color

.

35 of 48

Results: Datasets

  • RetVQA (proposed dataset)
    • Derived from Visual Genome [Krishna et al., 2017]

36 of 48

Results: Datasets

  • RetVQA (proposed dataset)
    • Derived from Visual Genome [Krishna et al.]

    • Largest dataset (418K) which needs ≥ 2 images to answer the question.

37 of 48

Results: Datasets

  • RetVQA (proposed dataset)
    • Derived from Visual Genome [Krishna et al.]

    • Largest dataset (418K) which needs ≥ 2 images to answer the question.

    • Two types of answers:
      • Open-set generative
      • Binary generative

38 of 48

Results: Datasets

  • RetVQA (proposed dataset)
    • Derived from Visual Genome [Krishna et al.]

    • Largest dataset (418K) which needs ≥ 2 images to answer the question.

    • Two types of answers:
      • Open-set generative
      • Binary generative

“The colors of the rose and the sunflower are red and yellow respectively.”

39 of 48

Results: Datasets

  • RetVQA (proposed dataset)
    • Derived from Visual Genome [Krishna et al.]

    • Largest dataset (418K) which needs ≥ 2 images to answer the question.

    • Two types of answers:
      • Open-set generative
      • Binary generative

“No, the rose and sunflower do not share the same color.”

40 of 48

Results: Datasets

  • RetVQA (proposed dataset)
    • Derived from Visual Genome [Krishna et al.]

    • Largest dataset (418K) which needs ≥ 2 images to answer the question.

    • Two types of answers: Open-set generative, Binary generative.

    • Five types of questions:
      • Color
      • Shape
      • Count
      • Object-attributes
      • Relation-based

41 of 48

Results: Datasets

  • RetVQA (proposed dataset)
    • Derived from Visual Genome [Krishna et al.]

    • Largest dataset (418K) which needs ≥ 2 images to answer the question.

    • Two types of answers: Open-set generative, Binary generative.

    • Five types of questions:
      • Color
      • Shape
      • Count
      • Object-attributes
      • Relation-based

42 of 48

Results: Datasets

  • RetVQA (proposed dataset, summary)
    • Derived from Visual Genome [Krishna et al.]

    • Largest dataset (418K) which needs ≥ 2 images to answer the question.

    • Two types of answers: Open-set generative, Binary generative.

    • Five types of questions: Color, Shape, Count, Relation-based and object-attributes.

  • WebQA [img-only segment] [Chang et al., 2022]

43 of 48

Results: Metrics

  • Accuracy (A): is the correct answer present in the generative answers?

  • Fluency (F): P(generated answer | ground truth answer) using BART normalised by P(ground truth answer | ground truth answer) [Yuan et al.]

44 of 48

Results: Quantitative Results

45 of 48

Results: Qualitative Results

GT: Four birds are pictured. VLP Answer: Two birds are pictured. MI-BART Answer (Ours): Four birds are pictured.

Context:

Question:

How many birds are pictured?

GT: Sheep eats same thing as brown horse. VLP Answer: Zebra eats same thing as brown horse.

MI-BART Answer (Ours): Sheep eats same thing as brown horse.

Context:

Question:

What else eats same thing as brown horse ?

46 of 48

Results: Qualitative Results

What else eats same thing as brown horse?

Context:

GT Answer: sheep eats same thing as brown horse

MI-BART: sheep eats same thing as brown horse

sheep

eats

same

Question:

thing

as

brown

horse

47 of 48

Summary

  • RetVQA (Retrieval-based VQA)
    • Different from retrieval augmented generation

  • Proposed a unified Multi-Image BART model to answer the question from the retrieved images using our relevance encoder.

  • Achieves an accuracy of 76.5% and a fluency of 79.3% on the new RetVQA dataset.

  • Outperforms SOTA by 4.9% and 11.8% on the image segment of the WebQA dataset over accuracy and fluency.

48 of 48

Thank You!

Visit our project page for Dataset and Code.