1 of 34

Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering

Abhirama Penamakuri1, Manish Gupta2, Mithun Das Gupta2 and Anand Mishra1

1Indian Institute of Technology Jodhpur, 2Microsoft

AI India Track @ AI-ML Systems 2023, Bengaluru

2 of 34

VQA is a well studied problem!

[Agrawal et al. 2015]

3 of 34

VQA is a well studied problem!

1 Question to be answered in the context of “1” image!

[Agrawal et al. 2015]

4 of 34

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

1 Question to be answered in the context of “Multiple” images!

5 of 34

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Do the rose and sunflower share the same color?

1 Question to be answered in the context of “Multiple” images!

6 of 34

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Do the rose and sunflower share the same color?

7 of 34

Step 1: Retrieval!

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Do the rose and sunflower share the same color?

8 of 34

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Step 1: Retrieval!

Do the rose and sunflower share the same color?

9 of 34

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Step 2: Question Answering!

Do the rose and sunflower share the same color?

10 of 34

Do the rose and sunflower share the same color?

A: Yes, the rose and the sunflower share the same color.

Proposed task: Retrieval-based Visual Question Answering (RetVQA)

Step 2: Question Answering!

11 of 34

Given a Question (Q) and a set of N heterogeneous images (I), generate an answer (A) to the question (Q) where only some images from I are relevant.

Proposed task: RetVQA- Formal Definition

12 of 34

Do the rose and sunflower share the same color?

.

.

.

Multi

Image

Context

1

2

N

Proposed framework: (Relevance encoder + Multi-Image-BART)

13 of 34

Do the rose and sunflower share the same color?

.

.

.

Multi

Image

Context

1

2

N

Stage 1: Relevance Encoding!

Goal: To learn the relevance between the question and the image, i.e., to learn whether an image is relevant or irrelevant with respect to the question.

Proposed framework: (Relevance encoder + Multi-Image-BART)

14 of 34

Do the rose and sunflower share the same color?

Relevance

Encoder

BERT

Faster RCNN

.

.

.

Multi

Image

Context

1

2

N

Proposed framework: (Relevance encoder + Multi-Image-BART)

15 of 34

Do the rose and sunflower share the same color?

Relevance

Encoder

BERT

Faster RCNN

.

.

.

Multi

Image

Context

1

2

N

Proposed framework: (Relevance encoder + Multi-Image-BART)

16 of 34

Do the rose and sunflower share the same color?

Relevance

Encoder

Relevance

Encoder

Relevance

Encoder

BERT

Faster RCNN

.

.

.

Multi

Image

Context

1

2

N

Proposed framework: (Relevance encoder + Multi-Image-BART)

17 of 34

Do the rose and sunflower share the same color?

Relevance

Encoder

Relevance

Encoder

Relevance

Encoder

BERT

Faster RCNN

.

.

.

Multi

Image

Context

1

2

N

Stage 2:

Retriever!

To retrieve relevant images from the pool and discard the irrelevant ones!

Proposed framework: (Relevance encoder + Multi-Image-BART)

18 of 34

Do the rose and sunflower share the same color?

Relevance

Encoder

Relevance

Encoder

Relevance

Encoder

BERT

Faster RCNN

.

.

.

Multi

Image

Context

1

2

N

Top-k

Proposed framework: (Relevance encoder + Multi-Image-BART)

19 of 34

Do the rose and sunflower share the same color?

Relevance

Encoder

Relevance

Encoder

Relevance

Encoder

BERT

Faster RCNN

.

.

.

Multi

Image

Context

1

2

N

Top-k

.

.

.

1

K

Retrieved Context

Proposed framework: (Relevance encoder + Multi-Image-BART)

20 of 34

Do the rose and sunflower share the same color?

Proposed framework: (Relevance encoder + Multi-Image-BART)

Stage 3: Question Answering!

Goal: To answer the question from the retrieved relevant images.

.

.

.

1

K

21 of 34

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

.

.

.

Do the rose and sunflower share the same color?

Proposed framework: (Relevance encoder + Multi-Image-BART)

.

.

.

1

K

22 of 34

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

.

.

.

.

.

.

Do the rose and sunflower share the same color?

Proposed framework: (Relevance encoder + Multi-Image-BART)

.

.

.

1

K

23 of 34

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

.

.

.

.

.

.

Do the rose and sunflower share the same color?

.

.

.

Proposed framework: (Relevance encoder + Multi-Image-BART)

1

K

Yes

.

.

.

1

K

24 of 34

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

.

.

.

.

.

.

Do the rose and sunflower share the same color?

.

.

.

Proposed framework: (Relevance encoder + Multi-Image-BART)

1

K

Yes

,

.

.

.

1

K

25 of 34

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

.

.

.

.

.

.

Do the rose and sunflower share the same color?

.

.

.

Proposed framework: (Relevance encoder + Multi-Image-BART)

1

K

Yes

,

the

.

.

.

1

K

26 of 34

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

.

.

.

.

.

.

Do the rose and sunflower share the same color?

.

.

.

Proposed framework: (Relevance encoder + Multi-Image-BART)

1

K

Yes

,

the

rose

.

.

.

1

K

27 of 34

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

.

.

.

.

.

.

Do the rose and sunflower share the same color?

.

.

.

Proposed framework: (Relevance encoder + Multi-Image-BART)

1

K

Yes

,

the

rose

and

.

.

.

1

K

28 of 34

MI-BART Encoder (6 transformer encoder layers)

[CLS]

[SEP]

MI-BART Decoder (6 transformer decoder layers)

.

.

.

.

.

.

Do the rose and sunflower share the same color?

.

.

.

Proposed framework: (Relevance encoder + Multi-Image-BART)

1

K

Yes

,

the

rose

and

color

.

.

.

.

1

K

29 of 34

Results: Datasets

  • RetVQA (proposed dataset)
    • Derived from Visual Genome [Krishna et al.]

    • Largest dataset (418K) which needs ≥ 2 images to answer the question.

    • Two types of answers: Open-set generative, Binary generative.

    • Five types of questions:
      • Color
      • Shape
      • Count
      • Object-attributes
      • Relation-based

30 of 34

Results: Quantitative Results

31 of 34

Results: Qualitative Results

GT: Four birds are pictured. VLP Answer: Two birds are pictured. MI-BART Answer (Ours): Four birds are pictured.

Context:

Question:

How many birds are pictured?

GT: Sheep eats same thing as brown horse. VLP Answer: Zebra eats same thing as brown horse.

MI-BART Answer (Ours): Sheep eats same thing as brown horse.

Context:

Question:

What else eats same thing as brown horse ?

32 of 34

Results: Qualitative Results

What else eats same thing as brown horse?

Context:

GT Answer: sheep eats same thing as brown horse

MI-BART: sheep eats same thing as brown horse

sheep

eats

same

Question:

thing

as

brown

horse

33 of 34

Summary

  • RetVQA (Retrieval-based VQA)
    • Different from retrieval augmented generation

  • Proposed a unified Multi-Image BART model to answer the question from the retrieved images using our relevance encoder.

  • Achieves an accuracy of 76.5% and a fluency of 79.3% on the new RetVQA dataset.

  • Outperforms SOTA by 4.9% and 11.8% on the image segment of the WebQA dataset over accuracy and fluency.

34 of 34

Thank You!

Visit our project page for Dataset and Code.