Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering
Abhirama Penamakuri1, Manish Gupta2, Mithun Das Gupta2 and Anand Mishra1
1Indian Institute of Technology Jodhpur, 2Microsoft
AI India Track @ AI-ML Systems 2023, Bengaluru
VQA is a well studied problem!
[Agrawal et al. 2015]
VQA is a well studied problem!
1 Question to be answered in the context of “1” image!
[Agrawal et al. 2015]
Proposed task: Retrieval-based Visual Question Answering (RetVQA)
1 Question to be answered in the context of “Multiple” images!
Proposed task: Retrieval-based Visual Question Answering (RetVQA)
Do the rose and sunflower share the same color?
1 Question to be answered in the context of “Multiple” images!
…
…
…
…
Proposed task: Retrieval-based Visual Question Answering (RetVQA)
Do the rose and sunflower share the same color?
…
…
…
…
Step 1: Retrieval!
Proposed task: Retrieval-based Visual Question Answering (RetVQA)
Do the rose and sunflower share the same color?
…
…
…
…
Proposed task: Retrieval-based Visual Question Answering (RetVQA)
Step 1: Retrieval!
Do the rose and sunflower share the same color?
…
…
…
…
Proposed task: Retrieval-based Visual Question Answering (RetVQA)
Step 2: Question Answering!
Do the rose and sunflower share the same color?
…
…
…
…
Do the rose and sunflower share the same color?
A: Yes, the rose and the sunflower share the same color.
Proposed task: Retrieval-based Visual Question Answering (RetVQA)
Step 2: Question Answering!
Given a Question (Q) and a set of N heterogeneous images (I), generate an answer (A) to the question (Q) where only some images from I are relevant.
Proposed task: RetVQA- Formal Definition
Do the rose and sunflower share the same color?
.
.
.
Multi
Image
Context
1
2
N
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
.
.
.
Multi
Image
Context
1
2
N
Stage 1: Relevance Encoding!
Goal: To learn the relevance between the question and the image, i.e., to learn whether an image is relevant or irrelevant with respect to the question.
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
Relevance
Encoder
BERT
Faster RCNN
.
.
.
Multi
Image
Context
1
2
N
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
Relevance
Encoder
BERT
Faster RCNN
.
.
.
Multi
Image
Context
1
2
N
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
Relevance
Encoder
Relevance
Encoder
Relevance
Encoder
BERT
Faster RCNN
.
.
.
Multi
Image
Context
1
2
N
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
Relevance
Encoder
Relevance
Encoder
Relevance
Encoder
BERT
Faster RCNN
.
.
.
Multi
Image
Context
1
2
N
Stage 2:
Retriever!
To retrieve relevant images from the pool and discard the irrelevant ones!
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
Relevance
Encoder
Relevance
Encoder
Relevance
Encoder
BERT
Faster RCNN
.
.
.
Multi
Image
Context
1
2
N
Top-k
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
Relevance
Encoder
Relevance
Encoder
Relevance
Encoder
BERT
Faster RCNN
.
.
.
Multi
Image
Context
1
2
N
Top-k
.
.
.
1
K
Retrieved Context
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
Proposed framework: (Relevance encoder + Multi-Image-BART)
Stage 3: Question Answering!
Goal: To answer the question from the retrieved relevant images.
.
.
.
1
K
MI-BART Encoder (6 transformer encoder layers)
[CLS]
[SEP]
.
.
.
Do the rose and sunflower share the same color?
Proposed framework: (Relevance encoder + Multi-Image-BART)
.
.
.
1
K
MI-BART Encoder (6 transformer encoder layers)
[CLS]
[SEP]
MI-BART Decoder (6 transformer decoder layers)
.
.
.
.
.
.
Do the rose and sunflower share the same color?
Proposed framework: (Relevance encoder + Multi-Image-BART)
.
.
.
1
K
MI-BART Encoder (6 transformer encoder layers)
[CLS]
[SEP]
MI-BART Decoder (6 transformer decoder layers)
.
.
.
.
.
.
Do the rose and sunflower share the same color?
.
.
.
Proposed framework: (Relevance encoder + Multi-Image-BART)
1
K
Yes
.
.
.
1
K
MI-BART Encoder (6 transformer encoder layers)
[CLS]
[SEP]
MI-BART Decoder (6 transformer decoder layers)
.
.
.
.
.
.
Do the rose and sunflower share the same color?
.
.
.
Proposed framework: (Relevance encoder + Multi-Image-BART)
1
K
Yes
,
.
.
.
1
K
MI-BART Encoder (6 transformer encoder layers)
[CLS]
[SEP]
MI-BART Decoder (6 transformer decoder layers)
.
.
.
.
.
.
Do the rose and sunflower share the same color?
.
.
.
Proposed framework: (Relevance encoder + Multi-Image-BART)
1
K
Yes
,
the
.
.
.
1
K
MI-BART Encoder (6 transformer encoder layers)
[CLS]
[SEP]
MI-BART Decoder (6 transformer decoder layers)
.
.
.
.
.
.
Do the rose and sunflower share the same color?
.
.
.
Proposed framework: (Relevance encoder + Multi-Image-BART)
1
K
Yes
,
the
rose
.
.
.
1
K
MI-BART Encoder (6 transformer encoder layers)
[CLS]
[SEP]
MI-BART Decoder (6 transformer decoder layers)
.
.
.
.
.
.
Do the rose and sunflower share the same color?
.
.
.
Proposed framework: (Relevance encoder + Multi-Image-BART)
1
K
Yes
,
the
rose
and
.
.
.
1
K
MI-BART Encoder (6 transformer encoder layers)
[CLS]
[SEP]
MI-BART Decoder (6 transformer decoder layers)
.
.
.
.
.
.
Do the rose and sunflower share the same color?
.
.
.
Proposed framework: (Relevance encoder + Multi-Image-BART)
1
K
Yes
,
the
rose
and
color
…
.
.
.
.
1
K
Results: Datasets
Results: Quantitative Results
Results: Qualitative Results
…
GT: Four birds are pictured. VLP Answer: Two birds are pictured. MI-BART Answer (Ours): Four birds are pictured.
Context:
Question:
How many birds are pictured?
…
GT: Sheep eats same thing as brown horse. VLP Answer: Zebra eats same thing as brown horse.
MI-BART Answer (Ours): Sheep eats same thing as brown horse.
Context:
Question:
What else eats same thing as brown horse ?
Results: Qualitative Results
What else eats same thing as brown horse?
Context:
GT Answer: sheep eats same thing as brown horse
MI-BART: sheep eats same thing as brown horse
sheep
eats
same
Question:
thing
as
brown
horse
Summary
Thank You!
Visit our project page for Dataset and Code.