Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering
Abhirama Subramanyam Penamakuri1, Manish Gupta2,
Mithun Das Gupta2 and Anand Mishra1
1Indian Institute of Technology Jodhpur
2Microsoft
VQA is a well studied problem!
[Agrawal et al. 2015]
VQA is a well studied problem!
1 Question in the context of “1” image!
[Agrawal et al. 2015]
Proposed task: Retrieval-based Visual Question Answering (RetVQA)
1 Question in the context of “Multiple” images!
1 Question in the context of “Multiple” images!
Proposed task: Retrieval-based Visual Question Answering (RetVQA)
Do the rose and sunflower share the same color?
…
…
…
…
Proposed task: Retrieval-based Visual Question Answering (RetVQA)
Do the rose and sunflower share the same color?
…
…
…
…
Step 1: Retrieval!
Proposed task: Retrieval-based Visual Question Answering (RetVQA)
Do the rose and sunflower share the same color?
…
…
…
…
Proposed task: Retrieval-based Visual Question Answering (RetVQA)
Step 1: Retrieval!
Do the rose and sunflower share the same color?
…
…
…
…
Proposed task: Retrieval-based Visual Question Answering (RetVQA)
Step 2: Question Answering!
Do the rose and sunflower share the same color?
…
…
…
…
Do the rose and sunflower share the same color?
A: No, the rose and the sunflower do not share the same color.
Proposed task: Retrieval-based Visual Question Answering (RetVQA)
Step 2: Question Answering!
Given a Question (Q) and a set of N heterogeneous images (I), generate an answer (A) to the question (Q) where only some images from I are relevant.
Proposed task: RetVQA- Formal Definition
Do the rose and sunflower share the same color?
.
.
.
Multi
Image
Context
1
2
N
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
.
.
.
Multi
Image
Context
1
2
N
Stage 1: Relevance Encoding!
Goal: To learn the relevance between the question and the image, i.e., to learn whether an image is relevant or irrelevant with respect to the question.
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
BERT
Faster RCNN
.
.
.
Multi
Image
Context
1
2
N
Proposed framework: (Relevance encoder + Multi-Image-BART)
[Devlin et al., 2019]
[Ren et al., 2015b]
Do the rose and sunflower share the same color?
BERT
Faster RCNN
.
.
.
Multi
Image
Context
1
2
N
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
Relevance
Encoder
BERT
Faster RCNN
.
.
.
Multi
Image
Context
1
2
N
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
Relevance
Encoder
BERT
Faster RCNN
.
.
.
Multi
Image
Context
1
2
N
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
Relevance
Encoder
BERT
Faster RCNN
.
.
.
Multi
Image
Context
1
2
N
Relevance Encoder is pre-trained on MS-COCO using two objectives:
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
Relevance
Encoder
BERT
Faster RCNN
.
.
.
Multi
Image
Context
1
2
N
Relevance Encoder is fine-tuned on our proposed RetVQA dataset with ITM objective!
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
Relevance
Encoder
Relevance
Encoder
Relevance
Encoder
BERT
Faster RCNN
.
.
.
Multi
Image
Context
1
2
N
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
Relevance
Encoder
Relevance
Encoder
Relevance
Encoder
BERT
Faster RCNN
.
.
.
Multi
Image
Context
1
2
N
Stage 2:
Retriever!
To retrieve relevant images from the pool and discard the irrelevant ones!
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
Relevance
Encoder
Relevance
Encoder
Relevance
Encoder
BERT
Faster RCNN
.
.
.
Multi
Image
Context
1
2
N
Top-k
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
Relevance
Encoder
Relevance
Encoder
Relevance
Encoder
BERT
Faster RCNN
.
.
.
Multi
Image
Context
1
2
N
Top-k
.
.
.
1
K
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
Relevance
Encoder
Relevance
Encoder
Relevance
Encoder
BERT
Faster RCNN
.
.
.
Multi
Image
Context
1
2
N
Top-k
.
.
.
1
K
Retrieved Context
Proposed framework: (Relevance encoder + Multi-Image-BART)
Do the rose and sunflower share the same color?
.
.
.
Proposed framework: (Relevance encoder + Multi-Image-BART)
1
K
Do the rose and sunflower share the same color?
.
.
.
Proposed framework: (Relevance encoder + Multi-Image-BART)
1
K
Stage 3: Question Answering!
Goal: To answer the question from the retrieved relevant images.
MI-BART Encoder (6 transformer encoder layers)
[CLS]
[SEP]
.
.
.
Do the rose and sunflower share the same color?
.
.
.
Proposed framework: (Relevance encoder + Multi-Image-BART)
1
K
MI-BART Encoder (6 transformer encoder layers)
[CLS]
[SEP]
MI-BART Decoder (6 transformer decoder layers)
.
.
.
.
.
.
Do the rose and sunflower share the same color?
.
.
.
Proposed framework: (Relevance encoder + Multi-Image-BART)
1
K
MI-BART Encoder (6 transformer encoder layers)
[CLS]
[SEP]
MI-BART Decoder (6 transformer decoder layers)
.
.
.
.
.
.
Do the rose and sunflower share the same color?
.
.
.
Proposed framework: (Relevance encoder + Multi-Image-BART)
1
K
No
MI-BART Encoder (6 transformer encoder layers)
[CLS]
[SEP]
MI-BART Decoder (6 transformer decoder layers)
.
.
.
.
.
.
Do the rose and sunflower share the same color?
.
.
.
Proposed framework: (Relevance encoder + Multi-Image-BART)
1
K
No
,
MI-BART Encoder (6 transformer encoder layers)
[CLS]
[SEP]
MI-BART Decoder (6 transformer decoder layers)
.
.
.
.
.
.
Do the rose and sunflower share the same color?
.
.
.
Proposed framework: (Relevance encoder + Multi-Image-BART)
1
K
No
,
the
MI-BART Encoder (6 transformer encoder layers)
[CLS]
[SEP]
MI-BART Decoder (6 transformer decoder layers)
.
.
.
.
.
.
Do the rose and sunflower share the same color?
.
.
.
Proposed framework: (Relevance encoder + Multi-Image-BART)
1
K
No
,
the
rose
MI-BART Encoder (6 transformer encoder layers)
[CLS]
[SEP]
MI-BART Decoder (6 transformer decoder layers)
.
.
.
.
.
.
Do the rose and sunflower share the same color?
.
.
.
Proposed framework: (Relevance encoder + Multi-Image-BART)
1
K
No
,
the
rose
and
MI-BART Encoder (6 transformer encoder layers)
[CLS]
[SEP]
MI-BART Decoder (6 transformer decoder layers)
.
.
.
.
.
.
Do the rose and sunflower share the same color?
.
.
.
Proposed framework: (Relevance encoder + Multi-Image-BART)
1
K
No
,
the
rose
and
color
…
.
Results: Datasets
Results: Datasets
Results: Datasets
Results: Datasets
“The colors of the rose and the sunflower are red and yellow respectively.”
Results: Datasets
“No, the rose and sunflower do not share the same color.”
Results: Datasets
Results: Datasets
Results: Datasets
Results: Metrics
Results: Quantitative Results
Results: Qualitative Results
…
GT: Four birds are pictured. VLP Answer: Two birds are pictured. MI-BART Answer (Ours): Four birds are pictured.
Context:
Question:
How many birds are pictured?
…
GT: Sheep eats same thing as brown horse. VLP Answer: Zebra eats same thing as brown horse.
MI-BART Answer (Ours): Sheep eats same thing as brown horse.
Context:
Question:
What else eats same thing as brown horse ?
Results: Qualitative Results
What else eats same thing as brown horse?
Context:
GT Answer: sheep eats same thing as brown horse
MI-BART: sheep eats same thing as brown horse
sheep
eats
same
Question:
thing
as
brown
horse
Summary
Thank You!
Visit our project page for Dataset and Code.