1 of 31

Visual Question Answering

Yash Vadi

Uday Karan Kapur

21st Feb 2024

2 of 31

Overview

What is VQA?
Datasets for VQA tasks
Methods
Paper 1: Revisiting Visual Question Answering Baselines
Paper 2: Deep Co-Attention Networks for Visual Question Answering

3 of 31

What is VQA?

Why VQA? Why is it harder as compared to image captioning?

Given an image, can our machine answer the corresponding questions in natural language?
Image caption tends to be generic

Question in VQA

Descriptive property about the object
Counting
Common sense knowledge

4 of 31

Datasets

VQA: Human annotated dataset with open-ended answers and multiple choice answers. Each question is answered by 10 human annotators
COCO-QA: COCO dataset, images are annotated with questions related to objects, scenes, and relationships within the image. Generated automatically from the image captions.
Visual-7w : dataset is formulated by associating images with "who," "what," "where," "when," "why," "how," and "which" questions, aiming to cover a diverse range of visual reasoning tasks through natural language prompts

etc.

5 of 31

Methods

Generation

Autoregressively generate the answer using a generative model conditioned on both the image and the question

Classification

classify the correct answer from a set of predefined options

Joint-embedding learning

Image encoder + Bag-of-words
Image encoder + LSTM
Image encoder and text encoder + multimodal bilinear pooling

Attention based

Stacked, Co-attention, Self attention

Generative methods: These methods autoregressively generate the answer using a generative model conditioned on both the image and the question. This methods didn’t perform well earlier as generating coherent language itself was the harder task

Classification methods: the VQA task is treated as a classification problem, where the goal is to classify the correct answer from a set of predefined options. These methods work well because approximately 90-95% of the answers in the dataset come from a subset of the answers, typically ranging from 1000 to 4000 options. By focusing on this subset, classification methods can achieve high accuracy by effectively narrowing down the possible answers.

�Generally, CNN and LSTM based methods aim to learn image and text joint embedding at the same time.

The variant of this include the image enc + BOW, imag enc and LSTM as text enc, bilinear pooling of the vision and text features.��There are also methods which introduces different attention mechanism in this joint learning like stacked attn, co-attn, self-att.

6 of 31

Revisiting Visual Question Answering Baselines by Jabri et al.

7 of 31

Revisiting Visual Question Answering Baselines

Instead of predicting the answer class given the question and image, Predicting correctness of an entire Image-Question-Answer triplet.
Questioned the significant of the reasoning based methods at that time

Ref: Revisiting Visual Question Answering Baselines (Nov-2016)

8 of 31

Results

Ref: Revisiting Visual Question Answering Baselines

MLP(A): Answer Only
MLP(A,Q): Answer + Question
MLP(A,Q,I): Answer + Image + Question
MLP(A,I) : Answer + Image

9 of 31

Ref: Revisiting Visual Question Answering Baselines

10 of 31

Ablations and Error Analysis

Dataset independent and can have arbitrary number of classes
Shape, Color, Count: Poor performance - Only learned the bias in the dataset
Action: Model does learn something.

Image feature transfers well into action-recognition task

Causality: “Why” question most of time can be answered using text only by using common sense reasoning.

Ref: Revisiting Visual Question Answering Baselines

11 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

12 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Before we dive into the paper, let’s look at co-attention.

How many puppies are in the image?

How many puppies can you see in the image?

Aforementioned questions are equivalent, with the meaning being apparent in the first 3 words.
Models that pay attention to the first 3 words will be more robust to variations in the verbiage .

13 of 31

Co-Attention

Co-attention mechanism aims to learn textual attention for question and visual attention for image.
In other words, it also addresses the problem of what words to listen to.
Initially introduced in “Hierarchical Question-Image Co-Attention for Visual Question Answering” by Lu et al. at Virginia Tech and Georgia Institute of Technology.
Improved state-of-the-art performance on VQA and COCO-QA dataset.

14 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Proposes a novel Modular Co-Attention Network (MCAN).
Consists of Modular Co-Attention (MCA) Layers, which are built from 2 basic attention units:

Self Attention Unit
Guided Attention Unit

These units are inspired by the Transformer architecture.

15 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Self Attention and Guided Attention units

Self Attention Unit:

Takes a group of input features and outputs the attended output features.
Similar to the encoder layer in Transformers.

Guided Attention Unit:

Takes two groups of input features (X and Y) and outputs the attended output �features of X guided by Y.
Similar to cross-attention/encoder-decoder attention layer in transformers.

These units can be combined in different topologies to build the MCA layer.

Credits: Yu et al. (2019)

16 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

MCA Compositions

ID(Y)-GA(X,Y) (part a) passes the question features (Y) directly to output and the �interaction between image features (X) and question features is modeled with�GA(X, Y).
SA(Y)-GA(X,Y) (part b) passes the question features to self-attention layer and its �output is used to model the interaction with image features modeled with GA(X,Y).
SA(Y)-SGA(X,Y) (part c) iterates on top of SA(Y)-GA(X,Y) by passing the image features�through a self-attention layer before modeling its interaction with question features with GA(X,Y).
Other possible combinations such as GA(X,Y)-GA(Y,X) and SGA(X,Y)-SGA(Y,X) were explored but not reported as their performance was not comparative.

Credits: Yu et al. (2019)

17 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

MCA Network (MCAN) Architecture

Credits: Yu et al. (2019)

18 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Question and Image Representation

Image Features:

Intermediate features extracted from Faster R-CNN pre-trained on Visual Genome dataset.

Question Features:

Question is tokenized and then trimmed to 14 words.
Each word is transformed into a vector using 300-D GLoVe embeddings.

19 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

MCA Network (MCAN) Architecture

Credits: Yu et al. (2019)

20 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Deep Co-Attention Learning

Stacking:

Stacks L MCA layers in depth and finally outputs X^(L) and Y^(L) as final attended image and �question features.
Each intermediate self-attention output for question (Y) is used to calculate each �intermediate guided attention output for the image (X).

Encoder-Decoder:

Inspired by the transformer model.
Question features from the last layer (Y^(L)) are only used for all layers in guided attention�units.

Credits: Yu et al. (2019)

21 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

MCA Network (MCAN) Architecture

Credits: Yu et al. (2019)

22 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Multimodal Fusion and Output Classifier

Attention outputs are passed through a two-layer MLP which reduces the dimensionality of the features and calculates attention weights using softmax.
Attended features are computed by aggregating the original features based on these attention weights and then combined using a linear function with LayerNorm stabilization.
Fused feature vector is used to train a N-way classifier with cross-entropy loss (N is the number of most frequent answers in the training set).

23 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Experiments

VQA-v2 Dataset

Contains human-annotated question-answer pairs relating to the images from the COCO dataset , with 3 questions per image and 10 answers per question.
Dataset is split into three: train (80k images and 444k QA pairs); val (40k images and 214k QA pairs); and test (80k images and 448k QA pairs).
There are two test subsets called test-dev and test-standard to evaluate model performance. The results consist of three per-type accuracies (Yes/No, Number, and Other) and an overall accuracy.

24 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Experiments

Ablation Studies (MCA variants)

Verifies that modeling self attention improves VQA performance since �SA(Y)-GA(X,Y) outperforms ID(Y)-GA(X,Y).
Moreover, SA(Y)-SGA(X,Y) also outperforms SA(Y)-GA(X,Y).
This implies that modeling self attention for image features is meaningful.

Credits: Yu et al. (2019)

25 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Experiments

Ablation Studies (Stacking vs Encoder-Decoder)

SA(Y)-SGA(X,Y) used as default MCA.
Shows that increasing L (no. of layers) improves performance.
Performance saturates when L>6, which can be explained by unstable gradients�during training.
Encoder-Decoder outperforms Stacking. This is because learned self-attention from�early SA(Y) unit is inaccurate as compared to later SA(Y) units.
MCAN is much more parametric-efficient than other models, with MCAN_ed-2 (27M) �reporting a 66.2% accuracy, BAN-4 (45M) a 65.8% accuracy, and MFH (116M) a �65.7% accuracy.

Credits: Yu et al. (2019)

26 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Experiments

Ablation Studies (Question Representation)

MCAN_ed-6 model used with SA(Y)-SGA(X,Y) as MCA.
Experiments with different question representations.
Using GLoVe word embeddings outperforms random initialization.
Other methods like fine tuning GLoVe embeddings slightly improves the �performance further.

Credits: Yu et al. (2019)

27 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Experiments

Ablation Studies (MCA vs Depth)

Credits: Yu et al. (2019)

28 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Experiments

Qualitative Analysis

Credits: Yu et al. (2019)

29 of 31

Credits: Yu et al. (2019)

Uses MCANsk-6 and MCANed-6 and visualizes six attention maps from different attention units and different layers.

Question Self-Attention SA(Y): The attention maps of SA(Y)-1 form vertical stripes, and the words like ‘how’ and ‘see’ obtain large attention weights. This unit acts as a question type classifier. Besides, the large values in the attention maps of SA(Y)-6 occur in the column ‘sheep’. This reveals that all the attended features tend to use the feature of ‘sheep’ for reconstruction. That is to say, the keyword ‘sheep’ is identified correctly.

Image Self-Attention SA(X): Values in the attention maps of SA(X)-1 are uniformly distributed, suggesting that the key objects for sheep are unclear. The large values in the attention maps of SA(X)-6 occur on the 1st, 3rd, and 11th columns, which correspond to the three sheep in the image. This explains why introducing SA(X) can greatly improve object counting performance.

Question Guided-Attention GA(X,Y): The attention maps of GA(X,Y)-1 do not focus on the current objects in the image; and the attention maps of GA(X,Y)-6 tend to focus on all values in the ‘sheep’ column. This can be explained by the fact that the input features have been reconstructed by the sheep features in SA(X)-6. Moreover, the GA(X,Y) units of the stacking model contain much more noise than the encoder-decoder model. This explains the better performance of encoder decoder model earlier.

30 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Experiments

Comparison with SOTA

MCAN_ed-6 used for comparison with existing SOTA models.
Outperforms BAN by 1.1 points and BAN+Counter (BAN with�counting module) by 0.6 points.
MCAN does not use auxiliary information like bounding-box�coordinates.

Credits: Yu et al. (2019)

31 of 31

Thank you!