1 of 31

Visual Question Answering

Yash Vadi

Uday Karan Kapur

21st Feb 2024

2 of 31

Overview

  • What is VQA?
  • Datasets for VQA tasks
  • Methods
  • Paper 1: Revisiting Visual Question Answering Baselines
  • Paper 2: Deep Co-Attention Networks for Visual Question Answering

3 of 31

What is VQA?

  • Why VQA? Why is it harder as compared to image captioning?
    • Given an image, can our machine answer the corresponding questions in natural language?
    • Image caption tends to be generic
  • Question in VQA
    • Descriptive property about the object
    • Counting
    • Common sense knowledge

4 of 31

Datasets

  • VQA: Human annotated dataset with open-ended answers and multiple choice answers. Each question is answered by 10 human annotators
  • COCO-QA: COCO dataset, images are annotated with questions related to objects, scenes, and relationships within the image. Generated automatically from the image captions.
  • Visual-7w : dataset is formulated by associating images with "who," "what," "where," "when," "why," "how," and "which" questions, aiming to cover a diverse range of visual reasoning tasks through natural language prompts

etc.

5 of 31

Methods

  • Generation
    • Autoregressively generate the answer using a generative model conditioned on both the image and the question
  • Classification
    • classify the correct answer from a set of predefined options
  • Joint-embedding learning
    • Image encoder + Bag-of-words
    • Image encoder + LSTM
    • Image encoder and text encoder + multimodal bilinear pooling
  • Attention based
    • Stacked, Co-attention, Self attention

6 of 31

Revisiting Visual Question Answering Baselines by Jabri et al.

7 of 31

Revisiting Visual Question Answering Baselines

  • Instead of predicting the answer class given the question and image, Predicting correctness of an entire Image-Question-Answer triplet.
  • Questioned the significant of the reasoning based methods at that time

8 of 31

Results

  • MLP(A): Answer Only
  • MLP(A,Q): Answer + Question
  • MLP(A,Q,I): Answer + Image + Question
  • MLP(A,I) : Answer + Image

9 of 31

10 of 31

Ablations and Error Analysis

  • Dataset independent and can have arbitrary number of classes
  • Shape, Color, Count: Poor performance - Only learned the bias in the dataset
  • Action: Model does learn something.
    • Image feature transfers well into action-recognition task
  • Causality: “Why” question most of time can be answered using text only by using common sense reasoning.

11 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

12 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

  • Before we dive into the paper, let’s look at co-attention.

How many puppies are in the image?

How many puppies can you see in the image?

  • Aforementioned questions are equivalent, with the meaning being apparent in the first 3 words.
  • Models that pay attention to the first 3 words will be more robust to variations in the verbiage .

13 of 31

Co-Attention

  • Co-attention mechanism aims to learn textual attention for question and visual attention for image.
  • In other words, it also addresses the problem of what words to listen to.
  • Initially introduced in “Hierarchical Question-Image Co-Attention for Visual Question Answering” by Lu et al. at Virginia Tech and Georgia Institute of Technology.
  • Improved state-of-the-art performance on VQA and COCO-QA dataset.

14 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

  • Proposes a novel Modular Co-Attention Network (MCAN).
  • Consists of Modular Co-Attention (MCA) Layers, which are built from 2 basic attention units:
    • Self Attention Unit
    • Guided Attention Unit
  • These units are inspired by the Transformer architecture.

15 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Self Attention and Guided Attention units

  • Self Attention Unit:
    • Takes a group of input features and outputs the attended output features.
    • Similar to the encoder layer in Transformers.
  • Guided Attention Unit:
    • Takes two groups of input features (X and Y) and outputs the attended output �features of X guided by Y.
    • Similar to cross-attention/encoder-decoder attention layer in transformers.
  • These units can be combined in different topologies to build the MCA layer.

Credits: Yu et al. (2019)

16 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

MCA Compositions

  • ID(Y)-GA(X,Y) (part a) passes the question features (Y) directly to output and the �interaction between image features (X) and question features is modeled with�GA(X, Y).
  • SA(Y)-GA(X,Y) (part b) passes the question features to self-attention layer and its �output is used to model the interaction with image features modeled with GA(X,Y).
  • SA(Y)-SGA(X,Y) (part c) iterates on top of SA(Y)-GA(X,Y) by passing the image features�through a self-attention layer before modeling its interaction with question features with GA(X,Y).
  • Other possible combinations such as GA(X,Y)-GA(Y,X) and SGA(X,Y)-SGA(Y,X) were explored but not reported as their performance was not comparative.

Credits: Yu et al. (2019)

17 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

MCA Network (MCAN) Architecture

Credits: Yu et al. (2019)

18 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Question and Image Representation

  • Image Features:
    • Intermediate features extracted from Faster R-CNN pre-trained on Visual Genome dataset.
  • Question Features:
    • Question is tokenized and then trimmed to 14 words.
    • Each word is transformed into a vector using 300-D GLoVe embeddings.

19 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

MCA Network (MCAN) Architecture

Credits: Yu et al. (2019)

20 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Deep Co-Attention Learning

  • Stacking:
    • Stacks L MCA layers in depth and finally outputs X(L) and Y(L) as final attended image and �question features.
    • Each intermediate self-attention output for question (Y) is used to calculate each �intermediate guided attention output for the image (X).
  • Encoder-Decoder:
    • Inspired by the transformer model.
    • Question features from the last layer (Y(L)) are only used for all layers in guided attention�units.

Credits: Yu et al. (2019)

21 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

MCA Network (MCAN) Architecture

Credits: Yu et al. (2019)

22 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Multimodal Fusion and Output Classifier

  • Attention outputs are passed through a two-layer MLP which reduces the dimensionality of the features and calculates attention weights using softmax.
  • Attended features are computed by aggregating the original features based on these attention weights and then combined using a linear function with LayerNorm stabilization.
  • Fused feature vector is used to train a N-way classifier with cross-entropy loss (N is the number of most frequent answers in the training set).

23 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Experiments

  • VQA-v2 Dataset
    • Contains human-annotated question-answer pairs relating to the images from the COCO dataset , with 3 questions per image and 10 answers per question.
    • Dataset is split into three: train (80k images and 444k QA pairs); val (40k images and 214k QA pairs); and test (80k images and 448k QA pairs).
    • There are two test subsets called test-dev and test-standard to evaluate model performance. The results consist of three per-type accuracies (Yes/No, Number, and Other) and an overall accuracy.

24 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Experiments

  • Ablation Studies (MCA variants)
    • Verifies that modeling self attention improves VQA performance since �SA(Y)-GA(X,Y) outperforms ID(Y)-GA(X,Y).
    • Moreover, SA(Y)-SGA(X,Y) also outperforms SA(Y)-GA(X,Y).
    • This implies that modeling self attention for image features is meaningful.

Credits: Yu et al. (2019)

25 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Experiments

  • Ablation Studies (Stacking vs Encoder-Decoder)
    • SA(Y)-SGA(X,Y) used as default MCA.
    • Shows that increasing L (no. of layers) improves performance.
    • Performance saturates when L>6, which can be explained by unstable gradients�during training.
    • Encoder-Decoder outperforms Stacking. This is because learned self-attention from�early SA(Y) unit is inaccurate as compared to later SA(Y) units.
    • MCAN is much more parametric-efficient than other models, with MCANed-2 (27M) �reporting a 66.2% accuracy, BAN-4 (45M) a 65.8% accuracy, and MFH (116M) a �65.7% accuracy.

Credits: Yu et al. (2019)

26 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Experiments

  • Ablation Studies (Question Representation)
    • MCANed-6 model used with SA(Y)-SGA(X,Y) as MCA.
    • Experiments with different question representations.
    • Using GLoVe word embeddings outperforms random initialization.
    • Other methods like fine tuning GLoVe embeddings slightly improves the �performance further.

Credits: Yu et al. (2019)

27 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Experiments

  • Ablation Studies (MCA vs Depth)

Credits: Yu et al. (2019)

28 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Experiments

  • Qualitative Analysis

Credits: Yu et al. (2019)

29 of 31

Credits: Yu et al. (2019)

30 of 31

Deep Modular Co-Attention Networks for VQA by Yu et al. (2019)

Experiments

  • Comparison with SOTA
    • MCANed-6 used for comparison with existing SOTA models.
    • Outperforms BAN by 1.1 points and BAN+Counter (BAN with�counting module) by 0.6 points.
    • MCAN does not use auxiliary information like bounding-box�coordinates.

Credits: Yu et al. (2019)

31 of 31

Thank you!