1 of 26

Beyond statistical learning in vision-language

Rishika Bhagwatkar

Shravan Nayak

2 of 26

Overview

2

Vision-Language model

In distribution data

IID

Out of distribution data

OOD

Good performance :)

Bad performance :(

Far from being robust enough for practical use

3 of 26

Overview

3

VLMs

Adversarial Challenges and Benchmarks in VQA

Enhancing Model Robustness and Generalization

4 of 26

Adversarial Challenges and Benchmarks in VQA

4

Prior VQA benchmarks study robustness with respect to:

Sensitivity to visual content manipulation (Agarwal et al. Towards causal vqa: Revealing and reducing spurious correlations by invariant and covariant semantic editing. CPR 2020)
Answer distribution shift (Agarwal et al. Don’t just assume; look and answer: Overcoming priors for visual question answering. In CVPR, 2018)
Linguistic variations in input question (Shah et al. Cycle- consistency for robust visual question answering. In CVPR, 2019.)
Reasoning capabilities (Gokhale et al. Vqa-lol: Visual question answering under the lens of logic. In ECCV, 2020)

Limitations:

Focused on single type of robustness
Based on VQAv2 images/questions
Data collection is static

5 of 26

Adversarial Challenges and Benchmarks in VQA

5

Adversarial VQA

Human and model in loop

Different domains including web images from Conceptual Captions, user-generated images from Fakeddit, and movie images from VCR

Model (EVAL + TRAIN)

Weaknesses exposed 😈

+

Data Augmentation: Model becomes robust on other VQA tasks 😇

Li et al. Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models. Aug 2021.

6 of 26

Enhancing Robustness and Generalization

6

How to develop methods to enhance robustness and generalization?

Data

Training Objectives

Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision, Teney et al. April 2021

Train on composition of counterfactual or contrastive examples.
The counterfactual examples are used to orient the model’s gradient in a way that aligns with the differences between paired examples.

Unshuffling Data for Improved Generalization, Teney et al. Nov 2021

Partitioning the training data into well-defined, non-i.i.d. subsets, which are treated as separate training environments.
Injection of task-relevant information through the strategic partitioning of training data.

Large-Scale Adversarial Training for Vision-and-Language Representation Learning, Gan et al. Oct 2020

Separating Skills and Concepts for Novel Visual Question Answering, Whitehead et al. July 2021

Will be discussed in detail :)

7 of 26

Large-Scale Adversarial Training for Vision and Language Representation Learning

NeurIPS 2020 Spotlight

7

8 of 26

Introduction

8

Adversarial Training is used to combat adversarial attacks in order to create robust neural networks.

Better generalizability

Research Question: Can we apply similar adversarial training techniques to V+L problems to improve model performance?

Task-agnostic adversarial pre-training

Task-specific adversarial finetuning

They propose VILLA Vision-and-Language Large-scale Adversarial training consisting of two stages

9 of 26

Approach

9

Perform adversarial training on the embedding level for modalities.

For text, perturbations are added to word embeddings.
For image, perturbations are added to extracted image-region features.

Adopt the “free” adversarial training strategy.

Two-stage Adversarial Training: APT+AFT

Adversarial Finetuning (AFT):

Finetuning the pretrained weights on downstream tasks

Adversarial Pre-training (APT):

Masked Language Modeling
Image-Text Matching

10 of 26

Free AT

10

Update both perturbations and the network in one pass.
Replay every mini-batch m times to simulate PGD training.

VILLA is m times computationally heavier than UNITER, where m is the number of adversarial training steps.

m is taken as 3 (following a prior work on LLM Free AT) and not ablated for Multimodal AT :(

11 of 26

Free Multimodal AT

11

Cross-entropy loss on clean data

Label-preserving AT loss

Inner maximization is solved by PGD

Finer-grained regularization term

Advocates that the confidence level of prediction should be close

There is no ablation on studying individual components separately :(

12 of 26

Ablation Study

12

Downstream Task Evaluation: VILLA applied to UNITER/LXMERT outperforms on all evaluation tasks.

13 of 26

Ablation Study

13

Pre-training vs Finetuning: Understanding effects of AT at pretraining and finetuning stage.

VILLA-pre brings +0.51 gain

VILLA-fine bring +0.82 gain

Combining both +1.12 gain

14 of 26

Ablation Study

14

Image vs Text Modality: Understanding effects of adversarial examples in different modalities..

Adding perturbations on one modality is already gaining significant improvement

15 of 26

Reflections

15

Strength:

Innovative approach of utilizing free AT for multimodal learning.
Improve generalization performance on several downstream tasks.

Limitation:

Never considered zero-shot scenarios.
Didn’t test adversarial robustness.

16 of 26

Separating Skills and Concepts for Novel Visual Question Answering

16

17 of 26

Motivation

17

Training data

Test data

18 of 26

Motivation

Existing models do not generalise to OOD data
Too much dependence on statistical priors
What if we could disentangle concept and skill?

Extract the visual concept referred to by the question
Determine what information we need to extract from the concept

18

19 of 26

Motivation

19

20 of 26

Method

We need the following things to achieve the disentanglement

Models should be able to understand what concept is the question referring to

How do we teach this? What data will we use?

Models should clearly understand the skill needed to answer the question

How do we teach this? Skill signal is needed

20

Self Supervised Learning

21 of 26

Disentangling Concepts

21

Positive samples:

Same concept
Different type of questions
Different Scene

Negative samples:

Different concept
Similar questions
Similar scenes

Generating hard negatives and diverse positives reduces learning spurious corelations

22 of 26

Disentangling Skills

Mask out the concept from questions
Encode these masked questions
Compute semantic similarity

22

How many zebras are visible?

23 of 26

Training and Results

Multimodal Transformer model where text can attend the images
Train in a multitask fashion

VQA -> Concept -> Skill

23

24 of 26

Results

Generalization to novel concepts

Perform several ablations to showcase the efficacy of each component

24

25 of 26

Reflections

Interesting way to look at the problem of VQA which makes sense intuitively

Very strong intuition and motivation
Identifying requirements and building them

Using SSL was very clever and creating hard negative was a simple but helpful technique

25

26 of 26

Reflections

Inductive bias

We need the model to learn this naturally in some way
Teaching might limit to other tasks (eg: external knowledge VQA)

Limited experiments on some categories

Have not used pre-trained models
No results for other datasets
Too much overfit to dataset?

Performance improvement over base model on VQA-CP is negligible

Model still learns spurious correlations

26

The major thing that has changed today is LLMs. Do we need to rethink how we look at the problem?