Beyond statistical learning in vision-language
Rishika Bhagwatkar
Shravan Nayak
Overview
2
Vision-Language model
In distribution data
IID
Out of distribution data
OOD
Good performance :)
Bad performance :(
Far from being robust enough for practical use
Overview
3
VLMs
Adversarial Challenges and Benchmarks in VQA
Enhancing Model Robustness and Generalization
Adversarial Challenges and Benchmarks in VQA
4
Prior VQA benchmarks study robustness with respect to:
Limitations:
Adversarial Challenges and Benchmarks in VQA
5
Adversarial VQA
Human and model in loop
Different domains including web images from Conceptual Captions, user-generated images from Fakeddit, and movie images from VCR
Model (EVAL + TRAIN)
Weaknesses exposed 😈
+
Data Augmentation: Model becomes robust on other VQA tasks 😇
Li et al. Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models. Aug 2021.
Enhancing Robustness and Generalization
6
How to develop methods to enhance robustness and generalization?
Data
Training Objectives
Will be discussed in detail :)
Large-Scale Adversarial Training for Vision and Language Representation Learning
NeurIPS 2020 Spotlight
7
Introduction
8
Adversarial Training is used to combat adversarial attacks in order to create robust neural networks.
Better generalizability
Research Question: Can we apply similar adversarial training techniques to V+L problems to improve model performance?
They propose VILLA Vision-and-Language Large-scale Adversarial training consisting of two stages
Approach
9
Two-stage Adversarial Training: APT+AFT
Adversarial Finetuning (AFT):
Adversarial Pre-training (APT):
Free AT
10
VILLA is m times computationally heavier than UNITER, where m is the number of adversarial training steps.
m is taken as 3 (following a prior work on LLM Free AT) and not ablated for Multimodal AT :(
Free Multimodal AT
11
Cross-entropy loss on clean data
Label-preserving AT loss
Inner maximization is solved by PGD
Finer-grained regularization term
Advocates that the confidence level of prediction should be close
There is no ablation on studying individual components separately :(
Ablation Study
12
Downstream Task Evaluation: VILLA applied to UNITER/LXMERT outperforms on all evaluation tasks.
Ablation Study
13
Pre-training vs Finetuning: Understanding effects of AT at pretraining and finetuning stage.
VILLA-pre brings +0.51 gain
VILLA-fine bring +0.82 gain
Combining both +1.12 gain
Ablation Study
14
Image vs Text Modality: Understanding effects of adversarial examples in different modalities..
Adding perturbations on one modality is already gaining significant improvement
Reflections
15
Separating Skills and Concepts for Novel Visual Question Answering
16
Motivation
17
Training data
Test data
Motivation
18
Motivation
19
Method
We need the following things to achieve the disentanglement
How do we teach this? What data will we use?
How do we teach this? Skill signal is needed
20
Self Supervised Learning
Disentangling Concepts
21
Positive samples:
Negative samples:
Generating hard negatives and diverse positives reduces learning spurious corelations
Disentangling Skills
22
How many zebras are visible?
Training and Results
23
Results
Generalization to novel concepts
Perform several ablations to showcase the efficacy of each component
24
Reflections
25
Reflections
26
The major thing that has changed today is LLMs. Do we need to rethink how we look at the problem?