1 of 34

Beyond statistical learning in vision-language

Aditya Sharma and Aman Dalmia

2 of 34

Classical approach to VQA

Cadene, R., Dancette, C., Cord, M., & Parikh, D. (2019). Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems, 32.

3 of 34

Problems with VQA models - Ignoring the image

Cadene, R., Dancette, C., Cord, M., & Parikh, D. (2019). Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems, 32.

4 of 34

Can we use adversarial regularization to overcome language priors?

Ramakrishnan, S., Agrawal, A., & Lee, S. (2018). Overcoming language priors in visual question answering with adversarial regularization. Advances in Neural Information Processing Systems, 31.

5 of 34

Can we use adversarial regularization to overcome language priors?

Ramakrishnan, S., Agrawal, A., & Lee, S. (2018). Overcoming language priors in visual question answering with adversarial regularization. Advances in Neural Information Processing Systems, 31.

6 of 34

RUBi: Reducing Unimodal Biases for VQA

Cadene, R., Dancette, C., Cord, M., & Parikh, D. (2019). Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems, 32.

7 of 34

RUBi: Reducing Unimodal Biases for VQA

Cadene, R., Dancette, C., Cord, M., & Parikh, D. (2019). Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems, 32.

8 of 34

RUBi: Reducing Unimodal Biases for VQA

Cadene, R., Dancette, C., Cord, M., & Parikh, D. (2019). Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems, 32.

9 of 34

RUBi: Reducing Unimodal Biases for VQA

Cadene, R., Dancette, C., Cord, M., & Parikh, D. (2019). Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems, 32.

10 of 34

Problems with VQA models - Linguistic Diversity

Kant, Y., Moudgil, A., Batra, D., Parikh, D., & Agrawal, H. (2021). Contrast and classify: Training robust vqa models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1604-1613)

Are models robust to linguistic variations?

Are models capable of handling linguistic diversity?

Can models cope with different linguistic forms reliably?

11 of 34

ConClaT: Contrastive Learning + Cross Entropy Loss

Kant, Y., Moudgil, A., Batra, D., Parikh, D., & Agrawal, H. (2021). Contrast and classify: Training robust vqa models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1604-1613)

Contrastive loss

Encourages representations to be robust to linguistic variations

Cross-entropy loss

Preserves the discriminative power of representations

12 of 34

ConClaT: Contrastive Learning + Cross Entropy Loss

Kant, Y., Moudgil, A., Batra, D., Parikh, D., & Agrawal, H. (2021). Contrast and classify: Training robust vqa models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1604-1613)

13 of 34

ConClaT: Contrastive Learning + Cross Entropy Loss

Kant, Y., Moudgil, A., Batra, D., Parikh, D., & Agrawal, H. (2021). Contrast and classify: Training robust vqa models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1604-1613)

14 of 34

ConClaT: Contrastive Learning + Cross Entropy Loss

15 of 34

Thoughts: How would I solve this problem today?

Using Chain-of-Thought to help the model think and reason about why it is making the prediction

What animals are in this picture?

The animals are pulling a cart
They have a long mane
They have four legs and a long nose
They are too tall to be donkeys
Therefore the answer is horses

16 of 34

Cycle-Consistency for Robust Visual Question Answering - Shah et al., 2019

VQA Models are amazing

Image Source: https://arxiv.org/pdf/1612.00837.pdf

17 of 34

VQA Models are brittle

Image Source: https://arxiv.org/pdf/1902.05660.pdf

Predictions from Pythia VQA 2018 Challenge Winner

18 of 34

Evaluating and benchmarking models for robustness

Image Source: https://arxiv.org/pdf/1902.05660.pdf

Can we train VQA models to be robust to linguistics variations with minimal supervision?

How do we evaluate VQA models for linguistics robustness?

19 of 34

Cycle consistent training scheme

Image Source: https://arxiv.org/pdf/1902.05660.pdf

I

Q

A`

VQA Model

Visual Question Generation Module

Q`

A``

I

VQA Model

20 of 34

Cycle consistent training scheme

Image Source: https://arxiv.org/pdf/1902.05660.pdf

I

Q

A`

VQA Model

Visual Question Generation Module

Q`

A``

I

VQA Model

A

VQA Loss

21 of 34

Cycle consistent training scheme

Image Source: https://arxiv.org/pdf/1902.05660.pdf

I

Q

A`

VQA Model

Visual Question Generation Module

Q`

A``

I

VQA Model

Question Consistency Loss

22 of 34

Cycle consistent training scheme

Image Source: https://arxiv.org/pdf/1902.05660.pdf

I

Q

A`

VQA Model

Visual Question Generation Module

Q`

A``

I

VQA Model

A

VQA Loss

Answer Consistency Loss

23 of 34

Cycle consistent training scheme

Image Source: https://arxiv.org/pdf/1902.05660.pdf

I

Q

A`

VQA Model

Visual Question Generation Module

Q`

A``

I

VQA Model

A

VQA Loss

Answer Consistency Loss

Question Consistency Loss

24 of 34

VQA Rephrasing

Image Source: https://arxiv.org/pdf/1902.05660.pdf

Evaluation Only

Based on VQA v2 Validation

Human Collected

25 of 34

VQA Rephrasing

Image Source: https://arxiv.org/pdf/1902.05660.pdf

40K Images

160K Questions

4 Rephrasing

26 of 34

Does cycle consistent training make models robust?

Image Source: https://arxiv.org/pdf/1902.05660.pdf

New SOTA

27 of 34

Does cycle consistent training make models robust?

Image Source: https://arxiv.org/pdf/1902.05660.pdf

New SOTA

Previous SOTA

28 of 34

Does cycle consistent training improve VQA performance ?

Image Source: https://arxiv.org/pdf/1902.05660.pdf

New SOTA

Previous SOTA

29 of 34

Does cycle consistent training improve question generation ?

Image Source: https://arxiv.org/pdf/1902.05660.pdf

30 of 34

Does cycle consistent models better predict their failures ?

Image Source: https://arxiv.org/pdf/1902.05660.pdf

31 of 34

Conclusion

Image Source: https://arxiv.org/pdf/1902.05660.pdf

Propose a cycle-consistent training scheme to learn robust VQA models without extra rephrasing supervision
Propose VQA rephrasing datasets to evaluate robustness of VQA models to linguistics variations
Improve SOTA in VQA, VQG and failure prediction

32 of 34

Thoughts / Insights

Yes, there is a possibility to train more robust models without additional annotated data.
Cycle consistency isn’t a new approach, it has widely used in Machine Translation, Object Tracking, etc.
I was wondering why Cycle consistency this approach isn’t popular these days ?

33 of 34

Questions?

34 of 34

Thank you!