1 of 34

Beyond statistical learning in vision-language

Aditya Sharma and Aman Dalmia

2 of 34

Classical approach to VQA

Cadene, R., Dancette, C., Cord, M., & Parikh, D. (2019). Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems, 32.

3 of 34

Problems with VQA models - Ignoring the image

Cadene, R., Dancette, C., Cord, M., & Parikh, D. (2019). Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems, 32.

4 of 34

Can we use adversarial regularization to overcome language priors?

Ramakrishnan, S., Agrawal, A., & Lee, S. (2018). Overcoming language priors in visual question answering with adversarial regularization. Advances in Neural Information Processing Systems, 31.

5 of 34

Can we use adversarial regularization to overcome language priors?

Ramakrishnan, S., Agrawal, A., & Lee, S. (2018). Overcoming language priors in visual question answering with adversarial regularization. Advances in Neural Information Processing Systems, 31.

6 of 34

RUBi: Reducing Unimodal Biases for VQA

Cadene, R., Dancette, C., Cord, M., & Parikh, D. (2019). Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems, 32.

7 of 34

RUBi: Reducing Unimodal Biases for VQA

Cadene, R., Dancette, C., Cord, M., & Parikh, D. (2019). Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems, 32.

8 of 34

RUBi: Reducing Unimodal Biases for VQA

Cadene, R., Dancette, C., Cord, M., & Parikh, D. (2019). Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems, 32.

9 of 34

RUBi: Reducing Unimodal Biases for VQA

Cadene, R., Dancette, C., Cord, M., & Parikh, D. (2019). Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems, 32.

10 of 34

Problems with VQA models - Linguistic Diversity

Kant, Y., Moudgil, A., Batra, D., Parikh, D., & Agrawal, H. (2021). Contrast and classify: Training robust vqa models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1604-1613)

Are models robust to linguistic variations?

Are models capable of handling linguistic diversity?

Can models cope with different linguistic forms reliably?

11 of 34

ConClaT: Contrastive Learning + Cross Entropy Loss

Kant, Y., Moudgil, A., Batra, D., Parikh, D., & Agrawal, H. (2021). Contrast and classify: Training robust vqa models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1604-1613)

Contrastive loss

Encourages representations to be robust to linguistic variations

Cross-entropy loss

Preserves the discriminative power of representations

12 of 34

ConClaT: Contrastive Learning + Cross Entropy Loss

Kant, Y., Moudgil, A., Batra, D., Parikh, D., & Agrawal, H. (2021). Contrast and classify: Training robust vqa models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1604-1613)

13 of 34

ConClaT: Contrastive Learning + Cross Entropy Loss

Kant, Y., Moudgil, A., Batra, D., Parikh, D., & Agrawal, H. (2021). Contrast and classify: Training robust vqa models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1604-1613)

14 of 34

ConClaT: Contrastive Learning + Cross Entropy Loss

15 of 34

Thoughts: How would I solve this problem today?

Using Chain-of-Thought to help the model think and reason about why it is making the prediction

What animals are in this picture?

  • The animals are pulling a cart
  • They have a long mane
  • They have four legs and a long nose
  • They are too tall to be donkeys
  • Therefore the answer is horses

16 of 34

Cycle-Consistency for Robust Visual Question Answering - Shah et al., 2019

VQA Models are amazing

Image Source: https://arxiv.org/pdf/1612.00837.pdf

17 of 34

VQA Models are brittle

Image Source: https://arxiv.org/pdf/1902.05660.pdf

Predictions from Pythia VQA 2018 Challenge Winner

18 of 34

Evaluating and benchmarking models for robustness

Image Source: https://arxiv.org/pdf/1902.05660.pdf

  • Can we train VQA models to be robust to linguistics variations with minimal supervision?

  • How do we evaluate VQA models for linguistics robustness?

19 of 34

Cycle consistent training scheme

Image Source: https://arxiv.org/pdf/1902.05660.pdf

I

Q

A`

VQA Model

Visual Question Generation Module

Q`

A``

I

I

VQA Model

20 of 34

Cycle consistent training scheme

Image Source: https://arxiv.org/pdf/1902.05660.pdf

I

Q

A`

VQA Model

Visual Question Generation Module

Q`

A``

I

I

VQA Model

A

VQA Loss

21 of 34

Cycle consistent training scheme

Image Source: https://arxiv.org/pdf/1902.05660.pdf

I

Q

A`

VQA Model

Visual Question Generation Module

Q`

A``

I

I

VQA Model

Question Consistency Loss

22 of 34

Cycle consistent training scheme

Image Source: https://arxiv.org/pdf/1902.05660.pdf

I

Q

A`

VQA Model

Visual Question Generation Module

Q`

A``

I

I

VQA Model

A

VQA Loss

Answer Consistency Loss

23 of 34

Cycle consistent training scheme

Image Source: https://arxiv.org/pdf/1902.05660.pdf

I

Q

A`

VQA Model

Visual Question Generation Module

Q`

A``

I

I

VQA Model

A

VQA Loss

Answer Consistency Loss

Question Consistency Loss

24 of 34

VQA Rephrasing

Image Source: https://arxiv.org/pdf/1902.05660.pdf

Evaluation Only

Based on VQA v2 Validation

Human Collected

25 of 34

VQA Rephrasing

Image Source: https://arxiv.org/pdf/1902.05660.pdf

40K Images

160K Questions

4 Rephrasing

26 of 34

Does cycle consistent training make models robust?

Image Source: https://arxiv.org/pdf/1902.05660.pdf

New SOTA

27 of 34

Does cycle consistent training make models robust?

Image Source: https://arxiv.org/pdf/1902.05660.pdf

New SOTA

Previous SOTA

28 of 34

Does cycle consistent training improve VQA performance ?

Image Source: https://arxiv.org/pdf/1902.05660.pdf

New SOTA

Previous SOTA

29 of 34

Does cycle consistent training improve question generation ?

Image Source: https://arxiv.org/pdf/1902.05660.pdf

30 of 34

Does cycle consistent models better predict their failures ?

Image Source: https://arxiv.org/pdf/1902.05660.pdf

31 of 34

Conclusion

Image Source: https://arxiv.org/pdf/1902.05660.pdf

  • Propose a cycle-consistent training scheme to learn robust VQA models without extra rephrasing supervision
  • Propose VQA rephrasing datasets to evaluate robustness of VQA models to linguistics variations
  • Improve SOTA in VQA, VQG and failure prediction

32 of 34

Thoughts / Insights

  • Yes, there is a possibility to train more robust models without additional annotated data.
  • Cycle consistency isn’t a new approach, it has widely used in Machine Translation, Object Tracking, etc.
  • I was wondering why Cycle consistency this approach isn’t popular these days ?

33 of 34

Questions?

34 of 34

Thank you!