1 of 22

Vision-Language Models

Fardin Ayar

[1] Gan, Zhe, et al. "Vision-language pre-training: Basics, recent advances, and future trends." Foundations and Trends® in Computer Graphics and Vision 14.3–4 (2022): 163-352.

[2] Zhang, Jingyi, et al. "Vision-language models for vision tasks: A survey." IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).

[3] Awais, Muhammad, et al. "Foundational models defining a new era in vision: A survey and outlook." arXiv preprint arXiv:2307.13721 (2023).

[4] Li, Chunyuan, et al. "Multimodal foundation models: From specialists to general-purpose assistants." Foundations and Trends® in Computer Graphics and Vision 16.1-2 (2024): 1-214.

2 of 22

Recap: Transformers

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

3 of 22

Recap: VIT

Dosovitskiy, Alexey, et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." International Conference on Learning Representations. 2020.

4 of 22

Image-Text Pretraining�Contrastive Target

5 of 22

CLIP

Zhang, Jingyi, et al. "Vision-language models for vision tasks: A survey." IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).

6 of 22

CLIP

  • CLIP is trained from scratch on a dataset of 400 million (image, text) pairs.
  • Its vision encoder is just a ResNet50 (Recently ViT), nothing special. Everyone can use it.
  • All models are trained for 32 epochs. A very large minibatch size of 32,768 is used. 
  • There is a lot of work on using CLIP for downstream tasks. Especially in other domains (e.g., video understanding). There are numerous papers on CLIP applications in CVPR 2022~2025!
  • CLIP is the standard image encoder for MLMs.

7 of 22

Compositional Understanding of VLMs

Zeng, Yunan, et al. "Investigating compositional challenges in vision-language models for visual grounding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

What’s “up” with vision-language models? Investigating their struggle with spatial reasoning

8 of 22

Context Optimization or Prompt Tuning

Zhou, Kaiyang, et al. "Learning to prompt for vision-language models." International Journal of Computer Vision 130.9 (2022): 2337-2348.

9 of 22

Visual Prompts!

Bahng, Hyojin, et al. "Exploring visual prompts for adapting large-scale models." arXiv preprint arXiv:2203.17274 (2022).

10 of 22

CLIP-Adapter

Gao, Peng, et al. "Clip-adapter: Better vision-language models with feature adapters." International Journal of Computer Vision 132.2 (2024): 581-595.

11 of 22

I can’t believe there’s no images!

Gu, Sophia, Christopher Clark, and Aniruddha Kembhavi. "I can't believe there's no images! learning visual tasks using only language supervision." Proceedings of the IEEE/CVF international conference on computer vision. 2023.

12 of 22

Image-Text Pretraining�Generative Target

13 of 22

CoCa: Contrastive Captioners

Jiahui Yu, et. Al. CoCa: Contrastive Captioners are Image-Text Foundation Models. Transactions on Machine Learning Research, Aug 2022.

14 of 22

BLIP: Unified Vision-Language Understanding and Generation

Li, Junnan, et al. "Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation." International conference on machine learning. PMLR, 2022.

15 of 22

BLIP: Advanced Tasks

Li, Junnan, et al. "Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation." International conference on machine learning. PMLR, 2022.

16 of 22

Multimodal Large Language Models

17 of 22

Recap: Instruction Tuning

Wei, Jason, et al. "Finetuned language models are zero-shot learners." arXiv preprint arXiv:2109.01652 (2021).

18 of 22

Recap: Instruction Tuning

Wei, Jason, et al. "Finetuned language models are zero-shot learners." arXiv preprint arXiv:2109.01652 (2021).

19 of 22

Visual Instruction Tuning (LLaVA)

Liu, Haotian, et al. "Visual instruction tuning." Advances in neural information processing systems 36 (2023): 34892-34916.

20 of 22

Visual Instruction Tuning (LLaVA)

Liu, Haotian, et al. "Visual instruction tuning." Advances in neural information processing systems 36 (2023): 34892-34916.

21 of 22

Multimodal Large Language Models

Caffagni, Davide, et al. "The revolution of multimodal large language models: a survey." arXiv preprint arXiv:2402.12451 (2024).

22 of 22

InstructBLIP

Liu, Haotian, et al. "Visual instruction tuning." Advances in neural information processing systems 36 (2023): 34892-34916.