Vision-Language Models
Fardin Ayar
[1] Gan, Zhe, et al. "Vision-language pre-training: Basics, recent advances, and future trends." Foundations and Trends® in Computer Graphics and Vision 14.3–4 (2022): 163-352.
[2] Zhang, Jingyi, et al. "Vision-language models for vision tasks: A survey." IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
[3] Awais, Muhammad, et al. "Foundational models defining a new era in vision: A survey and outlook." arXiv preprint arXiv:2307.13721 (2023).
[4] Li, Chunyuan, et al. "Multimodal foundation models: From specialists to general-purpose assistants." Foundations and Trends® in Computer Graphics and Vision 16.1-2 (2024): 1-214.
Recap: Transformers
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
Recap: VIT
Dosovitskiy, Alexey, et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." International Conference on Learning Representations. 2020.
Image-Text Pretraining�Contrastive Target
CLIP
Zhang, Jingyi, et al. "Vision-language models for vision tasks: A survey." IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
CLIP
Compositional Understanding of VLMs
Zeng, Yunan, et al. "Investigating compositional challenges in vision-language models for visual grounding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
What’s “up” with vision-language models? Investigating their struggle with spatial reasoning
Context Optimization or Prompt Tuning
Zhou, Kaiyang, et al. "Learning to prompt for vision-language models." International Journal of Computer Vision 130.9 (2022): 2337-2348.
Visual Prompts!
Bahng, Hyojin, et al. "Exploring visual prompts for adapting large-scale models." arXiv preprint arXiv:2203.17274 (2022).
CLIP-Adapter
Gao, Peng, et al. "Clip-adapter: Better vision-language models with feature adapters." International Journal of Computer Vision 132.2 (2024): 581-595.
I can’t believe there’s no images!
Gu, Sophia, Christopher Clark, and Aniruddha Kembhavi. "I can't believe there's no images! learning visual tasks using only language supervision." Proceedings of the IEEE/CVF international conference on computer vision. 2023.
Image-Text Pretraining�Generative Target
CoCa: Contrastive Captioners
Jiahui Yu, et. Al. CoCa: Contrastive Captioners are Image-Text Foundation Models. Transactions on Machine Learning Research, Aug 2022.
BLIP: Unified Vision-Language Understanding and Generation
Li, Junnan, et al. "Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation." International conference on machine learning. PMLR, 2022.
BLIP: Advanced Tasks
Li, Junnan, et al. "Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation." International conference on machine learning. PMLR, 2022.
Multimodal Large Language Models
Recap: Instruction Tuning
Wei, Jason, et al. "Finetuned language models are zero-shot learners." arXiv preprint arXiv:2109.01652 (2021).
Recap: Instruction Tuning
Wei, Jason, et al. "Finetuned language models are zero-shot learners." arXiv preprint arXiv:2109.01652 (2021).
Visual Instruction Tuning (LLaVA)
Liu, Haotian, et al. "Visual instruction tuning." Advances in neural information processing systems 36 (2023): 34892-34916.
Visual Instruction Tuning (LLaVA)
Liu, Haotian, et al. "Visual instruction tuning." Advances in neural information processing systems 36 (2023): 34892-34916.
Multimodal Large Language Models
Caffagni, Davide, et al. "The revolution of multimodal large language models: a survey." arXiv preprint arXiv:2402.12451 (2024).
InstructBLIP
Liu, Haotian, et al. "Visual instruction tuning." Advances in neural information processing systems 36 (2023): 34892-34916.