1 of 13

Augmenting Language Models with Vision Capabilities [UPDATE]

Yash Vadi

Uday Karan Kapur

2 of 13

Problem Statement

Training a vision-language model typically involves training a model that can understand and generate natural language descriptions based on visual input, such as images or videos. These models require significant compute resources due to the complexity of processing both visual and textual information simultaneously. However, in some scenarios, it may be impractical or unfeasible to train such models from scratch due to resource constraints.

3 of 13

Research Question

Can we impart vision capabilities to our existing language models without fully retraining on a very large dataset while maintaining it’s reasoning capabilities?

4 of 13

LLaVA-1.5

LLaVA-1.5 integrates a CLIP visual encoder and a LLaMA language decoder. CLIP encodes images and text into a shared space, while LLaMA excels in text comprehension. This combination allows the model to process mixed image and text inputs, seamlessly generating outputs that blend both modalities. LLaVa-1.5 finishes full training in ∼1 day on a single node with 8-A100 GPUs.

Ref: Improved Baselines with Visual Instruction Tuning (Oct 2023)

5 of 13

CALM

Ref: LLM Augmented LLMs: Expanding Capabilities through Composition (4th Jan-2024)

6 of 13

Proposed Model Architecture

CLIP is an auxiliary model and a pre-trained LLM is a base model which will be augmented with vision capabilities.
Intermediate outputs of LLM are attended with image features projected into language space.
Result of cross attention is added as a residual connection to input of next layer in LLM.

7 of 13

Experiments

Auxiliary vision model: CLIP encoder
Base language models: GPT2 (small ~120M), Stable LM 2 Zephyr 1.6B
Datasets: 23k high-quality image-caption pairs from LlaVa-Instruct-150K
Google Colab and CalcQuebec cluster (single node with 8GB V100 GPU)

8 of 13

Update

Base code setup
POC on small gpt2 model
Small training run on the desired LLM(1.6B)

9 of 13

10 of 13

Prompt: “Question: What do you see happening in this image? Answer:”

Generation: 4 people sitting at a dining table, enjoying a meal together. The table is set with plates, cups, and wine glasses. The dining area is well-lit…

11 of 13

Prompt: “Question: What colour shirt is the person wearing? Answer:”

Generation: A blue shirt is visible on the boy's body in the scene.

12 of 13

Next Steps

By next presentation,

Train the model on the comparatively large scale dataset.
Benchmark the method on zero shot VQA task.
Do ablation on different inter-connections between vision and language model.
And then…

13 of 13

Thank You!