Augmenting Language Models with Vision Capabilities [UPDATE]
Yash Vadi
Uday Karan Kapur
Problem Statement
Training a vision-language model typically involves training a model that can understand and generate natural language descriptions based on visual input, such as images or videos. These models require significant compute resources due to the complexity of processing both visual and textual information simultaneously. However, in some scenarios, it may be impractical or unfeasible to train such models from scratch due to resource constraints.
Research Question
Can we impart vision capabilities to our existing language models without fully retraining on a very large dataset while maintaining it’s reasoning capabilities?
LLaVA-1.5
LLaVA-1.5 integrates a CLIP visual encoder and a LLaMA language decoder. CLIP encodes images and text into a shared space, while LLaMA excels in text comprehension. This combination allows the model to process mixed image and text inputs, seamlessly generating outputs that blend both modalities. LLaVa-1.5 finishes full training in ∼1 day on a single node with 8-A100 GPUs.
Ref: Improved Baselines with Visual Instruction Tuning (Oct 2023)
CALM
Ref: LLM Augmented LLMs: Expanding Capabilities through Composition (4th Jan-2024)
Proposed Model Architecture
Experiments
Update
Prompt: “Question: What do you see happening in this image? Answer:”
Generation: 4 people sitting at a dining table, enjoying a meal together. The table is set with plates, cups, and wine glasses. The dining area is well-lit…
Prompt: “Question: What colour shirt is the person wearing? Answer:”
Generation: A blue shirt is visible on the boy's body in the scene.
Next Steps
By next presentation,
Thank You!