Parameter efficient LLM based vision-language models
Team 11: Omid Reza Heidari, Li Gu 2024-03-13
LLM-based vision-language models
What are Multi-Modal LLMs?
What is Key features of them?
LLM based Vision-language Methods
Vision Encoder
Pre-trained LLM
Mapping Network
It is United States. I think so because the flag is the United States flag.
What country is this? Why do you think so?
Image: https://unsplash.com/photos/a-city-street-with-a-roller-coaster-in-the-background-oQ0MOFu7NOY
Why parameter efficiency?
There are some limitations with fine-tuning the entire model:
LLM based Vision-language Methods
Image: https://unsplash.com/photos/a-city-street-with-a-roller-coaster-in-the-background-oQ0MOFu7NOY
Approach | 1st example | 2nd example | 3rd example |
Finetune the entire language model | Dai et al. 2022 | Gao et al. 2022 | |
| | | |
| | | |
| | | |
Vision Encoder
Pre-trained LLM
Mapping Network
It is United States. I think so because the flag is the United States flag.
What country is this? Why do you think so?
LLM based Vision-language Methods
Image: https://unsplash.com/photos/a-city-street-with-a-roller-coaster-in-the-background-oQ0MOFu7NOY
Approach | 1st example | 2nd example | 3rd example |
Finetune the entire language model | Dai et al. 2022 | Gao et al. 2022 | |
Insert and Train Adapter Layers in the Language Model | MAGMA | Flamingo | |
| | | |
| | | |
Vision Encoder
Pre-trained LLM
Mapping Network
It is United States. I think so because the flag is the United States flag.
What country is this? Why do you think so?
Adapter Layer
LLM based Vision-language Methods
Image: https://unsplash.com/photos/a-city-street-with-a-roller-coaster-in-the-background-oQ0MOFu7NOY
Approach | 1st example | 2nd example | 3rd example |
Finetune the entire language model | Dai et al. 2022 | Gao et al. 2022 | |
Insert and Train Adapter Layers in the Language Model | MAGMA | Flamingo | |
Learn Vision Encoder from Scratch | Frozen | | |
| | | |
Vision Encoder
Pre-trained LLM
Mapping Network
It is United States. I think so because the flag is the United States flag.
What country is this? Why do you think so?
LLM based Vision-language Methods
Image: https://unsplash.com/photos/a-city-street-with-a-roller-coaster-in-the-background-oQ0MOFu7NOY
Approach | 1st example | 2nd example | 3rd example |
Finetune the entire language model | Dai et al. 2022 | Gao et al. 2022 | |
Insert and Train Adapter Layers in the Language Model | MAGMA | Flamingo | |
Learn Vision Encoder from Scratch | Frozen | | |
Only Learn the Mapping Network | MAPL | BLIP-2 | MiniGPT-4 |
Vision Encoder
Pre-trained LLM
Mapping Network
It is United States. I think so because the flag is the United States flag.
What country is this? Why do you think so?
Drawback of other methods
MAPL - Architecture
A dog running in a beach.
Image: https://unsplash.com/photos/dog-running-on-beach-during-daytime-yihlaRCCvd4
Vision Encoder
Mapping Network
LM self-attention
LM Embedder
LM Tokenizer
A
running
a
.
dog
in
beach
A
running
dog
in
a
baech
.
MAPL - Mapping Network
Transformer Encoder
FC
FC
FC
FC
Learned Constant Embeddings
FC
FC
FC
FC
Li Visual Features
Dimension: Di
Lo Tokens
Dimension: Do
Lo
Dimension: Dh
Lo Tokens, Dimension: Dh
MAPL - Mapping Network
FC
FC
FC
FC
FC
FC
FC
FC
MAPL - Training
After Training
Zero-shot transfer: captioning unseen images
few-shot transfer: unseen VQA
Training just mapping network from scratch by minimizing negative log-likelihood of the reference captions under the LM conditioned on the corresponding images.
MAPL - Experiments
Evaluation
Image Captioning
VQA (Not Actually)
Karpathy-test split of COCO
validation splits of
Conceptual Captions
TextCaps
VizWiz-Captions
Metrics
BLEU@4
ROUGE-L
METEOR
CIDEr
SPICE
validation splits of
VQAv2
OK-VQA
Text-VQA
VizWiz-VQA
Metric
VQA-Accuracy
Training Setting
domain-agnostic training
in-domain training
filtered version of CC(CC-clean) : 398k image-text pairs
Trained on 100% data
Trained on 1% data
MAPL - Image Captioning - Domain-agnostic
Model | Trainable Parameters | Training Examples | CC | COCO | TextCaps | VizWiz-Caps | Overall | |||||
B@4 | CIDEr | B@4 | CIDEr | B@4 | CIDEr | B@4 | CIDEr | B@4 | CIDEr | |||
ClipCap CC3M | 43M | 3.3M | - | 71.82 | - | - | - | - | - | - | - | - |
VLKD CC3M | 406M | 3.3M | - | - | 18.2 | 61.1 | - | - | - | - | - | - |
MAPL-blind CC-clean | 3.4M | 374K | 0.35 | 5.05 | 2.75 | 5.75 | 1.35 | 2.15 | 1.5 | 1.8 | 1.47 | 3.69 |
Frozen CC-clean | 40.3M | 374K | 2.45 | 22.6 | 5.25 | 13.9 | 2.65 | 4.6 | 2.05 | 2.65 | 3.1 | 10.94 |
MAPL CC-clean | 3.4M | 374K | 6.75 | 79.75 | 12.3 | 54.3 | 5.8 | 22.95 | 4.95 | 20.95 | 7.45 | 44.49 |
Frozen CC-clean | 40.3M | 3.7K | | | | | | | | | | |
MAPL CC-clean | 3.4M | 3.7K | | | | | | | | | | |
1%
100%
Existing Methods
MAPL - Image Captioning - Domain-agnostic
Model | Trainable Parameters | Training Examples | CC | COCO | TextCaps | VizWiz-Caps | Overall | |||||
B@4 | CIDEr | B@4 | CIDEr | B@4 | CIDEr | B@4 | CIDEr | B@4 | CIDEr | |||
ClipCap CC3M | 43M | 3.3M | | | | | | | | | | |
VLKD CC3M | 406M | 3.3M | | | | | | | | | | |
MAPL-blind CC-clean | 3.4M | 374K | | | | | | | | | | |
Frozen CC-clean | 40.3M | 374K | | | | | | | | | | |
MAPL CC-clean | 3.4M | 374K | 6.75 | 79.75 | 12.3 | 54.3 | 5.8 | 22.95 | 4.95 | 20.95 | 7.45 | 44.49 |
Frozen CC-clean | 40.3M | 3.7K | 0.75 | 6.55 | 3.05 | 5.25 | 1.7 | 1.65 | 1.5 | 1.4 | 1.75 | 3.71 |
MAPL CC-clean | 3.4M | 3.7K | 1.75 | 19.65 | 5.8 | 17.85 | 2.7 | 5.4 | 2.15 | 4.85 | 3.1 | 11.94 |
1%
100%
Existing Methods
MAPL - Image Captioning - in-domain
Model | Trainable Parameters | Training Examples | CC B@4 | CC CIDEr | COCO B@4 | COCO CIDEr | TextCaps B@4 | TextCapsCIDEr | VizWiz-Caps CIDEr | VizWiz-Caps B@4 | Overall B@4 | Overall CIDEr |
Frozen∗ COCO | 40.3M | 414K | 0.65 | 9.05 | 20.05 | 61.35 | 6.95 | 11.75 | 5.45 | 6.20 | 8.28 | 22.09 |
Frozen TextCaps | 40.3M | 103K | 0.2 | 3.55 | 4.05 | 6.7 | 8.85 | 46.95 | 4.4 | 5.25 | 4.38 | 8.11 |
Frozen VizWiz | 40.3M | 110K | 0.25 | 4.4 | 3.75 | 6.05 | 4.1 | 5.65 | 19 | 76.85 | 6.78 | 23.24 |
ClipCap COCO | 43M | 414K | - | - | 33.53 | 113.08 | - | - | - | - | - | - |
MAPL COCO | 3.4M | 414K | 2.25 | 34.5 | 36.45 | 125.2 | 16.6 | 41.4 | 18 | 41.35 | 18.33 | 60.61 |
MAPL TextCaps | 3.4M | 103K | 0.90 | 13.05 | 9.8 | 28.65 | 18.35 | 62.55 | 11.2 | 31.85 | 10.06 | 34.03 |
MAPL VizWiz | 3.4M | 110K | 0.90 | 18.8 | 13.55 | 48.35 | 11.35 | 31.2 | 34.7 | 141.3 | 15.13 | 59.91 |
Frozen COCO | 40.3M | 4.1K | | | | | | | | | | |
Frozen TextCaps | 40.3M | 1K | | | | | | | | | | |
Frozen VizWiz | 40.3M | 1.1K | | | | | | | | | | |
MAPL COCO | 3.4M | 4.1K | | | | | | | | | | |
MAPL TextCaps | 3.4M | 1K | | | | | | | | | | |
MAPL VizWiz | 3.4M | 1.1K | | | | | | | | | | |
1%
100%
MAPL - Image Captioning - in-domain
Model | Trainable Parameters | Training Examples | CC B@4 | CC CIDEr | COCO B@4 | COCO CIDEr | TextCaps B@4 | TextCapsCIDEr | VizWiz-Caps CIDEr | VizWiz-Caps B@4 | Overall B@4 | Overall CIDEr |
Frozen∗ COCO | 40.3M | 414K | | | | | | | | | | |
Frozen TextCaps | 40.3M | 103K | | | | | | | | | | |
Frozen VizWiz | 40.3M | 110K | | | | | | | | | | |
ClipCap COCO | 43M | 414K | | | | | | | | | | |
MAPL COCO | 3.4M | 414K | 2.25 | 34.5 | 36.45 | 125.2 | | | | | 18.33 | 60.61 |
MAPL TextCaps | 3.4M | 103K | | | | | 18.35 | 62.55 | | | | |
MAPL VizWiz | 3.4M | 110K | | | | | | | 34.7 | 141.3 | | |
Frozen COCO | 40.3M | 4.1K | 0.25 | 3.6 | 6.2 | 12.8 | 2.8 | 3.15 | 2.85 | 2.3 | 3.03 | 5.46 |
Frozen TextCaps | 40.3M | 1K | 0.1 | 2.6 | 1.65 | 2.8 | 3.65 | 5 | 2 | 2.25 | 1.85 | 3.16 |
Frozen VizWiz | 40.3M | 1.1K | 0.2 | 3.4 | 2.9 | 3.2 | 3.35 | 3.45 | 12.7 | 40.55 | 4.79 | 12.65 |
MAPL COCO | 3.4M | 4.1K | 0.8 | 12.1 | 19.65 | 65.9 | 7 | 12.85 | 6.2 | 9.6 | 8.41 | 25.11 |
MAPL TextCaps | 3.4M | 1K | 0.3 | 3.9 | 4.1 | 8.05 | 8.35 | 16.9 | 5 | 7.25 | 4.44 | 9.03 |
MAPL VizWiz | 3.4M | 1.1K | 0.2 | 3.9 | 2.95 | 4.8 | 3.45 | 5.05 | 18.4 | 71.1 | 6.25 | 21.21 |
1%
100%
MAPL - VQA - Domain-agnostic
Model | Trainable Parameters | Training Examples | 0-shot VQAv2 | 4-shot VQAv2 | 8-shot VQAv2 | 0-shot VQAv2 | 4-shot VQAv2 | 8-shot VQAv2 | 0-shot Ok-VQA | 4-shot Ok-VQA | 8-shot Ok-VQA | 0-shot TextVQA | 4-shot TextVQA | 8-shot TextVQA | 0-shot Overall | 4-shot Overall | 8-shot Overall |
Frozen | 40.3M | 3.3M | 29.5 | 38.2 | - | 5.9 | 12.6 | - | - | - | - | - | - | - | - | - | - |
MAGMA CC12M | 243M | 3.8M | 36.9 | 45.4 | - | 13.9 | 23.4 | - | - | - | - | 5.6 | 10.6 | - | - | - | - |
VLKD CC3M | 406M | 3.3M | 38.6 | - | - | 10.5 | - | - | - | - | - | - | - | - | - | - | - |
LiMBeR-CLIP | 12.6M | 3.3M | 33.33 | 40.34 | - | - | - | - | - | - | - | - | - | - | - | - | - |
Flamingo | 10.2B | > 2.1B | - | - | - | 57.6 | 57.4 | 57.5 | 35 | 36.5 | 37.3 | - | - | - | - | - | - |
MAPL-blind CC-clean | 3.4M | 374K | 20.62 | 35.01 | 35.11 | 4.84 | 14.68 | 14.28 | 3.68 | 5.43 | 5.82 | 3.18 | 8.65 | 9.55 | 8.08 | 15.94 | 16.19 |
Frozen CC-clean | 40.3M | 374K | 25.98 | 37.80 | 38.52 | 5.51 | 18.86 | 19.91 | 5.11 | 6.15 | 6.30 | 4.33 | 11.28 | 16.68 | 10.23 | 18.52 | 20.35 |
MAPL CC-clean | 3.4M | 374K | 33.54 | 45.13 | 45.21 | 13.84 | 24.25 | 23.93 | 8.26 | 8.88 | 8.77 | 11.72 | 18.46 | 19.52 | 16.84 | 24.18 | 24.36 |
Frozen CC-clean | 40.3M | 3.7K | | | | | | | | | | | | | | | |
MAPL CC-clean | 3.4M | 3.7K | | | | | | | | | | | | | | | |
1%
100%
Existing Methods
MAPL - VQA - Domain-agnostic
Model | Trainable Parameters | Training Examples | 0-shot VQAv2 | 4-shot VQAv2 | 8-shot VQAv2 | 0-shot VQAv2 | 4-shot VQAv2 | 8-shot VQAv2 | 0-shot Ok-VQA | 4-shot Ok-VQA | 8-shot Ok-VQA | 0-shot TextVQA | 4-shot TextVQA | 8-shot TextVQA | 0-shot Overall | 4-shot Overall | 8-shot Overall |
Frozen | 40.3M | 3.3M | | | | | | | | | | | | | | | |
MAGMA CC12M | 243M | 3.8M | | | | | | | | | | | | | | | |
VLKD CC3M | 406M | 3.3M | | | | | | | | | | | | | | | |
LiMBeR-CLIP | 12.6M | 3.3M | | | | | | | | | | | | | | | |
Flamingo | 10.2B | > 2.1B | | | | | | | | | | | | | | | |
MAPL-blind CC-clean | 3.4M | 374K | | | | | | | | | | | | | | | |
Frozen CC-clean | 40.3M | 374K | | | | | | | | | | | | | | | |
MAPL CC-clean | 3.4M | 374K | 33.54 | 45.13 | 45.21 | 13.84 | 24.25 | 23.93 | 8.26 | 8.88 | 8.77 | 11.72 | 18.46 | 19.52 | 16.84 | 24.18 | 24.36 |
Frozen CC-clean | 40.3M | 3.7K | 26.22 | 36.69 | 37.41 | 5.5 | 18.76 | 20.51 | 5.71 | 7.19 | 7.53 | 3.83 | 11.71 | 16.66 | 10.31 | 18.58 | 20.53 |
MAPL CC-clean | 3.4M | 3.7K | 30.80 | 37.37 | 37.95 | 8.77 | 18.18 | 19.15 | 6.40 | 7.07 | 7.74 | 5.68 | 9.26 | 10.58 | 12.91 | 17.97 | 18.85 |
1%
100%
Existing Methods
MAPL - VQA - in-domain
Model | Trainable Parameters | Training Examples | 0-shot VQAv2 | 4-shot VQAv2 | 8-shot VQAv2 | 0-shot VQAv2 | 4-shot VQAv2 | 8-shot VQAv2 | 0-shot Ok-VQA | 4-shot Ok-VQA | 8-shot Ok-VQA | 0-shot TextVQA | 4-shot TextVQA | 8-shot TextVQA | 0-shot Overall | 4-shot Overall | 8-shot Overall |
PICa | 0 | 0 | 20.61 | 46.86 | 47.80 | 11.84 | 31.28 | 33.07 | - | - | - | - | - | - | - | - | - |
Frozen∗ COCO | 40.3M | 414K | 32.09 | 38.9 | 39.42 | 9.81 | 20.72 | 21.83 | 7.54 | 6.82 | 6.74 | 5.87 | 12.07 | 17.35 | 13.82 | 19.63 | 21.33 |
Frozen TextCaps | 40.3M | 103K | 32.49 | 37.39 | 38.03 | 11.34 | 19.87 | 20.82 | 8.83 | 7.33 | 7.51 | 6.25 | 12.26 | 16.86 | 14.73 | 19.21 | 20.8 |
Frozen VizWiz | 40.3M | 110K | 26.93 | 37.38 | 37.91 | 5.85 | 19.12 | 20.64 | 6.38 | 7.44 | 7.47 | 5.57 | 13.06 | 18.06 | 11.18 | 19.25 | 21.02 |
MAPL COCO | 3.4M | 414K | 43.51 | 48.75 | 48.44 | 18.27 | 31.13 | 31.63 | 10.99 | 11.1 | 11.08 | 14.05 | 17.72 | 19.18 | 21.7 | 27.17 | 27.58 |
MAPL TextCaps | 3.4M | 103K | 38.83 | 43.34 | 43.43 | 16.33 | 25.07 | 25.92 | 22.27 | 19.53 | 19.75 | 12.31 | 16.69 | 18.18 | 22.43 | 26.15 | 26.82 |
MAPL VizWiz | 3.4M | 110K | 32.8 | 42.94 | 43.2 | 11.7 | 24.91 | 25.73 | 9.27 | 10.36 | 10.23 | 10.42 | 20.63 | 23.10 | 16.05 | 24.71 | 25.56 |
Frozen COCO | 40.3M | 4.1K | | | | | | | | | | | | | | | |
Frozen TextCaps | 40.3M | 1K | | | | | | | | | | | | | | | |
Frozen VizWiz | 40.3M | 1.1K | | | | | | | | | | | | | | | |
MAPL COCO | 3.4M | 4.1K | | | | | | | | | | | | | | | |
MAPL TextCaps | 3.4M | 1K | | | | | | | | | | | | | | | |
MAPL VizWiz | 3.4M | 1.1K | | | | | | | | | | | | | | | |
1%
100%
MAPL - VQA - in-domain
Model | Trainable Parameters | Training Examples | 0-shot VQAv2 | 4-shot VQAv2 | 8-shot VQAv2 | 0-shot VQAv2 | 4-shot VQAv2 | 8-shot VQAv2 | 0-shot Ok-VQA | 4-shot Ok-VQA | 8-shot Ok-VQA | 0-shot TextVQA | 4-shot TextVQA | 8-shot TextVQA | 0-shot Overall | 4-shot Overall | 8-shot Overall |
PICa | 0 | 0 | | | | | 31.28 | 33.07 | | | | | | | | | |
Frozen∗ COCO | 40.3M | 414K | | | | | | | | | | | | | | | |
Frozen TextCaps | 40.3M | 103K | | | | | | | | | | | | | | | |
Frozen VizWiz | 40.3M | 110K | | | | | | | | | | | | | | | |
MAPL COCO | 3.4M | 414K | 43.51 | 48.75 | 48.44 | 18.27 | | | | | | 14.05 | | | | 27.17 | 27.58 |
MAPL TextCaps | 3.4M | 103K | | | | | | | 22.27 | 19.53 | 19.75 | | | | 22.43 | | |
MAPL VizWiz | 3.4M | 110K | | | | | | | | | | | 20.63 | 23.10 | | | |
Frozen COCO | 40.3M | 4.1K | 3018 | 37.23 | 37.89 | 9.33 | 19.6 | 20.71 | 7.43 | 7.65 | 7.67 | 4.37 | 12 | 16.48 | 12.83 | 19.12 | 20.69 |
Frozen TextCaps | 40.3M | 1K | 32.09 | 36.72 | 37.25 | 10.75 | 18.85 | 19.51 | 8.17 | 7.57 | 7.28 | 5.39 | 11.79 | 16.20 | 14.1 | 18.73 | 20.06 |
Frozen VizWiz | 40.3M | 1.1K | 29.6 | 37.3 | 37.87 | 7.57 | 19.36 | 20.6 | 7.16 | 7.17 | 7.25 | 4.53 | 12.51 | 17.56 | 12.22 | 19.08 | 20.82 |
MAPL COCO | 3.4M | 4.1K | 37.69 | 40.42 | 40.84 | 13.92 | 21.66 | 22.41 | 8.3 | 6.96 | 6.84 | 6.94 | 10.72 | 12.43 | 16.71 | 19.94 | 20.63 |
MAPL TextCaps | 3.4M | 1K | 33.57 | 36.7 | 36.87 | 12.46 | 17.75 | 18.21 | 9.34 | 8.29 | 8.62 | 6.54 | 9.58 | 11.62 | 15.48 | 18 | 18.83 |
MAPL VizWiz | 3.4M | 1.1K | 31.88 | 36.81 | 37.04 | 9.59 | 17.64 | 17.64 | 7.25 | 5.99 | 6.04 | 4.73 | 9.48 | 11.33 | 13.36 | 17.48 | 18.01 |
1%
100%
Drawbacks of MAPL
There are several disadvantages that the MAPL model has, two of them can be addressed in the below:
BLIP-2: Bootstrapping Language-Image Pre-training
with Frozen Image Encoders and Large Language Models
ICML 2023
24
Overview
Inference: Align the visual with the language
Key challenge: Since the LLM has not seen any images during its pre-training, how to “transform” the pure vision representation into a format (vision & language) that LLM can effectively use?
Align
Inference: Image Encoder + Q-former
Query transform (Q-former) = “filter”
Query output
Language-informative
visual feature
Pure visual feature
Inference: Image Encoder + Q-former
Query transform (Q-former) = “filter”
Query output
Language-informative
visual feature
Pure visual feature
Learned query
Inference: Image Encoder + Q-former
Query output
Pure visual feature
[HxW, D]
[N, D]
[N, D]
Query transform (Q-former)
Language-informative
visual feature
Inference: Linear projection + LLM
Fully connected layer
What is in the image?
Pre-training stage 1: Vision & Language representation learning
Q: How to learn a Q-former that can comprehend both visual and textual information to extract language-informative visual features?
Image-text
matching
Image-text
contrastive
Image-grounded
text generation
Pre-training stage 1: Vision & Language representation learning
Q: How to learn a Q-former that can comprehend both visual and textual information to extract language-informative visual features?
A: Pretrain on three vision-language proxy tasks using image-text pairs
Image-text
matching
Image-text
contrastive
Image-grounded
text generation
Pre-training stage 1: Vision & Language representation learning
Q: What is the architecture of Q-former? How to perform different proxy tasks?
Pre-training stage 1: Vision & Language representation learning
Q: What is the architecture of Q-former? How to perform different proxy tasks?
Query
embed
Text
embed
Pre-training stage 1: Vision & Language representation learning
Q: What is the architecture of Q-former? How to perform different proxy tasks?
Query
embed
Text
embed
Pre-training stage 1: Vision & Language representation learning
Q: What is the architecture of Q-former? How to perform different proxy tasks?
Query
embed
Text
embed
Pre-training stage 1: Vision & Language representation learning
Q: What is the architecture of Q-former? How to perform different proxy tasks?
Pre-training stage 2: Vision-to-Language generative learning
A two-stage pre-training strategy
Text
Stage 1: Image-text pair loss
Stage 2: language generation loss
Pre-training Dataset & Benchmarks
Pre-training Data
Pre-training Dataset & Benchmarks
Pre-training Data
Experiments
Results: zero-shot VQA
Results: zero-shot VQA
Results: Without Q-former pretraining
Demonstrate that decoupling the end-to-end training into two stages is crucial for state-of-the-art results.
Limitations
Limitations
Limitations
Summary
Parameter-efficient LLM-based vision-language models leverage LLMs’ strong capability (e.g. zero-shot, in-context learning) for vision-language tasks but with minimal changes to model’s architecture or parameters.
Two adapter-free approaches MAPL and BLIP2 reduce the need for high-scale trainable parameters, GPU resources, and pre-training datasets.
The next research directions may include:
MAPL Vs FLAMINGO
Feature | FLAMINGO | MAPL | BLIP2 |
Model Capabilities | zero-shot; few-shot | zero-shot; few-shot | zero-shot |
Model Architecture | Perceiver; Cross-attention layer for each LLM block | 4 transformer blocks | BERT-base: 12 blocks |
Trainable Parameters | >1.4B | 3.4 M | 104M |
Dataset Size | 2B | Around 400k | Around 129M |