1 of 49

Parameter efficient LLM based vision-language models

Team 11: Omid Reza Heidari, Li Gu 2024-03-13

2 of 49

LLM-based vision-language models

What are Multi-Modal LLMs?

What is Key features of them?

Integrated Understanding
Generative Capabilities
Adaptability

3 of 49

LLM based Vision-language Methods

Vision Encoder

Pre-trained LLM

Mapping Network

It is United States. I think so because the flag is the United States flag.

What country is this? Why do you think so?

Image: https://unsplash.com/photos/a-city-street-with-a-roller-coaster-in-the-background-oQ0MOFu7NOY

4 of 49

Why parameter efficiency?

There are some limitations with fine-tuning the entire model:

Destroying the current capability of the LLM

Training a huge number of parameters requires collecting a giant dataset

Requires too much GPU memory as well as GPU hours

5 of 49

LLM based Vision-language Methods

Image: https://unsplash.com/photos/a-city-street-with-a-roller-coaster-in-the-background-oQ0MOFu7NOY

Approach	1st example	2nd example	3rd example
Finetune the entire language model	Dai et al. 2022	Gao et al. 2022

Vision Encoder

Pre-trained LLM

Mapping Network

It is United States. I think so because the flag is the United States flag.

What country is this? Why do you think so?

6 of 49

LLM based Vision-language Methods

Image: https://unsplash.com/photos/a-city-street-with-a-roller-coaster-in-the-background-oQ0MOFu7NOY

Approach	1st example	2nd example	3rd example
Finetune the entire language model	Dai et al. 2022	Gao et al. 2022
Insert and Train Adapter Layers in the Language Model	MAGMA	Flamingo

Vision Encoder

Pre-trained LLM

Mapping Network

It is United States. I think so because the flag is the United States flag.

What country is this? Why do you think so?

Adapter Layer

7 of 49

LLM based Vision-language Methods

Image: https://unsplash.com/photos/a-city-street-with-a-roller-coaster-in-the-background-oQ0MOFu7NOY

Approach	1st example	2nd example	3rd example
Finetune the entire language model	Dai et al. 2022	Gao et al. 2022
Insert and Train Adapter Layers in the Language Model	MAGMA	Flamingo
Learn Vision Encoder from Scratch	Frozen

Vision Encoder

Pre-trained LLM

Mapping Network

It is United States. I think so because the flag is the United States flag.

What country is this? Why do you think so?

8 of 49

LLM based Vision-language Methods

Image: https://unsplash.com/photos/a-city-street-with-a-roller-coaster-in-the-background-oQ0MOFu7NOY

Approach	1st example	2nd example	3rd example
Finetune the entire language model	Dai et al. 2022	Gao et al. 2022
Insert and Train Adapter Layers in the Language Model	MAGMA	Flamingo
Learn Vision Encoder from Scratch	Frozen
Only Learn the Mapping Network	MAPL	BLIP-2	MiniGPT-4

Vision Encoder

Pre-trained LLM

Mapping Network

It is United States. I think so because the flag is the United States flag.

What country is this? Why do you think so?

9 of 49

Drawback of other methods

Number of parameters
Overfitting
Resources

10 of 49

MAPL - Architecture

A dog running in a beach.

Image: https://unsplash.com/photos/dog-running-on-beach-during-daytime-yihlaRCCvd4

Vision Encoder

Mapping Network

LM self-attention

LM Embedder

LM Tokenizer

A

running

a

.

dog

in

beach

A

running

dog

in

a

baech

.

11 of 49

MAPL - Mapping Network

Transformer Encoder

FC

Learned Constant Embeddings

FC

L_i Visual Features

Dimension: D_i

L_o Tokens

Dimension: D_o

L_o

Dimension: D_h

L_o Tokens, Dimension: D_h

12 of 49

MAPL - Mapping Network

FC

Decouple transformer hidden size D_h from D_i as well as D_o

Shared Parameters in FC

13 of 49

MAPL - Training

After Training

Zero-shot transfer: captioning unseen images

few-shot transfer: unseen VQA

Training just mapping network from scratch by minimizing negative log-likelihood of the reference captions under the LM conditioned on the corresponding images.

14 of 49

MAPL - Experiments

Evaluation

Image Captioning

VQA (Not Actually)

Karpathy-test split of COCO

validation splits of

Conceptual Captions

TextCaps

VizWiz-Captions

Metrics

BLEU@4

ROUGE-L

METEOR

CIDEr

SPICE

validation splits of

VQAv2

OK-VQA

Text-VQA

VizWiz-VQA

Metric

VQA-Accuracy

Training Setting

domain-agnostic training

in-domain training

filtered version of CC(CC-clean) : 398k image-text pairs

Trained on 100% data

Trained on 1% data

15 of 49

MAPL - Image Captioning - Domain-agnostic

Model	Trainable Parameters	Training Examples	CC		COCO		TextCaps		VizWiz-Caps		Overall
Model	Trainable Parameters	Training Examples	B@4	CIDEr	B@4	CIDEr	B@4	CIDEr	B@4	CIDEr	B@4	CIDEr
ClipCap CC3M	43M	3.3M	-	71.82	-	-	-	-	-	-	-	-
VLKD CC3M	406M	3.3M	-	-	18.2	61.1	-	-	-	-	-	-
MAPL-blind CC-clean	3.4M	374K	0.35	5.05	2.75	5.75	1.35	2.15	1.5	1.8	1.47	3.69
Frozen CC-clean	40.3M	374K	2.45	22.6	5.25	13.9	2.65	4.6	2.05	2.65	3.1	10.94
MAPL CC-clean	3.4M	374K	6.75	79.75	12.3	54.3	5.8	22.95	4.95	20.95	7.45	44.49
Frozen CC-clean	40.3M	3.7K
MAPL CC-clean	3.4M	3.7K

1%

100%

Existing Methods

16 of 49

MAPL - Image Captioning - Domain-agnostic

Model	Trainable Parameters	Training Examples	CC		COCO		TextCaps		VizWiz-Caps		Overall
Model	Trainable Parameters	Training Examples	B@4	CIDEr	B@4	CIDEr	B@4	CIDEr	B@4	CIDEr	B@4	CIDEr
ClipCap CC3M	43M	3.3M
VLKD CC3M	406M	3.3M
MAPL-blind CC-clean	3.4M	374K
Frozen CC-clean	40.3M	374K
MAPL CC-clean	3.4M	374K	6.75	79.75	12.3	54.3	5.8	22.95	4.95	20.95	7.45	44.49
Frozen CC-clean	40.3M	3.7K	0.75	6.55	3.05	5.25	1.7	1.65	1.5	1.4	1.75	3.71
MAPL CC-clean	3.4M	3.7K	1.75	19.65	5.8	17.85	2.7	5.4	2.15	4.85	3.1	11.94

1%

100%

Existing Methods

17 of 49

MAPL - Image Captioning - in-domain

Model	Trainable Parameters	Training Examples	CC B@4	CC CIDEr	COCO B@4	COCO CIDEr	TextCaps B@4	TextCapsCIDEr	VizWiz-Caps CIDEr	VizWiz-Caps B@4	Overall B@4	Overall CIDEr
Frozen∗ COCO	40.3M	414K	0.65	9.05	20.05	61.35	6.95	11.75	5.45	6.20	8.28	22.09
Frozen TextCaps	40.3M	103K	0.2	3.55	4.05	6.7	8.85	46.95	4.4	5.25	4.38	8.11
Frozen VizWiz	40.3M	110K	0.25	4.4	3.75	6.05	4.1	5.65	19	76.85	6.78	23.24
ClipCap COCO	43M	414K	-	-	33.53	113.08	-	-	-	-	-	-
MAPL COCO	3.4M	414K	2.25	34.5	36.45	125.2	16.6	41.4	18	41.35	18.33	60.61
MAPL TextCaps	3.4M	103K	0.90	13.05	9.8	28.65	18.35	62.55	11.2	31.85	10.06	34.03
MAPL VizWiz	3.4M	110K	0.90	18.8	13.55	48.35	11.35	31.2	34.7	141.3	15.13	59.91
Frozen COCO	40.3M	4.1K
Frozen TextCaps	40.3M	1K
Frozen VizWiz	40.3M	1.1K
MAPL COCO	3.4M	4.1K
MAPL TextCaps	3.4M	1K
MAPL VizWiz	3.4M	1.1K

1%

100%

18 of 49

MAPL - Image Captioning - in-domain

Model	Trainable Parameters	Training Examples	CC B@4	CC CIDEr	COCO B@4	COCO CIDEr	TextCaps B@4	TextCapsCIDEr	VizWiz-Caps CIDEr	VizWiz-Caps B@4	Overall B@4	Overall CIDEr
Frozen∗ COCO	40.3M	414K
Frozen TextCaps	40.3M	103K
Frozen VizWiz	40.3M	110K
ClipCap COCO	43M	414K
MAPL COCO	3.4M	414K	2.25	34.5	36.45	125.2					18.33	60.61
MAPL TextCaps	3.4M	103K					18.35	62.55
MAPL VizWiz	3.4M	110K							34.7	141.3
Frozen COCO	40.3M	4.1K	0.25	3.6	6.2	12.8	2.8	3.15	2.85	2.3	3.03	5.46
Frozen TextCaps	40.3M	1K	0.1	2.6	1.65	2.8	3.65	5	2	2.25	1.85	3.16
Frozen VizWiz	40.3M	1.1K	0.2	3.4	2.9	3.2	3.35	3.45	12.7	40.55	4.79	12.65
MAPL COCO	3.4M	4.1K	0.8	12.1	19.65	65.9	7	12.85	6.2	9.6	8.41	25.11
MAPL TextCaps	3.4M	1K	0.3	3.9	4.1	8.05	8.35	16.9	5	7.25	4.44	9.03
MAPL VizWiz	3.4M	1.1K	0.2	3.9	2.95	4.8	3.45	5.05	18.4	71.1	6.25	21.21

1%

100%

19 of 49

MAPL - VQA - Domain-agnostic

Model	Trainable Parameters	Training Examples	0-shot VQAv2	4-shot VQAv2	8-shot VQAv2	0-shot VQAv2	4-shot VQAv2	8-shot VQAv2	0-shot Ok-VQA	4-shot Ok-VQA	8-shot Ok-VQA	0-shot TextVQA	4-shot TextVQA	8-shot TextVQA	0-shot Overall	4-shot Overall	8-shot Overall
Frozen	40.3M	3.3M	29.5	38.2	-	5.9	12.6	-	-	-	-	-	-	-	-	-	-
MAGMA CC12M	243M	3.8M	36.9	45.4	-	13.9	23.4	-	-	-	-	5.6	10.6	-	-	-	-
VLKD CC3M	406M	3.3M	38.6	-	-	10.5	-	-	-	-	-	-	-	-	-	-	-
LiMBeR-CLIP	12.6M	3.3M	33.33	40.34	-	-	-	-	-	-	-	-	-	-	-	-	-
Flamingo	10.2B	> 2.1B	-	-	-	57.6	57.4	57.5	35	36.5	37.3	-	-	-	-	-	-
MAPL-blind CC-clean	3.4M	374K	20.62	35.01	35.11	4.84	14.68	14.28	3.68	5.43	5.82	3.18	8.65	9.55	8.08	15.94	16.19
Frozen CC-clean	40.3M	374K	25.98	37.80	38.52	5.51	18.86	19.91	5.11	6.15	6.30	4.33	11.28	16.68	10.23	18.52	20.35
MAPL CC-clean	3.4M	374K	33.54	45.13	45.21	13.84	24.25	23.93	8.26	8.88	8.77	11.72	18.46	19.52	16.84	24.18	24.36
Frozen CC-clean	40.3M	3.7K
MAPL CC-clean	3.4M	3.7K

1%

100%

Existing Methods

20 of 49

MAPL - VQA - Domain-agnostic

Model	Trainable Parameters	Training Examples	0-shot VQAv2	4-shot VQAv2	8-shot VQAv2	0-shot VQAv2	4-shot VQAv2	8-shot VQAv2	0-shot Ok-VQA	4-shot Ok-VQA	8-shot Ok-VQA	0-shot TextVQA	4-shot TextVQA	8-shot TextVQA	0-shot Overall	4-shot Overall	8-shot Overall
Frozen	40.3M	3.3M
MAGMA CC12M	243M	3.8M
VLKD CC3M	406M	3.3M
LiMBeR-CLIP	12.6M	3.3M
Flamingo	10.2B	> 2.1B
MAPL-blind CC-clean	3.4M	374K
Frozen CC-clean	40.3M	374K
MAPL CC-clean	3.4M	374K	33.54	45.13	45.21	13.84	24.25	23.93	8.26	8.88	8.77	11.72	18.46	19.52	16.84	24.18	24.36
Frozen CC-clean	40.3M	3.7K	26.22	36.69	37.41	5.5	18.76	20.51	5.71	7.19	7.53	3.83	11.71	16.66	10.31	18.58	20.53
MAPL CC-clean	3.4M	3.7K	30.80	37.37	37.95	8.77	18.18	19.15	6.40	7.07	7.74	5.68	9.26	10.58	12.91	17.97	18.85

1%

100%

Existing Methods

21 of 49

MAPL - VQA - in-domain

Model	Trainable Parameters	Training Examples	0-shot VQAv2	4-shot VQAv2	8-shot VQAv2	0-shot VQAv2	4-shot VQAv2	8-shot VQAv2	0-shot Ok-VQA	4-shot Ok-VQA	8-shot Ok-VQA	0-shot TextVQA	4-shot TextVQA	8-shot TextVQA	0-shot Overall	4-shot Overall	8-shot Overall
PICa	0	0	20.61	46.86	47.80	11.84	31.28	33.07	-	-	-	-	-	-	-	-	-
Frozen∗ COCO	40.3M	414K	32.09	38.9	39.42	9.81	20.72	21.83	7.54	6.82	6.74	5.87	12.07	17.35	13.82	19.63	21.33
Frozen TextCaps	40.3M	103K	32.49	37.39	38.03	11.34	19.87	20.82	8.83	7.33	7.51	6.25	12.26	16.86	14.73	19.21	20.8
Frozen VizWiz	40.3M	110K	26.93	37.38	37.91	5.85	19.12	20.64	6.38	7.44	7.47	5.57	13.06	18.06	11.18	19.25	21.02
MAPL COCO	3.4M	414K	43.51	48.75	48.44	18.27	31.13	31.63	10.99	11.1	11.08	14.05	17.72	19.18	21.7	27.17	27.58
MAPL TextCaps	3.4M	103K	38.83	43.34	43.43	16.33	25.07	25.92	22.27	19.53	19.75	12.31	16.69	18.18	22.43	26.15	26.82
MAPL VizWiz	3.4M	110K	32.8	42.94	43.2	11.7	24.91	25.73	9.27	10.36	10.23	10.42	20.63	23.10	16.05	24.71	25.56
Frozen COCO	40.3M	4.1K
Frozen TextCaps	40.3M	1K
Frozen VizWiz	40.3M	1.1K
MAPL COCO	3.4M	4.1K
MAPL TextCaps	3.4M	1K
MAPL VizWiz	3.4M	1.1K

1%

100%

22 of 49

MAPL - VQA - in-domain

Model	Trainable Parameters	Training Examples	0-shot VQAv2	4-shot VQAv2	8-shot VQAv2	0-shot VQAv2	4-shot VQAv2	8-shot VQAv2	0-shot Ok-VQA	4-shot Ok-VQA	8-shot Ok-VQA	0-shot TextVQA	4-shot TextVQA	8-shot TextVQA	0-shot Overall	4-shot Overall	8-shot Overall
PICa	0	0					31.28	33.07
Frozen∗ COCO	40.3M	414K
Frozen TextCaps	40.3M	103K
Frozen VizWiz	40.3M	110K
MAPL COCO	3.4M	414K	43.51	48.75	48.44	18.27						14.05				27.17	27.58
MAPL TextCaps	3.4M	103K							22.27	19.53	19.75				22.43
MAPL VizWiz	3.4M	110K											20.63	23.10
Frozen COCO	40.3M	4.1K	3018	37.23	37.89	9.33	19.6	20.71	7.43	7.65	7.67	4.37	12	16.48	12.83	19.12	20.69
Frozen TextCaps	40.3M	1K	32.09	36.72	37.25	10.75	18.85	19.51	8.17	7.57	7.28	5.39	11.79	16.20	14.1	18.73	20.06
Frozen VizWiz	40.3M	1.1K	29.6	37.3	37.87	7.57	19.36	20.6	7.16	7.17	7.25	4.53	12.51	17.56	12.22	19.08	20.82
MAPL COCO	3.4M	4.1K	37.69	40.42	40.84	13.92	21.66	22.41	8.3	6.96	6.84	6.94	10.72	12.43	16.71	19.94	20.63
MAPL TextCaps	3.4M	1K	33.57	36.7	36.87	12.46	17.75	18.21	9.34	8.29	8.62	6.54	9.58	11.62	15.48	18	18.83
MAPL VizWiz	3.4M	1.1K	31.88	36.81	37.04	9.59	17.64	17.64	7.25	5.99	6.04	4.73	9.48	11.33	13.36	17.48	18.01

1%

100%

23 of 49

Drawbacks of MAPL

There are several disadvantages that the MAPL model has, two of them can be addressed in the below:

High GPU Usage: Even though we do not update the LM, we have to compute the gradient of it for the mapping network

The LM should be open source (for computing gradient), and it goes without saying that open-source LMs are operated so poor in comparison to closed-source ones

24 of 49

BLIP-2: Bootstrapping Language-Image Pre-training

with Frozen Image Encoders and Large Language Models

ICML 2023

24

25 of 49

Overview

Propose a parameter-efficient multimodal pre-training method that enables to bridge the vision-language modality gap
Introduce a mapping network architecture (Adapter-free), named Q-former, along with a two-stage pre-training strategy
Outperform Flamingo-80B by 8.7% on zero-shot VQAv2 but with 45x less trainable parameter.

26 of 49

Inference: Align the visual with the language

Key challenge: Since the LLM has not seen any images during its pre-training, how to “transform” the pure vision representation into a format (vision & language) that LLM can effectively use?

Align

27 of 49

Inference: Image Encoder + Q-former

Query transform (Q-former) = “filter”

Extract language-informative visual representation from the frozen image encoder by conditioning the learned queries on the image feature

Query output

Language-informative

visual feature

Pure visual feature

28 of 49

Inference: Image Encoder + Q-former

Query transform (Q-former) = “filter”

Extract language-informative visual representation from the frozen image encoder by conditioning the learned queries on the image feature

Query output

Language-informative

visual feature

Pure visual feature

Learned query

Learnable parameters across the dataset
A collection of language-related “filtering” criteria to guide Q-former

To be specific, it takes a sequence of learned queries as the input and condition them on the pure visual feature. As a result, the output of the Q-Former retains only language-relevant information. In other words, the Q-Former serves the role of a filter that removes language irrelevant information

What exactly are learned queries? Technically speaking, they are a set of learnable parameters that can be optimized across the training dataset, already used in Flamingo and MAPL. But conceptually, especially in the context of q-former, learned queries can be interpreted as a collection of language-related filtering criteria that can guide the Q-former to select the most relevant visual features.

For example, One of the learned query may correspond to the semantic concept of the object shape in the image because during the training there are lots of question asking the shape in the VQA dataset

29 of 49

Inference: Image Encoder + Q-former

Query output

Pure visual feature

[HxW, D]

[N, D]

Query transform (Q-former)

Extract language-informative visual representation from the frozen image encoder by conditioning the learned queries on the image feature
Compress the resolution variant visual feature into a fixed number of query output

Language-informative

visual feature

30 of 49

Inference: Linear projection + LLM

Fully connected layer

Linearly project the language-informative visual feature into the text embedding space

What is in the image?

Serve as soft visual prompts for the LLM

31 of 49

Pre-training stage 1: Vision & Language representation learning

Q: How to learn a Q-former that can comprehend both visual and textual information to extract language-informative visual features?

Image-text

matching

Image-text

contrastive

Image-grounded

text generation

32 of 49

Pre-training stage 1: Vision & Language representation learning

Q: How to learn a Q-former that can comprehend both visual and textual information to extract language-informative visual features?

A: Pretrain on three vision-language proxy tasks using image-text pairs

Image-text

matching

Image-text

contrastive

Image-grounded

text generation

33 of 49

Pre-training stage 1: Vision & Language representation learning

Q: What is the architecture of Q-former? How to perform different proxy tasks?

Two transformer submodules (image and text) that shares the same self-attention layers; Interaction between learned query and text
Use cross-attention layers to condition the frozen image feature

34 of 49

Pre-training stage 1: Vision & Language representation learning

Q: What is the architecture of Q-former? How to perform different proxy tasks?

For each proxy task, use different self-attention masks to control the interaction between queries and text.

Query

embed

Text

embed

35 of 49

Pre-training stage 1: Vision & Language representation learning

Q: What is the architecture of Q-former? How to perform different proxy tasks?

For each proxy task, use different self-attention masks to control the interaction between queries and text.

Query

embed

Text

embed

36 of 49

Pre-training stage 1: Vision & Language representation learning

Q: What is the architecture of Q-former? How to perform different proxy tasks?

For each proxy task, use different self-attention masks to control the interaction between queries and text.

Query

embed

Text

embed

37 of 49

Pre-training stage 1: Vision & Language representation learning

Q: What is the architecture of Q-former? How to perform different proxy tasks?

For each proxy task, use different self-attention masks to control the interaction between queries and text for each proxy task
For each image-text pair, combine three proxy losses into one total loss

38 of 49

Pre-training stage 2: Vision-to-Language generative learning

Handle both encoder-decoder and decoder-only LLMs
End-to-end training using language modeling loss, but only update Q-former and FC layer

39 of 49

A two-stage pre-training strategy

Text

Stage 1: Image-text pair loss

Stage 2: language generation loss

Update Q-former at Stage 1; Update both Q-former and FC layer at Stage 2
Both Flamingo and MAPL only have the pretraining on Stage 2

40 of 49

Pre-training Dataset & Benchmarks

Pre-training Data

Total 129M images: COCO, Visual Genome, CC3M. CC12M, SBU and a subset of LAION400M; Instead, around 2B images used in Flamingo
Create synthetic captions for the web images (still noisy)
No a sequence of text interleaved with images and/or videos like Flamingo

41 of 49

Pre-training Dataset & Benchmarks

Pre-training Data

Total 129M images: COCO, Visual Genome, CC3M. CC12M, SBU and a subset of LAION400M; Instead, around 2B images used in Flamingo
Create synthetic captions for the web images (noisy)
No a sequence of text interleaved with images and/or videos like Flamingo

Experiments

Instructed Zero-shot Image-to-Text Generation: Zero-shot VQA
Fine-tuned Image Captioning, VQA, image-text retrieval

Fine-tuned for each specific downstream task
Update both Q-former and ViT

42 of 49

Results: zero-shot VQA

Outperform Flamingo-80B by 8.7% on VQAv2 with 45x fewer trainable parameters

43 of 49

Results: zero-shot VQA

Underperform Flamingo-80B on Open-knowledge VQA due to a smaller LLM (BLIP2’s 11B FlanT5 vs. Flamingo’s 70B Chinchilla)

44 of 49

Results: Without Q-former pretraining

Demonstrate that decoupling the end-to-end training into two stages is crucial for state-of-the-art results.

45 of 49

Limitations

Not perform in-context learning given few-shot examples

No interleaved image-text pairs in the pretraining datasets

46 of 49

Limitations

Not perform in-context learning given few-shot examples

No interleaved image-text pairs in the pretraining datasets

Only generate short sentences (average 6.5 words) that cover fewer objects

Rely on more capable LLMs (e.g. Vicuna)
Using noisy short image caption pairs is not sufficient.
Instead, manually annotate more detailed image description datasets leading to natural language generation (e.g. mini-GPT4)

47 of 49

Limitations

Not perform in-context learning given few-shot examples

No interleaved image-text pairs in the pretraining datasets

Only generate short sentences (average 6.5 words) that cover fewer objects

Rely on more capable LLMs (e.g. Vicuna)
Using noisy short image caption pairs is not sufficient.
Instead, manually annotate more detailed image description datasets leading to natural language generation (e.g. mini-GPT4)

Not perceive the local visual information in the image. E.g. Chart, Poster

ViT-CLIP can only capture global information.
The image feature in high resolution is compressed by Q-former into a small and fixed size (default 32) embedding
Instead, infuse local visual features (e.g. object detection, segmentation) into Q-former

48 of 49

Summary

Parameter-efficient LLM-based vision-language models leverage LLMs’ strong capability (e.g. zero-shot, in-context learning) for vision-language tasks but with minimal changes to model’s architecture or parameters.

Two adapter-free approaches MAPL and BLIP2 reduce the need for high-scale trainable parameters, GPU resources, and pre-training datasets.

The next research directions may include:

Train a mapping network in a black-box setting (only access LLM’s API)
Enable models to comprehend local information in a high-resolution image

49 of 49

MAPL Vs FLAMINGO

Feature	FLAMINGO	MAPL	BLIP2
Model Capabilities	zero-shot; few-shot	zero-shot; few-shot	zero-shot
Model Architecture	Perceiver; Cross-attention layer for each LLM block	4 transformer blocks	BERT-base: 12 blocks
Trainable Parameters	>1.4B	3.4 M	104M
Dataset Size	2B	Around 400k	Around 129M