1 of 49

Parameter efficient LLM based vision-language models

Team 11: Omid Reza Heidari, Li Gu 2024-03-13

2 of 49

LLM-based vision-language models

What are Multi-Modal LLMs?

What is Key features of them?

  • Integrated Understanding
  • Generative Capabilities
  • Adaptability

3 of 49

LLM based Vision-language Methods

Vision Encoder

Pre-trained LLM

Mapping Network

It is United States. I think so because the flag is the United States flag.

What country is this? Why do you think so?

Image: https://unsplash.com/photos/a-city-street-with-a-roller-coaster-in-the-background-oQ0MOFu7NOY

4 of 49

Why parameter efficiency?

There are some limitations with fine-tuning the entire model:

  • Destroying the current capability of the LLM

  • Training a huge number of parameters requires collecting a giant dataset

  • Requires too much GPU memory as well as GPU hours

5 of 49

LLM based Vision-language Methods

Image: https://unsplash.com/photos/a-city-street-with-a-roller-coaster-in-the-background-oQ0MOFu7NOY

Approach

1st example

2nd example

3rd example

Finetune the entire language model

Dai et al. 2022

Gao et al. 2022

Vision Encoder

Pre-trained LLM

Mapping Network

It is United States. I think so because the flag is the United States flag.

What country is this? Why do you think so?

6 of 49

LLM based Vision-language Methods

Image: https://unsplash.com/photos/a-city-street-with-a-roller-coaster-in-the-background-oQ0MOFu7NOY

Approach

1st example

2nd example

3rd example

Finetune the entire language model

Dai et al. 2022

Gao et al. 2022

Insert and Train Adapter Layers in the Language Model

MAGMA

Flamingo

Vision Encoder

Pre-trained LLM

Mapping Network

It is United States. I think so because the flag is the United States flag.

What country is this? Why do you think so?

Adapter Layer

7 of 49

LLM based Vision-language Methods

Image: https://unsplash.com/photos/a-city-street-with-a-roller-coaster-in-the-background-oQ0MOFu7NOY

Approach

1st example

2nd example

3rd example

Finetune the entire language model

Dai et al. 2022

Gao et al. 2022

Insert and Train Adapter Layers in the Language Model

MAGMA

Flamingo

Learn Vision Encoder from Scratch

Frozen

Vision Encoder

Pre-trained LLM

Mapping Network

It is United States. I think so because the flag is the United States flag.

What country is this? Why do you think so?

8 of 49

LLM based Vision-language Methods

Image: https://unsplash.com/photos/a-city-street-with-a-roller-coaster-in-the-background-oQ0MOFu7NOY

Approach

1st example

2nd example

3rd example

Finetune the entire language model

Dai et al. 2022

Gao et al. 2022

Insert and Train Adapter Layers in the Language Model

MAGMA

Flamingo

Learn Vision Encoder from Scratch

Frozen

Only Learn the Mapping Network

MAPL

BLIP-2

MiniGPT-4

Vision Encoder

Pre-trained LLM

Mapping Network

It is United States. I think so because the flag is the United States flag.

What country is this? Why do you think so?

9 of 49

Drawback of other methods

  • Number of parameters
  • Overfitting
  • Resources

10 of 49

MAPL - Architecture

A dog running in a beach.

Image: https://unsplash.com/photos/dog-running-on-beach-during-daytime-yihlaRCCvd4

Vision Encoder

Mapping Network

LM self-attention

LM Embedder

LM Tokenizer

A

running

a

.

dog

in

beach

A

running

dog

in

a

baech

.

11 of 49

MAPL - Mapping Network

Transformer Encoder

FC

FC

FC

FC

Learned Constant Embeddings

FC

FC

FC

FC

Li Visual Features

Dimension: Di

Lo Tokens

Dimension: Do

Lo

Dimension: Dh

Lo Tokens, Dimension: Dh

12 of 49

MAPL - Mapping Network

FC

FC

FC

FC

FC

FC

FC

FC

  • Decouple transformer hidden size Dh from Di as well as Do

  • Shared Parameters in FC

13 of 49

MAPL - Training

After Training

Zero-shot transfer: captioning unseen images

few-shot transfer: unseen VQA

Training just mapping network from scratch by minimizing negative log-likelihood of the reference captions under the LM conditioned on the corresponding images.

14 of 49

MAPL - Experiments

Evaluation

Image Captioning

VQA (Not Actually)

Karpathy-test split of COCO

validation splits of

Conceptual Captions

TextCaps

VizWiz-Captions

Metrics

BLEU@4

ROUGE-L

METEOR

CIDEr

SPICE

validation splits of

VQAv2

OK-VQA

Text-VQA

VizWiz-VQA

Metric

VQA-Accuracy

Training Setting

domain-agnostic training

in-domain training

filtered version of CC(CC-clean) : 398k image-text pairs

Trained on 100% data

Trained on 1% data

15 of 49

MAPL - Image Captioning - Domain-agnostic

Model

Trainable Parameters

Training Examples

CC

COCO

TextCaps

VizWiz-Caps

Overall

B@4

CIDEr

B@4

CIDEr

B@4

CIDEr

B@4

CIDEr

B@4

CIDEr

ClipCap CC3M

43M

3.3M

-

71.82

-

-

-

-

-

-

-

-

VLKD CC3M

406M

3.3M

-

-

18.2

61.1

-

-

-

-

-

-

MAPL-blind CC-clean

3.4M

374K

0.35

5.05

2.75

5.75

1.35

2.15

1.5

1.8

1.47

3.69

Frozen CC-clean

40.3M

374K

2.45

22.6

5.25

13.9

2.65

4.6

2.05

2.65

3.1

10.94

MAPL

CC-clean

3.4M

374K

6.75

79.75

12.3

54.3

5.8

22.95

4.95

20.95

7.45

44.49

Frozen CC-clean

40.3M

3.7K

MAPL

CC-clean

3.4M

3.7K

1%

100%

Existing Methods

16 of 49

MAPL - Image Captioning - Domain-agnostic

Model

Trainable Parameters

Training Examples

CC

COCO

TextCaps

VizWiz-Caps

Overall

B@4

CIDEr

B@4

CIDEr

B@4

CIDEr

B@4

CIDEr

B@4

CIDEr

ClipCap CC3M

43M

3.3M

VLKD CC3M

406M

3.3M

MAPL-blind CC-clean

3.4M

374K

Frozen CC-clean

40.3M

374K

MAPL

CC-clean

3.4M

374K

6.75

79.75

12.3

54.3

5.8

22.95

4.95

20.95

7.45

44.49

Frozen CC-clean

40.3M

3.7K

0.75

6.55

3.05

5.25

1.7

1.65

1.5

1.4

1.75

3.71

MAPL

CC-clean

3.4M

3.7K

1.75

19.65

5.8

17.85

2.7

5.4

2.15

4.85

3.1

11.94

1%

100%

Existing Methods

17 of 49

MAPL - Image Captioning - in-domain

Model

Trainable Parameters

Training Examples

CC B@4

CC CIDEr

COCO B@4

COCO CIDEr

TextCaps B@4

TextCapsCIDEr

VizWiz-Caps CIDEr

VizWiz-Caps B@4

Overall B@4

Overall CIDEr

Frozen∗ COCO

40.3M

414K

0.65

9.05

20.05

61.35

6.95

11.75

5.45

6.20

8.28

22.09

Frozen TextCaps

40.3M

103K

0.2

3.55

4.05

6.7

8.85

46.95

4.4

5.25

4.38

8.11

Frozen VizWiz

40.3M

110K

0.25

4.4

3.75

6.05

4.1

5.65

19

76.85

6.78

23.24

ClipCap COCO

43M

414K

-

-

33.53

113.08

-

-

-

-

-

-

MAPL COCO

3.4M

414K

2.25

34.5

36.45

125.2

16.6

41.4

18

41.35

18.33

60.61

MAPL TextCaps

3.4M

103K

0.90

13.05

9.8

28.65

18.35

62.55

11.2

31.85

10.06

34.03

MAPL VizWiz

3.4M

110K

0.90

18.8

13.55

48.35

11.35

31.2

34.7

141.3

15.13

59.91

Frozen COCO

40.3M

4.1K

Frozen TextCaps

40.3M

1K

Frozen VizWiz

40.3M

1.1K

MAPL COCO

3.4M

4.1K

MAPL

TextCaps

3.4M

1K

MAPL VizWiz

3.4M

1.1K

1%

100%

18 of 49

MAPL - Image Captioning - in-domain

Model

Trainable Parameters

Training Examples

CC B@4

CC CIDEr

COCO B@4

COCO CIDEr

TextCaps B@4

TextCapsCIDEr

VizWiz-Caps CIDEr

VizWiz-Caps B@4

Overall B@4

Overall CIDEr

Frozen∗ COCO

40.3M

414K

Frozen TextCaps

40.3M

103K

Frozen VizWiz

40.3M

110K

ClipCap COCO

43M

414K

MAPL COCO

3.4M

414K

2.25

34.5

36.45

125.2

18.33

60.61

MAPL TextCaps

3.4M

103K

18.35

62.55

MAPL VizWiz

3.4M

110K

34.7

141.3

Frozen COCO

40.3M

4.1K

0.25

3.6

6.2

12.8

2.8

3.15

2.85

2.3

3.03

5.46

Frozen TextCaps

40.3M

1K

0.1

2.6

1.65

2.8

3.65

5

2

2.25

1.85

3.16

Frozen VizWiz

40.3M

1.1K

0.2

3.4

2.9

3.2

3.35

3.45

12.7

40.55

4.79

12.65

MAPL COCO

3.4M

4.1K

0.8

12.1

19.65

65.9

7

12.85

6.2

9.6

8.41

25.11

MAPL

TextCaps

3.4M

1K

0.3

3.9

4.1

8.05

8.35

16.9

5

7.25

4.44

9.03

MAPL VizWiz

3.4M

1.1K

0.2

3.9

2.95

4.8

3.45

5.05

18.4

71.1

6.25

21.21

1%

100%

19 of 49

MAPL - VQA - Domain-agnostic

Model

Trainable Parameters

Training Examples

0-shot VQAv2

4-shot VQAv2

8-shot VQAv2

0-shot VQAv2

4-shot VQAv2

8-shot VQAv2

0-shot Ok-VQA

4-shot Ok-VQA

8-shot Ok-VQA

0-shot TextVQA

4-shot TextVQA

8-shot TextVQA

0-shot Overall

4-shot Overall

8-shot Overall

Frozen

40.3M

3.3M

29.5

38.2

-

5.9

12.6

-

-

-

-

-

-

-

-

-

-

MAGMA CC12M

243M

3.8M

36.9

45.4

-

13.9

23.4

-

-

-

-

5.6

10.6

-

-

-

-

VLKD CC3M

406M

3.3M

38.6

-

-

10.5

-

-

-

-

-

-

-

-

-

-

-

LiMBeR-CLIP

12.6M

3.3M

33.33

40.34

-

-

-

-

-

-

-

-

-

-

-

-

-

Flamingo

10.2B

> 2.1B

-

-

-

57.6

57.4

57.5

35

36.5

37.3

-

-

-

-

-

-

MAPL-blind CC-clean

3.4M

374K

20.62

35.01

35.11

4.84

14.68

14.28

3.68

5.43

5.82

3.18

8.65

9.55

8.08

15.94

16.19

Frozen CC-clean

40.3M

374K

25.98

37.80

38.52

5.51

18.86

19.91

5.11

6.15

6.30

4.33

11.28

16.68

10.23

18.52

20.35

MAPL

CC-clean

3.4M

374K

33.54

45.13

45.21

13.84

24.25

23.93

8.26

8.88

8.77

11.72

18.46

19.52

16.84

24.18

24.36

Frozen CC-clean

40.3M

3.7K

MAPL

CC-clean

3.4M

3.7K

1%

100%

Existing Methods

20 of 49

MAPL - VQA - Domain-agnostic

Model

Trainable Parameters

Training Examples

0-shot VQAv2

4-shot VQAv2

8-shot VQAv2

0-shot VQAv2

4-shot VQAv2

8-shot VQAv2

0-shot Ok-VQA

4-shot Ok-VQA

8-shot Ok-VQA

0-shot TextVQA

4-shot TextVQA

8-shot TextVQA

0-shot Overall

4-shot Overall

8-shot Overall

Frozen

40.3M

3.3M

MAGMA CC12M

243M

3.8M

VLKD CC3M

406M

3.3M

LiMBeR-CLIP

12.6M

3.3M

Flamingo

10.2B

> 2.1B

MAPL-blind CC-clean

3.4M

374K

Frozen CC-clean

40.3M

374K

MAPL

CC-clean

3.4M

374K

33.54

45.13

45.21

13.84

24.25

23.93

8.26

8.88

8.77

11.72

18.46

19.52

16.84

24.18

24.36

Frozen CC-clean

40.3M

3.7K

26.22

36.69

37.41

5.5

18.76

20.51

5.71

7.19

7.53

3.83

11.71

16.66

10.31

18.58

20.53

MAPL

CC-clean

3.4M

3.7K

30.80

37.37

37.95

8.77

18.18

19.15

6.40

7.07

7.74

5.68

9.26

10.58

12.91

17.97

18.85

1%

100%

Existing Methods

21 of 49

MAPL - VQA - in-domain

Model

Trainable Parameters

Training Examples

0-shot VQAv2

4-shot VQAv2

8-shot VQAv2

0-shot VQAv2

4-shot VQAv2

8-shot VQAv2

0-shot Ok-VQA

4-shot Ok-VQA

8-shot Ok-VQA

0-shot TextVQA

4-shot TextVQA

8-shot TextVQA

0-shot Overall

4-shot Overall

8-shot Overall

PICa

0

0

20.61

46.86

47.80

11.84

31.28

33.07

-

-

-

-

-

-

-

-

-

Frozen∗ COCO

40.3M

414K

32.09

38.9

39.42

9.81

20.72

21.83

7.54

6.82

6.74

5.87

12.07

17.35

13.82

19.63

21.33

Frozen TextCaps

40.3M

103K

32.49

37.39

38.03

11.34

19.87

20.82

8.83

7.33

7.51

6.25

12.26

16.86

14.73

19.21

20.8

Frozen VizWiz

40.3M

110K

26.93

37.38

37.91

5.85

19.12

20.64

6.38

7.44

7.47

5.57

13.06

18.06

11.18

19.25

21.02

MAPL COCO

3.4M

414K

43.51

48.75

48.44

18.27

31.13

31.63

10.99

11.1

11.08

14.05

17.72

19.18

21.7

27.17

27.58

MAPL TextCaps

3.4M

103K

38.83

43.34

43.43

16.33

25.07

25.92

22.27

19.53

19.75

12.31

16.69

18.18

22.43

26.15

26.82

MAPL VizWiz

3.4M

110K

32.8

42.94

43.2

11.7

24.91

25.73

9.27

10.36

10.23

10.42

20.63

23.10

16.05

24.71

25.56

Frozen COCO

40.3M

4.1K

Frozen TextCaps

40.3M

1K

Frozen VizWiz

40.3M

1.1K

MAPL COCO

3.4M

4.1K

MAPL

TextCaps

3.4M

1K

MAPL VizWiz

3.4M

1.1K

1%

100%

22 of 49

MAPL - VQA - in-domain

Model

Trainable Parameters

Training Examples

0-shot VQAv2

4-shot VQAv2

8-shot VQAv2

0-shot VQAv2

4-shot VQAv2

8-shot VQAv2

0-shot Ok-VQA

4-shot Ok-VQA

8-shot Ok-VQA

0-shot TextVQA

4-shot TextVQA

8-shot TextVQA

0-shot Overall

4-shot Overall

8-shot Overall

PICa

0

0

31.28

33.07

Frozen∗ COCO

40.3M

414K

Frozen TextCaps

40.3M

103K

Frozen VizWiz

40.3M

110K

MAPL COCO

3.4M

414K

43.51

48.75

48.44

18.27

14.05

27.17

27.58

MAPL TextCaps

3.4M

103K

22.27

19.53

19.75

22.43

MAPL VizWiz

3.4M

110K

20.63

23.10

Frozen COCO

40.3M

4.1K

3018

37.23

37.89

9.33

19.6

20.71

7.43

7.65

7.67

4.37

12

16.48

12.83

19.12

20.69

Frozen TextCaps

40.3M

1K

32.09

36.72

37.25

10.75

18.85

19.51

8.17

7.57

7.28

5.39

11.79

16.20

14.1

18.73

20.06

Frozen VizWiz

40.3M

1.1K

29.6

37.3

37.87

7.57

19.36

20.6

7.16

7.17

7.25

4.53

12.51

17.56

12.22

19.08

20.82

MAPL COCO

3.4M

4.1K

37.69

40.42

40.84

13.92

21.66

22.41

8.3

6.96

6.84

6.94

10.72

12.43

16.71

19.94

20.63

MAPL

TextCaps

3.4M

1K

33.57

36.7

36.87

12.46

17.75

18.21

9.34

8.29

8.62

6.54

9.58

11.62

15.48

18

18.83

MAPL VizWiz

3.4M

1.1K

31.88

36.81

37.04

9.59

17.64

17.64

7.25

5.99

6.04

4.73

9.48

11.33

13.36

17.48

18.01

1%

100%

23 of 49

Drawbacks of MAPL

There are several disadvantages that the MAPL model has, two of them can be addressed in the below:

  • High GPU Usage: Even though we do not update the LM, we have to compute the gradient of it for the mapping network
  • The LM should be open source (for computing gradient), and it goes without saying that open-source LMs are operated so poor in comparison to closed-source ones

24 of 49

BLIP-2: Bootstrapping Language-Image Pre-training

with Frozen Image Encoders and Large Language Models

ICML 2023

24

25 of 49

Overview

  • Propose a parameter-efficient multimodal pre-training method that enables to bridge the vision-language modality gap
  • Introduce a mapping network architecture (Adapter-free), named Q-former, along with a two-stage pre-training strategy
  • Outperform Flamingo-80B by 8.7% on zero-shot VQAv2 but with 45x less trainable parameter.

26 of 49

Inference: Align the visual with the language

Key challenge: Since the LLM has not seen any images during its pre-training, how to “transform” the pure vision representation into a format (vision & language) that LLM can effectively use?

Align

27 of 49

Inference: Image Encoder + Q-former

Query transform (Q-former) = “filter”

  • Extract language-informative visual representation from the frozen image encoder by conditioning the learned queries on the image feature

Query output

Language-informative

visual feature

Pure visual feature

28 of 49

Inference: Image Encoder + Q-former

Query transform (Q-former) = “filter”

  • Extract language-informative visual representation from the frozen image encoder by conditioning the learned queries on the image feature

Query output

Language-informative

visual feature

Pure visual feature

Learned query

  • Learnable parameters across the dataset
  • A collection of language-related “filtering” criteria to guide Q-former

29 of 49

Inference: Image Encoder + Q-former

Query output

Pure visual feature

[HxW, D]

[N, D]

[N, D]

Query transform (Q-former)

  • Extract language-informative visual representation from the frozen image encoder by conditioning the learned queries on the image feature
  • Compress the resolution variant visual feature into a fixed number of query output

Language-informative

visual feature

30 of 49

Inference: Linear projection + LLM

Fully connected layer

  • Linearly project the language-informative visual feature into the text embedding space

What is in the image?

  • Serve as soft visual prompts for the LLM

31 of 49

Pre-training stage 1: Vision & Language representation learning

Q: How to learn a Q-former that can comprehend both visual and textual information to extract language-informative visual features?

Image-text

matching

Image-text

contrastive

Image-grounded

text generation

32 of 49

Pre-training stage 1: Vision & Language representation learning

Q: How to learn a Q-former that can comprehend both visual and textual information to extract language-informative visual features?

A: Pretrain on three vision-language proxy tasks using image-text pairs

Image-text

matching

Image-text

contrastive

Image-grounded

text generation

33 of 49

Pre-training stage 1: Vision & Language representation learning

Q: What is the architecture of Q-former? How to perform different proxy tasks?

  • Two transformer submodules (image and text) that shares the same self-attention layers; Interaction between learned query and text
  • Use cross-attention layers to condition the frozen image feature

34 of 49

Pre-training stage 1: Vision & Language representation learning

Q: What is the architecture of Q-former? How to perform different proxy tasks?

  • For each proxy task, use different self-attention masks to control the interaction between queries and text.

Query

embed

Text

embed

35 of 49

Pre-training stage 1: Vision & Language representation learning

Q: What is the architecture of Q-former? How to perform different proxy tasks?

  • For each proxy task, use different self-attention masks to control the interaction between queries and text.

Query

embed

Text

embed

36 of 49

Pre-training stage 1: Vision & Language representation learning

Q: What is the architecture of Q-former? How to perform different proxy tasks?

  • For each proxy task, use different self-attention masks to control the interaction between queries and text.

Query

embed

Text

embed

37 of 49

Pre-training stage 1: Vision & Language representation learning

Q: What is the architecture of Q-former? How to perform different proxy tasks?

  • For each proxy task, use different self-attention masks to control the interaction between queries and text for each proxy task
  • For each image-text pair, combine three proxy losses into one total loss

38 of 49

Pre-training stage 2: Vision-to-Language generative learning

  • Handle both encoder-decoder and decoder-only LLMs
  • End-to-end training using language modeling loss, but only update Q-former and FC layer

39 of 49

A two-stage pre-training strategy

Text

Stage 1: Image-text pair loss

Stage 2: language generation loss

  • Update Q-former at Stage 1; Update both Q-former and FC layer at Stage 2
  • Both Flamingo and MAPL only have the pretraining on Stage 2

40 of 49

Pre-training Dataset & Benchmarks

Pre-training Data

  • Total 129M images: COCO, Visual Genome, CC3M. CC12M, SBU and a subset of LAION400M; Instead, around 2B images used in Flamingo
  • Create synthetic captions for the web images (still noisy)
  • No a sequence of text interleaved with images and/or videos like Flamingo

41 of 49

Pre-training Dataset & Benchmarks

Pre-training Data

  • Total 129M images: COCO, Visual Genome, CC3M. CC12M, SBU and a subset of LAION400M; Instead, around 2B images used in Flamingo
  • Create synthetic captions for the web images (noisy)
  • No a sequence of text interleaved with images and/or videos like Flamingo

Experiments

  • Instructed Zero-shot Image-to-Text Generation: Zero-shot VQA
  • Fine-tuned Image Captioning, VQA, image-text retrieval
    • Fine-tuned for each specific downstream task
    • Update both Q-former and ViT

42 of 49

Results: zero-shot VQA

  • Outperform Flamingo-80B by 8.7% on VQAv2 with 45x fewer trainable parameters

43 of 49

Results: zero-shot VQA

  • Underperform Flamingo-80B on Open-knowledge VQA due to a smaller LLM (BLIP2’s 11B FlanT5 vs. Flamingo’s 70B Chinchilla)

44 of 49

Results: Without Q-former pretraining

Demonstrate that decoupling the end-to-end training into two stages is crucial for state-of-the-art results.

45 of 49

Limitations

  • Not perform in-context learning given few-shot examples
    • No interleaved image-text pairs in the pretraining datasets

46 of 49

Limitations

  • Not perform in-context learning given few-shot examples
    • No interleaved image-text pairs in the pretraining datasets
  • Only generate short sentences (average 6.5 words) that cover fewer objects
    • Rely on more capable LLMs (e.g. Vicuna)
    • Using noisy short image caption pairs is not sufficient.
    • Instead, manually annotate more detailed image description datasets leading to natural language generation (e.g. mini-GPT4)

47 of 49

Limitations

  • Not perform in-context learning given few-shot examples
    • No interleaved image-text pairs in the pretraining datasets
  • Only generate short sentences (average 6.5 words) that cover fewer objects
    • Rely on more capable LLMs (e.g. Vicuna)
    • Using noisy short image caption pairs is not sufficient.
    • Instead, manually annotate more detailed image description datasets leading to natural language generation (e.g. mini-GPT4)
  • Not perceive the local visual information in the image. E.g. Chart, Poster
    • ViT-CLIP can only capture global information.
    • The image feature in high resolution is compressed by Q-former into a small and fixed size (default 32) embedding
    • Instead, infuse local visual features (e.g. object detection, segmentation) into Q-former

48 of 49

Summary

Parameter-efficient LLM-based vision-language models leverage LLMs’ strong capability (e.g. zero-shot, in-context learning) for vision-language tasks but with minimal changes to model’s architecture or parameters.

Two adapter-free approaches MAPL and BLIP2 reduce the need for high-scale trainable parameters, GPU resources, and pre-training datasets.

The next research directions may include:

  • Train a mapping network in a black-box setting (only access LLM’s API)
  • Enable models to comprehend local information in a high-resolution image

49 of 49

MAPL Vs FLAMINGO

Feature

FLAMINGO

MAPL

BLIP2

Model Capabilities

zero-shot; few-shot

zero-shot; few-shot

zero-shot

Model Architecture

Perceiver;

Cross-attention layer for each LLM block

4 transformer blocks

BERT-base: 12 blocks

Trainable Parameters

>1.4B

3.4 M

104M

Dataset Size

2B

Around 400k

Around 129M