1 of 36

(I) Contrastive NLP

(II) contrastive long-tail LM

(III) patternless prompts

Nils Rethmeier

CopeNLU, DFKI

27.09.21

2 of 36

(I) Contrastive NLP - a primer1

Self-supervised contrastive pretraining

  • Many self-supervised methods
  • Contrastive learning rather than MLM or AR-LM
  • Some of them domain (end-task) specific

Supervised contrastive learning

  • Naturally task-specific
  • Multimodal, common-sense, distillation, classification etc.

Unified Self-supervised and supervised contrastive

  • CLESS2 - contrastive learning efficient

self-supervision

  • Small contrastive LMs for long-tail, zero-shot, few-shot
  1. A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021
  2. Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

3 of 36

(I) Contrastive NLP - a primer1

Self-supervised contrastive pretraining

  • Many self-supervised methods
  • Contrastive learning rather than MLM or AR-LM
  • Some of them domain (end-task) specific

Supervised contrastive learning

  • Naturally task-specific
  • Multimodal, common-sense, distillation, classification etc.

Unified Self-supervised and supervised contrastive

  • CLESS2 - contrastive learning efficient

self-supervision

  • Small contrastive LMs for long-tail, zero-shot, few-shot
  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021
  • Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

4 of 36

(I) Contrastive NLP - a primer1

Self-supervised contrastive pretraining

  • Many self-supervised methods
  • Contrastive learning rather than MLM or AR-LM
  • Some of them domain (end-task) specific

Supervised contrastive learning

  • Naturally task-specific
  • Multimodal, common-sense, distillation, classification etc.

Unified Self-supervised and supervised contrastive

  • CLESS2 - contrastive learning efficient

self-supervision

  • Small contrastive LMs for long-tail, zero-shot, few-shot
  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021
  • Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

5 of 36

(I) Contrastive NLP - a primer1

Self-supervised contrastive pretraining

  • Many self-supervised methods
  • Contrastive learning rather than MLM or AR-LM
  • Data-efficient pretraining, zero-shot boosting

Supervised contrastive learning

  • Naturally task-specific
  • Few-shot boosting, naturally multimodal, great distillation

Unified Self-supervised and supervised contrastive

  • CLESS2 - contrastive learning-data efficient

self-supervision

  • Small contrastive LMs for long-tail, zero-shot, few-shot
  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021
  • Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

6 of 36

(I) Contrastive NLP - a primer1

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

  • undersamples the softmax normalization, use less 0’s (K negatives)

Method:

  • Given a text xi: augment the text ai to have:
  • a “different meaningai- than xi, e.g. another text xj
  • the “similar or same meaningai+, xi with synonym
  • ai+ := another representation (view) of xi
  • s(xi , ai+) is a similarity function
  • we learn s(xi , ai+) = 1 (match), s(xi , ai-) = 0 (mismatch)

  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

Example: Binary NCE = BCE with undersampling

xi = “I like cookies”

ai+ = “I like all cookies”

ai- = “I like cakes” (a hard negative)

ai- = “The car is green” (easy neg.)

7 of 36

(I) Contrastive NLP - a primer1

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

  • undersamples the softmax normalization, use less 0’s (K negatives)

Method:

  • Given a text xi: augment the text ai to have:
  • the “similar or same meaningai+, xi with synonym
  • ai+ := another representation (view) of xi
  • a “different meaningai- than xi, e.g. another text xj
  • s(xi , ai.) is a similarity function
  • we learn s(xi , ai+) = 1 (match), s(xi , ai-) = 0 (mismatch)

  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

Example: Binary NCE = BCE with undersampling

xi = “I like cookies”

ai+ = “I like all cookies”

ai- = “I like cakes” (a hard negative)

ai- = “The car is green” (easy neg.)

xi = “I like cookies”

ai+ = “I like all cookies”

8 of 36

(I) Contrastive NLP - a primer1

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

  • undersamples the softmax normalization, use less 0’s (K negatives)

Method:

  • Given a text xi: augment the text ai to have:
  • the “similar or same meaningai+, xi with synonym
  • ai+ := another representation (view) of xi
  • a “different meaningai- than xi, e.g. another text xj
  • s(xi , ai.) is a similarity function
  • we learn s(xi , ai+) = 1 (match), s(xi , ai-) = 0 (mismatch)

  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

Example: Binary NCE = BCE with undersampling

xi = “I like cookies”

ai+ = “I like all cookies”

ai- = “I like cakes” (a hard negative)

ai- = “The car is green” (easy neg.)

xi = “I like cookies”

ai+ = “I like all cookies”

ai- = “I like cakes” (a hard negative)

ai- = “The car is green” (easy neg.)

9 of 36

(I) Contrastive NLP - a primer1

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

  • undersamples the softmax normalization, use less 0’s (K negatives)

Method:

  • Given a text xi: augment the text ai to have:
  • the “similar or same meaningai+, xi with synonym
  • ai+ := another representation (view) of xi
  • a “different meaningai- than xi, e.g. another text xj
  • s(xi , ai.) is a similarity function
  • we learn s(xi , ai+) = 1 (match), s(xi , ai-) = 0 (mismatch)

  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

Example: Binary NCE = BCE with undersampling

xi = “I like cookies”

ai+ = “I like all cookies”

ai- = “I like cakes” (a hard negative)

ai- = “The car is green” (easy neg.)

xi = “I like cookies”

ai+ = “I like all cookies”

ai- = “I like cakes” (a hard negative)

ai- = “The car is green” (easy neg.)

10 of 36

(I) Contrastive NLP - a primer1

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

  • undersamples the softmax normalization, use less 0’s (K negatives)

Method:

  • Given a text xi: augment the text ai to have:
  • the “similar or same meaningai+, xi with synonym
  • ai+ := another representation (view) of xi
  • a “different meaningai- than xi, e.g. another text xj
  • s(xi , ai.) is a similarity function
  • we learn s(xi , ai+) = 1 (match), s(xi , ai-) = 0 (mismatch)

  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

Example: Binary NCE = BCE with undersampling

xi = “I like cookies”

ai+ = “I like all cookies”

ai- = “I like cakes” (a hard negative)

ai- = “The car is green” (easy neg.)

11 of 36

(I) Contrastive NLP - a primer1

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

  • undersamples the softmax normalization, use less 0’s (K negatives)

Method:

  • Given a text xi: augment the text ai to have:
  • the “similar or same meaningai+, xi with synonym
  • ai+ := another representation (view) of xi
  • a “different meaningai- than xi, e.g. another text xj
  • s(xi , ai.) is a similarity function
  • we learn s(xi , ai+) = 1 (match), s(xi , ai-) = 0 (mismatch)

  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

xi = “I like cookies”

ai+ = “I like all cookies”

Sampling effects:

  • xi

attract

  • positive

xi , ai+, ai- in embedding space

12 of 36

(I) Contrastive NLP - a primer1

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

  • undersamples the softmax normalization, use less 0’s (K negatives)

Method:

  • Given a text xi: augment the text ai to have:
  • the “similar or same meaningai+, xi with synonym
  • ai+ := another representation (view) of xi
  • a “different meaningai- than xi, e.g. another text xj
  • s(xi , ai.) is a similarity function
  • we learn s(xi , ai+) = 1 (match), s(xi , ai-) = 0 (mismatch)

  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

xi = “I like cookies”

ai+ = “I like all cookies”

ai- = “I like cakes” (a hard negative)

  • hard negative

Sampling effects:

  • trains a ‘margin’
  • positive
  • xi

repel

attract

xi , ai+, ai- in embedding space

13 of 36

(I) Contrastive NLP - a primer1

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

  • undersamples the softmax normalization, use less 0’s (K negatives)

Method:

  • Given a text xi: augment the text ai to have:
  • the “similar or same meaningai+, xi with synonym
  • ai+ := another representation (view) of xi
  • a “different meaningai- than xi, e.g. another text xj
  • s(xi , ai.) is a similarity function
  • we learn s(xi , ai+) = 1 (match), s(xi , ai-) = 0 (mismatch)

  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

xi = “I like cookies”

ai+ = “I like all cookies”

ai- = “I like cakes” (a hard negative)

ai- = “The car is green” (easy negative)

  • easy negative
  • hard negative

Sampling effects:

  • trains a ‘margin’
  • essent. clustering
  • positive
  • xi

repel

attract

xi , ai+, ai- in embedding space

14 of 36

(I) Contrastive NLP - a primer1

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

  • undersamples the softmax normalization, use less 0’s (K negatives)

In a Transformer: self-supervised

ai- = “I like cakes” (a hard negative)

ai- = “The car is green” (easy neg.)

  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

Example: Binary NCE = BCE with undersampling

xi = “I like cookies”

ai+ = “I like all cookies”

1

*BERT

single binary class

s(xi , ai+) = 1, i.e. a match

15 of 36

(I) Contrastive NLP - a primer1

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

  • undersamples the softmax normalization, use less 0’s (K negatives)

In a Transformer: self-supervised

ai- = “I like cakes” (a hard negative)

ai- = “The car is green” (easy neg.)

  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

Example: Binary NCE = BCE with undersampling

xi = “I like cookies”

ai+ = “I like all cookies”

0

*BERT

hard (wants to be a 1)

s(xi , ai-) = 0, i.e. a match

16 of 36

(I) Contrastive NLP - a primer1

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

  • undersamples the softmax normalization, use less 0’s (K negatives)

In a Transformer: self-supervised

ai- = “I like cakes” (a hard negative)

ai- = “The car is green” (easy neg.)

  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

xi = “I like cookies”

ai+ = “I like all cookies”

0

*BERT

hard (wants to be a 1)

s(xi , ai-) = 0, i.e. a match

17 of 36

(I) Contrastive NLP - a primer1

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

  • undersamples the softmax normalization, use less 0’s (K negatives)

In a Transformer: supervised

xi = “[CLS] positive sentiment [SEP] I like cookies [SEP]”

ai- = “[CLS] negative sentiment [SEP] I hate cookies [SEP]”

  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

0

*BERT

hard (wants to be a 1)

s(xi , ai-) = 0, i.e. a match

18 of 36

(I) Contrastive NLP - a primer1

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

  • undersamples the softmax normalization, use less 0’s (K negatives)

In a Transformer: supervised

xi = “[CLS] positive sentiment [SEP] I like cookies [SEP]”

ai- = “[CLS] negative sentiment [SEP] I hate cookies [SEP]”

  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

0

*BERT

hard (wants to be a 1)

s(xi , ai-) = 0, i.e. a match

Problems:

  1. How to augment xi into ai+?
  2. each s(xi , ai.) needs a full *BERT pass
    1. does not scale well to many labels
    2. BERT fine-tuning time increased by K

19 of 36

(I) contrastive long-tail LM2

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

  • undersamples the softmax normalization, use less 0’s (K negatives)

In a Transformer: supervised

xi = “I like cookies”

  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021
  • Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

0

Text encoder

hard (wants to be a 1)

s(xi , ai-) = 0, i.e. a match

Problems:

  • How to augment xi into ai+?
  • each s(xi , ai.) needs a full *BERT pass
    • does not scale well to many labels
    • BERT fine-tuning time increased by K

Solution:

  1. Encode text and label(-texts) separately
    1. encode text once (heavy operation)

label encoder

20 of 36

(I) contrastive long-tail LM2

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

  • undersamples the softmax normalization, use less 0’s (K negatives)

In a Transformer: supervised

xi = “I like cookies”

ai- = label yi- = “negative sentiment”

  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021
  • Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

0

Text encoder

hard (wants to be a 1)

s(xi , ai-) = 0, i.e. a match

Problems:

  • How to augment xi into ai+?
  • each s(xi , ai.) needs a full *BERT pass
    • does not scale well to many labels
    • BERT fine-tuning time increased by K

Solution:

  • Encode text and label(-texts) separately
    • encode text once (heavy operation)
    • encode K negative and P positive labels

label encoder

21 of 36

(I) contrastive long-tail LM2

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

  • undersamples the softmax normalization, use less 0’s (K negatives)

In a Transformer: supervised

xi = “I like cookies”

ai- = label yi- = “negative sentiment”, ai+ = label yi+ = “positive sentiment”

  • A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021
  • Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

1

Text encoder

hard (wants to be a 1)

s(xi , ai-) = 0, i.e. a match

Problems:

  • How to augment xi into ai+?
  • each s(xi , ai.) needs a full *BERT pass
    • does not scale well to many labels
    • BERT fine-tuning time increased by K

Solution:

  • Encode text and label(-texts) separately
    • encode text once (heavy operation)
    • encode K negative and P positive labels

label encoder

22 of 36

(C)LESS: data-efficient text encoder

Definition: Data efficient

  • sample efficiency -- aka (self)-supervised few shot learning
  • label efficiency -- aka zero-shot to few-shot learning
  • Long-tail efficiency ← potential iceberg problem

Majority classes�what we typically evaluate

rare/minority classes

top-k measure ignore this

20% labels, but 80% classes

Tag prediction for StatsExchange online answers

23 of 36

CLESS1, contrastive, self-supervised text CNN

1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein

Given a text -- sample words as positive labels

and another text in the batch

  • sample words as negative labels

24 of 36

CLESS1, contrastive, self-supervised text CNN

1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein

Encode the text 1x

Encode n labels

25 of 36

CLESS1, contrastive, self-supervised text CNN

1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein

match 1 text encoding with

n labels encodings

26 of 36

CLESS1, contrastive, self-supervised text CNN

1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein

Classify mis-match

  • → learns non-linear sim function
  • sim func complexity determined by the matcher sub-net

27 of 36

CLESS1, contrastive, self-supervised text CNN

1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein

Dropout

  • On text

→ input robustness

  • On labels

→ label noise robustness

→ reduces label overfit

= a soft “label smoothing”

28 of 36

CLESS1, contrastive, self-supervised text CNN

1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein

Supervision labels are also words, aka text-to-text.��CLESS uses dense-to-dense text prediction to minimize info loss

Reuse the pretrained matcher network for real labels

→ zero-shot etc

29 of 36

Results: RQ1, X-data efficient pretraining

CLESS: Pretrains fast and data-efficiently

Task:

Long-tailed tag prediction on StatsExchange

Task: Long-tailed label prediction

30 of 36

Results: RQ2, Y-data (label) efficiency via pretraining

CLESS: Pretrains fast and data-efficiently

Learns from few labels

Task:

Long-tailed tag prediction on StatsExchange

Task: Long-tailed label prediction

31 of 36

Results: RQ3, Long-tail efficient learning

CLESS: Pretrains fast and data-efficiently

Learns from few labels

Learns the long-tail well

Task:

Long-tailed tag prediction on StatsExchange

Task: Long-tailed label prediction

32 of 36

(III) patternless prompts (same as contrastive learning)

Discrete prompt engineering: aka input feature engineering3

  • Change prompt text → to change embeddings → to change generation
    • learn to ask in such a way as to evoke a desired reaction
    • aka Neuro-Linguistic Programming (NLP)
    • Benefit: good few-shot learning

Prompt search/ optimization:

  • Search for prompt patterns - effortful, or memory inefficient per task4

Prompt: remove pattern (trigger) or verbalizer (label):

  1. xi = <input text [MASK]>, yi= <label text>
    • i.e. pattern [MASK], verbalizer = label text
  2. Only fine-tune model biases (BitFit)

Prompts are still great for ‘priming’, i.e. zero-shot generation in chonk models like GPT-3

xi = “I like cookies [MASK]”

aka label

aka a label text

or description

3. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing - outdated survey on Prompting

4. Cutting Down on Prompts and Parameters:Simple Few-Shot Learning with Language Models, Logan, Riedel arxiv.org/pdf/2106.13353

“positive”

*BERT

33 of 36

(III) patternless prompts (same as contrastive learning)

Discrete prompt engineering: aka input feature engineering3

  • Change prompt text → to change embeddings → to change generation
    • learn to ask in such a way as to evoke a desired reaction
    • aka Neuro-Linguistic Programming (NLP)
    • Benefit: good few-shot learning

Prompt search/ optimization:

  • Search for prompt patterns - effortful, or memory inefficient per task4

Prompt: remove pattern (trigger) or verbalizer (label):

  • xi = <input text [MASK]>, yi= <label text>
    • i.e. pattern [MASK], verbalizer = label text
  • Only fine-tune model biases (BitFit)

Prompts are still great for ‘priming’, i.e. zero-shot generation in chonk models like GPT-3

xi = “I like cookies [MASK]”

aka label

aka a label text

or description

3. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing - outdated survey on Prompting

4. Cutting Down on Prompts and Parameters:Simple Few-Shot Learning with Language Models, Logan, Riedel arxiv.org/pdf/2106.13353

“positive”

*BERT

34 of 36

(III) patternless prompts (same as contrastive learning)

Discrete prompt engineering: aka input feature engineering3

  • Change prompt text → to change embeddings → to change generation
    • learn to ask in such a way as to evoke a desired reaction
    • aka Neuro-Linguistic Programming (NLP)
    • Benefit: good few-shot learning

Prompt search/ optimization:

  • Search for prompt patterns - effortful, or memory inefficient per task4

Prompt: remove pattern (trigger) or verbalizer (label):

  • xi = <input text [MASK]>, yi= <label text>
    • i.e. pattern [MASK], verbalizer = label text
  • Only fine-tune model biases (BitFit)

Prompts are still great for ‘priming’, i.e. zero-shot generation in chonk models like GPT-3

xi = “I like cookies [MASK]”

aka label

aka a label text

or description

3. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing - outdated survey on Prompting

4. Cutting Down on Prompts and Parameters:Simple Few-Shot Learning with Language Models, Logan, Riedel arxiv.org/pdf/2106.13353

“positive”

*BERT

35 of 36

(III) patternless prompts: compared to contrastive NLP

xi = “I like cookies [MASK]”

3. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing - outdated survey on Prompting

4. Cutting Down on Prompts and Parameters:Simple Few-Shot Learning with Language Models, Logan, Riedel arxiv.org/pdf/2106.13353

2. Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

“positive”

*BERT

xi = “I like cookies”

ai- = label yi- = “negative sentiment”, ai+ = label yi+ = “positive sentiment”

1

Text encoder

s(xi , ai-) = 0, i.e. a match

label encoder

Contrastive learning (NLP)2

Label-text embedding

aka “Veralizer”

patternless prompts

36 of 36

(III) patternless prompts: compared to contrastive NLP

xi = “I like cookies [MASK]”

3. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing - outdated survey on Prompting

4. Cutting Down on Prompts and Parameters:Simple Few-Shot Learning with Language Models, Logan, Riedel arxiv.org/pdf/2106.13353

2. Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

“positive”

*BERT

xi = “I like cookies”

ai- = label yi- = “negative sentiment”, ai+ = label yi+ = “positive sentiment”

1

Text encoder

s(xi , ai-) = 0, i.e. a match

label encoder

patternless prompts

Contrastive learning (NLP)2

Label-text embedding

aka “Veralizer”

unifies pretraining and fine-tuning

good zero, few-shot, longtail learning

only (mostly) fine-tuning

good few-shot learning