1 of 36

(I) Contrastive NLP

(II) contrastive long-tail LM

(III) patternless prompts

Nils Rethmeier

CopeNLU, DFKI

27.09.21

2 of 36

(I) Contrastive NLP - a primer¹

Self-supervised contrastive pretraining

Many self-supervised methods
Contrastive learning rather than MLM or AR-LM
Some of them domain (end-task) specific

Supervised contrastive learning

Naturally task-specific
Multimodal, common-sense, distillation, classification etc.

Unified Self-supervised and supervised contrastive

CLESS² - contrastive learning efficient

self-supervision

Small contrastive LMs for long-tail, zero-shot, few-shot

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021
Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

3 of 36

(I) Contrastive NLP - a primer¹

Self-supervised contrastive pretraining

Many self-supervised methods
Contrastive learning rather than MLM or AR-LM
Some of them domain (end-task) specific

Supervised contrastive learning

Naturally task-specific
Multimodal, common-sense, distillation, classification etc.

Unified Self-supervised and supervised contrastive

CLESS² - contrastive learning efficient

self-supervision

Small contrastive LMs for long-tail, zero-shot, few-shot

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021
Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

4 of 36

(I) Contrastive NLP - a primer¹

Self-supervised contrastive pretraining

Many self-supervised methods
Contrastive learning rather than MLM or AR-LM
Some of them domain (end-task) specific

Supervised contrastive learning

Naturally task-specific
Multimodal, common-sense, distillation, classification etc.

Unified Self-supervised and supervised contrastive

CLESS² - contrastive learning efficient

self-supervision

Small contrastive LMs for long-tail, zero-shot, few-shot

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021
Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

5 of 36

(I) Contrastive NLP - a primer¹

Self-supervised contrastive pretraining

Many self-supervised methods
Contrastive learning rather than MLM or AR-LM
Data-efficient pretraining, zero-shot boosting

Supervised contrastive learning

Naturally task-specific
Few-shot boosting, naturally multimodal, great distillation

Unified Self-supervised and supervised contrastive

CLESS² - contrastive learning-data efficient

self-supervision

Small contrastive LMs for long-tail, zero-shot, few-shot

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021
Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

6 of 36

(I) Contrastive NLP - a primer¹

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

undersamples the softmax normalization, use less 0’s (K negatives)

Method:

Given a text x_i: augment the text a_i to have:
a “different meaning” a_i^- than x_i, e.g. another text x_j
the “similar or same meaning” a_i⁺, x_iwith synonym
a_i⁺:= another representation (view) of x_i
s(x_i, a_i⁺) is a similarity function
we learn s(x_i, a_i⁺) = 1 (match), s(x_i, a_i^-) = 0 (mismatch)

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

Example: Binary NCE = BCE with undersampling

x_i=“I like cookies”

a_i⁺= “I like all cookies”

a_i^-= “I like cakes” (a hard negative)

a_i^- = “The car is green” (easy neg.)

7 of 36

(I) Contrastive NLP - a primer¹

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

undersamples the softmax normalization, use less 0’s (K negatives)

Method:

Given a text x_i: augment the text a_i to have:
the “similar or same meaning” a_i⁺, x_iwith synonym
a_i⁺:= another representation (view) of x_i
a “different meaning” a_i^- than x_i, e.g. another text x_j
s(x_i, a_i^.) is a similarity function
we learn s(x_i, a_i⁺) = 1 (match), s(x_i, a_i^-) = 0 (mismatch)

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

Example: Binary NCE = BCE with undersampling

x_i=“I like cookies”

a_i⁺= “I like all cookies”

a_i^-= “I like cakes” (a hard negative)

a_i^- = “The car is green” (easy neg.)

x_i=“I like cookies”

a_i⁺= “I like all cookies”

8 of 36

(I) Contrastive NLP - a primer¹

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

undersamples the softmax normalization, use less 0’s (K negatives)

Method:

Given a text x_i: augment the text a_i to have:
the “similar or same meaning” a_i⁺, x_iwith synonym
a_i⁺:= another representation (view) of x_i
a “different meaning” a_i^- than x_i, e.g. another text x_j
s(x_i, a_i^.) is a similarity function
we learn s(x_i, a_i⁺) = 1 (match), s(x_i, a_i^-) = 0 (mismatch)

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

Example: Binary NCE = BCE with undersampling

x_i=“I like cookies”

a_i⁺= “I like all cookies”

a_i^-= “I like cakes” (a hard negative)

a_i^- = “The car is green” (easy neg.)

x_i=“I like cookies”

a_i⁺= “I like all cookies”

a_i^-= “I like cakes” (a hard negative)

a_i^- = “The car is green” (easy neg.)

9 of 36

(I) Contrastive NLP - a primer¹

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

undersamples the softmax normalization, use less 0’s (K negatives)

Method:

Given a text x_i: augment the text a_i to have:
the “similar or same meaning” a_i⁺, x_iwith synonym
a_i⁺:= another representation (view) of x_i
a “different meaning” a_i^- than x_i, e.g. another text x_j
s(x_i, a_i^.) is a similarity function
we learn s(x_i, a_i⁺) = 1 (match), s(x_i, a_i^-) = 0 (mismatch)

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

Example: Binary NCE = BCE with undersampling

x_i=“I like cookies”

a_i⁺= “I like all cookies”

a_i^-= “I like cakes” (a hard negative)

a_i^- = “The car is green” (easy neg.)

x_i=“I like cookies”

a_i⁺= “I like all cookies”

a_i^-= “I like cakes” (a hard negative)

a_i^- = “The car is green” (easy neg.)

10 of 36

(I) Contrastive NLP - a primer¹

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

undersamples the softmax normalization, use less 0’s (K negatives)

Method:

Given a text x_i: augment the text a_i to have:
the “similar or same meaning” a_i⁺, x_iwith synonym
a_i⁺:= another representation (view) of x_i
a “different meaning” a_i^- than x_i, e.g. another text x_j
s(x_i, a_i^.) is a similarity function
we learn s(x_i, a_i⁺) = 1 (match), s(x_i, a_i^-) = 0 (mismatch)

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

Example: Binary NCE = BCE with undersampling

x_i=“I like cookies”

a_i⁺= “I like all cookies”

a_i^-= “I like cakes” (a hard negative)

a_i^- = “The car is green” (easy neg.)

11 of 36

(I) Contrastive NLP - a primer¹

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

undersamples the softmax normalization, use less 0’s (K negatives)

Method:

Given a text x_i: augment the text a_i to have:
the “similar or same meaning” a_i⁺, x_iwith synonym
a_i⁺:= another representation (view) of x_i
a “different meaning” a_i^- than x_i, e.g. another text x_j
s(x_i, a_i^.) is a similarity function
we learn s(x_i, a_i⁺) = 1 (match), s(x_i, a_i^-) = 0 (mismatch)

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

x_i=“I like cookies”

a_i⁺= “I like all cookies”

Sampling effects:

attract

positive

x_i, a_i⁺, a_i^- in embedding space

12 of 36

(I) Contrastive NLP - a primer¹

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

undersamples the softmax normalization, use less 0’s (K negatives)

Method:

Given a text x_i: augment the text a_i to have:
the “similar or same meaning” a_i⁺, x_iwith synonym
a_i⁺:= another representation (view) of x_i
a “different meaning” a_i^- than x_i, e.g. another text x_j
s(x_i, a_i^.) is a similarity function
we learn s(x_i, a_i⁺) = 1 (match), s(x_i, a_i^-) = 0 (mismatch)

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

x_i=“I like cookies”

a_i⁺= “I like all cookies”

a_i^-= “I like cakes” (a hard negative)

hard negative

Sampling effects:

trains a ‘margin’

positive

repel

attract

x_i, a_i⁺, a_i^- in embedding space

13 of 36

(I) Contrastive NLP - a primer¹

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

undersamples the softmax normalization, use less 0’s (K negatives)

Method:

Given a text x_i: augment the text a_i to have:
the “similar or same meaning” a_i⁺, x_iwith synonym
a_i⁺:= another representation (view) of x_i
a “different meaning” a_i^- than x_i, e.g. another text x_j
s(x_i, a_i^.) is a similarity function
we learn s(x_i, a_i⁺) = 1 (match), s(x_i, a_i^-) = 0 (mismatch)

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

x_i=“I like cookies”

a_i⁺= “I like all cookies”

a_i^-= “I like cakes” (a hard negative)

a_i^- = “The car is green” (easy negative)

easy negative

hard negative

Sampling effects:

trains a ‘margin’
essent. clustering

positive

repel

attract

x_i, a_i⁺, a_i^- in embedding space

14 of 36

(I) Contrastive NLP - a primer¹

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

undersamples the softmax normalization, use less 0’s (K negatives)

In a Transformer: self-supervised

a_i^-= “I like cakes” (a hard negative)

a_i^- = “The car is green” (easy neg.)

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

Example: Binary NCE = BCE with undersampling

x_i=“I like cookies”

a_i⁺= “I like all cookies”

*BERT

single binary class

s(x_i, a_i⁺) = 1, i.e. a match

15 of 36

(I) Contrastive NLP - a primer¹

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

undersamples the softmax normalization, use less 0’s (K negatives)

In a Transformer: self-supervised

a_i^-= “I like cakes” (a hard negative)

a_i^- = “The car is green” (easy neg.)

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

Example: Binary NCE = BCE with undersampling

x_i=“I like cookies”

a_i⁺= “I like all cookies”

*BERT

hard (wants to be a 1)

s(x_i, a_i^-) = 0, i.e. a match

16 of 36

(I) Contrastive NLP - a primer¹

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

undersamples the softmax normalization, use less 0’s (K negatives)

In a Transformer: self-supervised

a_i^-= “I like cakes” (a hard negative)

a_i^- = “The car is green” (easy neg.)

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

x_i=“I like cookies”

a_i⁺= “I like all cookies”

*BERT

hard (wants to be a 1)

s(x_i, a_i^-) = 0, i.e. a match

17 of 36

(I) Contrastive NLP - a primer¹

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

undersamples the softmax normalization, use less 0’s (K negatives)

In a Transformer: supervised

x_i=“[CLS] positive sentiment [SEP] I like cookies [SEP]”

a_i^-= “[CLS] negative sentiment [SEP] I hate cookies [SEP]”

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

*BERT

hard (wants to be a 1)

s(x_i, a_i^-) = 0, i.e. a match

18 of 36

(I) Contrastive NLP - a primer¹

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

undersamples the softmax normalization, use less 0’s (K negatives)

In a Transformer: supervised

x_i=“[CLS] positive sentiment [SEP] I like cookies [SEP]”

a_i^-= “[CLS] negative sentiment [SEP] I hate cookies [SEP]”

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021

*BERT

hard (wants to be a 1)

s(x_i, a_i^-) = 0, i.e. a match

Problems:

How to augment x_i into a_i⁺?
each s(x_i, a_i^.) needs a full *BERT pass

does not scale well to many labels
BERT fine-tuning time increased by K

19 of 36

(I) contrastive long-tail LM²

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

undersamples the softmax normalization, use less 0’s (K negatives)

In a Transformer: supervised

x_i=“I like cookies”

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021
Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

Text encoder

hard (wants to be a 1)

s(x_i, a_i^-) = 0, i.e. a match

Problems:

How to augment x_i into a_i⁺?
each s(x_i, a_i^.) needs a full *BERT pass

does not scale well to many labels
BERT fine-tuning time increased by K

Solution:

Encode text and label(-texts) separately

encode text once (heavy operation)

label encoder

20 of 36

(I) contrastive long-tail LM²

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

undersamples the softmax normalization, use less 0’s (K negatives)

In a Transformer: supervised

x_i=“I like cookies”

a_i^-= label y_i^- = “negative sentiment”

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021
Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

Text encoder

hard (wants to be a 1)

s(x_i, a_i^-) = 0, i.e. a match

Problems:

How to augment x_i into a_i⁺?
each s(x_i, a_i^.) needs a full *BERT pass

does not scale well to many labels
BERT fine-tuning time increased by K

Solution:

Encode text and label(-texts) separately

encode text once (heavy operation)
encode K negative and P positive labels

label encoder

21 of 36

(I) contrastive long-tail LM²

Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)

undersamples the softmax normalization, use less 0’s (K negatives)

In a Transformer: supervised

x_i=“I like cookies”

a_i^-= label y_i^- = “negative sentiment”, a_i⁺ = label y_i⁺ = “positive sentiment”

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives, Rethmeier, Augenstein, 2021
Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

Text encoder

hard (wants to be a 1)

s(x_i, a_i^-) = 0, i.e. a match

Problems:

How to augment x_i into a_i⁺?
each s(x_i, a_i^.) needs a full *BERT pass

does not scale well to many labels
BERT fine-tuning time increased by K

Solution:

Encode text and label(-texts) separately

encode text once (heavy operation)
encode K negative and P positive labels

label encoder

22 of 36

(C)LESS: data-efficient text encoder

Definition: Data efficient

sample efficiency -- aka (self)-supervised few shot learning
label efficiency -- aka zero-shot to few-shot learning
Long-tail efficiency ← potential iceberg problem

Majority classes�what we typically evaluate

rare/minority classes

top-k measure ignore this

20% labels, but 80% classes

Tag prediction for StatsExchange online answers

23 of 36

CLESS¹, contrastive, self-supervised text CNN

1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein

Given a text -- sample words as positive labels

and another text in the batch

sample words as negative labels

24 of 36

CLESS¹, contrastive, self-supervised text CNN

1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein

Encode the text 1x

Encode n labels

25 of 36

CLESS¹, contrastive, self-supervised text CNN

1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein

match 1 text encoding with

n labels encodings

26 of 36

CLESS¹, contrastive, self-supervised text CNN

1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein

Classify mis-match

→ learns non-linear sim function
sim func complexity determined by the matcher sub-net

27 of 36

CLESS¹, contrastive, self-supervised text CNN

1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein

Dropout

On text

→ input robustness

On labels

→ label noise robustness

→ reduces label overfit

= a soft “label smoothing”

28 of 36

CLESS¹, contrastive, self-supervised text CNN

1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein

Supervision labels are also words, aka text-to-text.��CLESS uses dense-to-dense text prediction to minimize info loss

Reuse the pretrained matcher network for real labels

→ zero-shot etc

29 of 36

Results: RQ1, X-data efficient pretraining

CLESS: Pretrains fast and data-efficiently

Task:

Long-tailed tag prediction on StatsExchange

Task: Long-tailed label prediction

30 of 36

Results: RQ2, Y-data (label) efficiency via pretraining

CLESS: Pretrains fast and data-efficiently

Learns from few labels

Task:

Long-tailed tag prediction on StatsExchange

Task: Long-tailed label prediction

31 of 36

Results: RQ3, Long-tail efficient learning

CLESS: Pretrains fast and data-efficiently

Learns from few labels

Learns the long-tail well

Task:

Long-tailed tag prediction on StatsExchange

Task: Long-tailed label prediction

32 of 36

(III) patternless prompts (same as contrastive learning)

Discrete prompt engineering: aka input feature engineering³

Change prompt text → to change embeddings → to change generation

learn to ask in such a way as to evoke a desired reaction
aka Neuro-Linguistic Programming (NLP)
Benefit: good few-shot learning

Prompt search/ optimization:

Search for prompt patterns - effortful, or memory inefficient per task⁴

Prompt: remove pattern (trigger) or verbalizer (label):

x_i= <input text [MASK]>, y_i= <label text>

i.e. pattern [MASK], verbalizer = label text

Only fine-tune model biases (BitFit)

Prompts are still great for ‘priming’, i.e. zero-shot generation in chonk models like GPT-3

x_i=“I like cookies [MASK]”

PET: Schick et al., 2020

aka label

aka a label text

or description

3. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing - outdated survey on Prompting

4. Cutting Down on Prompts and Parameters:Simple Few-Shot Learning with Language Models, Logan, Riedel arxiv.org/pdf/2106.13353

“positive”

*BERT

33 of 36

(III) patternless prompts (same as contrastive learning)

Discrete prompt engineering: aka input feature engineering³

Change prompt text → to change embeddings → to change generation

learn to ask in such a way as to evoke a desired reaction
aka Neuro-Linguistic Programming (NLP)
Benefit: good few-shot learning

Prompt search/ optimization:

Search for prompt patterns - effortful, or memory inefficient per task⁴

Prompt: remove pattern (trigger) or verbalizer (label):

x_i= <input text [MASK]>, y_i= <label text>

i.e. pattern [MASK], verbalizer = label text

Only fine-tune model biases (BitFit)

Prompts are still great for ‘priming’, i.e. zero-shot generation in chonk models like GPT-3

x_i=“I like cookies [MASK]”

PET: Schick et al., 2020

aka label

aka a label text

or description

3. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing - outdated survey on Prompting

4. Cutting Down on Prompts and Parameters:Simple Few-Shot Learning with Language Models, Logan, Riedel arxiv.org/pdf/2106.13353

“positive”

*BERT

34 of 36

(III) patternless prompts (same as contrastive learning)

Discrete prompt engineering: aka input feature engineering³

Change prompt text → to change embeddings → to change generation

learn to ask in such a way as to evoke a desired reaction
aka Neuro-Linguistic Programming (NLP)
Benefit: good few-shot learning

Prompt search/ optimization:

Search for prompt patterns - effortful, or memory inefficient per task⁴

Prompt: remove pattern (trigger) or verbalizer (label):

x_i= <input text [MASK]>, y_i= <label text>

i.e. pattern [MASK], verbalizer = label text

Only fine-tune model biases (BitFit)

Prompts are still great for ‘priming’, i.e. zero-shot generation in chonk models like GPT-3

x_i=“I like cookies [MASK]”

PET: Schick et al., 2020

aka label

aka a label text

or description

3. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing - outdated survey on Prompting

4. Cutting Down on Prompts and Parameters:Simple Few-Shot Learning with Language Models, Logan, Riedel arxiv.org/pdf/2106.13353

“positive”

*BERT

35 of 36

(III) patternless prompts: compared to contrastive NLP

x_i=“I like cookies [MASK]”

3. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing - outdated survey on Prompting

4. Cutting Down on Prompts and Parameters:Simple Few-Shot Learning with Language Models, Logan, Riedel arxiv.org/pdf/2106.13353

2. Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

“positive”

*BERT

x_i=“I like cookies”

a_i^-= label y_i^- = “negative sentiment”, a_i⁺ = label y_i⁺ = “positive sentiment”

Text encoder

s(x_i, a_i^-) = 0, i.e. a match

label encoder

Contrastive learning (NLP)²

Label-text embedding

aka “Veralizer”

patternless prompts

36 of 36

(III) patternless prompts: compared to contrastive NLP

x_i=“I like cookies [MASK]”

3. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing - outdated survey on Prompting

4. Cutting Down on Prompts and Parameters:Simple Few-Shot Learning with Language Models, Logan, Riedel arxiv.org/pdf/2106.13353

2. Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020

“positive”

*BERT

x_i=“I like cookies”

a_i^-= label y_i^- = “negative sentiment”, a_i⁺ = label y_i⁺ = “positive sentiment”

Text encoder

s(x_i, a_i^-) = 0, i.e. a match

label encoder

patternless prompts

Contrastive learning (NLP)²

Label-text embedding

aka “Veralizer”

unifies pretraining and fine-tuning

good zero, few-shot, longtail learning

only (mostly) fine-tuning

good few-shot learning