(I) Contrastive NLP
(II) contrastive long-tail LM
(III) patternless prompts
Nils Rethmeier
CopeNLU, DFKI
27.09.21
(I) Contrastive NLP - a primer1
Self-supervised contrastive pretraining
Supervised contrastive learning
Unified Self-supervised and supervised contrastive
self-supervision
(I) Contrastive NLP - a primer1
Self-supervised contrastive pretraining
Supervised contrastive learning
Unified Self-supervised and supervised contrastive
self-supervision
(I) Contrastive NLP - a primer1
Self-supervised contrastive pretraining
Supervised contrastive learning
Unified Self-supervised and supervised contrastive
self-supervision
(I) Contrastive NLP - a primer1
Self-supervised contrastive pretraining
Supervised contrastive learning
Unified Self-supervised and supervised contrastive
self-supervision
(I) Contrastive NLP - a primer1
Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)
Method:
Example: Binary NCE = BCE with undersampling
xi = “I like cookies”
ai+ = “I like all cookies”
ai- = “I like cakes” (a hard negative)
ai- = “The car is green” (easy neg.)
(I) Contrastive NLP - a primer1
Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)
Method:
Example: Binary NCE = BCE with undersampling
xi = “I like cookies”
ai+ = “I like all cookies”
ai- = “I like cakes” (a hard negative)
ai- = “The car is green” (easy neg.)
xi = “I like cookies”
ai+ = “I like all cookies”
(I) Contrastive NLP - a primer1
Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)
Method:
Example: Binary NCE = BCE with undersampling
xi = “I like cookies”
ai+ = “I like all cookies”
ai- = “I like cakes” (a hard negative)
ai- = “The car is green” (easy neg.)
xi = “I like cookies”
ai+ = “I like all cookies”
ai- = “I like cakes” (a hard negative)
ai- = “The car is green” (easy neg.)
(I) Contrastive NLP - a primer1
Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)
Method:
Example: Binary NCE = BCE with undersampling
xi = “I like cookies”
ai+ = “I like all cookies”
ai- = “I like cakes” (a hard negative)
ai- = “The car is green” (easy neg.)
xi = “I like cookies”
ai+ = “I like all cookies”
ai- = “I like cakes” (a hard negative)
ai- = “The car is green” (easy neg.)
(I) Contrastive NLP - a primer1
Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)
Method:
Example: Binary NCE = BCE with undersampling
xi = “I like cookies”
ai+ = “I like all cookies”
ai- = “I like cakes” (a hard negative)
ai- = “The car is green” (easy neg.)
(I) Contrastive NLP - a primer1
Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)
Method:
xi = “I like cookies”
ai+ = “I like all cookies”
Sampling effects:
attract
xi , ai+, ai- in embedding space
(I) Contrastive NLP - a primer1
Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)
Method:
xi = “I like cookies”
ai+ = “I like all cookies”
ai- = “I like cakes” (a hard negative)
Sampling effects:
repel
attract
xi , ai+, ai- in embedding space
(I) Contrastive NLP - a primer1
Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)
Method:
xi = “I like cookies”
ai+ = “I like all cookies”
ai- = “I like cakes” (a hard negative)
ai- = “The car is green” (easy negative)
Sampling effects:
repel
attract
xi , ai+, ai- in embedding space
(I) Contrastive NLP - a primer1
Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)
In a Transformer: self-supervised
ai- = “I like cakes” (a hard negative)
ai- = “The car is green” (easy neg.)
Example: Binary NCE = BCE with undersampling
xi = “I like cookies”
ai+ = “I like all cookies”
1
*BERT
single binary class
s(xi , ai+) = 1, i.e. a match
(I) Contrastive NLP - a primer1
Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)
In a Transformer: self-supervised
ai- = “I like cakes” (a hard negative)
ai- = “The car is green” (easy neg.)
Example: Binary NCE = BCE with undersampling
xi = “I like cookies”
ai+ = “I like all cookies”
0
*BERT
hard (wants to be a 1)
s(xi , ai-) = 0, i.e. a match
(I) Contrastive NLP - a primer1
Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)
In a Transformer: self-supervised
ai- = “I like cakes” (a hard negative)
ai- = “The car is green” (easy neg.)
xi = “I like cookies”
ai+ = “I like all cookies”
0
*BERT
hard (wants to be a 1)
s(xi , ai-) = 0, i.e. a match
(I) Contrastive NLP - a primer1
Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)
In a Transformer: supervised
xi = “[CLS] positive sentiment [SEP] I like cookies [SEP]”
ai- = “[CLS] negative sentiment [SEP] I hate cookies [SEP]”
0
*BERT
hard (wants to be a 1)
s(xi , ai-) = 0, i.e. a match
(I) Contrastive NLP - a primer1
Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)
In a Transformer: supervised
xi = “[CLS] positive sentiment [SEP] I like cookies [SEP]”
ai- = “[CLS] negative sentiment [SEP] I hate cookies [SEP]”
0
*BERT
hard (wants to be a 1)
s(xi , ai-) = 0, i.e. a match
Problems:
(I) contrastive long-tail LM2
Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)
In a Transformer: supervised
xi = “I like cookies”
0
Text encoder
hard (wants to be a 1)
s(xi , ai-) = 0, i.e. a match
Problems:
Solution:
label encoder
(I) contrastive long-tail LM2
Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)
In a Transformer: supervised
xi = “I like cookies”
ai- = label yi- = “negative sentiment”
0
Text encoder
hard (wants to be a 1)
s(xi , ai-) = 0, i.e. a match
Problems:
Solution:
label encoder
(I) contrastive long-tail LM2
Basic principle: Noise Contrastive Estimation, NCE (Gutmann, 2010)
In a Transformer: supervised
xi = “I like cookies”
ai- = label yi- = “negative sentiment”, ai+ = label yi+ = “positive sentiment”
1
Text encoder
hard (wants to be a 1)
s(xi , ai-) = 0, i.e. a match
Problems:
Solution:
label encoder
(C)LESS: data-efficient text encoder
Definition: Data efficient
Majority classes�what we typically evaluate
rare/minority classes
top-k measure ignore this
20% labels, but 80% classes
Tag prediction for StatsExchange online answers
CLESS1, contrastive, self-supervised text CNN
1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein
Given a text -- sample words as positive labels
and another text in the batch
CLESS1, contrastive, self-supervised text CNN
1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein
Encode the text 1x
Encode n labels
CLESS1, contrastive, self-supervised text CNN
1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein
match 1 text encoding with
n labels encodings
CLESS1, contrastive, self-supervised text CNN
1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein
Classify mis-match
CLESS1, contrastive, self-supervised text CNN
1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein
Dropout
→ input robustness
→ label noise robustness
→ reduces label overfit
= a soft “label smoothing”
CLESS1, contrastive, self-supervised text CNN
1 Data-Efficient Pretraining via Contrastive Self-Supervision, Rethmeier, Augenstein
Supervision labels are also words, aka text-to-text.��CLESS uses dense-to-dense text prediction to minimize info loss
Reuse the pretrained matcher network for real labels
→ zero-shot etc
Results: RQ1, X-data efficient pretraining
CLESS: Pretrains fast and data-efficiently
Task:
Long-tailed tag prediction on StatsExchange
Task: Long-tailed label prediction
Results: RQ2, Y-data (label) efficiency via pretraining
CLESS: Pretrains fast and data-efficiently
Learns from few labels
Task:
Long-tailed tag prediction on StatsExchange
Task: Long-tailed label prediction
Results: RQ3, Long-tail efficient learning
CLESS: Pretrains fast and data-efficiently
Learns from few labels
Learns the long-tail well
Task:
Long-tailed tag prediction on StatsExchange
Task: Long-tailed label prediction
(III) patternless prompts (same as contrastive learning)
Discrete prompt engineering: aka input feature engineering3
Prompt search/ optimization:
Prompt: remove pattern (trigger) or verbalizer (label):
Prompts are still great for ‘priming’, i.e. zero-shot generation in chonk models like GPT-3
xi = “I like cookies [MASK]”
aka label
aka a label text
or description
3. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing - outdated survey on Prompting
4. Cutting Down on Prompts and Parameters:Simple Few-Shot Learning with Language Models, Logan, Riedel arxiv.org/pdf/2106.13353
“positive”
*BERT
(III) patternless prompts (same as contrastive learning)
Discrete prompt engineering: aka input feature engineering3
Prompt search/ optimization:
Prompt: remove pattern (trigger) or verbalizer (label):
Prompts are still great for ‘priming’, i.e. zero-shot generation in chonk models like GPT-3
xi = “I like cookies [MASK]”
aka label
aka a label text
or description
3. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing - outdated survey on Prompting
4. Cutting Down on Prompts and Parameters:Simple Few-Shot Learning with Language Models, Logan, Riedel arxiv.org/pdf/2106.13353
“positive”
*BERT
(III) patternless prompts (same as contrastive learning)
Discrete prompt engineering: aka input feature engineering3
Prompt search/ optimization:
Prompt: remove pattern (trigger) or verbalizer (label):
Prompts are still great for ‘priming’, i.e. zero-shot generation in chonk models like GPT-3
xi = “I like cookies [MASK]”
aka label
aka a label text
or description
3. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing - outdated survey on Prompting
4. Cutting Down on Prompts and Parameters:Simple Few-Shot Learning with Language Models, Logan, Riedel arxiv.org/pdf/2106.13353
“positive”
*BERT
(III) patternless prompts: compared to contrastive NLP
xi = “I like cookies [MASK]”
3. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing - outdated survey on Prompting
4. Cutting Down on Prompts and Parameters:Simple Few-Shot Learning with Language Models, Logan, Riedel arxiv.org/pdf/2106.13353
2. Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020
“positive”
*BERT
xi = “I like cookies”
ai- = label yi- = “negative sentiment”, ai+ = label yi+ = “positive sentiment”
1
Text encoder
s(xi , ai-) = 0, i.e. a match
label encoder
Contrastive learning (NLP)2
Label-text embedding
aka “Veralizer”
patternless prompts
(III) patternless prompts: compared to contrastive NLP
xi = “I like cookies [MASK]”
3. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing - outdated survey on Prompting
4. Cutting Down on Prompts and Parameters:Simple Few-Shot Learning with Language Models, Logan, Riedel arxiv.org/pdf/2106.13353
2. Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, Rethmeier, Augenstein, 2020
“positive”
*BERT
xi = “I like cookies”
ai- = label yi- = “negative sentiment”, ai+ = label yi+ = “positive sentiment”
1
Text encoder
s(xi , ai-) = 0, i.e. a match
label encoder
patternless prompts
Contrastive learning (NLP)2
Label-text embedding
aka “Veralizer”
unifies pretraining and fine-tuning
good zero, few-shot, longtail learning
only (mostly) fine-tuning
good few-shot learning