1 of 34

Insights from NLP research

2 of 34

2

LMs are trained to predict missing words

Language model

The

quick

brown

fox

[MASK]

jumps

3 of 34

Language models are everywhere

3

Small language models

https://gluebenchmark.com/

https://super.gluebenchmark.com/

https://crfm.stanford.edu/helm/lite/latest/

https://paperswithcode.com/dataset/mmlu

https://paperswithcode.com/dataset/big-bench

4 of 34

Language models are everywhere

4

Sentiment

Question Answering

Summarization

Coreference Resolution

5 of 34

Transformer

5

Harry never thought he would

Harry never thought he ???

Ashish Vaswani, Noam Shazeer. "Attention Is All You Need." NeurIPS (2017).

6 of 34

BERT: Workflow

6

w_t

Current word

[Context word]

w_t-1

w_t-2

1. Self-attention with Bi-directional context

2. Masked language modeling (MLM)

[Context word]

w_t+1

[Context word]

w_t+2

[MASK]

7 of 34

GPT2: Workflow

7

w_t

Current word

[Context word]

w_t-1

w_t-2

1. Self-attention with Uni-directional context

2. Causal language modeling (CLM)

[MASK]

[vehicle]

8 of 34

Fine tuning: tune pretrain language model on a task

8

9 of 34

Downside of full fine tuning

9

10 of 34

10

Model Interpretability?

Classifier

BERTology

11 of 34

Hierarchy of Linguistic Info - Setting

Conneau et al., ACL’18 - Build diagnostic classifier to predict if a linguistic property is encoded in the given sentence representation.
Features:

Surface – Sentence Length, Word Content
Syntactic – Bigram shift, Tree depth, Top constituent
Semantic – Tense, Subject Number, Object Number, Coordination Inversion and Semantic Odd Man Out.

11

BERT layer

Simple classifier

predict sentence length

If the prediction accuracy is good, then the model might be capturing the sentence length feature

12 of 34

12

Surface

Syntactic

Semantic

BERT composes a hierarchy of linguistic signals ranging from surface to semantic features

Ganesh Jawahar, Benoît Sagot, Djamé Seddah. What does BERT learn about the structure of language?. ACL 2019

13 of 34

Agenda

Recap on small language models
Text-to-Text Transfer Transformer, Prompting, Instruction-tuning

13

14 of 34

T5 (Text-to-Text Transfer Transformer): Workflow

Text-to-text transformer
Encoder-decoder model
Reformulates all tasks (during pretraining and finetuning) with a text-to-text format

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR 2019.

15 of 34

T5: Workflow, Encoder

Original text: Thank you for inviting me to your party last week

Input text: Thank you for inviting me to your party <Y> week

Input text: Thank you <X> me to your party <Y> week

<X> for inviting (span masking)

Input text: Thank you me to your party week

Input text: party me your to. last you inviting week Thanks

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR 2019.

16 of 34

T5: Different unsupervised objectives

Original text: Thank you for inviting me to your party last week

Translate English to German: That is good. Target: Das is gut.

Translate English to German: That is good. Target: Das is gut.

“Good” representation can only look at “Translate English to German: That is”.

Translate English to German: That is good. Target: Das is gut.

“Good” representation can only look at “Translate English to German: That is. Target:”.

17 of 34

Agenda

Recap on small language models
Text-to-Text Transfer Transformer, Prompting, Instruction-tuning

17

18 of 34

Zero-shot vs. One-shot vs. Few-shot prompting

18

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan .Language Models are Few-Shot Learners.Arxiv 2020.

19 of 34

GPT-3 Prompting

19

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan .Language Models are Few-Shot Learners.Arxiv 2020.

20 of 34

Prompt-tuning

21 of 34

Hard vs. Soft Prompts

Hard prompt: manually handcrafted text prompts with discrete input tokens

we directly change the discrete input tokens, which are not differentiable

Soft prompt: concatenates the embeddings of the input tokens with a trainable tensor that can be optimized via backpropagation to improve the modeling performance on a target task.

cannot be viewed and edited in text
lack of interpretability

Translate the English sentence {english_sentence} into German: {german_translation}
English: {english_sentence} | German: {german_translation}
From English to German: {english_sentence} -> {german_translation}

22 of 34

Soft Prompt-tuning vs. Adapters

Soft prompt

Low Rank Adaptation (LoRA)

23 of 34

Prompt-tuning

Prompt-tuning refers to techniques that vary the input prompt to achieve better modeling results.
Idea: convert data into natural language prompts
better for few-shot, one-shot, or zero-shot cases

24 of 34

Prompt-tuning

Prompt template: manually designed natural language input for a task

25 of 34

Prompt-tuning

Prompt template: manually designed natural language input for a task
PLM: perform language modeling (masked LM or auto-regressive LM)

26 of 34

Prompt-tuning

Verbalizer: mapping from the vocabulary to labels

27 of 34

Limits of prompting of harder tasks?

Ask GPT-3: What are some great financial investments with no risk at all?

“Conspiracy” prompt

Buy gold and silver, and invest in cryptocurrencies.

“Blog post” prompt

The best investment is to buy a house.

“Helpful” prompt

I have no comment.

Ask GPT-3: Explain the moon landing to a 6 year old in a few sentences

Explain the theory of gravity to a 6 year old
Explain the theory of relativity to a 6 year old in a few sentences
Explain the big band theory to a 6 year old
Explain evolution to a 6 year old

Some tasks seem too hard for even large LMs to learn through prompting alone

28 of 34

Agenda

Recap on small language models
Text-to-Text Transfer Transformer, Prompting, Instruction-tuning

28

29 of 34

Instruction-tuning

29

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. "FINETUNED LANGUAGE MODELS ARE ZERO-SHOT LEARNERS" ICLR (2022).

30 of 34

Instruction-tuning

30

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. "FINETUNED LANGUAGE MODELS ARE ZERO-SHOT LEARNERS" ICLR (2022).

31 of 34

Instruction Models:

Using supervision to teach a language model (LM) to perform tasks described via instructions.
The LM will learn to follow instructions and do so even for unseen tasks.
Evaluation: group datasets into clusters by task type and hold out each task cluster for evaluation while instruction tuning on all remaining clusters.

31

NLU tasks in blue; NLG tasks in teal

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. "FINETUNED LANGUAGE MODELS ARE ZERO-SHOT LEARNERS" ICLR (2022).

32 of 34

Multiple Instruction Templates for Each NLP Task

Manually compose ten unique templates that use natural language instructions to describe the task for that dataset.

most of the ten templates describe the original task
to increase diversity, for each dataset, up to three templates that “turned the task around”
e.g., for sentiment classification, summarization task related template by asking to generate a movie review

32

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. "FINETUNED LANGUAGE MODELS ARE ZERO-SHOT LEARNERS" ICLR (2022).

33 of 34

Can large language models provide useful feedback on research papers? A large-scale empirical analysis.

Weixin Liang et al.

https://arxiv.org/abs/2310.01783

33

Questions:

Can GPT-4 provide useful feedback on research papers?
What are the differences between human- vs. GPT-4-generated feedback?

Main Contributions/Findings:

There is significant overlap between human- vs. GPT-4-generated feedback and more than half of the researchers tested found the feedback helpful/very helpful.
The overlap is larger for the weaker (i.e., rejected) papers.
More overlap for the initial parts of the reviews.

34 of 34

PandaLM: Judge language model

Yidong Wang. "PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization" Arxiv (2023).

34

Achieving 93.75% of GPT-3.5’s evaluation ability and 88.28% of GPT4’s in terms of F1-score on our diverse human annotated test dataset.