New abilities in big language models
Jason Wei
Google Brain
1
Some motivation about big language models
2
Research question: What abilities emerge by scaling up language models?
What can big models do that small models can’t?
Talk outline: two new abilities of scale
3
Finetuned language models are zero-shot learners.
4
Jason Wei
Maarten Bosma
Vincent Zhao
Adams Yu
Kelvin Guu
Andrew Dai
Quoc Le
Nan Du
Brian Lester
Summary
5
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
FLAN demo
6
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
FLAN outline
7
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
Motivation
8
Pretraining objective
(Language modeling)
Downstream inference
(NLP task)
Few-shot prompting
Prompt tuning
...
“This movie sucks.” This movie review is {negative,positive}.
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
Can we use “a little bit” of supervision to teach the model to perform many NLP tasks?
i.e., zero-shot!
Instruction tuning
9
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
“Instruction tuning”—finetuning a language model on a collection of tasks described via instructions—improves the zero-shot performance of language models on unseen tasks.
10
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
NLP tasks and datasets
11
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
Templates
We generate many natural instruction templates for each task
12
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
Evaluation splits
We evaluate on “unseen” / “zero-shot” tasks where no datasets from that task were seen during instruction tuning.
13
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
Classification with “options”
For classification tasks, we teach FLAN to return one of several “options”
14
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
FLAN Training details
15
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
Summary of results
16
FLAN almost always outperforms LaMDA-PT
On 20 of 25 tasks, zero-shot FLAN outperforms zero-shot GPT-3
On 10 tasks, zero-shot FLAN even outperforms few-shot GPT-3
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
Results: NLI, reading comprehension, closed-book QA, translation
17
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
Ablation study: number of instruction tuning clusters
Adding additional task clusters to instruction tuning improves zero-shot performance on held-out task clusters.
18
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
Ablation study: scaling laws
Performance on unseen tasks only improves with sufficient model scale.
19
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
Ablation: templates per task
Curiously, more templates per dataset did not help much.
20
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
Ablation: natural language instructions
Natural language instructions are crucial to successful zero-shot learning.
FT: No instruction
Eval: Instruction
FT: Task name
Eval: Instruction
FT: Task name
Eval: Task name
FT: Instruction
Eval: Instruction
= FLAN
Avg. zero-shot performance on 4 task clusters
21
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
22
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
FLAN Summary
23
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
Questions about FLAN?
24
Chain of thought prompting elicits reasoning in large language models.
25
Jason Wei
Maarten Bosma
Quoc Le
Xuezhi Wang
Dale Schuurmans
Ed Chi
Denny Zhou
Motivation
26
Chain of thought prompting elicits reasoning in large language models (2022).
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.
This movie sucks!
Thank you!
Novak Djokovic wins the 2021 French Open.
Language Model
Sentiment = negative
Topic = Tennis
Translation: 谢谢!
“type 1”
Take the last letters of the words in "Elon Musk" and concatenate them.
“uk”
The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
27
“type 2”
Chain of thought reasoning
Inspiration (from meditation): consider your inner monologue, which comprises a continuous stream of thoughts (imagery, sounds, sensations, emotions, language).
27
this project is really cool … wow I am sleepy … is it lunchtime yet … where is Maarten
Chain of thought = directed / intentional stream of thought
Chain of thought prompting elicits reasoning in large language models (2022).
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.
Example of chain of thought
Question: Shawn has 5 toys. He gets 2 more each from his mom and dad. How many toys does he have now?
28
Chain of thought: Shawn started with 5 toys. 2 toys each from his mom and dad is 4 more toys. The final answer is 5+4=9.
Chain of thought prompting elicits reasoning in large language models (2022).
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.
29
Chain of thought prompting elicits reasoning in large language models (2022).
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.
GPT-3 demo
30
Chain of thought prompting elicits reasoning in large language models (2022).
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.
31
Chain of thought prompting elicits reasoning in large language models (2022).
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.
32
Last Letter Concatentation
Q: Take the last letters of the words in "Elon Musk" and concatenate them.
A: The last letter of “Elon” is “n”. The last letter of “Musk” is “k”. Concatenating them is “nk”. So the answer is nk.
Chain of thought prompting elicits reasoning in large language models (2022).
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.
33
Reverse List
Q: Reverse the sequence "glasses, pen, alarm, license".
A: First is glasses. Second is pen. Third is alarm. Fourth is license. Now to reverse, change the order to: Fourth is license. Third is alarm. Second is pen. First is glasses. So the answer is "license, alarm, pen, glasses".
Coin Flip
Q: A coin is heads up. Maybelle flips the coin. Shalonda does not flip the coin. Is the coin still heads up?
A: The coin was flipped by Maybelle. So the coin was flipped 1 time, which is an odd number. The coin started heads up, so after an odd number of flips, it will be tails up. So the answer is no.
Chain of thought prompting elicits reasoning in large language models (2022).
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.
34
Chain of thought prompting elicits reasoning in large language models (2022).
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.
Why chain of thought?
Variable computation.
35
Human-readable reasoning.
Natural language is broadly applicable.
Works with just prompting!
Chain of thought prompting elicits reasoning in large language models (2022).
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.
36
Chain of thought prompting elicits reasoning in large language models (2022).
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.
37
??? (what gets me up in the morning)
instruction tuning enables zero-shot, chain of thought prompting (this talk)
few-shot learning (GPT-3, 2020)
next word prediction, etc… (2019 and before)
Things language models do
Thanks.
jasonwei@google.com
38
Backup slides
39
Further analysis: prompt tuning
FLAN responds better to continuous inputs from prompt tuning than base LM.
40
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
Further analysis: few-shot prompting
Few-shot prompting is a complementary way of improving performance with instruction tuning.
Example of few-shot prompt:
41
Does the following review have a positive or negative opinion of the movie?
<review>
Negative.
Does the following review have a positive or negative opinion of the movie?
<review>
Positive.
Does the following review have a positive or negative opinion of the movie?
<review>
Finetuned language models are zero-shot learners (ICLR 2022).
{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
42
Input | T0pp Output |
Jason Wei is reading the paper "Finetuned Language Models are Zero-Shot Learners." Change to past tense. | Jason Wei is reading the paper "Finetuned Language Models are Zero-Shot Learners." |
Jason Wei is reading the paper "Finetuned Language Models are Zero-Shot Learners." Change the verb to eat. | eat |
Jason Wei is reading the paper "Finetuned Language Models are Zero-Shot Learners." Change to passive voice. | Jason Wei is reading the paper "Finetuned Language Models are Zero-Shot Learners." |
Recommend activities to do on a sunny weekend in Mountain View. | Mountain View, California |
Generate utterances with the intent "get COVID vaccine". | A nurse is giving a child a COVID vaccine. |
Further analysis: data contamination
We do not find evidence that example overlaps with pretraining data affects the performance of FLAN.
43