1 of 43

New abilities in big language models

Jason Wei

Google Brain

1

2 of 43

Some motivation about big language models

  • I care about general methods.

2

  • Every neural network has a model scale…

Research question: What abilities emerge by scaling up language models?

What can big models do that small models can’t?

3 of 43

Talk outline: two new abilities of scale

  1. Language models follow instructions.
    1. Finetuned language models are zero-shot learners (ICLR 2022). {J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.
  2. Language models do chain of thought reasoning.
    • Chain of thought prompting elicits reasoning in large language models (2022). J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.

3

4 of 43

Finetuned language models are zero-shot learners.

4

Jason Wei

Maarten Bosma

Vincent Zhao

Adams Yu

Kelvin Guu‎

Andrew Dai

Quoc Le

Nan Du

Brian Lester

5 of 43

Summary

5

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

6 of 43

FLAN demo

6

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

7 of 43

FLAN outline

  • Background and motivation
  • Training FLAN & experimental setup
  • Results on various tasks, ablation studies

7

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

8 of 43

Motivation

8

Pretraining objective

(Language modeling)

Downstream inference

(NLP task)

Few-shot prompting

Prompt tuning

...

“This movie sucks.” This movie review is {negative,positive}.

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

9 of 43

Can we use “a little bit” of supervision to teach the model to perform many NLP tasks?

i.e., zero-shot!

Instruction tuning

9

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

10 of 43

“Instruction tuning”—finetuning a language model on a collection of tasks described via instructions—improves the zero-shot performance of language models on unseen tasks.

10

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

11 of 43

NLP tasks and datasets

  • 62 NLP datasets
  • 12 “task clusters”

11

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

12 of 43

Templates

We generate many natural instruction templates for each task

12

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

13 of 43

Evaluation splits

We evaluate on “unseen” / “zero-shot” tasks where no datasets from that task were seen during instruction tuning.

13

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

14 of 43

Classification with “options”

For classification tasks, we teach FLAN to return one of several “options”

14

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

15 of 43

FLAN Training details

  • 137B parameter pretrained checkpoint (LaMDA-PT)
  • Instruction tune for 30k steps on 62 datasets spanning 12 task clusters

15

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

16 of 43

Summary of results

  • 25 datasets spanning NLI, reading comprehension, closed-book QA, commonsense reasoning, coreference resolution, and translation
  • Baselines: LaMDA-PT, GPT-3 175B

16

FLAN almost always outperforms LaMDA-PT

On 20 of 25 tasks, zero-shot FLAN outperforms zero-shot GPT-3

On 10 tasks, zero-shot FLAN even outperforms few-shot GPT-3

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

17 of 43

Results: NLI, reading comprehension, closed-book QA, translation

17

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

18 of 43

Ablation study: number of instruction tuning clusters

Adding additional task clusters to instruction tuning improves zero-shot performance on held-out task clusters.

18

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

19 of 43

Ablation study: scaling laws

Performance on unseen tasks only improves with sufficient model scale.

19

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

20 of 43

Ablation: templates per task

Curiously, more templates per dataset did not help much.

20

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

21 of 43

Ablation: natural language instructions

Natural language instructions are crucial to successful zero-shot learning.

FT: No instruction

Eval: Instruction

FT: Task name

Eval: Instruction

FT: Task name

Eval: Task name

FT: Instruction

Eval: Instruction

= FLAN

Avg. zero-shot performance on 4 task clusters

21

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

22 of 43

22

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

23 of 43

FLAN Summary

  • Finetuning a language model on a collection of tasks allows it to follow instructions for a new task.
  • This instruction-tuned language model has better zero-shot performance.
  • Number of instruction tuning clusters and model scale are crucial.

23

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

24 of 43

Questions about FLAN?

24

25 of 43

Chain of thought prompting elicits reasoning in large language models.

25

Jason Wei

Maarten Bosma

Quoc Le

Xuezhi Wang

Dale Schuurmans

Ed Chi

Denny Zhou

26 of 43

Motivation

26

Chain of thought prompting elicits reasoning in large language models (2022).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.

This movie sucks!

Thank you!

Novak Djokovic wins the 2021 French Open.

Language Model

Sentiment = negative

Topic = Tennis

Translation: 谢谢!

“type 1”

Take the last letters of the words in "Elon Musk" and concatenate them.

“uk”

The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?

27

“type 2”

27 of 43

Chain of thought reasoning

Inspiration (from meditation): consider your inner monologue, which comprises a continuous stream of thoughts (imagery, sounds, sensations, emotions, language).

27

this project is really cool … wow I am sleepy … is it lunchtime yet … where is Maarten

Chain of thought = directed / intentional stream of thought

Chain of thought prompting elicits reasoning in large language models (2022).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.

28 of 43

Example of chain of thought

Question: Shawn has 5 toys. He gets 2 more each from his mom and dad. How many toys does he have now?

28

Chain of thought: Shawn started with 5 toys. 2 toys each from his mom and dad is 4 more toys. The final answer is 5+4=9.

Chain of thought prompting elicits reasoning in large language models (2022).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.

29 of 43

29

Chain of thought prompting elicits reasoning in large language models (2022).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.

30 of 43

GPT-3 demo

30

Chain of thought prompting elicits reasoning in large language models (2022).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.

31 of 43

31

Chain of thought prompting elicits reasoning in large language models (2022).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.

32 of 43

32

Last Letter Concatentation

Q: Take the last letters of the words in "Elon Musk" and concatenate them.

A: The last letter of “Elon” is “n”. The last letter of “Musk” is “k”. Concatenating them is “nk”. So the answer is nk.

Chain of thought prompting elicits reasoning in large language models (2022).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.

33 of 43

33

Reverse List

Q: Reverse the sequence "glasses, pen, alarm, license".

A: First is glasses. Second is pen. Third is alarm. Fourth is license. Now to reverse, change the order to: Fourth is license. Third is alarm. Second is pen. First is glasses. So the answer is "license, alarm, pen, glasses".

Coin Flip

Q: A coin is heads up. Maybelle flips the coin. Shalonda does not flip the coin. Is the coin still heads up?

A: The coin was flipped by Maybelle. So the coin was flipped 1 time, which is an odd number. The coin started heads up, so after an odd number of flips, it will be tails up. So the answer is no.

Chain of thought prompting elicits reasoning in large language models (2022).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.

34 of 43

34

Chain of thought prompting elicits reasoning in large language models (2022).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.

35 of 43

Why chain of thought?

Variable computation.

35

Human-readable reasoning.

Natural language is broadly applicable.

Works with just prompting!

Chain of thought prompting elicits reasoning in large language models (2022).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.

36 of 43

36

Chain of thought prompting elicits reasoning in large language models (2022).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, & D. Zhou.

37 of 43

37

??? (what gets me up in the morning)

instruction tuning enables zero-shot, chain of thought prompting (this talk)

few-shot learning (GPT-3, 2020)

next word prediction, etc… (2019 and before)

Things language models do

38 of 43

Thanks.

jasonwei@google.com

38

39 of 43

Backup slides

39

40 of 43

Further analysis: prompt tuning

FLAN responds better to continuous inputs from prompt tuning than base LM.

40

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

41 of 43

Further analysis: few-shot prompting

Few-shot prompting is a complementary way of improving performance with instruction tuning.

Example of few-shot prompt:

41

Does the following review have a positive or negative opinion of the movie?

<review>

Negative.

Does the following review have a positive or negative opinion of the movie?

<review>

Positive.

Does the following review have a positive or negative opinion of the movie?

<review>

Finetuned language models are zero-shot learners (ICLR 2022).

{J. Wei, M. Bosma, V. Zhao, K. Guu}, A. Yu, B. Lester, N. Du, A. Dai, & Q. Le.

42 of 43

42

Input

T0pp Output

Jason Wei is reading the paper "Finetuned Language Models are Zero-Shot Learners."

Change to past tense.

Jason Wei is reading the paper "Finetuned Language Models are Zero-Shot Learners."

Jason Wei is reading the paper "Finetuned Language Models are Zero-Shot Learners."

Change the verb to eat.

eat

Jason Wei is reading the paper "Finetuned Language Models are Zero-Shot Learners."

Change to passive voice.

Jason Wei is reading the paper "Finetuned Language Models are Zero-Shot Learners."

Recommend activities to do on a sunny weekend in Mountain View.

Mountain View, California

Generate utterances with the intent "get COVID vaccine".

A nurse is giving a child a COVID vaccine.

43 of 43

Further analysis: data contamination

We do not find evidence that example overlaps with pretraining data affects the performance of FLAN.

43