2 of 37

Outline

Emergent abilities of large language models.

Inverse scaling can become U-shaped.

Chain-of-thought prompting elicits reasoning in large language models.

Challenging BIG-Bench tasks and whether chain-of-thought can solve them.
Language models are multilingual chain-of-thought reasoners.
Self-consistency improves chain-of-thought reasoning in language models.

Feel free to interrupt anytime with questions :)

4 of 37

Predictable gains as a result of scaling

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

5 of 37

Emergence in science

Emergence: “a qualitative change that arises from quantitative changes”

Jacob Steinhardt, 2022.

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

6 of 37

Definition: emergent abilities in large language models

An ability is emergent if it is not present in smaller models but is present in larger models.

How to measure the “size” of the model?

Training FLOPs
Number of model parameters
Training dataset size

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

7 of 37

Emergence in few-shot prompting

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

> A few-shot prompted task is emergent if it achieves random accuracy for small models and above-random accuracy for large models.

8 of 37

Emergence in few-shot prompting

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

9 of 37

Emergence in few-shot prompting

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

Hendryks et al., 2020.

10 of 37

Emergence in few-shot prompting

Input (English): The 1931 Malay census was an alarm bell.

Target (IPA): ðə 1931 ˈmeɪleɪ ˈsɛnsəs wɑz ən əˈlɑrm bɛl.

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

BIG-Bench (Srivastava et al., 2022).

11 of 37

Inverse scaling can become U-shaped

Small language model → “glib”

Inverse scaling can become U-shaped, 2022.

J. Wei, Y. Tay, & Q. Le.

Medium language model → “gold”

Large language model → “glib”

12 of 37

Emergent prompting techniques

A prompting technique is emergent if it hurts performance (compared to baseline) for small models, and improves baseline for large models

RLHF hurts performance

RLHF helps performance

> later: chain-of-thought prompting as an emergent prompting technique

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

Bai et al., 2022.

13 of 37

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

14 of 37

Emergence: better data

Smaller models with better data can also lead to emergence, even when larger models trained on worse data don’t demonstrate worse behavior

PaLM 62B

LaMDA and GPT-3

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

BIG-Bench (Srivastava et al., 2022).

15 of 37

Emergence: better data

Better (in-domain) data makes a big difference when compute, model parameters, and dataset size are fixed

(Setup: small BERT models pre-trained from scratch, task is subject-verb agreement)

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

Wei et al., 2021.

16 of 37

Emergence: finetuning for desired behaviors

Desired behaviors can be induced in smaller models via finetuning and RLHF

Small RLHF models

Large model with weaker techniques

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

Ouyang et al., 2022.

17 of 37

Emergence: measure of model “scale”

What’s the right x-axis for emergence?

Can be viewed through training FLOPs, model parameters, Wiki-text103 perplexity

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

18 of 37

Emergence: surpassing finetuning

Sociological change in the AI community: finetuned task-specific models are outperformed by few-shot prompted large model

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

19 of 37

Summary of emergence:

Emergent abilities can only be observed in large models

Their emergence cannot be predicted by scaling plots with small models only

Reflection:

Framing for viewing these abilities, which are not intentionally built in

Subtext: “why we should keep scaling; these abilities are hard to find otherwise,” context around this

Tension between emergence (task-general; bigger models) and many production tasks (task-specific; compute constraints; in-domain data)
Haven’t seen a lot of work on predicting future emergence

Why? Too hard, only task-specific answers? Maybe Anthropic is working on it

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

20 of 37

Ed Chi

Quoc Le

Olivier Bousquet

Dale Schuurmans

Fei Xia

Aakanksha Chowdhery

Sharan Narang

Maarten Bosma

Nathan Scales

Brian Ichter

Le Hou

Nathaneal Schärli

Denny Zhou

Xuezhi Wang

Jason Wei

Google I/O 2022

Video: h ttps://twitter.com/Google/status/1525188695875366912

Ed Chi

Quoc Le

Olivier Bousquet

Dale Schuurmans

Fei Xia

Aakanksha Chowdhery

Sharan Narang

Maarten Bosma

Nathan Scales

Brian Ichter

Le Hou

Nathaneal Schärli

Denny Zhou

Xuezhi Wang

Jason Wei

21 of 37

CoT paper

Motivation:

Enable language models to do more-complicated tasks
Guide them with “meta-data” (i.e., reasoning process)
Prompts are manually composed (prompt engineering helps)

Chain-of-thought prompting elicits reasoning in large language models (NeurIPS ‘22).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.

22 of 37

CoT demo

Chain-of-thought prompting elicits reasoning in large language models (NeurIPS ‘22).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.

23 of 37

GSM8K

StrategyQA

CoT paper

Finetuned SOTA at the time

Finetuned SOTA

Human

Real model output #1

Q: Can you hide a basketball in a sand cat's ear?

A: A basketball is about 30 cm in diameter. A sand cat's ear is about 2 cm in diameter. Thus, a basketball would not fit in a sand cat's ear. So the answer is no.

Chain-of-thought prompting elicits reasoning in large language models (NeurIPS ‘22).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.

24 of 37

CoT on BIG-Bench: Benchmark

BIG-Bench Hard (BBH):

23 challenging tasks from BIG-Bench benchmark where no model beats avg. human rater

Challenging BIG-Bench tasks and whether chain-of-thought can solve them.

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. Chi, D. Zhou, and J. Wei.

25 of 37

CoT on BIG-Bench: Result summary

Detail: better formatting (options, task description) already beats prior best

Model much lower than average human rater

Challenging BIG-Bench tasks and whether chain-of-thought can solve them.

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. Chi, D. Zhou, and J. Wei.

CoT prompting improves by performance by +16.7%, passes avg. human on majority of tasks

26 of 37

CoT on BIG-Bench: Scaling

CoT requires sufficient model scale for positive delta
On aggregate, threshold is davinci-002 / PaLM 62B

Challenging BIG-Bench tasks and whether chain-of-thought can solve them.

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. Chi, D. Zhou, and J. Wei.

27 of 37

CoT on BIG-Bench: Emergence

No-CoT performance is flat, i.e., hasn’t unlocked emergence _yet_ ;)
CoT unlocks emergent performance

Challenging BIG-Bench tasks and whether chain-of-thought can solve them.

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. Chi, D. Zhou, and J. Wei.

28 of 37

Multilingual chain-of-thought prompting

Manually translated version of 250 examples from GSM8K into 10 languages
Prompt the model with Bengali math problems and Bengali reasoning
This input is highly improbable (Bengali is 0.01% of pre-training data)

Expected correlation between language frequency and performance
Underrepresented languages did surprisingly well
Implication: nice demonstration of compositionality of the model

Language models are multilingual chain-of-thought reasoners (ICLR ‘23).

{F. Shi, M. Suzgun}, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, & J. Wei.

29 of 37

Why does scaling up improve chain-of-thought?

Chain of thought prompting elicits reasoning in large language models (NeurIPS ‘22).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.

30 of 37

Chain-of-thought prompting elicits reasoning in large language models (NeurIPS ‘22).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.

31 of 37

Self-consistency: majority vote

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?

A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.

Q: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder for $2 per egg. How much does she make every day?

Prompt with example chain of thought

Language model

She has 16 - 3 - 4 = 9 eggs left. So she makes $2 * 9 = $18 per day.

Sample decode with diverse reasoning paths

She eats 3 for breakfast, so she has 16 - 3 = 13 left. Then she bakes muffins, so she has 13 - 4 = 9 eggs left. So she has 9 eggs * $2 = $18.

This means she uses 3 + 4 = 7 eggs every day. So in total she sells 7 * $2 = $14 per day.

The answer is $18.

The answer is $14.

The answer is $18.

Majority vote on the answers

The answer is $18.

Self-consistency improves chain-of-thought reasoning in language models (ICLR ‘23).

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, & D. Zhou.

32 of 37

Self-consistency: results

Simple trick but big performance delta

Self-consistency improves chain-of-thought reasoning in language models.

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, & D. Zhou.

33 of 37

Self-consistency: emergence

Self-consistency doesn’t work for small models, but can help a lot for large models

Doesn’t work

Increases performance by a lot

Self-consistency improves chain-of-thought reasoning in language models.

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, & D. Zhou.

34 of 37

Chain-of-thought: Discussion

Framework for “more-complicated” prompting

What’s the best way to get a language model to do a task? Few-shot prompting is kinda thinking by analogy from machine learning on (x, y) pairs

Limitation: Few-shot CoT is task-specific and requires the prompt engineer
Given explosion of tasks solved by LMs, we should be more open-minded about what tasks will be solved in next 1-2 years

Chain-of-thought prompting elicits reasoning in large language models (NeurIPS ‘22).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.

1 of 37

2 of 37

3 of 37

4 of 37

5 of 37

6 of 37

7 of 37

8 of 37

9 of 37

10 of 37

11 of 37

12 of 37

13 of 37

14 of 37

15 of 37

16 of 37

17 of 37

18 of 37

19 of 37

20 of 37

21 of 37

22 of 37

23 of 37

24 of 37

25 of 37

26 of 37

27 of 37

28 of 37

29 of 37

30 of 37

31 of 37

32 of 37

33 of 37

34 of 37

35 of 37

36 of 37

37 of 37