1 of 37

Scaling unlocks emergent abilities in language models

Jason Wei

Google Brain

2 of 37

Outline

  • Emergent abilities of large language models.
    • Inverse scaling can become U-shaped.
  • Chain-of-thought prompting elicits reasoning in large language models.
    • Challenging BIG-Bench tasks and whether chain-of-thought can solve them.
    • Language models are multilingual chain-of-thought reasoners.
    • Self-consistency improves chain-of-thought reasoning in language models.
  • Feel free to interrupt anytime with questions :)

3 of 37

4 of 37

Predictable gains as a result of scaling

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

5 of 37

Emergence in science

  • Emergence: “a qualitative change that arises from quantitative changes”

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

6 of 37

Definition: emergent abilities in large language models

An ability is emergent if it is not present in smaller models but is present in larger models.

  • How to measure the “size” of the model?
    • Training FLOPs
    • Number of model parameters
    • Training dataset size

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

7 of 37

Emergence in few-shot prompting

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

> A few-shot prompted task is emergent if it achieves random accuracy for small models and above-random accuracy for large models.

8 of 37

Emergence in few-shot prompting

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

9 of 37

Emergence in few-shot prompting

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

10 of 37

Emergence in few-shot prompting

Input (English): The 1931 Malay census was an alarm bell.

Target (IPA): ðə 1931 ˈmeɪleɪ ˈsɛnsəs wɑz ən əˈlɑrm bɛl.

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

11 of 37

Inverse scaling can become U-shaped

Small language model → “glib”

Inverse scaling can become U-shaped, 2022.

J. Wei, Y. Tay, & Q. Le.

Medium language model → “gold”

Large language model → “glib”

12 of 37

Emergent prompting techniques

A prompting technique is emergent if it hurts performance (compared to baseline) for small models, and improves baseline for large models

RLHF hurts performance

RLHF helps performance

> later: chain-of-thought prompting as an emergent prompting technique

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

13 of 37

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

14 of 37

Emergence: better data

Smaller models with better data can also lead to emergence, even when larger models trained on worse data don’t demonstrate worse behavior

PaLM 62B

LaMDA and GPT-3

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

15 of 37

Emergence: better data

Better (in-domain) data makes a big difference when compute, model parameters, and dataset size are fixed

(Setup: small BERT models pre-trained from scratch, task is subject-verb agreement)

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

16 of 37

Emergence: finetuning for desired behaviors

Desired behaviors can be induced in smaller models via finetuning and RLHF

Small RLHF models

Large model with weaker techniques

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

17 of 37

Emergence: measure of model “scale”

What’s the right x-axis for emergence?

Can be viewed through training FLOPs, model parameters, Wiki-text103 perplexity

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

18 of 37

Emergence: surpassing finetuning

Sociological change in the AI community: finetuned task-specific models are outperformed by few-shot prompted large model

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

19 of 37

Summary of emergence:

  • Emergent abilities can only be observed in large models
    • Their emergence cannot be predicted by scaling plots with small models only

Reflection:

  • Framing for viewing these abilities, which are not intentionally built in
    • Subtext: “why we should keep scaling; these abilities are hard to find otherwise,” context around this
  • Tension between emergence (task-general; bigger models) and many production tasks (task-specific; compute constraints; in-domain data)
  • Haven’t seen a lot of work on predicting future emergence
    • Why? Too hard, only task-specific answers? Maybe Anthropic is working on it

Emergent abilities of large language models (TMLR ‘22).

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.

20 of 37

Ed Chi

Quoc Le

Olivier Bousquet

Dale Schuurmans

Fei Xia

Aakanksha Chowdhery

Sharan Narang

Maarten Bosma

Nathan Scales

Brian Ichter

Le Hou

Nathaneal Schärli

Denny Zhou

Xuezhi Wang

Jason Wei

Ed Chi

Quoc Le

Olivier Bousquet

Dale Schuurmans

Fei Xia

Aakanksha Chowdhery

Sharan Narang

Maarten Bosma

Nathan Scales

Brian Ichter

Le Hou

Nathaneal Schärli

Denny Zhou

Xuezhi Wang

Jason Wei

21 of 37

CoT paper

Motivation:

  • Enable language models to do more-complicated tasks
  • Guide them with “meta-data” (i.e., reasoning process)
  • Prompts are manually composed (prompt engineering helps)

21

Chain-of-thought prompting elicits reasoning in large language models (NeurIPS ‘22).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.

22 of 37

CoT demo

22

Chain-of-thought prompting elicits reasoning in large language models (NeurIPS ‘22).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.

23 of 37

GSM8K

StrategyQA

CoT paper

23

Finetuned SOTA at the time

Finetuned SOTA

Human

Real model output #1

Q: Can you hide a basketball in a sand cat's ear?

A: A basketball is about 30 cm in diameter. A sand cat's ear is about 2 cm in diameter. Thus, a basketball would not fit in a sand cat's ear. So the answer is no.

Chain-of-thought prompting elicits reasoning in large language models (NeurIPS ‘22).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.

24 of 37

CoT on BIG-Bench: Benchmark

BIG-Bench Hard (BBH):

  • 23 challenging tasks from BIG-Bench benchmark where no model beats avg. human rater

Challenging BIG-Bench tasks and whether chain-of-thought can solve them.

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. Chi, D. Zhou, and J. Wei.

25 of 37

CoT on BIG-Bench: Result summary

Detail: better formatting (options, task description) already beats prior best

Model much lower than average human rater

Challenging BIG-Bench tasks and whether chain-of-thought can solve them.

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. Chi, D. Zhou, and J. Wei.

CoT prompting improves by performance by +16.7%, passes avg. human on majority of tasks

26 of 37

CoT on BIG-Bench: Scaling

  • CoT requires sufficient model scale for positive delta
  • On aggregate, threshold is davinci-002 / PaLM 62B

Challenging BIG-Bench tasks and whether chain-of-thought can solve them.

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. Chi, D. Zhou, and J. Wei.

27 of 37

CoT on BIG-Bench: Emergence

  • No-CoT performance is flat, i.e., hasn’t unlocked emergence _yet_ ;)
  • CoT unlocks emergent performance

Challenging BIG-Bench tasks and whether chain-of-thought can solve them.

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. Chi, D. Zhou, and J. Wei.

28 of 37

Multilingual chain-of-thought prompting

  • Manually translated version of 250 examples from GSM8K into 10 languages
  • Prompt the model with Bengali math problems and Bengali reasoning
  • This input is highly improbable (Bengali is 0.01% of pre-training data)
  • Expected correlation between language frequency and performance
  • Underrepresented languages did surprisingly well
  • Implication: nice demonstration of compositionality of the model

Language models are multilingual chain-of-thought reasoners (ICLR ‘23).

{F. Shi, M. Suzgun}, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, & J. Wei.

29 of 37

Why does scaling up improve chain-of-thought?

Chain of thought prompting elicits reasoning in large language models (NeurIPS ‘22).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.

30 of 37

30

Chain-of-thought prompting elicits reasoning in large language models (NeurIPS ‘22).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.

31 of 37

Self-consistency: majority vote

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?

A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.

Q: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder for $2 per egg. How much does she make every day?

A:

Prompt with example chain of thought

Language model

She has 16 - 3 - 4 = 9 eggs left. So she makes $2 * 9 = $18 per day.

Sample decode with diverse reasoning paths

She eats 3 for breakfast, so she has 16 - 3 = 13 left. Then she bakes muffins, so she has 13 - 4 = 9 eggs left. So she has 9 eggs * $2 = $18.

This means she uses 3 + 4 = 7 eggs every day. So in total she sells 7 * $2 = $14 per day.

The answer is $18.

The answer is $14.

The answer is $18.

Majority vote on the answers

The answer is $18.

Self-consistency improves chain-of-thought reasoning in language models (ICLR ‘23).

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, & D. Zhou.

32 of 37

Self-consistency: results

Simple trick but big performance delta

Self-consistency improves chain-of-thought reasoning in language models.

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, & D. Zhou.

33 of 37

Self-consistency: emergence

Self-consistency doesn’t work for small models, but can help a lot for large models

Doesn’t work

Increases performance by a lot

Self-consistency improves chain-of-thought reasoning in language models.

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, & D. Zhou.

34 of 37

Chain-of-thought: Discussion

  • Framework for “more-complicated” prompting
    • What’s the best way to get a language model to do a task? Few-shot prompting is kinda thinking by analogy from machine learning on (x, y) pairs
  • Limitation: Few-shot CoT is task-specific and requires the prompt engineer
  • Given explosion of tasks solved by LMs, we should be more open-minded about what tasks will be solved in next 1-2 years

Chain-of-thought prompting elicits reasoning in large language models (NeurIPS ‘22).

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.

35 of 37

Conclusions of talk

  • Language models acquire emergent abilities as they get scaled up (emergent abilities survey).
  • The ability for language models to do multi-step reasoning emerges with scale, unlocking new tasks (chain of thought and follow-up work).
  • There are reasons to believe that language models will continue to get bigger and better.
    • Even more new abilities may emerge :)

36 of 37

Looking forward (just my personal interests)

  • Scaling
  • Better prompting and characterization of language model abilities
  • Applied work (therapy, creative writing, science)
  • Benchmarks
  • Compute-efficient methods for better language models

37 of 37

Thanks.