Scaling unlocks emergent abilities in language models
Jason Wei
Google Brain
Outline
Predictable gains as a result of scaling
Emergent abilities of large language models (TMLR ‘22).
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.
Emergence in science
Emergent abilities of large language models (TMLR ‘22).
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.
Definition: emergent abilities in large language models
An ability is emergent if it is not present in smaller models but is present in larger models.
Emergent abilities of large language models (TMLR ‘22).
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.
Emergence in few-shot prompting
Emergent abilities of large language models (TMLR ‘22).
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.
> A few-shot prompted task is emergent if it achieves random accuracy for small models and above-random accuracy for large models.
Emergence in few-shot prompting
Emergent abilities of large language models (TMLR ‘22).
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.
Emergence in few-shot prompting
Emergent abilities of large language models (TMLR ‘22).
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.
Emergence in few-shot prompting
Input (English): The 1931 Malay census was an alarm bell.
Target (IPA): ðə 1931 ˈmeɪleɪ ˈsɛnsəs wɑz ən əˈlɑrm bɛl.
Emergent abilities of large language models (TMLR ‘22).
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.
BIG-Bench (Srivastava et al., 2022).
Inverse scaling can become U-shaped
Small language model → “glib”
Inverse scaling can become U-shaped, 2022.
J. Wei, Y. Tay, & Q. Le.
Medium language model → “gold”
Large language model → “glib”
Emergent prompting techniques
A prompting technique is emergent if it hurts performance (compared to baseline) for small models, and improves baseline for large models
RLHF hurts performance
RLHF helps performance
> later: chain-of-thought prompting as an emergent prompting technique
Emergent abilities of large language models (TMLR ‘22).
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.
Emergent abilities of large language models (TMLR ‘22).
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.
Emergence: better data
Smaller models with better data can also lead to emergence, even when larger models trained on worse data don’t demonstrate worse behavior
PaLM 62B
LaMDA and GPT-3
Emergent abilities of large language models (TMLR ‘22).
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.
BIG-Bench (Srivastava et al., 2022).
Emergence: better data
Better (in-domain) data makes a big difference when compute, model parameters, and dataset size are fixed
(Setup: small BERT models pre-trained from scratch, task is subject-verb agreement)
Emergent abilities of large language models (TMLR ‘22).
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.
Emergence: finetuning for desired behaviors
Desired behaviors can be induced in smaller models via finetuning and RLHF
Small RLHF models
Large model with weaker techniques
Emergent abilities of large language models (TMLR ‘22).
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.
Emergence: measure of model “scale”
What’s the right x-axis for emergence?
Can be viewed through training FLOPs, model parameters, Wiki-text103 perplexity
Emergent abilities of large language models (TMLR ‘22).
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.
Emergence: surpassing finetuning
Sociological change in the AI community: finetuned task-specific models are outperformed by few-shot prompted large model
Emergent abilities of large language models (TMLR ‘22).
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.
Summary of emergence:
Reflection:
Emergent abilities of large language models (TMLR ‘22).
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, & W. Fedus.
Ed Chi
Quoc Le
Olivier Bousquet
Dale Schuurmans
Fei Xia
Aakanksha Chowdhery
Sharan Narang
Maarten Bosma
Nathan Scales
Brian Ichter
Le Hou
Nathaneal Schärli
Denny Zhou
Xuezhi Wang
Jason Wei
Google I/O 2022
Video: https://twitter.com/Google/status/1525188695875366912
Ed Chi
Quoc Le
Olivier Bousquet
Dale Schuurmans
Fei Xia
Aakanksha Chowdhery
Sharan Narang
Maarten Bosma
Nathan Scales
Brian Ichter
Le Hou
Nathaneal Schärli
Denny Zhou
Xuezhi Wang
Jason Wei
CoT paper
Motivation:
21
Chain-of-thought prompting elicits reasoning in large language models (NeurIPS ‘22).
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.
CoT demo
22
Chain-of-thought prompting elicits reasoning in large language models (NeurIPS ‘22).
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.
GSM8K
StrategyQA
CoT paper
23
Finetuned SOTA at the time
Finetuned SOTA
Human
Real model output #1
Q: Can you hide a basketball in a sand cat's ear?
A: A basketball is about 30 cm in diameter. A sand cat's ear is about 2 cm in diameter. Thus, a basketball would not fit in a sand cat's ear. So the answer is no.
Chain-of-thought prompting elicits reasoning in large language models (NeurIPS ‘22).
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.
CoT on BIG-Bench: Benchmark
BIG-Bench Hard (BBH):
Challenging BIG-Bench tasks and whether chain-of-thought can solve them.
M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. Chi, D. Zhou, and J. Wei.
CoT on BIG-Bench: Result summary
Detail: better formatting (options, task description) already beats prior best
Model much lower than average human rater
Challenging BIG-Bench tasks and whether chain-of-thought can solve them.
M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. Chi, D. Zhou, and J. Wei.
CoT prompting improves by performance by +16.7%, passes avg. human on majority of tasks
CoT on BIG-Bench: Scaling
Challenging BIG-Bench tasks and whether chain-of-thought can solve them.
M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. Chi, D. Zhou, and J. Wei.
CoT on BIG-Bench: Emergence
Challenging BIG-Bench tasks and whether chain-of-thought can solve them.
M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. Chi, D. Zhou, and J. Wei.
Multilingual chain-of-thought prompting
Language models are multilingual chain-of-thought reasoners (ICLR ‘23).
{F. Shi, M. Suzgun}, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, & J. Wei.
Why does scaling up improve chain-of-thought?
Chain of thought prompting elicits reasoning in large language models (NeurIPS ‘22).
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.
30
Chain-of-thought prompting elicits reasoning in large language models (NeurIPS ‘22).
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.
Self-consistency: majority vote
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Q: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder for $2 per egg. How much does she make every day?
A:
Prompt with example chain of thought
Language model
She has 16 - 3 - 4 = 9 eggs left. So she makes $2 * 9 = $18 per day.
Sample decode with diverse reasoning paths
She eats 3 for breakfast, so she has 16 - 3 = 13 left. Then she bakes muffins, so she has 13 - 4 = 9 eggs left. So she has 9 eggs * $2 = $18.
This means she uses 3 + 4 = 7 eggs every day. So in total she sells 7 * $2 = $14 per day.
The answer is $18.
The answer is $14.
The answer is $18.
Majority vote on the answers
The answer is $18.
Self-consistency improves chain-of-thought reasoning in language models (ICLR ‘23).
X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, & D. Zhou.
Self-consistency: results
Simple trick but big performance delta
Self-consistency improves chain-of-thought reasoning in language models.
X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, & D. Zhou.
Self-consistency: emergence
Self-consistency doesn’t work for small models, but can help a lot for large models
Doesn’t work
Increases performance by a lot
Self-consistency improves chain-of-thought reasoning in language models.
X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, & D. Zhou.
Chain-of-thought: Discussion
Chain-of-thought prompting elicits reasoning in large language models (NeurIPS ‘22).
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, & D. Zhou.
Conclusions of talk
Looking forward (just my personal interests)
Thanks.