1 of 53

Irina Rish

Canada Excellence Research Chair in Autonomous AI

University of Montreal

Mila - Quebec AI Institute

Scaling Laws, Emergent Behaviors, and AI Democratization

2 of 53

AGI

Building AGI as an Ultimate Goal of AI Field

3 of 53

AGI ⇔ “General” AI ⇔ Multi-task,“Broad” AI

“Highly autonomous systems that outperform humans at most economically valuable work” (OpenAI definition)

4 of 53

“Classical” i.i.d. Generalization
OoD Generalization
Transfer learning
Meta-learning
Continual learning
etc

Generalization: ML Holy Grail

5 of 53

Neural Scaling Revolution:

One path to solve them all?

6 of 53

AI & Scaling

“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."

The Bitter Lesson (Rich Sutton, March 13, 2019)

7 of 53

Examples

Computer chess: “the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search.”

Computer go: “learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research”

Speech recognition, computer vision: “the statistical methods won out over the human-knowledge-based methods… more computation, together with learning on huge training sets.

8 of 53

Why Is This So Important?

“Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.

These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other“

9 of 53

The Bitter Lesson

We have to learn the bitter lesson that building in how we think we think does not work in the long run.

We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.

11 of 53

Large-Scale Pretrained Models

(a.k.a. Foundation Models)

“Train one model on a huge amount of data and adapt it to many applications.

We call such a model a foundation model.”

CEFM: Stanford’s Center for Research on Foundation Models

“On the Opportunities and Risks of Foundation Models”

Application example: healthcare

12 of 53

Successes of Large-Scale Models

GPT-3: natural language model (May 2020)
CLIP: image to text (Jan 2021)
DALL-E: text to image (Jan 2021)
Copilot/Codex: code-generation (Sept 2021),
StableDiffusion: text to image (Aug 2022)
GPT-4, ChatGPT, LLaMA, etc (2023 + )

13 of 53

Scaling Laws as “Investment Tools” for AI

An example:

image transformers dominated by convnets in lower data regimes, but outperforming the latter with more data: https://arxiv.org/pdf/2010.11929.pdf

14 of 53

History of Neural Scaling Laws

Notation used:

15 of 53

First to model the scaling of multilayer neural network’s performance, as a function of data set size, via power law (M2) where x = dataset size and y = test error.

Corinna Cortes, Lawrence D Jackel, Sara A Solla, Vladimir Vapnik, and John S Denker. Learning curves: Asymptotic values and rate of convergence. NeurIPS 1994.

16 of 53

Showed that data-size dependent scaling laws given by M2 (power laws) hold over many orders of magnitude.

Joel Hestness et al. Deep Learning Scaling is Predictable,

Empirically. arXiv:1712.00409, December 2017.

17 of 53

Applied M2 (power laws) to model-size dependent scaling laws, i.e. when x = number of parameters.

Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales. CoRR, abs/1909.12673, 2019.

18 of 53

Showed that M2 (power law) applies when x = compute, besides x = data and x = model.

This paper brought “neural” scaling laws to the mainstream as it was in context of GPT-3 training.

Jared Kaplan et al. Scaling Laws for Neural Language

Models. arXiv:2001.08361, January 2020.

19 of 53

Sharp Transitions in GPT-3 Performance with Increasing Number of Parameters

21 of 53

Wei et al, Emergent Abilities of Large Language Models

22 of 53

Broken Neural Scaling Laws (BNSL)

Ethan Caballero

Kshitij Gupta,

Irina Rish,

David Krueger

arxiv.org/abs/2210.14891�

ICLR 2023 Conference Paper

23 of 53

Broken Neural Scaling Laws:

A Universal Functional Form for Neural Scaling Laws?

Ethan Caballero et al, 2022

https://arxiv.org/abs/2210.14891

24 of 53

BNSL accurately fits and extrapolates a very wide range of scaling behaviors

Settings: Zero-Shot, Prompted, and Fine-Tuned settings; Downstream and upstream
Tasks: Large-Scale Vision, Language, Audio, Video, Diffusion, Generative Modeling, Multimodal Learning, Contrastive Learning, AI Alignment, AI Capabilities, Robotics, Out-Of-Distribution Generalization, Continual Learning, Transfer Learning, Uncertainty Estimation / Calibration, Out-Of-Distribution Detection, Adversarial Robustness, Distillation, Sparsity, Retrieval, Quantization, Pruning, Fairness, Molecules, Computer Programming/Coding, Math Word Problems, Arithmetic, Double Descent, “Emergent” “Phase Transitions”, Supervised Learning, Unsupervised / Self-Supervised Learning, & Reinforcement Learning (Single Agent & Multi-Agent)
Architectures: ResNet, Transformer, MLP-Mixer, MLP, Graph Neural Network, U-Net, Ensemble, Sparsely-Gated Mixture-of-Experts, Sparse Pruned Model
X-axes: Compute, Dataset Size, Number of Model Parameters, Number of Training Steps, Input (e.g. Context) Size, & Upstream Performance
Y-axes: prediction error, cross entropy, calibration error, AUROC, BLEU score percentage, F1 score, reward, Elo rating, FID score

25 of 53

Functional forms from previous work that we compare to:

26 of 53

Percentage of tasks by domain where each functional form is the best for extrapolation of scaling behavior

27 of 53

Empirical Results

In all plots, black points are the points used for fitting a BNSL, green points are the held-out points used for evaluating extrapolation of BNSL fit to the black points, and a red line is the BNSL that has been fit to the black points. 100% of the plots contain green point(s) for evaluating extrapolation.

Except when stated otherwise, each plot contains the region(s) surrounding (or neighboring) a single break of a BNSL fit to black points that are smaller (along the x-axis) than the green points.

28 of 53