1 of 53

Irina Rish

Canada Excellence Research Chair in Autonomous AI

University of Montreal

Mila - Quebec AI Institute

Scaling Laws, Emergent Behaviors, and AI Democratization

2 of 53

AGI

Building AGI as an Ultimate Goal of AI Field

3 of 53

AGI ⇔ “General” AI ⇔ Multi-task,“Broad” AI

“Highly autonomous systems that outperform humans at most economically valuable work” (OpenAI definition)

4 of 53

  • “Classical” i.i.d. Generalization
  • OoD Generalization
  • Transfer learning
  • Meta-learning
  • Continual learning
  • etc

Generalization: ML Holy Grail

5 of 53

Neural Scaling Revolution:

One path to solve them all?

6 of 53

AI & Scaling

“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."

The Bitter Lesson (Rich Sutton, March 13, 2019)

7 of 53

Examples

Computer chess: “the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search.”

Computer go: “learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research”

Speech recognition, computer vision: “the statistical methods won out over the human-knowledge-based methods… more computation, together with learning on huge training sets.

8 of 53

Why Is This So Important?

“Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.

These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other

9 of 53

We have to learn the bitter lesson that building in how we think we think does not work in the long run.

We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.

10 of 53

11 of 53

Large-Scale Pretrained Models

(a.k.a. Foundation Models)

“Train one model on a huge amount of data and adapt it to many applications.

We call such a model a foundation model.”

CEFM: Stanford’s Center for Research on Foundation Models

“On the Opportunities and Risks of Foundation Models”

Application example: healthcare

12 of 53

Successes of Large-Scale Models

  • GPT-3: natural language model (May 2020)
  • CLIP: image to text (Jan 2021)
  • DALL-E: text to image (Jan 2021)
  • Copilot/Codex: code-generation (Sept 2021),
  • StableDiffusion: text to image (Aug 2022)
  • GPT-4, ChatGPT, LLaMA, etc (2023 + )

13 of 53

Scaling Laws as “Investment Tools” for AI

An example:

image transformers dominated by convnets in lower data regimes, but outperforming the latter with more data: https://arxiv.org/pdf/2010.11929.pdf

14 of 53

History of Neural Scaling Laws

Notation used:

15 of 53

First to model the scaling of multilayer neural network’s performance, as a function of data set size, via power law (M2) where x = dataset size and y = test error.

Corinna Cortes, Lawrence D Jackel, Sara A Solla, Vladimir Vapnik, and John S Denker. Learning curves: Asymptotic values and rate of convergence. NeurIPS 1994.

16 of 53

Showed that data-size dependent scaling laws given by M2 (power laws) hold over many orders of magnitude.

Joel Hestness et al. Deep Learning Scaling is Predictable,

Empirically. arXiv:1712.00409, December 2017.

17 of 53

Applied M2 (power laws) to model-size dependent scaling laws, i.e. when x = number of parameters.

Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales. CoRR, abs/1909.12673, 2019.

18 of 53

Showed that M2 (power law) applies when x = compute, besides x = data and x = model.

This paper brought “neural” scaling laws to the mainstream as it was in context of GPT-3 training.

Jared Kaplan et al. Scaling Laws for Neural Language

Models. arXiv:2001.08361, January 2020.

19 of 53

Sharp Transitions in GPT-3 Performance with Increasing Number of Parameters

20 of 53

21 of 53

22 of 53

Broken Neural Scaling Laws (BNSL)

Ethan Caballero

Kshitij Gupta,

Irina Rish,

David Krueger

arxiv.org/abs/2210.14891

ICLR 2023 Conference Paper

23 of 53

Broken Neural Scaling Laws:

A Universal Functional Form for Neural Scaling Laws?

Ethan Caballero et al, 2022

https://arxiv.org/abs/2210.14891

24 of 53

BNSL accurately fits and extrapolates a very wide range of scaling behaviors

  • Settings: Zero-Shot, Prompted, and Fine-Tuned settings; Downstream and upstream
  • Tasks: Large-Scale Vision, Language, Audio, Video, Diffusion, Generative Modeling, Multimodal Learning, Contrastive Learning, AI Alignment, AI Capabilities, Robotics, Out-Of-Distribution Generalization, Continual Learning, Transfer Learning, Uncertainty Estimation / Calibration, Out-Of-Distribution Detection, Adversarial Robustness, Distillation, Sparsity, Retrieval, Quantization, Pruning, Fairness, Molecules, Computer Programming/Coding, Math Word Problems, Arithmetic, Double Descent, “Emergent” “Phase Transitions”, Supervised Learning, Unsupervised / Self-Supervised Learning, & Reinforcement Learning (Single Agent & Multi-Agent)
  • Architectures: ResNet, Transformer, MLP-Mixer, MLP, Graph Neural Network, U-Net, Ensemble, Sparsely-Gated Mixture-of-Experts, Sparse Pruned Model
  • X-axes: Compute, Dataset Size, Number of Model Parameters, Number of Training Steps, Input (e.g. Context) Size, & Upstream Performance
  • Y-axes: prediction error, cross entropy, calibration error, AUROC, BLEU score percentage, F1 score, reward, Elo rating, FID score

25 of 53

Functional forms from previous work that we compare to:

26 of 53

Percentage of tasks by domain where each functional form is the best for extrapolation of scaling behavior

27 of 53

Empirical Results

In all plots, black points are the points used for fitting a BNSL, green points are the held-out points used for evaluating extrapolation of BNSL fit to the black points, and a red line is the BNSL that has been fit to the black points. 100% of the plots contain green point(s) for evaluating extrapolation.

Except when stated otherwise, each plot contains the region(s) surrounding (or neighboring) a single break of a BNSL fit to black points that are smaller (along the x-axis) than the green points.

28 of 53

BNSL accurately extrapolates the scaling behavior of

Non-Monotonic Scaling (e.g. Double Descent)

29 of 53

BNSL accurately extrapolates the scaling behavior of

Inflection Points (e.g. Four Digit Addition)

30 of 53

BNSL accurately extrapolates to scales that are over an order of magnitude away

31 of 53

BNSL accurately extrapolates the scaling behavior of

Sparse Models

32 of 53

BNSL accurately extrapolates the scaling behavior of

Distillation

33 of 53

BNSL accurately extrapolates the scaling behavior of

Diffusion Generative Models of Images

34 of 53

BNSL accurately extrapolates the scaling behavior of

AI Alignment (even downstream)

35 of 53

BNSL accurately extrapolates the scaling behavior of

Reinforcement Learning (Single- and Multi-Agent)

36 of 53

BNSL accurately extrapolates the scaling behavior of

Computer Coding / Programming

37 of 53

Perhaps, variants of smoothly broken power laws (BNSL is an example of a variant of smoothly broken power laws) are the “true” functional form of the scaling behavior of many (all?) things that involve artificial neural networks?

38 of 53

Training FoMo in Practice

“We think the most benefits will go to whoever has the biggest computer.” Greg Brockman, OpenAI’s CTO, Financial Times

Most compute is owned by AI companies (Google, OpenAI, etc), not academia & nonprofit research; this “compute gap” continues to widen

We need to “democratize AI”!

39 of 53

Supercomputers: Summit and Frontier

40 of 53

Growing Collaborative Network

41 of 53

Irina Rish (UdeM/Mila/LAION)

Jenia Jitsev (Juelich/LAION) and collaborators

Scalable Foundation Models for Transferable Generalist AI

INCITE 2023 / Summit

42 of 53

Our 2022-2023 INCITE Project

43 of 53

Ongoing Projects

Language Models: Pretraining and Continual Learning

Aligned Multimodal Language-Vision Models:

Time-series Transformers

Multimodal “Generalist” Agent

Ultimate goal:

Interactive, Continually Learning “Open ChatX” model

44 of 53

Open-Source Language Models and Data

45 of 53

Aligning Large Models with Human Values?

huggingface.co/spaces/EleutherAI/magma github.com/Aleph-Alpha/magma

???

Details here

46 of 53

Our Vision

Our long-term, overarching goal is to develop a wide international collaboration, united by the objective of building foundation models that are increasingly more powerful, while at the same time are safe, robust and aligned with human values.

Such models aim to serve as the foundation for numerous AI applications, from industry to healthcare to scientific discovery - i.e., AI-powered applications of great societal value.

We aim to avoid accumulation of the most advanced AI technology in a small set of large companies, while jointly advancing the field of open AI (democratization of AI).

Obtaining an access to large-scale computational resources would greatly facilitate the development of open AI research world-wide, and ensure a collaborative, collective solution to the challenge of making AI systems of the future not only highly advanced but maximally beneficial for the whole of humanity.

47 of 53

Neural Scaling Laws course at UdeM & Mila

Workshop series on “Neural Scaling Laws:

Towards Maximally Beneficial AGI

4th workshop: ICML (Hawaii), July 2023

Resources

48 of 53

49 of 53

50 of 53

51 of 53

Open Questions to Neuroscience

  • Biological networks scaled from single-cell organisms to humans

  • What are “scaling laws” and “scaling algorithms” we can borrow from nature?�
  • Phase transitions: “from quantity to quality” - what can AI researchers working on scaling AI while ensuring its safe and steerable learn from nature about emergence of novel behaviors?�

52 of 53

Organization of the Brain Network

Stam CJ. 2014 Modern network science of neurological disorders. Nat. Rev. Neurosci. 15.

Brain topology as a combination of three different types of networks:

  • a locally connected network
  • a random network
  • a scale-free network

53 of 53

Thank you!