Irina Rish
Canada Excellence Research Chair in Autonomous AI
University of Montreal
Mila - Quebec AI Institute
Scaling Laws, Emergent Behaviors, and AI Democratization
AGI
Building AGI as an Ultimate Goal of AI Field
AGI ⇔ “General” AI ⇔ Multi-task,“Broad” AI
“Highly autonomous systems that outperform humans at most economically valuable work” (OpenAI definition)
Generalization: ML Holy Grail
Neural Scaling Revolution:
One path to solve them all?
AI & Scaling
“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."
The Bitter Lesson (Rich Sutton, March 13, 2019)
Examples
Computer chess: “the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search.”
Computer go: “learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research”
Speech recognition, computer vision: “the statistical methods won out over the human-knowledge-based methods… more computation, together with learning on huge training sets.
Why Is This So Important?
“Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.
These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other“
We have to learn the bitter lesson that building in how we think we think does not work in the long run.
We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.
Large-Scale Pretrained Models
(a.k.a. Foundation Models)
“Train one model on a huge amount of data and adapt it to many applications.
We call such a model a foundation model.”
CEFM: Stanford’s Center for Research on Foundation Models
“On the Opportunities and Risks of Foundation Models”
Application example: healthcare
Successes of Large-Scale Models
Scaling Laws as “Investment Tools” for AI
An example:
image transformers dominated by convnets in lower data regimes, but outperforming the latter with more data: https://arxiv.org/pdf/2010.11929.pdf
History of Neural Scaling Laws
Notation used:
First to model the scaling of multilayer neural network’s performance, as a function of data set size, via power law (M2) where x = dataset size and y = test error.
Corinna Cortes, Lawrence D Jackel, Sara A Solla, Vladimir Vapnik, and John S Denker. Learning curves: Asymptotic values and rate of convergence. NeurIPS 1994.
Showed that data-size dependent scaling laws given by M2 (power laws) hold over many orders of magnitude.
Joel Hestness et al. Deep Learning Scaling is Predictable,
Empirically. arXiv:1712.00409, December 2017.
Applied M2 (power laws) to model-size dependent scaling laws, i.e. when x = number of parameters.
Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales. CoRR, abs/1909.12673, 2019.
Showed that M2 (power law) applies when x = compute, besides x = data and x = model.
This paper brought “neural” scaling laws to the mainstream as it was in context of GPT-3 training.
Jared Kaplan et al. Scaling Laws for Neural Language
Models. arXiv:2001.08361, January 2020.
Sharp Transitions in GPT-3 Performance with Increasing Number of Parameters
Wei et al, Emergent Abilities of Large Language Models
Broken Neural Scaling Laws (BNSL)
Ethan Caballero
Kshitij Gupta,
Irina Rish,
David Krueger
ICLR 2023 Conference Paper
Broken Neural Scaling Laws:
A Universal Functional Form for Neural Scaling Laws?
Ethan Caballero et al, 2022
https://arxiv.org/abs/2210.14891
BNSL accurately fits and extrapolates a very wide range of scaling behaviors
Functional forms from previous work that we compare to:
Percentage of tasks by domain where each functional form is the best for extrapolation of scaling behavior
Empirical Results
In all plots, black points are the points used for fitting a BNSL, green points are the held-out points used for evaluating extrapolation of BNSL fit to the black points, and a red line is the BNSL that has been fit to the black points. 100% of the plots contain green point(s) for evaluating extrapolation.
Except when stated otherwise, each plot contains the region(s) surrounding (or neighboring) a single break of a BNSL fit to black points that are smaller (along the x-axis) than the green points.
BNSL accurately extrapolates the scaling behavior of
Non-Monotonic Scaling (e.g. Double Descent)
BNSL accurately extrapolates the scaling behavior of
Inflection Points (e.g. Four Digit Addition)
BNSL accurately extrapolates to scales that are over an order of magnitude away
BNSL accurately extrapolates the scaling behavior of
Sparse Models
BNSL accurately extrapolates the scaling behavior of
Distillation
BNSL accurately extrapolates the scaling behavior of
Diffusion Generative Models of Images
BNSL accurately extrapolates the scaling behavior of
AI Alignment (even downstream)
BNSL accurately extrapolates the scaling behavior of
Reinforcement Learning (Single- and Multi-Agent)
BNSL accurately extrapolates the scaling behavior of
Computer Coding / Programming
Perhaps, variants of smoothly broken power laws (BNSL is an example of a variant of smoothly broken power laws) are the “true” functional form of the scaling behavior of many (all?) things that involve artificial neural networks?
Training FoMo in Practice
“We think the most benefits will go to whoever has the biggest computer.” � Greg Brockman, OpenAI’s CTO, Financial Times
Most compute is owned by AI companies (Google, OpenAI, etc), not academia & nonprofit research; this “compute gap” continues to widen
We need to “democratize AI”!
Supercomputers: Summit and Frontier
Growing Collaborative Network
Irina Rish (UdeM/Mila/LAION)
Jenia Jitsev (Juelich/LAION) and collaborators
Scalable Foundation Models for Transferable Generalist AI
INCITE 2023 / Summit
Our 2022-2023 INCITE Project
Ongoing Projects
Language Models: Pretraining and Continual Learning
Aligned Multimodal Language-Vision Models:
Time-series Transformers
Multimodal “Generalist” Agent
Ultimate goal:
Interactive, Continually Learning “Open ChatX” model
Open-Source Language Models and Data
Aligning Large Models with Human Values?
huggingface.co/spaces/EleutherAI/magma github.com/Aleph-Alpha/magma
???
Details here
Our Vision
Our long-term, overarching goal is to develop a wide international collaboration, united by the objective of building foundation models that are increasingly more powerful, while at the same time are safe, robust and aligned with human values.
Such models aim to serve as the foundation for numerous AI applications, from industry to healthcare to scientific discovery - i.e., AI-powered applications of great societal value.
We aim to avoid accumulation of the most advanced AI technology in a small set of large companies, while jointly advancing the field of open AI (democratization of AI).
Obtaining an access to large-scale computational resources would greatly facilitate the development of open AI research world-wide, and ensure a collaborative, collective solution to the challenge of making AI systems of the future not only highly advanced but maximally beneficial for the whole of humanity.
Neural Scaling Laws course at UdeM & Mila
Workshop series on “Neural Scaling Laws:
Towards Maximally Beneficial AGI”
4th workshop: ICML (Hawaii), July 2023
Resources
Open Questions to Neuroscience
Organization of the Brain Network
Stam CJ. 2014 Modern network science of neurological disorders. Nat. Rev. Neurosci. 15.
Brain topology as a combination of three different types of networks:
Thank you!