Irina Rish
University of Montreal
Mila - Quebec AI Institute
Lag-LLama:
Towards Time-Series Foundation Models
Canada Excellence Research Chair
in Autonomous AI
CERC-AAI Lab: irina-lab.ai
Outline
The Holy Grail of AI: Generalization
AGI ⇔ “General” AI ⇔ Multi-task,“Broad” AI
“Highly autonomous systems that outperform humans at most economically valuable work” (OpenAI definition)
High-Level Objective of CERC in Autonomous AI
Bring AI to a fundamentally new level:
Artificial General Intelligence (AGI):
“Highly autonomous systems that outperform humans at most economically valuable work” (OpenAI definition)
It’s All About (Further) Generalization
Recent Advances in AI @ Scale:
One path to Solve Them All?
Foundation Models: Jump Towards AGI?
“Train one model on a huge amount of data and adapt it to many applications.
We call such a model a foundation model.”
CEFM: Stanford’s Center for Research on Foundation Models
“On the Opportunities and Risks of Foundation Models”
Application example: healthcare
GPT-3: Generative Pre-trained Transformer (3rd gen)
GPT-3: The New Mighty Language Model from OpenAI, (towardsdatascience.com, May 31, 2020)
Language model learns:
P(next word | previous words)
Task-Agnostic, few-shot learner: no task-specific datasets needed to “fine-tune”
Produces more fluent and human-like text outputs than its (smaller) predecessors
~175B parameters
~45TB of training data
“Cambrian Explosion” of Large-Scale Models
AI & Scaling
“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."
The Bitter Lesson (Rich Sutton, March 13, 2019)
Scaling Laws as “Investment Tools” for AI
An example:
image transformers dominated by convnets in lower data regimes, but outperforming the latter with more data: https://arxiv.org/pdf/2010.11929.pdf
Neural Scaling Laws: Kaplan et al
Jared Kaplan et al, Scaling Laws for Neural Language Models, 2020.
“Old” GPT Scaling Laws
(NOT compute-optimal!):
Data/Model = 2/5
Chinchilla Scaling Laws: Data/Model = 50/50
Broken Neural Scaling Laws:
A Universal Functional Form for Neural Scaling Laws?
Ethan Caballero et al, 2022
https://arxiv.org/abs/2210.14891
Training Foundation Models
“We think the most benefits will go to whoever has the biggest computer.” � Greg Brockman, OpenAI’s CTO, Financial Times
Most compute is owned by AI companies (Google, OpenAI, etc), not academia & nonprofit research; this “compute gap” continues to widen.
We need to “democratize AI”!
Open Foundation Models on Supercomputers
5.9M V100 GPU hrs on Summit
Growing International Collaboration
nolano.org
Farama
Ongoing CERC-AAI Lab Projects
Language Models: Pretraining and Continual Learning
Aligned Multimodal Language-Vision Models:
Time-series Transformers
Compression/Distillation & Fast Inference
Generalist Agents: Open Gato & Open Ada
LLM 4 Psychology & Psychology 4 LLMs
Kshitij Gupta
Daniel Kaplan
Quentin Anthony
Arjun Ashok
Benjamin Thérien
Tejas Vaidhya
Adam Ibrahim
Andrew WIlliams
Alexis Roger
LLM from Scratch: RPJ-INCITE 3B and 7B (May 2023)
Multimodal Alignment
???
https://www.irina-lab.ai/blog/lag-llama
Use Cases
Univariate Probabilistic Forecasting
Use Cases
Lag-LLama Architecture
Tokenization
Lag-LLama Tokenization
Pre-training Data Diversity
Good Zero-Shot Forecasting on Some Datasets
Fine-tuning Significantly Improves Forecasting
Neural Scaling Laws: Preliminary Results
Thank you!
Predicting Network’s Behavior at Scale
Neural Scaling Laws: Kaplan et al
Jared Kaplan et al, Scaling Laws for Neural Language Models, 2020.
Scale and Inductive Biases
Brief History of Neural Scaling Laws
Kaplan et al, Scaling Laws for Neural Language Models, 2020
Cortes et al. Learning curves: Asymptotic values and rate of convergence. NeurIPS 1994.
First to observe power law scaling: of ANNs:
x = dataset size and y = test error.
Hestness et al. Deep Learning Scaling is Predictable,Empirically. Dec 2017.
1994
2017
Showed that data-size dependent scaling laws given by power laws hold over many orders of magnitude.
Rosenfeld et al. . A constructive prediction of the generalization error across scales. 2019.
Applied power laws to model-size dependent scaling laws, i.e. when x = number of parameters.
Showed that power law applies when x = compute, besides x = data and x = model.
This paper brought “neural” scaling laws to the mainstream as it was in context of GPT-3 training.
2019
2020
Should Pre-training be Continual?
Standard pre-training:
multiple datasets available at once; mixed into one dataset
(or, sampled uniformly into each minibatch)
Example: A Generalist Agent
“Old” GPT Scaling Laws
(NOT compute-optimal!):
Data/Model = 2/5
Chinchilla Scaling Laws: Data/Model = 50/50
LLaMA-13B outperforms GPT-3 (175B) on most benchmarks
LLaMA-65B is competitive w/ best models: Chinchilla-70B, PaLM-540B
Beyond Chinchilla compute-optimal, towards “inference-optimal”
More Data Needed - No Saturation in Sight!
More Complex Scaling Behavior:
“Phase Transitions”, Emergent Phenomena
-
Predicting Network’s Behavior at Scale
Transition from Memorization to Generalization
Notsavo et al, Predicting Grokking Long Before it Happens: A look into the loss landscape of models which grok. 2023
Initial stage/”confusion” (t0 to t1)
Memorization(t1-t2)
Comprehension (t3-t4)
Generalization (“grokking”) at t4+
t2
t3
t1
t4
Generalization (“grokking”) point t4 follows an empirical power law
diven the training data fraction r, and
Spectral Signature of Loss Predicts Grokking
Notsavo et al, Predicting Grokking Long Before it Happens: A look into the loss landscape of models which grok. 2023
no grokking
(after 10k steps)
grokking
Predicting Network’s Behavior at Scale
Broken Neural Scaling Laws:
A Universal Functional Form for Neural Scaling Laws?
Ethan Caballero et al, 2022
https://arxiv.org/abs/2210.14891
Sparse Models
Distillation
Diffusion Models
Alignment (ELO score)
Reinforcement Learning
Coding
Video
BNSL accurately fits and extrapolates a very wide range of scaling behaviors
BNSL accurately extrapolates the scaling behavior of:
Non-Monotonic Scaling (e.g. Double Descent)
Inflection Points
(e.g. Four Digit Addition)
“Old” GPT Scaling Laws (NOT compute-optimal!):
Data/Model = 2/5
Chinchilla Scaling Laws: Data/Model = 50/50
Chinchilla Scaling Laws
”Cinchilla's wild implications”, AI Alignment Forum, July 30th, 2022
“Data, not size, is the currently active constraint on language modeling performance”
More Data Needed - And No Saturation in Sight!
Why Continual Pretraining?
Open Questions about Continual Pretraining
Model trained on datasets D(1),...,D(t) will continue to be trained on D(t+1).
Open questions:
Problem Setup
No Distribution Shift: SlimPajama (300B subset) split into 3 100B parts
Weak Distribution Shift: Pile (English, 300B) -> SlimPajama subset (Engish, 300B)
Strong Distribution Shift: Pile (English, 300B) -> German, 200B
New Data
New Data
Pretrainig data (Pile)
Pile + New Data
Joint: Upper bound?
Continual Pretraining
Lower bound
Results: Continual vs Joint Pre-Training
Compute savings
continual pre-training on new data vs joint re-training
Test loss: approx same!
average across both datasets, last 100 iterations sampled at 10 iterations
Same (or Better!) Avg Evals Performance!
Average performance across benchmarks: MMLU, MathQA, Reading Comprehension, World Knowledge, Commonsense Reasoning
Linear Warmup and Cosine Decay Schedule
Infinite LR schedules: A Better Approach?
SlimPajama split into 3 equal parts and trained continually
Infinite LR schedules can:
Key Insights
LR schedule
Replay
Continual pre-training can greatly reduce compute cost and human effort, without degrading model’s performance.
Control:
Model: size N, architecture,...
Data: size D, diversity,...
No control:
Downstream task(s) complexity (T)
Relative Scaling Laws?
Scaling laws:
not just L(N,D), but rather L(N,...,D,...| T) ?
Capacity vs Complexity Trade-off
10 rotated MNIST tasks
1000 different videogames
Downstream Task Complexity
Model Capacity (Capabilties)
(grow w/ model & data size)
AI
Neuro
Neuroscience for AI: Neuroscience-inspired AI Algorithms
AI for Neuroscience: Modeling Brain and Behavior
Long-Standing Goal:
Discover Universal Laws Underlying Intelligence
Larval zebrafish calcium imaging data
(NatFrontiers in Artificial Intelligenceure Methods, 2013) M. Ahrens (Janelia Farm/HHMI)
Brain: Non-equilibrium Stochastic Dynamical System
Abrevaya et al. Effective Latent Differential Equation Models via Attention and Multiple Shooting. NeurIPS workshop, 2023.
Ramezanian-Panahi et al. Generative Models of Brain Dynamics. Frontiers in Artificial Intelligence 2022.
Abrevaya,et al. Learning Brain Dynamics With Coupled Low-Dimensional Nonlinear Oscillators and Deep Recurrent Networks", Neural Computation, 2021.
Human functional MRI
Organization of the Brain Network
Stam CJ. 2014 Modern network science of neurological disorders. Nat. Rev. Neurosci. 15.
Brain topology as a combination of three different types of networks:
both at developmental and evolutionary scale
http://www.skyhunter.com/marcs/GentleSeduction.html
The Gentle Seduction by Marc Stiegler
Post-Singularity scenarios different civilizations experienced:
“Some had died in a frenzy, as the builders of new technologies indulged an orgy of inventions, releasing just one that destroyed them all.
Others had died in despair, as fear-filled leaders beat down the innovators, strangling them, putting the future beyond their grasp.
The fear-ridden species settled into a long slide of despair that ended with degenerate descendants no longer able to dream.
Only those who knew caution without fear, only those marked by her elemental form of prudence, made it through. Only humanity had survived.”
Thank you!
Adding Time to ANNs: Spiking Neural Networks �
1. Maass, Wolfgang (1997). "Networks of spiking neurons: The third generation of neural network models". Neural Networks. 10 (9)
Real neurons: spiking dynamics
Artificial neuron: static function
Spike-time coding:
Early visual system
Auditory system
Phase coding in hippocampus
STDP
Richer state space:
neuronal activations,
synaptic weights
+ temporal information
CL: Growing a Set of “Principal Components”:
Emerging “Compositional” Generalization?
Infinite stream of changing environments and “tasks”
f1(x) f2(x) f3(x) … fn(x) …
Assumption: future tasks are “well-approximable” in some finite “basis” finite functional “basis”
{ h1(x) , … , hk(x) }
Brief History: CL in Neural Networks
Catastrophic
forgetting
(McCloskey
&
Cohen;
Ratcliff)
Stability-
Plasticity
(Carpenter
&
Grossberg)
1987
1990
Sensitivity-
Stability
(Hebb)
Rehearsal
(replay) &
pseudo-
rehearsal
(Robins)
1993
1949
Semi-
distributed
representations
(French)
1991
1995
Pre-training
(McRae & Hetherington)
1997
Pseudo-
recurrent
(“dual”)
networks a la
neocortex/
hippocampus
(French)
Lifelong robot learning (Thrun & Mitchell)
“An empirical investigation of catastrophic forgetting in gradient-based
neural networks”
(Goodfellow et al.)
2013
1999
Survey on
“Catastrophic forgetting in connectionist networks”
(French)
Recent
work on
CL
Canada Excellence Research Chair in Autonomous AI
CERC-AAI