1 of 83

Irina Rish

University of Montreal

Mila - Quebec AI Institute

Lag-LLama:

Towards Time-Series Foundation Models

Canada Excellence Research Chair

in Autonomous AI

CERC-AAI Lab: irina-lab.ai

2 of 83

3 of 83

Overview of CERC Program
“Scale is All You Need”
Lag-LLama

Outline

4 of 83

The Holy Grail of AI: Generalization

AGI ⇔ “General” AI ⇔ Multi-task,“Broad” AI

“Highly autonomous systems that outperform humans at most economically valuable work” (OpenAI definition)

5 of 83

High-Level Objective of CERC in Autonomous AI

Bring AI to a fundamentally new level:

from narrow to broad (general)
from human-dependent to autonomous

Artificial General Intelligence (AGI):

“Highly autonomous systems that outperform humans at most economically valuable work” (OpenAI definition)

6 of 83

OoD Generalization
Domain Adaptation
Meta-learning
Transfer learning
Continual learning

It’s All About (Further) Generalization

7 of 83

Recent Advances in AI @ Scale:

One path to Solve Them All?

8 of 83

Foundation Models: Jump Towards AGI?

“Train one model on a huge amount of data and adapt it to many applications.

We call such a model a foundation model.”

CEFM: Stanford’s Center for Research on Foundation Models

“On the Opportunities and Risks of Foundation Models”

Application example: healthcare

9 of 83

GPT-3: Generative Pre-trained Transformer (3rd gen)

GPT-3: The New Mighty Language Model from OpenAI, (towardsdatascience.com, May 31, 2020)

Language model learns:

P(next word | previous words)

Task-Agnostic, few-shot learner: no task-specific datasets needed to “fine-tune”

Produces more fluent and human-like text outputs than its (smaller) predecessors

OpenAI GPT-3 Now Open to Public via API [FREE]

~175B parameters

~45TB of training data

10 of 83

“Cambrian Explosion” of Large-Scale Models

GPT-3: natural language model (May 2020)
CLIP: image to text (Jan 2021)
DALL-E: text to image (Jan 2021)
Copilot/Codex: code-generation (Sept 2021),
StableDiffusion: text to image (Aug 2022)
GPT-4, ChatGPT, LLaMA, etc (2023 + )

11 of 83

AI & Scaling

“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."

The Bitter Lesson (Rich Sutton, March 13, 2019)

12 of 83

Scaling Laws as “Investment Tools” for AI

An example:

image transformers dominated by convnets in lower data regimes, but outperforming the latter with more data: https://arxiv.org/pdf/2010.11929.pdf

13 of 83

Neural Scaling Laws: Kaplan et al

Jared Kaplan et al, Scaling Laws for Neural Language Models, 2020.

14 of 83

“Old” GPT Scaling Laws

(NOT compute-optimal!):

Data/Model = 2/5

Chinchilla Scaling Laws: Data/Model = 50/50

15 of 83

Broken Neural Scaling Laws:

A Universal Functional Form for Neural Scaling Laws?

Ethan Caballero et al, 2022

https://arxiv.org/abs/2210.14891

16 of 83

Training Foundation Models

“We think the most benefits will go to whoever has the biggest computer.” � Greg Brockman, OpenAI’s CTO, Financial Times

Most compute is owned by AI companies (Google, OpenAI, etc), not academia & nonprofit research; this “compute gap” continues to widen.

We need to “democratize AI”!

17 of 83

Open Foundation Models on Supercomputers

News

5.9M V100 GPU hrs on Summit

18 of 83

Growing International Collaboration

nolano.org

Farama

19 of 83

Ongoing CERC-AAI Lab Projects

Language Models: Pretraining and Continual Learning

Aligned Multimodal Language-Vision Models:

Time-series Transformers

Compression/Distillation & Fast Inference

Generalist Agents: Open Gato & Open Ada

LLM 4 Psychology & Psychology 4 LLMs

Kshitij Gupta

Daniel Kaplan

Quentin Anthony

Arjun Ashok

Benjamin Thérien

Tejas Vaidhya

Adam Ibrahim

Andrew WIlliams

Alexis Roger

20 of 83

LLM from Scratch: RPJ-INCITE 3B and 7B (May 2023)

21 of 83

Continual Pretraining 0f LLMs

Continually Trained Multilingual Models

22 of 83

Multimodal Foundation Models� 2.1.Robin Suite of Vision-Text Models

2.2 Multimodal Model Alignment

23 of 83

Multimodal Alignment

???

24 of 83

Time-Series Foundation Models

https://www.irina-lab.ai/blog/lag-llama

25 of 83

Use Cases

26 of 83

Univariate Probabilistic Forecasting

27 of 83

Use Cases

28 of 83

Lag-LLama Architecture

29 of 83

Tokenization

30 of 83

Lag-LLama Tokenization

31 of 83

32 of 83

33 of 83

Pre-training Data Diversity

34 of 83

35 of 83

36 of 83

37 of 83

Good Zero-Shot Forecasting on Some Datasets

38 of 83

Fine-tuning Significantly Improves Forecasting

39 of 83

40 of 83

41 of 83

Neural Scaling Laws: Preliminary Results

42 of 83

Thank you!

43 of 83

Predicting Network’s Behavior at Scale

“Closed-box” predictions (empirical scaling laws)
“Open-box” predictions (observing learning dynamics)

44 of 83

Neural Scaling Laws: Kaplan et al

Jared Kaplan et al, Scaling Laws for Neural Language Models, 2020.

45 of 83

Scale and Inductive Biases

46 of 83

Brief History of Neural Scaling Laws

Kaplan et al, Scaling Laws for Neural Language Models, 2020

Cortes et al. Learning curves: Asymptotic values and rate of convergence. NeurIPS 1994.

First to observe power law scaling: of ANNs:

x = dataset size and y = test error.

Hestness et al. Deep Learning Scaling is Predictable,Empirically. Dec 2017.

1994

2017

Showed that data-size dependent scaling laws given by power laws hold over many orders of magnitude.

Rosenfeld et al. . A constructive prediction of the generalization error across scales. 2019.

Applied power laws to model-size dependent scaling laws, i.e. when x = number of parameters.

Showed that power law applies when x = compute, besides x = data and x = model.

This paper brought “neural” scaling laws to the mainstream as it was in context of GPT-3 training.

2019

2020

47 of 83

Should Pre-training be Continual?

Standard pre-training:

multiple datasets available at once; mixed into one dataset

(or, sampled uniformly into each minibatch)

Example: A Generalist Agent

48 of 83

“Old” GPT Scaling Laws

(NOT compute-optimal!):

Data/Model = 2/5

Chinchilla Scaling Laws: Data/Model = 50/50

49 of 83

LLaMA-13B outperforms GPT-3 (175B) on most benchmarks

LLaMA-65B is competitive w/ best models: Chinchilla-70B, PaLM-540B

Beyond Chinchilla compute-optimal, towards “inference-optimal”

LLaMA: Open and Efficient Foundation Language Models

50 of 83

Llama 2: Open Foundation and Fine-Tuned Chat Models

More Data Needed - No Saturation in Sight!

51 of 83

More Complex Scaling Behavior:

“Phase Transitions”, Emergent Phenomena

3SAT, CSPs, NPhard problems
Random graphs
Universal Laws of Robustness
GPT-3 on Arithmetic task
Grokking

-

52 of 83

Predicting Network’s Behavior at Scale

“Open-box” predictions (observing learning dynamics)
“Closed-box” predictions (empirical scaling laws)

53 of 83

Transition from Memorization to Generalization

Notsavo et al, Predicting Grokking Long Before it Happens: A look into the loss landscape of models which grok. 2023

Initial stage/”confusion” (t0 to t1)

Memorization(t1-t2)

Comprehension (t3-t4)

Generalization (“grokking”) at t4+

t2

t3

t1

t4

Generalization (“grokking”) point t4 follows an empirical power law

diven the training data fraction r, and

54 of 83

Spectral Signature of Loss Predicts Grokking

Notsavo et al, Predicting Grokking Long Before it Happens: A look into the loss landscape of models which grok. 2023

no grokking

(after 10k steps)

grokking

55 of 83

Predicting Network’s Behavior at Scale

“Open-box” predictions (observing learning dynamics)
“Closed-box” predictions (empirical scaling laws)

56 of 83

Broken Neural Scaling Laws:

A Universal Functional Form for Neural Scaling Laws?

Ethan Caballero et al, 2022

https://arxiv.org/abs/2210.14891

57 of 83

Sparse Models

Distillation

Diffusion Models

Alignment (ELO score)

Reinforcement Learning

Coding

Video

58 of 83

BNSL accurately fits and extrapolates a very wide range of scaling behaviors

Settings: Zero-Shot, Prompted, and Fine-Tuned settings; Downstream and upstream
Tasks: Large-Scale Vision, Language, Audio, Video, Diffusion, Generative Modeling, Multimodal Learning, Contrastive Learning, AI Alignment, AI Capabilities, Robotics, Out-Of-Distribution Generalization, Continual Learning, Transfer Learning, Uncertainty Estimation / Calibration, Out-Of-Distribution Detection, Adversarial Robustness, Distillation, Sparsity, Retrieval, Quantization, Pruning, Fairness, Molecules, Computer Programming/Coding, Math Word Problems, Arithmetic, Double Descent, “Emergent” “Phase Transitions”, Supervised Learning, Unsupervised / Self-Supervised Learning, & Reinforcement Learning (Single Agent & Multi-Agent)
Architectures: ResNet, Transformer, MLP-Mixer, MLP, Graph Neural Network, U-Net, Ensemble, Sparsely-Gated Mixture-of-Experts, Sparse Pruned Model
X-axes: Compute, Dataset Size, Number of Model Parameters, Number of Training Steps, Input (e.g. Context) Size, & Upstream Performance
Y-axes: prediction error, cross entropy, calibration error, AUROC, BLEU score percentage, F1 score, reward, Elo rating, FID score

59 of 83

BNSL accurately extrapolates the scaling behavior of:

Non-Monotonic Scaling (e.g. Double Descent)

Inflection Points

(e.g. Four Digit Addition)

60 of 83

Another example

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

61 of 83

“Old” GPT Scaling Laws (NOT compute-optimal!):

Data/Model = 2/5

Chinchilla Scaling Laws: Data/Model = 50/50

Chinchilla Scaling Laws

”Cinchilla's wild implications”, AI Alignment Forum, July 30th, 2022

“Data, not size, is the currently active constraint on language modeling performance”

62 of 83

Llama 2: Open Foundation and Fine-Tuned Chat Models

More Data Needed - And No Saturation in Sight!

63 of 83

Why Continual Pretraining?

We want to keep accumulating knowledge from new datasets, maintain, extend and improve foundation models over time

Chinchilla and LLaMA examples suggest to keep increasing data vs model size ratio even further (till saturation?)

However, standard pretraining approach - to train from scratch each time new dataset is available by merge different datasets into one leads to a huge loss of compute and human effort put in creating previous models

64 of 83

Open Questions about Continual Pretraining

Model trained on datasets D(1),...,D(t) will continue to be trained on D(t+1).

Open questions:

would this lead to a model of similar quality to the one trained jointly on D = {D(1),...,D(t+1)}?
How to train the model continually so that its final performance is maximized?
More specifically: what learning rate (LR) schedule to use? Would replay be useful at scale? Would any CL algorithms be useful? How would they scale (we should keep in mind the bitter lesson!)

65 of 83

Continual Pretraining 0f LLMs

Continually Trained Multilingual Models

66 of 83

Problem Setup

No Distribution Shift: SlimPajama (300B subset) split into 3 100B parts

Weak Distribution Shift: Pile (English, 300B) -> SlimPajama subset (Engish, 300B)

Strong Distribution Shift: Pile (English, 300B) -> German, 200B

New Data

Pretrainig data (Pile)

Pile + New Data

Joint: Upper bound?

Continual Pretraining

Lower bound

67 of 83

Results: Continual vs Joint Pre-Training

Compute savings

continual pre-training on new data vs joint re-training

Test loss: approx same!

average across both datasets, last 100 iterations sampled at 10 iterations

Same (or Better!) Avg Evals Performance!

Average performance across benchmarks: MMLU, MathQA, Reading Comprehension, World Knowledge, Commonsense Reasoning

68 of 83

Linear Warmup and Cosine Decay Schedule

69 of 83

Infinite LR schedules: A Better Approach?

SlimPajama split into 3 equal parts and trained continually

Infinite LR schedules can:

Can add more tokens/converge earlier without pre-determining token budget
Help prevent forgetting by avoiding re-warmup and additionally

70 of 83

Key Insights

LR schedule

Re-warming is necessary for adaptation during continual pre-training

Warm-up duration does not have a significant impact �

Infinite LR schedules which keep LR constant across tasks are a promising way to circumvent optimization difficulties due to re-warming

Replay

A small amount of replay (5%) can help significantly mitigate forgetting

Continual pre-training can greatly reduce compute cost and human effort, without degrading model’s performance.

71 of 83

Control:

Model: size N, architecture,...

Data: size D, diversity,...

No control:

Downstream task(s) complexity (T)

Relative Scaling Laws?

Scaling laws:

not just L(N,D), but rather L(N,...,D,...| T) ?

72 of 83

Capacity vs Complexity Trade-off

10 rotated MNIST tasks

1000 different videogames

Downstream Task Complexity

Model Capacity (Capabilties)

(grow w/ model & data size)

73 of 83

AI

Neuro

Neuroscience for AI: Neuroscience-inspired AI Algorithms

AI for Neuroscience: Modeling Brain and Behavior

Long-Standing Goal:

Discover Universal Laws Underlying Intelligence

74 of 83

Larval zebrafish calcium imaging data

(NatFrontiers in Artificial Intelligenceure Methods, 2013) M. Ahrens (Janelia Farm/HHMI)

Brain: Non-equilibrium Stochastic Dynamical System

Abrevaya et al. Effective Latent Differential Equation Models via Attention and Multiple Shooting. NeurIPS workshop, 2023.

Ramezanian-Panahi et al. Generative Models of Brain Dynamics. Frontiers in Artificial Intelligence 2022.

Abrevaya,et al. Learning Brain Dynamics With Coupled Low-Dimensional Nonlinear Oscillators and Deep Recurrent Networks", Neural Computation, 2021.

Human functional MRI

75 of 83

Organization of the Brain Network

Stam CJ. 2014 Modern network science of neurological disorders. Nat. Rev. Neurosci. 15.

Brain topology as a combination of three different types of networks:

a locally connected network
a random network
a scale-free network

76 of 83

Evolution scaled bio networks from single cells to brains

What are the “scaling algorithms” and “scaling laws”?

Emergence and transitions in bio networks:

both at developmental and evolutionary scale

what can AI researchers working on scaling AI while ensuring its safe and steerable learn from nature about emergence of novel behaviors?

77 of 83

http://www.skyhunter.com/marcs/GentleSeduction.html

The Gentle Seduction by Marc Stiegler

Post-Singularity scenarios different civilizations experienced:

“Some had died in a frenzy, as the builders of new technologies indulged an orgy of inventions, releasing just one that destroyed them all.

Others had died in despair, as fear-filled leaders beat down the innovators, strangling them, putting the future beyond their grasp.

The fear-ridden species settled into a long slide of despair that ended with degenerate descendants no longer able to dream.

Only those who knew caution without fear, only those marked by her elemental form of prudence, made it through. Only humanity had survived.”

78 of 83

Thank you!

79 of 83

80 of 83

Adding Time to ANNs: Spiking Neural Networks �

1. Maass, Wolfgang (1997). "Networks of spiking neurons: The third generation of neural network models". Neural Networks. 10 (9)

Often called the 3^rd generation ANNs, after multi-layer perceptrons and nonlinear deep nets

In SNNs, neurons fire only when membrane potential reaches a threshold

Real neurons: spiking dynamics

Artificial neuron: static function

Is information propagated via rate (frequency) coding or/and temporal coding?

Spike-time coding:

Early visual system

Auditory system

Phase coding in hippocampus

STDP

Richer state space:

neuronal activations,

synaptic weights

+ temporal information

81 of 83

CL: Growing a Set of “Principal Components”:

Emerging “Compositional” Generalization?

Infinite stream of changing environments and “tasks”

f1(x) f2(x) f3(x) … fn(x) …

Assumption: future tasks are “well-approximable” in some finite “basis” finite functional “basis”

{ h1(x) , … , hk(x) }

82 of 83

Brief History: CL in Neural Networks

Catastrophic

forgetting

(McCloskey

&

Cohen;

Ratcliff)

Stability-

Plasticity

(Carpenter

&

Grossberg)

1987

1990

Sensitivity-

Stability

(Hebb)

Rehearsal

(replay) &

pseudo-

rehearsal

(Robins)

1993

1949

Semi-

distributed

representations

(French)

1991

1995

Pre-training

(McRae & Hetherington)

1997

Pseudo-

recurrent

(“dual”)

networks a la

neocortex/

hippocampus

(French)

Lifelong robot learning (Thrun & Mitchell)

“An empirical investigation of catastrophic forgetting in gradient-based

neural networks”

(Goodfellow et al.)

2013

1999

Survey on

“Catastrophic forgetting in connectionist networks”

(French)

Recent

work on

CL

83 of 83

Canada Excellence Research Chair in Autonomous AI

CERC-AAI