�@ Google Zürich Pi Group July 23, 2025
Kazuki Irie
A first-person retrospective �(and prospective) on developing
kirie@g.harvard.edu
fast weight programmers
Disclaimer:
This is not a standard research talk!
Instead, I’ll be telling a chain of thoughts that has driven my research on “fast weight programmers” (from 2020 to present),
which I believe, have some interesting parallels to some of the research conducted at the Pi Group.
2
Fast weight programmers (FWPs)
General-purpose sequence processing models in �which one net generates the weights of another net
→ Context-dependent weight matrix serves as STM;
An RNN with 2D matrix-form states
3
Fast weight programmers (FWPs)
General-purpose sequence processing models in �which one net generates the weights of another net
→ Context-dependent weight matrix serves as STM;
An RNN with 2D matrix-form states
1943
4
Fast weight programmers (FWPs)
General-purpose sequence processing models in �which one net generates the weights of another net
→ Context-dependent weight matrix serves as STM;
An RNN with 2D matrix-form states
Historically many proposals for “fast weights” �(without ‘P’ !)
1943
1987
1981
1982
5
Fast weight programmers (FWPs)
General-purpose sequence processing models in �which one net generates the weights of another net
→ Context-dependent weight matrix serves as STM;
An RNN with 2D matrix-form states
Historically many proposals for “fast weights” �(without ‘P’ !)
but no end-to-end learnable �weight modification model until 1991.
1943
6
Fast weight programmers (FWPs)
General-purpose sequence processing models in �which one net generates the weights of another net
→ Context-dependent weight matrix serves as STM;
An RNN with 2D matrix-form states
Historically many proposals for “fast weights” �(without ‘P’ !)
but no end-to-end learnable weight �modification model until 1991.�Revival due to Transformer connection.
I got introduced to
FWPs in May 2020 by
1943
7
Jürgen �Schmidhuber�
Imanol �Schlag
How to generate weights of NNs using NNs
→ The main challenge: high dimensionality of weight matrices
Proto-FWPs (2020)
“Usual” Data: O(N) → WM: O(N^2)
8
How to generate weights of NNs using NNs
→ The main challenge: high dimensionality of weight matrices
Proto-FWPs (2020)
“Usual” Data: O(N) → WM: O(N^2)
Directly parameterize WM as �Transformer layers �“Transformer learning rule”
Cf. Google’s HyperTransformer
(ICML 2022)
9
How to generate weights of NNs using NNs
→ The main challenge: high dimensionality of weight matrices
Proto-FWPs (2020)
“Usual” Data: O(N) → WM: O(N^2)
Directly parameterize WM as �Transformer layers �“Transformer learning rule”
Cf. Google’s HyperTransformer
(ICML 2022)
Modify WM in Compressed Space
Borrowing an idea from Evolution Strategies:
Use DCT as a compression function
Still very expensive to
get the final dense matrix :(
10
Irie, Schmidhuber, ICLR 2021 Workshop Training and Generating Neural Networks in Compressed Weight Space
Another possibility: Outer-product
Generate two vectors instead of one matrix!
11
Another possibility: Outer-product
Generate two vectors instead of one matrix!
Connection to Attention
12
Ba’s equivalence result (NeurIPS 2016)
→ Converting Schmidhuber’s recurrent FWP (1993) to an attention mechanism
Another possibility: Outer-product
Generate two vectors instead of one matrix!
Connection to Attention
13
Ba’s equivalence result (NeurIPS 2016)
→ Converting Schmidhuber’s recurrent FWP (1993) to an attention mechanism
→ Realization: directly applicable
to the self-attention works if unnormalized
Another possibility: Outer-product
Generate two vectors instead of one matrix!
Connection to Attention
(later in our ICML 2021)
14
Ba’s equivalence result (NeurIPS 2016)
→ Converting Schmidhuber’s recurrent FWP (1993) to an attention mechanism
→ Realization: directly applicable
to the self-attention works if unnormalized
Another possibility: Outer-product
Generate two vectors instead of one matrix!
Linear transformers (ICML 2020)
→ Derivation for the more general, “linear attention” case (involving extra normalization variable). Very nice CUDA implementation.
Connection to Attention
(later in our ICML 2021)
15
Duality between FWP and attention
Drop-in replacement for the self-attention in the transformer architecture
“Transformer” Form
“Fast Weight Programmer” Form
Schmidhuber 1991, Neural Computation 1992
Vaswani et al. N(eur)IPS 2017
“unnormalized attention”
Query, key, value projections
16
17
Better Update Rule: Good Old Delta Rule
Instead of using the purely additive rule…
Also generate dynamic ‘learning rate’
Widrow & Hoff 1960, Rescorla & Wagner 1972
Schlag*, Irie*, Schmidhuber, ICML 2021
Natural as an associative memory manip
as in the Tensor-product representation
18
Better Update Rule: Good Old Delta Rule
Instead of using the purely additive rule…
Also generate dynamic ‘learning rate’
Widrow & Hoff 1960, Rescorla & Wagner 1972
Schlag*, Irie*, Schmidhuber, ICML 2021
“Fast delta”: using the good old delta rule instead of the purely Hebbian rule: The slow net �uses self-invented training patterns (input/target/learning rate) to “train” the fast net �on the fly
Natural as an associative memory manip
as in the Tensor-product representation
19
Better Update Rule: Good Old Delta Rule
Instead of using the purely additive rule…
Also generate dynamic ‘learning rate’
Widrow & Hoff 1960, Rescorla & Wagner 1972
Schlag*, Irie*, Schmidhuber, ICML 2021
Side note: �Key-Value Associative Memory
Read operation
Write operation
Memory Primitives
= “retrieve value associated to my query”
= “store a key-value pair”
20
“Memory”
Read
“Memory”
Write
Side note: �Key-Value Associative Memory
Read operation
Write operation
Memory Primitives
= “retrieve value associated to my query”
= “store a key-value pair”
21
“Memory”
Read
“Memory”
Write
Fast Net
Update Rule / Programming Instruction
Slow Net
Outer product
Self generated learning rate
Currently stored value �to be replaced
The slow net generates self-invented training patterns (input/target/learning rate) to optimize the weights of the fast net using delta rule.
DeltaNet
Schlag*, Irie*, Schmidhuber, ICML 2021�Linear Transformers Are Secretly Fast Weight Programmers
Further “immediate” extensions
Using more powerful fast/slow net architectures.��Fast net:
Slow net:
+ use FWPs as a general-purpose
sequence model for
reinforcement learning
23
23
Irie*, Schlag, Csordas Schmidhuber, NeurIPS 2021 “Going Beyond Linear Transformers with Recurrent Fast Weight Programmers”
Limited Expressivity
as non-recurrent �models
Certain regular languages
Certain algorithmic tasks
code execution with�variable-state manipulation
24
Irie, Csordás, Schmidhuber EMNLP 2023
Recurrent DeltaNet (2021)
25
Irie*, Schlag*, Csordás, Schmidhuber �NeurIPS 2021
Important Note
We were not concerned by the efficiency (i.e., parallelizable training over sequence dimension) at that time (2021).
Our training was using the recurrent form and “backward weight recomputation” to manage memory space for training.
Today: use the parallel form, in the chunk-wise fashion�also applicable to DeltaNet, thanks to:
NeurIPS 2024
26
26
What’s next?? (2021-)
27
What’s next?? (2021-)
What I didn’t do
28
“add micro-steps within each step!”� (i.e., apply more than one delta rule updates per step)��I also wanted to overcome the “rank-1 update limitation” but overlooked the concrete implication for expressivity �→ I couldn’t do it in the end→ Now: DeltaProduct (2024)!
�
29
“add micro-steps within each step!”� (i.e., apply more than one delta rule updates per step)��I also wanted to overcome the “rank-1 update limitation” but overlooked the concrete implication for expressivity �→ I couldn’t do it in the end→ Now: DeltaProduct (2024)!
�
“make the fast net deeper & instead using �the delta rule, directly optimize the local self-invented loss by backprop”��→ I couldn’t do it (was pushing other extensions). �→ Now: Test-time regression & MesaNet (2025)!
30
“add micro-steps within each step!”� (i.e., apply more than one delta rule updates per step)��I also wanted to overcome the “rank-1 update limitation” but overlooked the concrete implication for expressivity �→ I couldn’t do it in the end→ Now: DeltaProduct (2024)!
�
“make the fast net deeper & instead using �the delta rule, directly optimize the local self-invented loss by backprop”��→ I couldn’t do it (was pushing other extensions). �→ Now: Test-time regression & MesaNet (2025)!
So what did I do instead???�
31
Can we build a neural net that learns to modify & improve itself?�(recursively)
32
Fast weight programmer �= Net generating weights of another net
DeltaNet
Why not more meta-levels?
Why self-referential loop? meta-meta-...metalearning
33
Schlag*, Irie*, Schmidhuber �ICML 2021
Irie, Schlag, Csordás, Schmidhuber �ICML 2022
Fast weight programmer �= Net generating weights of another net
DeltaNet
DeltaDeltaNet
Why not more meta-levels?
Why self-referential loop? meta-meta-...metalearning
34
Schlag*, Irie*, Schmidhuber �ICML 2021
Irie*, Schlag*, Csordás, Schmidhuber �NeurIPS 2021
Irie, Schlag, Csordás, Schmidhuber �ICML 2022
Fast weight programmer �= Net generating weights of another net
DeltaNet
DeltaDeltaNet
DeltaDelta…DeltaNet??
Why not more meta-levels?
Why self-referential loop? meta-meta-...metalearning
35
Schlag*, Irie*, Schmidhuber �ICML 2021
Irie*, Schlag*, Csordás, Schmidhuber �NeurIPS 2021
Irie, Schlag, Csordás, Schmidhuber �ICML 2022
Why self-referential loop? meta-meta-...metalearning
DeltaNet
DeltaDeltaNet
DeltaDelta…DeltaNet??
Collapse all meta-levels into �a single self-referential loop…
36
Schlag*, Irie*, Schmidhuber �ICML 2021
Irie*, Schlag*, Csordás, Schmidhuber �NeurIPS 2021
Irie, Schlag, Csordás, Schmidhuber �ICML 2022
Self-Referential Weight Matrix
Input Output
Weight Matrix
(recursively self-modifying programs)
Irie, Schlag, Csordás, Schmidhuber �ICML 2022 “A modern self-referential weight matrix that learns to modify itself”
Sanity check on the standard few-shot�image classification (Mini-ImageNet)�
�
Sequential Multi-task Adaptation�(Mini-ImageNet + Omniglot)�
�
Multi-task Reinforcement Learning�(ProcGen)
Dataset changes
Common initial �weight matrix
Self-modified �task/episode �specific weight matrices
38
Irie, Schlag, Csordás, Schmidhuber ICML 2022
Does this improve Expressivity?
Regular language recognition tasks�requiring “state tracking”
�
Both Recurrence & �Self-Reference help
�
39
Irie, Csordás, Schmidhuber �EMNLP 2023 “Practical Computational Power of Linear Transformers and Their Recurrent and Self-Referential Extensions”
Does this improve Expressivity?
Regular language recognition tasks�requiring “state tracking”
�
Both Recurrence & �Self-Reference help
�
DeltaNet result is OUT DATED!
�
ICLR 2025
�
40
Irie, Csordás, Schmidhuber �EMNLP 2023 “Practical Computational Power of Linear Transformers and Their Recurrent and Self-Referential Extensions”
Other follow-ups
- How to incentivize an SRWM to become �better than its current self?�(by comparing to its future self!)�����
41
Irie, Schmidhuber �ICLR 2023 Workshop ccelerating Neural Self-Improvement via Bootstrapping
Other follow-ups
- How to incentivize an SRWM to become �better than its current self?�(by comparing to its future self!)�����- How to make an SRWM learn & �self-adapt continually? �(by augmenting the �meta-objectives!)
42
Irie, Schmidhuber �ICLR 2023 Workshop ccelerating Neural Self-Improvement via Bootstrapping
Irie, Csordás, Schmidhuber �TMLR 2025 “Metalearning Continual Learning Algorithms”
What’s else?? (2022)
43
What’s else?? (2022)
Generalizing the duality result
44
Deriving the FWP-Attention Duality
Ba et al. NIPS 2016�Katharopoulos et al., ICML 2020
Deriving the FWP-Attention Duality
Ba et al. NIPS 2016�Katharopoulos et al., ICML 2020
Deriving the FWP-Attention Duality
Ba et al. NIPS 2016�Katharopoulos et al., ICML 2020
Deriving the FWP-Attention Duality
Ba et al. NIPS 2016�Katharopoulos et al., ICML 2020
Deriving the FWP-Attention Duality
Ba et al. NIPS 2016�Katharopoulos et al., ICML 2020
Deriving the FWP-Attention Duality
Ba et al. NIPS 2016�Katharopoulos et al., ICML 2020
Deriving the FWP-Attention Duality
Update rule: purely additive Hebbian
“Synaptic modulation”
Ba et al. NIPS 2016�Katharopoulos et al., ICML 2020
A more general formulation
These systems are “equivalent”
Arbitrary v and k
Linear layer
Irie*, Csordás*, Schmidhuber ICML 2022
A more general formulation
These systems are “equivalent”
NB: Unnormalised dot attention� Standard attention:
“Unlimited memory”�(grows with T)
Fixed size memory!�(independent of T)
Arbitrary v and k
Linear layer
Key-value memory attention layer
Irie*, Csordás*, Schmidhuber ICML 2022�“The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention”
Application to a Linear Layer Trained by Gradient Descent:
Forward computation:
54
Application to a Linear Layer Trained by Gradient Descent:
Forward computation:
�Backward computation (gradient descent) to update W: outer product�����for some error function learning rate at step
t now denotes training �iteration!
55
Application to a Linear Layer Trained by Gradient Descent:
Forward computation:
�Backward computation (gradient descent) to update W: outer product�����for some error function learning rate at step
t now denotes training �iteration!
56
Application to a Linear Layer Trained by Gradient Descent:
Forward computation:
�Backward computation (gradient descent) to update W: outer product�����for some error function learning rate at step
We can directly apply the duality from the previous slide to:
t now denotes training �iteration!
57
This yields another duality…
58
Linear layer trained by
gradient descent
Store:
Compute:
This yields another duality…
59
Linear layer trained by
gradient descent
Store:
Compute:
Key/value-attention memory �storing entire training experience
This yields another duality…
Classic ML exercise!
Classic result (1964)
60
Linear layer trained by
gradient descent
Store:
Compute:
Key/value-attention memory �storing entire training experience
Mark A. Aizerman
“Nothing is forgotten”: �key/value slots corresponding to all training datapoints are never “lost”...
→ intriguing in light of catastrophic forgetting
→ connection to some findings on neuroscience (“silent engrams”)
Many Consequences… (incl. misconceptions)
61
Irie*, Csordás*, Schmidhuber ICML 2022
Science 2015
PNAS 2017
Neuron 2025
What’s else?? (2022)
62
Continuous-time FWP
63
Irie, Faccio, Schmidhuber NeurIPS 2022�Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules
Continuous-time FWP
64
Irie, Faccio, Schmidhuber NeurIPS 2022�Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules
Continuous-time FWP
Inspired by Oja (1982)�
Helped by successes & practicalities �of Neural ODE
65
Irie, Faccio, Schmidhuber NeurIPS 2022�Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules
Image Generation (Fast Weight Painters)
Motivation:
Idea: treat images as weight matrices
Use an FWP as an image generator in a GAN.
66
Sampled Images
64x64 images �CelebA, Metfaces, �AFHQ-Cat/Dog/Wild, LSUN-Church
StyleGAN-2
Fast Weight Painter
67
Irie, Schmidhuber ICLR 2023�Images as Weight Matrices: Sequential Image Generation Through Synaptic Learning Rules
Generation Steps (cont’d)
Low-rank scenario: 16 drawing steps to draw 64x64 images
Symmetry
Other regularities
68
Irie, Schmidhuber ICLR 2023�Images as Weight Matrices: Sequential Image Generation Through Synaptic Learning Rules
Anecdote: �Real-time recurrent learning for FWPs?
69
Real-Time Recurrent Learning�(RTRL)
Backpropagation Through Time �(BPTT)
Untruncated gradients!�Time/space complexity: O(D^4)/O(D^3)
Truncated gradients!�Time/space complexity: O(D^2)/O(T*D)
Anecdote: �Real-time recurrent learning for FWPs?
1989
70
Irie, Gopalakrishnan, Schmidhuber, ICLR 2024 Exploring the Promise and Limits of Real-Time Recurrent Learning
Anecdote: �Real-time recurrent learning for FWPs?
Yes we can but only for 1 layer model, and for a certain limited update rule…
1989
71
Irie, Gopalakrishnan, Schmidhuber, ICLR 2024 Exploring the Promise and Limits of Real-Time Recurrent Learning
Outlook:
Current architecture research is driven by the training parallelism + efficient inference.��Interestingly, the limitations of such models are unclear in terms of expressivity “again” (after Grazzi et al’s results on DeltaNet)
But I also like investigating other non-parallezable but interesting sequential computation (e.g., other self-modification rules)...
72
“Sequential parallel duality”
Thank you for listening
work done at University of Lugano, Switzerland
73
Róbert Csordás
Imanol Schlag
Jürgen Schmidhuber�
Imanol Schlag*, Kazuki Irie*, Jürgen Schmidhuber
Linear Transformers Are Secretly Fast Weight Programmers
ICML 2021, https://arxiv.org/abs/2102.11174
Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber�Going Beyond Linear Transformers with Recurrent Fast Weight Programmers
NeurIPS 2021, https://arxiv.org/abs/2106.06295
Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber
A Modern Self-Referential Weight Matrix That Learns to Modify Itself�ICML 2022, https://arxiv.org/abs/2202.05780��Kazuki Irie*, Róbert Csordás*, Jürgen Schmidhuber�The Dual Form of Neural Networks Revisited
ICML 2022, https://arxiv.org/abs/2202.05798
The main ideas/results presented in this talk are from:
74
Use formal languages to evaluate computational capabilities:
�Meta-continual learning with self-referential neural nets:
Extensions of fast weight programmers:
75
Training and Generating Neural Networks in Compressed Weight Space
Kazuki Irie, Jürgen Schmidhuber https://arxiv.org/abs/2112.15545
ICLR 2021 Workshop on Neural Compression���Accelerating Neural Self-Improvement via Bootstrapping
Kazuki Irie, Jürgen Schmidhuber https://arxiv.org/abs/2305.01547
ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models
��Key-value memory in the brain�Samuel J. Gershman, Ila Fiete, Kazuki Irie
Neuron 2025 https://arxiv.org/abs/2501.02950
76