1 of 76

�@ Google Zürich Pi Group July 23, 2025

Kazuki Irie

A first-person retrospective �(and prospective) on developing

kirie@g.harvard.edu

fast weight programmers

2 of 76

Disclaimer:

This is not a standard research talk!

Instead, I’ll be telling a chain of thoughts that has driven my research on “fast weight programmers” (from 2020 to present),

which I believe, have some interesting parallels to some of the research conducted at the Pi Group.

2

3 of 76

Fast weight programmers (FWPs)

General-purpose sequence processing models in �which one net generates the weights of another net

→ Context-dependent weight matrix serves as STM;

An RNN with 2D matrix-form states

3

4 of 76

Fast weight programmers (FWPs)

General-purpose sequence processing models in �which one net generates the weights of another net

→ Context-dependent weight matrix serves as STM;

An RNN with 2D matrix-form states

1943

4

5 of 76

Fast weight programmers (FWPs)

General-purpose sequence processing models in �which one net generates the weights of another net

→ Context-dependent weight matrix serves as STM;

An RNN with 2D matrix-form states

Historically many proposals for “fast weights” �(without ‘P’ !)

1943

1987

1981

1982

5

6 of 76

Fast weight programmers (FWPs)

General-purpose sequence processing models in �which one net generates the weights of another net

→ Context-dependent weight matrix serves as STM;

An RNN with 2D matrix-form states

Historically many proposals for “fast weights” �(without ‘P’ !)

but no end-to-end learnable �weight modification model until 1991.

1943

6

7 of 76

Fast weight programmers (FWPs)

General-purpose sequence processing models in �which one net generates the weights of another net

→ Context-dependent weight matrix serves as STM;

An RNN with 2D matrix-form states

Historically many proposals for “fast weights” �(without ‘P’ !)

but no end-to-end learnable weight �modification model until 1991.�Revival due to Transformer connection.

I got introduced to

FWPs in May 2020 by

1943

7

Jürgen �Schmidhuber

Imanol �Schlag

8 of 76

How to generate weights of NNs using NNs

→ The main challenge: high dimensionality of weight matrices

Proto-FWPs (2020)

“Usual” Data: O(N) → WM: O(N^2)

8

9 of 76

How to generate weights of NNs using NNs

→ The main challenge: high dimensionality of weight matrices

Proto-FWPs (2020)

“Usual” Data: O(N) → WM: O(N^2)

Directly parameterize WM as �Transformer layers �“Transformer learning rule”

  • Tested on a visual question answering benchmark
  • Severe overfitting, no good results �→ Unpublished :(

Cf. Google’s HyperTransformer

(ICML 2022)

9

10 of 76

How to generate weights of NNs using NNs

→ The main challenge: high dimensionality of weight matrices

Proto-FWPs (2020)

“Usual” Data: O(N) → WM: O(N^2)

Directly parameterize WM as �Transformer layers �“Transformer learning rule”

  • Tested on a visual question answering benchmark
  • Severe overfitting, no good results �→ Unpublished :(

Cf. Google’s HyperTransformer

(ICML 2022)

Modify WM in Compressed Space

Borrowing an idea from Evolution Strategies:

Use DCT as a compression function

Still very expensive to

get the final dense matrix :(

10

Irie, Schmidhuber, ICLR 2021 Workshop Training and Generating Neural Networks in Compressed Weight Space

11 of 76

Another possibility: Outer-product

Generate two vectors instead of one matrix!

11

12 of 76

Another possibility: Outer-product

Generate two vectors instead of one matrix!

Connection to Attention

12

13 of 76

Ba’s equivalence result (NeurIPS 2016)

Converting Schmidhuber’s recurrent FWP (1993) to an attention mechanism

Another possibility: Outer-product

Generate two vectors instead of one matrix!

Connection to Attention

13

14 of 76

Ba’s equivalence result (NeurIPS 2016)

Converting Schmidhuber’s recurrent FWP (1993) to an attention mechanism

Realization: directly applicable

to the self-attention works if unnormalized

Another possibility: Outer-product

Generate two vectors instead of one matrix!

Connection to Attention

(later in our ICML 2021)

14

15 of 76

Ba’s equivalence result (NeurIPS 2016)

Converting Schmidhuber’s recurrent FWP (1993) to an attention mechanism

Realization: directly applicable

to the self-attention works if unnormalized

Another possibility: Outer-product

Generate two vectors instead of one matrix!

Linear transformers (ICML 2020)

→ Derivation for the more general, “linear attention” case (involving extra normalization variable). Very nice CUDA implementation.

Connection to Attention

(later in our ICML 2021)

15

16 of 76

Duality between FWP and attention

Drop-in replacement for the self-attention in the transformer architecture

“Transformer” Form

“Fast Weight Programmer” Form

Schmidhuber 1991, Neural Computation 1992

Vaswani et al. N(eur)IPS 2017

“unnormalized attention”

Query, key, value projections

16

17 of 76

17

Better Update Rule: Good Old Delta Rule

Instead of using the purely additive rule…

Also generate dynamic ‘learning rate’

Widrow & Hoff 1960, Rescorla & Wagner 1972

Schlag*, Irie*, Schmidhuber, ICML 2021

18 of 76

Natural as an associative memory manip

as in the Tensor-product representation

18

Better Update Rule: Good Old Delta Rule

Instead of using the purely additive rule…

Also generate dynamic ‘learning rate’

Widrow & Hoff 1960, Rescorla & Wagner 1972

Schlag*, Irie*, Schmidhuber, ICML 2021

19 of 76

“Fast delta”: using the good old delta rule instead of the purely Hebbian rule: The slow net �uses self-invented training patterns (input/target/learning rate) to “train” the fast net �on the fly

Natural as an associative memory manip

as in the Tensor-product representation

19

Better Update Rule: Good Old Delta Rule

Instead of using the purely additive rule…

Also generate dynamic ‘learning rate’

Widrow & Hoff 1960, Rescorla & Wagner 1972

Schlag*, Irie*, Schmidhuber, ICML 2021

20 of 76

Side note: �Key-Value Associative Memory

Read operation

Write operation

Memory Primitives

= “retrieve value associated to my query”

= “store a key-value pair”

20

“Memory”

Read

“Memory”

Write

21 of 76

Side note: �Key-Value Associative Memory

Read operation

Write operation

Memory Primitives

= “retrieve value associated to my query”

= “store a key-value pair”

21

“Memory”

Read

“Memory”

Write

22 of 76

Fast Net

Update Rule / Programming Instruction

Slow Net

Outer product

Self generated learning rate

Currently stored value �to be replaced

The slow net generates self-invented training patterns (input/target/learning rate) to optimize the weights of the fast net using delta rule.

DeltaNet

Schlag*, Irie*, Schmidhuber, ICML 2021�Linear Transformers Are Secretly Fast Weight Programmers

23 of 76

Further “immediate” extensions

Using more powerful fast/slow net architectures.��Fast net:

  • Deeper feedforward NN → Delta-MLP
  • Recurrent NN → Delta-RNN
  • LSTM → Delta-LSTM
  • DeltaNet → Delta-DeltaNet

Slow net:

  • Recurrent NN (fast net output to slow net input) �→ Recurrent DeltaNet

+ use FWPs as a general-purpose

sequence model for

reinforcement learning

23

23

Irie*, Schlag, Csordas Schmidhuber, NeurIPS 2021 “Going Beyond Linear Transformers with Recurrent Fast Weight Programmers”

24 of 76

Limited Expressivity

as non-recurrent �models

Certain regular languages

Certain algorithmic tasks

code execution with�variable-state manipulation

24

Irie, Csordás, Schmidhuber EMNLP 2023

25 of 76

Recurrent DeltaNet (2021)

25

Irie*, Schlag*, Csordás, Schmidhuber �NeurIPS 2021

26 of 76

Important Note

We were not concerned by the efficiency (i.e., parallelizable training over sequence dimension) at that time (2021).

Our training was using the recurrent form and “backward weight recomputation” to manage memory space for training.

Today: use the parallel form, in the chunk-wise fashion�also applicable to DeltaNet, thanks to:

NeurIPS 2024

26

26

27 of 76

What’s next?? (2021-)

27

28 of 76

What’s next?? (2021-)

What I didn’t do

28

29 of 76

add micro-steps within each step!”� (i.e., apply more than one delta rule updates per step)��I also wanted to overcome the “rank-1 update limitation” but overlooked the concrete implication for expressivity �→ I couldn’t do it in the end→ Now: DeltaProduct (2024)!

29

30 of 76

add micro-steps within each step!”� (i.e., apply more than one delta rule updates per step)��I also wanted to overcome the “rank-1 update limitation” but overlooked the concrete implication for expressivity �→ I couldn’t do it in the end→ Now: DeltaProduct (2024)!

“make the fast net deeper & instead using �the delta rule, directly optimize the local self-invented loss by backprop”�→ I couldn’t do it (was pushing other extensions). �→ Now: Test-time regression & MesaNet (2025)!

30

31 of 76

add micro-steps within each step!”� (i.e., apply more than one delta rule updates per step)��I also wanted to overcome the “rank-1 update limitation” but overlooked the concrete implication for expressivity �→ I couldn’t do it in the end→ Now: DeltaProduct (2024)!

“make the fast net deeper & instead using �the delta rule, directly optimize the local self-invented loss by backprop”�→ I couldn’t do it (was pushing other extensions). �→ Now: Test-time regression & MesaNet (2025)!

So what did I do instead???

31

32 of 76

Can we build a neural net that learns to modify & improve itself?�(recursively)

  • Following the idea of (another) Schmidhuber 1992‘Steps towards “self-referential” learning.’Original” Self-Referential Weight Matrix
  • A natural follow-up for FWPsA slow weight matrix embeds learning algorithm for the fast net; �Why not also continually modulate/improve slow net weights over time?�

32

33 of 76

Fast weight programmer �= Net generating weights of another net

DeltaNet

Why not more meta-levels?

Why self-referential loop? meta-meta-...metalearning

33

Schlag*, Irie*, Schmidhuber �ICML 2021

Irie, Schlag, Csordás, Schmidhuber �ICML 2022

34 of 76

Fast weight programmer �= Net generating weights of another net

DeltaNet

DeltaDeltaNet

Why not more meta-levels?

Why self-referential loop? meta-meta-...metalearning

34

Schlag*, Irie*, Schmidhuber �ICML 2021

Irie*, Schlag*, Csordás, Schmidhuber �NeurIPS 2021

Irie, Schlag, Csordás, Schmidhuber �ICML 2022

35 of 76

Fast weight programmer �= Net generating weights of another net

DeltaNet

DeltaDeltaNet

DeltaDeltaDeltaNet??

Why not more meta-levels?

Why self-referential loop? meta-meta-...metalearning

35

Schlag*, Irie*, Schmidhuber �ICML 2021

Irie*, Schlag*, Csordás, Schmidhuber �NeurIPS 2021

Irie, Schlag, Csordás, Schmidhuber �ICML 2022

36 of 76

Why self-referential loop? meta-meta-...metalearning

DeltaNet

DeltaDeltaNet

DeltaDeltaDeltaNet??

Collapse all meta-levels into �a single self-referential loop…

36

Schlag*, Irie*, Schmidhuber �ICML 2021

Irie*, Schlag*, Csordás, Schmidhuber �NeurIPS 2021

Irie, Schlag, Csordás, Schmidhuber �ICML 2022

37 of 76

Self-Referential Weight Matrix

Input Output

Weight Matrix

  • Self-modifications based on key/value associations
  • Learns to train itself (at runtime) using self-generated training patterns/learning rates

(recursively self-modifying programs)

Irie, Schlag, Csordás, Schmidhuber �ICML 2022 “A modern self-referential weight matrix that learns to modify itself”

38 of 76

Sanity check on the standard few-shot�image classification (Mini-ImageNet)

Sequential Multi-task Adaptation�(Mini-ImageNet + Omniglot)�

Multi-task Reinforcement Learning�(ProcGen)

Dataset changes

Common initial �weight matrix

Self-modified �task/episode �specific weight matrices

38

Irie, Schlag, Csordás, Schmidhuber ICML 2022

39 of 76

Does this improve Expressivity?

Regular language recognition tasks�requiring “state tracking”

Both Recurrence & �Self-Reference help

39

Irie, Csordás, Schmidhuber �EMNLP 2023 “Practical Computational Power of Linear Transformers and Their Recurrent and Self-Referential Extensions

40 of 76

Does this improve Expressivity?

Regular language recognition tasks�requiring “state tracking”

Both Recurrence & �Self-Reference help

DeltaNet result is OUT DATED!

ICLR 2025

40

Irie, Csordás, Schmidhuber �EMNLP 2023 “Practical Computational Power of Linear Transformers and Their Recurrent and Self-Referential Extensions”

41 of 76

Other follow-ups

- How to incentivize an SRWM to become �better than its current self?�(by comparing to its future self!)�����

41

Irie, Schmidhuber �ICLR 2023 Workshop ccelerating Neural Self-Improvement via Bootstrapping

42 of 76

Other follow-ups

- How to incentivize an SRWM to become �better than its current self?�(by comparing to its future self!)�����- How to make an SRWM learn & �self-adapt continually? �(by augmenting the �meta-objectives!)

42

Irie, Schmidhuber �ICLR 2023 Workshop ccelerating Neural Self-Improvement via Bootstrapping

Irie, Csordás, Schmidhuber �TMLR 2025 “Metalearning Continual Learning Algorithms”

43 of 76

What’s else?? (2022)

43

44 of 76

What’s else?? (2022)

Generalizing the duality result

44

45 of 76

Deriving the FWP-Attention Duality

Ba et al. NIPS 2016�Katharopoulos et al., ICML 2020

46 of 76

Deriving the FWP-Attention Duality

Ba et al. NIPS 2016�Katharopoulos et al., ICML 2020

47 of 76

Deriving the FWP-Attention Duality

Ba et al. NIPS 2016�Katharopoulos et al., ICML 2020

48 of 76

Deriving the FWP-Attention Duality

Ba et al. NIPS 2016�Katharopoulos et al., ICML 2020

49 of 76

Deriving the FWP-Attention Duality

Ba et al. NIPS 2016�Katharopoulos et al., ICML 2020

50 of 76

Deriving the FWP-Attention Duality

Ba et al. NIPS 2016�Katharopoulos et al., ICML 2020

51 of 76

Deriving the FWP-Attention Duality

Update rule: purely additive Hebbian

“Synaptic modulation”

Ba et al. NIPS 2016�Katharopoulos et al., ICML 2020

52 of 76

A more general formulation

These systems are “equivalent”

Arbitrary v and k

Linear layer

Irie*, Csordás*, Schmidhuber ICML 2022

53 of 76

A more general formulation

These systems are “equivalent”

NB: Unnormalised dot attention� Standard attention:

“Unlimited memory”�(grows with T)

Fixed size memory!�(independent of T)

Arbitrary v and k

Linear layer

Key-value memory attention layer

Irie*, Csordás*, Schmidhuber ICML 2022�“The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention”

54 of 76

Application to a Linear Layer Trained by Gradient Descent:

Forward computation:

54

55 of 76

Application to a Linear Layer Trained by Gradient Descent:

Forward computation:

�Backward computation (gradient descent) to update W: outer product�����for some error function learning rate at step

t now denotes training �iteration!

55

56 of 76

Application to a Linear Layer Trained by Gradient Descent:

Forward computation:

�Backward computation (gradient descent) to update W: outer product�����for some error function learning rate at step

t now denotes training �iteration!

56

57 of 76

Application to a Linear Layer Trained by Gradient Descent:

Forward computation:

�Backward computation (gradient descent) to update W: outer product�����for some error function learning rate at step

We can directly apply the duality from the previous slide to:

t now denotes training �iteration!

57

58 of 76

This yields another duality…

58

Linear layer trained by

gradient descent

Store:

Compute:

59 of 76

This yields another duality…

59

Linear layer trained by

gradient descent

Store:

Compute:

Key/value-attention memory �storing entire training experience

60 of 76

This yields another duality…

Classic ML exercise!

  • Matrix-version of �Perceptron/Kernel Machine Duality!
  • All above duality results are simple generalization/applications�

Classic result (1964)

60

Linear layer trained by

gradient descent

Store:

Compute:

Key/value-attention memory �storing entire training experience

Mark A. Aizerman

61 of 76

Nothing is forgotten”: �key/value slots corresponding to all training datapoints are never “lost”...

→ intriguing in light of catastrophic forgetting

→ connection to some findings on neuroscience (“silent engrams”)

Many Consequences… (incl. misconceptions)

61

Irie*, Csordás*, Schmidhuber ICML 2022

Science 2015

PNAS 2017

Neuron 2025

62 of 76

What’s else?? (2022)

62

63 of 76

Continuous-time FWP

63

Irie, Faccio, Schmidhuber NeurIPS 2022�Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules

64 of 76

Continuous-time FWP

64

Irie, Faccio, Schmidhuber NeurIPS 2022�Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules

65 of 76

Continuous-time FWP

Inspired by Oja (1982)�

Helped by successes & practicalities �of Neural ODE

65

Irie, Faccio, Schmidhuber NeurIPS 2022�Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules

66 of 76

Image Generation (Fast Weight Painters)

Motivation:

  • More applications to show general purposeness
  • Application with better visualization

Idea: treat images as weight matrices

Use an FWP as an image generator in a GAN.

66

67 of 76

Sampled Images

64x64 images CelebA, Metfaces, �AFHQ-Cat/Dog/Wild, LSUN-Church

StyleGAN-2

Fast Weight Painter

67

Irie, Schmidhuber ICLR 2023�Images as Weight Matrices: Sequential Image Generation Through Synaptic Learning Rules

68 of 76

Generation Steps (cont’d)

Low-rank scenario: 16 drawing steps to draw 64x64 images

Symmetry

Other regularities

68

Irie, Schmidhuber ICLR 2023�Images as Weight Matrices: Sequential Image Generation Through Synaptic Learning Rules

69 of 76

Anecdote: �Real-time recurrent learning for FWPs?

69

Real-Time Recurrent Learning�(RTRL)

Backpropagation Through Time �(BPTT)

Untruncated gradients!�Time/space complexity: O(D^4)/O(D^3)

Truncated gradients!�Time/space complexity: O(D^2)/O(T*D)

70 of 76

Anecdote: �Real-time recurrent learning for FWPs?

1989

70

Irie, Gopalakrishnan, Schmidhuber, ICLR 2024 Exploring the Promise and Limits of Real-Time Recurrent Learning

71 of 76

Anecdote: �Real-time recurrent learning for FWPs?

Yes we can but only for 1 layer model, and for a certain limited update rule

1989

71

Irie, Gopalakrishnan, Schmidhuber, ICLR 2024 Exploring the Promise and Limits of Real-Time Recurrent Learning

72 of 76

Outlook:

Current architecture research is driven by the training parallelism + efficient inference.��Interestingly, the limitations of such models are unclear in terms of expressivity “again” (after Grazzi et al’s results on DeltaNet)

But I also like investigating other non-parallezable but interesting sequential computation (e.g., other self-modification rules)...

72

“Sequential parallel duality”

73 of 76

Thank you for listening

  • All our code is public (links in the corresponding papers)
  • Main results presented are based on joint work with my former colleagues, with special thanks to:

work done at University of Lugano, Switzerland

73

Róbert Csordás

Imanol Schlag

Jürgen Schmidhuber

74 of 76

Imanol Schlag*, Kazuki Irie*, Jürgen Schmidhuber

Linear Transformers Are Secretly Fast Weight Programmers

ICML 2021, https://arxiv.org/abs/2102.11174

Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber�Going Beyond Linear Transformers with Recurrent Fast Weight Programmers

NeurIPS 2021, https://arxiv.org/abs/2106.06295

Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber

A Modern Self-Referential Weight Matrix That Learns to Modify Itself�ICML 2022, https://arxiv.org/abs/2202.05780��Kazuki Irie*, Róbert Csordás*, Jürgen Schmidhuber�The Dual Form of Neural Networks Revisited

ICML 2022, https://arxiv.org/abs/2202.05798

The main ideas/results presented in this talk are from:

74

75 of 76

Use formal languages to evaluate computational capabilities:

  • Irie, Csordás, Schmidhuber. EMNLP 2023 �Practical Computational Power of Linear Transformers and Their Recurrent and Self-Referential Extensions https://arxiv.org/abs/2310.16076

�Meta-continual learning with self-referential neural nets:

  • Irie, Csordás, Schmidhuber. TMLR 2025.�Metalearning Continual Learning Algorithms. https://arxiv.org/abs/2312.00276

Extensions of fast weight programmers:

  • Neural ODE: Irie, Faccio, Schmidhuber. NeurIPS 2022.�Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules https://arxiv.org/abs/2206.01649
  • Image generation: Irie, Schmidhuber. ICLR 2023.�Images as Weight Matrices: Sequential Image Generation Through Synaptic Learning Rules. https://arxiv.org/abs/2210.06184

75

76 of 76

Training and Generating Neural Networks in Compressed Weight Space

Kazuki Irie, Jürgen Schmidhuber https://arxiv.org/abs/2112.15545

ICLR 2021 Workshop on Neural Compression���Accelerating Neural Self-Improvement via Bootstrapping

Kazuki Irie, Jürgen Schmidhuber https://arxiv.org/abs/2305.01547

ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models

�Key-value memory in the brain�Samuel J. Gershman, Ila Fiete, Kazuki Irie

Neuron 2025 https://arxiv.org/abs/2501.02950

76