1 of 76

�@ Google Zürich Pi Group July 23, 2025

Kazuki Irie

A first-person retrospective �(and prospective) on developing

kirie@g.harvard.edu

fast weight programmers

2 of 76

Disclaimer:

This is not a standard research talk!

Instead, I’ll be telling a chain of thoughts that has driven my research on “fast weight programmers” (from 2020 to present),

which I believe, have some interesting parallels to some of the research conducted at the Pi Group.

2

3 of 76

Fast weight programmers (FWPs)

General-purpose sequence processing models in �which one net generates the weights of another net

→ Context-dependent weight matrix serves as STM;

An RNN with 2D matrix-form states

3

4 of 76

Fast weight programmers (FWPs)

General-purpose sequence processing models in �which one net generates the weights of another net

→ Context-dependent weight matrix serves as STM;

An RNN with 2D matrix-form states

1943

4

5 of 76

Fast weight programmers (FWPs)

General-purpose sequence processing models in �which one net generates the weights of another net

→ Context-dependent weight matrix serves as STM;

An RNN with 2D matrix-form states

Historically many proposals for “fast weights” �(without ‘P’ !)

1943

1987

1981

1982

5

6 of 76

Fast weight programmers (FWPs)

General-purpose sequence processing models in �which one net generates the weights of another net

→ Context-dependent weight matrix serves as STM;

An RNN with 2D matrix-form states

Historically many proposals for “fast weights” �(without ‘P’ !)

but no end-to-end learnable �weight modification model until 1991.

1943

6

7 of 76

Fast weight programmers (FWPs)

General-purpose sequence processing models in �which one net generates the weights of another net

→ Context-dependent weight matrix serves as STM;

An RNN with 2D matrix-form states

Historically many proposals for “fast weights” �(without ‘P’ !)

but no end-to-end learnable weight �modification model until 1991.�Revival due to Transformer connection.

I got introduced to

FWPs in May 2020 by

1943

7

Jürgen �Schmidhuber�

Imanol �Schlag

8 of 76

How to generate weights of NNs using NNs

→ The main challenge: high dimensionality of weight matrices

Proto-FWPs (2020)

“Usual” Data: O(N) → WM: O(N^2)

8

9 of 76

How to generate weights of NNs using NNs

→ The main challenge: high dimensionality of weight matrices

Proto-FWPs (2020)

“Usual” Data: O(N) → WM: O(N^2)

Directly parameterize WM as �Transformer layers �“Transformer learning rule”

Tested on a visual question answering benchmark
Severe overfitting, no good results �→ Unpublished :(

Cf. Google’s HyperTransformer

(ICML 2022)

9

10 of 76

How to generate weights of NNs using NNs

→ The main challenge: high dimensionality of weight matrices

Proto-FWPs (2020)

“Usual” Data: O(N) → WM: O(N^2)

Directly parameterize WM as �Transformer layers �“Transformer learning rule”

Tested on a visual question answering benchmark
Severe overfitting, no good results �→ Unpublished :(

Cf. Google’s HyperTransformer

(ICML 2022)

Modify WM in Compressed Space

Borrowing an idea from Evolution Strategies:

Use DCT as a compression function

Still very expensive to

get the final dense matrix :(

10

Irie, Schmidhuber, ICLR 2021 Workshop Training and Generating Neural Networks in Compressed Weight Space

11 of 76

Another possibility: Outer-product

Generate two vectors instead of one matrix!

11

12 of 76

Another possibility: Outer-product

Generate two vectors instead of one matrix!

Connection to Attention

12

13 of 76

Ba’s equivalence result (NeurIPS 2016)

→ Converting Schmidhuber’s recurrent FWP (1993) to an attention mechanism

Another possibility: Outer-product

Generate two vectors instead of one matrix!

Connection to Attention

13

14 of 76

Ba’s equivalence result (NeurIPS 2016)

→ Converting Schmidhuber’s recurrent FWP (1993) to an attention mechanism

→ Realization: directly applicable

to the self-attention works if unnormalized

Another possibility: Outer-product

Generate two vectors instead of one matrix!

Connection to Attention

(later in our ICML 2021)

14

15 of 76

Ba’s equivalence result (NeurIPS 2016)

→ Converting Schmidhuber’s recurrent FWP (1993) to an attention mechanism

→ Realization: directly applicable

to the self-attention works if unnormalized

Another possibility: Outer-product

Generate two vectors instead of one matrix!

Linear transformers (ICML 2020)

→ Derivation for the more general, “linear attention” case (involving extra normalization variable). Very nice CUDA implementation.

Connection to Attention

(later in our ICML 2021)

15

16 of 76

Duality between FWP and attention

Drop-in replacement for the self-attention in the transformer architecture

“Transformer” Form

“Fast Weight Programmer” Form

Schmidhuber 1991, Neural Computation 1992

Vaswani et al. N(eur)IPS 2017

“unnormalized attention”

Query, key, value projections

16

17 of 76

17

Better Update Rule: Good Old Delta Rule

Instead of using the purely additive rule…

Also generate dynamic ‘learning rate’

Widrow & Hoff 1960, Rescorla & Wagner 1972

Schlag*, Irie*, Schmidhuber, ICML 2021

18 of 76

Natural as an associative memory manip

as in the Tensor-product representation

18

Better Update Rule: Good Old Delta Rule

Instead of using the purely additive rule…

Also generate dynamic ‘learning rate’

Widrow & Hoff 1960, Rescorla & Wagner 1972

Schlag*, Irie*, Schmidhuber, ICML 2021

19 of 76

“Fast delta”: using the good old delta rule instead of the purely Hebbian rule: The slow net �uses self-invented training patterns (input/target/learning rate) to “train” the fast net �on the fly

Natural as an associative memory manip

as in the Tensor-product representation

19

Better Update Rule: Good Old Delta Rule

Instead of using the purely additive rule…

Also generate dynamic ‘learning rate’

Widrow & Hoff 1960, Rescorla & Wagner 1972

Schlag*, Irie*, Schmidhuber, ICML 2021

20 of 76

Side note: �Key-Value Associative Memory

Read operation

Write operation

Memory Primitives

= “retrieve value associated to my query”

= “store a key-value pair”

20

“Memory”

Read

“Memory”

Write

21 of 76

Side note: �Key-Value Associative Memory

Read operation

Write operation

Memory Primitives

= “retrieve value associated to my query”

= “store a key-value pair”

21

“Memory”

Read

“Memory”

Write

22 of 76

Fast Net

Update Rule / Programming Instruction

Slow Net

Outer product

Self generated learning rate

Currently stored value �to be replaced

The slow net generates self-invented training patterns (input/target/learning rate) to optimize the weights of the fast net using delta rule.

DeltaNet

Schlag*, Irie*, Schmidhuber, ICML 2021�Linear Transformers Are Secretly Fast Weight Programmers

23 of 76

Further “immediate” extensions

Using more powerful fast/slow net architectures.��Fast net:

Deeper feedforward NN → Delta-MLP
Recurrent NN → Delta-RNN
LSTM → Delta-LSTM
DeltaNet → Delta-DeltaNet

Slow net:

Recurrent NN (fast net output to slow net input) �→ Recurrent DeltaNet

+ use FWPs as a general-purpose

sequence model for

reinforcement learning

23

Irie*, Schlag, Csordas Schmidhuber, NeurIPS 2021 “Going Beyond Linear Transformers with Recurrent Fast Weight Programmers”

24 of 76

Limited Expressivity

as non-recurrent �models

Certain regular languages

Certain algorithmic tasks

code execution with�variable-state manipulation

24

Irie, Csordás, Schmidhuber EMNLP 2023

25 of 76

Recurrent DeltaNet (2021)

25

Irie*, Schlag*, Csordás, Schmidhuber �NeurIPS 2021

26 of 76

Important Note

We were not concerned by the efficiency (i.e., parallelizable training over sequence dimension) at that time (2021).

Our training was using the recurrent form and “backward weight recomputation” to manage memory space for training.

Today: use the parallel form, in the chunk-wise fashion�also applicable to DeltaNet, thanks to:

NeurIPS 2024

26

27 of 76

What’s next?? (2021-)

27

28 of 76

What’s next?? (2021-)

What I didn’t do

28

29 of 76

“add micro-steps within each step!”� (i.e., apply more than one delta rule updates per step)��I also wanted to overcome the “rank-1 update limitation” but overlooked the concrete implication for expressivity �→ I couldn’t do it in the end→ Now: DeltaProduct (2024)!

�

29

30 of 76

“add micro-steps within each step!”� (i.e., apply more than one delta rule updates per step)��I also wanted to overcome the “rank-1 update limitation” but overlooked the concrete implication for expressivity �→ I couldn’t do it in the end→ Now: DeltaProduct (2024)!

�

“make the fast net deeper & instead using �the delta rule, directly optimize the local self-invented loss by backprop”��→ I couldn’t do it (was pushing other extensions). �→ Now: Test-time regression & MesaNet (2025)!

30

31 of 76

“add micro-steps within each step!”� (i.e., apply more than one delta rule updates per step)��I also wanted to overcome the “rank-1 update limitation” but overlooked the concrete implication for expressivity �→ I couldn’t do it in the end→ Now: DeltaProduct (2024)!

�

“make the fast net deeper & instead using �the delta rule, directly optimize the local self-invented loss by backprop”��→ I couldn’t do it (was pushing other extensions). �→ Now: Test-time regression & MesaNet (2025)!

So what did I do instead???�

31

32 of 76

Can we build a neural net that learns to modify & improve itself?�(recursively)

Following the idea of (another) Schmidhuber 1992�‘Steps towards “self-referential” learning.’�“Original” Self-Referential Weight Matrix��
A natural follow-up for FWPs�A slow weight matrix embeds learning algorithm for the fast net; �Why not also continually modulate/improve slow net weights over time?��

32

33 of 76

Fast weight programmer �= Net generating weights of another net

DeltaNet

Why not more meta-levels?

Why self-referential loop? meta-meta-...metalearning

33

Schlag*, Irie*, Schmidhuber �ICML 2021

Irie, Schlag, Csordás, Schmidhuber �ICML 2022

34 of 76

Fast weight programmer �= Net generating weights of another net

DeltaNet

DeltaDeltaNet

Why not more meta-levels?

Why self-referential loop? meta-meta-...metalearning

34

Schlag*, Irie*, Schmidhuber �ICML 2021

Irie*, Schlag*, Csordás, Schmidhuber �NeurIPS 2021

Irie, Schlag, Csordás, Schmidhuber �ICML 2022

35 of 76

Fast weight programmer �= Net generating weights of another net

DeltaNet

DeltaDeltaNet

DeltaDelta…DeltaNet??

Why not more meta-levels?

Why self-referential loop? meta-meta-...metalearning

35

Schlag*, Irie*, Schmidhuber �ICML 2021

Irie*, Schlag*, Csordás, Schmidhuber �NeurIPS 2021

Irie, Schlag, Csordás, Schmidhuber �ICML 2022

36 of 76

Why self-referential loop? meta-meta-...metalearning

DeltaNet

DeltaDeltaNet

DeltaDelta…DeltaNet??

Collapse all meta-levels into �a single self-referential loop…

36

Schlag*, Irie*, Schmidhuber �ICML 2021

Irie*, Schlag*, Csordás, Schmidhuber �NeurIPS 2021

Irie, Schlag, Csordás, Schmidhuber �ICML 2022

37 of 76

Self-Referential Weight Matrix

Input Output

Weight Matrix

Self-modifications based on key/value associations
Learns to train itself (at runtime) using self-generated training patterns/learning rates

(recursively self-modifying programs)

Irie, Schlag, Csordás, Schmidhuber �ICML 2022 “A modern self-referential weight matrix that learns to modify itself”

38 of 76

Sanity check on the standard few-shot�image classification (Mini-ImageNet)�

�

Sequential Multi-task Adaptation�(Mini-ImageNet + Omniglot)�

�

Multi-task Reinforcement Learning�(ProcGen)

Dataset changes

Common initial �weight matrix

Self-modified �task/episode �specific weight matrices

38

Irie, Schlag, Csordás, Schmidhuber ICML 2022

39 of 76

Does this improve Expressivity?

Regular language recognition tasks�requiring “state tracking”

�

Both Recurrence & �Self-Reference help

�

39

Irie, Csordás, Schmidhuber �EMNLP 2023 “Practical Computational Power of Linear Transformers and Their Recurrent and Self-Referential Extensions”

40 of 76

Does this improve Expressivity?

Regular language recognition tasks�requiring “state tracking”

�

Both Recurrence & �Self-Reference help

�

DeltaNet result is OUT DATED!

�

ICLR 2025

�

40

Irie, Csordás, Schmidhuber �EMNLP 2023 “Practical Computational Power of Linear Transformers and Their Recurrent and Self-Referential Extensions”

41 of 76

Other follow-ups

- How to incentivize an SRWM to become �better than its current self?�(by comparing to its future self!)��

41

Irie, Schmidhuber �ICLR 2023 Workshop ccelerating Neural Self-Improvement via Bootstrapping

42 of 76

Other follow-ups

- How to incentivize an SRWM to become �better than its current self?�(by comparing to its future self!)��- How to make an SRWM learn & �self-adapt continually? �(by augmenting the �meta-objectives!)

42

Irie, Schmidhuber �ICLR 2023 Workshop ccelerating Neural Self-Improvement via Bootstrapping

Irie, Csordás, Schmidhuber �TMLR 2025 “Metalearning Continual Learning Algorithms”

43 of 76

What’s else?? (2022)

43

44 of 76

What’s else?? (2022)

Generalizing the duality result

44

45 of 76