1 of 15

Parameter-efficient Language Modeling with Dual Recurrence

2025.6.4

2 of 15

1. Depth-wise Recurrence Models

End-to-end Algorithm Synthesis with Recurrent Networks: Logical Extrapolation Without Overthinking, NeuIPS 2022

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach Arxiv 2025

3 of 15

1. Depth-wise Recurrence Models

Input Injection 🡪 Path Independent (Path Independent Equilibrium Models Can Better Exploit Test-Time Computation)
Implementation:

Addition: Input + hidden_state
Concatenation: [Input: hidden_state] (Perform better on large-scale tasks & models)

Learning a loop-independent paradigm through param. sharing (Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach Arxiv 2025).
Parameter sharing as algorithm extrapolation. (train on small tasks, test on large tasks with looping)

4 of 15

2. Dual Recurrence in RWKV-v7: Depth-wise & Seq-wise

Original RWKV Arch.

RWKV-Depth-Recurrence Arch.

Model Arch.

Metric:

共享粒度

Addition Injection

#TODO: Concatenation Injection

5 of 15

2. Dual Recurrence in RWKV-v7: Depth-wise & Seq-wise

Training settings:

Loss:
Dataset: Minipile (provided by RWKV team)
PytorchLightning Settings：

Optimizer：FusedAdam(b1:0.9, b2:0.99, eps:1e-18)
Weight Decay: 0.001
WarmUp: 10 steps
Param Group Updating (1x/2x)

Next-token-prediction

Avoiding numerical exploding

Toy trial:

6 of 15

3. Test on MMLU Benchmark

8-layer groups, 400 loops total

Recommended Prompt: 3222/14042 22.945%

Default lm_eval Prompt: 3222/14042 22.945% (SAME)

16-layers

Recommended Prompt: 3232/14042 23.017%

7 of 15

4. Future Works

Dataset joint, context-length-scale training
Training on RWKV-World-v3 Dataset (3.1T tokens): Goose-World/RWKV-World-v3 at main
Self-adaptive criterion for early-exit scheme ---> Loss function
Baseline: same FLOPs、same Params
Training Strategies:

Looping iters randomness?
Loss function?
Tokenizer(分词器) & datasets?
Different tasks? (Simple user-defined/alg. extrapolate)

Massive Multiplication (datasets) -> Chaoyi.

8 of 15

5. About paper?

Dual Recurrency
Why RWKV not Transformer?

Risks KV-cache exploding…

App. Scenarios:

足够垂直化（某一个场景下100%）

RWKV weak for reasoning, but good for long ctx…,

9 of 15

6. Adaptive Reasoning?

10 of 15

Previous trials

ICL in Math function regression:

Task: Linear Regression

Settings:

embed_dim = 64

head_size = 2

layers = 3

epochs = 200

lr = 6e-4

optim = Adam(0.9, 0.99, 1e-18)

11 of 15

Previous trials

tot loss

pointwise loss

12 of 15

RWKV arch. is more antinoise

13 of 15

Unstability in looped training

14 of 15

Revisiting Dual Recurrence and ICL

In-context Learning: （网络的）上下文学习

表现

能够对[A,B,...,A]的序列做出准确的Next-token Prediction预测：即输出B
能够对[A*,B*,...,A]的序列做出准确的Next-token Prediction预测：即输出B，其中A*与A, B*与B在某一维度（抽象空间内）具有关联。
是一个模型的长序列召回/记忆能力的体现

和深度网络/大模型的关联

记忆容量与表示能力
自适应深度递归
短期记忆&长期记忆？？？

15 of 15

Revisiting Dual Recurrence(DR) and ICL

DR的记忆容量与表示能力

Seq-wise: 代表短期记忆
Depth-wise: 代表长期记忆
（层次化的记忆机制）