1 of 15

Parameter-efficient Language Modeling with Dual Recurrence

2025.6.4

2 of 15

1. Depth-wise Recurrence Models

3 of 15

1. Depth-wise Recurrence Models

  • Input Injection 🡪 Path Independent (Path Independent Equilibrium Models Can Better Exploit Test-Time Computation)
  • Implementation:
    • Addition: Input + hidden_state
    • Concatenation: [Input: hidden_state] (Perform better on large-scale tasks & models)

4 of 15

2. Dual Recurrence in RWKV-v7: Depth-wise & Seq-wise

Original RWKV Arch.

RWKV-Depth-Recurrence Arch.

  • Model Arch.
  • Metric:
    • 共享粒度

Addition Injection

#TODO: Concatenation Injection

5 of 15

2. Dual Recurrence in RWKV-v7: Depth-wise & Seq-wise

  • Training settings:
    • Loss:
    • Dataset: Minipile (provided by RWKV team)
    • PytorchLightning Settings:
      • Optimizer:FusedAdam(b1:0.9, b2:0.99, eps:1e-18)
      • Weight Decay: 0.001
      • WarmUp: 10 steps
      • Param Group Updating (1x/2x)

Next-token-prediction

Avoiding numerical exploding

  • Toy trial:

6 of 15

3. Test on MMLU Benchmark

8-layer groups, 400 loops total

Recommended Prompt: 3222/14042 22.945%

Default lm_eval Prompt: 3222/14042 22.945% (SAME)

16-layers

Recommended Prompt: 3232/14042 23.017%

7 of 15

4. Future Works

  • Dataset joint, context-length-scale training
  • Training on RWKV-World-v3 Dataset (3.1T tokens): Goose-World/RWKV-World-v3 at main
  • Self-adaptive criterion for early-exit scheme ---> Loss function
  • Baseline: same FLOPs、same Params
  • Training Strategies:
    • Looping iters randomness?
    • Loss function?
    • Tokenizer(分词器) & datasets?
    • Different tasks? (Simple user-defined/alg. extrapolate)

Massive Multiplication (datasets) -> Chaoyi.

8 of 15

5. About paper?

  1. Dual Recurrency
  2. Why RWKV not Transformer?

Risks KV-cache exploding…

  • App. Scenarios:

足够垂直化(某一个场景下100%)

  • RWKV weak for reasoning, but good for long ctx…,

9 of 15

6. Adaptive Reasoning?

10 of 15

Previous trials

ICL in Math function regression:

Task: Linear Regression

Settings:

embed_dim = 64

head_size = 2

layers = 3

epochs = 200

lr = 6e-4

optim = Adam(0.9, 0.99, 1e-18)

11 of 15

Previous trials

tot loss

pointwise loss

12 of 15

RWKV arch. is more antinoise

13 of 15

Unstability in looped training

14 of 15

Revisiting Dual Recurrence and ICL

In-context Learning: (网络的)上下文学习

  • 表现
    • 能够对[A,B,...,A]的序列做出准确的Next-token Prediction预测:即输出B
    • 能够对[A*,B*,...,A]的序列做出准确的Next-token Prediction预测:即输出B,其中A*与A, B*与B在某一维度(抽象空间内)具有关联。
    • 是一个模型的长序列召回/记忆能力的体现
  • 和深度网络/大模型的关联
    • 记忆容量与表示能力
    • 自适应深度递归
    • 短期记忆&长期记忆???

15 of 15

Revisiting Dual Recurrence(DR) and ICL

  • DR的记忆容量与表示能力
    • Seq-wise: 代表短期记忆
    • Depth-wise: 代表长期记忆
    • (层次化的记忆机制)