Parameter-efficient Language Modeling with Dual Recurrence
2025.6.4
1. Depth-wise Recurrence Models
1. Depth-wise Recurrence Models
2. Dual Recurrence in RWKV-v7: Depth-wise & Seq-wise
Original RWKV Arch.
RWKV-Depth-Recurrence Arch.
Addition Injection
#TODO: Concatenation Injection
2. Dual Recurrence in RWKV-v7: Depth-wise & Seq-wise
Next-token-prediction
Avoiding numerical exploding
3. Test on MMLU Benchmark
8-layer groups, 400 loops total
Recommended Prompt: 3222/14042 22.945%
Default lm_eval Prompt: 3222/14042 22.945% (SAME)
16-layers
Recommended Prompt: 3232/14042 23.017%
4. Future Works
Massive Multiplication (datasets) -> Chaoyi.
5. About paper?
Risks KV-cache exploding…
足够垂直化(某一个场景下100%)
6. Adaptive Reasoning?
Previous trials
ICL in Math function regression:
Task: Linear Regression
Settings:
embed_dim = 64
head_size = 2
layers = 3
epochs = 200
lr = 6e-4
optim = Adam(0.9, 0.99, 1e-18)
Previous trials
tot loss
pointwise loss
RWKV arch. is more antinoise
Unstability in looped training
Revisiting Dual Recurrence and ICL
In-context Learning: (网络的)上下文学习
Revisiting Dual Recurrence(DR) and ICL