JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 8

Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling

Jizhou Guo, Zhaomin Wu, Hanchen Yang, Philip S. Yu

2 of 8

Best-of-N sampling improves LLM reasoning
Existing reward models are large (billions of parameters) and text-based
→ High cost, slow inference, data-hungry

Why Reward Models Are a Bottleneck

3 of 8

LLM hidden states already encode correctness
Signals are linearly separable
Verified via PCA + LDA

Key Insight

4 of 8

Token-level reward modeling on hidden states
Two linear heads: reward + gating
Weighted aggregation → path-level score
O(L·d) parameters (extremely lightweight)

SWIFT: Simple Weighted Intrinsic Feedback

5 of 8

+12.7% accuracy over EurusRM-7B (MATH)
Consistent gains across reasoning benchmarks
Strong cross-dataset generalization
Works beyond reasoning (helpfulness & safety)

Results: Accuracy & Generalization

6 of 8

<0.005% parameters of baseline RMs
Orders-of-magnitude lower latency and FLOPs
Requires minimal data for training and has strong scalability

Efficiency

7 of 8

Token-level gating highlights decisive tokens
Emphasizes reasoning steps & conclusion markers
Down-weights boilerplate text

Why Does SWIFT Work?

8 of 8

Many Thanks!

Jizhou Guo’s homepage: https://aster2024.github.io/