1 of 8

Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling

Jizhou Guo, Zhaomin Wu, Hanchen Yang, Philip S. Yu

2 of 8

  • Best-of-N sampling improves LLM reasoning
  • Existing reward models are large (billions of parameters) and text-based
  • → High cost, slow inference, data-hungry

Why Reward Models Are a Bottleneck

3 of 8

  • LLM hidden states already encode correctness
  • Signals are linearly separable
  • Verified via PCA + LDA

Key Insight

4 of 8

  • Token-level reward modeling on hidden states
  • Two linear heads: reward + gating
  • Weighted aggregation → path-level score
  • O(L·d) parameters (extremely lightweight)

SWIFT: Simple Weighted Intrinsic Feedback

5 of 8

  • +12.7% accuracy over EurusRM-7B (MATH)
  • Consistent gains across reasoning benchmarks
  • Strong cross-dataset generalization
  • Works beyond reasoning (helpfulness & safety)

Results: Accuracy & Generalization

6 of 8

  • <0.005% parameters of baseline RMs
  • Orders-of-magnitude lower latency and FLOPs
  • Requires minimal data for training and has strong scalability

Efficiency

7 of 8

  • Token-level gating highlights decisive tokens
  • Emphasizes reasoning steps & conclusion markers
  • Down-weights boilerplate text

Why Does SWIFT Work?

8 of 8

Many Thanks!

Jizhou Guo’s homepage: https://aster2024.github.io/