1 of 35

Unlocking the Efficiency of LLM Inference: �A Comprehensive Survey of Speculative Decoding

Heming Xia1, Zhe Yang2, Qingxiu Dong2, Peiyi Wang2, Yongqi Li1, Tao Ge3, Tianyu Liu4, Wenjie Li1, Zhifang Sui2

1Department of Computing, The Hong Kong Polytechnic University

2National Key Laboratory for Multimedia Information Processing, Peking University

3Microsoft Research Asia 4Alibaba Group

2 of 35

Autoregressive Decoding

  • Training:
    • Teacher Forcing

  • Inference:
    • Generate token-by-token

Though reliable, but too slow!

3 of 35

How to accelerate LLM inference losslessly?

🤔

4 of 35

Insight-1: Not all tokens need LLM to generate

Compressing Context to Enhance Inference Efficiency of Large Language Models. Li et.al. EMNLP 2023.

5 of 35

Insight-2: LLM inference is memory-bound

Efficient Memory Management for Large Language Model Serving with PagedAttention. Kwon et.al. SOSP 2023.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Dao et.al. NIPS 2022.

Memory layout when serving an LLM with 13B parameters on NVIDIA A100. The parameters (gray) persist in GPU memory throughout serving. The memory for the KV cache (red) is (de)allocated per serving request. A small amount of memory (yellow) is used ephemerally for activation.

6 of 35

Insight-2: LLM inference is memory-bound

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation.

Xia et.al. EMNLP 2023 Findings.

  • Inefficiency:
    • Spent most time on re-loading LLM parameters from HBM to the on-chip cache of GPUs (rather than arithmetic computation)
  • Under-utilization of GPUs:
    • Online Inference Bottleneck is Latency (e.g., 1s) rather than GPU Memory.

Latency and peak GPU memory utilization of T5-Large on WMT14 EN-DE.

Latency and peak GPU memory utilization of T5-Large on CNN-DM.

7 of 35

How about using a small model to do the easy part?

8 of 35

Speculative Decoding

Small LM: Make a guess!

Targeted LLM: Let’s verify!

Teacher-forcing Recap:

9 of 35

Speculative Decoding

Verification: all tokens after the bifurcation position are discarded.

10 of 35

Latency Comparison

11 of 35

Latency Comparison

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. Cai et.al. Arxiv 2024.

12 of 35

Timeline of Speculative Decoding

21 related papers have been released in 2024!

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding.

Xia et.al. Arxiv 2024.

13 of 35

Key Facets of Speculative Decoding

  • Drafting:
    • Drafting efficiency
    • Speculation accuracy
  • Verification:
    • Quality of final outputs
    • Token acceptance rate

14 of 35

Drafting – Efficient Speculation

  • Challenges:
    • Drafting efficiency & Speculation accuracy
  • Methods:
    • Independent Drafting:
      • Small LM, Non-Auto LM, Retrieved Docs…
    • Self-Drafting:
      • Lightweight Heads, Early-Existing / Layer Skipping, Mask-Predict…

15 of 35

Blockwise Decoding & Medusa: FFN Heads

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. Cai et.al. Arxiv 2024.

16 of 35

SpecDec: Non-Auto LM

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation.

Xia et.al. EMNLP 2023 Findings.

17 of 35

EAGLE: Auto-Regression Heads

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. Li et.al. Arxiv 2024.

18 of 35

SpS: Small LM in the same series

Accelerating Large Language Model Decoding with Speculative Sampling. Chen et.al. Technical Report.

Source: https://huggingface.co/blog/assisted-generation.

19 of 35

Lookahead Decoding: Mask-Predict & N-gram

Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding. Fu et.al. Arxiv 2024.

20 of 35

LLMA: Selected Text Span from the Input Prompt

Inference with Reference: Lossless Acceleration of Large Language Models. Yang et.al. Technical Report.

21 of 35

REST: Retrieval as Drafting

REST: Retrieval-Based Speculative Decoding. He et.al. Arxiv 2023.

22 of 35

Drafting – Summary

  • Accuracy & Efficiency Tradeoff:
    • Model Scale & Architecture
    • Behavior Alignment
  • The Ease of Deployment:
    • Distributed Inference
    • Tuning-free
    • Plug and Play

23 of 35

Verification – Quality Control

  • Greedy Decoding:
    • Identical to greedy results
  • Nucleus Sampling:
    • Same as LLM distributions
  • Token Tree Verification:
    • Multiple Draft Sequence Veirification

24 of 35

Greedy Decoding

Exact match of the drafted tokens with Top-1 tokens of the targeted LLM

25 of 35

Greedy Decoding

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. Cai et.al. Arxiv 2024.

26 of 35

Nucleus Sampling

Accelerating Large Language Model Decoding with Speculative Sampling. Chen et.al. Technical Report.

27 of 35

Token Tree Verification

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads. Cai et.al. Arxiv 2024.

28 of 35

Token Tree Verification

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads. Cai et.al. Arxiv 2024.

29 of 35

Summary

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding.

Xia et.al. Arxiv 2024.

30 of 35

Spec-Bench

Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding.

Xia et.al. Blog 2024.

31 of 35

Spec-Bench

Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding.

Xia et.al. Blog 2024.

32 of 35

Spec-Bench

Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding.

Xia et.al. Blog 2024.

33 of 35

Summary & Future Directions

  • Speculation Accuracy
    • Behavior Alignment and Beyond
  • Batched Inference
    • Throughput & Latency
  • Application Senarios
    • Controllable Text Generation, e.g., Speculative Contrastive Decoding
    • Multimodal Scenatios, and etc.

34 of 35

Contributions are welcome!

35 of 35

Thanks!