1 of 35

Unlocking the Efficiency of LLM Inference: �A Comprehensive Survey of Speculative Decoding

Heming Xia¹, Zhe Yang², Qingxiu Dong², Peiyi Wang², Yongqi Li¹, Tao Ge³, Tianyu Liu⁴, Wenjie Li¹, Zhifang Sui²

¹Department of Computing, The Hong Kong Polytechnic University

²National Key Laboratory for Multimedia Information Processing, Peking University

³Microsoft Research Asia ⁴Alibaba Group

2 of 35

Autoregressive Decoding

Training:

Teacher Forcing

Inference:

Generate token-by-token

Though reliable, but too slow!

3 of 35

How to accelerate LLM inference losslessly?

🤔

4 of 35

Insight-1: Not all tokens need LLM to generate

Compressing Context to Enhance Inference Efficiency of Large Language Models. Li et.al. EMNLP 2023.

5 of 35

Insight-2: LLM inference is memory-bound

Efficient Memory Management for Large Language Model Serving with PagedAttention. Kwon et.al. SOSP 2023.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Dao et.al. NIPS 2022.

Memory layout when serving an LLM with 13B parameters on NVIDIA A100. The parameters (gray) persist in GPU memory throughout serving. The memory for the KV cache (red) is (de)allocated per serving request. A small amount of memory (yellow) is used ephemerally for activation.

6 of 35

Insight-2: LLM inference is memory-bound

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation.

Xia et.al. EMNLP 2023 Findings.

Inefficiency:

Spent most time on re-loading LLM parameters from HBM to the on-chip cache of GPUs (rather than arithmetic computation)

Under-utilization of GPUs:

Online Inference Bottleneck is Latency (e.g., 1s) rather than GPU Memory.

Latency and peak GPU memory utilization of T5-Large on WMT14 EN-DE.

Latency and peak GPU memory utilization of T5-Large on CNN-DM.

7 of 35

How about using a small model to do the easy part?

8 of 35

Speculative Decoding

Small LM: Make a guess!

Targeted LLM: Let’s verify!

Teacher-forcing Recap:

9 of 35

Speculative Decoding

Verification: all tokens after the bifurcation position are discarded.

10 of 35

Latency Comparison

11 of 35

Latency Comparison

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. Cai et.al. Arxiv 2024.

12 of 35

Timeline of Speculative Decoding

21 related papers have been released in 2024!

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding.

Xia et.al. Arxiv 2024.

13 of 35

Key Facets of Speculative Decoding

Drafting:

Drafting efficiency
Speculation accuracy

Verification:

Quality of final outputs
Token acceptance rate

14 of 35

Drafting – Efficient Speculation

Challenges:

Drafting efficiency & Speculation accuracy

Methods:

Independent Drafting:

Small LM, Non-Auto LM, Retrieved Docs…

Self-Drafting:

Lightweight Heads, Early-Existing / Layer Skipping, Mask-Predict…

15 of 35

Blockwise Decoding & Medusa: FFN Heads

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. Cai et.al. Arxiv 2024.

16 of 35

SpecDec: Non-Auto LM

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation.

Xia et.al. EMNLP 2023 Findings.

17 of 35

EAGLE: Auto-Regression Heads

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. Li et.al. Arxiv 2024.

18 of 35

SpS: Small LM in the same series

Accelerating Large Language Model Decoding with Speculative Sampling. Chen et.al. Technical Report.

Source: https://huggingface.co/blog/assisted-generation.

19 of 35

Lookahead Decoding: Mask-Predict & N-gram

Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding. Fu et.al. Arxiv 2024.

20 of 35

LLMA: Selected Text Span from the Input Prompt

Inference with Reference: Lossless Acceleration of Large Language Models. Yang et.al. Technical Report.

21 of 35

REST: Retrieval as Drafting

REST: Retrieval-Based Speculative Decoding. He et.al. Arxiv 2023.

22 of 35

Drafting – Summary

Accuracy & Efficiency Tradeoff:

Model Scale & Architecture
Behavior Alignment

The Ease of Deployment:

Distributed Inference
Tuning-free
Plug and Play

23 of 35

Verification – Quality Control

Greedy Decoding:

Identical to greedy results

Nucleus Sampling:

Same as LLM distributions

Token Tree Verification:

Multiple Draft Sequence Veirification

24 of 35

Greedy Decoding

Exact match of the drafted tokens with Top-1 tokens of the targeted LLM

25 of 35

Greedy Decoding

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. Cai et.al. Arxiv 2024.

26 of 35

Nucleus Sampling

Accelerating Large Language Model Decoding with Speculative Sampling. Chen et.al. Technical Report.

27 of 35

Token Tree Verification

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads. Cai et.al. Arxiv 2024.

28 of 35

Token Tree Verification

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads. Cai et.al. Arxiv 2024.

29 of 35

Summary

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding.

Xia et.al. Arxiv 2024.

30 of 35

Spec-Bench

Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding.

Xia et.al. Blog 2024.

31 of 35

Spec-Bench

Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding.

Xia et.al. Blog 2024.

32 of 35

Spec-Bench

Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding.

Xia et.al. Blog 2024.

33 of 35

Summary & Future Directions

Speculation Accuracy

Behavior Alignment and Beyond

Batched Inference

Throughput & Latency

Application Senarios

Controllable Text Generation, e.g., Speculative Contrastive Decoding
Multimodal Scenatios, and etc.

34 of 35

Contributions are welcome!

Arxiv: https://arxiv.org/abs/2401.07851

GitHub: https://github.com/hemingkx/SpeculativeDecodingPapers

Spec-Bench: https://sites.google.com/view/spec-bench/

35 of 35

Thanks!