Unlocking the Efficiency of LLM Inference: �A Comprehensive Survey of Speculative Decoding
Heming Xia1, Zhe Yang2, Qingxiu Dong2, Peiyi Wang2, Yongqi Li1, Tao Ge3, Tianyu Liu4, Wenjie Li1, Zhifang Sui2
1Department of Computing, The Hong Kong Polytechnic University
2National Key Laboratory for Multimedia Information Processing, Peking University
3Microsoft Research Asia 4Alibaba Group
Autoregressive Decoding
Though reliable, but too slow!
How to accelerate LLM inference losslessly?
🤔
Insight-1: Not all tokens need LLM to generate
Compressing Context to Enhance Inference Efficiency of Large Language Models. Li et.al. EMNLP 2023.
Insight-2: LLM inference is memory-bound
Efficient Memory Management for Large Language Model Serving with PagedAttention. Kwon et.al. SOSP 2023.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Dao et.al. NIPS 2022.
Memory layout when serving an LLM with 13B parameters on NVIDIA A100. The parameters (gray) persist in GPU memory throughout serving. The memory for the KV cache (red) is (de)allocated per serving request. A small amount of memory (yellow) is used ephemerally for activation.
Insight-2: LLM inference is memory-bound
Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation.
Xia et.al. EMNLP 2023 Findings.
Latency and peak GPU memory utilization of T5-Large on WMT14 EN-DE.
Latency and peak GPU memory utilization of T5-Large on CNN-DM.
How about using a small model to do the easy part?
Speculative Decoding
Small LM: Make a guess!
Targeted LLM: Let’s verify!
Teacher-forcing Recap:
Speculative Decoding
Verification: all tokens after the bifurcation position are discarded.
Latency Comparison
Latency Comparison
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. Cai et.al. Arxiv 2024.
Timeline of Speculative Decoding
21 related papers have been released in 2024!
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding.
Xia et.al. Arxiv 2024.
Key Facets of Speculative Decoding
Drafting – Efficient Speculation
Blockwise Decoding & Medusa: FFN Heads
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. Cai et.al. Arxiv 2024.
SpecDec: Non-Auto LM
Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation.
Xia et.al. EMNLP 2023 Findings.
EAGLE: Auto-Regression Heads
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. Li et.al. Arxiv 2024.
SpS: Small LM in the same series
Accelerating Large Language Model Decoding with Speculative Sampling. Chen et.al. Technical Report.
Source: https://huggingface.co/blog/assisted-generation.
Lookahead Decoding: Mask-Predict & N-gram
Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding. Fu et.al. Arxiv 2024.
LLMA: Selected Text Span from the Input Prompt
Inference with Reference: Lossless Acceleration of Large Language Models. Yang et.al. Technical Report.
REST: Retrieval as Drafting
REST: Retrieval-Based Speculative Decoding. He et.al. Arxiv 2023.
Drafting – Summary
Verification – Quality Control
Greedy Decoding
Exact match of the drafted tokens with Top-1 tokens of the targeted LLM
Greedy Decoding
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. Cai et.al. Arxiv 2024.
Nucleus Sampling
Accelerating Large Language Model Decoding with Speculative Sampling. Chen et.al. Technical Report.
Token Tree Verification
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads. Cai et.al. Arxiv 2024.
Token Tree Verification
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads. Cai et.al. Arxiv 2024.
Summary
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding.
Xia et.al. Arxiv 2024.
Spec-Bench
Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding.
Xia et.al. Blog 2024.
Spec-Bench
Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding.
Xia et.al. Blog 2024.
Spec-Bench
Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding.
Xia et.al. Blog 2024.
Summary & Future Directions
Contributions are welcome!
Thanks!