Decoding in�Autoregressive LLMs
April 2, 2026 · Apoorv Saxena · CDS 102, IISc
1
Inception
About Me
Apoorv Saxena
Research & Background
Open Source
This Year
2
Inception AI
Agenda
1 Basic Decoding
Greedy · Sampling · Temperature · Top-k / Top-p / Min-p
2 Speed
Speculative Decoding · Prompt Lookup Decoding
3 A Sneak Peek
Towards Diffusion Language Models
Feel free to interrupt and ask questions!
3
Inception
What is a Language Model?
Modern Neural Language Models
Token ≠ Word
4
Inception AI
What is Autoregressive Generation?
This architecture underlies GPT, LLaMA, Claude, Gemini, Mistral...
Key constraint
5
Inception AI
6
Inception AI
From Input to Output: The Decoding Pipeline
1
Forward Pass
Run the full model on current tokens
Output: one logit score per vocabulary item
(~50,000 scores per step)
2
Softmax
Convert logits to a probability distribution
P(x) = exp(logit_x) / Σ exp(logit_i)
All probabilities sum to 1.0
3
Decode
Choose one token from the distribution
Append it — go back to Step 1
The choice of strategy in Step 3 is what decoding is all about
7
Inception AI
Part 1
Basic Decoding Strategies
8
Inception
Greedy Decoding
The simplest strategy: always pick the highest-probability token
next_token = argmax P(x | context)
Properties
Problems
9
Inception AI
Sampling
Instead of always picking the max, draw randomly from the distribution
next_token ~ P(x | context)
Properties
Problem
Solution
10
Inception AI
Temperature Scaling
Scale logits before softmax
P(x) ∝ exp( logit_x / T )
Effect of Temperature T
Practical range
Real-world example
T = 0.7
DeepSeek R1 default
11
Inception AI
Top-k Sampling
Restrict sampling to only the top k most probable tokens
keep top k tokens by probability → renormalize → sample
Intuition
Problem: k is fixed, but distributions vary
12
Inception AI
Top-p (Nucleus) Sampling
Adaptively select the smallest set of tokens with cumulative probability ≥ p
keep fewest tokens where Σ P(x_i) ≥ p → renormalize → sample
The nucleus changes size dynamically
Directly addresses the fixed-k problem
Limitation
13
Inception AI
Min-p Sampling
A newer, adaptive approach: relative probability threshold
discard token x if P(x) < min_p × P(top_token)
Key idea: threshold is relative to the top token's probability
Properties
14
Inception AI
Min-p Sampling
15
Inception AI
Part 2
Making AR Models Faster
16
Inception
The Autoregressive Bottleneck
Every token requires a full forward pass
Why is this slow?
100 tokens�= 100 passes
The key insight for speedups
17
Inception AI
Speculative Decoding — Motivation
Core idea: draft many tokens cheaply, verify them all at once
Step 1 — Draft
Step 2 — Verify
Step 3 — Accept or Reject
2–3×
typical speedup
Key guarantee
18
Inception AI
19
Inception AI
Speculative Decoding — Why It Works
The verification step is essentially free
The acceptance/rejection rule
Trade-offs
20
Inception AI
Prompt Lookup Decoding
Can we do speculative decoding without a draft model?
Key idea: use the prompt itself as the draft
Properties
21
Inception AI
22
Inception AI
Prompt Lookup Decoding — Results
23
Inception AI
All these methods accept one constraint: one token at a time.
What if we could generate all tokens in parallel through iterative refinement?
24
Inception
Recap & Key Takeaways
Decoding basics
Controlling quality
Controlling speed
The big picture
25
Inception AI
Thank you
apoorv@inceptionlabs.ai
26
Inception