1 of 26

Decoding in�Autoregressive LLMs

April 2, 2026 · Apoorv Saxena · CDS 102, IISc

Inception

2 of 26

About Me

Apoorv Saxena

Member of Technical Staff, Inception AI

Research & Background

Prev Research Scientist at Adobe Research (2022–2025)
Focus: Document QA, grounding and attribution, llm inference
PhD from IISc (2018-2022), worked on knowledge graph embeddings

Open Source

Creator of Prompt Lookup Decoding (n-gram speculation)
A speculative decoding algorithm now built into every major LLM inference framework (HuggingFace, vLLM, TensorRT-LLM…), delivering 2-4x speedups with zero model changes

This Year

Now at Inception AI — building diffusion language models
Last year I covered decoding here; this year we go further

Inception AI

3 of 26

Agenda

1 Basic Decoding

Greedy · Sampling · Temperature · Top-k / Top-p / Min-p

2 Speed

Speculative Decoding · Prompt Lookup Decoding

3 A Sneak Peek

Towards Diffusion Language Models

Feel free to interrupt and ask questions!

Inception

4 of 26

What is a Language Model?

A language model assigns probabilities to sequences of words
P("The cat sat on the mat") > P("The cat sat on the spacecraft") > P(“The cat faslkdf alskfaslkf”)

Modern Neural Language Models

Neural network that predicts: P(next token | all previous tokens)
Trained on massive text corpora to learn these distributions
Core operation: input a sequence of tokens → output probability distribution over vocabulary (~50k tokens)

Token ≠ Word

Text is first broken into sub-word tokens by a tokenizer
"unbelievable" → ["un", "believ", "able"] (3 tokens, not 1 word)

Inception AI

5 of 26

What is Autoregressive Generation?

"Auto" = self, "Regressive" = predicting from previous values
The model generates one token at a time, strictly left to right
Each token is conditioned on all previously generated tokens:

This architecture underlies GPT, LLaMA, Claude, Gemini, Mistral...

Key constraint

Inherently sequential: cannot generate token t without token t-1
Every forward pass produces exactly one new token
This is the key thing we will explore in today's lecture (next class - going beyond AR)

Inception AI

6 of 26

Inception AI

7 of 26

From Input to Output: The Decoding Pipeline

Forward Pass

Run the full model on current tokens

Output: one logit score per vocabulary item

(~50,000 scores per step)

Softmax

Convert logits to a probability distribution

P(x) = exp(logit_x) / Σ exp(logit_i)

All probabilities sum to 1.0

Decode

Choose one token from the distribution

Append it — go back to Step 1

The choice of strategy in Step 3 is what decoding is all about

Inception AI

8 of 26

Part 1

Basic Decoding Strategies

Inception

9 of 26

Greedy Decoding

The simplest strategy: always pick the highest-probability token

next_token = argmax P(x | context)

Properties

Deterministic — always produces the same output for the same input
Fast — no sampling overhead

Problems

Repetition: the model gets stuck in loops
"The cat sat on the cat sat on the cat sat on the..."
Locally optimal is not globally optimal
A better sequence may have required a less likely token early on
No diversity — same input always yields the same output

Inception AI

10 of 26

Sampling

Instead of always picking the max, draw randomly from the distribution

next_token ~ P(x | context)

Properties

Stochastic — different outputs on each run
Escapes repetition loops naturally
Produces diverse, creative text

Problem

Low-probability tokens can still get selected
"The astronaut walked on the... zxqrtf"
The full distribution includes many nonsensical tokens

Solution

Control the distribution before sampling — next slides cover how

Inception AI

11 of 26

Temperature Scaling

Scale logits before softmax

P(x) ∝ exp( logit_x / T )

Effect of Temperature T

T < 1 → sharper distribution → more focused
High-logit tokens become even more dominant
T > 1 → flatter distribution → more random
Differences between tokens are compressed
T → 0 → becomes greedy decoding
T = 1 → use the raw model distribution

Practical range

0.7 – 1.0 for factual / analytical tasks
1.0 – 1.3 for creative writing

Real-world example

T = 0.7

DeepSeek R1 default

Temperature is one of the oldest tricks in the book
Still the first knob practitioners reach for

Inception AI

12 of 26

Top-k Sampling

Restrict sampling to only the top k most probable tokens

Set all other probabilities to zero, renormalize, then sample

keep top k tokens by probability → renormalize → sample

Intuition

Prevents the model from ever choosing clearly bad tokens
Common choice: k = 50 (used in GPT-2 original)

Problem: k is fixed, but distributions vary

Sometimes the distribution is very peaked — top 5 tokens cover 99% probability
→ k=50 still allows 45 unlikely tokens
Sometimes the distribution is spread out — top 50 only covers 60%
→ k=50 might cut off valid options

The right k depends on context — a fixed k is always a compromise

Inception AI

13 of 26

Top-p (Nucleus) Sampling

Adaptively select the smallest set of tokens with cumulative probability ≥ p

keep fewest tokens where Σ P(x_i) ≥ p → renormalize → sample

The nucleus changes size dynamically

When model is confident: nucleus is small (few tokens cover ≥ p)
When model is uncertain: nucleus is larger (need many tokens to reach p)

Directly addresses the fixed-k problem

Common choice: p = 0.9 or 0.95
Widely used in modern LLM inference — most APIs default to top-p

Limitation

Can still include very low-probability tokens when the distribution is flat
The threshold is absolute, not relative to the top token

Inception AI

14 of 26

Min-p Sampling

A newer, adaptive approach: relative probability threshold

discard token x if P(x) < min_p × P(top_token)

Key idea: threshold is relative to the top token's probability

If top token has P=0.60 and min_p=0.10 → discard anything below 0.06
If top token has P=0.05 and min_p=0.10 → discard anything below 0.005

Properties

Naturally adapts: strict when model is confident, permissive when uncertain
Computationally simple — just one comparison per token

Inception AI

15 of 26

Min-p Sampling

Inception AI

16 of 26

Part 2

Making AR Models Faster

Inception

17 of 26

The Autoregressive Bottleneck

Every token requires a full forward pass

"The" takes as long to generate as "photosynthesis"
No parallelism possible: token t requires token t-1

Why is this slow?

Modern LLMs: 100B+ parameters per forward pass
Inference is memory-bandwidth bound

100 tokens�= 100 passes

The key insight for speedups

Transformers compute ALL positions in parallel
One pass gives P(·|x_1..t) for every t at once
→ Verify k draft tokens for the cost of generating 1!

Inception AI

18 of 26

Speculative Decoding — Motivation

Core idea: draft many tokens cheaply, verify them all at once

Step 1 — Draft

Small, fast draft model generates k tokens cheaply
Draft model: ~10× smaller, ~10× faster

Step 2 — Verify

Feed all k drafts to the large target model in ONE pass
Get verification probabilities for all positions at once

Step 3 — Accept or Reject

Accept drafts that match target distribution
Reject first mismatch; fall back to target model there

2–3×

typical speedup

Key guarantee

Identical output distribution to the target model alone
No quality loss — mathematically equivalent

Inception AI

19 of 26

Inception AI

20 of 26

Speculative Decoding — Why It Works

The verification step is essentially free

Transformer attention is parallelizable across sequence positions
Given a draft sequence of k tokens, one forward pass gives us P(·|x_{1..t}) for all t
So verifying k tokens costs the same as generating 1 token

The acceptance/rejection rule

If P_target(x_t) / P_draft(x_t) ≥ 1: always accept
Otherwise: accept with probability P_target / P_draft
On rejection: sample from a corrected distribution — guarantees target distribution

Trade-offs

Requires running two models → more GPU memory
Benefits tasks where draft tokens are likely correct: code, summarization, QA
Less benefit on open-ended generation where the draft is often wrong

Inception AI

21 of 26

Prompt Lookup Decoding

Can we do speculative decoding without a draft model?

Speculative decoding requires a separate draft model → extra VRAM, operational complexity
Observation: for many tasks, the output contains phrases from the input
Summarization, document QA, code editing — output heavily overlaps with prompt

Key idea: use the prompt itself as the draft

Take the last n generated tokens (an n-gram)
Search the prompt for a matching n-gram
If found: the continuation in the prompt becomes the draft
Pass to the target model for verification — same accept/reject as speculative decoding

Properties

Zero extra model, zero extra VRAM — just a string search (O(n))
Works whenever the output reuses input phrasing

Inception AI

22 of 26

Inception AI

23 of 26

Prompt Lookup Decoding — Results

https://specdecode-bench.github.io/ (2026)

Inception AI

24 of 26

All these methods accept one constraint: one token at a time.

What if we could generate all tokens in parallel through iterative refinement?

Inception

25 of 26

Recap & Key Takeaways

Decoding basics

LMs output distributions; decoding turns that into text
Autoregressive: one token at a time, left to right

Controlling quality

Greedy: fast, deterministic, prone to repetition
Temperature / top-k / top-p / min-p: diverse, controllable
These compose: use temperature + top-p together

Controlling speed

Speculative decoding: draft + verify → 2–3× speedup
Prompt Lookup Decoding: use prompt as draft — no extra model

The big picture

All these methods work within the AR constraint
Next lecture (Apr 7): what if we changed the generation paradigm entirely?
→ Discrete Diffusion Language Models

Inception AI

26 of 26

Thank you

apoorv@inceptionlabs.ai

Inception