1 of 26

Decoding in�Autoregressive LLMs

April 2, 2026 · Apoorv Saxena · CDS 102, IISc

1

Inception

2 of 26

About Me

Apoorv Saxena

  • Member of Technical Staff, Inception AI

Research & Background

  • Prev Research Scientist at Adobe Research (2022–2025)
  • Focus: Document QA, grounding and attribution, llm inference
  • PhD from IISc (2018-2022), worked on knowledge graph embeddings

Open Source

  • Creator of Prompt Lookup Decoding (n-gram speculation)
  • A speculative decoding algorithm now built into every major LLM inference framework (HuggingFace, vLLM, TensorRT-LLM…), delivering 2-4x speedups with zero model changes

This Year

  • Now at Inception AI — building diffusion language models
  • Last year I covered decoding here; this year we go further

2

Inception AI

3 of 26

Agenda

1 Basic Decoding

Greedy · Sampling · Temperature · Top-k / Top-p / Min-p

2 Speed

Speculative Decoding · Prompt Lookup Decoding

3 A Sneak Peek

Towards Diffusion Language Models

Feel free to interrupt and ask questions!

3

Inception

4 of 26

What is a Language Model?

  • A language model assigns probabilities to sequences of words
  • P("The cat sat on the mat") > P("The cat sat on the spacecraft") > P(“The cat faslkdf alskfaslkf”)

Modern Neural Language Models

  • Neural network that predicts: P(next token | all previous tokens)
  • Trained on massive text corpora to learn these distributions
  • Core operation: input a sequence of tokens → output probability distribution over vocabulary (~50k tokens)

Token ≠ Word

  • Text is first broken into sub-word tokens by a tokenizer
  • "unbelievable" → ["un", "believ", "able"] (3 tokens, not 1 word)

4

Inception AI

5 of 26

What is Autoregressive Generation?

  • "Auto" = self, "Regressive" = predicting from previous values
  • The model generates one token at a time, strictly left to right
  • Each token is conditioned on all previously generated tokens:

This architecture underlies GPT, LLaMA, Claude, Gemini, Mistral...

Key constraint

  • Inherently sequential: cannot generate token t without token t-1
  • Every forward pass produces exactly one new token
  • This is the key thing we will explore in today's lecture (next class - going beyond AR)

5

Inception AI

6 of 26

6

Inception AI

7 of 26

From Input to Output: The Decoding Pipeline

1

Forward Pass

Run the full model on current tokens

Output: one logit score per vocabulary item

(~50,000 scores per step)

2

Softmax

Convert logits to a probability distribution

P(x) = exp(logit_x) / Σ exp(logit_i)

All probabilities sum to 1.0

3

Decode

Choose one token from the distribution

Append it — go back to Step 1

The choice of strategy in Step 3 is what decoding is all about

7

Inception AI

8 of 26

Part 1

Basic Decoding Strategies

8

Inception

9 of 26

Greedy Decoding

The simplest strategy: always pick the highest-probability token

next_token = argmax P(x | context)

Properties

  • Deterministic — always produces the same output for the same input
  • Fast — no sampling overhead

Problems

  • Repetition: the model gets stuck in loops
  • "The cat sat on the cat sat on the cat sat on the..."
  • Locally optimal is not globally optimal
  • A better sequence may have required a less likely token early on
  • No diversity — same input always yields the same output

9

Inception AI

10 of 26

Sampling

Instead of always picking the max, draw randomly from the distribution

next_token ~ P(x | context)

Properties

  • Stochastic — different outputs on each run
  • Escapes repetition loops naturally
  • Produces diverse, creative text

Problem

  • Low-probability tokens can still get selected
  • "The astronaut walked on the... zxqrtf"
  • The full distribution includes many nonsensical tokens

Solution

  • Control the distribution before sampling — next slides cover how

10

Inception AI

11 of 26

Temperature Scaling

Scale logits before softmax

P(x) ∝ exp( logit_x / T )

Effect of Temperature T

  • T < 1 → sharper distribution → more focused
  • High-logit tokens become even more dominant
  • T > 1 → flatter distribution → more random
  • Differences between tokens are compressed
  • T → 0 → becomes greedy decoding
  • T = 1 → use the raw model distribution

Practical range

  • 0.7 – 1.0 for factual / analytical tasks
  • 1.0 – 1.3 for creative writing

Real-world example

T = 0.7

DeepSeek R1 default

  • Temperature is one of the oldest tricks in the book
  • Still the first knob practitioners reach for

11

Inception AI

12 of 26

Top-k Sampling

Restrict sampling to only the top k most probable tokens

  • Set all other probabilities to zero, renormalize, then sample

keep top k tokens by probability → renormalize → sample

Intuition

  • Prevents the model from ever choosing clearly bad tokens
  • Common choice: k = 50 (used in GPT-2 original)

Problem: k is fixed, but distributions vary

  • Sometimes the distribution is very peaked — top 5 tokens cover 99% probability
  • → k=50 still allows 45 unlikely tokens
  • Sometimes the distribution is spread out — top 50 only covers 60%
  • → k=50 might cut off valid options

  • The right k depends on context — a fixed k is always a compromise

12

Inception AI

13 of 26

Top-p (Nucleus) Sampling

Adaptively select the smallest set of tokens with cumulative probability ≥ p

keep fewest tokens where Σ P(x_i) ≥ p → renormalize → sample

The nucleus changes size dynamically

  • When model is confident: nucleus is small (few tokens cover ≥ p)
  • When model is uncertain: nucleus is larger (need many tokens to reach p)

Directly addresses the fixed-k problem

  • Common choice: p = 0.9 or 0.95
  • Widely used in modern LLM inference — most APIs default to top-p

Limitation

  • Can still include very low-probability tokens when the distribution is flat
  • The threshold is absolute, not relative to the top token

13

Inception AI

14 of 26

Min-p Sampling

A newer, adaptive approach: relative probability threshold

discard token x if P(x) < min_p × P(top_token)

Key idea: threshold is relative to the top token's probability

  • If top token has P=0.60 and min_p=0.10 → discard anything below 0.06
  • If top token has P=0.05 and min_p=0.10 → discard anything below 0.005

Properties

  • Naturally adapts: strict when model is confident, permissive when uncertain
  • Computationally simple — just one comparison per token

14

Inception AI

15 of 26

Min-p Sampling

15

Inception AI

16 of 26

Part 2

Making AR Models Faster

16

Inception

17 of 26

The Autoregressive Bottleneck

Every token requires a full forward pass

  • "The" takes as long to generate as "photosynthesis"
  • No parallelism possible: token t requires token t-1

Why is this slow?

  • Modern LLMs: 100B+ parameters per forward pass
  • Inference is memory-bandwidth bound

100 tokens�= 100 passes

The key insight for speedups

  • Transformers compute ALL positions in parallel
  • One pass gives P(·|x_1..t) for every t at once
  • → Verify k draft tokens for the cost of generating 1!

17

Inception AI

18 of 26

Speculative Decoding — Motivation

Core idea: draft many tokens cheaply, verify them all at once

Step 1 — Draft

  • Small, fast draft model generates k tokens cheaply
  • Draft model: ~10× smaller, ~10× faster

Step 2 — Verify

  • Feed all k drafts to the large target model in ONE pass
  • Get verification probabilities for all positions at once

Step 3 — Accept or Reject

  • Accept drafts that match target distribution
  • Reject first mismatch; fall back to target model there

2–3×

typical speedup

Key guarantee

  • Identical output distribution to the target model alone
  • No quality loss — mathematically equivalent

18

Inception AI

19 of 26

19

Inception AI

20 of 26

Speculative Decoding — Why It Works

The verification step is essentially free

  • Transformer attention is parallelizable across sequence positions
  • Given a draft sequence of k tokens, one forward pass gives us P(·|x_{1..t}) for all t
  • So verifying k tokens costs the same as generating 1 token

The acceptance/rejection rule

  • If P_target(x_t) / P_draft(x_t) ≥ 1: always accept
  • Otherwise: accept with probability P_target / P_draft
  • On rejection: sample from a corrected distribution — guarantees target distribution

Trade-offs

  • Requires running two models → more GPU memory
  • Benefits tasks where draft tokens are likely correct: code, summarization, QA
  • Less benefit on open-ended generation where the draft is often wrong

20

Inception AI

21 of 26

Prompt Lookup Decoding

Can we do speculative decoding without a draft model?

  • Speculative decoding requires a separate draft model → extra VRAM, operational complexity
  • Observation: for many tasks, the output contains phrases from the input
  • Summarization, document QA, code editing — output heavily overlaps with prompt

Key idea: use the prompt itself as the draft

  • Take the last n generated tokens (an n-gram)
  • Search the prompt for a matching n-gram
  • If found: the continuation in the prompt becomes the draft
  • Pass to the target model for verification — same accept/reject as speculative decoding

Properties

  • Zero extra model, zero extra VRAM — just a string search (O(n))
  • Works whenever the output reuses input phrasing

21

Inception AI

22 of 26

22

Inception AI

23 of 26

Prompt Lookup Decoding — Results

23

Inception AI

24 of 26

All these methods accept one constraint: one token at a time.

What if we could generate all tokens in parallel through iterative refinement?

24

Inception

25 of 26

Recap & Key Takeaways

Decoding basics

  • LMs output distributions; decoding turns that into text
  • Autoregressive: one token at a time, left to right

Controlling quality

  • Greedy: fast, deterministic, prone to repetition
  • Temperature / top-k / top-p / min-p: diverse, controllable
  • These compose: use temperature + top-p together

Controlling speed

  • Speculative decoding: draft + verify → 2–3× speedup
  • Prompt Lookup Decoding: use prompt as draft — no extra model

The big picture

  • All these methods work within the AR constraint
  • Next lecture (Apr 7): what if we changed the generation paradigm entirely?
  • → Discrete Diffusion Language Models

25

Inception AI

26 of 26

Thank you

apoorv@inceptionlabs.ai

26

Inception