1 of 60

Diffusion models

Ruba Haroun

Senior Research Engineer, Google DeepMind

2 of 60

I. Generative modelling

II. Iterative refinement

III. Diffusion models

IV. Guidance

V. Other topics

3 of 60

I. Generative modelling

4 of 60

Generative modelling: the probabilistic perspective

x ~ p(x)

entropy

x

Explicit: autoregression, flows, VAEs, …

Implicit: GANs, …

5 of 60

Conditional generative models

sparsely conditioned

densely conditioned

class labels

bounding boxes

segmentation

grayscale image (colorisation)

y = “cat”

Control model output using a control signal: p(x|c) vs. p(x)

6 of 60

II. Iterative refinement

7 of 60

Two approaches to iterative refinement

Autoregression: step-by-step

Turn everything into a 1D sequence,�generate it one step at a time

Diffusion: iterative denoising

Gradually add noise until all information is destroyed,�then learn to invert this procedure step by step

8 of 60

Autoregression: step-by-step

chain rule of probability

p(x) = Π_ip(x_i|x_<i)

factorise p(x) into�sequential conditionals�
p(x_i|x_<i) are simple scalar distributions�
use the same model for all of them

9 of 60

Autoregression in pixel space: PixelRNN & PixelCNN

van den Oord et al. ‘Pixel Recurrent Neural Networks’ (2016)

van den Oord et al. ‘Conditional Image Generation with PixelCNN Decoders’ (2016)

10 of 60

Autoregression in amplitude space: WaveNet & SampleRNN

van den Oord et al. ‘WaveNet: a Generative Model for Raw Audio’ (2016)

Mehri et al. ‘SampleRNN: An Unconditional End-to-End Neural Audio Generation Model’ (2016)

11 of 60

III. Diffusion models

12 of 60

Diffusion: iterative denoising

13 of 60

Diffusion: forward process

+ δ

…

+ δ

…

+ δ

x₀

x_t

x_∞

training data

noisy�data

Gaussian�noise

14 of 60

Diffusion: forward process

+ δ

…

+ δ

…

+ δ

x₀

x_t

x_∞

training data

noisy�data

Gaussian�noise

x_t = x₀ + σ(t)·ε

15 of 60

Diffusion: forward process

+ δ

…

+ δ

…

+ δ

x₀

x_t

x_T

training data

noisy�data

Gaussian�noise

x_t = α(t)·x₀ + σ(t)·ε

·γ

16 of 60

Diffusion: forward process

σ(t) is the noise schedule�
controls the rate of corruption�over the course of the process�
Several choices for α(t):�

variance-preserving (VP)�α = √(1 - σ²)�
variance-exploding (VE)�α = 1�
rectified flow, flow matching (RF)�α = 1 - σ

17 of 60

Diffusion: backward process

+ δ

…

+ δ

…

+ δ

x₀

x_t

x_T

training data

noisy�data

Gaussian�noise

18 of 60

Diffusion: backward process

x_t

x₀

19 of 60

Diffusion: backward process

x_t

x̂₀

predict x₀

x₀

20 of 60

Diffusion: backward process

predict x₀

x₀

x_t

x̂₀

21 of 60

Diffusion: backward process

x_t

x̂₀

take a small step

x₀

22 of 60

Diffusion: backward process

add some noise

x₀

x_t

x̂₀

x_t-1

ξ

23 of 60

Diffusion: backward process

repeat

x₀

x_t

x̂₀

x_t-1

ξ

x̂₀

24 of 60

Diffusion: backward process

repeat

x₀

x_t

x̂₀

x_t-1

ξ

x̂₀

25 of 60

Diffusion: backward process

repeat

x₀

x_t

x̂₀

x_t-1

ξ

x̂₀

x_t-2

ξ

26 of 60

Diffusion:�predict ε instead of x₀?

x_t is a linear combination of x₀ and ε�
Diffusion models predict x₀ from x_t

… but predicting ε is also an option�
Given x_t, we can convert a prediction for ε into one for x₀, and vice versa�
Other linear combinations of ε and x₀ are also possible:

v-prediction: v = α(t)·ε - σ(t)·x₀
Flow matching: ε - x₀

Salimans & Ho ‘Progressive Distillation [..]’ (2022)

Lipman et al. ‘Flow matching [..]’ (2022)

x_t = α(t)·x₀ + σ(t)·ε

27 of 60

Diffusion training: summary

For each training example x₀:�

Sample random time step t_�
Corrupt x₀ to get x_t�� x_t = α(t)·x₀ + σ(t)·ε_�
Predict x₀ (or ε) from x_t��x̂₀ = f(x_t,t) or ε̂ = f(x_t,t)_�
Minimise squared prediction error�� min (x̂₀ - x₀)² or min (ε̂ - ε)²

Repeat

28 of 60

Diffusion sampling: summary

At each sampling time step t:_�

Predict x₀ (or ε) from x_t��x̂₀ = f(x_t,t) or ε̂ = f(x_t,t)_�
Take a small step in the predicted direction to partially denoise x_t → x_t-1

Optionally add back some noise�(only in some algorithms)�
Repeat

29 of 60

IV. Guidance

30 of 60

Guidance: a cheat code�for diffusion models

Guidance enables trading off sample quality for diversity�
It allows diffusion models to�punch well above their weight

https://sander.ai/2022/05/26/guidance.html�https://sander.ai/2023/08/28/geometry.html

31 of 60

Diffusion: classifier guidance

x_t

x₀

32 of 60

Diffusion: classifier guidance

x_t

x̂₀

predict x₀

x₀

33 of 60

Diffusion: classifier guidance

calculate ∇_xlog p(c|x_t)

x₀

x_t

x̂₀

∇_xlog p(c=‘bunny’|x_t)

34 of 60

Diffusion: classifier guidance

combine directions

x₀

x_t

x̂₀

∇_xlog p(c=‘bunny’|x_t)

35 of 60

Classifier guidance: the Bayesian perspective

Conditional score = unconditional score + classifier gradient w.r.t. input�
Turn an unconditional model conditional�
Without retraining!

36 of 60

Diffusion: classifier guidance

calculate ∇_xlog p(c|x_t)

x₀

x_t

x̂₀

∇_xlog p(c=‘bunny’|x_t)

37 of 60

Diffusion: classifier guidance

scale by ɣ

x₀

x_t

x̂₀

ɣ·∇_xlog p(c=‘bunny’|x_t)

38 of 60

Diffusion: classifier guidance

x₀

x_t

x̂₀

ɣ·∇_xlog p(c=‘bunny’|x_t)

combine directions

39 of 60

Diffusion: classifier guidance

x₀

x_t

x̂₀

ɣ·∇_xlog p(c=‘bunny’|x_t)

add some noise

x_t-1

ξ

40 of 60

Classifier guidance: the Bayesian perspective

Scale factor ɣ acts as an (inverse) temperature�
“Sharpening” of p(c|x)�
Different from sharpening p(x) or p(x|c)!

41 of 60

Diffusion: classifier-free guidance

x_t

x̂₀

predict x₀

x₀

42 of 60

Diffusion: classifier-free guidance

x_t

x̂₀

predict x₀|c

x₀

x̂₀|c

43 of 60

Diffusion: classifier-free guidance

x_t

x̂₀

calculate difference

x₀

x̂₀|c

δ

44 of 60

Diffusion: classifier-free guidance

x_t

x̂₀

amplify difference

x₀

x̂₀|c

ɣ·δ

45 of 60

Diffusion: classifier-free guidance

x_t

x̂₀

take a small step

x₀

x̂₀|c

ɣ·δ

46 of 60

Diffusion: classifier-free guidance

x_t

x̂₀

add some noise

x₀

x̂₀|c

ɣ·δ

x_t-1

ξ

47 of 60

The power of classifier-free guidance�A stained glass window of a panda eating bamboo

Nichol et al. ‘GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models’ (2021)

https://sander.ai/2022/05/26/guidance.html

48 of 60

The power of classifier-free guidance�A cozy living room with a painting of a corgi on the wall […]

Nichol et al. ‘GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models’ (2021)

https://sander.ai/2022/05/26/guidance.html

49 of 60

V. Other topics

50 of 60

Latent diffusion

Visual perception works differently at fine scales and large scales�
Fine-scale perception makes abstraction of texture�
It is not necessary to model all possible realisations of particular textures��⇒ use an adversarial autoencoder to learn to “paint with textures”�
Diffusion in latent space is usually a�simpler, less memory-hungry task

https://sander.ai/2020/09/01/typicality.html�Rombach et al. ‘High-Resolution Image Synthesis with Latent Diffusion Models’ (2021)

51 of 60

ℒ_generator

decoder

encoder

input

reconstruction

latents

ℒ_regression+ ℒ_perceptual + ℒ_adversarial

ℒ_bottleneck

training�stage 1

training�stage 2

encoder

input

latents

iterative generator�(AR or diffusion)

sampling

iterative generator�(AR or diffusion)

latents

decoder

output

𝛁

❄

52 of 60

w=256

h=256

c=3

w=32

c=8

h=32

pixels

latents

53 of 60

‘EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling’,�Kouzelis, Kakogeorgiou, Gidaris, Komodakis, arXiv, 2025.

54 of 60

VI. Examples

55 of 60

Image generation at scale: Imagen 4�https://deepmind.google/models/imagen/

56 of 60

Image generation at scale: Imagen 4�https://deepmind.google/models/imagen/

57 of 60

Image generation at scale: Imagen 4�https://deepmind.google/models/imagen/

58 of 60

Image generation at scale: Nano Banana�https://deepmind.google/models/gemini-image/flash/

Prompt: Make woman underwater, and remove the couch and wallpaper

59 of 60

Image generation at scale: Nano Banana�https://deepmind.google/models/gemini-image/flash/

Prompt 1: Remove the door mirror.

60 of 60

Thank you!