1 of 60

Diffusion models

Ruba Haroun

Senior Research Engineer, Google DeepMind

2 of 60

I. Generative modelling

II. Iterative refinement

III. Diffusion models

IV. Guidance

V. Other topics

3 of 60

I. Generative modelling

4 of 60

Generative modelling: the probabilistic perspective

x ~ p(x)

entropy

x

Explicit: autoregression, flows, VAEs, …

Implicit: GANs, …

5 of 60

Conditional generative models

sparsely conditioned

densely conditioned

class labels

bounding boxes

segmentation

grayscale image (colorisation)

y = “cat”

Control model output using a control signal: p(x|c) vs. p(x)

6 of 60

II. Iterative refinement

7 of 60

Two approaches to iterative refinement

Autoregression: step-by-step

Turn everything into a 1D sequence,�generate it one step at a time

Diffusion: iterative denoising

Gradually add noise until all information is destroyed,�then learn to invert this procedure step by step

8 of 60

Autoregression: step-by-step

chain rule of probability

p(x) = Πi p(xi|x<i)

  • factorise p(x) into�sequential conditionals�
  • p(xi|x<i) are simple scalar distributions�
  • use the same model for all of them

9 of 60

Autoregression in pixel space: PixelRNN & PixelCNN

van den Oord et al. ‘Pixel Recurrent Neural Networks’ (2016)

van den Oord et al. ‘Conditional Image Generation with PixelCNN Decoders’ (2016)

10 of 60

Autoregression in amplitude space: WaveNet & SampleRNN

van den Oord et al. ‘WaveNet: a Generative Model for Raw Audio’ (2016)

Mehri et al. ‘SampleRNN: An Unconditional End-to-End Neural Audio Generation Model’ (2016)

11 of 60

III. Diffusion models

12 of 60

Diffusion: iterative denoising

13 of 60

Diffusion: forward process

+ δ

+ δ

+ δ

+ δ

x0

xt

x

training data

noisy�data

Gaussian�noise

14 of 60

Diffusion: forward process

+ δ

+ δ

+ δ

+ δ

x0

xt

x

training data

noisy�data

Gaussian�noise

xt = x0 + σ(t)·ε

15 of 60

Diffusion: forward process

+ δ

+ δ

+ δ

+ δ

x0

xt

xT

training data

noisy�data

Gaussian�noise

xt = α(t)·x0 + σ(t)·ε

·γ

·γ

·γ

·γ

16 of 60

Diffusion: forward process

  • σ(t) is the noise schedule�
  • controls the rate of corruption�over the course of the process�
  • Several choices for α(t):�
    • variance-preserving (VP)�α = √(1 - σ2)
    • variance-exploding (VE)�α = 1
    • rectified flow, flow matching (RF)�α = 1 - σ

17 of 60

Diffusion: backward process

+ δ

+ δ

+ δ

+ δ

x0

xt

xT

training data

noisy�data

Gaussian�noise

18 of 60

Diffusion: backward process

xt

x0

19 of 60

Diffusion: backward process

xt

0

predict x0

x0

20 of 60

Diffusion: backward process

predict x0

x0

xt

0

21 of 60

Diffusion: backward process

xt

0

take a small step

x0

22 of 60

Diffusion: backward process

add some noise

x0

xt

0

xt-1

ξ

23 of 60

Diffusion: backward process

repeat

x0

xt

0

xt-1

ξ

0

24 of 60

Diffusion: backward process

repeat

x0

xt

0

xt-1

ξ

0

25 of 60

Diffusion: backward process

repeat

x0

xt

0

xt-1

ξ

0

xt-2

ξ

26 of 60

Diffusion:�predict ε instead of x0?

  • xt is a linear combination of x0 and ε
  • Diffusion models predict x0 from xt

  • … but predicting ε is also an option�
  • Given xt, we can convert a prediction for ε into one for x0, and vice versa�
  • Other linear combinations of ε and x0 are also possible:
    • v-prediction: v = α(t)·ε - σ(t)·x0
    • Flow matching: ε - x0

Salimans & Ho ‘Progressive Distillation [..]’ (2022)

Lipman et al. ‘Flow matching [..]’ (2022)

xt = α(t)·x0 + σ(t)·ε

27 of 60

Diffusion training: summary

For each training example x0:�

  • Sample random time step t
  • Corrupt x0 to get xt�xt = α(t)·x0 + σ(t)·ε
  • Predict x0 (or ε) from xt�� 0 = f(xt,t) or ε̂ = f(xt,t)
  • Minimise squared prediction error�� min (x̂0 - x0)2 or min (ε̂ - ε)2

  • Repeat

28 of 60

Diffusion sampling: summary

At each sampling time step t:

  • Predict x0 (or ε) from xt�� 0 = f(xt,t) or ε̂ = f(xt,t)
  • Take a small step in the predicted direction to partially denoise xtxt-1

  • Optionally add back some noise�(only in some algorithms)�
  • Repeat

29 of 60

IV. Guidance

30 of 60

Guidance: a cheat code�for diffusion models

  • Guidance enables trading off sample quality for diversity�
  • It allows diffusion models to�punch well above their weight

https://sander.ai/2022/05/26/guidance.html�https://sander.ai/2023/08/28/geometry.html

31 of 60

Diffusion: classifier guidance

xt

x0

32 of 60

Diffusion: classifier guidance

xt

0

predict x0

x0

33 of 60

Diffusion: classifier guidance

calculate xlog p(c|xt)

x0

xt

0

xlog p(c=‘bunny’|xt)

34 of 60

Diffusion: classifier guidance

combine directions

x0

xt

0

xlog p(c=‘bunny’|xt)

35 of 60

Classifier guidance: the Bayesian perspective

  • Conditional score = unconditional score + classifier gradient w.r.t. input�
  • Turn an unconditional model conditional�
  • Without retraining!

36 of 60

Diffusion: classifier guidance

calculate xlog p(c|xt)

x0

xt

0

xlog p(c=‘bunny’|xt)

37 of 60

Diffusion: classifier guidance

scale by ɣ

x0

xt

0

ɣ·∇xlog p(c=‘bunny’|xt)

38 of 60

Diffusion: classifier guidance

x0

xt

0

ɣ·∇xlog p(c=‘bunny’|xt)

combine directions

39 of 60

Diffusion: classifier guidance

x0

xt

0

ɣ·∇xlog p(c=‘bunny’|xt)

add some noise

xt-1

ξ

40 of 60

Classifier guidance: the Bayesian perspective

  • Scale factor ɣ acts as an (inverse) temperature
  • “Sharpening” of p(c|x)
  • Different from sharpening p(x) or p(x|c)!

41 of 60

Diffusion: classifier-free guidance

xt

0

predict x0

x0

42 of 60

Diffusion: classifier-free guidance

xt

0

predict x0|c

x0

0|c

43 of 60

Diffusion: classifier-free guidance

xt

0

calculate difference

x0

0|c

δ

44 of 60

Diffusion: classifier-free guidance

xt

0

amplify difference

x0

0|c

ɣ·δ

45 of 60

Diffusion: classifier-free guidance

xt

0

take a small step

x0

0|c

ɣ·δ

46 of 60

Diffusion: classifier-free guidance

xt

0

add some noise

x0

0|c

ɣ·δ

xt-1

ξ

47 of 60

The power of classifier-free guidance�A stained glass window of a panda eating bamboo

Nichol et al. ‘GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models’ (2021)

https://sander.ai/2022/05/26/guidance.html

48 of 60

The power of classifier-free guidance�A cozy living room with a painting of a corgi on the wall […]

Nichol et al. ‘GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models’ (2021)

https://sander.ai/2022/05/26/guidance.html

49 of 60

V. Other topics

50 of 60

Latent diffusion

  • Visual perception works differently at fine scales and large scales�
  • Fine-scale perception makes abstraction of texture�
  • It is not necessary to model all possible realisations of particular textures��⇒ use an adversarial autoencoder to learn to “paint with textures”�
  • Diffusion in latent space is usually a�simpler, less memory-hungry task

https://sander.ai/2020/09/01/typicality.html�Rombach et al. ‘High-Resolution Image Synthesis with Latent Diffusion Models’ (2021)

51 of 60

generator

decoder

encoder

input

reconstruction

latents

regression + ℒperceptual + ℒadversarial

bottleneck

training�stage 1

training�stage 2

encoder

input

latents

iterative generator�(AR or diffusion)

sampling

iterative generator�(AR or diffusion)

latents

decoder

output

𝛁

𝛁

𝛁

52 of 60

w=256

h=256

c=3

w=32

c=8

h=32

pixels

latents

53 of 60

‘EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling’,�Kouzelis, Kakogeorgiou, Gidaris, Komodakis, arXiv, 2025.

54 of 60

VI. Examples

55 of 60

Image generation at scale: Imagen 4�https://deepmind.google/models/imagen/

56 of 60

Image generation at scale: Imagen 4�https://deepmind.google/models/imagen/

57 of 60

Image generation at scale: Imagen 4�https://deepmind.google/models/imagen/

58 of 60

Image generation at scale: Nano Banana�https://deepmind.google/models/gemini-image/flash/

Prompt: Make woman underwater, and remove the couch and wallpaper

59 of 60

Image generation at scale: Nano Banana�https://deepmind.google/models/gemini-image/flash/

Prompt 1: Remove the door mirror.

60 of 60

Thank you!