1 of 34

Generative modeling

DS-GA 3001 - Intro to Computer Vision - 4/24/2023

Alberto Bietti (alberto@bietti.me)

2 of 34

Supervised vs unsupervised learning

Supervised learning:

Given labeled data (x_i, y_i)
Goal: predict y for a new x

Learn by minimizing:

: neural network (CNN, transformer, ...)

dog

cat

3 of 34

Supervised vs unsupervised learning

Unsupervised learning

Given unlabeled data
Estimate / discover properties of
Possible goals: clustering, compression, generation/sampling

4 of 34

“Generative AI”

Source: https://www.blueshadow.art/midjourney-prompt-commands/

5 of 34

“Generative AI”: conditional generative models

(DALL-E, Stable Diffusion, Midjourney, etc.)

Typical goal: conditional generation from text descriptions

Data: pairs (x, caption)
Learn conditional generative model p(x | caption) instead of just p(x)
Usually replace “caption” by its vector representation given by a text-image model (e.g. CLIP)

6 of 34

Density estimation

Estimate from samples

(empirical distribution)

7 of 34

Density estimation: failure modes

Overfitting / memorization

Underfitting / oversmoothing

Mode collapse

8 of 34

How to estimate high-dimensional distributions?

Explicit density models (Gaussian mixture, auto-regressive model, VAE, normalizing flows)

Explicit form of
Train using maximum likelihood (or approximations)
Sampling using specific model structure, or appropriate samplers

Implicit models (Generative Adversarial Networks)

is defined implicitly as the distribution of for
Train by minimizing distance to the empirical data distribution
D is a metric on distributions (e.g. Wasserstein, MMD, KL, …)
Often requires adversarial (min-max) optimization

We will focus on most recently successful: score-based and diffusion models

9 of 34

Warmup: energy-based models (EBMs)

: energy function

: normalizing constant

10 of 34

Sampling from EBMs with Langevin dynamics

How can we get new samples if we know but not ?
Langevin dynamics: iterate the following

Similar to gradient descent optimization of with some added noise!
For small and large t, we can show that obtain approximate samples from
But: very slow in the presence of low-density regions

Jumping across modes is difficult!

11 of 34

Langevin dynamics example

Mixture of Gaussians (source: https://yang-song.net/blog/2021/score/)

12 of 34

Training EBMs

How to learn the energy function ?
If normalizing constant is known/tractable → easy via maximum likelihood!
In general, normalizing constant may be intractable!
Possible solution: approximate gradient ascent on log-likelihood

First term can be computed exactly, second term may need sampling from

13 of 34

Dropping normalization: score-based models

Sampling with Langevin dynamics only needs the score function:

�Q: can we learn the score function directly without worrying about ?

A: Yes! Use score matching (Hyvärinen, 2005). Learn by minimizing:

(with some tricks, this can be rewritten to be independent of )

14 of 34

Density vs score

Source: https://yang-song.net/blog/2021/score/

15 of 34

Score-based generative modeling so far

Source: https://yang-song.net/blog/2021/score/

16 of 34

Issues with “vanilla” score-based models

Difficult to estimate scores in low-density regions!

Related to difficulties of sampling

17 of 34

Adding noise

Solution: add noise to increase density everywhere

→ score matching is easier

→ Langevin sampling is easier

But: data is now noisy!

18 of 34

Denoising score matching: “predicting noise”

Consider noisy samples
Their distribution is:

Vincent (2010) shows:

Score matching on ⇔ learning a “denoiser”

19 of 34

Repeating for multiple noise levels

Recipe for score-based generative model (Song & Ermon, 2019)

Consider multiple increasing scales of Gaussian noise
Estimate score functions with neural networks for each scale

Sample using annealed Langevin dynamics: start at large scale, and gradually decrease scale and learning rate

Related: “Denoising Diffusion Probabilistic Models” (Ho et al, 2020), but different motivations

20 of 34

Illustration: Gaussian Mixture

Source: https://yang-song.net/blog/2021/score/

21 of 34

Illustration: CelebA and Cifar10

Source: https://yang-song.net/blog/2021/score/

22 of 34

Which neural networks for ?

U-nets

Image-to-image
Very successful for segmentation
Different ways to incorporate scale

Conditional instance normalization [Song]
Sinusoidal embeddings [Ho]

[Ho] includes self-attention

Can also use transformers (Peebles & Xie, 2023)

23 of 34

Continuous-time formulation (Song et al, 2021)

Q: can we avoid Langevin and directly sample across scales?

A: yes, by considering an infinite number of noise scales!

Adding noise ⇔ diffusion process (similar to heat equation)
We can reverse the diffusion process to get back the data distribution!

24 of 34

Forward and reverse processes

Forward diffusion process SDE (stochastic differential equation)

g(t) controls how much noise is added at each infinitesimal step, similar to noise scales previously

25 of 34

Forward and reverse processes

Backward data-generating process (starting from pure noise)

Resembles annealed Langevin, but each step depends on score function at current noise level.

26 of 34

Discretization and sampling

Sampling now corresponds to discretization of the backward process
Discretization ⇔ numerical solvers for the SDE

Euler-Maruyama
Runge-Kutta

Different choices of g(t), f(x, t) can lead to different performance
Can be combined with Langevin samplers (“predictor-corrector”)
There exists a deterministic ODE (“probability flow ODE”) which has the same solution → noise-free sampling