1 of 34

Generative modeling

DS-GA 3001 - Intro to Computer Vision - 4/24/2023

Alberto Bietti (alberto@bietti.me)

2 of 34

Supervised vs unsupervised learning

Supervised learning:

  • Given labeled data (xi, yi)
  • Goal: predict y for a new x

Learn by minimizing:

: neural network (CNN, transformer, ...)

dog

cat

3 of 34

Supervised vs unsupervised learning

Unsupervised learning

  • Given unlabeled data
  • Estimate / discover properties of
  • Possible goals: clustering, compression, generation/sampling

4 of 34

“Generative AI”

Source: https://www.blueshadow.art/midjourney-prompt-commands/

5 of 34

“Generative AI”: conditional generative models

(DALL-E, Stable Diffusion, Midjourney, etc.)

Typical goal: conditional generation from text descriptions

  • Data: pairs (x, caption)
  • Learn conditional generative model p(x | caption) instead of just p(x)
  • Usually replace “caption” by its vector representation given by a text-image model (e.g. CLIP)

6 of 34

Density estimation

Estimate from samples

(empirical distribution)

7 of 34

Density estimation: failure modes

Overfitting / memorization

Underfitting / oversmoothing

Mode collapse

8 of 34

How to estimate high-dimensional distributions?

  • Explicit density models (Gaussian mixture, auto-regressive model, VAE, normalizing flows)
    • Explicit form of
    • Train using maximum likelihood (or approximations)
    • Sampling using specific model structure, or appropriate samplers

  • Implicit models (Generative Adversarial Networks)
    • is defined implicitly as the distribution of for
    • Train by minimizing distance to the empirical data distribution
    • D is a metric on distributions (e.g. Wasserstein, MMD, KL, …)
    • Often requires adversarial (min-max) optimization

We will focus on most recently successful: score-based and diffusion models

9 of 34

Warmup: energy-based models (EBMs)

: energy function

: normalizing constant

10 of 34

Sampling from EBMs with Langevin dynamics

  • How can we get new samples if we know but not ?
  • Langevin dynamics: iterate the following

  • Similar to gradient descent optimization of with some added noise!
  • For small and large t, we can show that obtain approximate samples from
  • But: very slow in the presence of low-density regions
    • Jumping across modes is difficult!

11 of 34

Langevin dynamics example

Mixture of Gaussians (source: https://yang-song.net/blog/2021/score/)

12 of 34

Training EBMs

  • How to learn the energy function ?
  • If normalizing constant is known/tractable → easy via maximum likelihood!
  • In general, normalizing constant may be intractable!
  • Possible solution: approximate gradient ascent on log-likelihood

First term can be computed exactly, second term may need sampling from

13 of 34

Dropping normalization: score-based models

  • Sampling with Langevin dynamics only needs the score function:

�Q: can we learn the score function directly without worrying about ?

A: Yes! Use score matching (Hyvärinen, 2005). Learn by minimizing:

(with some tricks, this can be rewritten to be independent of )

14 of 34

Density vs score

Source: https://yang-song.net/blog/2021/score/

15 of 34

Score-based generative modeling so far

Source: https://yang-song.net/blog/2021/score/

16 of 34

Issues with “vanilla” score-based models

Difficult to estimate scores in low-density regions!

Related to difficulties of sampling

17 of 34

Adding noise

Solution: add noise to increase density everywhere

→ score matching is easier

→ Langevin sampling is easier

But: data is now noisy!

18 of 34

Denoising score matching: “predicting noise”

  • Consider noisy samples
  • Their distribution is:

  • Vincent (2010) shows:

  • Score matching on ⇔ learning a “denoiser

19 of 34

Repeating for multiple noise levels

Recipe for score-based generative model (Song & Ermon, 2019)

  • Consider multiple increasing scales of Gaussian noise
  • Estimate score functions with neural networks for each scale

  • Sample using annealed Langevin dynamics: start at large scale, and gradually decrease scale and learning rate

Related: “Denoising Diffusion Probabilistic Models” (Ho et al, 2020), but different motivations

20 of 34

Illustration: Gaussian Mixture

Source: https://yang-song.net/blog/2021/score/

21 of 34

Illustration: CelebA and Cifar10

Source: https://yang-song.net/blog/2021/score/

22 of 34

Which neural networks for ?

U-nets

  • Image-to-image
  • Very successful for segmentation
  • Different ways to incorporate scale
    • Conditional instance normalization [Song]
    • Sinusoidal embeddings [Ho]
  • [Ho] includes self-attention

Can also use transformers (Peebles & Xie, 2023)

23 of 34

Continuous-time formulation (Song et al, 2021)

Q: can we avoid Langevin and directly sample across scales?

A: yes, by considering an infinite number of noise scales!

  • Adding noise ⇔ diffusion process (similar to heat equation)
  • We can reverse the diffusion process to get back the data distribution!

24 of 34

Forward and reverse processes

  • Forward diffusion process SDE (stochastic differential equation)

g(t) controls how much noise is added at each infinitesimal step, similar to noise scales previously

25 of 34

Forward and reverse processes

  • Backward data-generating process (starting from pure noise)

Resembles annealed Langevin, but each step depends on score function at current noise level.

26 of 34

Discretization and sampling

  • Sampling now corresponds to discretization of the backward process
  • Discretization ⇔ numerical solvers for the SDE
    • Euler-Maruyama
    • Runge-Kutta
  • Different choices of g(t), f(x, t) can lead to different performance
  • Can be combined with Langevin samplers (“predictor-corrector”)
  • There exists a deterministic ODE (“probability flow ODE”) which has the same solution → noise-free sampling

See also: (Karras et al, 2022)

27 of 34

Controllable / guided generation

  • We may want to only sample relevant images for a given class or prompt y
  • To sample from we can use conditional scores
  • Denoising score matching no longer works, but we may use Bayes rule:

  • is a classifier of noisy data which should be learned separately
  • Recent methods use “classifier-free guidance” (Ho & Salimans, 2022)
  • Often, improved quality thanks to reduced variability in the distribution

28 of 34

Examples: class conditional

29 of 34

Examples: inpainting, colorization

30 of 34

Examples: text-to-image (DALL-E 2, Imagen)

31 of 34

Examples: edges + text

32 of 34

Examples: text-to-video (Imagen video, make-a-video)

Imagen Video (Google), Make-a-video (Meta)

33 of 34

Examples: 3D

34 of 34

Final remarks

  • Diffusion models are now dominating generative modeling in vision
  • Future applications: Long videos? Movies? Music?
  • Many issues in practice:
    • Data privacy? Trained on lots of images from the web
    • Security? Deepfakes
    • Are the generated images really new?