Generative modeling
DS-GA 3001 - Intro to Computer Vision - 4/24/2023
Alberto Bietti (alberto@bietti.me)
Supervised vs unsupervised learning
Supervised learning:
Learn by minimizing:
: neural network (CNN, transformer, ...)
dog
cat
Supervised vs unsupervised learning
Unsupervised learning
“Generative AI”
Source: https://www.blueshadow.art/midjourney-prompt-commands/
“Generative AI”: conditional generative models
(DALL-E, Stable Diffusion, Midjourney, etc.)
Typical goal: conditional generation from text descriptions
Density estimation
Estimate from samples
(empirical distribution)
Density estimation: failure modes
Overfitting / memorization
Underfitting / oversmoothing
Mode collapse
How to estimate high-dimensional distributions?
We will focus on most recently successful: score-based and diffusion models
Warmup: energy-based models (EBMs)
: energy function
: normalizing constant
Sampling from EBMs with Langevin dynamics
Langevin dynamics example
Mixture of Gaussians (source: https://yang-song.net/blog/2021/score/)
Training EBMs
First term can be computed exactly, second term may need sampling from
Dropping normalization: score-based models
�Q: can we learn the score function directly without worrying about ?
A: Yes! Use score matching (Hyvärinen, 2005). Learn by minimizing:
(with some tricks, this can be rewritten to be independent of )
Density vs score
Source: https://yang-song.net/blog/2021/score/
Score-based generative modeling so far
Source: https://yang-song.net/blog/2021/score/
Issues with “vanilla” score-based models
Difficult to estimate scores in low-density regions!
Related to difficulties of sampling
Adding noise
Solution: add noise to increase density everywhere
→ score matching is easier
→ Langevin sampling is easier
But: data is now noisy!
Denoising score matching: “predicting noise”
Repeating for multiple noise levels
Recipe for score-based generative model (Song & Ermon, 2019)
Related: “Denoising Diffusion Probabilistic Models” (Ho et al, 2020), but different motivations
Illustration: Gaussian Mixture
Source: https://yang-song.net/blog/2021/score/
Illustration: CelebA and Cifar10
Source: https://yang-song.net/blog/2021/score/
Which neural networks for ?
U-nets
Can also use transformers (Peebles & Xie, 2023)
Continuous-time formulation (Song et al, 2021)
Q: can we avoid Langevin and directly sample across scales?
A: yes, by considering an infinite number of noise scales!
Forward and reverse processes
g(t) controls how much noise is added at each infinitesimal step, similar to noise scales previously
Forward and reverse processes
Resembles annealed Langevin, but each step depends on score function at current noise level.
Discretization and sampling
See also: (Karras et al, 2022)
Controllable / guided generation
Examples: class conditional
Examples: inpainting, colorization
Examples: text-to-image (DALL-E 2, Imagen)
Examples: edges + text
Source: https://arxiv.org/abs/2302.05543
Examples: text-to-video (Imagen video, make-a-video)
Imagen Video (Google), Make-a-video (Meta)
Examples: 3D
Source: https://dreamfusion3d.github.io/
Final remarks