Stable Diffusion Session
Camilo Fosco
April 11, 2023
6.8300/6.8301: Advances in Computer Vision
MIT CSAIL
What is Stable Diffusion?
Stable Diffusion
kneeling cat knight, portrait, finely detailed armor, intricate design, silver, silk, cinematic lighting, 4k
History behind Stable Diffusion
Massive impact
Dall-E 1 - May 2021
Dall-E 2 - July 2022
Stable Diffusion - August 2022
Incredible ability to combine unrelated concepts
A Tortoise/Ladybird hybrid
Understanding Stable Diffusion
Diffusion Models
Latent Diffusion Models
Conditioning Diffusion Models
1
2
3
What are Generative Models?
Discriminative Models
p(Y|X)
Generative Models
p(X)
Examples:
Examples:
Learns: Decision boundary
Learns: Probability distribution of the data
The key: map known distribution to target distribution
Multiple ways to build generative models!
Approximate the data distribution by reconstructing from a learned latent space distribution
Approximate the data distribution by learning to gradually denoise
Diffusion Models
Variational Autoencoders
Generative Adversarial Networks
Approximate the data distribution by learning to fool an adversary
The forward diffusion process
What if we invert this process?
Reverse Diffusion Process
…
Diffusion in action over an image
The forward process
Let be a data distribution (e.g. distribution of real images) and be a sample from that distribution. The forward process takes the form of a Markov chain and produces corrupted samples (also called latents) from by repeatedly sampling from successive conditional gaussian distributions :
…
…
The forward process
Each step of the noising process adds gaussian noise with variance
The distribution takes the following form:
With small , Sampling from this essentially gives us with a little gaussian noise.
…
…
The forward process
Why do we need small ?
Why do we need different betas?
Importantly, the variance is different for each time step t. During this forward process, we want Bt to start small and increase over time.
*Feller, 1949
The forward process
Typically, bt is treated as a hyperparameter and follows a fixed schedule during our training run.
In practice, with vanilla diffusion, T is on the order of 1000. In theory, running the process for infinitely many T steps gives us a sample of pure gaussian noise, that is
…
…
Estimate the reverse process
What if we could sample from for each t? This would allow us to sample a slightly less noisy image from a noisy image repeatedly, until we reach a sample of our initial data distribution.
Problem: it’s intractable. We’d need access to the entire distribution of images.
…
…
Estimate the reverse process
Solution: estimate it with a model!
For small enough Bt, will be gaussian. We take advantage of this property to estimate by modeling the mean and covariance of a gaussian distribution:
And our joint distribution (trajectory) can be written as:
Model with parameters that outputs mean of the reverse step given and t
How can we learn this model?
We could try to maximize the likelihood of the training data, . But:
Instead, we optimize a lower bound called the Evidence Lower Bound (ELBO), as in other latent-variable models (e.g. VAEs):
Integrate over all possible trajectories
How can we learn this model?
Ho et al., 2020, show that we can derive a simpler and more efficient objective by taking into account the following:
This means that we need a model to output the following mean:
Which is just predicting the added noise, !
And after some simplification, we can get to the following simple objective:
with
Estimate the noise at time t!
Intuition: Diffusion in 2D
Summing up
Diffusion models are designed to generate samples from a data distribution by gradually denoising a sample from an N-dimensional gaussian distribution.
How? We train a denoising UNet, , to predict the noise of image at time step using the following objective:
Training: learn to predict noise!
t
Apply noise according to t: compute xt with
Predict noise
UNet noise predictor
As explained in the previous slide, we usually precompute an entire dataset of these
Compute loss by comparing to , then update parameters
Inference: predict noise, remove noise, repeat!
Apply noise
Predict noise
UNet noise predictor
t
Gaussian noise modulated by 𝜎t to approximate sampling from the gaussian representing the reverse process.
Intuition: perturbation needed to avoid falling into local minima while moving closer to target manifold
Replace with , repeat T times
Remove noise from xt following schedule to get xt-1
Latent Diffusion
Denoising the latent space
Encoder
Apply noise
Predict noise
UNet noise predictor
Predict noise
UNet latent noise predictor
Apply noise
Regular Denoising
Latent space Denoising
Noise prediction on latents
Latents
The U-Net architecture
Ok, but how do I control this model?
Control through conditioning
Text Conditioning
UNet latent noise predictor
A dog in a field
Text encoder
Noisy compressed image
Noise amount
Noise prediction
Force to be close to embedding for “a dog in a field”
Building a good text encoder: CLIP
So What is Stable Diffusion Exactly?
Stable Diffusion during Inference
How can you further control it?
Multiple techniques to control its outputs:
Teach new concepts
Control output properties
Textual Inversion
Dreambooth
Paper: https://arxiv.org/abs/2208.12242
Code: https://github.com/XavierXiao/Dreambooth-Stable-Diffusion
More info: https://dreambooth.github.io/ , https://stable-diffusion-art.com/dreambooth/
Dedicated training for the decoder network (D)
Training on specific concept
Images generated by the original network
Ensures that the original information about “dog” isn’t overwritten
Low Rank Adaptation (LoRA)
Paper: https://arxiv.org/abs/2106.09685
Code: https://github.com/cloneofsimo/lora
More info: https://aituts.com/stable-diffusion-lora/, https://huggingface.co/docs/diffusers/training/lora
LoRA add-on
Matrix B, initialized with zeros
Matrix A, random gaussian initialization
r << d
ControlNet
Composable Diffusion
"mystical trees" AND "A magical pond" AND "Dark"
"mystical trees" AND "A magical pond" AND NOT "Dark"
Q&A
Thank you!
For additional questions, reach me at camilolu@mit.edu
Camilo Fosco