1 of 41

Stable Diffusion Session

Camilo Fosco

April 11, 2023

6.8300/6.8301: Advances in Computer Vision

MIT CSAIL

3 of 41

What is Stable Diffusion?

Generative Model that can produce images from text
Trained on billions of images from a large dataset of text + image pairs, LAION-5b
Takes in noise and a prompt, outputs an image following the prompt

Stable Diffusion

kneeling cat knight, portrait, finely detailed armor, intricate design, silver, silk, cinematic lighting, 4k

4 of 41

History behind Stable Diffusion

Built through a collaboration between the CompVis lab at LMU, RunwayML and Stability AI

Stability supported the construction of LAION and many other projects in the space

Development led by Patrick Esser (Runway) and Robin Rombach (CompVis)
Trained on 4000 A100 GPUs for close to a month.
Estimated training cost: $600k

5 of 41

Massive impact

Strong performance
Finer control, editing capabilities, compositionality
Completely open source

Dall-E 1 - May 2021

Dall-E 2 - July 2022

Stable Diffusion - August 2022

6 of 41

Incredible ability to combine unrelated concepts

A Tortoise/Ladybird hybrid

7 of 41

Understanding Stable Diffusion

Diffusion Models

Latent Diffusion Models

Conditioning Diffusion Models

8 of 41

What are Generative Models?

Models that capture the joint probability p(X) of the data distribution. Different from discriminative models, where you learn the conditional probability p(Y|X).
They can generate samples from that distribution, usually mapping a known distribution (e.g. gaussian noise) to the data.

Discriminative Models

p(Y|X)

Generative Models

p(X)

Examples:

Linear regression models
Decision trees
Neural nets with classification head

Examples:

Generative Adversarial Networks
Variational Autoencoders
Diffusion Models

Learns: Decision boundary

Learns: Probability distribution of the data

9 of 41

The key: map known distribution to target distribution

Generative models can be seen as distribution transformers
If we can find a model G that maps gaussian distribution into the distribution of natural images, we have a generative model for images!

10 of 41

Multiple ways to build generative models!

Approximate the data distribution by reconstructing from a learned latent space distribution

Approximate the data distribution by learning to gradually denoise

Diffusion Models

Variational Autoencoders

Generative Adversarial Networks

Approximate the data distribution by learning to fool an adversary

11 of 41

The forward diffusion process

Start from an image
Progressively add gaussian noise to it over T timesteps
If repeated enough times, the image will approximate a sample from gaussian noise.

12 of 41

What if we invert this process?

Reverse Diffusion Process

Start from random noise
Progressively denoise the image
Continue until you get a sample of your image space

…

13 of 41

Diffusion in action over an image

14 of 41

The forward process

Let be a data distribution (e.g. distribution of real images) and be a sample from that distribution. The forward process takes the form of a Markov chain and produces corrupted samples (also called latents) from by repeatedly sampling from successive conditional gaussian distributions :

…

15 of 41

The forward process

Each step of the noising process adds gaussian noise with variance

The distribution takes the following form:

With small , Sampling from this essentially gives us with a little gaussian noise.

…

16 of 41

The forward process

Why do we need small ?

Intuition: If Bt is small, learning to undo the forward process won’t be too difficult.
In the limit of infinitesimally small Bt, a step in the true reverse process can be modeled with a unimodal gaussian (same as the forward process)*

Why do we need different betas?

Intuition: during the forward process, larger betas allow us to ensure convergence to pure gaussian noise. During the reverse process, starting with large variance and gradually reducing it acts as a form of annealing to converge to an adequate sample.

Importantly, the variance is different for each time step t. During this forward process, we want Bt to start small and increase over time.

*Feller, 1949

17 of 41

The forward process

Typically, bt is treated as a hyperparameter and follows a fixed schedule during our training run.

In practice, with vanilla diffusion, T is on the order of 1000. In theory, running the process for infinitely many T steps gives us a sample of pure gaussian noise, that is

…

18 of 41

Estimate the reverse process

What if we could sample from for each t? This would allow us to sample a slightly less noisy image from a noisy image repeatedly, until we reach a sample of our initial data distribution.

Problem: it’s intractable. We’d need access to the entire distribution of images.

…

19 of 41

Estimate the reverse process

Solution: estimate it with a model!

For small enough Bt, will be gaussian. We take advantage of this property to estimate by modeling the mean and covariance of a gaussian distribution:

And our joint distribution (trajectory) can be written as:

Model with parameters that outputs mean of the reverse step given and t

20 of 41

How can we learn this model?

We could try to maximize the likelihood of the training data, . But:

Not tractable, as we would need to compute

Instead, we optimize a lower bound called the Evidence Lower Bound (ELBO), as in other latent-variable models (e.g. VAEs):

Integrate over all possible trajectories

21 of 41

How can we learn this model?

Ho et al., 2020, show that we can derive a simpler and more efficient objective by taking into account the following:

Any step of the forward process can be sampled directly, without running through the chain:

If we condition on , is tractable (just a gaussian)
We can fix the variance to and just predict the mean of the reverse process

This means that we need a model to output the following mean:

Which is just predicting the added noise, !

And after some simplification, we can get to the following simple objective:

with

Estimate the noise at time t!

Ho et al., 2020

22 of 41

Intuition: Diffusion in 2D

The forward process adds noise and destroys structure
The reverse process attempts to recover the original structure

It needs to know about the properties and structure of the original distribution, e.g. what constitutes a “natural image”

We predict to take a small step in the direction of the original distribution (drifting term)

Sohl-Dickstein et al., 2015

23 of 41

Summing up

Diffusion models are designed to generate samples from a data distribution by gradually denoising a sample from an N-dimensional gaussian distribution.

How? We train a denoising UNet, , to predict the noise of image at time step using the following objective:

Ho et al., 2020

24 of 41

Alammar, 2022

25 of 41

Training: learn to predict noise!

Apply noise according to t: compute x_t with

Predict noise

UNet noise predictor

As explained in the previous slide, we usually precompute an entire dataset of these

Compute loss by comparing to , then update parameters

Ho et al., 2020

26 of 41

Inference: predict noise, remove noise, repeat!

Apply noise

Predict noise

UNet noise predictor

Gaussian noise modulated by 𝜎_t to approximate sampling from the gaussian representing the reverse process.

Intuition: perturbation needed to avoid falling into local minima while moving closer to target manifold

Replace with , repeat T times

Ho et al., 2020

Remove noise from x_t following schedule to get x_t-1

27 of 41

Latent Diffusion

Introduced by Rombach et al., in late 2021
Proposes a denoising process that operates on the latent space of pretrained autoencoders instead of the pixels themselves
Much more efficient, good balance between sample quality and model efficiency
Cross-attention layers in the denoising UNet for conditioning

Rombach et al., 2021

28 of 41

Denoising the latent space

Encoder

Apply noise

Predict noise

UNet noise predictor

Predict noise

UNet latent noise predictor

Apply noise

Regular Denoising

Latent space Denoising

Noise prediction on latents

Latents

29 of 41

The U-Net architecture

Ronneberger et al., 2020

30 of 41

Ok, but how do I control this model?

Control through conditioning

31 of 41

Text Conditioning

Multiple strategies to achieve this conditioning. One of them being CLIP guidance
Main idea of CLIP guidance: feed a representation of the input text to the model and force the image output to be “close” to the text in some hybrid space

UNet latent noise predictor

A dog in a field

Text encoder

Noisy compressed image

Noise amount

Noise prediction

Force to be close to embedding for “a dog in a field”

32 of 41

Building a good text encoder: CLIP

Strong text embedding trained to match images to their textual descriptions

Alammar , 2022

33 of 41

So What is Stable Diffusion Exactly?

Stable diffusion is an evolution of Latent Diffusion with some guidance tricks.
Trained on a subset of LAION-5b

First trained on LAION 2b-en and LAION high resolution, then on LAION aesthetics

Receives noise and text conditioning as input, outputs a 512x512 image

Large dataset, CLIP encoding, guidance tricks

34 of 41

Stable Diffusion during Inference

Generate initial noise latents (sample a 64x64 gaussian noise tensor)
Encode user prompt with CLIP
Repeat N times:

Feed latents + text embedding to UNet
Use predicted noise output to correct latents according to schedule

Take final conditioned latents and feed to Decoder (similar to superresolution network)
Obtain 512x512 image

35 of 41

How can you further control it?

Multiple techniques to control its outputs:

Textual Inversion to make it reproduce a specific trained concept
Dreambooth to train a specific concept into the network without perturbing it
LoRA to introduce concepts without fine tuning the full network
ControlNet to force the network to follow additional constraints, such as a specific pose
Composable Diffusion to make it follow compositions

Teach new concepts

Control output properties

36 of 41

Textual Inversion

Create an embedding from a few samples of a concept
Use that concept in prompts to generate new images with the concept
Concept can be an object, a style, a scene, etc

Paper: https://arxiv.org/abs/2208.01618

Code: https://github.com/rinongal/textual_inversion

More info: https://textual-inversion.github.io/, https://huggingface.co/docs/diffusers/training/text_inversion

37 of 41

Dreambooth

Finetune your diffusion model with a few images from a concept
Maintain information about the class with a Prior Preservation Loss

Paper: https://arxiv.org/abs/2208.12242

Code: https://github.com/XavierXiao/Dreambooth-Stable-Diffusion

More info: https://dreambooth.github.io/ , https://stable-diffusion-art.com/dreambooth/

Dedicated training for the decoder network (D)

Training on specific concept

Images generated by the original network

Ensures that the original information about “dog” isn’t overwritten

38 of 41

Low Rank Adaptation (LoRA)

Idea: add a few trainable parameters to the model and train on your new concept while the main model is frozen.

Trainable parameters come in the form of smaller matrices A and B that process the block’s input x and add their output to the original product Wx

Additional parameters learn to adapt layer outputs towards the desired results

Paper: https://arxiv.org/abs/2106.09685

Code: https://github.com/cloneofsimo/lora

More info: https://aituts.com/stable-diffusion-lora/, https://huggingface.co/docs/diffusers/training/lora

LoRA add-on

Matrix B, initialized with zeros

Matrix A, random gaussian initialization

r << d

39 of 41

ControlNet

Adds trainable copies of existing blocks that learn to follow an additional control image, like an edge map or a human pose
Needs a dataset of [image, control] pairs

Paper: https://arxiv.org/abs/2302.05543

Code: https://github.com/lllyasviel/ControlNet

More info: https://aituts.com/controlnet/, https://stable-diffusion-art.com/controlnet/

40 of 41

Composable Diffusion

Adds the ability to use Conjunction (AND) and Negation (NOT) operators to prompts
Combines noise predictions from copies of the diffusion model prompted with each concept in the prompt to merge them
From MIT lab!

Paper: https://arxiv.org/abs/2206.01714

Code: https://github.com/energy-based-model/Compositional-Visual-Generation-with-Composable-Diffusion-Models-PyTorch

More info: https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/

"mystical trees" AND "A magical pond" AND "Dark"

"mystical trees" AND "A magical pond" AND NOT "Dark"

41 of 41

Q&A

Thank you!

For additional questions, reach me at camilolu@mit.edu

Camilo Fosco