1 of 41

Stable Diffusion Session

Camilo Fosco

April 11, 2023

6.8300/6.8301: Advances in Computer Vision

MIT CSAIL

2 of 41

3 of 41

What is Stable Diffusion?

  • Generative Model that can produce images from text
  • Trained on billions of images from a large dataset of text + image pairs, LAION-5b
  • Takes in noise and a prompt, outputs an image following the prompt

Stable Diffusion

kneeling cat knight, portrait, finely detailed armor, intricate design, silver, silk, cinematic lighting, 4k

4 of 41

History behind Stable Diffusion

  • Built through a collaboration between the CompVis lab at LMU, RunwayML and Stability AI
    • Stability supported the construction of LAION and many other projects in the space
  • Development led by Patrick Esser (Runway) and Robin Rombach (CompVis)
  • Trained on 4000 A100 GPUs for close to a month.
  • Estimated training cost: $600k

5 of 41

Massive impact

  • Strong performance
  • Finer control, editing capabilities, compositionality
  • Completely open source

Dall-E 1 - May 2021

Dall-E 2 - July 2022

Stable Diffusion - August 2022

6 of 41

Incredible ability to combine unrelated concepts

A Tortoise/Ladybird hybrid

7 of 41

Understanding Stable Diffusion

Diffusion Models

Latent Diffusion Models

Conditioning Diffusion Models

1

2

3

8 of 41

What are Generative Models?

  • Models that capture the joint probability p(X) of the data distribution. Different from discriminative models, where you learn the conditional probability p(Y|X).
  • They can generate samples from that distribution, usually mapping a known distribution (e.g. gaussian noise) to the data.

Discriminative Models

p(Y|X)

Generative Models

p(X)

Examples:

  • Linear regression models
  • Decision trees
  • Neural nets with classification head

Examples:

  • Generative Adversarial Networks
  • Variational Autoencoders
  • Diffusion Models

Learns: Decision boundary

Learns: Probability distribution of the data

9 of 41

The key: map known distribution to target distribution

  • Generative models can be seen as distribution transformers
  • If we can find a model G that maps gaussian distribution into the distribution of natural images, we have a generative model for images!

10 of 41

Multiple ways to build generative models!

Approximate the data distribution by reconstructing from a learned latent space distribution

Approximate the data distribution by learning to gradually denoise

Diffusion Models

Variational Autoencoders

Generative Adversarial Networks

Approximate the data distribution by learning to fool an adversary

11 of 41

The forward diffusion process

  • Start from an image
  • Progressively add gaussian noise to it over T timesteps
  • If repeated enough times, the image will approximate a sample from gaussian noise.

12 of 41

What if we invert this process?

Reverse Diffusion Process

  • Start from random noise
  • Progressively denoise the image
  • Continue until you get a sample of your image space

13 of 41

Diffusion in action over an image

14 of 41

The forward process

Let be a data distribution (e.g. distribution of real images) and be a sample from that distribution. The forward process takes the form of a Markov chain and produces corrupted samples (also called latents) from by repeatedly sampling from successive conditional gaussian distributions :

15 of 41

The forward process

Each step of the noising process adds gaussian noise with variance

The distribution takes the following form:

With small , Sampling from this essentially gives us with a little gaussian noise.

16 of 41

The forward process

Why do we need small ?

  • Intuition: If Bt is small, learning to undo the forward process won’t be too difficult.
  • In the limit of infinitesimally small Bt, a step in the true reverse process can be modeled with a unimodal gaussian (same as the forward process)*

Why do we need different betas?

  • Intuition: during the forward process, larger betas allow us to ensure convergence to pure gaussian noise. During the reverse process, starting with large variance and gradually reducing it acts as a form of annealing to converge to an adequate sample.

Importantly, the variance is different for each time step t. During this forward process, we want Bt to start small and increase over time.

*Feller, 1949

17 of 41

The forward process

Typically, bt is treated as a hyperparameter and follows a fixed schedule during our training run.

In practice, with vanilla diffusion, T is on the order of 1000. In theory, running the process for infinitely many T steps gives us a sample of pure gaussian noise, that is

18 of 41

Estimate the reverse process

What if we could sample from for each t? This would allow us to sample a slightly less noisy image from a noisy image repeatedly, until we reach a sample of our initial data distribution.

Problem: it’s intractable. We’d need access to the entire distribution of images.

19 of 41

Estimate the reverse process

Solution: estimate it with a model!

For small enough Bt, will be gaussian. We take advantage of this property to estimate by modeling the mean and covariance of a gaussian distribution:

And our joint distribution (trajectory) can be written as:

Model with parameters that outputs mean of the reverse step given and t

20 of 41

How can we learn this model?

We could try to maximize the likelihood of the training data, . But:

  • Not tractable, as we would need to compute

Instead, we optimize a lower bound called the Evidence Lower Bound (ELBO), as in other latent-variable models (e.g. VAEs):

Integrate over all possible trajectories

21 of 41

How can we learn this model?

Ho et al., 2020, show that we can derive a simpler and more efficient objective by taking into account the following:

  1. Any step of the forward process can be sampled directly, without running through the chain:

  • If we condition on , is tractable (just a gaussian)
  • We can fix the variance to and just predict the mean of the reverse process

This means that we need a model to output the following mean:

Which is just predicting the added noise, !

And after some simplification, we can get to the following simple objective:

with

Estimate the noise at time t!

22 of 41

Intuition: Diffusion in 2D

  • The forward process adds noise and destroys structure
  • The reverse process attempts to recover the original structure
    • It needs to know about the properties and structure of the original distribution, e.g. what constitutes a “natural image”
  • We predict to take a small step in the direction of the original distribution (drifting term)

23 of 41

Summing up

Diffusion models are designed to generate samples from a data distribution by gradually denoising a sample from an N-dimensional gaussian distribution.

How? We train a denoising UNet, , to predict the noise of image at time step using the following objective:

24 of 41

25 of 41

Training: learn to predict noise!

t

Apply noise according to t: compute xt with

Predict noise

UNet noise predictor

As explained in the previous slide, we usually precompute an entire dataset of these

Compute loss by comparing to , then update parameters

26 of 41

Inference: predict noise, remove noise, repeat!

Apply noise

Predict noise

UNet noise predictor

t

Gaussian noise modulated by 𝜎t to approximate sampling from the gaussian representing the reverse process.

Intuition: perturbation needed to avoid falling into local minima while moving closer to target manifold

Replace with , repeat T times

Remove noise from xt following schedule to get xt-1

27 of 41

Latent Diffusion

  • Introduced by Rombach et al., in late 2021
  • Proposes a denoising process that operates on the latent space of pretrained autoencoders instead of the pixels themselves
  • Much more efficient, good balance between sample quality and model efficiency
  • Cross-attention layers in the denoising UNet for conditioning

28 of 41

Denoising the latent space

Encoder

Apply noise

Predict noise

UNet noise predictor

Predict noise

UNet latent noise predictor

Apply noise

Regular Denoising

Latent space Denoising

Noise prediction on latents

Latents

29 of 41

The U-Net architecture

30 of 41

Ok, but how do I control this model?

Control through conditioning

31 of 41

Text Conditioning

  • Multiple strategies to achieve this conditioning. One of them being CLIP guidance
  • Main idea of CLIP guidance: feed a representation of the input text to the model and force the image output to be “close” to the text in some hybrid space

UNet latent noise predictor

A dog in a field

Text encoder

Noisy compressed image

Noise amount

Noise prediction

Force to be close to embedding for “a dog in a field”

32 of 41

Building a good text encoder: CLIP

  • Strong text embedding trained to match images to their textual descriptions

33 of 41

So What is Stable Diffusion Exactly?

  • Stable diffusion is an evolution of Latent Diffusion with some guidance tricks.
  • Trained on a subset of LAION-5b
    • First trained on LAION 2b-en and LAION high resolution, then on LAION aesthetics
  • Receives noise and text conditioning as input, outputs a 512x512 image

  • Large dataset, CLIP encoding, guidance tricks

34 of 41

Stable Diffusion during Inference

  1. Generate initial noise latents (sample a 64x64 gaussian noise tensor)
  2. Encode user prompt with CLIP
  3. Repeat N times:
    1. Feed latents + text embedding to UNet
    2. Use predicted noise output to correct latents according to schedule
  4. Take final conditioned latents and feed to Decoder (similar to superresolution network)
  5. Obtain 512x512 image

35 of 41

How can you further control it?

Multiple techniques to control its outputs:

  • Textual Inversion to make it reproduce a specific trained concept
  • Dreambooth to train a specific concept into the network without perturbing it
  • LoRA to introduce concepts without fine tuning the full network
  • ControlNet to force the network to follow additional constraints, such as a specific pose
  • Composable Diffusion to make it follow compositions

Teach new concepts

Control output properties

36 of 41

Textual Inversion

  • Create an embedding from a few samples of a concept
  • Use that concept in prompts to generate new images with the concept
  • Concept can be an object, a style, a scene, etc

37 of 41

Dreambooth

  • Finetune your diffusion model with a few images from a concept
  • Maintain information about the class with a Prior Preservation Loss

Dedicated training for the decoder network (D)

Training on specific concept

Images generated by the original network

Ensures that the original information about “dog” isn’t overwritten

38 of 41

Low Rank Adaptation (LoRA)

  • Idea: add a few trainable parameters to the model and train on your new concept while the main model is frozen.
    • Trainable parameters come in the form of smaller matrices A and B that process the block’s input x and add their output to the original product Wx
  • Additional parameters learn to adapt layer outputs towards the desired results

LoRA add-on

Matrix B, initialized with zeros

Matrix A, random gaussian initialization

r << d

39 of 41

ControlNet

  • Adds trainable copies of existing blocks that learn to follow an additional control image, like an edge map or a human pose
  • Needs a dataset of [image, control] pairs

40 of 41

Composable Diffusion

  • Adds the ability to use Conjunction (AND) and Negation (NOT) operators to prompts
  • Combines noise predictions from copies of the diffusion model prompted with each concept in the prompt to merge them
  • From MIT lab!

"mystical trees" AND "A magical pond" AND "Dark"

"mystical trees" AND "A magical pond" AND NOT "Dark"

41 of 41

Q&A

Thank you!

For additional questions, reach me at camilolu@mit.edu

Camilo Fosco