1 of 31

[WIP 🚧…] �Image Generation and Editing

- Literature Survey

6/6/2023

Ran Ding

1

2 of 31

Scope and context

Literature survey conducted in late May 2023, on image generation and editing based on deep generative models.

This survey is a bit heavy on diffusion based models.
Some non-diffusion systems are included (e.g. Parti, MaskGIT, MUSE)
Lot of GAN based work are omitted.

This deck is long…

I inserted a bunch outline/intro slides, background in green. They contain all the links to papers/resources. So you can also just browser them without reading all the details slides.
We have a team with diverse background, so please do feel free leave any questions or suggestions about things to add/remove/clarify 🙏
It’s a large body of work to cover, please read with a grain of salt, corrections/feedback much appreciated.

#Thanks to this list of awesome people for reviewing this deck and providing valuable feedback.

tba, tba

2

3 of 31

Outline

Diffusion models

Core methods
[Optional] Score-matching, unified with diffusion models
Improvements

Image generation systems

Diffusion
Autoregressive
Masked (non-diffusion, non-autoregressive)

Image editing
Tracing, fingerprinting
Other references

Image backbones
VAE, VQ-VAE, VQ-GAN
Transformers, Vision Transformers, Text Encoders
Multimodal: CLIP, ViLT, Flamingo and BLIP series

Codebases
Questions, takeaways, musings

3

4 of 31

Diffusion models, core methods

Context

Two foundational papers, that many other papers built upon.

Key papers

[1a] Deep Unsupervised Learning using Nonequilibrium Thermodynamics (2015)

Proposed the original Diffusion model for doing generative modelling.

[1b] [DDPM] Denoising Diffusion Probabilistic Models (06/2020)

Landmark paper, introduced a few key improvements (problem setup, model arch, loss function, etc), that are widely adopted by later papers.

Other resources

This blog gives a good overview with approachable derivations What are Diffusion Models? | Lil'Log

(I won’t repeat the derivations in this deck and instead focus on intuitions)

2022-08 Understanding Diffusion Models: A Unified Perspective

4

5 of 31

Generative model

What are generative models, when are they useful?

Learns the data distribution and data generation process

Useful anomaly detection (rare/no labels), compression, clustering (and other unsupervised learning problems).

And of course, generating new data samples!

Generate text, images, videos etc

Why is generative modeling hard?

“Can you look at a painting and recognize it as being the Mona Lisa? You probably can. That's discriminative modeling. Can you paint the Mona Lisa yourself? You probably can't. That's generative modeling.” - Ian Goodfellow

Deep generative models

GAN

Unstable adversarial training procedure. Lack of sample diversity.

VAE

Stable training. Lack of sample quality.

Diffusion

Stable training. Seems to do well in both quality and diversity.
Note: Diffusion can be viewed as many (100~1000) VAEs chained together. Making the problem more incremental (more learnable?).

5

6 of 31

Diffusion model, intuitions

Observation 1:

For generative models, often we need an “easy” distribution to begin with / sample from for generation and modeling of data distributions that are highly complex (e.g. real world images)
Diffusion is a process that relates a complex distribution to a simple one.

Observation 2:

At small time scale, diffusion is reversible, both the forward and reverse steps are Gaussian
The ability to “learn” the (probabilistic) reversal step is due to the extra spatial information

i.e. surrounding data points tell you what plausible arrangement could be in a previous time step

6

source

7 of 31

Diffusion model, formulation

We have a fixed forward diffusion process to convert data distribution to a known distribution (e.g. Gaussian)

We have design choices about the forward process (e.g. what eventual distribution to converge to, how to evolve into it (noise schedule), etc).

And a learned time reversal process to convert random samples from this known distribution (e.g. Gaussian) to a sample that follows data distribution (e.g. real image).

We have design choices about how to parameterize the learned process/model (what model architecture, what optimization objectives, etc)

7

Decay towards origin

Add small noise

Forward:

Learned mean and variance functions

Reverse:

8 of 31

Learning objective

The forward process q(x_t | x_t-1) is known, the reverse q(x_t-1 | x_t) is unknown. But…, if we know about the original image (x0), we can easily figure out the reverse trajectory q(x_t-1 | x_t, x0) (i.e. figure out how much noise is added to x0)

Lots of algebra involved, but it is a simple Gaussian in the end.

We do know about the original image x0 during forward process (we started from it), so we can q(x_t-1 | x_t, x0) as the ground truth to train a model p_θ(x_t-1 | x_t) to guess/generate the reverse trajectory without knowing x0 beforehand.
The learning objectively is: for all time step t, we’d like p to be very close to q

We minimize the difference by minimizing KL divergence (usually denoted by D_KL(q||p) it measures how different two distributions p and q are)

So the learning objective below is a series of D_KL terms covering all time steps t

D_KL between two Gaussian distributions has a simpler form: the difference between their means divided by covariance

8

9 of 31

Denoising Diffusion Probabilistic Models, aka DDPM

Landmark paper, made diffusion really work via a few key improvements that are widely adopted by later papers.
Key improvements details

1. predicting noise instead predicting image��

�

2. Simpler loss objective, prioritized image generation quality instead of log likelihood

3. Better image model (U-Net with attention, more details in next slide)

9

Note about item3

This change made the training objective deviate from the correct variational bound, so the trained model would take a hit in negative log-likelihood metric (i.e. modeling the data distribution).
Because the weights are very large at small t, removing the weights is effectively down-weighting the loss terms at small t and asks the model to focus on doing better at large t.
At small t, the model is mostly reconstructing imperceptible details (more on this later) which doesn’t really meaningfully improve image quality. Doing this reweighting removal helped with better sample quality

Note about item1

Predicting \epsilon and predicting \mu worked similarly well when trained on variational bound but predicting \epsilon worked much better on L_simple objective (more below)

Variance preserving vs t

The design of the sqrt(alpha_t) in front of x0 preserved variance throughout t, maybe this makes learning easier.
In “Generative Modeling by Estimating Gradients of the Data Distribution”, variance explodes as t increase

10 of 31

DDPM, model details

Used U-Net to model p_theta

given a noised image, predict the noise, i.e. input and output both in the image dimensions

Enhancements

add self-attention (at 16x16 resolution between the convolution blocks)
Diffusion time t is specified by adding Transformer sinusoidal position embedding into each residual block

Code

The Annotated Diffusion Model

10

Ref: U-Net- Convolutional Networks for Biomedical Image Segmentation

11 of 31

Side note: reconstruction error (distortion) vs time step

Going from t=T to t=0

We can stop at some t and try to directly take a guess of the original image (x0) and see how much we got wrong (“distortion”).
We also count cumulatively from T to t, how much loss term L_t we accumulated (“rates”, infor theory parlance)�

Takeaways

The last few time steps (say t=0~10) accounts for large components in the overall loss. But they don’t contribute much to the improvements in reducing distortion of the reconstruction (image quality).
This also calls to question whether we are wasting lots of model capacity on modeling details of the data distribution that are unimportant for generating realistic looking images (which is what motivated Latent Diffusion Model (LDM))

11

t=T → t=0

The paper has a pretty interesting discussion about compression and codelength. Summarizing below:

Each of the terms in the variational bound equation above is a KL divergence term. They tell us how much “information” (bit/dim) we need to capture the difference of the true posterior q(x_t-1 | x_t) and the approximate, learned posterior p(x_t-1 | x_t) (i.e. the reverse transition).

As we go through the reverse process (from t=T to t=0), we can stop at any t, and directly estimate a x0_hat and compare it to the true x0 - we call this difference, measured in RMSE, “distortion”. We count the cumulative information we need to encode q w.r.t p from T to t the “rate”.

“Rate” and “distortions” are information theory parlance [ref] - we can think of “rate” the length of code we use to describe something (a compressed representation of the real thing) and distortion being the difference between the real thing and the reconstructed version based on the compressed representation.

The following plot shows the relationship of distortion vs t, rate vs t and distortion vs rate.

The main takeaways of these plots are

The last few time steps (say t=0~10) accounts for a large portion of the information encoded to match q and p. (i.e. These L_i terms have large components in the overall loss). But they don’t contribute much to the improvements in reducing distortion of the reconstruction.
This justifies the reweighting scheme to prioritize sample generation instead of maximizing log likelihood.
This also calls to question whether we are wasting lots of model capacity on modeling details of the data distribution that are unimportant for generating realistic looking images (which is what motivated Latent Diffusion).

12 of 31

[Optional, but really interesting] Score-based models, unified with diffusion

Context

Score-based models turned out to be equivalent to Diffusion (a somewhat accidental, concurrent development). It provides an alternative view of diffusion models.
Score-based formulation helps derive very important methods for diffusion based image generation, e.g.

Classifier- and Classifier-Free Guidance
Accelerated sampling methods based on ODE solvers

This section is not strictly needed for understanding most papers, read for curiosity but feel free to skip.

Key papers

[1c] 2020-10 Generative Modeling by Estimating Gradients of the Data Distribution

Will mostly focus on this one

[1d] 2020-10 Improved Techniques for Training Score-Based Generative Models

Various improvements over [1c]

[1e] 2021-02 Score-Based Generative Modeling through Stochastic Differential Equations

Casted the discrete time process into a continuous time domain and formulating it as an SDE (which helped later improvements on samplers based on numerical SDE/ODE solvers)

The author of these following papers wrote a phenomenal overview https://yang-song.net/blog/2021/score

12

13 of 31

Langevin dynamics

Langevin dynamics is an iterative sampling procedure, sampling from distribution p(x) without knowing p(x)

To sample x from p(x), you actually don’t need to know p(x). You only need access to

This (p’s gradient w.r.t. x) is called the score function.

To generate sample, do the following repeatedly (x0 can come from any reasonable prior, e.g. Uniform, epsilon is a small constant, z_i is Gaussian):

��When epsilon → 0, K → infinity, x_K converges to a sample from p(x)

13

14 of 31

What’s the big deal about Langevin dynamics

Using the gradient instead of p(x), allows us to use arbitrary functions (e.g. neural networks) to describe (parameterize) a probability density function p(x) that has a tricky constraint (always needs to integrate to 1)

We parameterize a network (real valued function f) and model the probability density function (PDF) as

The normalizing constant Z is dependent on the parameters theta. And, crucially, Z is subject to integral (p(x) dx) = 1. For a general form of f, Z is often intractable, thus making directly modeling and optimizing p infeasible. The way people get round this is to restrict the form of f (or p), or use approximation.

Note that the derivative of p w.r.t x (aka the score function s) however, does not depend on Z. As a result we have lots of modeling freedom with f (and p).

We can parameterize this function s as a neural network and learn it through minimizing: (i.e. Fisher divergence). s models the score (i.e. p’s gradient w.r.t. x) and is called score-based model. Learning s is called “score matching”.

14

15 of 31

Score-based generative modeling

15

16 of 31

Problems with applying Langevin dynamics naively

For high dimensional data most of the region has very few data points (aka “data lives on a low dimensional manifold”). Thus for most of the region, the score (nabla_x p) is inaccurate and the estimated score (s_\theta) is also inaccurate.

Similarly, during the iterative sampling if initial data comes from one mode, it is almost impossible for it to go to another model, causing inaccurate sampling

16

17 of 31

Annealed Langevin dynamics

Step1: Add a small amount of noise to data distribution, this makes score learning accurate (w.r.t. noised distribution)

We apply multiple scales of Gaussian noise to perturb the data distribution (first row), and jointly estimate the score functions for all of them (second row).

This step corresponds to the forward diffusion process

17

18 of 31

Annealed Langevin dynamics

Step2: Perform Langevin sampling at decreasingly less noisy distributions

This step corresponds to the reverse diffusion process.

18

19 of 31

Score-based models is equivalent to Diffusion models

Score function is proportional to the noise prediction function in diffusion

Overall optimization objectives are also equivalent

19

Score-based ->

Diffusion ->

20 of 31

Diffusion model improvements

In this section, we’ll summarize a few papers each introduced some new and important techniques, here is a quick overview:

Model improvements

2021-02 [Improved DDPM] Improved Denoising Diffusion Probabilistic Models
Guidance

2021-06 [Classifier Guided Diffusion] Diffusion Models Beat GANs on Image Synthesis
2021-12 Classifier-Free Diffusion Guidance

Cascading

2021-12 Cascaded Diffusion Models for High Fidelity Image Generation

Latent Diffusion

2022-04 [LDM] High-Resolution Image Synthesis with Latent Diffusion Models

Faster sample generation

Better solvers

E.g. 2022-10 [DDIM] Denoising Diffusion Implicit Models
Many others

Common things: float16, freeze models, compilation
Time scheduler
Token merging
Require training

Guidance distillation
Progressive distillation
Architecture Distillation

20

21 of 31

Improved DDPM

21

22 of 31

Guidance

22

23 of 31

Cascading

23

24 of 31

Latent Diffusion

24

25 of 31

Faster generation

25

26 of 31

Image generation systems

Diffusion based

2021-02 [Dall E] Zero-Shot Text-to-Image Generation

2021-04 [SR3] Image Super-Resolution via Iterative Refinement

2021-10 Palette Image-to-Image Diffusion Models

2022-03 [GLIDE] Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

2022-04 [DALLE2 unCLIP] Hierarchical Text-Conditional Image Generation with CLIP Latents

2022-05 [Imagen] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Autoregressive

2022-06 [Parti] Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Masked (non-diffusion, non-autoregressive)

2022-02 MaskGIT- Masked Generative Image Transformer

2023-01 Muse- Text-To-Image Generation via Masked Generative Transformers

26

27 of 31

Design choices

Elucidating the Design Space of Diffusion-Based Generative Models

27

28 of 31

Image editing (to break down into sections)

2021-08 SDEdit- Guided Image Synthesis and Editing with Stochastic Differential Equations

2021-11 Blended Diffusion for Text-driven Editing of Natural Images

2022-06 Blended Latent Diffusion

2022-08 [Textual Inversion] An Image is Worth One Word- Personalizing Text-to-Image Generation using Textual Inversion

2022-08 DreamBooth- Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

2022-08 Prompt-to-Prompt Image Editing with Cross Attention Control

2022-10 DiffEdit- Diffusion-based semantic image editing with mask guidance

2022-10 Imagic- Text-Based Real Image Editing with Diffusion Models

2022-11 InstructPix2Pix- Learning to Follow Image Editing Instructions

2022-11 Null-text Inversion for Editing Real Images using Guided Diffusion Models

2022-12 Imagen Editor and EditBench- Advancing and Evaluating Text-Guided Image Inpainting

2023-02 [ControlNet] Adding Conditional Control to Text-to-Image Diffusion Models

2023-05 BLIP-Diffusion- Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

2023-05 Drag Your GAN- Interactive Point-based Manipulation on the Generative Image Manifold

2022-11 Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

2022-11 EDICT- Exact Diffusion Inversion via Coupled Transformations

2023-06 StyleDrop- Text-to-Image Generation in Any Style

Other

2022-03 GAN Inversion- A Survey

28

29 of 31

Comparison / summary (add table)

29

30 of 31

[To be updated] Tracking, fingerprinting

2023-05 Tree-Ring Watermarks- Fingerprints for Diffusion Images that are Invisible and Robust

30

31 of 31

Other references

Image backbone networks

UNET

2015-05 U-Net- Convolutional Networks for Biomedical Image Segmentation

ResNet

2015-12 Deep Residual Learning for Image Recognition

BigGAN

2018-09 [BigGAN] Large scale GAN training for high fidelity natural image synthesis

VAE, VQ-VAE and VQ-GAN

VAE

2013 Auto-Encoding Variational Bayes

An Introduction to Variational Autoencoders

VQVAE

2017-11 [VQVAE]] Neural Discrete Representation Learning

2019-06 [VQVAE2] Generating Diverse High-Fidelity Images with VQ-VAE-2

VQGAN

2020-12 [VQGAN] Taming Transformers for High-Resolution Image Synthesis

Misc

DETR

2020-05 [DETR] End-to-End Object Detection with Transformers

2020-10 Deformable DETR- Deformable Transformers for End-to-End Object Detection

MAE

2021-12 [MAE] Masked Autoencoders Are Scalable Vision Learners

Transformers, Vision Transformers, Language Models/Encoders

Transformer

2017-12 Attention Is All You Need

http://nlp.seas.harvard.edu/annotated-transformer/

Vision Transformer (ViT)

2021-06 [ViT] An Image is Worth 16x16 Words Transformers for Image Recognition at Scale

UniLM

2019-05 Unified Language Model Pre-training for Natural Language Understanding and Generation

T5

2019-10 [T5] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Multimodal: CLIP, ViLT, Flamingo and BLIP series

2021-02 [CLIP] Learning Transferable Visual Models From Natural Language Supervision

2021-02 ViLT- Vision-and-Language Transformer Without Convolution or Region Supervision

2022-04 Flamingo- a Visual Language Model for Few-Shot Learning

2022-02 BLIP- Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

2023-01 BLIP-2- Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

2023-05 InstructBLIP- Towards General-purpose Vision-Language Models with Instruction Tuning

31