1 of 31

[WIP 🚧…] �Image Generation and Editing

- Literature Survey

6/6/2023

Ran Ding

1

2 of 31

Scope and context

  • Literature survey conducted in late May 2023, on image generation and editing based on deep generative models.
    • This survey is a bit heavy on diffusion based models.
    • Some non-diffusion systems are included (e.g. Parti, MaskGIT, MUSE)
    • Lot of GAN based work are omitted.
  • This deck is long…
    • I inserted a bunch outline/intro slides, background in green. They contain all the links to papers/resources. So you can also just browser them without reading all the details slides.
    • We have a team with diverse background, so please do feel free leave any questions or suggestions about things to add/remove/clarify 🙏
    • It’s a large body of work to cover, please read with a grain of salt, corrections/feedback much appreciated.
  • #Thanks to this list of awesome people for reviewing this deck and providing valuable feedback.
    • tba, tba

2

3 of 31

Outline

  1. Diffusion models
    1. Core methods
    2. [Optional] Score-matching, unified with diffusion models
    3. Improvements
  2. Image generation systems
    • Diffusion
    • Autoregressive
    • Masked (non-diffusion, non-autoregressive)
  3. Image editing
  4. Tracing, fingerprinting
  5. Other references
    • Image backbones
    • VAE, VQ-VAE, VQ-GAN
    • Transformers, Vision Transformers, Text Encoders
    • Multimodal: CLIP, ViLT, Flamingo and BLIP series
  6. Codebases
  7. Questions, takeaways, musings

3

4 of 31

Diffusion models, core methods

Context

  • Two foundational papers, that many other papers built upon.

Key papers

  • [1a] Deep Unsupervised Learning using Nonequilibrium Thermodynamics (2015)
    • Proposed the original Diffusion model for doing generative modelling.
  • [1b] [DDPM] Denoising Diffusion Probabilistic Models (06/2020)
    • Landmark paper, introduced a few key improvements (problem setup, model arch, loss function, etc), that are widely adopted by later papers.

Other resources

4

5 of 31

Generative model

  • What are generative models, when are they useful?
    • Learns the data distribution and data generation process
      • Useful anomaly detection (rare/no labels), compression, clustering (and other unsupervised learning problems).
    • And of course, generating new data samples!
      • Generate text, images, videos etc

  • Why is generative modeling hard?
    • “Can you look at a painting and recognize it as being the Mona Lisa? You probably can. That's discriminative modeling. Can you paint the Mona Lisa yourself? You probably can't. That's generative modeling.” - Ian Goodfellow

  • Deep generative models
    • GAN
      • Unstable adversarial training procedure. Lack of sample diversity.
    • VAE
      • Stable training. Lack of sample quality.
    • Diffusion
      • Stable training. Seems to do well in both quality and diversity.
      • Note: Diffusion can be viewed as many (100~1000) VAEs chained together. Making the problem more incremental (more learnable?).

5

6 of 31

Diffusion model, intuitions

  • Observation 1:
    • For generative models, often we need an “easy” distribution to begin with / sample from for generation and modeling of data distributions that are highly complex (e.g. real world images)
    • Diffusion is a process that relates a complex distribution to a simple one.
  • Observation 2:
    • At small time scale, diffusion is reversible, both the forward and reverse steps are Gaussian
    • The ability to “learn” the (probabilistic) reversal step is due to the extra spatial information
      • i.e. surrounding data points tell you what plausible arrangement could be in a previous time step

6

7 of 31

Diffusion model, formulation

  1. We have a fixed forward diffusion process to convert data distribution to a known distribution (e.g. Gaussian)
    1. We have design choices about the forward process (e.g. what eventual distribution to converge to, how to evolve into it (noise schedule), etc).
  2. And a learned time reversal process to convert random samples from this known distribution (e.g. Gaussian) to a sample that follows data distribution (e.g. real image).
    • We have design choices about how to parameterize the learned process/model (what model architecture, what optimization objectives, etc)

7

Decay towards origin

Add small noise

Forward:

Learned mean and variance functions

Reverse:

8 of 31

Learning objective

  • The forward process q(xt | xt-1) is known, the reverse q(xt-1 | xt) is unknown. But…, if we know about the original image (x0), we can easily figure out the reverse trajectory q(xt-1 | xt, x0) (i.e. figure out how much noise is added to x0)
    • Lots of algebra involved, but it is a simple Gaussian in the end.
  • We do know about the original image x0 during forward process (we started from it), so we can q(xt-1 | xt, x0) as the ground truth to train a model pθ(xt-1 | xt) to guess/generate the reverse trajectory without knowing x0 beforehand.
  • The learning objectively is: for all time step t, we’d like p to be very close to q
    • We minimize the difference by minimizing KL divergence (usually denoted by DKL(q||p) it measures how different two distributions p and q are)
  • So the learning objective below is a series of DKL terms covering all time steps t
    • DKL between two Gaussian distributions has a simpler form: the difference between their means divided by covariance

8

9 of 31

Denoising Diffusion Probabilistic Models, aka DDPM

  • Landmark paper, made diffusion really work via a few key improvements that are widely adopted by later papers.
  • Key improvements details
    • 1. predicting noise instead predicting image��

    • 2. Simpler loss objective, prioritized image generation quality instead of log likelihood

    • 3. Better image model (U-Net with attention, more details in next slide)

9

10 of 31

DDPM, model details

  • Used U-Net to model ptheta
    • given a noised image, predict the noise, i.e. input and output both in the image dimensions
  • Enhancements
    • add self-attention (at 16x16 resolution between the convolution blocks)
    • Diffusion time t is specified by adding Transformer sinusoidal position embedding into each residual block
  • Code
    • The Annotated Diffusion Model

10

Ref: U-Net- Convolutional Networks for Biomedical Image Segmentation

11 of 31

Side note: reconstruction error (distortion) vs time step

  • Going from t=T to t=0
    • We can stop at some t and try to directly take a guess of the original image (x0) and see how much we got wrong (“distortion”).
    • We also count cumulatively from T to t, how much loss term Lt we accumulated (“rates”, infor theory parlance)�
  • Takeaways
    • The last few time steps (say t=0~10) accounts for large components in the overall loss. But they don’t contribute much to the improvements in reducing distortion of the reconstruction (image quality).
    • This also calls to question whether we are wasting lots of model capacity on modeling details of the data distribution that are unimportant for generating realistic looking images (which is what motivated Latent Diffusion Model (LDM))

11

t=T → t=0

12 of 31

[Optional, but really interesting] Score-based models, unified with diffusion

Context

  • Score-based models turned out to be equivalent to Diffusion (a somewhat accidental, concurrent development). It provides an alternative view of diffusion models.
  • Score-based formulation helps derive very important methods for diffusion based image generation, e.g.
    • Classifier- and Classifier-Free Guidance
    • Accelerated sampling methods based on ODE solvers
  • This section is not strictly needed for understanding most papers, read for curiosity but feel free to skip.

Key papers

The author of these following papers wrote a phenomenal overview https://yang-song.net/blog/2021/score

12

13 of 31

Langevin dynamics

Langevin dynamics is an iterative sampling procedure, sampling from distribution p(x) without knowing p(x)

To sample x from p(x), you actually don’t need to know p(x). You only need access to

This (p’s gradient w.r.t. x) is called the score function.

To generate sample, do the following repeatedly (x0 can come from any reasonable prior, e.g. Uniform, epsilon is a small constant, zi is Gaussian):

��When epsilon → 0, K → infinity, xK converges to a sample from p(x)

13

14 of 31

What’s the big deal about Langevin dynamics

Using the gradient instead of p(x), allows us to use arbitrary functions (e.g. neural networks) to describe (parameterize) a probability density function p(x) that has a tricky constraint (always needs to integrate to 1)

We parameterize a network (real valued function f) and model the probability density function (PDF) as

The normalizing constant Z is dependent on the parameters theta. And, crucially, Z is subject to integral (p(x) dx) = 1. For a general form of f, Z is often intractable, thus making directly modeling and optimizing p infeasible. The way people get round this is to restrict the form of f (or p), or use approximation.

Note that the derivative of p w.r.t x (aka the score function s) however, does not depend on Z. As a result we have lots of modeling freedom with f (and p).

We can parameterize this function s as a neural network and learn it through minimizing: (i.e. Fisher divergence). s models the score (i.e. p’s gradient w.r.t. x) and is called score-based model. Learning s is called “score matching”.

14

15 of 31

Score-based generative modeling

15

16 of 31

Problems with applying Langevin dynamics naively

For high dimensional data most of the region has very few data points (aka “data lives on a low dimensional manifold”). Thus for most of the region, the score (nabla_x p) is inaccurate and the estimated score (s_\theta) is also inaccurate.

Similarly, during the iterative sampling if initial data comes from one mode, it is almost impossible for it to go to another model, causing inaccurate sampling

16

17 of 31

Annealed Langevin dynamics

Step1: Add a small amount of noise to data distribution, this makes score learning accurate (w.r.t. noised distribution)

We apply multiple scales of Gaussian noise to perturb the data distribution (first row), and jointly estimate the score functions for all of them (second row).

This step corresponds to the forward diffusion process

17

18 of 31

Annealed Langevin dynamics

Step2: Perform Langevin sampling at decreasingly less noisy distributions

This step corresponds to the reverse diffusion process.

18

19 of 31

Score-based models is equivalent to Diffusion models

Score function is proportional to the noise prediction function in diffusion

Overall optimization objectives are also equivalent

19

Score-based ->

Diffusion ->

20 of 31

Diffusion model improvements

In this section, we’ll summarize a few papers each introduced some new and important techniques, here is a quick overview:

  • Model improvements
    • 2021-02 [Improved DDPM] Improved Denoising Diffusion Probabilistic Models
    • Guidance
      • 2021-06 [Classifier Guided Diffusion] Diffusion Models Beat GANs on Image Synthesis
      • 2021-12 Classifier-Free Diffusion Guidance
    • Cascading
      • 2021-12 Cascaded Diffusion Models for High Fidelity Image Generation
    • Latent Diffusion
      • 2022-04 [LDM] High-Resolution Image Synthesis with Latent Diffusion Models
  • Faster sample generation
    • Better solvers
      • E.g. 2022-10 [DDIM] Denoising Diffusion Implicit Models
      • Many others
    • Common things: float16, freeze models, compilation
    • Time scheduler
    • Token merging
    • Require training
      • Guidance distillation
      • Progressive distillation
      • Architecture Distillation

20

21 of 31

Improved DDPM

21

22 of 31

Guidance

22

23 of 31

Cascading

23

24 of 31

Latent Diffusion

24

25 of 31

Faster generation

25

26 of 31

Image generation systems

Diffusion based

2021-02 [Dall E] Zero-Shot Text-to-Image Generation

2021-04 [SR3] Image Super-Resolution via Iterative Refinement

2021-10 Palette Image-to-Image Diffusion Models

2022-03 [GLIDE] Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

2022-04 [DALLE2 unCLIP] Hierarchical Text-Conditional Image Generation with CLIP Latents

2022-05 [Imagen] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Autoregressive

2022-06 [Parti] Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Masked (non-diffusion, non-autoregressive)

2022-02 MaskGIT- Masked Generative Image Transformer

2023-01 Muse- Text-To-Image Generation via Masked Generative Transformers

26

27 of 31

Design choices

Elucidating the Design Space of Diffusion-Based Generative Models

27

28 of 31

Image editing (to break down into sections)

2021-08 SDEdit- Guided Image Synthesis and Editing with Stochastic Differential Equations

2021-11 Blended Diffusion for Text-driven Editing of Natural Images

2022-06 Blended Latent Diffusion

2022-08 [Textual Inversion] An Image is Worth One Word- Personalizing Text-to-Image Generation using Textual Inversion

2022-08 DreamBooth- Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

2022-08 Prompt-to-Prompt Image Editing with Cross Attention Control

2022-10 DiffEdit- Diffusion-based semantic image editing with mask guidance

2022-10 Imagic- Text-Based Real Image Editing with Diffusion Models

2022-11 InstructPix2Pix- Learning to Follow Image Editing Instructions

2022-11 Null-text Inversion for Editing Real Images using Guided Diffusion Models

2022-12 Imagen Editor and EditBench- Advancing and Evaluating Text-Guided Image Inpainting

2023-02 [ControlNet] Adding Conditional Control to Text-to-Image Diffusion Models

2023-05 BLIP-Diffusion- Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

2023-05 Drag Your GAN- Interactive Point-based Manipulation on the Generative Image Manifold

2022-11 Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

2022-11 EDICT- Exact Diffusion Inversion via Coupled Transformations

2023-06 StyleDrop- Text-to-Image Generation in Any Style

Other

2022-03 GAN Inversion- A Survey

28

29 of 31

Comparison / summary (add table)

29

30 of 31

[To be updated] Tracking, fingerprinting

2023-05 Tree-Ring Watermarks- Fingerprints for Diffusion Images that are Invisible and Robust

30

31 of 31

Other references

Image backbone networks

UNET

2015-05 U-Net- Convolutional Networks for Biomedical Image Segmentation

ResNet

2015-12 Deep Residual Learning for Image Recognition

BigGAN

2018-09 [BigGAN] Large scale GAN training for high fidelity natural image synthesis

VAE, VQ-VAE and VQ-GAN

VAE

2013 Auto-Encoding Variational Bayes

An Introduction to Variational Autoencoders

VQVAE

2017-11 [VQVAE]] Neural Discrete Representation Learning

2019-06 [VQVAE2] Generating Diverse High-Fidelity Images with VQ-VAE-2

VQGAN

2020-12 [VQGAN] Taming Transformers for High-Resolution Image Synthesis

Misc

DETR

2020-05 [DETR] End-to-End Object Detection with Transformers

2020-10 Deformable DETR- Deformable Transformers for End-to-End Object Detection

MAE

2021-12 [MAE] Masked Autoencoders Are Scalable Vision Learners

Transformers, Vision Transformers, Language Models/Encoders

Transformer

2017-12 Attention Is All You Need

http://nlp.seas.harvard.edu/annotated-transformer/

Vision Transformer (ViT)

2021-06 [ViT] An Image is Worth 16x16 Words Transformers for Image Recognition at Scale

UniLM

2019-05 Unified Language Model Pre-training for Natural Language Understanding and Generation

T5

2019-10 [T5] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Multimodal: CLIP, ViLT, Flamingo and BLIP series

2021-02 [CLIP] Learning Transferable Visual Models From Natural Language Supervision

2021-02 ViLT- Vision-and-Language Transformer Without Convolution or Region Supervision

2022-04 Flamingo- a Visual Language Model for Few-Shot Learning

2022-02 BLIP- Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

2023-01 BLIP-2- Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

2023-05 InstructBLIP- Towards General-purpose Vision-Language Models with Instruction Tuning

31