[WIP 🚧…] �Image Generation and Editing
- Literature Survey
6/6/2023
Ran Ding
1
Scope and context
2
Outline
3
Diffusion models, core methods
Context
Key papers
Other resources
4
Generative model
5
Diffusion model, intuitions
6
Diffusion model, formulation
7
Decay towards origin
Add small noise
Forward:
Learned mean and variance functions
Reverse:
Learning objective
8
Denoising Diffusion Probabilistic Models, aka DDPM
�
9
DDPM, model details
10
Ref: U-Net- Convolutional Networks for Biomedical Image Segmentation
Side note: reconstruction error (distortion) vs time step
11
t=T → t=0
[Optional, but really interesting] Score-based models, unified with diffusion
Context
Key papers
The author of these following papers wrote a phenomenal overview https://yang-song.net/blog/2021/score
12
Langevin dynamics
Langevin dynamics is an iterative sampling procedure, sampling from distribution p(x) without knowing p(x)
To sample x from p(x), you actually don’t need to know p(x). You only need access to
This (p’s gradient w.r.t. x) is called the score function.
To generate sample, do the following repeatedly (x0 can come from any reasonable prior, e.g. Uniform, epsilon is a small constant, zi is Gaussian):
��When epsilon → 0, K → infinity, xK converges to a sample from p(x)
13
What’s the big deal about Langevin dynamics
Using the gradient instead of p(x), allows us to use arbitrary functions (e.g. neural networks) to describe (parameterize) a probability density function p(x) that has a tricky constraint (always needs to integrate to 1)
We parameterize a network (real valued function f) and model the probability density function (PDF) as
The normalizing constant Z is dependent on the parameters theta. And, crucially, Z is subject to integral (p(x) dx) = 1. For a general form of f, Z is often intractable, thus making directly modeling and optimizing p infeasible. The way people get round this is to restrict the form of f (or p), or use approximation.
Note that the derivative of p w.r.t x (aka the score function s) however, does not depend on Z. As a result we have lots of modeling freedom with f (and p).
We can parameterize this function s as a neural network and learn it through minimizing: (i.e. Fisher divergence). s models the score (i.e. p’s gradient w.r.t. x) and is called score-based model. Learning s is called “score matching”.
14
Score-based generative modeling
15
Problems with applying Langevin dynamics naively
For high dimensional data most of the region has very few data points (aka “data lives on a low dimensional manifold”). Thus for most of the region, the score (nabla_x p) is inaccurate and the estimated score (s_\theta) is also inaccurate.
Similarly, during the iterative sampling if initial data comes from one mode, it is almost impossible for it to go to another model, causing inaccurate sampling
16
Annealed Langevin dynamics
Step1: Add a small amount of noise to data distribution, this makes score learning accurate (w.r.t. noised distribution)
We apply multiple scales of Gaussian noise to perturb the data distribution (first row), and jointly estimate the score functions for all of them (second row).
This step corresponds to the forward diffusion process
17
Annealed Langevin dynamics
Step2: Perform Langevin sampling at decreasingly less noisy distributions
This step corresponds to the reverse diffusion process.
18
Score-based models is equivalent to Diffusion models
Score function is proportional to the noise prediction function in diffusion
Overall optimization objectives are also equivalent
19
Score-based ->
Diffusion ->
Diffusion model improvements
In this section, we’ll summarize a few papers each introduced some new and important techniques, here is a quick overview:
20
Improved DDPM
21
Guidance
22
Cascading
23
Latent Diffusion
24
Faster generation
25
Image generation systems
Diffusion based
2021-02 [Dall E] Zero-Shot Text-to-Image Generation
2021-04 [SR3] Image Super-Resolution via Iterative Refinement
2021-10 Palette Image-to-Image Diffusion Models
2022-03 [GLIDE] Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
2022-04 [DALLE2 unCLIP] Hierarchical Text-Conditional Image Generation with CLIP Latents
2022-05 [Imagen] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Autoregressive
2022-06 [Parti] Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Masked (non-diffusion, non-autoregressive)
2022-02 MaskGIT- Masked Generative Image Transformer
2023-01 Muse- Text-To-Image Generation via Masked Generative Transformers
26
Design choices
Elucidating the Design Space of Diffusion-Based Generative Models
27
Image editing (to break down into sections)
2021-08 SDEdit- Guided Image Synthesis and Editing with Stochastic Differential Equations
2021-11 Blended Diffusion for Text-driven Editing of Natural Images
2022-06 Blended Latent Diffusion
2022-08 [Textual Inversion] An Image is Worth One Word- Personalizing Text-to-Image Generation using Textual Inversion
2022-08 DreamBooth- Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
2022-08 Prompt-to-Prompt Image Editing with Cross Attention Control
2022-10 DiffEdit- Diffusion-based semantic image editing with mask guidance
2022-10 Imagic- Text-Based Real Image Editing with Diffusion Models
2022-11 InstructPix2Pix- Learning to Follow Image Editing Instructions
2022-11 Null-text Inversion for Editing Real Images using Guided Diffusion Models
2022-12 Imagen Editor and EditBench- Advancing and Evaluating Text-Guided Image Inpainting
2023-02 [ControlNet] Adding Conditional Control to Text-to-Image Diffusion Models
2023-05 BLIP-Diffusion- Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing
2023-05 Drag Your GAN- Interactive Point-based Manipulation on the Generative Image Manifold
2022-11 Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
2022-11 EDICT- Exact Diffusion Inversion via Coupled Transformations
2023-06 StyleDrop- Text-to-Image Generation in Any Style
Other
2022-03 GAN Inversion- A Survey
28
Comparison / summary (add table)
29
[To be updated] Tracking, fingerprinting
2023-05 Tree-Ring Watermarks- Fingerprints for Diffusion Images that are Invisible and Robust
30
Other references
Image backbone networks
UNET
2015-05 U-Net- Convolutional Networks for Biomedical Image Segmentation
ResNet
2015-12 Deep Residual Learning for Image Recognition
BigGAN
2018-09 [BigGAN] Large scale GAN training for high fidelity natural image synthesis
VAE, VQ-VAE and VQ-GAN
VAE
2013 Auto-Encoding Variational Bayes
An Introduction to Variational Autoencoders
VQVAE
2017-11 [VQVAE]] Neural Discrete Representation Learning
2019-06 [VQVAE2] Generating Diverse High-Fidelity Images with VQ-VAE-2
VQGAN
2020-12 [VQGAN] Taming Transformers for High-Resolution Image Synthesis
Misc
DETR
2020-05 [DETR] End-to-End Object Detection with Transformers
2020-10 Deformable DETR- Deformable Transformers for End-to-End Object Detection
MAE
2021-12 [MAE] Masked Autoencoders Are Scalable Vision Learners
Transformers, Vision Transformers, Language Models/Encoders
Transformer
2017-12 Attention Is All You Need
http://nlp.seas.harvard.edu/annotated-transformer/
Vision Transformer (ViT)
2021-06 [ViT] An Image is Worth 16x16 Words Transformers for Image Recognition at Scale
UniLM
2019-05 Unified Language Model Pre-training for Natural Language Understanding and Generation
T5
2019-10 [T5] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Multimodal: CLIP, ViLT, Flamingo and BLIP series
2021-02 [CLIP] Learning Transferable Visual Models From Natural Language Supervision
2021-02 ViLT- Vision-and-Language Transformer Without Convolution or Region Supervision
2022-04 Flamingo- a Visual Language Model for Few-Shot Learning
2022-02 BLIP- Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
2023-01 BLIP-2- Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
2023-05 InstructBLIP- Towards General-purpose Vision-Language Models with Instruction Tuning
31