WURSTCHEN: AN EFFICIENT ARCHITECTURE FOR LARGE-SCALE TEXT-TO-IMAGE DIFFUSION MODELS
Pablo Pertinas , Dominic Rampas , Mats L. Richter , Christopher J. Pal , Marc Aubreville
Diffusion models
SOTA Generative models for Image generation
NNs that model the diffusion process
3
Forward Diffusion process
Starts from an Image and adds noise gradually
get Gaussian noise
Reverse Diffusion process
Starts from Gaussian noise and gradually denoises it to get an Image
Models that model the diffusion process
4
Forward Diffusion process
Starts from an Image and adds noise gradually
get Gaussian noise
Reverse Diffusion process
Starts from Gaussian noise and gradually denoises it to get an Image
Train a neural network to predict score function given time step
∝ Added Noise
Inference Process
5
Image
Pure Gaussian
t = 0
t = 10
t = 1000
t = 20
t = 100
t = 200
…
…
Timesteps
Predicted noise
Inference Process
6
Image
Pure Gaussian
t = 0
t = 10
t = 1000
t = 20
t = 100
t = 200
…
…
Timesteps
Predicted noise
Unet
How do we get the image that we want
7
We use Classifier free guidance
Neural network
x
t
c
Image
Timestep
condition
Noise
w - guidance scale
Training is costly (also inference)
8
BigGAN-deep - Generative adversarial network (256 x 256) = 128 V100 days
ADM-G (4360K) - Diffusion model in Image-space (256 x 256) = 962 V100 days [source]
Latent Diffusion models
Models that use diffusion process on the lower dimensional space
Addresses the problem of high compute
10
Uses Perceptual compression to reduce the computational burden of diffusion models
VQ-GAN based compression
Addresses the problem of high compute
11
Uses Perceptual compression to reduce the computational burden of diffusion models
VQ-GAN based compression
VQ-GAN based Encoder
UNet
VQ-GAN based Decoder
Text encoder
Reduction in Training time
12
BigGAN-deep - Generative adversarial network (256 x 256) = 128 V100 days
ADM-G (4360K) - Diffusion model in Image-space (256 x 256) = 962 V100 days
LDM - Diffusion model in latent space (256 x 256) = 79 V100 days [source]
How can we go even further
13
Reducing the compute budget by the use of spatial compression is limited by how much we can compress without degradation - Input size matters for CNNs
Solution
Compress the conditioning Latent instead of using only pre-trained language encodings
Result
Novel three-stage architecture achieving 42:1 compression
Method
Three stage architecture
Stages
Stage A - Compression of Images in to Latent space of VQ-GAN (f4)
Stage B - Latent diffusion process conditioned on the outputs of a Semantic Compressor & Text Embedding
Stage C - Text conditional LDM is trained on the strongly compressed space of Semantic compressor
15
During Inference
Stage C → Stage B → Stage A
16
During inference (Stage C)
17
Random
noise
Text embeddings
Text conditioned LDM
During inference (Stage C)
18
Random
noise
Text embeddings
Text conditioned LDM
Compressed semantic
60 Steps
During Inference (Stage B)
19
Compressed semantic
Denoising UNet
Random latent
Text Embedding
During Inference (Stage B)
20
Compressed semantic
Denoising UNet
Random latent
Text Embedding
Denoised
Latent
12 Steps
During Inference (Stage A)
21
Denoised
Latent
VQ-GAN
Decoder
Image
During Training
Stage A → Stage B → Stage C
22
During Training (Stage A)
23
Reconstruction objective
During Training (Stage B)
24
During Training (Stage B)
25
Diffusion Loss
During Training (Stage C)
26
Architecture
Semantic Compressor
28
1024 x 1024
786x786
Resize
SC
Any Feature extractor
EfficientNet-V2 is used
(init - pretrained on ImageNet)
Compressed Semantic
Stage C - LDM
29
Denoiser
16 - ConvNext-block
Noised latent
t
text condition
De-Noised latent
Experiments & Evaluation
Experiments
Baseline:
Evaluation Metrics:
31
DATASET
Results
Automated Text To Image evaluation
33
Image Quality Score
Automated Text To Image evaluation
34
Comparison of different Models on COCO-30K
Collage 1
Collage 2
35
Human Preference Evaluation
36
Human Preference Evaluation
37
Thank you
38