1 of 38

WURSTCHEN: AN EFFICIENT ARCHITECTURE FOR LARGE-SCALE TEXT-TO-IMAGE DIFFUSION MODELS

Pablo Pertinas , Dominic Rampas , Mats L. Richter , Christopher J. Pal , Marc Aubreville

2 of 38

Diffusion models

SOTA Generative models for Image generation

3 of 38

NNs that model the diffusion process

3

Forward Diffusion process

Starts from an Image and adds noise gradually

get Gaussian noise

Reverse Diffusion process

Starts from Gaussian noise and gradually denoises it to get an Image

4 of 38

Models that model the diffusion process

4

Forward Diffusion process

Starts from an Image and adds noise gradually

get Gaussian noise

Reverse Diffusion process

Starts from Gaussian noise and gradually denoises it to get an Image

Train a neural network to predict score function given time step

∝ Added Noise

5 of 38

Inference Process

5

Image

Pure Gaussian

t = 0

t = 10

t = 1000

t = 20

t = 100

t = 200

…

Timesteps

Predicted noise

6 of 38

Inference Process

6

Image

Pure Gaussian

t = 0

t = 10

t = 1000

t = 20

t = 100

t = 200

…

Timesteps

Predicted noise

Unet

7 of 38

How do we get the image that we want

7

We use Classifier free guidance

Neural network

x

t

c

Image

Timestep

condition

Noise

w - guidance scale

8 of 38

Training is costly (also inference)

8

BigGAN-deep - Generative adversarial network (256 x 256) = 128 V100 days

ADM-G (4360K) - Diffusion model in Image-space (256 x 256) = 962 V100 days [source]

9 of 38

Latent Diffusion models

Models that use diffusion process on the lower dimensional space

10 of 38

Addresses the problem of high compute

10

Uses Perceptual compression to reduce the computational burden of diffusion models

VQ-GAN based compression

11 of 38

Addresses the problem of high compute

11

Uses Perceptual compression to reduce the computational burden of diffusion models

VQ-GAN based compression

VQ-GAN based Encoder

UNet

VQ-GAN based Decoder

Text encoder

12 of 38

Reduction in Training time

12

BigGAN-deep - Generative adversarial network (256 x 256) = 128 V100 days

ADM-G (4360K) - Diffusion model in Image-space (256 x 256) = 962 V100 days

LDM - Diffusion model in latent space (256 x 256) = 79 V100 days [source]

13 of 38

How can we go even further

13

Reducing the compute budget by the use of spatial compression is limited by how much we can compress without degradation - Input size matters for CNNs

Solution

Compress the conditioning Latent instead of using only pre-trained language encodings

Result

Novel three-stage architecture achieving 42:1 compression

14 of 38

Method

Three stage architecture

15 of 38

Stages

Stage A - Compression of Images in to Latent space of VQ-GAN (f4)

Stage B - Latent diffusion process conditioned on the outputs of a Semantic Compressor & Text Embedding

Stage C - Text conditional LDM is trained on the strongly compressed space of Semantic compressor

15

16 of 38

During Inference

Stage C → Stage B → Stage A

16

17 of 38

During inference (Stage C)

17

Random

noise

Text embeddings

Text conditioned LDM

18 of 38

During inference (Stage C)

18

Random

noise

Text embeddings

Text conditioned LDM

Compressed semantic

60 Steps

19 of 38

During Inference (Stage B)

19

Compressed semantic

Denoising UNet

Random latent

Text Embedding

20 of 38

During Inference (Stage B)

20

Compressed semantic

Denoising UNet

Random latent

Text Embedding

Denoised

Latent

12 Steps

21 of 38

During Inference (Stage A)

21

Denoised

Latent

VQ-GAN

Decoder

Image

22 of 38

During Training

Stage A → Stage B → Stage C

22

23 of 38

During Training (Stage A)

23

Reconstruction objective

24 of 38

During Training (Stage B)

24

25 of 38

During Training (Stage B)

25

Diffusion Loss

26 of 38

During Training (Stage C)

26

27 of 38

Architecture

28 of 38

Semantic Compressor

28

1024 x 1024

786x786

Resize

SC

Any Feature extractor

EfficientNet-V2 is used

(init - pretrained on ImageNet)

Compressed Semantic

29 of 38

Stage C - LDM

29

Denoiser

16 - ConvNext-block

Noised latent

t

text condition

De-Noised latent

30 of 38

Experiments & Evaluation

31 of 38

Experiments

Baseline:

Trained U-Net-based 1B parameter LDM on SD 2.1 first stage & text-conditioning model
Evaluated against GALIP, SD 1.4, 2.1, SDXL, DF-GAN
Inference Time

Evaluation Metrics:

FID and IS- inconsistent
PickScore as Primary Automated Metric

31

DATASET

COCO-30K
Localized Narrative-COCO5K
Parti-Prompts

32 of 38

Results

33 of 38

Automated Text To Image evaluation

PickScore

33

Image Quality Score

34 of 38

Automated Text To Image evaluation

FID, IS Score
Efficiency - Inference time, Quality with less number of training samples

34

Comparison of different Models on COCO-30K

35 of 38

Collage 1

Collage 2

35

36 of 38

Human Preference Evaluation

Compared SD 2.1 to Würstchen
30000 images :MS-COCO validation set prompt
1633 : Parti-prompts

36

37 of 38

Human Preference Evaluation

37

38 of 38

Thank you

38