1 of 43

Lecture 9:

Generative Models II

Sookyung Kim

Spring 2025

2 of 43

Taxonomy of Generative Models

Generative models

Explicit density

Implicit density

Tractable density

Approximate density

Variational

Stochastic

Direct

Stochastic

Generative Adversarial Networks (GAN)

Generative Stochastic Networks (GSN)

Variational Autoencoders

Boltzmann Machine

Fully Visible Belief Nets

PixelRNN/CNN

Ian Goodfellow, Tutorial on Generative Adversarial Networks https://arxiv.org/abs/1701.00160

Today

Spring 2025

2

3 of 43

Generative Adversarial Networks (GANs)

Spring 2025

3

4 of 43

Generative Adversarial Networks: Implementation

2020

Try this!

Spring 2025

4

5 of 43

Generative Adversarial Networks: Applications

Robbie Barrat, Obvious (2018)

Sold for $432,500 at bidding!

Spring 2025

5

6 of 43

Generative Adversarial Networks

Recall that our problem is sampling images from complex, high-dimensional training distribution p(z).

Instead of modeling p(z), GANs use an additional network to tell whether the generated image is within the data distribution or not.
Training signal (Real/Fake) does not require human labeling (unsupervised).

This signal trains both networks, Generator (G) and Discriminator (D).

z

G(z)

D(x)

Real/Fake

Generator

Discriminator

x

https://arxiv.org/pdf/1406.2661.pdf

Spring 2025

6

7 of 43

Generative Adversarial Networks: Discriminator

Input: a real image in training data, or a fake image generated by Generator (G)
Output: probability that the input is a real image (with sigmoid)
Objective: real vs. fake binary classification (cross-entropy)

Note that we don’t have (-) sign at the beginning, so we maximize this instead of minimize.

z

G(z)

D(x)

p(real)

Generator

Discriminator

x

For real image x, the discriminator wants to output high p(real).

(D_θ(x) → 1)

For an image generated by G, the discriminator wants to output low p(real).

(D_θ(G_θ(z)) → 0)

Spring 2025

7

8 of 43

Generative Adversarial Networks: Generator

Input: a random vector sampled from p(z)
Output: a real-like image (in training data)
Objective: Fooling the Discriminator

G tries to make D hard to distinguish real vs. fake by generating real-like images.

z

G(z)

D(x)

p(real)

Discriminator

x

Nothing to do when D takes a real example.

When G generates an image, its goal is to increase p(real) output of D with it, by producing a real-like image.

(D_θ(G_θ(z)) → 1)

Generator

Spring 2025

8

9 of 43

Generative Adversarial Networks: Objective Function

Putting all together, training is done jointly in minimax game!

One is trying to maximize, while the other to minimize the objective function.

Alternatively solve:

Gradient Ascent by the Discriminator

Gradient Descent by the Generator

Spring 2025

9

10 of 43

Generative Adversarial Networks: Overall Algorithm

Spring 2025

10

11 of 43

Generative Adversarial Networks: Practical Concerns

Generator suffers from vanishing gradient problem.

At the beginning, the Generator works poorly.
So, the Discriminator likely assigns low probability (close to 0).
Gradient with low D(G(z)) is relatively flat! 😨
This gradient is the only signal that we can improve G. 😭
So, training G makes little progress. 😱

So, what’s the solution?

We solve the problem below instead:��
With this, gradient is larger when D(G(z)) is small. 😀.

D(G(z))

Spring 2025

11

12 of 43

Generative Adversarial Networks: Implementation

One thing missing yet: what model do we use for G and D?

The original GAN paper used fully-connected (!) for most experiments.
Tried deconv (G) and conv (D) combination for CIFAR10 experiment.

FC model trained on CIFAR10

Deconv-conv model trained on CIFAR10

Closest training example

Spring 2025

12

13 of 43

Deep Convolutional GAN (DCGAN)

Problems with the Vanila GAN:

Unstable to train
Black-box method: no explanation or feature understanding at all
No evaluation criteria for new images generated: are these really generating new images, or small modification of memorized images?�

Suggestions after scrutiny over Generator and Discriminator architectures:

Replace any pooling layers with strided convolutions for Discriminator and fractional-strided convolutions (deconvolutions) for Generator.
Use batch normalization in both the Generator and the Discriminator.
Remove fully connected hidden layers for faster convergence.
Use ReLU activation in Generator for all layers except for the output, which uses Tanh.
Use Leaky ReLU activation in the Discriminator for all layers.

Spring 2025

13

14 of 43

Deep Convolutional GAN (DCGAN): Architecture

Generator

Discriminator

5×5 deconv

stride=2

5×5 deconv

stride=2

5×5 deconv

stride=2

5×5 deconv

stride=2

5×5 conv

stride=2

5×5 conv

stride=2

5×5 conv

stride=2

5×5 conv

stride=2

https://arxiv.org/pdf/1511.06434.pdf

Spring 2025

14

15 of 43

Deep Convolutional GAN (DCGAN): Interpretability

Vector arithmetic on z space

z₁

z₂

z₃

(z₁+ z₂+ z₃) / 3

y

Spring 2025

15

16 of 43

Deep Convolutional GAN (DCGAN): Interpretability

Walking in the latent space

Generated images

“Interpolation between a series of 9 random points in Z show that the space learned has smooth transitions, with every image in the space plausibly looking like a bedroom.” -- Fig. 4 in the paper

Spring 2025

16

17 of 43

Wasserstein GAN

Spring 2025

17

18 of 43

Limitation of GANs: Unstable Training

Recall that the GAN objective function is defined as how well the Discriminator distinguishes real vs. fake, not based on the quality of generated images:

So, it is hard to tell if the Generator is well-trained or not from the loss.

This is from an obvious fact that training the Generator (creating an image from nothing) is much harder than training the Discriminator (just grading).

Spring 2025

18

https://jonathan-hui.medium.com/gan-wasserstein-gan-wgan-gp-6a1a2aa1b490

GAN의 문제점

오브젝티브가 떨어지지 않고 일정. 로스를 봐도 어디가 수렴했는지 모름. 어디서 끝낼지 모름.
기본적으로 오브젝티브 펑션이 디스크리미네이터로 만 이루어져서 문제.
Objective function이 Discriminator가 얼마나 잘 분별하느냐 만으로 되어 있고, generate된 이미지의 퀄리티는 (간접적으로밖에) 고려하지 않는다.
그래서 이 loss 값을 봐도 이미지가 얼마나 잘 만들어지고 있는지를 판별하기 어렵고 실제 example을 보고 사람이 판단하는 수밖에 없었다.
이건 (당연하게도) 없는 이미지를 만들어내는 Generator를 학습시키는게 binary classifier인 Discriminator를 학습시키는 것보다 훨씬 어렵기 때문. (미술 작품 그리기 vs. 그림 채점하기)
수학적으로 말하면, Minimizing the GAN objective function with an optimal discriminator is equivalent to minimizing the JS-divergence.

JS-divergence는 KL-divergence의 변형. (p와 q 평균을 대상으로 해서 distance처럼 만든 것.)
근데 KLD, JSD는 p와 q가 아주 다를 때는 gradient가 거의 0.
즉, G가 제대로 generate을 못하는, 특히 초기에는 G가 거의 배우는게 없다는 뜻.

그러니까, 한줄 요약하면, Discriminator가 yes/no 주는 signal만 갖고 Generator를 학습하는게 엄청 어렵다는 이야기.

19 of 43

Limitation of GANs: Unstable Training

Mathematically, it can be shown that minimizing the GAN objective with an optimal discriminator is equivalent to minimizing the JS-divergence.

JS Divergence is a modified form of KL Divergence:

By the way, KLD has ≈ 0 gradient when P and Q are very different.

Thus, when the Generator does not work well (e.g., initial stage), it learns almost nothing from the Discriminator signal.

Mean

Density

Spring 2025

19

https://jonathan-hui.medium.com/gan-wasserstein-gan-wgan-gp-6a1a2aa1b490

GAN의 문제점

Objective function이 Discriminator가 얼마나 잘 분별하느냐 만으로 되어 있고, generate된 이미지의 퀄리티는 (간접적으로밖에) 고려하지 않는다.
그래서 이 loss 값을 봐도 이미지가 얼마나 잘 만들어지고 있는지를 판별하기 어렵고 실제 example을 보고 사람이 판단하는 수밖에 없었다.
이건 (당연하게도) 없는 이미지를 만들어내는 Generator를 학습시키는게 binary classifier인 Discriminator를 학습시키는 것보다 훨씬 어렵기 때문. (미술 작품 그리기 vs. 그림 채점하기)
수학적으로 말하면, Minimizing the GAN objective function with an optimal discriminator is equivalent to minimizing the JS-divergence. (생성된 이미지랑, 원본 이미지의 JS-divergence를 미니마이즈 하는것)

JS-divergence는 KL-divergence의 변형. (p와 q 평균을 대상으로 해서 distance처럼 만든 것.)--> KL div의 시메트릭 버전
KLD는 PQ 순서 바꾸면 다름.
근데 KLD, JSD는 p와 q가 아주 다를 때는 gradient가 거의 0. (왜냐면 어짜피 너무 멀면 조금 바뀌어도 다를게 없다는뜻)
즉, G가 제대로 generate을 못하는, 특히 초기에는 G가 거의 배우는게 없다는 뜻.

그러니까, 한줄 요약하면, Discriminator가 yes/no 주는 signal만 갖고 Generator를 학습하는게 엄청 어렵다는 이야기.

20 of 43

Limitation of GANs: Mode Collapse

In most cases, the true image space is highly multimodal.

Ex) MNIST: 10 modes, each for 0, 1, …, 9

Mode collapse: Phenomenon that the Generator only concentrates on producing samples lying on a few modes instead of the whole data space.

Spring 2025

20

21 of 43

Wasserstein GAN

Instead of KL Divergence, proposed to minimize Earth Mover (EM) distance.
Deriving from its definition (math omitted), we get the following objective function that is similar to GAN:

Unlike the Discriminator in original GAN which outputs [0, 1] from sigmoid activation, f_w has no output bound.
It outputs larger values for more real-like ones; smaller values for more fake-like ones.
Due to a mathematical condition with EM distance, f_w needs to satisfy 1-Lipschitz continuous function condition. → The weights w are clipped to (-c, c) for some hyperparameter c.

https://arxiv.org/pdf/1701.07875.pdf

Spring 2025

21

[수학적 부분을 많이 생략하고보자]

No log in loss function (no 0 1 bound)�WGAN의 원리

기존 GAN은 KLD를 minimize하는 것과 같다고 했는데, 얘네는 Earth Mover라는 새로운 distance를 minimize하는 것을 제안.
그래서 이 EM distance의 정의로부터 식을 유도하다 보면 GAN과 비슷한 꼴이 되는데 (https://ahjeong.tistory.com/7)
앞의 식과 다른점은 앞에 log가 사라짐. (즉 0,1사이에 있어야하는 제약조건이 없어짐.여기서는 크로스 엔트로피가아니라, 그냥 풀어놓음)
결론만 말하자면 기존 GAN 식과 비슷하게 생겼는데 Discriminator가 sigmoid를 거쳐 [0, 1]의 output만 내는 대신 scoring 함수 f_w가 제약없이 real일 수록 큰 값이 나오도록 함.

다만 그 중간에 수학적인 조건 때문에 이 함수 f_w는 1-Lipschitz continuous function이라는 조건을 만족시켜야 하고, 이걸 위해서 weight W값을 (-c, c) 사이의 값이 되도록 clipping 한다.

22 of 43

Wasserstein GAN

GAN

WGAN

or

z

G(z)

D(x)

p(real)

Discriminator

x

Generator

z

G(z)

f(x)

score(real)

Critic

x

Generator

maximize

Spring 2025

22

23 of 43

Wasserstein GAN: Results

Objective function better reflects the quality of generated images:

According to the authors, no mode collapsing observed with WGAN!

No explicit experiment conducted though… 😏

Spring 2025

23

24 of 43

Wasserstein GAN: Limitations

Model performance is very sensitive to the hyperparameter c.

The weight clipping behaves as a weight regulation.
It reduces the capacity of the model f, limiting its capability to model complex patterns.

Exploding Gradients

Vanishing Gradients

Spring 2025

24

25 of 43

Wasserstein GAN with Gradient Penalty (WGAN-GP)

Instead of clipping the weights, WGAN-GP penalizes the model if the gradient norm moves away from its target norm value 1.

Using the fact that a differentiable function is 1-Lipschtiz if and only if it has gradients with norm at most 1 everywhere.

With more stable training algorithm, opened the path for stronger GAN modeling on large-scale image datasets.

https://arxiv.org/pdf/1704.00028.pdf

Spring 2025

25

26 of 43

Wasserstein GAN with Gradient Penalty (WGAN-GP)

WGAN with clipping

WGAN-GP

Spring 2025

26

27 of 43

GANs for Image-to-Image Translation

Face Mask Detection using Opencv+MobileNet

Spring 2025

27

28 of 43

Image Translation

Task of transforming images from one domain to have the style or characteristics of images from another domain, maintaining their semantics.

Spring 2025

28

29 of 43

Pix2pix

Problem setting: A training example contains a pair of images (x, y), where we transform domain X → domain Y.

Training set: {(x₁, y₁), (x₂, y₂), …, (x_N, y_N)}

Pix2pix main idea:

The Generator G translates the image x to the style of domain Y, to fool the Discriminator.
The Discriminator D classifies if the input image is real or fake.
Unlike a regular GAN, the Discriminator takes a pair of two images, one for source domain and the other for target domain.

x ∈ X

y ∈ Y

Spring 2025

29

30 of 43

Pix2pix: Objective Function

D tries to maximize this by maximizing scores (≈1) for real images y.

D tries to maximize this by minimizing scores (≈0) for generated images from x.

G tries to minimize this by fooling D to assign high score (≈1) for generated images from x.

G tries to minimize this by creating an image as similar as the ground truth pair y.

Adversarial Loss

Reconstruction Loss

Spring 2025

30

31 of 43

Pix2pix: Implementation Details

Model architectures

Generator: U-Net (https://arxiv.org/pdf/1505.04597.pdf)

DeconvNet + Skip-connection to reduce information loss

Discriminator: PatchGAN

Classifying real vs. fake on N × N image patches, instead of the entire image.
Loss is back-propagated patch by patch, providing detailed feedback to the Generator.

Spring 2025

31

32 of 43

Pix2pix: Examples

Colorization

Apps

Aerial photos to/from Google Map

Spring 2025

32

33 of 43

Pix2pix: Summary

Contributions

Photo-realistic image-to-image mapping is achieved.

Mainly thanks to the adversarial loss.

Proposed PatchGAN to optimize performance.
Applicable to various image-to-image translation tasks.�

Limitations

Requires paired images for training.

This is a serious limitation, as most style transfer tasks do not have paired data.
E.g., photos → Monet-style�

https://arxiv.org/pdf/1611.07004v3.pdf

Spring 2025

33

34 of 43

CycleGAN

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks:

Overcome the main drawback of Pix2pix: no longer requiring paired images for training.�

Main idea:

In this problem setting, we don’t have paired images (x, y), but just have a set of images�{x₁, x₂, …, x_M} from domain X and another set of images {y₁, y₂, …, y_N} from domain Y.
As we do not have paired images, we rely on 2 losses:

When we transfer an image x into y style, we’d like the resulting image G(x) to look like a target domain (Y) image. → Adversarial Loss
When we transfer G(x) back to domain X by using another Generator F, we’d like to have F(G(x)) be same as the original x. → Cycle Consistency Loss
We apply this idea symmetrically, both for x → y and y → x.

Without cycle consistency loss, the generated images will lose its semantics, just adapting to the target domain. (e.g., creating a map-like image from aerial photo, but incorrect roads)

Spring 2025

34

35 of 43

CycleGAN

Domain X (horse)

Domain Y (zebra)

G_x→y

D_y

G_y→x

Real vs. Fake

x

Cycle-consistency Loss

Adversarial Loss

y

G_y→x

G_x→y

D_x

Real vs. Fake

Adversarial Loss

Cycle-consistency Loss

Real image in training set

Generated image by model

Spring 2025

35

36 of 43

CycleGAN: Objective Function

D_y tries to maximize this by maximizing scores (≈1) for real images y.

D_y tries to maximize this by minimizing scores (≈0) for generated images from x.

G_x→y tries to minimize this by fooling D_y to assign high score (≈1) for generated images from x.

Both G try to minimize this by creating an image as similar as the original image x.

Adversarial Loss�(same as original GAN)

Cycle-Consistency Loss

This is for x → y → x. We do have its mirrored version for y → x → y, and overall loss is the sum of both.

Spring 2025

36

37 of 43

CycleGAN: Implementation Details

Model architectures

Generator: ResNet

Thanks to the residual connections, information loss is minimized.

Discriminator: PatchGAN 70 × 70 (same as pix2pix)
Loss: tried LSGAN (Least-Square GAN) as well

Instead of cross-entropy, the loss is defined as squared loss:��
Known that this is effective to avoid vanishing gradients.

Identity loss (for some tasks):

Feed y to G_x→y and make sure that G does nothing (and vice versa).

Spring 2025

37

38 of 43

CycleGAN: Examples

Spring 2025

38

39 of 43

CycleGAN: Examples

Smart-phone photos → DSLR photos

Monet → Photos

Photo → Monet, Van Gogh, Cezzane, Ukiyo-e

Spring 2025

39

40 of 43

CycleGAN: Summary

Contributions

Style transfer achieved without paired images.
Produced high-resolution images.
Working well for style transfer.�

Limitations

Due to large networks, training is slow.
Poor performance when the shape needs to be changed. (e.g., cat → dog, apple → orange)��

https://arxiv.org/pdf/1703.10593.pdf

Spring 2025

40

41 of 43

DiscoGAN

Discovery-GAN
Exactly same idea as CycleGAN, concurrently developed by SK T Brain.

Same idea, same loss function.�

Main difference is on experimental settings:

Focus more on dynamic shape changes, rather than style transfer.
Mainly due to architectural and dataset difference.

https://arxiv.org/pdf/1703.05192.pdf

Spring 2025

41

42 of 43

DiscoGAN: Implementation Details

Model architectures

Generator: Encoder-Decoder structure

Similar to DeconvNet.
Suffer from information loss → hard to create high-resolution image.

Produced 64 × 64 images only (as opposed to 512 × 512 in CycleGAN)

Paying photo-realisticity, it becomes more flexible on shape changes.

Discriminator: DCGAN
Loss: almost equivalent to CycleGAN

Used L₂ loss for cycle-consistency, instead of L₁.�

Overall, focus more on domain transfer with more shape changes.

Faster training thanks to simpler network structure.
Due to this, output resolution is lower than CycleGAN.

Spring 2025

42

43 of 43

DiscoGAN: Examples

Chair to Car

Car to Face

Handbag to Shoes

Shoes to Handbag

Spring 2025

43