1 of 43

Lecture 9:

Generative Models II

Sookyung Kim

Spring 2025

2 of 43

Taxonomy of Generative Models

Generative models

Explicit density

Implicit density

Tractable density

Approximate density

Variational

Stochastic

Direct

Stochastic

Generative Adversarial Networks (GAN)

Generative Stochastic Networks (GSN)

Variational Autoencoders

Boltzmann Machine

Fully Visible Belief Nets

PixelRNN/CNN

Ian Goodfellow, Tutorial on Generative Adversarial Networks https://arxiv.org/abs/1701.00160

Today

Spring 2025

2

3 of 43

Generative Adversarial Networks (GANs)

Spring 2025

3

4 of 43

Generative Adversarial Networks: Implementation

Spring 2025

4

5 of 43

Generative Adversarial Networks: Applications

Robbie Barrat, Obvious (2018)

Sold for $432,500 at bidding!

Spring 2025

5

6 of 43

Generative Adversarial Networks

Recall that our problem is sampling images from complex, high-dimensional training distribution p(z).

  • Instead of modeling p(z), GANs use an additional network to tell whether the generated image is within the data distribution or not.
  • Training signal (Real/Fake) does not require human labeling (unsupervised).
    • This signal trains both networks, Generator (G) and Discriminator (D).

z

G(z)

D(x)

Real/Fake

Generator

Discriminator

x

Spring 2025

6

7 of 43

Generative Adversarial Networks: Discriminator

  • Input: a real image in training data, or a fake image generated by Generator (G)
  • Output: probability that the input is a real image (with sigmoid)
  • Objective: real vs. fake binary classification (cross-entropy)
    • Note that we don’t have (-) sign at the beginning, so we maximize this instead of minimize.

z

G(z)

D(x)

p(real)

Generator

Discriminator

x

For real image x, the discriminator wants to output high p(real).

(Dθ(x) → 1)

For an image generated by G, the discriminator wants to output low p(real).

(Dθ(Gθ(z)) → 0)

Spring 2025

7

8 of 43

Generative Adversarial Networks: Generator

  • Input: a random vector sampled from p(z)
  • Output: a real-like image (in training data)
  • Objective: Fooling the Discriminator
    • G tries to make D hard to distinguish real vs. fake by generating real-like images.

z

G(z)

D(x)

p(real)

Discriminator

x

Nothing to do when D takes a real example.

When G generates an image, its goal is to increase p(real) output of D with it, by producing a real-like image.

(Dθ(Gθ(z)) → 1)

Generator

Spring 2025

8

9 of 43

Generative Adversarial Networks: Objective Function

  • Putting all together, training is done jointly in minimax game!
    • One is trying to maximize, while the other to minimize the objective function.
  • Alternatively solve:
    • Gradient Ascent by the Discriminator
    • Gradient Descent by the Generator

Spring 2025

9

10 of 43

Generative Adversarial Networks: Overall Algorithm

Spring 2025

10

11 of 43

Generative Adversarial Networks: Practical Concerns

  • Generator suffers from vanishing gradient problem.
    • At the beginning, the Generator works poorly.
    • So, the Discriminator likely assigns low probability (close to 0).
    • Gradient with low D(G(z)) is relatively flat! 😨
    • This gradient is the only signal that we can improve G. 😭
    • So, training G makes little progress. 😱
  • So, what’s the solution?
    • We solve the problem below instead:��
    • With this, gradient is larger when D(G(z)) is small. 😀.

D(G(z))

Spring 2025

11

12 of 43

Generative Adversarial Networks: Implementation

  • One thing missing yet: what model do we use for G and D?
    • The original GAN paper used fully-connected (!) for most experiments.
    • Tried deconv (G) and conv (D) combination for CIFAR10 experiment.

FC model trained on CIFAR10

Deconv-conv model trained on CIFAR10

Closest training example

Spring 2025

12

13 of 43

Deep Convolutional GAN (DCGAN)

  • Problems with the Vanila GAN:
    • Unstable to train
    • Black-box method: no explanation or feature understanding at all
    • No evaluation criteria for new images generated: are these really generating new images, or small modification of memorized images?�
  • Suggestions after scrutiny over Generator and Discriminator architectures:
    • Replace any pooling layers with strided convolutions for Discriminator and fractional-strided convolutions (deconvolutions) for Generator.
    • Use batch normalization in both the Generator and the Discriminator.
    • Remove fully connected hidden layers for faster convergence.
    • Use ReLU activation in Generator for all layers except for the output, which uses Tanh.
    • Use Leaky ReLU activation in the Discriminator for all layers.

Spring 2025

13

14 of 43

Deep Convolutional GAN (DCGAN): Architecture

Generator

Discriminator

5×5 deconv

stride=2

5×5 deconv

stride=2

5×5 deconv

stride=2

5×5 deconv

stride=2

5×5 conv

stride=2

5×5 conv

stride=2

5×5 conv

stride=2

5×5 conv

stride=2

Spring 2025

14

15 of 43

Deep Convolutional GAN (DCGAN): Interpretability

  • Vector arithmetic on z space

z1

z2

z3

(z1 + z2 + z3) / 3

y

Spring 2025

15

16 of 43

Deep Convolutional GAN (DCGAN): Interpretability

  • Walking in the latent space

Generated images

Generated images

Interpolation between a series of 9 random points in Z show that the space learned has smooth transitions, with every image in the space plausibly looking like a bedroom.” -- Fig. 4 in the paper

Spring 2025

16

17 of 43

Wasserstein GAN

Spring 2025

17

18 of 43

Limitation of GANs: Unstable Training

  • Recall that the GAN objective function is defined as how well the Discriminator distinguishes real vs. fake, not based on the quality of generated images:
  • So, it is hard to tell if the Generator is well-trained or not from the loss.
  • This is from an obvious fact that training the Generator (creating an image from nothing) is much harder than training the Discriminator (just grading).

Spring 2025

18

19 of 43

Limitation of GANs: Unstable Training

  • Mathematically, it can be shown that minimizing the GAN objective with an optimal discriminator is equivalent to minimizing the JS-divergence.
    • JS Divergence is a modified form of KL Divergence:
  • By the way, KLD has ≈ 0 gradient when P and Q are very different.
  • Thus, when the Generator does not work well (e.g., initial stage), it learns almost nothing from the Discriminator signal.

Mean

Density

Spring 2025

19

20 of 43

Limitation of GANs: Mode Collapse

  • In most cases, the true image space is highly multimodal.
    • Ex) MNIST: 10 modes, each for 0, 1, …, 9
  • Mode collapse: Phenomenon that the Generator only concentrates on producing samples lying on a few modes instead of the whole data space.

Spring 2025

20

21 of 43

Wasserstein GAN

  • Instead of KL Divergence, proposed to minimize Earth Mover (EM) distance.
  • Deriving from its definition (math omitted), we get the following objective function that is similar to GAN:
    • Unlike the Discriminator in original GAN which outputs [0, 1] from sigmoid activation, fw has no output bound.
    • It outputs larger values for more real-like ones; smaller values for more fake-like ones.
    • Due to a mathematical condition with EM distance, fw needs to satisfy 1-Lipschitz continuous function condition. → The weights w are clipped to (-c, c) for some hyperparameter c.

Spring 2025

21

22 of 43

Wasserstein GAN

GAN

WGAN

or

z

G(z)

D(x)

p(real)

Discriminator

x

Generator

z

G(z)

f(x)

score(real)

Critic

x

Generator

maximize

Spring 2025

22

23 of 43

Wasserstein GAN: Results

  • Objective function better reflects the quality of generated images:
  • According to the authors, no mode collapsing observed with WGAN!
    • No explicit experiment conducted though… 😏

Spring 2025

23

24 of 43

Wasserstein GAN: Limitations

  • Model performance is very sensitive to the hyperparameter c.
    • The weight clipping behaves as a weight regulation.
    • It reduces the capacity of the model f, limiting its capability to model complex patterns.

Exploding Gradients

Vanishing Gradients

Spring 2025

24

25 of 43

Wasserstein GAN with Gradient Penalty (WGAN-GP)

  • Instead of clipping the weights, WGAN-GP penalizes the model if the gradient norm moves away from its target norm value 1.
    • Using the fact that a differentiable function is 1-Lipschtiz if and only if it has gradients with norm at most 1 everywhere.
  • With more stable training algorithm, opened the path for stronger GAN modeling on large-scale image datasets.

Spring 2025

25

26 of 43

Wasserstein GAN with Gradient Penalty (WGAN-GP)

WGAN with clipping

WGAN-GP

Spring 2025

26

27 of 43

GANs for Image-to-Image Translation

Spring 2025

27

28 of 43

Image Translation

  • Task of transforming images from one domain to have the style or characteristics of images from another domain, maintaining their semantics.

Spring 2025

28

29 of 43

Pix2pix

  • Problem setting: A training example contains a pair of images (x, y), where we transform domain X → domain Y.
    • Training set: {(x1, y1), (x2, y2), …, (xN, yN)}
  • Pix2pix main idea:
    • The Generator G translates the image x to the style of domain Y, to fool the Discriminator.
    • The Discriminator D classifies if the input image is real or fake.
    • Unlike a regular GAN, the Discriminator takes a pair of two images, one for source domain and the other for target domain.

x ∈ X

y ∈ Y

Spring 2025

29

30 of 43

Pix2pix: Objective Function

D tries to maximize this by maximizing scores (≈1) for real images y.

D tries to maximize this by minimizing scores (≈0) for generated images from x.

G tries to minimize this by fooling D to assign high score (≈1) for generated images from x.

G tries to minimize this by creating an image as similar as the ground truth pair y.

Adversarial Loss

Reconstruction Loss

Spring 2025

30

31 of 43

Pix2pix: Implementation Details

  • Model architectures
    • Generator: U-Net (https://arxiv.org/pdf/1505.04597.pdf)
      • DeconvNet + Skip-connection to reduce information loss
    • Discriminator: PatchGAN
      • Classifying real vs. fake on N × N image patches, instead of the entire image.
      • Loss is back-propagated patch by patch, providing detailed feedback to the Generator.

Spring 2025

31

32 of 43

Pix2pix: Examples

Colorization

Apps

Aerial photos to/from Google Map

Spring 2025

32

33 of 43

Pix2pix: Summary

  • Contributions
    • Photo-realistic image-to-image mapping is achieved.
      • Mainly thanks to the adversarial loss.
    • Proposed PatchGAN to optimize performance.
    • Applicable to various image-to-image translation tasks.�
  • Limitations
    • Requires paired images for training.
      • This is a serious limitation, as most style transfer tasks do not have paired data.
      • E.g., photos → Monet-style�
  • https://arxiv.org/pdf/1611.07004v3.pdf

Spring 2025

33

34 of 43

CycleGAN

  • Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks:
    • Overcome the main drawback of Pix2pix: no longer requiring paired images for training.�
  • Main idea:
    • In this problem setting, we don’t have paired images (x, y), but just have a set of images�{x1, x2, …, xM} from domain X and another set of images {y1, y2, …, yN} from domain Y.
    • As we do not have paired images, we rely on 2 losses:
      • When we transfer an image x into y style, we’d like the resulting image G(x) to look like a target domain (Y) image. → Adversarial Loss
      • When we transfer G(x) back to domain X by using another Generator F, we’d like to have F(G(x)) be same as the original x. → Cycle Consistency Loss
      • We apply this idea symmetrically, both for xy and yx.
    • Without cycle consistency loss, the generated images will lose its semantics, just adapting to the target domain. (e.g., creating a map-like image from aerial photo, but incorrect roads)

Spring 2025

34

35 of 43

CycleGAN

Domain X (horse)

Domain Y (zebra)

Gx→y

Dy

Gy→x

Real vs. Fake

x

Cycle-consistency Loss

Adversarial Loss

y

Gy→x

Gx→y

Dx

Real vs. Fake

Adversarial Loss

Cycle-consistency Loss

Real image in training set

Generated image by model

Spring 2025

35

36 of 43

CycleGAN: Objective Function

Dy tries to maximize this by maximizing scores (≈1) for real images y.

Dy tries to maximize this by minimizing scores (≈0) for generated images from x.

Gx→y tries to minimize this by fooling Dy to assign high score (≈1) for generated images from x.

Both G try to minimize this by creating an image as similar as the original image x.

Adversarial Loss�(same as original GAN)

Cycle-Consistency Loss

This is for xyx. We do have its mirrored version for yxy, and overall loss is the sum of both.

Spring 2025

36

37 of 43

CycleGAN: Implementation Details

  • Model architectures
    • Generator: ResNet
      • Thanks to the residual connections, information loss is minimized.
    • Discriminator: PatchGAN 70 × 70 (same as pix2pix)
    • Loss: tried LSGAN (Least-Square GAN) as well
      • Instead of cross-entropy, the loss is defined as squared loss:��
      • Known that this is effective to avoid vanishing gradients.
    • Identity loss (for some tasks):
      • Feed y to Gx→y and make sure that G does nothing (and vice versa).

Spring 2025

37

38 of 43

CycleGAN: Examples

Spring 2025

38

39 of 43

CycleGAN: Examples

Smart-phone photos → DSLR photos

Monet → Photos

Photo → Monet, Van Gogh, Cezzane, Ukiyo-e

Spring 2025

39

40 of 43

CycleGAN: Summary

  • Contributions
    • Style transfer achieved without paired images.
    • Produced high-resolution images.
    • Working well for style transfer.�
  • Limitations
    • Due to large networks, training is slow.
    • Poor performance when the shape needs to be changed. (e.g., cat → dog, apple → orange)������
  • https://arxiv.org/pdf/1703.10593.pdf

Spring 2025

40

41 of 43

DiscoGAN

  • Discovery-GAN
  • Exactly same idea as CycleGAN, concurrently developed by SK T Brain.
    • Same idea, same loss function.�
  • Main difference is on experimental settings:
    • Focus more on dynamic shape changes, rather than style transfer.
    • Mainly due to architectural and dataset difference.

Spring 2025

41

42 of 43

DiscoGAN: Implementation Details

  • Model architectures
    • Generator: Encoder-Decoder structure
      • Similar to DeconvNet.
      • Suffer from information loss → hard to create high-resolution image.
        • Produced 64 × 64 images only (as opposed to 512 × 512 in CycleGAN)
      • Paying photo-realisticity, it becomes more flexible on shape changes.
    • Discriminator: DCGAN
    • Loss: almost equivalent to CycleGAN
      • Used L2 loss for cycle-consistency, instead of L1.�
  • Overall, focus more on domain transfer with more shape changes.
    • Faster training thanks to simpler network structure.
    • Due to this, output resolution is lower than CycleGAN.

Spring 2025

42

43 of 43

DiscoGAN: Examples

Chair to Car

Car to Face

Handbag to Shoes

Shoes to Handbag

Spring 2025

43