1 of 111

Generative Adversarial Networks

Amnon Geifman & Adam Yaari

2 of 111

Table of content

First lecture:

  • Generative models intro
  • GANs intro & examples
  • Deep Convolutional GAN
  • GAN’s objective function & training
  • GANs and distribution

Second lecture:

  • Current issues with GANs
  • Training dynamics improvements
  • Wasserstein GAN
  • Improvements of Wasserstein GAN

3 of 111

Table of content

First lecture:

  • Generative models intro
  • GANs intro & examples
  • Deep Convolutional GAN
  • GAN’s objective function & training
  • GANs and distribution

Second lecture:

  • Current issues with GANs
  • Training dynamics improvements
  • Wasserstein GAN
  • Improvements of Wasserstein GAN

4 of 111

Supervised Vs. Unsupervised Learning

Supervised

Unsupervised

Input Data

Data (x), Labels (y)

Goal

Learn y = f(x)

Examples

Classification, regression, object detection, semantic segmentation, etc.

5 of 111

Supervised Vs. Unsupervised Learning

6 of 111

Supervised Vs. Unsupervised Learning

7 of 111

Supervised Vs. Unsupervised Learning

Supervised

Unsupervised

Input Data

Data (x), Labels (y)

Data (x)

Goal

Learn y = f(x)

Learn underlying hidden structure of the data

Examples

Classification, regression, object detection, semantic segmentation, etc.

Clustering, feature learning, density estimation, dimensionality reduction, etc.

8 of 111

Supervised Vs. Unsupervised Learning

Clustering:

9 of 111

Supervised Vs. Unsupervised Learning

Feature mapping:

10 of 111

Supervised Vs. Unsupervised Learning

Dimensionality reduction:

11 of 111

What does that got to do with generative models?

  • Generative models are unsupervised* algorithms (require no labels).
  • Their goal is to generate more of the same without memorizing.

Meaning: given training data, generate new samples from the same distribution.

* Modified versions of GAN’s also serve for semi-supervised and reinforcement learning.

12 of 111

What do Generative Models do?

Generative models attempt to estimate the density function of the training data.

  • Explicit estimation: explicitly define and solve p(x).
  • Implicit estimation: learn a model that can sample from p(x).

13 of 111

Taxonomy of Generative Models

Figure copyright from Ian Goodfellow, Tutorial on GAN, NIPS 2016

Van Der Oord et al. 2016

14 of 111

Pixel RNN (Van Der Oord et al. 2016)

Explicit tractable density function:

Van Der Oord et al. "Pixel Recurrent Neural Networks", 2016.

15 of 111

Pixel RNN (Van Der Oord et al. 2016)

Explicit tractable density function:

Use chain rule to decompose likelihood of an image x into product of 1-d distributions, while maximizing likelihood of training data.

But, what are d?

16 of 111

Pixel RNN (Van Der Oord et al. 2016)

Starting from the top left. Modeled via RNN (LSTM).

Pros:

  • p(x) likelihood is computed directly.
  • Good evaluation metric via explicit likelihood over training data.
  • Relatively high quality samples.

Cons:

  • Pixel-wise sequential generation => very slow

17 of 111

Pixel CNN (Van Der Oord et al. 2016)

An improvement over PixelRNN. Each pixel has an explicit distribution over values 0-255.

Modeled via CNN instead of RNN.

Pros:

  • Significantly faster than PixelRNN(can parallelize convolutions).

Cons:

  • Still sequential and slow...

18 of 111

Pixel CNN (Van Der Oord et al. 2016)

Some results:

19 of 111

Taxonomy of Generative Models

Figure copyright from Ian Goodfellow, Tutorial on GAN, NIPS 2016

Van Der Oord et al. 2016

Kingma & Welling, 2013

20 of 111

Variational Autoencoder (Kingma & Welling, ICLR 2013)

Explicit intractable function:

Kingma and Welling. "Auto-Encoding Variational Bayes", 2013.

21 of 111

Variational Autoencoder (Kingma & Welling, ICLR 2014)

Explicit intractable function:

z

x

22 of 111

Variational Autoencoder (Kingma & Welling, ICLR 2014)

23 of 111

Variational Autoencoder (Kingma & Welling, ICLR 2014)

Some results:

24 of 111

Variational Autoencoder (Kingma & Welling, ICLR 2014)

Pros:

  • Principled approach to generative models.
  • Allows inference of p(z|x) (useful for additional tasks).

Cons:

  • Maximize lower bound of likelihood (not as good as PixelRNN/CNN).
  • Samples blurrier and lower quality compared to state-of-the-art (GANs)

25 of 111

Taxonomy of Generative Models

Figure copyright from Ian Goodfellow, Tutorial on GAN, NIPS 2016

26 of 111

Generative Stochastic Networks

27 of 111

Taxonomy of Generative Models

Figure copyright from Ian Goodfellow, Tutorial on GAN, NIPS 2016

Van Der Oord et al. 2016

Kingma & Welling, 2013

Goodfellow et al. 2014

28 of 111

Table of content

First lecture:

  • Generative models intro
  • GANs intro & examples
  • Deep Convolutional GAN
  • GAN’s objective function & training
  • GANs and distribution

Second lecture:

  • Current issues with GANs
  • Training dynamics improvements
  • Wasserstein GAN
  • Improvements of Wasserstein GAN

29 of 111

Finally… what are GANs?

Let’s give up on explicitly modeling the density, and just gain the ability to sample.

Generative Adversarial Networks:

learn to approximate the data’s distribution, instead of directly expose it.

Goodfellow et al. "Generative Adversarial Networks", 2014.

30 of 111

Some motivation...

https://hothardware.com/news/nvidia-neural-network-generates-photorealistic-faces-disturbing-results

31 of 111

Some motivation...

Bedrooms generation

32 of 111

Some motivation...

Vector Arithmetic

33 of 111

Some motivation...

Super resolution

34 of 111

Some motivation...

3D reconstruction

35 of 111

Some motivation...

Face compilation

36 of 111

Some motivation...

Assisted painting

37 of 111

Some motivation...

Style transfer/coloring

38 of 111

GAN’s Architecture

An adversarial differentiable function of two players:

  • Discriminator:

Trained to distinguish between samples from pdata and pmodel.

  • Generator:

Tries to fool the discriminator by randomly generating samples, i.e. make vvbbpmodel an approximation of pdata.

Goodfellow et al. "Generative Adversarial Networks", 2014.

39 of 111

GAN’s Architecture

x

z

Goodfellow et al. "Generative Adversarial Networks", 2014.

40 of 111

Table of content

First lecture:

  • Generative models intro
  • GANs intro & examples
  • Deep Convolutional GAN
  • GAN’s objective function & training
  • GANs and distribution

Second lecture:

  • Current issues with GANs
  • Training dynamics improvements
  • Wasserstein GAN
  • Improvements of Wasserstein GAN

41 of 111

Deep Convolutional GAN (Redford at el. 2016)

Goodfellow’s original paper used a multi-layered perceptron for both Generator and Discriminator.

Redford et al. "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks", 2016.

42 of 111

Deep Convolutional GAN (Redford at el. 2016)

Original GAN

DCGAN

Redford et al. "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks", 2016.

43 of 111

Table of content

First lecture:

  • Generative models intro
  • GANs intro & examples
  • Deep Convolutional GAN
  • GAN’s objective function & training
  • GANs and distribution

Second lecture:

  • Current issues with GANs
  • Training dynamics improvements
  • Wasserstein GAN
  • Improvements of Wasserstein GAN

44 of 111

GAN’s Objective

Given a training set of real examples x, random noise sampled from a gaussian distribution z, a discriminator D, and a generator G:

Play a zero-sum (minmax) game between the generator and the discriminator.

Discriminator tries to maximize while the Generator tries to minimize

Discriminator score over a real image

Discriminator score over a fake image

Goodfellow et al. "Generative Adversarial Networks", 2014.

D(real image) -> 1, D(fake image) -> 0

45 of 111

GAN’s Algorithm

m*k training iterations for the discriminator

m training iterations for the generator

Goodfellow et al. "Generative Adversarial Networks", 2014.

46 of 111

GAN’s Learning

Goodfellow at el. (2014) showed that minimizing the minmax function over D and G, resembles to minimizing the Jensen-Shannon divergence between the data and model distributions*.

* Given infinite capacity for D and G.

Always read the fine print!

Goodfellow et al. "Generative Adversarial Networks", 2014.

47 of 111

GAN’s Objective In Practice

Goodfellow et al. "Generative Adversarial Networks", 2014.

48 of 111

Nash-Equilibrium in GANs

LeCunn at el. (2017) have shown that given a slightly different objective function:

m - a positive margin

D - Non-negative energy

G - generated image

[]+ - Relu function

There exists a Nash-Equilibrium in which G produces samples that are indistinguishable from the real data*.

* Given infinite capacity for D and G.

Zhao, LeCunn et al. "Energy-based Generative Adversarial Network“, 2016.

49 of 111

Energy Based GAN

50 of 111

Nash-Equilibrium in GANs

  • The results were measured via inception score (widely used but highly criticised).
  • Quality of results seems just as poor.

Our criticism: A Nash equilibrium may exist in theory but did not seem to be found by the EBGAN (Energy-Based GAN).

Zhao, LeCunn et al. "Energy-based Generative Adversarial Network“, 2017.

51 of 111

Semi-Supervised Learning

GANs are now used with an extremely wide variety of objectives, tasks, architectures, etc.

52 of 111

So, It all seems great!

53 of 111

Table of content

First lecture:

  • Generative models intro
  • GANs intro & examples
  • Deep Convolutional GAN
  • GAN’s objective function & training
  • GANs and distribution

Second lecture:

  • Current issues with GANs
  • Training dynamics improvements
  • Wasserstein GAN
  • Improvements of Wasserstein GAN

54 of 111

Do GAN’s really learn the distribution?

What we know so far:

  1. GANs generate real-looking images for many image types.

  • GANs do not memorize the training data.

  • The support of the latent variable Z is infinite (normal distribution).

55 of 111

The Birthday Paradox

Suppose there are k people in a room. How large must k be before we have a high likelihood of having two people with the same birthday?

Let p(k) be the probability that no two people have the same birthday.

p(23) = 365/365 + … + 343/365 = 0.493

p(70) = 365/365 + … + 295/365 < 0.001

Overall, a discrete distribution of support N is likely to have duplicates in a sample size about sqrt(N).

Arora, Zhang "Do GANs actually learn the distribution? An empirical study”, 2017.

56 of 111

The Birthday Paradox

Thus, if we find duplicate images after sampling s samples from the generator, then the distribution is likely to have a support of size s2.

Overall, a discrete distribution of support N is likely to have duplicates in a sample size about sqrt(N).

Arora, Zhang "Do GANs actually learn the distribution? An empirical study”, 2017.

57 of 111

The Birthday Paradox

Thus, if samples of size s have duplicate images with good probability, then the distribution is likely have support of size about s2.

Sample (and support) size needed to find duplicates (w.p. > 50%):

Dataset \ Architecture

DCGAN

MIX+DCGAN

ALI (BiGAN)

Stacked GAN

CelebA

400 (16,000)

400 (16,000)

1000 (1,000,000)

-

CIFAR-10

-

-

-

< 500 (< 250,000)

Arora, Zhang "Do GANs actually learn the distribution? An empirical study”, 2017.

58 of 111

Diversity test results

Arora, Zhang "Do GANs actually learn the distribution? An empirical study”, 2017.

59 of 111

Diversity test results

Diversity as a function of the discriminator’s number of parameters:

Arora, Zhang "Do GANs actually learn the distribution? An empirical study”, 2017.

60 of 111

61 of 111

First Part: “GANs are awesome!”

62 of 111

Second Part: GANs far from being great!

63 of 111

Table of content

First lecture:

  • Generative models intro
  • GANs intro & examples
  • Deep Convolutional GAN
  • GAN’s objective function & training
  • GANs and distribution

Second lecture:

  • Current issues with GANs
  • Training dynamics improvements
  • Wasserstein GAN
  • Improvements of Wasserstein GAN

64 of 111

GAN are notoriously hard to train

Vanishing/ exploding gradients

Discriminator domination

Sensitivity to learning rate

Mode collapse

Batch normalization

65 of 111

There are two current research directions

Understanding and improving the training dynamics

Understanding the gap between theory and practice

Improved Techniques for Training GANs, Silmans el al. 2016

Wasserstein GAN Arjovsky et al 2017

66 of 111

Table of content

First lecture:

  • Generative models intro
  • GANs intro & examples
  • Deep Convolutional GAN
  • GAN’s objective function & training
  • GANs and distribution

Second lecture:

  • Current issues with GANs
  • Training dynamics improvements
  • Wasserstein GAN
  • Improvements of Wasserstein GAN

67 of 111

Improved Techniques for Training GANs, Silmans el al. 2016

  • The idea is to go over the main known problems and try to cure them
  • They proposed some heuristics/techniques for each problem
  • They also suggested the inception score and the semi supervised learning

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

68 of 111

Problem #1- overtraining of discriminator

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

69 of 111

Problem #1- overtraining of discriminator

Feature matching

  • Instead of using the discriminator output we will use the intermediate layer of the discriminator
  • Now if f(x) is the activation in the intermediate layer then our loss is:

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

70 of 111

Problem #1- overtraining of discriminator

Problem #2- mode collapse

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

71 of 111

Problem #1- overtraining of discriminator

Minibatch discrimination

We will allow the discriminator to look on a batch of examples and not only one example

Problem #2- mode collapse

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

72 of 111

Problem #1- overtraining of discriminator

Problem #2- mode collapse

Problem #3- local minimum of equilibrium

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

73 of 111

Problem #1- overtraining of discriminator

Historical avareging

  • We modify the player cost to include penalty of moving too far from the previews weights

Problem #2- mode collapse

Problem #3- local minimum of equilibrium

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

  • For example if one player minimizes (xy) w.r.t x and the other minimizes (-xy) w.r.t y and (0,0) is the nash eq.

74 of 111

Problem #1- overtraining of discriminator

Problem #2- mode collapse

Problem #3- local minimum of equilibrium

Problem #4- batch normalization

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

75 of 111

Problem #1- overtraining of discriminator

Virtual batch normalization

  • In order to prevent correlation between batch outputs we will use sophisticated BN
  • Instead normalizing with the current batch we normalize with predetermine batch

Problem #2- mode collapse

Problem #3- local minimum of equilibrium

Problem #4- batch normalization

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

76 of 111

Results of the improved training

Feature matching

Minibatch discrimination

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

Basic GAN

77 of 111

DCGAN

DCGAN+feature mapping

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

78 of 111

DCGAN

DCGAN+Minibatch discrimination

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

79 of 111

DCGAN

DCGAN + Minibatch discrimination + Feature matching

80 of 111

DCGAN

DCGAN + Minibatch discrimination + Feature matching

81 of 111

There still place for improvement

DCGAN

Improved DCGAN

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

82 of 111

There are two current research directions

Understanding and improving the training dynamics

Understanding the gap between theory and practice

Improved Techniques for Training GANs, Silmans el al. 2016

Wasserstein GAN, Arjovsky et al 2017

83 of 111

Table of content

First lecture:

  • Generative models intro
  • GANs intro & examples
  • Deep Convolutional GAN
  • GAN’s objective function & training
  • GANs and distribution

Second lecture:

  • Current issues with GANs
  • Training dynamics improvements
  • Wasserstein GAN
  • Improvements of Wasserstein GAN

84 of 111

Understanding the gap between theory and practice

Some question we want to ask:

Why GAN never converges to the nash equilibrium?

Why is there mode collapse?

What theoretical guarantees we have on the objective?

85 of 111

Divergence/ distance- a tool to analyze GANs

  • Divergence is a measure of “similarity” between distributions
  • It is not always symmetric and hence it is not a metric

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

86 of 111

List of divergences

  • Divergence is a measure of “similarity” between distributions
  • It is not always symmetric and hence it is not a metric

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

87 of 111

GAN as divergence minimizer

Claim- (Goodfellow et al. 2014) In regular GAN given the discriminator the generator minimizes the Jensen-Shannon (JS) divergence

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

88 of 111

Example of the weakness of JS

  • The JS divergence isn’t continues when the distribution supported by a low dimensional manifold
  • Let Z~U[0,1] and P0(Z) be the distribution of the r.v (0,Z)
  • The parametric distribution we want to estimate is Pθ(z) of the r.v (θ,Z)
  • Now the JS divergence in this case is

We need better distance measure

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

89 of 111

A better choice- EM distance

  • Considering one distribution P as a pile of earth, and another distribution Q as the target
  • The average distance the earth mover has to move the earth

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

90 of 111

A better choice- EM distance

If there are several ways to move the mass then choose the shortest one

P

Q

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

91 of 111

A better choice- EM distance

In the discrete case the moving plan is a matrix the value of each element is the amount of earth from on position to another

B(𝛾)=∑xd,xg𝛾(xd,xg)||xd-xg||

The earth moving distance (wasserstein):

min𝛾∊ℿB(𝛾)

92 of 111

A better choice- EM distance

The earth mover distance (wasserstein distance) define to be

Inf over all possible joint distribution with marginals pg , pr

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

93 of 111

EM distance is better than JS

Thm- Let 𝕡g denote the distribution of the r.v g𝜃(z) then under some regularity conditions on g𝜃 the wasserstein distance is continuous and differentiable almost everywhere

The above distance is highly intractable

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

94 of 111

Dual function principle

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

95 of 111

The Kantorovich-Rubinstein duality

  • The wasserstein distance can be written in the dual form as

  • We can optimize over K-lipschitz functions and the objective will be K*W(Pr , Pθ)
  • We will replace f to be parametric function (DNN)

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

96 of 111

WGAN objective

  • The objective is now

We take f to be parametric function

Constrain on the weights of f enforce lipchitz

g is a parametric function that s.t g𝜃(z)~P𝜃

Z is normal r.v

f is the analogous to discriminator but in this model called critic

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

We enforce the constraint by clipping the value of each weight

97 of 111

WGAN algorithm

98 of 111

WGAN results- JS with regular GAN

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

99 of 111

WGAN results

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

100 of 111

WGAN results

  • First time that correlation between quality of samples and distance measure was observed
  • No mode collapse was ever observed in Wasserstein GAN
  • Results aren’t depend on BN
  • However weight still sometimes don’t converge
  • Weight cliping seems to be problematic

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

101 of 111

Table of content

First lecture:

  • Generative models intro
  • GANs intro & examples
  • Deep Convolutional GAN
  • GAN’s objective function & training
  • GANs and distribution

Second lecture:

  • Current issues with GANs
  • Training dynamics improvements
  • Wasserstein GAN
  • Improvements of Wasserstein GAN

102 of 111

Improved Training of Wasserstein GANs

  • Weight clipping is not a natural choice to enforce lipschitz
  • It can be proved that the optimal critics has a gradient of norm 1 almost everywhere
  • By the two observation above a it will be natural to regularize the gradient

Gulrajani, Ishaan, et al. "Improved training of wasserstein gans." Advances in Neural Information Processing Systems. 2017.

103 of 111

Regularization

WGAN loss

New regularization loss

Gulrajani, Ishaan, et al. "Improved training of wasserstein gans." Advances in Neural Information Processing Systems. 2017.

104 of 111

Improved Training of Wasserstein GANs results

WGAN

WGAN-GP

Gulrajani, Ishaan, et al. "Improved training of wasserstein gans." Advances in Neural Information Processing Systems. 2017.

105 of 111

Improving the improved training of Wasserstein GANs

106 of 111

Improving the improved training of Wasserstein GANs

WGAN-GP

WGAN-CT

107 of 111

Improving the improved training of Wasserstein GANs

WGAN-GP

108 of 111

Improving the improved training of Wasserstein GANs

109 of 111

GANs today

Goodfellow et al. 2014

Super resolution

3-D depth from single view

Image generation

Image inpainting

Style transfer

Text to image

Image to text

110 of 111

Take home massage

  • GANs are a powerful tool
  • GANs training is still unstable (even after Wasserstein GAN)
  • Improvements of GANs can be done by improving the theory as well as improving the training procedure

111 of 111