2 of 111

Table of content

First lecture:

Generative models intro
GANs intro & examples
Deep Convolutional GAN
GAN’s objective function & training
GANs and distribution

Second lecture:

Current issues with GANs
Training dynamics improvements
Wasserstein GAN
Improvements of Wasserstein GAN

3 of 111

Table of content

First lecture:

Generative models intro
GANs intro & examples
Deep Convolutional GAN
GAN’s objective function & training
GANs and distribution

Second lecture:

Current issues with GANs
Training dynamics improvements
Wasserstein GAN
Improvements of Wasserstein GAN

4 of 111

Supervised Vs. Unsupervised Learning

	Supervised	Unsupervised
Input Data	Data (x), Labels (y)
Goal	Learn y = f(x)
Examples	Classification, regression, object detection, semantic segmentation, etc.

5 of 111

Supervised Vs. Unsupervised Learning

6 of 111

Supervised Vs. Unsupervised Learning

7 of 111

Supervised Vs. Unsupervised Learning

	Supervised	Unsupervised
Input Data	Data (x), Labels (y)	Data (x)
Goal	Learn y = f(x)	Learn underlying hidden structure of the data
Examples	Classification, regression, object detection, semantic segmentation, etc.	Clustering, feature learning, density estimation, dimensionality reduction, etc.

8 of 111

Supervised Vs. Unsupervised Learning

Clustering:

9 of 111

Supervised Vs. Unsupervised Learning

Feature mapping:

10 of 111

Supervised Vs. Unsupervised Learning

Dimensionality reduction:

11 of 111

What does that got to do with generative models?

Generative models are unsupervised* algorithms (require no labels).
Their goal is to generate more of the same without memorizing.

Meaning: given training data, generate new samples from the same distribution.

* Modified versions of GAN’s also serve for semi-supervised and reinforcement learning.

12 of 111

What do Generative Models do?

Generative models attempt to estimate the density function of the training data.

Explicit estimation: explicitly define and solve p(x).
Implicit estimation: learn a model that can sample from p(x).

13 of 111

Taxonomy of Generative Models

Figure copyright from Ian Goodfellow, Tutorial on GAN, NIPS 2016

Van Der Oord et al. 2016

14 of 111

Pixel RNN (Van Der Oord et al. 2016)

Explicit tractable density function:

Van Der Oord et al. "Pixel Recurrent Neural Networks", 2016.

15 of 111

Pixel RNN (Van Der Oord et al. 2016)

Explicit tractable density function:

Use chain rule to decompose likelihood of an image x into product of 1-d distributions, while maximizing likelihood of training data.

But, what are d?

16 of 111

Pixel RNN (Van Der Oord et al. 2016)

Starting from the top left. Modeled via RNN (LSTM).

Pros:

p(x) likelihood is computed directly.
Good evaluation metric via explicit likelihood over training data.
Relatively high quality samples.

Cons:

Pixel-wise sequential generation => very slow

17 of 111

Pixel CNN (Van Der Oord et al. 2016)

An improvement over PixelRNN. Each pixel has an explicit distribution over values 0-255.

Modeled via CNN instead of RNN.

Pros:

Significantly faster than PixelRNN(can parallelize convolutions).

Cons:

Still sequential and slow...

18 of 111

Pixel CNN (Van Der Oord et al. 2016)

Some results:

19 of 111

Taxonomy of Generative Models

Figure copyright from Ian Goodfellow, Tutorial on GAN, NIPS 2016

Van Der Oord et al. 2016

Kingma & Welling, 2013

20 of 111

Variational Autoencoder (Kingma & Welling, ICLR 2013)

Explicit intractable function:

Kingma and Welling. "Auto-Encoding Variational Bayes", 2013.

21 of 111

Variational Autoencoder (Kingma & Welling, ICLR 2014)

Explicit intractable function:

22 of 111

Variational Autoencoder (Kingma & Welling, ICLR 2014)

23 of 111

Variational Autoencoder (Kingma & Welling, ICLR 2014)

Some results:

24 of 111

Variational Autoencoder (Kingma & Welling, ICLR 2014)

Pros:

Principled approach to generative models.
Allows inference of p(z|x) (useful for additional tasks).

Cons:

Maximize lower bound of likelihood (not as good as PixelRNN/CNN).
Samples blurrier and lower quality compared to state-of-the-art (GANs)

25 of 111

Taxonomy of Generative Models

Figure copyright from Ian Goodfellow, Tutorial on GAN, NIPS 2016

26 of 111

Generative Stochastic Networks

27 of 111

Taxonomy of Generative Models

Figure copyright from Ian Goodfellow, Tutorial on GAN, NIPS 2016

Van Der Oord et al. 2016

Kingma & Welling, 2013

Goodfellow et al. 2014

28 of 111

Table of content

First lecture:

Generative models intro
GANs intro & examples
Deep Convolutional GAN
GAN’s objective function & training
GANs and distribution

Second lecture:

Current issues with GANs
Training dynamics improvements
Wasserstein GAN
Improvements of Wasserstein GAN

29 of 111

Finally… what are GANs?

Let’s give up on explicitly modeling the density, and just gain the ability to sample.

Generative Adversarial Networks:

learn to approximate the data’s distribution, instead of directly expose it.

Goodfellow et al. "Generative Adversarial Networks", 2014.

30 of 111

Some motivation...

https://hothardware.com/news/nvidia-neural-network-generates-photorealistic-faces-disturbing-results

31 of 111

Some motivation...

Bedrooms generation

32 of 111

Some motivation...

Vector Arithmetic

33 of 111

Some motivation...

Super resolution

34 of 111

Some motivation...

3D reconstruction

35 of 111

Some motivation...

Face compilation

36 of 111

Some motivation...

Assisted painting

37 of 111

Some motivation...

Style transfer/coloring

38 of 111

GAN’s Architecture

An adversarial differentiable function of two players:

Discriminator:

Trained to distinguish between samples from p_data and p_model.

Generator:

Tries to fool the discriminator by randomly generating samples, i.e. make vvbbp_model an approximation of p_data.

Goodfellow et al. "Generative Adversarial Networks", 2014.

39 of 111

GAN’s Architecture

Goodfellow et al. "Generative Adversarial Networks", 2014.

40 of 111

Table of content

First lecture:

Generative models intro
GANs intro & examples
Deep Convolutional GAN
GAN’s objective function & training
GANs and distribution

Second lecture:

Current issues with GANs
Training dynamics improvements
Wasserstein GAN
Improvements of Wasserstein GAN

41 of 111

Deep Convolutional GAN (Redford at el. 2016)

Goodfellow’s original paper used a multi-layered perceptron for both Generator and Discriminator.

Redford et al. "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks", 2016.

42 of 111

Deep Convolutional GAN (Redford at el. 2016)

Original GAN

DCGAN

Redford et al. "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks", 2016.

43 of 111

Table of content

First lecture:

Generative models intro
GANs intro & examples
Deep Convolutional GAN
GAN’s objective function & training
GANs and distribution

Second lecture:

Current issues with GANs
Training dynamics improvements
Wasserstein GAN
Improvements of Wasserstein GAN

44 of 111

GAN’s Objective

Given a training set of real examples x, random noise sampled from a gaussian distribution z, a discriminator D, and a generator G:

Play a zero-sum (minmax) game between the generator and the discriminator.

Discriminator tries to maximize while the Generator tries to minimize

Discriminator score over a real image

Discriminator score over a fake image

Goodfellow et al. "Generative Adversarial Networks", 2014.

D(real image) -> 1, D(fake image) -> 0

45 of 111

GAN’s Algorithm

m*k training iterations for the discriminator

m training iterations for the generator

Goodfellow et al. "Generative Adversarial Networks", 2014.

46 of 111

GAN’s Learning

Goodfellow at el. (2014) showed that minimizing the minmax function over D and G, resembles to minimizing the Jensen-Shannon divergence between the data and model distributions*.

* Given infinite capacity for D and G.

Always read the fine print!

Goodfellow et al. "Generative Adversarial Networks", 2014.

47 of 111

GAN’s Objective In Practice

Goodfellow et al. "Generative Adversarial Networks", 2014.

48 of 111

Nash-Equilibrium in GANs

LeCunn at el. (2017) have shown that given a slightly different objective function:

m - a positive margin

D - Non-negative energy

G - generated image

[]⁺ - Relu function

There exists a Nash-Equilibrium in which G produces samples that are indistinguishable from the real data*.

* Given infinite capacity for D and G.

Zhao, LeCunn et al. "Energy-based Generative Adversarial Network“, 2016.

49 of 111

Energy Based GAN

50 of 111

Nash-Equilibrium in GANs

The results were measured via inception score (widely used but highly criticised).
Quality of results seems just as poor.

Our criticism: A Nash equilibrium may exist in theory but did not seem to be found by the EBGAN (Energy-Based GAN).

Zhao, LeCunn et al. "Energy-based Generative Adversarial Network“, 2017.

51 of 111

Semi-Supervised Learning

GANs are now used with an extremely wide variety of objectives, tasks, architectures, etc.

52 of 111

So, It all seems great!

53 of 111

Table of content

First lecture:

Generative models intro
GANs intro & examples
Deep Convolutional GAN
GAN’s objective function & training
GANs and distribution

Second lecture:

Current issues with GANs
Training dynamics improvements
Wasserstein GAN
Improvements of Wasserstein GAN

54 of 111

Do GAN’s really learn the distribution?

What we know so far:

GANs generate real-looking images for many image types.

GANs do not memorize the training data.

The support of the latent variable Z is infinite (normal distribution).

55 of 111

The Birthday Paradox

Suppose there are k people in a room. How large must k be before we have a high likelihood of having two people with the same birthday?

Let p(k) be the probability that no two people have the same birthday.

p(23) = 365/365 + … + 343/365 = 0.493

p(70) = 365/365 + … + 295/365 < 0.001

Overall, a discrete distribution of support N is likely to have duplicates in a sample size about sqrt(N).

Arora, Zhang "Do GANs actually learn the distribution? An empirical study”, 2017.

56 of 111

The Birthday Paradox

Thus, if we find duplicate images after sampling s samples from the generator, then the distribution is likely to have a support of size s².

Overall, a discrete distribution of support N is likely to have duplicates in a sample size about sqrt(N).

Arora, Zhang "Do GANs actually learn the distribution? An empirical study”, 2017.

57 of 111

The Birthday Paradox

Thus, if samples of size s have duplicate images with good probability, then the distribution is likely have support of size about s².

Sample (and support) size needed to find duplicates (w.p. > 50%):

Dataset \ Architecture	DCGAN	MIX+DCGAN	ALI (BiGAN)	Stacked GAN
CelebA	400 (16,000)	400 (16,000)	1000 (1,000,000)	-
CIFAR-10	-	-	-	< 500 (< 250,000)

Arora, Zhang "Do GANs actually learn the distribution? An empirical study”, 2017.

58 of 111

Diversity test results

Arora, Zhang "Do GANs actually learn the distribution? An empirical study”, 2017.

59 of 111

Diversity test results

Diversity as a function of the discriminator’s number of parameters:

Arora, Zhang "Do GANs actually learn the distribution? An empirical study”, 2017.

61 of 111

First Part: “GANs are awesome!”

62 of 111

Second Part: GANs far from being great!

63 of 111

Table of content

First lecture:

Generative models intro
GANs intro & examples
Deep Convolutional GAN
GAN’s objective function & training
GANs and distribution

Second lecture:

Current issues with GANs
Training dynamics improvements
Wasserstein GAN
Improvements of Wasserstein GAN

64 of 111

GAN are notoriously hard to train

Vanishing/ exploding gradients

Discriminator domination

Sensitivity to learning rate

Mode collapse

Batch normalization

65 of 111

There are two current research directions

Understanding and improving the training dynamics

Understanding the gap between theory and practice

Improved Techniques for Training GANs, Silmans el al. 2016

Wasserstein GAN Arjovsky et al 2017

66 of 111

Table of content

First lecture:

Generative models intro
GANs intro & examples
Deep Convolutional GAN
GAN’s objective function & training
GANs and distribution

Second lecture:

Current issues with GANs
Training dynamics improvements
Wasserstein GAN
Improvements of Wasserstein GAN

67 of 111

Improved Techniques for Training GANs, Silmans el al. 2016

The idea is to go over the main known problems and try to cure them
They proposed some heuristics/techniques for each problem
They also suggested the inception score and the semi supervised learning

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

68 of 111

Problem #1- overtraining of discriminator

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

69 of 111

Problem #1- overtraining of discriminator

Feature matching

Instead of using the discriminator output we will use the intermediate layer of the discriminator
Now if f(x) is the activation in the intermediate layer then our loss is:

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

70 of 111

Problem #1- overtraining of discriminator

Problem #2- mode collapse

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

71 of 111

Problem #1- overtraining of discriminator

Minibatch discrimination

We will allow the discriminator to look on a batch of examples and not only one example

Problem #2- mode collapse

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

72 of 111

Problem #1- overtraining of discriminator

Problem #2- mode collapse

Problem #3- local minimum of equilibrium

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

73 of 111

Problem #1- overtraining of discriminator

Historical avareging

We modify the player cost to include penalty of moving too far from the previews weights

Problem #2- mode collapse

Problem #3- local minimum of equilibrium

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

For example if one player minimizes (xy) w.r.t x and the other minimizes (-xy) w.r.t y and (0,0) is the nash eq.

74 of 111

Problem #1- overtraining of discriminator

Problem #2- mode collapse

Problem #3- local minimum of equilibrium

Problem #4- batch normalization

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

75 of 111

Problem #1- overtraining of discriminator

Virtual batch normalization

In order to prevent correlation between batch outputs we will use sophisticated BN
Instead normalizing with the current batch we normalize with predetermine batch

Problem #2- mode collapse

Problem #3- local minimum of equilibrium

Problem #4- batch normalization

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

76 of 111

Results of the improved training

Feature matching

Minibatch discrimination

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

Basic GAN

77 of 111

DCGAN

DCGAN+feature mapping

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

78 of 111

DCGAN

DCGAN+Minibatch discrimination

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

79 of 111

DCGAN

DCGAN + Minibatch discrimination + Feature matching

80 of 111

DCGAN

DCGAN + Minibatch discrimination + Feature matching

81 of 111

There still place for improvement

DCGAN

Improved DCGAN

Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016.

82 of 111

There are two current research directions

Understanding and improving the training dynamics

Understanding the gap between theory and practice

Improved Techniques for Training GANs, Silmans el al. 2016

Wasserstein GAN, Arjovsky et al 2017

83 of 111

Table of content

First lecture:

Generative models intro
GANs intro & examples
Deep Convolutional GAN
GAN’s objective function & training
GANs and distribution

Second lecture:

Current issues with GANs
Training dynamics improvements
Wasserstein GAN
Improvements of Wasserstein GAN

84 of 111

Understanding the gap between theory and practice

Some question we want to ask:

Why GAN never converges to the nash equilibrium?

Why is there mode collapse?

What theoretical guarantees we have on the objective?

85 of 111

Divergence/ distance- a tool to analyze GANs

Divergence is a measure of “similarity” between distributions
It is not always symmetric and hence it is not a metric

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

86 of 111

List of divergences

Divergence is a measure of “similarity” between distributions
It is not always symmetric and hence it is not a metric

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

87 of 111

GAN as divergence minimizer

Claim- (Goodfellow et al. 2014) In regular GAN given the discriminator the generator minimizes the Jensen-Shannon (JS) divergence

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

88 of 111

Example of the weakness of JS

The JS divergence isn’t continues when the distribution supported by a low dimensional manifold
Let Z~U[0,1] and P₀(Z) be the distribution of the r.v (0,Z)
The parametric distribution we want to estimate is P_θ(z) of the r.v (θ,Z)
Now the JS divergence in this case is

We need better distance measure

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

89 of 111

A better choice- EM distance

Considering one distribution P as a pile of earth, and another distribution Q as the target
The average distance the earth mover has to move the earth

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

90 of 111

A better choice- EM distance

If there are several ways to move the mass then choose the shortest one

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

91 of 111

A better choice- EM distance

In the discrete case the moving plan is a matrix the value of each element is the amount of earth from on position to another

B(𝛾)=∑_xd,xg𝛾(x_d,x_g)||x_d-x_g||

The earth moving distance (wasserstein):

min_𝛾∊ℿB(𝛾)

92 of 111

A better choice- EM distance

The earth mover distance (wasserstein distance) define to be

Inf over all possible joint distribution with marginals p_g , p_r

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

93 of 111

EM distance is better than JS

Thm- Let 𝕡_g denote the distribution of the r.v g_𝜃(z) then under some regularity conditions on g_𝜃 the wasserstein distance is continuous and differentiable almost everywhere

The above distance is highly intractable

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

94 of 111

Dual function principle

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

95 of 111

The Kantorovich-Rubinstein duality

The wasserstein distance can be written in the dual form as

We can optimize over K-lipschitz functions and the objective will be K*W(P_r, P_θ)
We will replace f to be parametric function (DNN)

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

96 of 111

WGAN objective

The objective is now

We take f to be parametric function

Constrain on the weights of f enforce lipchitz

g is a parametric function that s.t g_𝜃(z)~P_𝜃

Z is normal r.v

f is the analogous to discriminator but in this model called critic

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

We enforce the constraint by clipping the value of each weight

97 of 111

WGAN algorithm

98 of 111

WGAN results- JS with regular GAN

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

99 of 111

WGAN results

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

100 of 111

WGAN results

First time that correlation between quality of samples and distance measure was observed
No mode collapse was ever observed in Wasserstein GAN
Results aren’t depend on BN
However weight still sometimes don’t converge
Weight cliping seems to be problematic

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International Conference on Machine Learning. 2017.

101 of 111

Table of content

First lecture:

Generative models intro
GANs intro & examples
Deep Convolutional GAN
GAN’s objective function & training
GANs and distribution

Second lecture:

Current issues with GANs
Training dynamics improvements
Wasserstein GAN
Improvements of Wasserstein GAN

102 of 111

Improved Training of Wasserstein GANs

Weight clipping is not a natural choice to enforce lipschitz
It can be proved that the optimal critics has a gradient of norm 1 almost everywhere
By the two observation above a it will be natural to regularize the gradient

Gulrajani, Ishaan, et al. "Improved training of wasserstein gans." Advances in Neural Information Processing Systems. 2017.

103 of 111

Regularization

WGAN loss

New regularization loss

Gulrajani, Ishaan, et al. "Improved training of wasserstein gans." Advances in Neural Information Processing Systems. 2017.

104 of 111

Improved Training of Wasserstein GANs results

WGAN

WGAN-GP

Gulrajani, Ishaan, et al. "Improved training of wasserstein gans." Advances in Neural Information Processing Systems. 2017.

105 of 111

Improving the improved training of Wasserstein GANs

106 of 111

Improving the improved training of Wasserstein GANs

WGAN-GP

WGAN-CT

107 of 111

Improving the improved training of Wasserstein GANs

WGAN-GP

108 of 111

Improving the improved training of Wasserstein GANs

109 of 111

GANs today

Goodfellow et al. 2014

Super resolution

3-D depth from single view

Image generation

Image inpainting

Style transfer

Text to image

Image to text

1 of 111

2 of 111

3 of 111

4 of 111

5 of 111

6 of 111

7 of 111

8 of 111

9 of 111

10 of 111

11 of 111

12 of 111

13 of 111

14 of 111

15 of 111

16 of 111

17 of 111

18 of 111

19 of 111

20 of 111

21 of 111

22 of 111

23 of 111

24 of 111

25 of 111

26 of 111

27 of 111

28 of 111

29 of 111

30 of 111

31 of 111

32 of 111

33 of 111

34 of 111

35 of 111

36 of 111

37 of 111

38 of 111

39 of 111

40 of 111

41 of 111

42 of 111

43 of 111

44 of 111

45 of 111

46 of 111

47 of 111

48 of 111

49 of 111

50 of 111

51 of 111

52 of 111

53 of 111

54 of 111

55 of 111

56 of 111

57 of 111

58 of 111

59 of 111

60 of 111

61 of 111

62 of 111

63 of 111

64 of 111

65 of 111

66 of 111

67 of 111

68 of 111

69 of 111

70 of 111

71 of 111

72 of 111

73 of 111

74 of 111

75 of 111

76 of 111

77 of 111

78 of 111

79 of 111

80 of 111