1 of 67

Variational Auto-encoders
Generative Adversarial Networks
Some Architectures (DCGAN, CGAN, ACGAN,SRGAN , InfoGAN...)

Ahmad Kalhor-University of Tehran

1

Chapter 7

Variational Auto encoders and Generative Adversarial Networks

2 of 67

Deep Generative Models are DNN architectures to generate new, synthetic instances of data that can pass for real data. They are used widely in image generation, video generation and voice generation.

Applications: Art, Animation & Entertainment , Design & Recommender systems, Data Augmentation , Simulators, Problem Solvers, Image/Video Enhancement and Reparing, Super resolution, D-Noising, Compressing, In Social Robots and Autonomous Driving Systems,….

DGMs use usually two neural networks, pitting one along (cooperative mechanism) or against (competition or adversarial mechanisms) the other.

Popular DGMs include:

Variational Auto-Encoders (VAEs)
Generative Adversarial Neural (GANs)
Deep Auto Rgressive Modles*

Deep Generative Models (DGMs)

*DARNs (Deep AutoRegressive Networks) are generative sequential models, and are therefore often compared to other generative networks like GANs or VAEs; however, they are also sequence models and show promise in traditional sequence challenges like language processing and audio generation.

3 of 67

1. Variational autoencoder

In machine learning, a variational autoencoder (VAE), is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling, belonging to the families of probabilistic graphical models and variational Bayesian methods.^[*]

[*] Auto-Encoding Variational Bayes, Submitted on 20 Dec 2013 (v1), last revised 10 Dec 2022

Diederik P Kingma, Max Welling

Input: One, Two or three D signals (Image, Video, Time Series, Text,...)

Encoder(Decoder): MLP, Convolution, LSTM, Self Attention, ....

Latent Space: Gaussian Distribution Space.

Output: A reconstruction or regeneration form of Input

Autoencoders have found wide applications in dimensionality reduction, object detection, image classification, and image denoising applications. Variational Autoencoders (VAEs) can be regarded as enhanced Autoencoders where a Bayesian approach is used to learn the probability distribution of the input data and to generate new data models.

4 of 67

VAEs versus autoencoder

5 of 67

Training of VAEs

During the training of autoencoders, we would like to utilize the unlabeled data and try to minimize the following

quadratic loss function:�

The above equation tries to minimize the distance between the original input and reconstructed image as shown in Figure

6 of 67

7 of 67

Loss Function for VAEs

8 of 67

Evidence Lower Bound (ELBO) loss

9 of 67

In practical cases where empirical data are available

A typical KL Loss

For VAEs, the KL loss is equivalent to the sum of all the KL divergences between the component Xi~N(μi, σi²) in X, and the standard normal. It’s minimized when μi = 0, σi = 1.

A typical Reconstruction Loss

10 of 67

Understanding Variational Autoencoders (VAEs)�Joseph Rocaa, 24 sep 2019

11 of 67

Generating samples from a VAE

Assume a trained VAE�• Generating a sample:�1. Draw a sample from p(∊)�2. Run z through decoder to�get p(x|z)�• It consists of parameters of a�Bernoulli or multinomial�3. Obtain sample from this�distribution

12 of 67

The problem with standard autoencoders for generation�

The fundamental problem with autoencoders, for generation, is that the latent space they convert their inputs to and where their encoded vectors lie, may not be continuous, or allow easy interpolation.

For example, training an autoencoder on the MNIST dataset, and visualizing the encodings from a 2D latent space reveals the formation of distinct clusters.

This makes sense, as distinct encodings for each image type makes it far easier for the decoder to decode them. This is fine if you’re just replicating the same images.

But when you’re building a generative model, you don’t want to prepare to replicate the same image you put in. You want to randomly sample from the latent space, or generate variations on an input image, from a continuous latent space.

If the space has discontinuities (eg. gaps between clusters) and you sample/generate a variation from there, the decoder will simply generate an unrealistic output, because the decoder has no idea how to deal with that region of the latent space. During training, it never saw encoded vectors coming from that region of latent space.

13 of 67

Variational Autoencoders�

Variational Autoencoders (VAEs) have one fundamentally unique property that separates them from vanilla autoencoders, and it is this property that makes them so useful for generative modeling: their latent spaces are, by design, continuous, allowing easy random sampling and interpolation.
It achieves this by doing something that seems rather surprising at first: making its encoder not output an encoding vector of size n, rather, outputting two vectors of size n: a vector of means, μ, and another vector of standard deviations, σ.

14 of 67

They form the parameters of a vector of random variables of length n, with the i th element of μ and σ being the mean and standard deviation of the i th random variable, X i, from which we sample, to obtain the sampled encoding which we pass onward to the decoder:

Stochastically generating encoding vectors

15 of 67

Reconstruction(AEs) and Stochastic Generation(VAEs)

This stochastic generation means, that even for the same input, while the mean and standard deviations remain the same, the actual encoding will somewhat vary on every single pass simply due to sampling.

16 of 67

Some Notes

Intuitively, the mean vector controls where the encoding of an input should be centered around, while the standard deviation controls the “area”, how much from the mean the encoding can vary.
As encodings are generated at random from anywhere inside the “circle” (the distribution), the decoder learns that not only is a single point in latent space referring to a sample of that class, but all nearby points refer to the same as well.
This allows the decoder to not just decode single, specific encodings in the latent space (leaving the decodable latent space discontinuous), but ones that slightly vary too, as the decoder is exposed to a range of variations of the encoding of the same input during training.

17 of 67

The model is now exposed to a certain degree of local variation by varying the encoding of one sample, resulting in smooth latent spaces on a local scale, that is, for similar samples.
Ideally, we want overlap between samples that are not very similar too, in order to interpolate between classes.
However, since there are no limits on what values vectors μ and σ can take on, the encoder can learn to generate very different μ for different classes, clustering them apart, and minimize σ, making sure the encodings themselves don’t vary much for the same sample (that is, less uncertainty for the decoder).
This allows the decoder to efficiently reconstruct the training data.

18 of 67

What we ideally want are encodings, all of which are as close as possible to each other while still being distinct, allowing smooth interpolation, and enabling the construction of new samples.

19 of 67

2.s Generative Adversarial Network (GAN)�Introductory guide to Generative Adversarial Networks (GANs) and their promise! Faizan Shaikh, June 15, 2017

Introduction

Neural Networks have made great progress.

They now recognize images and voice at levels comparable to humans.
They are also able to understand natural language with a good accuracy.�

Let us see a few examples where we need human creativity (at least as of now):

Train an artificial author which can write an article and explain data science concepts to a community in a very simplistic manner by learning from past articles on Analytics Vidhya.

You are not able to buy a painting from a famous painter which might be too expensive. Can you create an artificial painter which can paint like any famous artist by learning from his / her past collections?

20 of 67

GANs: A mechanism to generate the desired patterns.

Yann LeCun, a prominent figure in Deep Learning Domain said in his Quora session that:

“(GANs), and the variations that are now being proposed is the most interesting idea in the last 10 years in ML, in my opinion.”

It seems we should develop a mechanism in order to generate fake patterns which are so similar to real patterns.

Ian Goodfellow (in 2014) introduce GANs for such purposes.

21 of 67

But what is a GAN?

Let us take an analogy to explain the concept:

If you want to get better at something, say chess; what would you do? You would compete with an opponent better than you. Then you would analyze what you did wrong, what he / she did right, and think on what could you do to beat him / her in the next game.

You would repeat this step until you defeat the opponent. This concept can be incorporated to build better models. So simply, for getting a powerful hero (viz generator), we need a more powerful opponent (viz discriminator)!

Another analogy from real life

A slightly more real analogy can be considered as a relation between forger and an investigator.

The task of a forger is to create fraudulent imitations of original paintings by famous artists. If this created piece can pass as the original one, the forger gets a lot of money in exchange of the piece.

On the other hand, an art investigator’s task is to catch these forgers who create the fraudulent pieces. How does he do it? He knows what are the properties which sets the original artist apart and what kind of painting he should have created. He evaluates this knowledge with the piece in hand to check if it is real or not.

This contest of forger vs investigator goes on, which ultimately makes world class investigators (and unfortunately world class forger); a battle between good and evil.

22 of 67

How do GANs work?

As we saw, there are two main components of a GAN – Generator Neural Network and Discriminator Neural Network.

23 of 67

About the above Figure:

The Generator Network takes an random input and tries to generate a sample of data.
In the above image, we can see that generator G(z) takes a input z from p(z), where z is a sample from probability distribution p(z).

2. It then generates a data which is then fed into a discriminator network D(x).

The task of Discriminator Network is to take input either from the real data or from the generator and try to predict whether the input is real or generated.

It takes an input x from p_data(x) where p_data(x) is our real data distribution. D(x) then solves a binary classification problem using sigmoid function giving output in the range 0 to 1.

Let us define the notations we will be using to formalize our GAN,
Pdata(x) -> the distribution of real data�X -> sample from pdata(x)�P(z) -> distribution of generator�Z -> sample from p(z)�G(z) -> Generator Network�D(x) -> Discriminator Network

24 of 67

Now the training of GAN is done (as we saw above) as a

“fight between generator and discriminator.”

In our function V(D, G) the first term is entropy that the data from real distribution

(pdata(x)) passes through the discriminator (also known as (aka) best case scenario).

The discriminator tries to maximize this to 1.

The second term is entropy that the data from random input (p(z)) passes through the

generator, which then generates a fake sample which is then passed through the

discriminator to identify the fakeness (aka worst case scenario).

In this term, discriminator tries to maximize it to 0

(i.e. the log probability that the data from generated is fake is equal to 0).

of training a GAN is taken from game theory called the minimax game.

This can be represented mathematically as:

25 of 67

So overall, the discriminator is trying to maximize our function V.

On the other hand, the task of generator is exactly opposite,

i.e. it tries to minimize the function V so that the differentiation between real and fake data is bare minimum.

This, in other words is a cat and mouse game between generator and discriminator!

Note: This method of training a GAN is taken from game theory called the minimax game.
�

26 of 67

Pass 1: Train discriminator and freeze generator (freezing means setting training as false. The network does only forward pass and no backpropagation is applied).

G(z)

27 of 67

Pass 2: Train generator and freeze discriminator.

G(z)

28 of 67

Step 1: Define the problem. Do you want to generate fake images or fake text. Here you should completely define the problem and collect data for it.
Step 2: Define architecture of GAN. Define how your GAN should look like. Should both your generator and discriminator be multi layer perceptrons, or convolutional neural networks? This step will depend on what problem you are trying to solve.
Step 3: Train Discriminator on real data for n epochs. Get the data you want to generate fake on and train the discriminator to correctly predict them as real. Here value n can be any natural number between 1 and infinity.
Step 4: Generate fake inputs for generator and train discriminator on fake data. Get generated data and let the discriminator correctly predict them as fake.
Step 5: Train generator with the output of discriminator. Now when the discriminator is trained, you can get its predictions and use it as an objective for training the generator. Train the generator to fool the discriminator.
Steps to train a GAN
Step 6: Repeat step 3 to step 5 for a few epochs.
Step 7: Check if the fake data manually if it seems legit. If it seems appropriate, stop training, else go to step 3. This is a bit of a manual task, as hand evaluating the data is the best way to check the fakeness. When this step is over, you can evaluate whether the GAN is performing well enough.

Steps to train a GAN

29 of 67

A pseudocode of GAN training can be thought out as follows�

30 of 67

We will try to generate digits by training a GAN

After training for 100 epochs, I got the following generated images

�

31 of 67

Applications of GAN

Predicting the next frame in a video : You train a GAN on video sequences and let it predict what would occur next�

32 of 67

Increasing Resolution of an image : Generate a high resolution photo from a comparatively low resolution.

Previous Attempts

SRGAN

33 of 67

Image to Image Translation : Generate an image from another image. For example, given on the left, you have labels of a street scene and you can generate a real looking photo with GAN. On the right, you give a simple drawing of a handbag and you get a real looking drawing of a handbag.

34 of 67

Text to Image Generation : Just say to your GAN what you want to see and get a realistic photo of the target. �

35 of 67

Interactive Image Generation(IGAN) : Draw simple strokes and let the GAN draw an impressive picture for you

36 of 67

A General Plan�to generate solutions for challenging problems

Generator

Discriminator

Cond.

z

Problem as condition

Problem

Solution

Problem

Solution

Px

x

Pz (noise) z

G(Cond, z)

Problems and solutions

(1,0)

37 of 67

Challenges With GAN�Problem with Counting: GANs fail to differentiate how many of a particular object should occur at a location. As we can see below, it gives more number of eyes in the head than naturally present.

38 of 67

GANs fail to adapt to 3D objects. It doesn’t understand perspective, i.e.difference between frontview and backview. As we can see below, it gives flat (2D)

39 of 67

Problems with Global Structures�Same as the problem with perspective, GANs do not understand a holistic structure. For example, in the bottom left image, it gives a generated image of a quadruple cow, i.e. a cow standing on its hind legs and simultaneously on all four legs. That is definitely not possible in real life!

40 of 67

Some Extensions on GANs

Bidirectional GAN Context-Conditional GAN Context Encoder Coupled GANs CycleGAN DualGAN LSGAN Pix2Pix PixelDA Wasserstein GAN GP Adversarial Autoencoder	DCGANs BGAN SRGAN CGAN Auxiliary Classifier GAN Semi-Supervised GAN StackGAN Disco GANS Flow based GANs InfoGANs Wasserstein GAN

41 of 67

(1) Deep Convolutional GANs (DCGANs)�https://arxiv.org/abs/1710.10196 (Jan 2016)�

In this article, we will see how a neural net maps from random noise to an image matrix and how using Convolutional Layers in the generator network produces better results.

Original DCGAN architecture(Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks) have four convolutional layers for the Discriminator and four “four fractionally-strided convolutions” layers for the Generator.

42 of 67

(DCGAN) Generator

This network takes in a 100x1 noise vector, denoted z, and maps it into the G(Z) output which is 64x64x3.

This architecture is especially interesting the way the first layer expands the random noise. The network goes from 100x1 to 1024x4x4! This layer is denoted ‘project and reshape’. We see that following this layer, classical convolutional layers are applied. In the diagram above we can see that the N parameter, (Height/Width), goes from 4 to 8 to 16 to 32, it doesn’t appear that there is any padding, the kernel filter parameter F is 5x5, and the stride is 2. You may find this equation to be useful for designing your own convolutional layers for customized output sizes.

we see the network goes from

100x1 → 1024x4x4 → 512x8x8 → 256x16x16 → 128x32x32 → 64x64x3

43 of 67

Above is the output from the network presented in the paper, citing that this came after 5 epochs of training. Pretty impressive stuff.

44 of 67

(2) Boundary-Seeking Generative Adversarial Networks(BGAN)�https://arxiv.org/abs/1702.08431 (2017)

We introduce a method for training GANs with discrete data that uses the estimated difference measure from the discriminator to compute importance weights for generated samples, thus providing a policy gradient for training the generator.
The importance weights have a strong connection to the decision boundary of the discriminator, and we call our method boundary-seeking GANs (BGANs).
We demonstrate the effectiveness of the proposed algorithm with discrete image and character-based natural language generation.
In addition, the boundary-seeking objective extends to continuous data, which can be used to improve stability of training, and we demonstrate this on Celeba, Large-scale Scene Understanding (LSUN) bedrooms, and Imagenet without conditioning.

45 of 67

Training a GAN with different generator loss functions and 5 updates for the generator for every update of the discriminator.

Over-optimizing the generator can lead to instability and poorer results depending on the generator objective function.

Samples for GAN and GAN with the proxy loss are quite poor at 50 discriminator epochs (250 generator epochs), while BGAN is noticeably better.

At 100 epochs, these models have improved, though are still considerably behind BGAN.

46 of 67

47 of 67

(3) Conditional GANs (cGANs)

Source: https://arxiv.org/pdf/1411.1784.pdf

Mehdi Mirza, Simon Osindero

(Submitted on 6 Nov 2014)

1. These GANs use extra label information and result in better quality images and are able to control how generated images will look. cGANs learn to produce better images by exploiting the information fed to the model.

2. it does give the end-user a mechanism for controlling the Generator output

48 of 67

Results testing Conditional GANs on MNIST

The results of Conditional GANs are very impressive. They allow for much greater control over the final output from the generator

49 of 67

AC-GAN�Auxiliary Classifier GANs�

By adding an auxiliary classifier to the discriminator of a GAN, the discriminator produces not only a probability distribution over sources but also probability distribution over the class labels.
Source: Augustus Odena, Christopher Olah, Jonathon Shlens. Conditional Image Synthesis with Auxiliary Classifier GANs. 2016.

The sample generated images from CIFAR-10 dataset.

50 of 67

The sample generated images from ImageNet dataset.

In the AC-GAN paper, 100 different GAN models each handle 10 different classes from the ImageNet dataset consisting of 1,000 different object categories

51 of 67

(4) Super Resolution GAN (SR GAN)

Super-resolution is a task concerned with upscaling images from low-resolution sizes such as 90 x 90, into high-resolution sizes such as 360 x 360.

Source: Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, Wenzhe Shi.

52 of 67

In this example, 90 x 90 to 360 x 360 is denoted as an up-scaling factor of 4x.

These networks learn a mapping from the low-resolution patch through a series of convolutional, fully-connected, or transposed convolutional layers into the high-resolution patch.

53 of 67

For example, this network could take a 64x 64low-resolution patch, convolve over it a couple times such that the feature map is something like 16 x 16 x 512, flatten it into a vector, apply a couple of fully-connected layers, reshape it, and finally, up-sample it into a 256x 256 high-resolution patch through transposed convolutional layers.

54 of 67

(5) StackGAN�

The authors of this paper propose a solution to the problem of synthesizing high-quality images from text descriptions in computer vision. They propose Stacked Generative Adversarial Networks (StackGAN) to generate 256x256 photo-realistic images conditioned on text descriptions. They decompose the hard problem into more manageable sub-problems through a sketch-refinement process.

The Stage-I GAN sketches the primitive shape and colors of the object based on the given text description, yielding Stage-I low-resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details.

Text to Photo-realistic Image Synthesis with Stacked generative Adversarial Networks, Han Zhang, …..

55 of 67

The architecture of the proposed StackGAN.

56 of 67

Comparison

57 of 67

��Comparison�

58 of 67

(6) Discover Cross-Domain Relations with GANs(Disco GANS)�

The authors of this paper propose a method based on generative adversarial networks that learns to discover relations between different domains (without any extra labels). Using the discovered relations, the network transfers style from one domain to another..

D

G

Bag

Shoes

Fake Bag conditioned by shoes

59 of 67

(7) Flow-Based GANs�Oct 13, 2018 by Lilian Weng

Here is a quick summary of the difference between GAN, VAE, and flow-based generative models:

Generative adversarial networks: GAN provides a smart solution to model the data generation, an unsupervised learning problem, as a supervised one. The discriminator model learns to distinguish the real data from the fake samples that are produced by the generator model. Two models are trained as they are playing a minimax game.
Variational autoencoders: VAE inexplicitly optimizes the log-likelihood of the data by maximizing the evidence lower bound (ELBO).
Flow-based generative models: A flow-based generative model is constructed by a sequence of invertible transformations. Unlike other two, the model explicitly learns the data distribution p(x) and therefore the loss function is simply the negative log-likelihood.

60 of 67

Types of Generative Models

61 of 67

Toward using both maximum likelihood and adversarial training

Implicit models such as generative adversarial networks (GAN) often generate better samples compared to explicit models trained by maximum likelihood.

However, we know that the method based on maximum likelihood explicitly learn the probability density function of the input data.

To bridge this gap, we propose Flow-GANs, a generative adversarial network for which we can perform Exact likelihood evaluation, thus supporting both adversarial and maximum likelihood training.

این روش سعی می کند از هر دو مزیت روشهای ماکسییم لایکلیهود (روش بیشینه بزرگ نمایی) و روش منازعه در تولید الگو بهره ببرد.

روش اول بصراحت الگوهای تولیدی را با لحاظ مفهوم تابع چگالی احتمالاتی داده های وروذی ایجاد می کند ولی مشکل بهینه سازی در آن وجود دارد.

در حالی که روش دوم روش موفق تری در تولید الگوهای شبه واقعی دارد و لی نقطه قوت تفسیر آماری مانند روش اول را ندارد.

ترکیب این توانمندی روش ازعه و البته ملاحظه تابع هزینه بزرگ نمایی می تواند به تولید الگوهای بهتر بیانجامد.

62 of 67

63 of 67

(8) InfoGANs�Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets�NIPS 2016�

InfoGANs�InfoGAN is an information-theoretic extension to the GAN that is able to learn disentangled representations in an unsupervised manner. InfoGANs are used when your dataset is very complex, when you’d like to train a cGAN and the dataset is not labelled, and when you’d like to see the most important features of your images.