Ahmad Kalhor-University of Tehran
1
Chapter 7
Variational Auto encoders and Generative Adversarial Networks
Deep Generative Models are DNN architectures to generate new, synthetic instances of data that can pass for real data. They are used widely in image generation, video generation and voice generation.
Applications: Art, Animation & Entertainment , Design & Recommender systems, Data Augmentation , Simulators, Problem Solvers, Image/Video Enhancement and Reparing, Super resolution, D-Noising, Compressing, In Social Robots and Autonomous Driving Systems,….
DGMs use usually two neural networks, pitting one along (cooperative mechanism) or against (competition or adversarial mechanisms) the other.
Popular DGMs include:
Deep Generative Models (DGMs)
*DARNs (Deep AutoRegressive Networks) are generative sequential models, and are therefore often compared to other generative networks like GANs or VAEs; however, they are also sequence models and show promise in traditional sequence challenges like language processing and audio generation.
1. Variational autoencoder
In machine learning, a variational autoencoder (VAE), is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling, belonging to the families of probabilistic graphical models and variational Bayesian methods.[*]
[*] Auto-Encoding Variational Bayes, Submitted on 20 Dec 2013 (v1), last revised 10 Dec 2022
Input: One, Two or three D signals (Image, Video, Time Series, Text,...)
Encoder(Decoder): MLP, Convolution, LSTM, Self Attention, ....
Latent Space: Gaussian Distribution Space.
Output: A reconstruction or regeneration form of Input
Autoencoders have found wide applications in dimensionality reduction, object detection, image classification, and image denoising applications. Variational Autoencoders (VAEs) can be regarded as enhanced Autoencoders where a Bayesian approach is used to learn the probability distribution of the input data and to generate new data models.
VAEs versus autoencoder
Training of VAEs
During the training of autoencoders, we would like to utilize the unlabeled data and try to minimize the following
quadratic loss function:�
The above equation tries to minimize the distance between the original input and reconstructed image as shown in Figure
Loss Function for VAEs
Evidence Lower Bound (ELBO) loss
In practical cases where empirical data are available
A typical KL Loss
For VAEs, the KL loss is equivalent to the sum of all the KL divergences between the component Xi~N(μi, σi²) in X, and the standard normal. It’s minimized when μi = 0, σi = 1.
A typical Reconstruction Loss
Understanding Variational Autoencoders (VAEs)�Joseph Rocaa, 24 sep 2019
Generating samples from a VAE
Assume a trained VAE�• Generating a sample:�1. Draw a sample from p(∊)�2. Run z through decoder to�get p(x|z)�• It consists of parameters of a�Bernoulli or multinomial�3. Obtain sample from this�distribution
The problem with standard autoencoders for generation�
For example, training an autoencoder on the MNIST dataset, and visualizing the encodings from a 2D latent space reveals the formation of distinct clusters.
This makes sense, as distinct encodings for each image type makes it far easier for the decoder to decode them. This is fine if you’re just replicating the same images.
But when you’re building a generative model, you don’t want to prepare to replicate the same image you put in. You want to randomly sample from the latent space, or generate variations on an input image, from a continuous latent space.
If the space has discontinuities (eg. gaps between clusters) and you sample/generate a variation from there, the decoder will simply generate an unrealistic output, because the decoder has no idea how to deal with that region of the latent space. During training, it never saw encoded vectors coming from that region of latent space.
Variational Autoencoders�
Reconstruction(AEs) and Stochastic Generation(VAEs)
Some Notes
2.s Generative Adversarial Network (GAN)�Introductory guide to Generative Adversarial Networks (GANs) and their promise! Faizan Shaikh, June 15, 2017
Neural Networks have made great progress.
Let us see a few examples where we need human creativity (at least as of now):
GANs: A mechanism to generate the desired patterns.
“(GANs), and the variations that are now being proposed is the most interesting idea in the last 10 years in ML, in my opinion.”
It seems we should develop a mechanism in order to generate fake patterns which are so similar to real patterns.
Ian Goodfellow (in 2014) introduce GANs for such purposes.
But what is a GAN?
Let us take an analogy to explain the concept:
If you want to get better at something, say chess; what would you do? You would compete with an opponent better than you. Then you would analyze what you did wrong, what he / she did right, and think on what could you do to beat him / her in the next game.
You would repeat this step until you defeat the opponent. This concept can be incorporated to build better models. So simply, for getting a powerful hero (viz generator), we need a more powerful opponent (viz discriminator)!
Another analogy from real life
A slightly more real analogy can be considered as a relation between forger and an investigator.
The task of a forger is to create fraudulent imitations of original paintings by famous artists. If this created piece can pass as the original one, the forger gets a lot of money in exchange of the piece.
On the other hand, an art investigator’s task is to catch these forgers who create the fraudulent pieces. How does he do it? He knows what are the properties which sets the original artist apart and what kind of painting he should have created. He evaluates this knowledge with the piece in hand to check if it is real or not.
This contest of forger vs investigator goes on, which ultimately makes world class investigators (and unfortunately world class forger); a battle between good and evil.
How do GANs work?
As we saw, there are two main components of a GAN – Generator Neural Network and Discriminator Neural Network.
About the above Figure:
2. It then generates a data which is then fed into a discriminator network D(x).
It takes an input x from pdata(x) where pdata(x) is our real data distribution. D(x) then solves a binary classification problem using sigmoid function giving output in the range 0 to 1.
Now the training of GAN is done (as we saw above) as a
“fight between generator and discriminator.”
(pdata(x)) passes through the discriminator (also known as (aka) best case scenario).
The discriminator tries to maximize this to 1.
The second term is entropy that the data from random input (p(z)) passes through the
generator, which then generates a fake sample which is then passed through the
discriminator to identify the fakeness (aka worst case scenario).
In this term, discriminator tries to maximize it to 0
(i.e. the log probability that the data from generated is fake is equal to 0).
of training a GAN is taken from game theory called the minimax game.
This can be represented mathematically as:
So overall, the discriminator is trying to maximize our function V.
On the other hand, the task of generator is exactly opposite,
i.e. it tries to minimize the function V so that the differentiation between real and fake data is bare minimum.
This, in other words is a cat and mouse game between generator and discriminator!
Pass 1: Train discriminator and freeze generator (freezing means setting training as false. The network does only forward pass and no backpropagation is applied).
G(z)
Pass 2: Train generator and freeze discriminator.
G(z)
Steps to train a GAN
A pseudocode of GAN training can be thought out as follows�
We will try to generate digits by training a GAN
After training for 100 epochs, I got the following generated images
�
Applications of GAN
Predicting the next frame in a video : You train a GAN on video sequences and let it predict what would occur next�
Increasing Resolution of an image : Generate a high resolution photo from a comparatively low resolution.
Previous Attempts
SRGAN
Image to Image Translation : Generate an image from another image. For example, given on the left, you have labels of a street scene and you can generate a real looking photo with GAN. On the right, you give a simple drawing of a handbag and you get a real looking drawing of a handbag.
Text to Image Generation : Just say to your GAN what you want to see and get a realistic photo of the target. �
Interactive Image Generation(IGAN) : Draw simple strokes and let the GAN draw an impressive picture for you
A General Plan�to generate solutions for challenging problems
Generator
Discriminator
Cond.
z
Problem as condition
Problem
Solution
Problem
Solution
Px
x
Pz (noise) z
G(Cond, z)
Problems and solutions
(1,0)
Challenges With GAN�Problem with Counting: GANs fail to differentiate how many of a particular object should occur at a location. As we can see below, it gives more number of eyes in the head than naturally present.
GANs fail to adapt to 3D objects. It doesn’t understand perspective, i.e.difference between frontview and backview. As we can see below, it gives flat (2D)
Problems with Global Structures�Same as the problem with perspective, GANs do not understand a holistic structure. For example, in the bottom left image, it gives a generated image of a quadruple cow, i.e. a cow standing on its hind legs and simultaneously on all four legs. That is definitely not possible in real life!
Some Extensions on GANs
|
|
(1) Deep Convolutional GANs (DCGANs)�https://arxiv.org/abs/1710.10196 (Jan 2016)�
In this article, we will see how a neural net maps from random noise to an image matrix and how using Convolutional Layers in the generator network produces better results.
Original DCGAN architecture(Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks) have four convolutional layers for the Discriminator and four “four fractionally-strided convolutions” layers for the Generator.
(DCGAN) Generator
This network takes in a 100x1 noise vector, denoted z, and maps it into the G(Z) output which is 64x64x3.
This architecture is especially interesting the way the first layer expands the random noise. The network goes from 100x1 to 1024x4x4! This layer is denoted ‘project and reshape’. We see that following this layer, classical convolutional layers are applied. In the diagram above we can see that the N parameter, (Height/Width), goes from 4 to 8 to 16 to 32, it doesn’t appear that there is any padding, the kernel filter parameter F is 5x5, and the stride is 2. You may find this equation to be useful for designing your own convolutional layers for customized output sizes.
we see the network goes from
100x1 → 1024x4x4 → 512x8x8 → 256x16x16 → 128x32x32 → 64x64x3
Above is the output from the network presented in the paper, citing that this came after 5 epochs of training. Pretty impressive stuff.
(2) Boundary-Seeking Generative Adversarial Networks(BGAN)�https://arxiv.org/abs/1702.08431 (2017)
Training a GAN with different generator loss functions and 5 updates for the generator for every update of the discriminator.
Over-optimizing the generator can lead to instability and poorer results depending on the generator objective function.
Samples for GAN and GAN with the proxy loss are quite poor at 50 discriminator epochs (250 generator epochs), while BGAN is noticeably better.
At 100 epochs, these models have improved, though are still considerably behind BGAN.
(3) Conditional GANs (cGANs)
(Submitted on 6 Nov 2014)
1. These GANs use extra label information and result in better quality images and are able to control how generated images will look. cGANs learn to produce better images by exploiting the information fed to the model.
2. it does give the end-user a mechanism for controlling the Generator output
Results testing Conditional GANs on MNIST
The results of Conditional GANs are very impressive. They allow for much greater control over the final output from the generator
AC-GAN�Auxiliary Classifier GANs�
The sample generated images from CIFAR-10 dataset.
The sample generated images from ImageNet dataset.
In the AC-GAN paper, 100 different GAN models each handle 10 different classes from the ImageNet dataset consisting of 1,000 different object categories
(4) Super Resolution GAN (SR GAN)
Source: Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, Wenzhe Shi.
In this example, 90 x 90 to 360 x 360 is denoted as an up-scaling factor of 4x.
For example, this network could take a 64x 64low-resolution patch, convolve over it a couple times such that the feature map is something like 16 x 16 x 512, flatten it into a vector, apply a couple of fully-connected layers, reshape it, and finally, up-sample it into a 256x 256 high-resolution patch through transposed convolutional layers.
(5) StackGAN�
The Stage-I GAN sketches the primitive shape and colors of the object based on the given text description, yielding Stage-I low-resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details.
Text to Photo-realistic Image Synthesis with Stacked generative Adversarial Networks, Han Zhang, …..
The architecture of the proposed StackGAN.
Comparison
��Comparison�
(6) Discover Cross-Domain Relations with GANs(Disco GANS)�
D
G
Bag
Shoes
Fake Bag conditioned by shoes
(7) Flow-Based GANs�Oct 13, 2018 by Lilian Weng
Here is a quick summary of the difference between GAN, VAE, and flow-based generative models:
Types of Generative Models
Toward using both maximum likelihood and adversarial training
(8) InfoGANs�Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets�NIPS 2016�
Thank you