Mimicking Reality:
An Overview of Generative Models
Francesco Vaselli
francesco.vaselli@cern.ch
Fifth ML-INFN Hackathon: Advanced Level
2
3
4
This lecture in one slide
5
Common building blocks!
6
Latent space
Generative Models
Synthetic data
A common strategy for defining the loss
7
Figure stolen from the OpenAI blog post
(latent space)
How a good idea can change a whole field
8
Generative adversarial networks:
game-theory inspired training dynamic
9
Discriminator D:
estimates the probability of a given sample coming from the real dataset
Generator G:
outputs synthetic samples given a noise variable input (brings in stochasticity)
Unstable training dynamics
10
Example of mode collapse taken from arXiv:1611.02163
Example of vanishing gradients from
https://arxiv.org/pdf/1701.04862
Training is not easy; The process is known to be slow and unstable:
When discriminator is perfect:
L(x)->0!!! No weight update
Wasserstein GAN as an improvement to training
11
Wasserstein Distance is a measure of the distance between two probability distributions. (Earth Mover’s distance)
Idea: Loss function is configured as measuring the Wasserstein distance between real and fake data distributions!
The “discriminator” is not a direct critic anymore. Instead, it is trained to learn a Lipschitz continuous function (f_w) to help compute Wasserstein distance!
Caveat: not trivial to enforce lipschitzianity
Figure taken from arXiv:1701.07875
From Adversarial to Autoencoding
12
Autoencoders can be a great starting point
13
Encode x into low dimensional representation!
Encoder network: It translates the original high-dimension input into the latent low-dimensional code. The input size is larger than the output size.
Decoder network: The decoder network recovers the data from the code
Why can’t I use this for generation?
Reconstruction Loss
An irregular latent space is useless for generation!
14
Problem:
The autoencoder is solely trained to encode and decode with as few reconstruction loss as possible, no matter how the latent space is organised.
Meaningless points in latent space!
Random/useless samples!
A graph model shows how we can regularize latent space!
15
We want to map the input into a distribution!
Prob encoder: learns to model conditional Gaussian dist given x
Prob decoder: learns to model mean of likelihood distribution given z
Probabilistic Encoder (NN)
Likelihood
(Gaussian w fixed cov)
Posterior
(Gaussian)
A single loss is not enough!
16
Need to map NN output into the posterior
We can use Kullback-Leibler divergence to quantify the distance between these two distributions (NN vs Posterior)
measures how much information is lost if the distribution Y is used to represent X.
Total loss = RECO + VAE
= Evidence Lower Bound (ELBO)
The “lower bound” part in the name comes from the fact that KL divergence is always non-negative and thus the loss is the lower bound of log(p(x))
To train we need a small trick!
17
The expectation term in the loss function invokes generating samples from
Sampling is a stochastic process and therefore we cannot backpropagate the gradient! To make it trainable, the reparameterization trick is introduced:
Finally, a variational autoencoder!
18
We can produce new, original data samples z with arbitrary mean and sigma!
Known issues affecting latent space and samples
19
In essence, the very features that make VAEs effective for smooth interpolation and latent space exploration also contribute to:
Images stolen from the Enoch Kan’s blog post
Wait, I really liked the idea of learning a pdf!
20
Why limit ourselves to just one sample?
21
from “Why I Stopped Using GAN — ECCV 2020”
GAN
FLOW
The basic idea: change of variables
22
We define a transform f such that:
The two pdfs are related:
x
z
z
The basic idea: image generation
23
The basic idea: complex transforms
24
from Lilian Weng
We need just a few building blocks!
25
Task
Pieces
Learn the f(z) to send into the (unknown) data distribution
The usage is straightforward
26
Density evaluation
Sampling new data
The loss is explained from the change of variables!
27
Invertible
transform
Jacobian for Volume Correction
where are the parameters of f(z)
Splines can be a smart choice for f(z)
28
Expressive
Admit analytical inverse,
fast to invert AND evaluate
We use ML to learn the optimal disposition of points and derivatives
Just one of the possible choices!
Linear transformations (Affine)
are also used a lot:
f(z)= Wz+b
z
z
z
z
Normalizing Flows are powerful GMs!
29
Efficient to sample from
Efficient to evaluate
Highly expressive
Useful latent representation
Straightforward to train
Normalizing Flows are flawed GMs!
30
Computation of the Jacobian is hard
Not defined to work in a discrete contex!
Coupling layers are a way of addressing jacobian complexity
31
from Jason Yu
the Jacobian becomes triangular!
Dequantization can be used on discrete variables
32
from arXiv:2001.11235
Apply a gaussian smearing
converting discrete data into a continuum
Increasing Realism step-by-step
33
A Markov-chain approach to generation!
34
Diffusion Models define a Markov chain of diffusion steps to slowly add random noise to data and then learn to reverse the diffusion process to construct desired data samples from the noise.
Unlike VAE or flow models, diffusion models are learned with a fixed procedure and the latent variable has high dimensionality
We need to learn a model to approximate these conditional probabilities in order to run the reverse diffusion process.
Output of NN
Model in action, pros and cons
35
Pros: Diffusion models are both analytically tractable (can be analytically evaluated and cheaply fit data) and flexible (can fit arbitrary structures in data)
Cons: Diffusion models rely on a long Markov chain of diffusion steps to generate samples, so it can be quite expensive in terms of time and compute. New methods have been proposed to make the process much faster, but the sampling is still slower than GAN.
How do I do that??!
36
An illustration of an avocado sitting in a therapist’s chair, saying ‘I just feel so empty inside’ with a pit-sized hole in its centre. The therapist, a spoon, scribble notes
DALL·E 3
Just so you know: text conditioning
37
A CLIP (Contrastive Language–Image Pre-training) model learns to match a latent representation of the image given the associated label
The latent input is given to a diffusion decoder
Conclusions
38
Generative models are a powerful tool at our disposal
Different models have specific advantages and drawbacks
Widespread adoption in many Physics use-cases and convincing results:
No readily available implementations for our problems, need to experiment
This lecture in one slide
39
Citations:
Thanks Lilian Weng!
40
@article{weng2021diffusion,
title = "What are diffusion models?",
author = "Weng, Lilian",
journal = "lilianweng.github.io",
year = "2021",
month = "Jul",
url = "https://lilianweng.github.io/posts/2021-07-11-diffusion-models/"
}
@article{weng2017gan,
title = "From GAN to WGAN",
author = "Weng, Lilian",
journal = "lilianweng.github.io",
year = "2017",
url = "https://lilianweng.github.io/posts/2017-08-20-gan/"
}
@article{weng2018VAE,
title = "From Autoencoder to Beta-VAE",
author = "Weng, Lilian",
journal = "lilianweng.github.io",
year = "2018",
url = "https://lilianweng.github.io/posts/2018-08-12-vae/"
}
@article{weng2018flow,
title = "Flow-based Deep Generative Models",
author = "Weng, Lilian",
journal = "lilianweng.github.io",
year = "2018",
url = "https://lilianweng.github.io/posts/2018-10-13-flow-models/"
}