1 of 40

Mimicking Reality:

An Overview of Generative Models

Francesco Vaselli

francesco.vaselli@cern.ch

Fifth ML-INFN Hackathon: Advanced Level

2 of 40

2

3 of 40

3

4 of 40

4

5 of 40

This lecture in one slide

5

Scheme

stolen from

Lilian Weng’s

blog post

6 of 40

Common building blocks!

6

Latent space

Generative Models

Synthetic data

7 of 40

A common strategy for defining the loss

7

Figure stolen from the OpenAI blog post

(latent space)

8 of 40

How a good idea can change a whole field

8

9 of 40

Generative adversarial networks:

game-theory inspired training dynamic

9

Discriminator D:

estimates the probability of a given sample coming from the real dataset

Generator G:

outputs synthetic samples given a noise variable input (brings in stochasticity)

10 of 40

Unstable training dynamics

10

Example of mode collapse taken from arXiv:1611.02163

Example of vanishing gradients from

https://arxiv.org/pdf/1701.04862

Training is not easy; The process is known to be slow and unstable:

  • Difficult to reach equilibrium
  • Mode collapse
  • Vanishing gradients:

When discriminator is perfect:

  • D(x_True)->1
  • D(x_Gen)->0

L(x)->0!!! No weight update

11 of 40

Wasserstein GAN as an improvement to training

11

Wasserstein Distance is a measure of the distance between two probability distributions. (Earth Mover’s distance)

Idea: Loss function is configured as measuring the Wasserstein distance between real and fake data distributions!

The “discriminator” is not a direct critic anymore. Instead, it is trained to learn a Lipschitz continuous function (f_w) to help compute Wasserstein distance!

Caveat: not trivial to enforce lipschitzianity

Figure taken from arXiv:1701.07875

12 of 40

From Adversarial to Autoencoding

12

13 of 40

Autoencoders can be a great starting point

13

Encode x into low dimensional representation!

Encoder network: It translates the original high-dimension input into the latent low-dimensional code. The input size is larger than the output size.

Decoder network: The decoder network recovers the data from the code

Why can’t I use this for generation?

Reconstruction Loss

14 of 40

An irregular latent space is useless for generation!

14

Problem:

The autoencoder is solely trained to encode and decode with as few reconstruction loss as possible, no matter how the latent space is organised.

Meaningless points in latent space!

Random/useless samples!

from the

Joseph Rocca’s

blog post

15 of 40

A graph model shows how we can regularize latent space!

15

We want to map the input into a distribution!

Prob encoder: learns to model conditional Gaussian dist given x

Prob decoder: learns to model mean of likelihood distribution given z

Probabilistic Encoder (NN)

Likelihood

(Gaussian w fixed cov)

Posterior

(Gaussian)

16 of 40

A single loss is not enough!

16

Need to map NN output into the posterior

We can use Kullback-Leibler divergence to quantify the distance between these two distributions (NN vs Posterior)

measures how much information is lost if the distribution Y is used to represent X.

Total loss = RECO + VAE

= Evidence Lower Bound (ELBO)

The “lower bound” part in the name comes from the fact that KL divergence is always non-negative and thus the loss is the lower bound of log(p(x))

17 of 40

To train we need a small trick!

17

The expectation term in the loss function invokes generating samples from

Sampling is a stochastic process and therefore we cannot backpropagate the gradient! To make it trainable, the reparameterization trick is introduced:

18 of 40

Finally, a variational autoencoder!

18

We can produce new, original data samples z with arbitrary mean and sigma!

  • Latent Space Representation: meaningful, smooth latent space representations, great for tasks like anomaly detection or data compression.
  • Flexibility
  • Interpolation: VAEs can interpolate between data points in a meaningful way in the latent space.

19 of 40

Known issues affecting latent space and samples

19

In essence, the very features that make VAEs effective for smooth interpolation and latent space exploration also contribute to:

  • Mode Collapse: They can suffer from mode collapse, where the model generates a limited variety of outputs.
  • Blurriness in Outputs: VAEs often produce blurrier results, especially in image generation tasks.
  • Complex Training: need to balance loss terms
  • Limited Expressiveness: assumption of a Gaussian prior limits the expressiveness of the model, impacting the variety of generated data.

Images stolen from the Enoch Kan’s blog post

20 of 40

Wait, I really liked the idea of learning a pdf!

20

21 of 40

Why limit ourselves to just one sample?

21

from “Why I Stopped Using GAN — ECCV 2020”

GAN

FLOW

22 of 40

The basic idea: change of variables

22

We define a transform f such that:

The two pdfs are related:

x

z

z

23 of 40

The basic idea: image generation

23

24 of 40

The basic idea: complex transforms

24

from Lilian Weng

25 of 40

We need just a few building blocks!

25

Task

Pieces

Learn the f(z) to send into the (unknown) data distribution

  • Basic distribution , typically Gaussian

  • Function called flow f(z) invertible and differentiable, with tractable jacobian

26 of 40

The usage is straightforward

26

Density evaluation

Sampling new data

  • Sample from (Gaussian, trivial)

  • Compute (fast)

27 of 40

The loss is explained from the change of variables!

27

Invertible

transform

Jacobian for Volume Correction

where are the parameters of f(z)

28 of 40

Splines can be a smart choice for f(z)

28

Expressive

Admit analytical inverse,

fast to invert AND evaluate

We use ML to learn the optimal disposition of points and derivatives

Just one of the possible choices!

Linear transformations (Affine)

are also used a lot:

f(z)= Wz+b

z

z

z

z

29 of 40

Normalizing Flows are powerful GMs!

29

Efficient to sample from

Efficient to evaluate

Highly expressive

Useful latent representation

Straightforward to train

30 of 40

Normalizing Flows are flawed GMs!

30

Computation of the Jacobian is hard

Not defined to work in a discrete contex!

31 of 40

Coupling layers are a way of addressing jacobian complexity

31

from Jason Yu

the Jacobian becomes triangular!

32 of 40

Dequantization can be used on discrete variables

32

from arXiv:2001.11235

Apply a gaussian smearing

converting discrete data into a continuum

33 of 40

Increasing Realism step-by-step

33

34 of 40

A Markov-chain approach to generation!

34

Diffusion Models define a Markov chain of diffusion steps to slowly add random noise to data and then learn to reverse the diffusion process to construct desired data samples from the noise.

Unlike VAE or flow models, diffusion models are learned with a fixed procedure and the latent variable has high dimensionality

We need to learn a model to approximate these conditional probabilities in order to run the reverse diffusion process.

Output of NN

35 of 40

Model in action, pros and cons

35

Pros: Diffusion models are both analytically tractable (can be analytically evaluated and cheaply fit data) and flexible (can fit arbitrary structures in data)

Cons: Diffusion models rely on a long Markov chain of diffusion steps to generate samples, so it can be quite expensive in terms of time and compute. New methods have been proposed to make the process much faster, but the sampling is still slower than GAN.

36 of 40

How do I do that??!

36

An illustration of an avocado sitting in a therapist’s chair, saying ‘I just feel so empty inside’ with a pit-sized hole in its centre. The therapist, a spoon, scribble notes

DALL·E 3

37 of 40

Just so you know: text conditioning

37

A CLIP (Contrastive Language–Image Pre-training) model learns to match a latent representation of the image given the associated label

The latent input is given to a diffusion decoder

38 of 40

Conclusions

38

Generative models are a powerful tool at our disposal

Different models have specific advantages and drawbacks

Widespread adoption in many Physics use-cases and convincing results:

  • See tomorrow hands-on about Generative Model
  • See Alessandra and Andrea’s talks on Thursday

No readily available implementations for our problems, need to experiment

39 of 40

This lecture in one slide

39

Scheme

stolen from

Lilian Weng’s

blog post

40 of 40

Citations:

Thanks Lilian Weng!

40

@article{weng2021diffusion,

title = "What are diffusion models?",

author = "Weng, Lilian",

journal = "lilianweng.github.io",

year = "2021",

month = "Jul",

url = "https://lilianweng.github.io/posts/2021-07-11-diffusion-models/"

}

@article{weng2017gan,

title = "From GAN to WGAN",

author = "Weng, Lilian",

journal = "lilianweng.github.io",

year = "2017",

url = "https://lilianweng.github.io/posts/2017-08-20-gan/"

}

@article{weng2018VAE,

title = "From Autoencoder to Beta-VAE",

author = "Weng, Lilian",

journal = "lilianweng.github.io",

year = "2018",

url = "https://lilianweng.github.io/posts/2018-08-12-vae/"

}

@article{weng2018flow,

title = "Flow-based Deep Generative Models",

author = "Weng, Lilian",

journal = "lilianweng.github.io",

year = "2018",

url = "https://lilianweng.github.io/posts/2018-10-13-flow-models/"

}