1 of 45

Lecture 8:

Generative Models I

Sookyung Kim

Spring 2025

2 of 45

Era of Generative model

Spring 2024

2

3 of 45

Era of Generative model

Spring 2024

3

4 of 45

Era of Generative model

https://youtu.be/HK6y8DAPN_0?si=VOkD8mYBPXs_cnmN

Spring 2024

4

5 of 45

Supervised vs. Unsupervised Learning

Supervised Learning

Data: (x, y)

x is data, y is label.�

Goal: function approximation

Learning a function to map x → y.�

Examples

Classification
Regression
Object detection
Semantic segmentation
Image captioning

Unsupervised Learning

Data: x

Just data, no labels!�

Goal: learning underlying hidden structure of the data�
Examples

Clustering
Dimension reduction
Density estimation

Spring 2024

5

6 of 45

Taxonomy of Generative Models

Generative models

Explicit density

Implicit density

Tractable density

Approximate density

Variational

Stochastic

Direct

Stochastic

Generative Adversarial Networks (GAN)

Generative Stochastic Networks (GSN)

Variational Autoencoders

Boltzmann Machine

Fully Visible Belief Nets

PixelRNN/CNN

Ian Goodfellow, Tutorial on Generative Adversarial Networks https://arxiv.org/abs/1701.00160

Lecture 8

Lecture 9

PixelRNN/CNN

Generative Adversarial Networks (GAN)

Variational Autoencoders

Stable Diffusion�(DDPN)

Lecture 10

Spring 2024

6

7 of 45

Generative Modeling

Given training data,

Step 1: Assuming there is a probability distribution which has generated the data, learn the probability distribution p_model(x).
Step 2: Then, sample new data x from the distribution, p_model(x).

Explicit density estimation: explicitly define and solve for p_model(x).
Implicit density estimation: learn a model that samples from p_model(x) without explicitly defining it.

p_model(x)

Step 1: density estimation

Step 2: sampling

Spring 2024

7

8 of 45

Generative Modeling

Why generative models?

Realistic samples for training data
Improving the quality of data: Super-resolution, Colorization, ...
Learn useful features for downstream tasks, e.g., classification.
Getting insights from high-dimensional data (physics, medical imaging, etc.)
Modeling physical world for simulation and planning (robotics and reinforcement learning applications)

Spring 2024

8

9 of 45

PixelRNN & PixelCNN

Spring 2024

9

10 of 45

Pixel-by-pixel Image Generation

Explicit density model:

Assumes that (plausible) images are sampled from an unknown probability distribution p(x).
Our goal is modeling p(x) such that the images in training data are likely sampled from it.�

Suppose we define an order of pixels in an image.

Any order may be fine. We generate the full image in this order.

1	2	3	4	5
6	7	8	9	10
11	12	...

1	2	4	7	11
3	5	8	12
6	9	...
10

1		2		3
	10		11
4		5		6
	12		...
7		8		9

https://arxiv.org/pdf/1601.06759.pdf

Spring 2024

10

11 of 45

Pixel-by-pixel Image Generation

Task: Given a partially generated image, we model the probability distribution of the next pixel values.

As a classification problem with 256 possible values, per channel per pixel.

Starting from the empty image, iteratively generates pixel by pixel.

This is a stochastic process!
Each pixel is sampled according to the distribution, not the maximally plausible one picked.

Spring 2024

11

12 of 45

Pixel-by-pixel Image Generation

Use chain rule to decompose likelihood of an image x into a product of 1D distributions.

Likelihood of an image x = product of conditional probabilities of each pixel, given all previously generated pixels:��

With 3 channels, each channel is also generated sequentially:��
Complex relationship between pixels is usually modeled with a neural network.
Very slow due to the sequential generation.

Spring 2024

12

13 of 45

RNN (Review)

Spring 2024

13

14 of 45

RNN (Review)

Spring 2024

14

15 of 45

Pixel-by-pixel Image Generation (An RNN setting)

Input: masked ground truth generated so far

Output: prob. dist. of next pixel RGB

Compute loss with GT and backprop

f_W

h₀

h₁

f_W

h₂

f_W

h₃

f_W

h₄

Spring 2024

15

16 of 45

PixelRNN (1): Row LSTM

Input image x_t

Previous hidden state h_t-1

W

U

Spring 2024

16

17 of 45

PixelRNN (1): Row LSTM

Input image x_t

Previous hidden state h_t-1

W

U

Spring 2024

17

18 of 45

PixelRNN (1): Row LSTM

Input image x_t

Previous hidden state h_t-1

W

U

Note: this process is actually done in parallel, within the same row!

Spring 2024

18

19 of 45

PixelRNN (1): Row LSTM

Previous hidden state h_t-1

U

Receptive field?

Receptive field is triangular, not covering the entire pixels previously generated!

That is, it does not use all available context.

Spring 2024

19

20 of 45

PixelRNN (2): Diagonal BiLSTM

Input image x_t

Previous hidden state h_t-1

W

U

Input-to-state is 1×1 conv

State-to-state is 2×1 conv�(See next)

Entire diagonal is processed in parallel, each relying on adjacent cells.

Spring 2024

20

21 of 45

PixelRNN (2): Diagonal BiLSTM

State-to-state: 2x1 conv after shifting one pixel (implementation trick)
Receptive field is Global: All previously generated pixels are used at each step.

Spring 2024

21

22 of 45

PixelCNN

Instead of RNNs, the model consists of stacked convolutional layers.
Generates the target pixel using previously generated nearby pixels.
Training can be done in parallel, as all values are known at training.

Faster than PixelRNN!

Spring 2024

22

23 of 45

Pixel Recursive Super Resolution

Task: enlarging a low resolution photograph to recover a corresponding plausible image with high resolution
Underspecified: many plausible high resolution images may match the given low resolution one.
Need to consider texture, edges, viewpoints, illumination, occlusion, etc.
Notations:

x: low-resolution input image, with L pixels.
y: high-resolution predicted image, with M pixels.
y^*: high-resolution ground truth image, with M pixels.�(L << M)

https://arxiv.org/pdf/1702.00783.pdf

Spring 2024

23

24 of 45

Pixel Recursive Super Resolution

softmax

logits

Conditioning Network A_i(x)

Prior Network B_i(y_<i)

Sample the target pixel, then continue to the next one (i+1).

B_i(y_<i): maps from high-res image generated so far to prob. dist. over K (256) values for i-th pixel.
Captures sequential dependencies of pixels

A_i(x): maps from entire low-res image to prob. dist. over K (256) values for i-th pixel.
Captures the global structure of the low resolution image

Pixel i

PixelCNN

CNN

+

Spring 2024

24

25 of 45

Pixel Recursive Super Resolution

The output of the model gives probability distribution over K (=256) possible values in the target pixel:

Vector of size K

Scalar

Minimizes cross-entropy loss

Spring 2024

25

26 of 45

Pixel Recursive Super Resolution: Result

4 instances of super resolution from the left.

We get various output images, because PixelCNN is a stochastic process (sampling each pixel from the estimated distribution).

Spring 2024

26

27 of 45

Pixel Recursive Super Resolution: Result

Super Resolution for Hurricane Data

Kim, Sookyung, et al. "Resolution reconstruction of climate data with pixel recursive model." 2017 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 2017.

Spring 2024

27

28 of 45

Pixel Recursive Super Resolution: Result

Super Resolution for Hurricane Data

Kim, Sookyung, et al. "Resolution reconstruction of climate data with pixel recursive model." 2017 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 2017.

Spring 2024

28

29 of 45

Autoencoders

Spring 2024

29

30 of 45

Autoencoders

Unsupervised approach for learning a lower-dimensional feature representation from unlabeled training data.

Trained with reconstruction loss:

Features are trained to reconstruct original data (the input itself).

Input can be anything (no assumption about the data)
z is usually smaller than x, to enforce the model to capture meaningful factors in the data.

x

x̂

z

Encoder

h(x)

Decoder

g(z)

Spring 2024

30

31 of 45

Autoencoders

Main purpose of autoencoders is representation learning.

Embedding
Manifold learning
Feature extraction

Extracted features may be used to train other supervised models.
Once trained, decoder is no longer used.

In other words, we temporarily attached the decoder g to train the encoder h.

x

x̂

z

Encoder

h(x)

Decoder

g(z)

Spring 2024

31

32 of 45

Denoising Autoencoders (DAE)

Adding random noise to the input will encourage representations robust to small perturbation of the input. → Better end-to-end classification accuracy.

Noise examples

Zeroing out random pixels
Gaussian noise
Salt-and-pepper noise (random white and black points)�

We may apply multiple corrupted images for each x.

x’

x̂

z

Encoder

h(x’)

Decoder

g(z)

x

Random noise q(x’|x)

Spring 2024

32

33 of 45

Denoising Autoencoders

Input data x lies closely on a low-dimensional manifold.

Corruption q maps x to farther away from this manifold.

The model g(h(x)) learns how to map x’ back to the manifold.

https://www.cs.toronto.edu/~larocheh/publications/icml-2008-denoising-autoencoders.pdf

Spring 2024

33

34 of 45

Variational Autoencoders

Spring 2024

34

35 of 45

Variational Autoencoders

Autoencoders are mainly used to embed the input x to a learned manifold space.
Can we use the decoder g as a generator?

No! 😢 We do not have z vector unless we encode them from an existing image x.

x

x̂

z

Encoder

h(x)

Decoder

g(z)

p_model(x)

Step 1: density estimation

Step 2: sampling

Hmm… but this reminds the goal of generative model at the beginning!

z

x

x̂

Spring 2024

35

36 of 45

Variational Autoencoders

What we want with a generative model:

We believe plausible images are on a low-dimensional subspace (or manifold) z of the entire pixel space (ℝ^m×n).

Embedding models learn how to map images x to this space z.�

Conversely, we’d like to estimate the probability distribution of z, and to sample from it.

We want this distribution resembles that of training images.
When we sample from this distribution, we want semantically reasonable images to be generated!

x

z

g_θ(z)

Generator

4

9

Spring 2024

36

37 of 45

Variational Autoencoders: First try

Suppose a set of images are generated from the latent distribution z:

As we can’t integrate over all possible z, approximate it with the training samples:

We’d like to model g so that it likely generates the dataset we have.

Likelihood and log-likelihood of the dataset (with N examples):

Spring 2024

37

38 of 45

Variational Autoencoders: First try (cont’d)

Assuming p(x | g_θ(z)) ~ N(x | g_θ(z), σ²I), we try to maximize MLE:

To maximize this, we need to train g such that�x_i ≈ g_θ(z_i) for training examples (i = 1, …, N), in terms of squared loss.

Q. Is the squared loss a good indicator of image semantics?

No! We’ve seen it’s not the case:

Spring 2024

38

39 of 45

Variational Autoencoders: Main Idea

Okay, is there any better way?
Variational Autoencoders suggest to model z itself instead of p(x | g_θ(z))!

1) sample z from p(z|x) which semantically distinguishes images observed in x
2) since we don’t know p, we approximate it with another distribution q_ф(z|x), which we know the form. (Variational inference)

What’s the difference from the first try?

In the first try, p(z) was arbitrary. Just assumed there is ‘some’ latent variable z.
Now, we explicitly force p(z) to reflect observed examples x, by modeling p(z|x).
Also, we assume the form of q ≈ p, for parametric optimization.

x

z

g_θ(z)

Generator

https://arxiv.org/pdf/1312.6114.pdf

Spring 2024

39

40 of 45

Variational Autoencoders: Derivation

Bayes’ Rule

Multiplied by 1

Organize by color

By def. of expectation, KL divergence

Reconstruction (likelihood of the data x). As we model p(x|z) by the generator g, this term is same as in the first try!

We enforce that the encoder q_ф(z|x) embeds the data x to a prior distribution p(z) we assume, e.g., Gaussian.

We don’t know p(z|x). As KL divergence ≥ 0, the first two terms are a lower bound of log p(x).

Spring 2024

40

41 of 45

Variational Autoencoders: Overall Structure

VAE is a generative model:

Once trained, encoder is no longer used.
In other words, we temporarily attach the encoder q to train the generator g.

The latent space z is modeled such that

Actual examples x_i mapped by the encoder q are semantically well-distinguished, so that they can be generated back by the generator g.
At the same time, the embeddings z is on a Gaussian distribution.

x

x̂

z

Encoder

q_ф(z|x)

Generator

g_θ(z)

Spring 2024

41

42 of 45

Variational Autoencoders: Overall Structure

x

x̂

z

Encoder

q_ф(z|x)

Generator

g_θ(x|z)

μ_z|x

Σ_z|x

KL Divergence between two Gaussians: acts like a regularizer!

Sample z from N(μ_z|x, Σ_z|x)

Reconstruction loss

https://arxiv.org/pdf/1312.6114.pdf

Spring 2024

42

43 of 45

Variational Autoencoders: Overall Structure

Q. Hmm, it is still strange. Given z ~ N(0, I), does it represent complicated relationship between images well? Don’t we need more complex prior?

A. Interestingly, simple prior is enough!

This is because we use a deep neural network for the generator g.
Lower layers in g learns the complex manifold of the image space.

x̂

z

Generator

g_θ(x|z)

Spring 2024

43

44 of 45

Variational Autoencoders: Examples

Learned MNIST manifold:

Learned facial expression manifold:

More smile

Less smile

Gaze left

Gaze right

0

1

2

3

4

5

6

7

8

9

Spring 2024

44

45 of 45

Variational Autoencoders: Summary

A principled approach to generative models

Probabilistic modeling of traditional autoencoders → allows data generation
To optimize an intractable density, derived and optimized a variational lower bound.�

Pros:

Interpretable latent space
Allows inference of q(z|x)
Useful feature representation for other tasks�

Cons:

Working with a lower bound of likelihood (approximated)
Not as good result as PixelRNN/PixelCNN (which uses actual likelihood)
Samples are blurrier and lower quality than GANs

Spring 2024

45