1 of 45

Lecture 8:

Generative Models I

Sookyung Kim

Spring 2025

2 of 45

Era of Generative model

Spring 2024

2

3 of 45

Era of Generative model

Spring 2024

3

4 of 45

Era of Generative model

Spring 2024

4

5 of 45

Supervised vs. Unsupervised Learning

Supervised Learning

  • Data: (x, y)
    • x is data, y is label.�
  • Goal: function approximation
    • Learning a function to map xy.�
  • Examples
    • Classification
    • Regression
    • Object detection
    • Semantic segmentation
    • Image captioning

Unsupervised Learning

  • Data: x
    • Just data, no labels!�
  • Goal: learning underlying hidden structure of the data�
  • Examples
    • Clustering
    • Dimension reduction
    • Density estimation

Spring 2024

5

6 of 45

Taxonomy of Generative Models

Generative models

Explicit density

Implicit density

Tractable density

Approximate density

Variational

Stochastic

Direct

Stochastic

Generative Adversarial Networks (GAN)

Generative Stochastic Networks (GSN)

Variational Autoencoders

Boltzmann Machine

Fully Visible Belief Nets

PixelRNN/CNN

Ian Goodfellow, Tutorial on Generative Adversarial Networks https://arxiv.org/abs/1701.00160

Lecture 8

Lecture 9

PixelRNN/CNN

Generative Adversarial Networks (GAN)

Variational Autoencoders

Stable Diffusion�(DDPN)

Lecture 10

Spring 2024

6

7 of 45

Generative Modeling

  • Given training data,
    • Step 1: Assuming there is a probability distribution which has generated the data, learn the probability distribution pmodel(x).
    • Step 2: Then, sample new data x from the distribution, pmodel(x).
  • Explicit density estimation: explicitly define and solve for pmodel(x).
  • Implicit density estimation: learn a model that samples from pmodel(x) without explicitly defining it.

pmodel(x)

Step 1: density estimation

Step 2: sampling

Spring 2024

7

8 of 45

Generative Modeling

Why generative models?

  • Realistic samples for training data
  • Improving the quality of data: Super-resolution, Colorization, ...
  • Learn useful features for downstream tasks, e.g., classification.
  • Getting insights from high-dimensional data (physics, medical imaging, etc.)
  • Modeling physical world for simulation and planning (robotics and reinforcement learning applications)

Spring 2024

8

9 of 45

PixelRNN & PixelCNN

Spring 2024

9

10 of 45

Pixel-by-pixel Image Generation

  • Explicit density model:
    • Assumes that (plausible) images are sampled from an unknown probability distribution p(x).
    • Our goal is modeling p(x) such that the images in training data are likely sampled from it.�
  • Suppose we define an order of pixels in an image.
    • Any order may be fine. We generate the full image in this order.

1

2

3

4

5

6

7

8

9

10

11

12

...

1

2

4

7

11

3

5

8

12

6

9

...

10

1

2

3

10

11

4

5

6

12

...

7

8

9

Spring 2024

10

11 of 45

Pixel-by-pixel Image Generation

  • Task: Given a partially generated image, we model the probability distribution of the next pixel values.
    • As a classification problem with 256 possible values, per channel per pixel.
  • Starting from the empty image, iteratively generates pixel by pixel.
    • This is a stochastic process!
    • Each pixel is sampled according to the distribution, not the maximally plausible one picked.

Spring 2024

11

12 of 45

Pixel-by-pixel Image Generation

  • Use chain rule to decompose likelihood of an image x into a product of 1D distributions.
    • Likelihood of an image x = product of conditional probabilities of each pixel, given all previously generated pixels:���
  • With 3 channels, each channel is also generated sequentially:��
  • Complex relationship between pixels is usually modeled with a neural network.
  • Very slow due to the sequential generation.

Spring 2024

12

13 of 45

RNN (Review)

Spring 2024

13

14 of 45

RNN (Review)

Spring 2024

14

15 of 45

Pixel-by-pixel Image Generation (An RNN setting)

Input: masked ground truth generated so far

Output: prob. dist. of next pixel RGB

Compute loss with GT and backprop

fW

h0

h1

fW

h2

fW

h3

fW

h4

Spring 2024

15

16 of 45

PixelRNN (1): Row LSTM

Input image xt

Previous hidden state ht-1

W

U

Spring 2024

16

17 of 45

PixelRNN (1): Row LSTM

Input image xt

Previous hidden state ht-1

W

U

Spring 2024

17

18 of 45

PixelRNN (1): Row LSTM

Input image xt

Previous hidden state ht-1

W

U

Note: this process is actually done in parallel, within the same row!

Spring 2024

18

19 of 45

PixelRNN (1): Row LSTM

Previous hidden state ht-1

U

  • Receptive field?

Receptive field is triangular, not covering the entire pixels previously generated!

That is, it does not use all available context.

Spring 2024

19

20 of 45

PixelRNN (2): Diagonal BiLSTM

Input image xt

Previous hidden state ht-1

W

U

Input-to-state is 1×1 conv

State-to-state is 2×1 conv�(See next)

Entire diagonal is processed in parallel, each relying on adjacent cells.

Spring 2024

20

21 of 45

PixelRNN (2): Diagonal BiLSTM

  • State-to-state: 2x1 conv after shifting one pixel (implementation trick)
  • Receptive field is Global: All previously generated pixels are used at each step.

Spring 2024

21

22 of 45

PixelCNN

  • Instead of RNNs, the model consists of stacked convolutional layers.
  • Generates the target pixel using previously generated nearby pixels.
  • Training can be done in parallel, as all values are known at training.
    • Faster than PixelRNN!

Spring 2024

22

23 of 45

Pixel Recursive Super Resolution

  • Task: enlarging a low resolution photograph to recover a corresponding plausible image with high resolution
  • Underspecified: many plausible high resolution images may match the given low resolution one.
  • Need to consider texture, edges, viewpoints, illumination, occlusion, etc.
  • Notations:
    • x: low-resolution input image, with L pixels.
    • y: high-resolution predicted image, with M pixels.
    • y*: high-resolution ground truth image, with M pixels.�(L << M)

Spring 2024

23

24 of 45

Pixel Recursive Super Resolution

softmax

logits

logits

Conditioning Network Ai(x)

Prior Network Bi(y<i)

Sample the target pixel, then continue to the next one (i+1).

  • Bi(y<i): maps from high-res image generated so far to prob. dist. over K (256) values for i-th pixel.
  • Captures sequential dependencies of pixels
  • Ai(x): maps from entire low-res image to prob. dist. over K (256) values for i-th pixel.
  • Captures the global structure of the low resolution image

Pixel i

PixelCNN

CNN

+

Spring 2024

24

25 of 45

Pixel Recursive Super Resolution

  • The output of the model gives probability distribution over K (=256) possible values in the target pixel:

Vector of size K

Scalar

  • Minimizes cross-entropy loss

Spring 2024

25

26 of 45

Pixel Recursive Super Resolution: Result

  • 4 instances of super resolution from the left.
    • We get various output images, because PixelCNN is a stochastic process (sampling each pixel from the estimated distribution).

Spring 2024

26

27 of 45

Pixel Recursive Super Resolution: Result

  • Super Resolution for Hurricane Data

Kim, Sookyung, et al. "Resolution reconstruction of climate data with pixel recursive model." 2017 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 2017.

Spring 2024

27

28 of 45

Pixel Recursive Super Resolution: Result

  • Super Resolution for Hurricane Data

Kim, Sookyung, et al. "Resolution reconstruction of climate data with pixel recursive model." 2017 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 2017.

Spring 2024

28

29 of 45

Autoencoders

Spring 2024

29

30 of 45

Autoencoders

  • Unsupervised approach for learning a lower-dimensional feature representation from unlabeled training data.
  • Trained with reconstruction loss:
    • Features are trained to reconstruct original data (the input itself).
  • Input can be anything (no assumption about the data)
  • z is usually smaller than x, to enforce the model to capture meaningful factors in the data.

x

z

Encoder

h(x)

Decoder

g(z)

Spring 2024

30

31 of 45

Autoencoders

  • Main purpose of autoencoders is representation learning.
    • Embedding
    • Manifold learning
    • Feature extraction
  • Extracted features may be used to train other supervised models.
  • Once trained, decoder is no longer used.
    • In other words, we temporarily attached the decoder g to train the encoder h.

x

z

Encoder

h(x)

Decoder

g(z)

Spring 2024

31

32 of 45

Denoising Autoencoders (DAE)

  • Adding random noise to the input will encourage representations robust to small perturbation of the input. → Better end-to-end classification accuracy.
  • Noise examples
    • Zeroing out random pixels
    • Gaussian noise
    • Salt-and-pepper noise (random white and black points)�
  • We may apply multiple corrupted images for each x.

x’

z

Encoder

h(x’)

Decoder

g(z)

x

Random noise q(x’|x)

Spring 2024

32

33 of 45

Denoising Autoencoders

Input data x lies closely on a low-dimensional manifold.

Corruption q maps x to farther away from this manifold.

The model g(h(x)) learns how to map x’ back to the manifold.

Spring 2024

33

34 of 45

Variational Autoencoders

Spring 2024

34

35 of 45

Variational Autoencoders

  • Autoencoders are mainly used to embed the input x to a learned manifold space.
  • Can we use the decoder g as a generator?
  • No! 😢 We do not have z vector unless we encode them from an existing image x.

x

z

Encoder

h(x)

Decoder

g(z)

pmodel(x)

Step 1: density estimation

Step 2: sampling

  • Hmm… but this reminds the goal of generative model at the beginning!

z

x

Spring 2024

35

36 of 45

Variational Autoencoders

  • What we want with a generative model:
    • We believe plausible images are on a low-dimensional subspace (or manifold) z of the entire pixel space (ℝm×n).
      • Embedding models learn how to map images x to this space z.�
    • Conversely, we’d like to estimate the probability distribution of z, and to sample from it.
      • We want this distribution resembles that of training images.
      • When we sample from this distribution, we want semantically reasonable images to be generated!

x

z

gθ(z)

Generator

4

9

Spring 2024

36

37 of 45

Variational Autoencoders: First try

  • Suppose a set of images are generated from the latent distribution z:
  • As we can’t integrate over all possible z, approximate it with the training samples:
  • We’d like to model g so that it likely generates the dataset we have.
    • Likelihood and log-likelihood of the dataset (with N examples):

Spring 2024

37

38 of 45

Variational Autoencoders: First try (cont’d)

  • Assuming p(x | gθ(z)) ~ N(x | gθ(z), σ2I), we try to maximize MLE:
  • To maximize this, we need to train g such that�xigθ(zi) for training examples (i = 1, …, N), in terms of squared loss.
    • Q. Is the squared loss a good indicator of image semantics?

No! We’ve seen it’s not the case:

Spring 2024

38

39 of 45

Variational Autoencoders: Main Idea

  • Okay, is there any better way?
  • Variational Autoencoders suggest to model z itself instead of p(x | gθ(z))!
    • 1) sample z from p(z|x) which semantically distinguishes images observed in x
    • 2) since we don’t know p, we approximate it with another distribution qф(z|x), which we know the form. (Variational inference)
  • What’s the difference from the first try?
    • In the first try, p(z) was arbitrary. Just assumed there is ‘some’ latent variable z.
    • Now, we explicitly force p(z) to reflect observed examples x, by modeling p(z|x).
    • Also, we assume the form of qp, for parametric optimization.

x

z

gθ(z)

Generator

Spring 2024

39

40 of 45

Variational Autoencoders: Derivation

Bayes’ Rule

Multiplied by 1

Organize by color

By def. of expectation, KL divergence

Reconstruction (likelihood of the data x). As we model p(x|z) by the generator g, this term is same as in the first try!

We enforce that the encoder qф(z|x) embeds the data x to a prior distribution p(z) we assume, e.g., Gaussian.

We don’t know p(z|x). As KL divergence ≥ 0, the first two terms are a lower bound of log p(x).

Spring 2024

40

41 of 45

Variational Autoencoders: Overall Structure

  • VAE is a generative model:
    • Once trained, encoder is no longer used.
    • In other words, we temporarily attach the encoder q to train the generator g.
  • The latent space z is modeled such that
    • Actual examples xi mapped by the encoder q are semantically well-distinguished, so that they can be generated back by the generator g.
    • At the same time, the embeddings z is on a Gaussian distribution.

x

z

Encoder

qф(z|x)

Generator

gθ(z)

Spring 2024

41

42 of 45

Variational Autoencoders: Overall Structure

x

z

Encoder

qф(z|x)

Generator

gθ(x|z)

μz|x

Σz|x

KL Divergence between two Gaussians: acts like a regularizer!

Sample z from N(μz|x, Σz|x)

Reconstruction loss

Spring 2024

42

43 of 45

Variational Autoencoders: Overall Structure

  • Q. Hmm, it is still strange. Given z ~ N(0, I), does it represent complicated relationship between images well? Don’t we need more complex prior?
  • A. Interestingly, simple prior is enough!
    • This is because we use a deep neural network for the generator g.
    • Lower layers in g learns the complex manifold of the image space.

z

Generator

gθ(x|z)

Spring 2024

43

44 of 45

Variational Autoencoders: Examples

  • Learned MNIST manifold:
  • Learned facial expression manifold:

More smile

Less smile

Gaze left

Gaze right

0

1

2

3

4

5

6

7

8

9

Spring 2024

44

45 of 45

Variational Autoencoders: Summary

  • A principled approach to generative models
    • Probabilistic modeling of traditional autoencoders → allows data generation
    • To optimize an intractable density, derived and optimized a variational lower bound.�
  • Pros:
    • Interpretable latent space
    • Allows inference of q(z|x)
    • Useful feature representation for other tasks�
  • Cons:
    • Working with a lower bound of likelihood (approximated)
    • Not as good result as PixelRNN/PixelCNN (which uses actual likelihood)
    • Samples are blurrier and lower quality than GANs

Spring 2024

45