1 of 142

Lecture 14

Image Synthesis

6.8300/1 Advances in Computer Vision

Spring 2024

Sara Beery, Kaiming He, Vincent Sitzmann, Mina Konaković Luković

2 of 142

14. Image Synthesis

  • Image synthesis
    • Variational Autoencoders
    • Generative Adversarial Networks
  • Structured prediction
    • Image-to-image GANs
  • Domain mapping

3 of 142

Announcements

  • Pset5 due Tuesday, 04/02
  • Pset6 out Thursday, 04/04
  • Project proposal due Thursday, 04/04

4 of 142

Analysis

“Duck”

image x

label y

c

5 of 142

Analysis

“A large duck standing by the river”

image x

caption y

c

Image Captioner

6 of 142

Analysis

“positive”

sentence x

sentiment y

“A statuesque duck gazing gracefully over the water”

c

Sentiment Classifier

7 of 142

Analysis

“Duck”

label y

image x

c

8 of 142

Synthesis

“Duck”

label y

image x

c

9 of 142

Synthesis

Generator

image x

“Fish”

label y

10 of 142

Synthesis

Photo

User sketch

c

Translator

11 of 142

Synthesis

Photo

User sketch

c

12 of 142

Image synthesis via generative modeling

In vision, this is usually what we are interested in!

Model of high-dimensional structured data

X is high-dimensional!

13 of 142

z

z

Deep nets are data transformers

  • Deep nets transform datapoints, layer by layer
  • Each layer is a different representation of the data

Embedding

Data

14 of 142

z

z

Embedding

Data

Deep nets are data transformers

  • Deep nets transform datapoints, layer by layer
  • Each layer is a different representation of the data

15 of 142

Generative modeling vs Representation learning

Representation learning:

mapping data to abstract representations

(analysis)

z

z

Embedding

Data

Generative modeling

Representation learning

Generative modeling:

mapping abstract representations to data (synthesis)

16 of 142

  1. Image synthesis
  2. Representation learning
  3. Data translation

What can you do with generative models?

17 of 142

Image synthesis

  1. Image synthesis
  2. Representation learning
  3. Data translation

18 of 142

Procedural graphics

[Anders Scheil]

19 of 142

20 of 142

Image synthesis from “noise”

Generator

21 of 142

Learning a generative model

22 of 142

Learning a density model

23 of 142

Case study #1: Fitting a Gaussian to data

fig from [Goodfellow, 2016]

Max likelihood objective

Considering only Gaussian fits

24 of 142

Case study #1: Fitting a Gaussian to data

“max likelihood”

25 of 142

Case study #2: learning a deep generative model

SGD

Deep net

Usually max likelihood

26 of 142

SGD

Deep net

Usually max likelihood

Models that provide a sampler but no density are called implicit generative models

Case study #2: learning a deep generative model

27 of 142

Deep generative models are distribution transformers

Prior distribution

Target distribution

28 of 142

Gaussian noise

Deep generative models are distribution transformers

Synthesized image

29 of 142

Deep generative models are distribution transformers

Gaussian noise

Synthesized image

30 of 142

Generative Adversarial Networks (GANs)

Gaussian noise

Synthesized image

31 of 142

Generator

[Goodfellow et al., 2014]

G tries to synthesize fake images that fool D

D tries to identify the fakes

Discriminator

real or fake?

32 of 142

[Goodfellow et al., 2014]

fake (0.9)

real (0.1)

33 of 142

G tries to synthesize fake images that fool D:

real or fake?

[Goodfellow et al., 2014]

34 of 142

G tries to synthesize fake images that fool the best D:

real or fake?

[Goodfellow et al., 2014]

35 of 142

  • Training: iterate between training D and G with backprop.
  • Global optimum when G reproduces data distribution.

Training

G tries to synthesize fake images that fool D

D tries to identify the fakes

real or fake?

[Goodfellow et al., 2014]

36 of 142

GANs are implicit generative models

“generative model” of the data x

Noise distribution

Samples from a perfectly optimized, sufficiently expressive GAN are samples from the data distribution

GAN

Data distribution

37 of 142

Proof

38 of 142

Proof

is the unique global minimizer of the GAN objective.

39 of 142

Samples from BigGAN

[Brock et al. 2018]

40 of 142

Generative Adversarial Network

Deep nets G and D

Alternating SGD on G and D

41 of 142

Latent space

(Gaussian)

Data space

(Natural image manifold)

[BigGAN, Brock et al. 2018]

42 of 142

Generative adversarial networks are representation learners

Images generated by walking along two latent dimensions of BigGAN

[BigGAN, Brock et al. 2018]

43 of 142

Generative models organize the manifold of natural images

image space

latent space

44 of 142

Representation learning

  1. Image synthesis
  2. Representation learning
  3. Data translation

z

z

Embedding

Data

Generative modeling

Representation learning

45 of 142

Autoencoder —> Generative model

46 of 142

Variational Autoencoders (VAEs)

Prior distribution

Target distribution

[Kingma & Welling, 2014; Rezende, Mohamed, Wierstra 2014]

47 of 142

Mixture of Gaussians

Target distribution

48 of 142

Variational Autoencoders (VAEs)

Target distribution

Density model:

Sampling:

[Kingma & Welling, 2014; Rezende, Mohamed, Wierstra 2014]

49 of 142

Variational Autoencoder (VAE)

50 of 142

Current model of target distribution

In order to optimize our model, we need to measure the likelihood it assigns to each datapoint x

51 of 142

52 of 142

Current model of target distribution

In order to optimize our model, we need to measure the likelihood it assigns to each datapoint x

53 of 142

Current model of target distribution

If only we knew z*, we wouldn’t need the integral…

54 of 142

Current model of target distribution

If only we knew z*, we wouldn’t need the integral…

So, we simply try to predict z* for the given x!

Technical note: for the continuous math to actually work out, z* ~ E(x) needs to be a distribution (typically set to Gaussian), but here we (incorrectly) treat it as deterministic for simplicity.

55 of 142

Current model of target distribution

If only we knew z*, we wouldn’t need the integral…

So, we simply try to predict z* for the given x!

(assuming unit Gaussian prior, isotropic Gaussian likelihood model)

56 of 142

Autoencoder!

57 of 142

58 of 142

Autoencoder!

59 of 142

Classical Autoencoder

60 of 142

Variational Autoencoder

61 of 142

Variational Autoencoder

62 of 142

VAEs

Pros: Cheap to sample, good coverage

Cons: Blurry samples (in practice)

GANs

Pros: Cheap to sample, fast to train, require little data

Cons: No likelihoods, bad coverage (mode collapse), finicky to train (minimax)

[adapted from slide by David Duvenaud]

Other deep generative models:

Autoregressive models, Normalizing flows, Energy-based models

63 of 142

Data Translation

  1. Image synthesis
  2. Representation learning
  3. Data translation

64 of 142

Data translation problems (“structured prediction”)

“this small bird has a pink breast and crown…”

65 of 142

Structured prediction

In vision this is usually what we are interested in

Model joint distribution of high-dimensional data

66 of 142

Deep learning in 2012

Use a hypothesis space that can model complex structure

(e.g., a CNN, nearest-neighbor)

67 of 142

[Slide credit: Andrew Ng]

68 of 142

(Colors represent one-hot codes)

[Photo credit: Fredo Durand]

69 of 142

Convolutional neural net

Stochastic gradient descent

Semantic Segmentation

70 of 142

Convolutional neural net

Stochastic gradient descent

Sat2Map

71 of 142

Input

Deep net output

72 of 142

Structured prediction

Use an objective that can model structure! (e.g., a graphical model, a GAN, etc)

73 of 142

Generator

74 of 142

G tries to synthesize fake images that fool D

D tries to identify the fakes

Generator

Discriminator

real or fake?

75 of 142

fake (0.9)

real (0.1)

76 of 142

G tries to synthesize fake images that fool D:

real or fake?

77 of 142

G tries to synthesize fake images that fool the best D:

real or fake?

78 of 142

Loss Function

G’s perspective: D is a loss function.

Rather than being hand-designed, it is learned and highly structured.

79 of 142

real or fake?

80 of 142

real!

81 of 142

real or fake pair ?

82 of 142

real or fake pair ?

83 of 142

fake pair

84 of 142

real pair

85 of 142

real or fake pair ?

86 of 142

Training Details: Loss function

Conditional GAN

87 of 142

Training Details: Loss function

Conditional GAN

Stable training + fast convergence

[c.f. Pathak et al. CVPR 2016]

-

88 of 142

Input

Output

Groundtruth

89 of 142

Input

Output

Groundtruth

Data from [maps.google.com]

90 of 142

[Slide credit: Andrew Ng]

91 of 142

Performance

Amount of data

Why structured objectives

(cartoon)

Deep learning

Older learning algorithms

92 of 142

Performance

Amount of data

DL w/ unstructured objective

(e.g., least-squares regression)

Why structured objectives

(cartoon)

Older learning algorithms

93 of 142

Input

Unstructured prediction (L1)

94 of 142

Structured Prediction (cGAN)

Input

95 of 142

Training data

[HED, Xie & Tu, 2015]

96 of 142

#edges2cats [Chris Hesse]

97 of 142

98 of 142

Ivy Tasi @ivymyt

Vitaly Vidmirov @vvid

99 of 142

Leveraging pretrained models for efficient data translation

100 of 142

With enough data, deep learning can solve pretty much anything

Deep Learning

101 of 142

This is a “dax”.

Which of the below symbols are also daxes?

Few-shot Learning

[Lake, Salakhutdinov, Tenenbaum, 2015]

102 of 142

[Lake, Salakhutdinov, Tenenbaum, 2015]

Which of these is an example of the same concept as the item in the box?

Few-shot Learning

103 of 142

Representations

(encoders)

Models (decoders)

Deep learning

The point of deep learning is to enable learning with little data

104 of 142

Foundation models

[Blind Orion Searching for the Rising Sun by Nicolas Poussin, 1658]

“If I have seen further it is by standing on the shoulders of Giants”

— Newton

https://arxiv.org/pdf/2108.07258.pdf

[Bommasani et al. 2021]

105 of 142

1. Learn foundation model encoders and decoders

for each domain

CLIP

StyleGAN

GPT-3

SimCLR

BERT

2. Plug them together to translate between modalities (may require finetuning)

Image

caption

106 of 142

Tons of data

Learn foundation models

Learner

CLIP

GPT

AlexNet

SimCLR

BERT

DALL-E

VQGAN

StyleGAN

BigGAN

SimCLR

WaveNet

Use/adapt foundations to solve new problems

Little or no data

Adaptor

App

107 of 142

CLIP

[Radford et al., 2021]

https://arxiv.org/pdf/2103.00020.pdf

108 of 142

CLIP

[Radford et al., 2021]

https://arxiv.org/pdf/2103.00020.pdf

2. Adaptor: Linear classifer on top of image encodings

109 of 142

CLIP

[Radford et al., 2021]

https://arxiv.org/pdf/2103.00020.pdf

2. Adaptor: Just ask

110 of 142

CLIP

[Radford et al., 2021]

https://arxiv.org/pdf/2103.00020.pdf

2. Adaptor: Just ask

111 of 142

New capabilities by just asking

“A sketch of a banana”

“A photo of a banana”

112 of 142

New capabilities by plugging pretrained models together: CLIP+GAN

“A Monet painting of the MIT Dome”

To maximize this

INPUT:

OUTPUT:

GAN Generator

Optimize this

113 of 142

New capabilities by plugging pretrained models together: CLIP+GAN

"What is the answer to the ultimate question of life, the universe, and everything?”

To maximize this

INPUT:

OUTPUT:

Image

Generator

Optimize this

Code: https://colab.research.google.com/drive/1_4PQqzM_0KKytCzWtn-ZPi4cCa5bwK2F?usp=sharing

114 of 142

Tons of data

Learn foundation models

Learner

CLIP

GPT

AlexNet

SimCLR

BERT

DALL-E

VQGAN

StyleGAN

BigGAN

SimCLR

WaveNet

Use/adapt foundations to solve new problems

Little or no data

Adaptor

VQGAN

CLIP

Text-to-image

115 of 142

DALL-E

Image Decoder

“A wide-eyed cat on the lookout for food”

INPUT:

Latent space Transformer

[Ramesh et al. 2021]

https://arxiv.org/pdf/2102.12092.pdf

https://openai.com/blog/dall-e/

116 of 142

Text-to-image translation

“An illustration of a baby daikon radish in a tutu walking a dog”

Image Decoder

OUTPUT:

Latent space Transformer

INPUT:

117 of 142

New capabilities by just asking: product design

118 of 142

New capabilities by just asking: image translation

119 of 142

120 of 142

121 of 142

DALL-E

[Ramesh et al. 2021]

https://arxiv.org/pdf/2102.12092.pdf

https://openai.com/blog/dall-e/

122 of 142

DALL-E 2

123 of 142

DALL-E 3

Prompt: a painting of water lilies in a new art style no human has ever seen before

ChatGPT: Here’s a painting of water lilies in a new art style, envisioned as a blend of organic beauty and futuristic technology. This unique interpretation features metallic water lilies floating on a liquid glass surface under a neon sky, creating a scene that harmoniously combines elements of nature with advanced technology.

https://chat.openai.com

124 of 142

Domain mapping

[Cartoon: The Computer as a Communication Device, Licklider & Taylor 1968]

  1. Image synthesis
  2. Structured prediction
  3. Domain mapping

[Includes slides from Jun-Yan Zhu, Taesung Park]

125 of 142

Unpaired data

 

 

Paired data

 

126 of 142

real or fake pair ?

127 of 142

real or fake pair ?

No input-output pairs!

128 of 142

real or fake?

Usually loss functions check if output matches a target instance

GAN loss checks if output is part of an admissible set

129 of 142

Real!

130 of 142

Real too!

Nothing to force output to correspond to input

131 of 142

 

 

[Zhu*, Park* et al. 2017], [Yi et al. 2017], [Kim et al. 2017]

CycleGAN, or there and back aGAN

132 of 142

 

 

CycleGAN, or there and back aGAN

133 of 142

Cycle Consistency Loss

 

 

 

 

134 of 142

 

 

 

 

 

 

Cycle Consistency Loss

 

 

135 of 142

Paired translation

Training data

regression error

Objective

cycle-consistency error

Objective

Input

Result

Training data

Result

Input

[“pix2pix”, Isola, Zhu, Zhou, Efros, 2017]

Unpaired translation

136 of 142

137 of 142

138 of 142

Cezanne

Ukiyo-e

Monet

Input

Van Gogh

139 of 142

Gaussian

Target distribution

GANs

140 of 142

Horses

Zebras

CycleGAN

141 of 142

What would it look like if…?

142 of 142

What would it look like if…?