Lecture 14
Image Synthesis
6.8300/1 Advances in Computer Vision
Spring 2024
Sara Beery, Kaiming He, Vincent Sitzmann, Mina Konaković Luković
14. Image Synthesis
Announcements
Analysis
“Duck”
image x
label y
c
Analysis
“A large duck standing by the river”
image x
caption y
c
Image Captioner
Analysis
“positive”
sentence x
sentiment y
“A statuesque duck gazing gracefully over the water”
c
Sentiment Classifier
Analysis
“Duck”
label y
image x
c
Synthesis
“Duck”
label y
image x
c
Synthesis
Generator
image x
“Fish”
label y
Synthesis
Photo
User sketch
c
Translator
Synthesis
Photo
User sketch
c
Image synthesis via generative modeling
In vision, this is usually what we are interested in!
Model of high-dimensional structured data
X is high-dimensional!
z
z
Deep nets are data transformers
Embedding
Data
z
z
Embedding
Data
Deep nets are data transformers
Generative modeling vs Representation learning
Representation learning:
mapping data to abstract representations
(analysis)
z
z
Embedding
Data
Generative modeling
Representation learning
Generative modeling:
mapping abstract representations to data (synthesis)
What can you do with generative models?
Image synthesis
[Images: https://ganbreeder.app/]
Procedural graphics
[Anders Scheil]
Image synthesis from “noise”
Generator
Learning a generative model
[figs modified from: http://introtodeeplearning.com/materials/2019_6S191_L4.pdf]
[figs modified from: http://introtodeeplearning.com/materials/2019_6S191_L4.pdf]
Learning a density model
Case study #1: Fitting a Gaussian to data
fig from [Goodfellow, 2016]
Max likelihood objective
Considering only Gaussian fits
Case study #1: Fitting a Gaussian to data
“max likelihood”
Case study #2: learning a deep generative model
SGD
Deep net
Usually max likelihood
SGD
Deep net
Usually max likelihood
Models that provide a sampler but no density are called implicit generative models
Case study #2: learning a deep generative model
Deep generative models are distribution transformers
Prior distribution
Target distribution
Gaussian noise
Deep generative models are distribution transformers
Synthesized image
Deep generative models are distribution transformers
Gaussian noise
Synthesized image
Generative Adversarial Networks (GANs)
Gaussian noise
Synthesized image
Generator
[Goodfellow et al., 2014]
G tries to synthesize fake images that fool D
D tries to identify the fakes
Discriminator
real or fake?
[Goodfellow et al., 2014]
fake (0.9)
real (0.1)
G tries to synthesize fake images that fool D:
real or fake?
[Goodfellow et al., 2014]
G tries to synthesize fake images that fool the best D:
real or fake?
[Goodfellow et al., 2014]
Training
G tries to synthesize fake images that fool D
D tries to identify the fakes
real or fake?
[Goodfellow et al., 2014]
GANs are implicit generative models
“generative model” of the data x
Noise distribution
Samples from a perfectly optimized, sufficiently expressive GAN are samples from the data distribution
GAN
Data distribution
Proof
Proof
is the unique global minimizer of the GAN objective.
Samples from BigGAN
[Brock et al. 2018]
Generative Adversarial Network
Deep nets G and D
Alternating SGD on G and D
Latent space
(Gaussian)
Data space
(Natural image manifold)
[BigGAN, Brock et al. 2018]
Generative adversarial networks are representation learners
Images generated by walking along two latent dimensions of BigGAN
[BigGAN, Brock et al. 2018]
Generative models organize the manifold of natural images
image space
latent space
Representation learning
z
z
Embedding
Data
Generative modeling
Representation learning
Autoencoder —> Generative model
Variational Autoencoders (VAEs)
Prior distribution
Target distribution
[Kingma & Welling, 2014; Rezende, Mohamed, Wierstra 2014]
Mixture of Gaussians
Target distribution
Variational Autoencoders (VAEs)
Target distribution
Density model:
Sampling:
[Kingma & Welling, 2014; Rezende, Mohamed, Wierstra 2014]
Variational Autoencoder (VAE)
Current model of target distribution
In order to optimize our model, we need to measure the likelihood it assigns to each datapoint x
Current model of target distribution
In order to optimize our model, we need to measure the likelihood it assigns to each datapoint x
Current model of target distribution
If only we knew z*, we wouldn’t need the integral…
Current model of target distribution
If only we knew z*, we wouldn’t need the integral…
So, we simply try to predict z* for the given x!
Technical note: for the continuous math to actually work out, z* ~ E(x) needs to be a distribution (typically set to Gaussian), but here we (incorrectly) treat it as deterministic for simplicity.
Current model of target distribution
If only we knew z*, we wouldn’t need the integral…
So, we simply try to predict z* for the given x!
(assuming unit Gaussian prior, isotropic Gaussian likelihood model)
Autoencoder!
Autoencoder!
Classical Autoencoder
Variational Autoencoder
Variational Autoencoder
VAEs
Pros: Cheap to sample, good coverage
Cons: Blurry samples (in practice)
GANs
Pros: Cheap to sample, fast to train, require little data
Cons: No likelihoods, bad coverage (mode collapse), finicky to train (minimax)
[adapted from slide by David Duvenaud]
Other deep generative models:
Autoregressive models, Normalizing flows, Energy-based models
Data Translation
Data translation problems (“structured prediction”)
“this small bird has a pink breast and crown…”
Structured prediction
In vision this is usually what we are interested in
Model joint distribution of high-dimensional data
Deep learning in 2012
Use a hypothesis space that can model complex structure
(e.g., a CNN, nearest-neighbor)
[Slide credit: Andrew Ng]
(Colors represent one-hot codes)
[Photo credit: Fredo Durand]
Convolutional neural net
Stochastic gradient descent
Semantic Segmentation
…
Convolutional neural net
Stochastic gradient descent
Sat2Map
…
Input
Deep net output
Structured prediction
Use an objective that can model structure! (e.g., a graphical model, a GAN, etc)
Generator
G tries to synthesize fake images that fool D
D tries to identify the fakes
Generator
Discriminator
real or fake?
fake (0.9)
real (0.1)
G tries to synthesize fake images that fool D:
real or fake?
G tries to synthesize fake images that fool the best D:
real or fake?
Loss Function
G’s perspective: D is a loss function.
Rather than being hand-designed, it is learned and highly structured.
real or fake?
real!
real or fake pair ?
real or fake pair ?
fake pair
real pair
real or fake pair ?
Training Details: Loss function
Conditional GAN
Training Details: Loss function
Conditional GAN
Stable training + fast convergence
[c.f. Pathak et al. CVPR 2016]
-
Input
Output
Groundtruth
Input
Output
Groundtruth
Data from [maps.google.com]
[Slide credit: Andrew Ng]
Performance
Amount of data
Why structured objectives
(cartoon)
Deep learning
Older learning algorithms
Performance
Amount of data
DL w/ unstructured objective
(e.g., least-squares regression)
Why structured objectives
(cartoon)
Older learning algorithms
Input
Unstructured prediction (L1)
Structured Prediction (cGAN)
Input
Training data
…
[HED, Xie & Tu, 2015]
#edges2cats [Chris Hesse]
Ivy Tasi @ivymyt
Vitaly Vidmirov @vvid
Leveraging pretrained models for efficient data translation
With enough data, deep learning can solve pretty much anything
Deep Learning
This is a “dax”.
Which of the below symbols are also daxes?
Few-shot Learning
[Lake, Salakhutdinov, Tenenbaum, 2015]
[Lake, Salakhutdinov, Tenenbaum, 2015]
Which of these is an example of the same concept as the item in the box?
Few-shot Learning
Representations
(encoders)
Models (decoders)
Deep learning
The point of deep learning is to enable learning with little data
Foundation models
[Blind Orion Searching for the Rising Sun by Nicolas Poussin, 1658]
“If I have seen further it is by standing on the shoulders of Giants”
— Newton
https://arxiv.org/pdf/2108.07258.pdf
[Bommasani et al. 2021]
1. Learn foundation model encoders and decoders
for each domain
CLIP
StyleGAN
GPT-3
SimCLR
BERT
2. Plug them together to translate between modalities (may require finetuning)
Image
caption
Tons of data
Learn foundation models
Learner
CLIP
GPT
AlexNet
SimCLR
BERT
DALL-E
VQGAN
StyleGAN
BigGAN
SimCLR
WaveNet
Use/adapt foundations to solve new problems
Little or no data
Adaptor
App
CLIP
[Radford et al., 2021]
https://arxiv.org/pdf/2103.00020.pdf
CLIP
[Radford et al., 2021]
https://arxiv.org/pdf/2103.00020.pdf
2. Adaptor: Linear classifer on top of image encodings
CLIP
[Radford et al., 2021]
https://arxiv.org/pdf/2103.00020.pdf
2. Adaptor: Just ask
CLIP
[Radford et al., 2021]
https://arxiv.org/pdf/2103.00020.pdf
2. Adaptor: Just ask
New capabilities by just asking
“A sketch of a banana”
“A photo of a banana”
New capabilities by plugging pretrained models together: CLIP+GAN
“A Monet painting of the MIT Dome”
To maximize this
INPUT:
OUTPUT:
GAN Generator
Optimize this
New capabilities by plugging pretrained models together: CLIP+GAN
"What is the answer to the ultimate question of life, the universe, and everything?”
To maximize this
INPUT:
OUTPUT:
Image
Generator
Optimize this
Code: https://colab.research.google.com/drive/1_4PQqzM_0KKytCzWtn-ZPi4cCa5bwK2F?usp=sharing
Tons of data
Learn foundation models
Learner
CLIP
GPT
AlexNet
SimCLR
BERT
DALL-E
VQGAN
StyleGAN
BigGAN
SimCLR
WaveNet
Use/adapt foundations to solve new problems
Little or no data
Adaptor
VQGAN
CLIP
Text-to-image
DALL-E
Image Decoder
“A wide-eyed cat on the lookout for food”
INPUT:
Latent space Transformer
[Ramesh et al. 2021]
https://arxiv.org/pdf/2102.12092.pdf
https://openai.com/blog/dall-e/
Text-to-image translation
“An illustration of a baby daikon radish in a tutu walking a dog”
Image Decoder
OUTPUT:
Latent space Transformer
INPUT:
New capabilities by just asking: product design
New capabilities by just asking: image translation
DALL-E
[Ramesh et al. 2021]
https://arxiv.org/pdf/2102.12092.pdf
https://openai.com/blog/dall-e/
DALL-E 2
DALL-E 3
Prompt: a painting of water lilies in a new art style no human has ever seen before
ChatGPT: Here’s a painting of water lilies in a new art style, envisioned as a blend of organic beauty and futuristic technology. This unique interpretation features metallic water lilies floating on a liquid glass surface under a neon sky, creating a scene that harmoniously combines elements of nature with advanced technology.
https://chat.openai.com
Domain mapping
[Cartoon: The Computer as a Communication Device, Licklider & Taylor 1968]
[Includes slides from Jun-Yan Zhu, Taesung Park]
Unpaired data
Paired data
real or fake pair ?
real or fake pair ?
No input-output pairs!
real or fake?
Usually loss functions check if output matches a target instance
GAN loss checks if output is part of an admissible set
Real!
Real too!
Nothing to force output to correspond to input
[Zhu*, Park* et al. 2017], [Yi et al. 2017], [Kim et al. 2017]
CycleGAN, or there and back aGAN
CycleGAN, or there and back aGAN
Cycle Consistency Loss
Cycle Consistency Loss
Paired translation
…
Training data
regression error
Objective
cycle-consistency error
Objective
Input
Result
…
…
Training data
Result
Input
[“pix2pix”, Isola, Zhu, Zhou, Efros, 2017]
Unpaired translation
Cezanne
Ukiyo-e
Monet
Input
Van Gogh
Gaussian
Target distribution
GANs
Horses
Zebras
CycleGAN
What would it look like if…?
What would it look like if…?