1
Autoregressive Conditional Generation using Transformers
Yen-Chi Cheng, Paritosh Mittal
Maneesh K. Singh , Shubham Tulsiani
Outline
2
Recap
3
Problem Formulation
4
Conditional Inputs
Full 3D Shapes
e.g.
Recap of SDF and T-SDF
5
ShapeNet: Training Data Examples
6
airplane
table
rifle
sofa
bench
speaker
chair
car
cabinet
phone
lamp
display
watercraft
Methodology
7
Model
8
Learning Discrete Representation of the SDFs with VQ-VAE
9
Quantization
(Dz x Hz x Wz)
Codebook
0
1
k-1
k-2
43
Lookup�Codebook
Encoder
Decoder
Input: SDF (DxHxW)
Recon: SDF�(DxHxW)
34
16
8
15
1
40
1
55
33
43
23
96
0
3
77
2
(Dz x Hz x Wz)
Autoregressive Generation with Transformer
10
Quantization
Encoder
Decoder
Input: SDF (DxHxW)
Generation: SDF�(DxHxW)
34
16
8
15
1
40
1
55
33
43
23
96
0
3
7
2
Transformer
0
3
7
2
?
0
3
7
2
33
Autoregressive Generation with Partial Input
11
Encoder
Decoder
Input: SDF (DxHxW)
Generation: SDF�(DxHxW)
34
16
8
15
1
40
1
55
33
43
23
96
0
3
7
2
Transformer
0
3
7
2
33
0
?
?
2
?
Quantization
Model - Summary
12
VQ-VAE on ShapeNet (Chair)
13
Input
Reconstruction
k: 512� d: 256� T: 0.2�z_dim: 256x8x8x8
VQ-VAE on ShapeNet (All)
14
Input
Reconstruction
k: 512� d: 256� T: 0.2�z_dim: 256x8x8x8
VQ-VAE with Patch Input
15
34
16
8
15
1
40
1
55
33
43
23
96
0
3
7
2
34
16
8
15
1
40
1
55
33
43
23
96
0
3
7
2
?
P-VQ-VAE
x
64x64x64
x_c
(8x8x8) of�8x8x8
512x8x8x8
x_c
Enc
512xdx1x1x1
z
Lookup�Codebook
512xdx1x1x1
z_q
fold
k: codebook size d: code dimension
dx8x8x8
Dec
encode�independently
decode�jointly
1x64x64x64
x_rec
unfold
P-VQ-VAE on ShapeNet (Chair)
17
Input
Reconstruction
k: 512� d: 256� T: 0.2�z_dim: 256x8x8x8
P-VQ-VAE on ShapeNet (All)
18
Input
Reconstruction
k: 512� d: 256� T: 0.2�z_dim: 256x8x8x8
Comparison on ShapeNet (Chair): P-VQ-VAE v.s. VQ-VAE
19
VQ-VAE
P-VQ-VAE
Input
Comparison on ShapeNet (All): P-VQVAE v.s. VQVAE
20
VQ-VAE
P-VQ-VAE
Input
Quantitative Results: IoU
| IoU |
VQVAE - Chair | 0.8099 |
VQVAE - All | 0.8145 |
P-VQVAE - Chair | 0.7112 |
P-VQVAE - All | 0.7222 |
IoU for VQ-VAE and P-VQ-VAE
IoU Across All Categories
| All | watercraft | chair | lamp | cabinet | table | bench | display | speaker | phone | car | rifle | sofa | airplane |
VQ-VAE �All | 0.8145 | 0.7483 | 0.8004 | 0.6848 | 0.8884 | 0.6866 | 0.8262 | 0.8049 | 0.7774 | 0.8100 | 0.8979 | 0.6209 | 0.8423 | 0.7753 |
P-VQ-VAE All | 0.7222 | 0.6889 | 0.7244 | 0.6092 | 0.8289 | 0.5977 | 0.7604 | 0.7368 | 0.7234 | 0.7412 | 0.8357 | 0.5427 | 0.7512 | 0.6854 |
Autoregressive Generation
23
What we want
Start Simple!!
24
What is autoregressive Generation
25
xi
Transformer
Encoder
Decoder
0
3
77
2
Input
Output
?
91
xo
x1
x<i
x2
x<i
Sequential generation (Next token Prediction)
26
x0
p1
x1
x2
xn-1
x’1
x'2
x’3
x’n
Encoder only Transformer
47.8M parameters�12 Layers in Encoder�12 Attention Heads
Codebook look-up
Pos Encodings
p2
pn-1
+
p3
+
+
+
0
1
k-1
k-2
43
Results on Sequential Generation
27
(Conditioning) First 100 elements of the latent code (100/512) is given as input
Results on Sequential Generation
28
With Autoregressive Vs Random Selection
100 Original + 412 random
250 Original + 262 random
100 Original + 412 Autoregressive
Results on Sequential Generation
29
Results on Sequential Un-Conditional Generation
30
Results on Sequential Un-Conditional Generation
31
Random Order Generation
32
x3
x9
x36
x91
x42
x2
x1
x2
x3
x4
x5
x6
x7
xk
Random Order Generation
33
Encoder Network
63.8M parameters�8 Layers in Encoder�8 Attention Heads
8 Layers in Decoder�8 Attention Heads
Decoder Network
xa
pa
xb
pb
xc
pc
pl
x’l
xa
Codebook look-up
xa
pa
Position Encoding
pa
Results Random Order Generation
34
Context: 64
Results Random Order Generation (Vary Context)
Context: 0
Context: 32
Context: 64
Results Placeholder For Random Order Generation
36
Recon.�P-VQ-VAE
Recon.�P-VQ-VAE
Transformer Generation
Transformer Generation
Examples of visualization Application
37
Input from different views
Transformer Generation
Recon. from P-VQ-VAE
Image Conditional Generation
38
Image Conditional Generation
39
Image Conditional Generation - Model
ResNet-18
256x256
cx8x8
(ntokens)x8x8x8
cx8x8x8
LinearLayer
ConvT3D
(Randomly sample from one view)
Image Conditional Generation
Image Conditional Generation - Results
Image Condition
Transformer Generation
Image Condition
Transformer Generation
Image Conditional Generation - Results
Img Condition
Transformer Generation
Future Plan
44
Questions
45
Thanks
46
Appendix
47
Results on Autoregressive generation
IoU Curve
AutoEncoder IoU
0.1937
0.4328
0.767
0.81
Note
49
Demo
50
VQ-VAE on ShapeNet (All)
51
Input
Reconstruction
k: 512� d: 256� T: 0.2�z_dim: 256x8x8x8
VQ-VAE: Architectures
Conv
ResBlock
Down
ResBlock
Down
ResBlock
Vector Quantizer
Down
ResBlock
Attn
ResBlock
Attn
ResBlock
Norm-Activ
Conv
Conv
ResBlock
attn
ResBlock
ResBlock
Attn
Up
ResBlock
Up
ResBlock
Up
ResBlock
Norm-Activ
Conv
Norm�|�activ�|�conv�|�norm�|�Activ
|�drop�|�conv
ResBlock
Encoder
Decoder
Results on Sequential Generation (Vary Context)
53
Context: 0
Context: 50
Context: 100
IoU is not a good metric (Optional)
54
Image Conditional Generation - Experiment
Input
P-VQ-VAE Recon.
Output
Random Order Generation
56
x3
x9
x36
x91
x42
x2
x1
x2
x3
x4
x5
x6
x7
xk
random_positions <- torch.randperm(512)
Permute
x[random_positions]
x
Random Order Generation
57
x’k
Encoder Network
8 Layers in Encoder�12 Attention Heads
8 Layers in Decoder�12 Attention Heads
Decoder
xa
pa
xb
pb
xc
pc
pk
pl
pm
x’l
x’m
xa
Codebook look-up
xa
pa
Position Encoding
pa
58
Objective -> Generate 3D from partial input
59
60
61
*https://unsplash.com/photos/EPy0gBJzzZU
62
an armchair in the shape of an avocado
Image taken from the original paper [1]
Example taken from original Dall-E work [2]
Image is a visualization from ShapeNet [3]
Autoregressive models
63
What is autoregressive Generation
64
xi
Transformer
xo
x1
x<i
x2
x<i
VQ-VAE/ VQ-VAE 2
VAEs regularize the latent space. VQ-VAEs propose to use discrete latent variables
65
*https://arxiv.org/pdf/1906.00446.pdf
Image taken from the original paper*
VQ step
66
*https://arxiv.org/pdf/1711.00937.pdf
Image taken from the original paper*
Taming Transformers for High-Resolution Synthesis
67
*https://arxiv.org/pdf/2012.09841.pdf
Image taken from the original paper*
Taming Transformers for High-Resolution Synthesis
68
*https://arxiv.org/pdf/2012.09841.pdf
Image taken from the original paper*