1 of 68

1

Autoregressive Conditional Generation using Transformers

Yen-Chi Cheng, Paritosh Mittal

Maneesh K. Singh , Shubham Tulsiani

2 of 68

Outline

Recap
Methodology
Future Plan

2

3 of 68

Recap

3

4 of 68

Problem Formulation

Given conditional inputs, we aim to generate full 3D shapes with high quality and diversity

4

Conditional Inputs

Full 3D Shapes

e.g.

A leg of a chair

An image from a certain view of the chair

Text description
...

5 of 68

Recap of SDF and T-SDF

5

SDF is a function that takes input (x, y, z), and returns the distance to the closest surface�
T-SDF: SDF values truncated with a threshold

6 of 68

ShapeNet: Training Data Examples

6

airplane

table

rifle

sofa

bench

speaker

chair

car

cabinet

phone

lamp

display

watercraft

7 of 68

Methodology

7

8 of 68

Model

Learning Discrete Compact Representation of the SDFs with VQ-VAE�
Autoregressive Generation with Transformer�
Conditional Generation

Sequential Modeling
Random Order Modeling
Image conditioning
Texts (TBD)

8

9 of 68

Learning Discrete Representation of the SDFs with VQ-VAE

9

Quantization

(Dz x Hz x Wz)

Codebook

0

1

k-1

k-2

43

Lookup�Codebook

Encoder

Decoder

Input: SDF (DxHxW)

Recon: SDF�(DxHxW)

34

16

8

15

1

40

1

55

33

43

23

96

0

3

77

2

(Dz x Hz x Wz)

10 of 68

Autoregressive Generation with Transformer

10

Quantization

Encoder

Decoder

Input: SDF (DxHxW)

Generation: SDF�(DxHxW)

34

16

8

15

1

40

1

55

33

43

23

96

0

3

7

2

Transformer

0

3

7

2

?

0

3

7

2

33

11 of 68

Autoregressive Generation with Partial Input

11

Encoder

Decoder

Input: SDF (DxHxW)

Generation: SDF�(DxHxW)

34

16

8

15

1

40

1

55

33

43

23

96

0

3

7

2

Transformer

0

3

7

2

33

0

?

2

?

Quantization

12 of 68

Model - Summary

Learning Discrete Compact Representation of the SDFs with VQ-VAE�
Autoregressive Generation with Transformer�

12

13 of 68

VQ-VAE on ShapeNet (Chair)

13

Input

Reconstruction

k: 512� d: 256� T: 0.2�z_dim: 256x8x8x8

14 of 68

VQ-VAE on ShapeNet (All)

14

Input

Reconstruction

k: 512� d: 256� T: 0.2�z_dim: 256x8x8x8

15 of 68

VQ-VAE with Patch Input

Problem��

15

34

16

8

15

1

40

1

55

33

43

23

96

0

3

7

2

34

16

8

15

1

40

1

55

33

43

23

96

0

3

7

2

?

Proposes P-VQ-VAE to handle the partial inputs

16 of 68

P-VQ-VAE

x

64x64x64

x_c

(8x8x8) of�8x8x8

512x8x8x8

x_c

Enc

512xdx1x1x1

z

Lookup�Codebook

512xdx1x1x1

z_q

fold

k: codebook size d: code dimension

dx8x8x8

Dec

encode�independently

decode�jointly

1x64x64x64

x_rec

unfold

17 of 68

P-VQ-VAE on ShapeNet (Chair)

17

Input

Reconstruction

k: 512� d: 256� T: 0.2�z_dim: 256x8x8x8

18 of 68

P-VQ-VAE on ShapeNet (All)

18

Input

Reconstruction

k: 512� d: 256� T: 0.2�z_dim: 256x8x8x8

19 of 68

Comparison on ShapeNet (Chair): P-VQ-VAE v.s. VQ-VAE

19

VQ-VAE

P-VQ-VAE

Input

20 of 68

Comparison on ShapeNet (All): P-VQVAE v.s. VQVAE

20

VQ-VAE

P-VQ-VAE

Input

21 of 68

Quantitative Results: IoU

	IoU
VQVAE - Chair	0.8099
VQVAE - All	0.8145
P-VQVAE - Chair	0.7112
P-VQVAE - All	0.7222

IoU for VQ-VAE and P-VQ-VAE

22 of 68

IoU Across All Categories

	All	watercraft	chair	lamp	cabinet	table	bench	display	speaker	phone	car	rifle	sofa	airplane
VQ-VAE �All	0.8145	0.7483	0.8004	0.6848	0.8884	0.6866	0.8262	0.8049	0.7774	0.8100	0.8979	0.6209	0.8423	0.7753
P-VQ-VAE All	0.7222	0.6889	0.7244	0.6092	0.8289	0.5977	0.7604	0.7368	0.7234	0.7412	0.8357	0.5427	0.7512	0.6854

23 of 68

Autoregressive Generation

23

24 of 68

What we want

Model which can take a random sequence and generate the remaining tokens

Start Simple!!

Experiments with sequential modelling. To check and assess the performance of auto-regressive models

24

25 of 68

What is autoregressive Generation

25

x_i

Transformer

Encoder

Decoder

0

3

77

2

Input

Output

?

91

x_o

x₁

x_<i

x₂

x_<i

26 of 68

Sequential generation (Next token Prediction)

26

x₀

p₁

x₁

x₂

x_n-1

x’₁

x'₂

x’₃

x’_n

Encoder only Transformer

47.8M parameters�12 Layers in Encoder�12 Attention Heads

Codebook look-up

Pos Encodings

p₂

p_n-1

+

p₃

+

0

1

k-1

k-2

43

27 of 68

Results on Sequential Generation

27

Using VQ-VAE (Only Chair)

(Conditioning) First 100 elements of the latent code (100/512) is given as input

28 of 68

Results on Sequential Generation

28

With Autoregressive Vs Random Selection

100 Original + 412 random

250 Original + 262 random

100 Original + 412 Autoregressive

Using VQ-VAE (Only Chair)

29 of 68

Results on Sequential Generation

29

Using VQ-VAE (All Categories)

30 of 68

Results on Sequential Un-Conditional Generation

30

Using VQ-VAE (All Categories)

31 of 68

Results on Sequential Un-Conditional Generation

31

Using P-VQ-VAE (All Categories)

32 of 68

Random Order Generation

Model which can take a random sequence and generate the remaining tokens

32

x₃

x₉

x₃₆

x₉₁

x₄₂

x₂

x₁

x₂

x₃

x₄

x₅

x₆

x₇

x_k

33 of 68

Random Order Generation

33

Encoder Network

63.8M parameters�8 Layers in Encoder�8 Attention Heads

8 Layers in Decoder�8 Attention Heads

Decoder Network

x_a

p_a

x_b

p_b

x_c

p_c

p_l

x’_l

x_a

Codebook look-up

x_a

p_a

Position Encoding

p_a

34 of 68

Results Random Order Generation

34

Using P-VQ-VAE

Context: 64

35 of 68

Results Random Order Generation (Vary Context)

Using P-VQ-VAE (Only Chair)

Context: 0

Context: 32

Context: 64

36 of 68

Results Placeholder For Random Order Generation

36

Recon.�P-VQ-VAE

Transformer Generation

37 of 68

Examples of visualization Application

37

Input from different views

Transformer Generation

Recon. from P-VQ-VAE

38 of 68

Image Conditional Generation

38

39 of 68

Image Conditional Generation

39

40 of 68

Image Conditional Generation - Model

ResNet-18

256x256

cx8x8

(ntokens)x8x8x8

cx8x8x8

LinearLayer

ConvT3D

(Randomly sample from one view)

41 of 68

Image Conditional Generation

Data: rendered images from 3D-R2N2
Examples

42 of 68

Image Conditional Generation - Results

Image Condition

Transformer Generation

Image Condition

Transformer Generation

43 of 68

Image Conditional Generation - Results

Img Condition

Transformer Generation

44 of 68

Future Plan

Improve the random order generation across different categories�
Experiment with conditional generation across different tasks (text, ...)�
Add other datasets (e.g. ABC) and see if this help improve the generation�

44

45 of 68

Questions

45

46 of 68

Thanks

46

47 of 68

Appendix

47

48 of 68

Results on Autoregressive generation

IoU Curve

AutoEncoder IoU

0.1937

0.4328

0.767

0.81

49 of 68

Note

49

50 of 68

Demo

50

51 of 68

VQ-VAE on ShapeNet (All)

51

Input

Reconstruction

k: 512� d: 256� T: 0.2�z_dim: 256x8x8x8

52 of 68

VQ-VAE: Architectures

Arch_v2

Conv

ResBlock

Down

ResBlock

Down

ResBlock

Vector Quantizer

Down

ResBlock

Attn

ResBlock

Attn

ResBlock

Norm-Activ

Conv

ResBlock

attn

ResBlock

Attn

Up

ResBlock

Up

ResBlock

Up

ResBlock

Norm-Activ

Conv

Norm�|�activ�|�conv�|�norm�|�Activ

|�drop�|�conv

ResBlock

Encoder

Decoder

53 of 68

Results on Sequential Generation (Vary Context)

53

Using P-VQ-VAE (Only Chair)

Context: 0

Context: 50

Context: 100

54 of 68

IoU is not a good metric (Optional)

54

55 of 68

Image Conditional Generation - Experiment

Input

P-VQ-VAE Recon.

Output

56 of 68

Random Order Generation

56

x₃

x₉

x₃₆

x₉₁

x₄₂

x₂

x₁

x₂

x₃

x₄

x₅

x₆

x₇

x_k

random_positions <- torch.randperm(512)

Permute

x[random_positions]

x

57 of 68

Random Order Generation

57

x’_k

Encoder Network

8 Layers in Encoder�12 Attention Heads

8 Layers in Decoder�12 Attention Heads

Decoder

x_a

p_a

x_b

p_b

x_c

p_c

p_k

p_l

p_m

x’_l

x’_m

x_a

Codebook look-up

x_a

p_a

Position Encoding

p_a

58 of 68

58

59 of 68

Objective -> Generate 3D from partial input

Leg of chair
Occlusion (back of chair)
Ambiguous input

59

60 of 68

60

61 of 68

61

*https://unsplash.com/photos/EPy0gBJzzZU

62 of 68

62

an armchair in the shape of an avocado

[1] https://arxiv.org/pdf/1705.00389v2.pdf

[2] https://openai.com/blog/dall-e/

[3] ShapeNet 3D

Image taken from the original paper [1]

Example taken from original Dall-E work [2]

Image is a visualization from ShapeNet [3]

63 of 68

Autoregressive models

63

64 of 68

What is autoregressive Generation

64

x_i

Transformer

x_o

x₁

x_<i

x₂

x_<i

65 of 68

VQ-VAE/ VQ-VAE 2

VAEs regularize the latent space. VQ-VAEs propose to use discrete latent variables

Discrete representations are more common across different modalities�
Discretization reduces the amount of info. stored in latent space, however most important info is stored

65

*https://arxiv.org/pdf/1906.00446.pdf

Image taken from the original paper*

66 of 68

VQ step

Encoder gives an output of dim (n,h,w,d). Resize this to (n*h*w,d)
Find the closest among k vectors in codebook (Argmin).
Reshape to (n,h,w,1)
Replace the latent variables with entries from codebook (n,h,w,d)

66

*https://arxiv.org/pdf/1711.00937.pdf

Image taken from the original paper*

67 of 68

Taming Transformers for High-Resolution Synthesis

67

*https://arxiv.org/pdf/2012.09841.pdf

Image taken from the original paper*

68 of 68

Taming Transformers for High-Resolution Synthesis

68

*https://arxiv.org/pdf/2012.09841.pdf

Image taken from the original paper*