1 of 68

1

Autoregressive Conditional Generation using Transformers

Yen-Chi Cheng, Paritosh Mittal

Maneesh K. Singh , Shubham Tulsiani

2 of 68

Outline

  • Recap
  • Methodology
  • Future Plan

2

3 of 68

Recap

3

4 of 68

Problem Formulation

  • Given conditional inputs, we aim to generate full 3D shapes with high quality and diversity

4

Conditional Inputs

Full 3D Shapes

e.g.

  • A leg of a chair
  • An image from a certain view of the chair
  • Text description
  • ...

5 of 68

Recap of SDF and T-SDF

5

  • SDF is a function that takes input (x, y, z), and returns the distance to the closest surface�
  • T-SDF: SDF values truncated with a threshold

6 of 68

ShapeNet: Training Data Examples

6

airplane

table

rifle

sofa

bench

speaker

chair

car

cabinet

phone

lamp

display

watercraft

7 of 68

Methodology

7

8 of 68

Model

  • Learning Discrete Compact Representation of the SDFs with VQ-VAE
  • Autoregressive Generation with Transformer
  • Conditional Generation
    • Sequential Modeling
    • Random Order Modeling
    • Image conditioning
    • Texts (TBD)

8

9 of 68

Learning Discrete Representation of the SDFs with VQ-VAE

9

Quantization

(Dz x Hz x Wz)

Codebook

0

1

k-1

k-2

43

Lookup�Codebook

Encoder

Decoder

Input: SDF (DxHxW)

Recon: SDF�(DxHxW)

34

16

8

15

1

40

1

55

33

43

23

96

0

3

77

2

(Dz x Hz x Wz)

10 of 68

Autoregressive Generation with Transformer

10

Quantization

Encoder

Decoder

Input: SDF (DxHxW)

Generation: SDF�(DxHxW)

34

16

8

15

1

40

1

55

33

43

23

96

0

3

7

2

Transformer

0

3

7

2

?

0

3

7

2

33

11 of 68

Autoregressive Generation with Partial Input

11

Encoder

Decoder

Input: SDF (DxHxW)

Generation: SDF�(DxHxW)

34

16

8

15

1

40

1

55

33

43

23

96

0

3

7

2

Transformer

0

3

7

2

33

0

?

?

2

?

Quantization

12 of 68

Model - Summary

  • Learning Discrete Compact Representation of the SDFs with VQ-VAE
  • Autoregressive Generation with Transformer

12

13 of 68

VQ-VAE on ShapeNet (Chair)

13

Input

Reconstruction

k: 512� d: 256� T: 0.2�z_dim: 256x8x8x8

14 of 68

VQ-VAE on ShapeNet (All)

14

Input

Reconstruction

k: 512� d: 256� T: 0.2�z_dim: 256x8x8x8

15 of 68

VQ-VAE with Patch Input

  • Problem���������

15

34

16

8

15

1

40

1

55

33

43

23

96

0

3

7

2

34

16

8

15

1

40

1

55

33

43

23

96

0

3

7

2

?

  • Proposes P-VQ-VAE to handle the partial inputs

16 of 68

P-VQ-VAE

x

64x64x64

x_c

(8x8x8) of�8x8x8

512x8x8x8

x_c

Enc

512xdx1x1x1

z

Lookup�Codebook

512xdx1x1x1

z_q

fold

k: codebook size d: code dimension

dx8x8x8

Dec

encode�independently

decode�jointly

1x64x64x64

x_rec

unfold

17 of 68

P-VQ-VAE on ShapeNet (Chair)

17

Input

Reconstruction

k: 512� d: 256� T: 0.2�z_dim: 256x8x8x8

18 of 68

P-VQ-VAE on ShapeNet (All)

18

Input

Reconstruction

k: 512� d: 256� T: 0.2�z_dim: 256x8x8x8

19 of 68

Comparison on ShapeNet (Chair): P-VQ-VAE v.s. VQ-VAE

19

VQ-VAE

P-VQ-VAE

Input

20 of 68

Comparison on ShapeNet (All): P-VQVAE v.s. VQVAE

20

VQ-VAE

P-VQ-VAE

Input

21 of 68

Quantitative Results: IoU

IoU

VQVAE - Chair

0.8099

VQVAE - All

0.8145

P-VQVAE - Chair

0.7112

P-VQVAE - All

0.7222

IoU for VQ-VAE and P-VQ-VAE

22 of 68

IoU Across All Categories

All

watercraft

chair

lamp

cabinet

table

bench

display

speaker

phone

car

rifle

sofa

airplane

VQ-VAE �All

0.8145

0.7483

0.8004

0.6848

0.8884

0.6866

0.8262

0.8049

0.7774

0.8100

0.8979

0.6209

0.8423

0.7753

P-VQ-VAE

All

0.7222

0.6889

0.7244

0.6092

0.8289

0.5977

0.7604

0.7368

0.7234

0.7412

0.8357

0.5427

0.7512

0.6854

23 of 68

Autoregressive Generation

23

24 of 68

What we want

  • Model which can take a random sequence and generate the remaining tokens

Start Simple!!

  • Experiments with sequential modelling. To check and assess the performance of auto-regressive models

24

25 of 68

What is autoregressive Generation

25

xi

Transformer

Encoder

Decoder

0

3

77

2

Input

Output

?

91

xo

x1

x<i

x2

x<i

26 of 68

Sequential generation (Next token Prediction)

26

x0

p1

x1

x2

xn-1

x’1

x'2

x’3

x’n

Encoder only Transformer

47.8M parameters�12 Layers in Encoder�12 Attention Heads

Codebook look-up

Pos Encodings

p2

pn-1

+

p3

+

+

+

0

1

k-1

k-2

43

27 of 68

Results on Sequential Generation

27

  • Using VQ-VAE (Only Chair)

(Conditioning) First 100 elements of the latent code (100/512) is given as input

28 of 68

Results on Sequential Generation

28

With Autoregressive Vs Random Selection

100 Original + 412 random

250 Original + 262 random

100 Original + 412 Autoregressive

  • Using VQ-VAE (Only Chair)

29 of 68

Results on Sequential Generation

29

  • Using VQ-VAE (All Categories)

30 of 68

Results on Sequential Un-Conditional Generation

30

  • Using VQ-VAE (All Categories)

31 of 68

Results on Sequential Un-Conditional Generation

31

  • Using P-VQ-VAE (All Categories)

32 of 68

Random Order Generation

  • Model which can take a random sequence and generate the remaining tokens

32

x3

x9

x36

x91

x42

x2

x1

x2

x3

x4

x5

x6

x7

xk

33 of 68

Random Order Generation

33

Encoder Network

63.8M parameters�8 Layers in Encoder�8 Attention Heads

8 Layers in Decoder�8 Attention Heads

Decoder Network

xa

pa

xb

pb

xc

pc

pl

x’l

xa

Codebook look-up

xa

pa

Position Encoding

pa

34 of 68

Results Random Order Generation

34

  • Using P-VQ-VAE

Context: 64

35 of 68

Results Random Order Generation (Vary Context)

  • Using P-VQ-VAE (Only Chair)

Context: 0

Context: 32

Context: 64

36 of 68

Results Placeholder For Random Order Generation

36

Recon.�P-VQ-VAE

Recon.�P-VQ-VAE

Transformer Generation

Transformer Generation

37 of 68

Examples of visualization Application

37

Input from different views

Transformer Generation

Recon. from P-VQ-VAE

38 of 68

Image Conditional Generation

38

39 of 68

Image Conditional Generation

39

40 of 68

Image Conditional Generation - Model

ResNet-18

256x256

cx8x8

(ntokens)x8x8x8

cx8x8x8

LinearLayer

ConvT3D

(Randomly sample from one view)

41 of 68

Image Conditional Generation

  • Data: rendered images from 3D-R2N2
  • Examples

42 of 68

Image Conditional Generation - Results

Image Condition

Transformer Generation

Image Condition

Transformer Generation

43 of 68

Image Conditional Generation - Results

Img Condition

Transformer Generation

44 of 68

Future Plan

  • Improve the random order generation across different categories�
  • Experiment with conditional generation across different tasks (text, ...)�
  • Add other datasets (e.g. ABC) and see if this help improve the generation�

44

45 of 68

Questions

45

46 of 68

Thanks

46

47 of 68

Appendix

47

48 of 68

Results on Autoregressive generation

IoU Curve

AutoEncoder IoU

0.1937

0.4328

0.767

0.81

49 of 68

Note

49

50 of 68

Demo

50

51 of 68

VQ-VAE on ShapeNet (All)

51

Input

Reconstruction

k: 512� d: 256� T: 0.2�z_dim: 256x8x8x8

52 of 68

VQ-VAE: Architectures

  • Arch_v2

Conv

ResBlock

Down

ResBlock

Down

ResBlock

Vector Quantizer

Down

ResBlock

Attn

ResBlock

Attn

ResBlock

Norm-Activ

Conv

Conv

ResBlock

attn

ResBlock

ResBlock

Attn

Up

ResBlock

Up

ResBlock

Up

ResBlock

Norm-Activ

Conv

Norm�|�activ�|�conv�|�norm�|�Activ

|�drop�|�conv

ResBlock

Encoder

Decoder

53 of 68

Results on Sequential Generation (Vary Context)

53

  • Using P-VQ-VAE (Only Chair)

Context: 0

Context: 50

Context: 100

54 of 68

IoU is not a good metric (Optional)

54

55 of 68

Image Conditional Generation - Experiment

Input

P-VQ-VAE Recon.

Output

56 of 68

Random Order Generation

56

x3

x9

x36

x91

x42

x2

x1

x2

x3

x4

x5

x6

x7

xk

random_positions <- torch.randperm(512)

Permute

x[random_positions]

x

57 of 68

Random Order Generation

57

x’k

Encoder Network

8 Layers in Encoder�12 Attention Heads

8 Layers in Decoder�12 Attention Heads

Decoder

xa

pa

xb

pb

xc

pc

pk

pl

pm

x’l

x’m

xa

Codebook look-up

xa

pa

Position Encoding

pa

58 of 68

58

59 of 68

Objective -> Generate 3D from partial input

  • Leg of chair
  • Occlusion (back of chair)
  • Ambiguous input

59

60 of 68

60

61 of 68

61

*https://unsplash.com/photos/EPy0gBJzzZU

62 of 68

62

an armchair in the shape of an avocado

Image taken from the original paper [1]

Example taken from original Dall-E work [2]

Image is a visualization from ShapeNet [3]

63 of 68

Autoregressive models

63

64 of 68

What is autoregressive Generation

64

xi

Transformer

xo

x1

x<i

x2

x<i

65 of 68

VQ-VAE/ VQ-VAE 2

VAEs regularize the latent space. VQ-VAEs propose to use discrete latent variables

  • Discrete representations are more common across different modalities�
  • Discretization reduces the amount of info. stored in latent space, however most important info is stored

65

*https://arxiv.org/pdf/1906.00446.pdf

Image taken from the original paper*

66 of 68

VQ step

  • Encoder gives an output of dim (n,h,w,d). Resize this to (n*h*w,d)
  • Find the closest among k vectors in codebook (Argmin).
  • Reshape to (n,h,w,1)
  • Replace the latent variables with entries from codebook (n,h,w,d)

66

*https://arxiv.org/pdf/1711.00937.pdf

Image taken from the original paper*

67 of 68

Taming Transformers for High-Resolution Synthesis

67

*https://arxiv.org/pdf/2012.09841.pdf

Image taken from the original paper*

68 of 68

Taming Transformers for High-Resolution Synthesis

68

*https://arxiv.org/pdf/2012.09841.pdf

Image taken from the original paper*