1 of 58

3D Vision & 3D Reasoning

16-824 Visual Learning and Recognition

Spring 2019

Chen-Hsuan Lin

2 of 58

2

We’ve learned how various

convolutional neural networks

operate on 2D image data.

classification

object detection / localization

semantic segmentation

……

3 of 58

Our 3D world

3

How do we, as humans, perceive and understand the visual world?

4 of 58

Our 3D world

4

The world is not comprised of a bunch of 2D pixels!

5 of 58

Our 3D world

5

https://www.turbosquid.com/3d-models/roman-colloseum-ruins-3d-model-1196429

We perceive the world in 2D -- but the physical world is in 3D.

6 of 58

Applications

6

autonomous driving

robot navigation

computer graphics

virtual reality

medical imaging

augmented reality

http://fortune.com/2015/12/04/2016-the-year-of-virtual-reality/

7 of 58

Applications

7

https://www.wired.com/story/facebook-oculus-codec-avatars-vr/

Example: VR avatars (from Facebook Reality Labs Pittsburgh)

8 of 58

8

How do we learn to

perceive and understand

the 3D physical world?

9 of 58

9

Traditional approaches -- multiple view geometry

  • Structure from motion (SfM)
  • Simultaneous localization and mapping (SLAM)
  • Multi-view stereo (MVS)

10 of 58

10

Given observations from multiple viewpoints,

Choi et al. “A Large Dataset of Object Scans.” arXiv 2016.

11 of 58

11

we can solve for the 3D structure

Choi et al. “A Large Dataset of Object Scans.” arXiv 2016.

(and the camera matrices).

12 of 58

12

The key is to establish correspondences between observations.

Choi et al. “A Large Dataset of Object Scans.” arXiv 2016.

13 of 58

13

Traditional approaches -- multiple view geometry

  • Structure from motion (SfM)
  • Simultaneous localization and mapping (SLAM)
  • Multi-view stereo (MVS)

Modern approaches -- learning (data-driven)

  • Deep neural networks

Not necessary a conflict -- can complement each other!

14 of 58

Tasks

14

3D synthesis

(generative)

chair

3D reconstruction

depth estimation

scene / shape completion

……

classification

scene / object part segmentation

3D object localization

……

3D analysis

(discriminative)

chair

15 of 58

Datasets

15

scenes

objects

Chang et al. “ShapeNet: An Information-Rich 3D Model Repository.” arXiv 2015.

Dai et al. “ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes.” CVPR 2017.

16 of 58

Datasets

16

scenes

objects

Chang et al. “ShapeNet: An Information-Rich 3D Model Repository.” arXiv 2015.

Dai et al. “ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes.” CVPR 2017.

17 of 58

Datasets

17

scenes

objects

NYU depth v2

SUN RGB-D

ScanNet

Matterport3D

SUNCG

KITTI

CityScapes

TorontoCity

ApolloCar3D

……

IKEA 3D models

PASCAL 3D+

ModelNet

ShapeNet

Pix3D

……

New 3D datasets are being constructed each year!

indoor scenes

outdoor scenes

(street scenes)

18 of 58

3D representations

18

voxels

3D analog of conventional CNNs

expensive (memory & computation)

limited resolution

point clouds

compact

sparse representation

lack spatial structure

meshes

structured with dense surfaces

irregular structures (graph CNN)

hard to analyze / synthesize

19 of 58

3D representations

19

voxels

meshes

point clouds

3D analog of conventional CNNs

expensive (memory & computation)

limited resolution

compact

sparse representation

lack spatial structure

structured with dense surfaces

irregular structures (graph CNN)

hard to analyze / synthesize

20 of 58

Volumetric prediction

20

Girdhar et al. “Learning a Predictable and Generative Vector Representation for Objects.” ECCV 2016.

3D convolution

3D transposed convolution

We can learn a shape embedding via a 3D voxel autoencoder.

loss: per-voxel binary classification

21 of 58

Volumetric prediction

21

Girdhar et al. “Learning a Predictable and Generative Vector Representation for Objects.” ECCV 2016.

We can also learn image embeddings that matches the shape embeddings.

22 of 58

Volumetric prediction

22

Girdhar et al. “Learning a Predictable and Generative Vector Representation for Objects.” ECCV 2016.

Then we can predict a 3D object shape from an image.

23 of 58

Multi-view 2D supervision

23

Yan et al. “Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision.” NeurIPS 2016.

Tulsiani et al. “Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency.” CVPR 2017.

We’ve assumed we have ground-truth 3D shapes as supervision.

How about learning just from multiple 2D observations?

24 of 58

Multi-view 2D supervision

24

Key idea: 2D projections of 3D shapes should be consistent across viewpoints.

Yan et al. “Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision.” NeurIPS 2016.

Tulsiani et al. “Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency.” CVPR 2017.

25 of 58

Multi-view 2D supervision

25

input 2D image

input 2D image

reconstructed 3D shape

reconstructed 3D shape

32 × 32 × 32

Yan et al. “Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision.” NeurIPS 2016.

Tulsiani et al. “Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency.” CVPR 2017.

26 of 58

26

However, 3D CNNs for voxels are

really expensive……

Think back about how long you’ve been suffering to train AlexNet and VGG-16 for from your assignments.

Now imagine everything being 3D instead of 2D.

training

High complexity and low tractable data resolution.

How could we possibly improve?

27 of 58

Octree-based 3D CNNs

27

Riegler et al. “OctNet: Learning Deep 3D Representations at High Resolutions.” CVPR 2017.

Tatarchenko et al. “Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs.” ICCV 2017.

Wang et al. “O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis.” SIGGRAPH 2017.

Key idea: hierarchically encode/decode granularity of occupied voxels.

28 of 58

Octree-based 3D CNNs

28

Convention 3D CNNs are tractable up to resolution 323. With octree -- up to 2563.

Riegler et al. “OctNet: Learning Deep 3D Representations at High Resolutions.” CVPR 2017.

Tatarchenko et al. “Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs.” ICCV 2017.

Wang et al. “O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis.” SIGGRAPH 2017.

29 of 58

29

Even with octree structures,

we’re still limited to rasterized shapes……

Can we learn on 3D data with

real-value precisions?

30 of 58

3D representations

30

voxels

meshes

point clouds

3D analog of conventional CNNs

expensive (memory & computation)

limited resolution

compact

sparse representation

lack spatial structure

structured with dense surfaces

irregular structures (graph CNN)

hard to analyze / synthesize

N

3

31 of 58

Learning on point clouds

31

Qi et al. “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation.” CVPR 2017.

We can pass each 3D point through a multi-layer perceptron

and max-pool out a global high-level feature.

32 of 58

Learning on point clouds

32

Qi et al. “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation.” CVPR 2017.

We can also use a hybrid feature for point-level tasks.

33 of 58

Learning on point clouds

33

Qi et al. “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation.” CVPR 2017.

Additional transformer modules can be optionally included.

34 of 58

Learning on point clouds

34

Qi et al. “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation.” CVPR 2017.

(segmentation on shape parts)

complete input point clouds

incomplete input point clouds

35 of 58

Spatial structure in point clouds

35

Global pooling is a naïve way to obtain high-level features.

What have we learned from convolutional neural networks?

36 of 58

Hierarchical local features

36

Qi et al. “PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space.” NeurIPS 2017.

Key idea: group points in local neighborhoods together -- just like spatial pooling.

(requires search for nearest neighbors -- adds to time complexity.)

37 of 58

Hierarchical local features

37

Klokov et al. “Escape from Cells: Deep Kd-Networks for the Recognition of 3D Point Cloud Models.” ICCV 2017.

Key idea: construct a K-d tree as the feature hierarchy.

(requires sorting of points -- adds to time complexity.)

38 of 58

Hierarchical local features

38

Wang et al. “Dynamic Graph CNN for Learning on Point Clouds.” arXiv 2018.

Key idea: construct dynamic graphs of spatial structures.

(requires search for nearest neighbors -- adds to time complexity.)

39 of 58

Hierarchical local features

39

Wang et al. “Dynamic Graph CNN for Learning on Point Clouds.” arXiv 2018.

distance in Euclidean space

40 of 58

Hierarchical local features

40

Wang et al. “Dynamic Graph CNN for Learning on Point Clouds.” arXiv 2018.

distance in feature space

41 of 58

Point cloud generation

41

A simple way to predict 3D shapes in point clouds.

Fan et al. “A Point Set Generation Network for 3D Object Reconstruction from a Single Image.” CVPR 2017.

Groueix et al. “AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation.” CVPR 2018.

loss: Chamfer distance

ground truth

42 of 58

Point cloud generation

42

Chamfer distance: distance of each point to nearest point and vice versa.

Fan et al. “A Point Set Generation Network for 3D Object Reconstruction from a Single Image.” CVPR 2017.

Groueix et al. “AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation.” CVPR 2018.

S

1

S

2

43 of 58

3D representations

43

voxels

meshes

point clouds

3D analog of conventional CNNs

expensive (memory & computation)

limited resolution

compact

sparse representation

lack spatial structure

structured with dense surfaces

irregular structures (graph CNN)

hard to analyze / synthesize

N

3

44 of 58

3D representations

44

0

1

2

3

4

meshes

structured with dense surfaces

irregular structures (graph CNN)

hard to analyze / synthesize

N

3

M

3

mesh topology

(triangular)

vertex

face

45 of 58

Mesh generation

45

We can take advantage of continuity in MLPs to learn a surface manifold.

(fully-connected layers and ReLU are continuous functions)

Groueix et al. “AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation.” CVPR 2018.

loss: Chamfer distance

ground truth

46 of 58

Mesh generation

46

Groueix et al. “AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation.” CVPR 2018.

single-image 3D reconstruction

shape interpolation

47 of 58

Mesh generation

47

How about learning from image collections? No 3D supervision!

Kanazawa et al. “Learning Category-Specific Mesh Reconstruction from Image Collections.” ECCV 2018.

(known mesh topology)

48 of 58

Mesh generation

48

Kanazawa et al. “Learning Category-Specific Mesh Reconstruction from Image Collections.” ECCV 2018.

single-image 3D reconstruction

texture transfer

49 of 58

Learning on mesh manifolds

49

Meshes are one type of graphs -- we can work in non-Euclidean domains.

Bronstein et al. “Geometric Deep Learning: Going Beyond Euclidean Data.” arXiv 2016.

spatial domain

spectral domain

50 of 58

Learning on mesh manifolds

50

Spatial domain: we can extend familiar operations to non-Euclidean graphs.

Bronstein et al. “Geometric Deep Learning: Going Beyond Euclidean Data.” arXiv 2016.

graph Laplacian

heat kernel descriptors

non-Euclidean convolution

51 of 58

Learning on mesh manifolds

51

Spectral domain: Fourier analysis on spatial signals.

Bronstein et al. “Geometric Deep Learning: Going Beyond Euclidean Data.” arXiv 2016.

The convolution theorem:

52 of 58

52

We’ve seen enough 3D objects for now……

Let’s look at some real-world scene data.

53 of 58

Indoor scenes

53

Semantic segmentation from RGB-D data.

Qi et al. “3D Graph Neural Networks for RGBD Semantic Segmentation.” ICCV 2017.

54 of 58

Indoor scenes

54

Scene completion + semantic segmentation from depth scans.

Song et al. “Semantic Scene Completion from a Single Depth Image.” CVPR 2017.

Dai et al. “ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans.” CVPR 2018.

55 of 58

Indoor scenes

55

Semantic SLAM with deep networks is also a new trend!

Tateno et al. “CNN-SLAM: Real-time Dense Monocular SLAM with Learned Depth Prediction.” CVPR 2017.

56 of 58

Outdoor scenes

56

Zhou et al. “VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection.” CVPR 2018.

Qi et al. “Frustum PointNets for 3D Object Detection from RGB-D Data.” CVPR 2018.

3D object localization from LiDAR data. (They are point clouds!)

57 of 58

Outdoor scenes

57

Zhou et al. “Unsupervised Learning of Depth and Ego-Motion from Video.” CVPR 2017.

Mahjourian et al. “Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints.” CVPR 2018.

Finally, going unsupervised -- learning depth from RGB videos.

loss

58 of 58

Outdoor scenes

58

Zhou et al. “Unsupervised Learning of Depth and Ego-Motion from Video.” CVPR 2017.

Mahjourian et al. “Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints.” CVPR 2018.

Finally, going unsupervised -- learning depth from RGB videos.