3D Vision & 3D Reasoning
16-824 Visual Learning and Recognition
Spring 2019
Chen-Hsuan Lin
2
We’ve learned how various
convolutional neural networks
operate on 2D image data.
classification
object detection / localization
semantic segmentation
……
Our 3D world
3
How do we, as humans, perceive and understand the visual world?
Our 3D world
4
The world is not comprised of a bunch of 2D pixels!
Our 3D world
5
https://www.turbosquid.com/3d-models/roman-colloseum-ruins-3d-model-1196429
We perceive the world in 2D -- but the physical world is in 3D.
Applications
6
autonomous driving
robot navigation
computer graphics
virtual reality
medical imaging
augmented reality
http://fortune.com/2015/12/04/2016-the-year-of-virtual-reality/
Applications
7
https://www.wired.com/story/facebook-oculus-codec-avatars-vr/
Example: VR avatars (from Facebook Reality Labs Pittsburgh)
8
How do we learn to
perceive and understand
the 3D physical world?
9
Traditional approaches -- multiple view geometry
10
Given observations from multiple viewpoints,
Choi et al. “A Large Dataset of Object Scans.” arXiv 2016.
11
we can solve for the 3D structure
Choi et al. “A Large Dataset of Object Scans.” arXiv 2016.
(and the camera matrices).
12
The key is to establish correspondences between observations.
Choi et al. “A Large Dataset of Object Scans.” arXiv 2016.
13
Traditional approaches -- multiple view geometry
Modern approaches -- learning (data-driven)
Not necessary a conflict -- can complement each other!
Tasks
14
3D synthesis
(generative)
chair
3D reconstruction
depth estimation
scene / shape completion
……
classification
scene / object part segmentation
3D object localization
……
3D analysis
(discriminative)
chair
Datasets
15
scenes
objects
Chang et al. “ShapeNet: An Information-Rich 3D Model Repository.” arXiv 2015.
Dai et al. “ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes.” CVPR 2017.
Datasets
16
scenes
objects
Chang et al. “ShapeNet: An Information-Rich 3D Model Repository.” arXiv 2015.
Dai et al. “ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes.” CVPR 2017.
Datasets
17
scenes
objects
NYU depth v2
SUN RGB-D
ScanNet
Matterport3D
SUNCG
KITTI
CityScapes
TorontoCity
ApolloCar3D
……
IKEA 3D models
PASCAL 3D+
ModelNet
ShapeNet
Pix3D
……
New 3D datasets are being constructed each year!
indoor scenes
outdoor scenes
(street scenes)
3D representations
18
voxels
3D analog of conventional CNNs
expensive (memory & computation)
limited resolution
point clouds
compact
sparse representation
lack spatial structure
meshes
structured with dense surfaces
irregular structures (graph CNN)
hard to analyze / synthesize
3D representations
19
voxels
meshes
point clouds
3D analog of conventional CNNs
expensive (memory & computation)
limited resolution
compact
sparse representation
lack spatial structure
structured with dense surfaces
irregular structures (graph CNN)
hard to analyze / synthesize
Volumetric prediction
20
Girdhar et al. “Learning a Predictable and Generative Vector Representation for Objects.” ECCV 2016.
3D convolution
3D transposed convolution
We can learn a shape embedding via a 3D voxel autoencoder.
loss: per-voxel binary classification
Volumetric prediction
21
Girdhar et al. “Learning a Predictable and Generative Vector Representation for Objects.” ECCV 2016.
We can also learn image embeddings that matches the shape embeddings.
Volumetric prediction
22
Girdhar et al. “Learning a Predictable and Generative Vector Representation for Objects.” ECCV 2016.
Then we can predict a 3D object shape from an image.
Multi-view 2D supervision
23
Yan et al. “Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision.” NeurIPS 2016.
Tulsiani et al. “Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency.” CVPR 2017.
We’ve assumed we have ground-truth 3D shapes as supervision.
How about learning just from multiple 2D observations?
Multi-view 2D supervision
24
Key idea: 2D projections of 3D shapes should be consistent across viewpoints.
Yan et al. “Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision.” NeurIPS 2016.
Tulsiani et al. “Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency.” CVPR 2017.
Multi-view 2D supervision
25
input 2D image
input 2D image
reconstructed 3D shape
reconstructed 3D shape
32 × 32 × 32
Yan et al. “Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision.” NeurIPS 2016.
Tulsiani et al. “Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency.” CVPR 2017.
26
However, 3D CNNs for voxels are
really expensive……
Think back about how long you’ve been suffering to train AlexNet and VGG-16 for from your assignments.
Now imagine everything being 3D instead of 2D.
training
High complexity and low tractable data resolution.
How could we possibly improve?
Octree-based 3D CNNs
27
Riegler et al. “OctNet: Learning Deep 3D Representations at High Resolutions.” CVPR 2017.
Tatarchenko et al. “Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs.” ICCV 2017.
Wang et al. “O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis.” SIGGRAPH 2017.
Key idea: hierarchically encode/decode granularity of occupied voxels.
Octree-based 3D CNNs
28
Convention 3D CNNs are tractable up to resolution 323. With octree -- up to 2563.
Riegler et al. “OctNet: Learning Deep 3D Representations at High Resolutions.” CVPR 2017.
Tatarchenko et al. “Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs.” ICCV 2017.
Wang et al. “O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis.” SIGGRAPH 2017.
29
Even with octree structures,
we’re still limited to rasterized shapes……
Can we learn on 3D data with
real-value precisions?
3D representations
30
voxels
meshes
point clouds
3D analog of conventional CNNs
expensive (memory & computation)
limited resolution
compact
sparse representation
lack spatial structure
structured with dense surfaces
irregular structures (graph CNN)
hard to analyze / synthesize
…
N
3
Learning on point clouds
31
Qi et al. “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation.” CVPR 2017.
We can pass each 3D point through a multi-layer perceptron
and max-pool out a global high-level feature.
Learning on point clouds
32
Qi et al. “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation.” CVPR 2017.
We can also use a hybrid feature for point-level tasks.
Learning on point clouds
33
Qi et al. “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation.” CVPR 2017.
Additional transformer modules can be optionally included.
Learning on point clouds
34
Qi et al. “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation.” CVPR 2017.
(segmentation on shape parts)
complete input point clouds
incomplete input point clouds
Spatial structure in point clouds
35
Global pooling is a naïve way to obtain high-level features.
What have we learned from convolutional neural networks?
Hierarchical local features
36
Qi et al. “PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space.” NeurIPS 2017.
Key idea: group points in local neighborhoods together -- just like spatial pooling.
(requires search for nearest neighbors -- adds to time complexity.)
Hierarchical local features
37
Klokov et al. “Escape from Cells: Deep Kd-Networks for the Recognition of 3D Point Cloud Models.” ICCV 2017.
Key idea: construct a K-d tree as the feature hierarchy.
(requires sorting of points -- adds to time complexity.)
Hierarchical local features
38
Wang et al. “Dynamic Graph CNN for Learning on Point Clouds.” arXiv 2018.
Key idea: construct dynamic graphs of spatial structures.
(requires search for nearest neighbors -- adds to time complexity.)
Hierarchical local features
39
Wang et al. “Dynamic Graph CNN for Learning on Point Clouds.” arXiv 2018.
distance in Euclidean space
Hierarchical local features
40
Wang et al. “Dynamic Graph CNN for Learning on Point Clouds.” arXiv 2018.
distance in feature space
Point cloud generation
41
A simple way to predict 3D shapes in point clouds.
Fan et al. “A Point Set Generation Network for 3D Object Reconstruction from a Single Image.” CVPR 2017.
Groueix et al. “AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation.” CVPR 2018.
loss: Chamfer distance
ground truth
Point cloud generation
42
Chamfer distance: distance of each point to nearest point and vice versa.
Fan et al. “A Point Set Generation Network for 3D Object Reconstruction from a Single Image.” CVPR 2017.
Groueix et al. “AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation.” CVPR 2018.
S
1
S
2
3D representations
43
voxels
meshes
point clouds
3D analog of conventional CNNs
expensive (memory & computation)
limited resolution
compact
sparse representation
lack spatial structure
structured with dense surfaces
irregular structures (graph CNN)
hard to analyze / synthesize
…
N
3
3D representations
44
0
1
2
3
4
…
meshes
structured with dense surfaces
irregular structures (graph CNN)
hard to analyze / synthesize
N
3
M
3
…
mesh topology
(triangular)
vertex
face
Mesh generation
45
We can take advantage of continuity in MLPs to learn a surface manifold.
(fully-connected layers and ReLU are continuous functions)
Groueix et al. “AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation.” CVPR 2018.
loss: Chamfer distance
ground truth
Mesh generation
46
Groueix et al. “AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation.” CVPR 2018.
single-image 3D reconstruction
shape interpolation
Mesh generation
47
How about learning from image collections? No 3D supervision!
Kanazawa et al. “Learning Category-Specific Mesh Reconstruction from Image Collections.” ECCV 2018.
(known mesh topology)
Mesh generation
48
Kanazawa et al. “Learning Category-Specific Mesh Reconstruction from Image Collections.” ECCV 2018.
single-image 3D reconstruction
texture transfer
Learning on mesh manifolds
49
Meshes are one type of graphs -- we can work in non-Euclidean domains.
Bronstein et al. “Geometric Deep Learning: Going Beyond Euclidean Data.” arXiv 2016.
spatial domain
spectral domain
Learning on mesh manifolds
50
Spatial domain: we can extend familiar operations to non-Euclidean graphs.
Bronstein et al. “Geometric Deep Learning: Going Beyond Euclidean Data.” arXiv 2016.
graph Laplacian
heat kernel descriptors
non-Euclidean convolution
Learning on mesh manifolds
51
Spectral domain: Fourier analysis on spatial signals.
Bronstein et al. “Geometric Deep Learning: Going Beyond Euclidean Data.” arXiv 2016.
The convolution theorem:
52
We’ve seen enough 3D objects for now……
Let’s look at some real-world scene data.
Indoor scenes
53
Semantic segmentation from RGB-D data.
Qi et al. “3D Graph Neural Networks for RGBD Semantic Segmentation.” ICCV 2017.
Indoor scenes
54
Scene completion + semantic segmentation from depth scans.
Song et al. “Semantic Scene Completion from a Single Depth Image.” CVPR 2017.
Dai et al. “ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans.” CVPR 2018.
Indoor scenes
55
Semantic SLAM with deep networks is also a new trend!
Tateno et al. “CNN-SLAM: Real-time Dense Monocular SLAM with Learned Depth Prediction.” CVPR 2017.
Outdoor scenes
56
Zhou et al. “VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection.” CVPR 2018.
Qi et al. “Frustum PointNets for 3D Object Detection from RGB-D Data.” CVPR 2018.
3D object localization from LiDAR data. (They are point clouds!)
Outdoor scenes
57
Zhou et al. “Unsupervised Learning of Depth and Ego-Motion from Video.” CVPR 2017.
Mahjourian et al. “Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints.” CVPR 2018.
Finally, going unsupervised -- learning depth from RGB videos.
loss
Outdoor scenes
58
Zhou et al. “Unsupervised Learning of Depth and Ego-Motion from Video.” CVPR 2017.
Mahjourian et al. “Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints.” CVPR 2018.
Finally, going unsupervised -- learning depth from RGB videos.