Learning for Structure from Motion
1
Shubham
Tzu-Hsuan(Betty)
Moneish
Structure from Motion (SfM)
2
Set of images
3D structure of a object or scene.
Images :3dvision.princeton.edu/courses/SFMedu/
Applications
3
Geosciences
Virtual reality
and Games
Image based Localization
Structure-from-Motion Pipeline
4
Feature Extraction
Feature Matching
Camera Pose Estimation
Triangulation
Bundle Adjustment
https://demuc.de/tutorials/cvpr2017/
Structure-from-Motion Pipeline
5
Feature Extraction
Feature Matching
Camera Pose Estimation
Triangulation
Bundle Adjustment
https://demuc.de/tutorials/cvpr2017/
Structure-from-Motion Pipeline
6
Feature Extraction
Feature Matching
Camera Pose Estimation
Triangulation
Bundle Adjustment
?
https://demuc.de/tutorials/cvpr2017/
Structure-from-Motion Pipeline
7
Feature Extraction
Feature Matching
Camera Pose Estimation
Triangulation
Bundle Adjustment
https://demuc.de/tutorials/cvpr2017/
Structure-from-Motion Pipeline
8
Feature Extraction
Feature Matching
Camera Pose Estimation
Triangulation
Bundle Adjustment
https://demuc.de/tutorials/cvpr2017/
Dataset - Co3D v2
9
HLOC Challenges
10
Number of images | Average Rotation Error | Number of poses estimated |
50 | 0.056 | 50 |
20 | 0.665 | 14 |
10 | Does not converge | 0 |
Challenges
11
Requires lot of Images
Requires good overlap between images
Sparse View SfM
12
Set of Few Images
Good 3D points / Camera poses
Images: [Zhang J., Yang G. and Tulsiani S. and Ramanan D., NeurIPS 2021]
Solving SfM
13
Data driven priors !
Wide baselines
Sparse View
https://www.researchgate.net/figure/Short-or-wide-baseline-stereo-with-the-matching-algorithms_fig3_281393350
Sparse-view 3D pose estimation
14
Camera pose
Set of Few Images + camera poses
Images: [Zhang J., Yang G. and Tulsiani S. and Ramanan D., NeurIPS 2021]
Scene Representation Transformers
15
Generated
View
Multi-View
Images
Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022
Camera
Poses
Overview
SRT
Query Pose
Scene Representation Transformers
View specific features
16
CNN
Multi-View
Images
Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022
Camera
Poses
Scene level encoding
Scene Representation Transformers
17
Encoder Transformer
CNN
Multi-View
Images
Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022
Camera
Poses
2D positional encoding
Latent Scene Representation
Scene Representation Transformers
18
Decoder Transformer
Encoder Transformer
Generated
Views
CNN
Latent Scene Representation
Multi-View
Images
Query Rays
Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022
Camera
Poses
2D positional encoding
Baseline
19
Decoder Transformer
Encoder Transformer
Generated
Views
CNN
Latent Scene Representation
Multi-View
Images
Query Pose
Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022
Camera
Poses
2D positional encoding
Baseline
20
Decoder Transformer
Encoder Transformer
Generated
Views
CNN
Latent Scene Representation
Multi-View
Images
Query Rays
Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022
Camera
Poses
Baseline
21
Decoder Transformer
Encoder Transformer
CNN
Latent Scene Representation
Multi-View
Images
Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022
Camera
Poses
Estimated pose
Query Image
CNN
Results
22
R Acc @ 15 Degrees | |
Category | Test ( 3 scene Images)
|
Hydrant | 82.5 |
Teddybear | 70 |
Motorcycle | 86 |
Bench | 92.0 |
~250 Scenes
Input Views
Query image
Wide Baseline !
Baseline SRT
23
Naive way of encoding camera poses !
Encoder only architecture !
Pose encoding
Patch 1
Pose encoding
Patch 2
Pose encoding
Patch 3
Sub-optimal !
Baseline ++
24
Camera
Poses Known
CNN
Rays 1
Patch 1
Rays 2
Patch 2
Rays3
Patch 3
Rays (I) 2
Patch 2
Input Images
Camera
Poses Unknown
CNN
Rays (I) 1
Patch 1
Query Images
Rays (I) 3
Patch 3
Baseline ++
25
Transformer Encoder
Rays (I)
mask(unknown)
Query
patches
Predicted Rays
2 - Layer MLP
Rays
mask(known)
Input
patches
rays
Rays Query
Predicted Cameras
Rays Input
Camera Poses
Qualitative Results
26
Input Views ( 3 known [green] 3 unknown [blue] )
Input Views (3 known 3 unknown)
Quantitative Results
27
R Acc @ 15 Degrees (Query Images / Total Images) | |||||
Category | Test 1/6 | Test 2/6 | Test 3/6 | Test 4/6 | Test 5/6 |
Hydrant | 95.9 | 99.0 | 63.9 | 43.4 | 35.5 |
Teddybear | 98.9 | 93.2 | 56.1 | 42.9 | 27.0 |
Motorcycle | 98.0 | 90.0 | 68.7 | 55.5 | 36.8 |
Bench | 92.0 | 87.0 | 60.0 | 35.5 | 26.8 |
~250 Scenes
Data driven priors !
Next Steps
28
TakeAways
Data prior is useful
Leverage priors more explicitly to deep learning
29
Thank you
30