1 of 30

Learning for Structure from Motion

1

Shubham

Tzu-Hsuan(Betty)

Moneish

2 of 30

Structure from Motion (SfM)

2

Set of images

3D structure of a object or scene.

Images :3dvision.princeton.edu/courses/SFMedu/

3 of 30

Applications

3

Geosciences

Virtual reality

and Games

Image based Localization

4 of 30

Structure-from-Motion Pipeline

4

Feature Extraction

Feature Matching

Camera Pose Estimation

Triangulation

Bundle Adjustment

https://demuc.de/tutorials/cvpr2017/

5 of 30

Structure-from-Motion Pipeline

5

Feature Extraction

Feature Matching

Camera Pose Estimation

Triangulation

Bundle Adjustment

https://demuc.de/tutorials/cvpr2017/

6 of 30

Structure-from-Motion Pipeline

6

Feature Extraction

Feature Matching

Camera Pose Estimation

Triangulation

Bundle Adjustment

?

https://demuc.de/tutorials/cvpr2017/

7 of 30

Structure-from-Motion Pipeline

7

Feature Extraction

Feature Matching

Camera Pose Estimation

Triangulation

Bundle Adjustment

https://demuc.de/tutorials/cvpr2017/

8 of 30

Structure-from-Motion Pipeline

8

Feature Extraction

Feature Matching

Camera Pose Estimation

Triangulation

Bundle Adjustment

https://demuc.de/tutorials/cvpr2017/

9 of 30

Dataset - Co3D v2

9

10 of 30

HLOC Challenges

10

Number of images	Average Rotation Error	Number of poses estimated
50	0.056	50
20	0.665	14
10	Does not converge	0

Here we chose one of the category “hydrant” form Co3D v2 dataset and run the SfM pipeline which we built using hloc library. (the blue cameras here are the ground truth cameras and the red ones are predictions from the SfM pipeline.) We compared the average rotation error and the number of estimated-poses with different number of input images.

If we take 50 images as input, we can get very low average rotation error and the SfM is able to predict all the camera poses for input images.

Then if we give less than half of 50 images, the rotation error increase significantly and the not all the images can be predicted to the correspondence camera poses.

Finally we take 10 images as input, The SfM pipeline fails to converge and is not able to predict the camera poses for those 10 input images.

11 of 30

Challenges

11

Requires lot of Images

Requires good overlap between images

12 of 30

Sparse View SfM

12

Set of Few Images

Good 3D points / Camera poses

Images: [Zhang J., Yang G. and Tulsiani S. and Ramanan D., NeurIPS 2021]

13 of 30

Solving SfM

13

Data driven priors !

Wide baselines

Sparse View

https://www.researchgate.net/figure/Short-or-wide-baseline-stereo-with-the-matching-algorithms_fig3_281393350

14 of 30

Sparse-view 3D pose estimation

14

Camera pose

Set of Few Images + camera poses

Images: [Zhang J., Yang G. and Tulsiani S. and Ramanan D., NeurIPS 2021]

15 of 30

Scene Representation Transformers

15

Generated

View

Multi-View

Images

Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022

Camera

Poses

Overview

SRT

Query Pose

16 of 30

Scene Representation Transformers

View specific features

16

CNN

Multi-View

Images

Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022

Camera

Poses

17 of 30

Scene level encoding

Scene Representation Transformers

17

Encoder Transformer

CNN

Multi-View

Images

Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022

Camera

Poses

2D positional encoding

Latent Scene Representation

18 of 30

Scene Representation Transformers

18

Decoder Transformer

Encoder Transformer

Generated

Views

CNN

Latent Scene Representation

Multi-View

Images

Query Rays

Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022

Camera

Poses

2D positional encoding

19 of 30

Baseline

19

Decoder Transformer

Encoder Transformer

Generated

Views

CNN

Latent Scene Representation

Multi-View

Images

Query Pose

Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022

Camera

Poses

2D positional encoding

20 of 30

Baseline

20

Decoder Transformer

Encoder Transformer

Generated

Views

CNN

Latent Scene Representation

Multi-View

Images

Query Rays

Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022

Camera

Poses

21 of 30

Baseline

21

Decoder Transformer

Encoder Transformer

CNN

Latent Scene Representation

Multi-View

Images

Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022

Camera

Poses

Estimated pose

Query Image

CNN

22 of 30

Results

22

R Acc @ 15 Degrees
Category	Test ( 3 scene Images)
Hydrant	82.5
Teddybear	70
Motorcycle	86
Bench	92.0

~250 Scenes

Input Views

Query image

Wide Baseline !

23 of 30

Baseline SRT

23

Naive way of encoding camera poses !

Encoder only architecture !

Pose encoding

Patch 1

Pose encoding

Patch 2

Pose encoding

Patch 3

Sub-optimal !

24 of 30

Baseline ++

24

Camera

Poses Known

CNN

Rays 1

Patch 1

Rays 2

Patch 2

Rays3

Patch 3

Rays (I) 2

Patch 2

Input Images

Camera

Poses Unknown

CNN

Rays (I) 1

Patch 1

Query Images

Rays (I) 3

Patch 3

25 of 30

Baseline ++

25

Transformer Encoder

Rays (I)

mask(unknown)

Query

patches

Predicted Rays

2 - Layer MLP

Rays

mask(known)

Input

patches

rays

Rays Query

Predicted Cameras

Rays Input

Camera Poses

26 of 30

Qualitative Results

26

Input Views ( 3 known [green] 3 unknown [blue] )

Input Views (3 known 3 unknown)

27 of 30

Quantitative Results

27

R Acc @ 15 Degrees (Query Images / Total Images)
Category	Test 1/6	Test 2/6	Test 3/6	Test 4/6	Test 5/6
Hydrant	95.9	99.0	63.9	43.4	35.5
Teddybear	98.9	93.2	56.1	42.9	27.0
Motorcycle	98.0	90.0	68.7	55.5	36.8
Bench	92.0	87.0	60.0	35.5	26.8

~250 Scenes

Data driven priors !

28 of 30

Next Steps

Run on all categories (41 categories) and evaluate .
Experiment on the randomised number of masked input cameras.
Comparison with other pose regression models such as RelPose etc.

28

29 of 30

TakeAways

Data prior is useful

Leverage priors more explicitly to deep learning

29

30 of 30

Thank you

30