1 of 30

Learning for Structure from Motion

1

Shubham

Tzu-Hsuan(Betty)

Moneish

2 of 30

Structure from Motion (SfM)

2

Set of images

3D structure of a object or scene.

Images :3dvision.princeton.edu/courses/SFMedu/

3 of 30

Applications

3

Geosciences

Virtual reality

and Games

Image based Localization

4 of 30

Structure-from-Motion Pipeline

4

Feature Extraction

Feature Matching

Camera Pose Estimation

Triangulation

Bundle Adjustment

https://demuc.de/tutorials/cvpr2017/

5 of 30

Structure-from-Motion Pipeline

5

Feature Extraction

Feature Matching

Camera Pose Estimation

Triangulation

Bundle Adjustment

https://demuc.de/tutorials/cvpr2017/

6 of 30

Structure-from-Motion Pipeline

6

Feature Extraction

Feature Matching

Camera Pose Estimation

Triangulation

Bundle Adjustment

?

https://demuc.de/tutorials/cvpr2017/

7 of 30

Structure-from-Motion Pipeline

7

Feature Extraction

Feature Matching

Camera Pose Estimation

Triangulation

Bundle Adjustment

https://demuc.de/tutorials/cvpr2017/

8 of 30

Structure-from-Motion Pipeline

8

Feature Extraction

Feature Matching

Camera Pose Estimation

Triangulation

Bundle Adjustment

https://demuc.de/tutorials/cvpr2017/

9 of 30

Dataset - Co3D v2

9

10 of 30

HLOC Challenges

10

Number of images

Average Rotation Error

Number of poses estimated

50

0.056

50

20

0.665

14

10

Does not converge

0

11 of 30

Challenges

11

Requires lot of Images

Requires good overlap between images

12 of 30

Sparse View SfM

12

Set of Few Images

Good 3D points / Camera poses

Images: [Zhang J., Yang G. and Tulsiani S. and Ramanan D., NeurIPS 2021]

13 of 30

Solving SfM

13

Data driven priors !

Wide baselines

Sparse View

https://www.researchgate.net/figure/Short-or-wide-baseline-stereo-with-the-matching-algorithms_fig3_281393350

14 of 30

Sparse-view 3D pose estimation

14

Camera pose

Set of Few Images + camera poses

Images: [Zhang J., Yang G. and Tulsiani S. and Ramanan D., NeurIPS 2021]

15 of 30

Scene Representation Transformers

15

Generated

View

Multi-View

Images

Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022

Camera

Poses

Overview

SRT

Query Pose

16 of 30

Scene Representation Transformers

View specific features

16

CNN

Multi-View

Images

Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022

Camera

Poses

17 of 30

Scene level encoding

Scene Representation Transformers

17

Encoder Transformer

CNN

Multi-View

Images

Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022

Camera

Poses

2D positional encoding

Latent Scene Representation

18 of 30

Scene Representation Transformers

18

Decoder Transformer

Encoder Transformer

Generated

Views

CNN

Latent Scene Representation

Multi-View

Images

Query Rays

Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022

Camera

Poses

2D positional encoding

19 of 30

Baseline

19

Decoder Transformer

Encoder Transformer

Generated

Views

CNN

Latent Scene Representation

Multi-View

Images

Query Pose

Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022

Camera

Poses

2D positional encoding

20 of 30

Baseline

20

Decoder Transformer

Encoder Transformer

Generated

Views

CNN

Latent Scene Representation

Multi-View

Images

Query Rays

Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022

Camera

Poses

21 of 30

Baseline

21

Decoder Transformer

Encoder Transformer

CNN

Latent Scene Representation

Multi-View

Images

Sajjadi et al, “Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations”, CVPR 2022

Camera

Poses

Estimated pose

Query Image

CNN

22 of 30

Results

22

R Acc @ 15 Degrees

Category

Test ( 3 scene Images)

Hydrant

82.5

Teddybear

70

Motorcycle

86

Bench

92.0

~250 Scenes

Input Views

Query image

Wide Baseline !

23 of 30

Baseline SRT

23

Naive way of encoding camera poses !

Encoder only architecture !

Pose encoding

Patch 1

Pose encoding

Patch 2

Pose encoding

Patch 3

Sub-optimal !

24 of 30

Baseline ++

24

Camera

Poses Known

CNN

Rays 1

Patch 1

Rays 2

Patch 2

Rays3

Patch 3

Rays (I) 2

Patch 2

Input Images

Camera

Poses Unknown

CNN

Rays (I) 1

Patch 1

Query Images

Rays (I) 3

Patch 3

25 of 30

Baseline ++

25

Transformer Encoder

Rays (I)

mask(unknown)

Query

patches

Predicted Rays

2 - Layer MLP

Rays

mask(known)

Input

patches

rays

Rays Query

Predicted Cameras

Rays Input

Camera Poses

26 of 30

Qualitative Results

26

Input Views ( 3 known [green] 3 unknown [blue] )

Input Views (3 known 3 unknown)

27 of 30

Quantitative Results

27

R Acc @ 15 Degrees

(Query Images / Total Images)

Category

Test 1/6

Test 2/6

Test 3/6

Test 4/6

Test 5/6

Hydrant

95.9

99.0

63.9

43.4

35.5

Teddybear

98.9

93.2

56.1

42.9

27.0

Motorcycle

98.0

90.0

68.7

55.5

36.8

Bench

92.0

87.0

60.0

35.5

26.8

~250 Scenes

Data driven priors !

28 of 30

Next Steps

  • Run on all categories (41 categories) and evaluate .
  • Experiment on the randomised number of masked input cameras.
  • Comparison with other pose regression models such as RelPose etc.

28

29 of 30

TakeAways

Data prior is useful

Leverage priors more explicitly to deep learning

29

30 of 30

Thank you

30