1 of 20

VGGSfM: Visual Geometry Deep Structure From Motion�Visual Geometry Group, University of Oxford; Meta AI�CVPR 2024

Samuel Chua

16 September 2025

2 of 20

Contents

Background
Related Works
Motivation
Method
Experiments
Conclusion

2

3 of 20

Background – Structure-from-Motion (SfM)

3

Recover 3D point cloud (Structure) + camera parameters (Motion) from multiple unconstrained 2D images

Detecting key-points and feature descriptors (SIFT, ORB)
Search for image pairs via nearest-neighbor search
Verify candidate pairs using two-view epipolar geometry (essential or fundamental matrix) or homography (RANSAC)
Triangulation of 3D points
Bundle Adjustment (BA)

4 of 20

Related Works

4

COLMAP

Incremental SfM

Deep Learning approaches

Better keypoint detection and feature matching (SuperPoint, SuperGlue)

5 of 20

Related Works

PixSfM

Builds on COLMAP output (tracks + structure).
Refines with feature-metric keypoint adjustment and feature-metric bundle adjustment.
Improves accuracy but still dependent on COLMAP’s initialization.

Detector-free SfM

Uses detector-free matches instead of handcrafted or learned keypoints.
Builds a coarse SfM model using COLMAP
Reduces reliance on explicit detectors, but still tied to classical SfM frameworks

5

6 of 20

Motivation

6

Classical SfM

Modular & Incremental
Entirely non-differentiable

Fully-differentiable SfM pipelines

Deep neural networks to regress camera poses and depths
Suffer from limited scalability

7 of 20

Method - VGGSfM

7

Point Tracker

Initial Camera Estimator

Triangulator

Bundle-Adjustment

8 of 20

Method - VGGSfM

8

Point Tracker

Initial Camera Estimator

Triangulator

Bundle-Adjustment

9 of 20

Method - VGGSfM

9

Point Tracker

Initial Camera Estimator

Triangulator

Bundle-Adjustment

10 of 20

Method - VGGSfM

10

Point Tracker

Initial Camera Estimator

Triangulator

Bundle-Adjustment

11 of 20

Method - VGGSfM

11

Point Tracker

Initial Camera Estimator

Triangulator

Bundle-Adjustment

Goal:

Minimises reprojection loss with second-order Levenberg-Marquardt (LM) optimizer

12 of 20

Method - VGGSfM

12

Point Tracker

Initial Camera Estimator

Triangulator

Bundle-Adjustment

- Evaluate the ϵ-thresholded pseudo-Huber loss between ground-truth 3D points and BA-defined 3D points

- Compare the predicted initial pose and bundle-adjusted camera pose to ground-truth camera annotation

- Likelihood of a ground-truth track point under a probabilistic track-point estimate defined by a 2D gaussian with mean and variance predictions respectively

13 of 20

Experiments

13

CO3Dv2 → internet-scale collection, category-diverse.

IMC Phototourism → large-scale, unstructured internet photos.

ETH3D → high-precision indoor/outdoor multi-view benchmark.

14 of 20

Experiments

14

15 of 20

Experiments

15

16 of 20

Experiments

16

17 of 20

Experiments

17

18 of 20

Experiments

18

19 of 20

Experiments

19

20 of 20

Conclusions

20

VGGSfM

Full differentiable end-to-end SfM pipeline

Cannot yet compete with established pipelines in all application domains.

Currently lacks the capability to process thousands of images as in traditional SfM frameworks.