1 of 20

VGGSfM: Visual Geometry Deep Structure From Motion�Visual Geometry Group, University of Oxford; Meta AI�CVPR 2024

Samuel Chua

16 September 2025

2 of 20

Contents

  • Background
  • Related Works
  • Motivation
  • Method
  • Experiments
  • Conclusion

2

3 of 20

Background – Structure-from-Motion (SfM)

3

  • Recover 3D point cloud (Structure) + camera parameters (Motion) from multiple unconstrained 2D images
    • Detecting key-points and feature descriptors (SIFT, ORB)
    • Search for image pairs via nearest-neighbor search
    • Verify candidate pairs using two-view epipolar geometry (essential or fundamental matrix) or homography (RANSAC)
    • Triangulation of 3D points
    • Bundle Adjustment (BA)

4 of 20

Related Works

4

  • COLMAP
    • Incremental SfM
  • Deep Learning approaches
    • Better keypoint detection and feature matching (SuperPoint, SuperGlue)

5 of 20

Related Works

  • PixSfM
    • Builds on COLMAP output (tracks + structure).
    • Refines with feature-metric keypoint adjustment and feature-metric bundle adjustment.
    • Improves accuracy but still dependent on COLMAP’s initialization.
  • Detector-free SfM
    • Uses detector-free matches instead of handcrafted or learned keypoints.
    • Builds a coarse SfM model using COLMAP
    • Reduces reliance on explicit detectors, but still tied to classical SfM frameworks

5

6 of 20

Motivation

6

  • Classical SfM
    • Modular & Incremental
    • Entirely non-differentiable
  • Fully-differentiable SfM pipelines
    • Deep neural networks to regress camera poses and depths
    • Suffer from limited scalability

7 of 20

Method - VGGSfM

7

 

Point Tracker

Initial Camera Estimator

Triangulator

Bundle-Adjustment

8 of 20

Method - VGGSfM

8

Point Tracker

Initial Camera Estimator

Triangulator

Bundle-Adjustment

 

9 of 20

Method - VGGSfM

9

Point Tracker

Initial Camera Estimator

Triangulator

Bundle-Adjustment

 

 

10 of 20

Method - VGGSfM

10

Point Tracker

Initial Camera Estimator

Triangulator

Bundle-Adjustment

 

 

11 of 20

Method - VGGSfM

11

Point Tracker

Initial Camera Estimator

Triangulator

Bundle-Adjustment

 

Goal:

Minimises reprojection loss with second-order Levenberg-Marquardt (LM) optimizer

12 of 20

Method - VGGSfM

12

Point Tracker

Initial Camera Estimator

Triangulator

Bundle-Adjustment

- Evaluate the ϵ-thresholded pseudo-Huber loss between ground-truth 3D points and BA-defined 3D points

- Compare the predicted initial pose and bundle-adjusted camera pose to ground-truth camera annotation

- Likelihood of a ground-truth track point under a probabilistic track-point estimate defined by a 2D gaussian with mean and variance predictions respectively

13 of 20

Experiments

13

CO3Dv2 → internet-scale collection, category-diverse.

IMC Phototourism → large-scale, unstructured internet photos.

ETH3D → high-precision indoor/outdoor multi-view benchmark.

14 of 20

Experiments

14

15 of 20

Experiments

15

16 of 20

Experiments

16

17 of 20

Experiments

17

18 of 20

Experiments

18

19 of 20

Experiments

19

20 of 20

Conclusions

20

  • VGGSfM
    • Full differentiable end-to-end SfM pipeline
  • Cannot yet compete with established pipelines in all application domains.
    • Currently lacks the capability to process thousands of images as in traditional SfM frameworks.