1 of 26

Visual Geometry Grounded TransformerVGG Group, Meta�(2025 CVPR best paper)

Siyu Li

Sep/18/25

2 of 26

Contents

  • Recap
  • Motivation
  • Algorithm
  • Experiments

3 of 26

Recap: 3D Reconstruction

Visual Geometry Grounded Transformer

3

4 of 26

Recap: Structure from Motion

4

  • Traditional SfM pipeline mainly includes:
    • Feature Extraction Vision embedding of patches
    • Matching Match points based on similarity of embeddings
      • Geometric Verification (post matching) Filter out incorrect matches using geometric constraints
    • Image Registration Calculate camera poses based on points matched
      • Triangulation (initial of BA) Calculate 3D coordinates based on camera poses and points matched
    • Bundle Adjustment Refine point map and camera globally on projection error, in need of iteration to filter outliers usually

5 of 26

Recap: Comparison of Reconstructions

5

Algorithm

Feature

Matching

Geometry

Verification

Image Registeration

Triangulation

Bundle Adjustment

COLMAP

SIFT

KNN

RANSAC, bidirectional homography

RANSAC, P3P/PnP

RANSAC, triangulation

Essential

PixSfM

SIFT,

CNN

KNN,

Queried

RANSAC, bidirectional homography

RANSAC, P3P/PnP

RANSAC, triangulation

Essential

Dust3R

Shared Vit Encoder (2I)

Cross-attention Decoder

Point map Header

Yes

VGGSfM

Tracking points Prediction

Camera Transformer

Triangulator Transformer

Essential

VGGT

DINO

Global/ Frame

Attention, Tracking Head

Camera Head

DPT Head

[Optional] post optimization

This table may not be correct, but I want to list which part might play the similar effect as corresponding SfM part does. I think it could be helpful to understand algorithms…

6 of 26

Motivation

Visual Geometry Grounded Transformer

6

7 of 26

Motivation: Why feedforward?

7

 

8 of 26

Motivation: Why feedforward?

8

  • Hence, Optimization-based SfM has to filter outliers and does iterations and RANSAC over and over again! However there may still be error accumulation.

  • Optimization on projection error explicitly is nonlinear, sensitive and slow, so people turn to feedforward for optimization in expectance among image distribution!

9 of 26

Motivation: VGGT

9

  • Shared backbone to solve all 3D tasks.

  • Learning to predict interrelated 3D attributes enhances overall accuracy.

  • Derive the point maps from separately predicted depth and camera, obtaining better accuracy than point map head.

10 of 26

Contribution of VGGT

10

  • A large feedforward transformer
    • Input: >= 1 image of the scene
    • Output: extrinsic, depth map, 3d point map, points track

  • Fast and excellent prediction, potentially better after BA

11 of 26

The Algorithm of VGGT

Visual Geometry Grounded Transformer

11

12 of 26

Overall Pipeline

12

Camera Head

DPT

Cameras

Input

Point maps

Tracks

Depth maps

Concat

randomly init

camera token

DINO

Frame

Attention

Global

Attention

 

13 of 26

Feature: DINOv2/v3

13

  • Self supervision reveals great performance on similarity for local, global and semantic feature.

  • It actually learned low-level vision prior like neighbor similarity, texture similarity etc., even 3d consistency.

14 of 26

Feature: DINOv2/v3

14

15 of 26

Global/Frame Attention

15

  • Global Attention
    • Ensures scene-level coherence

  • Frame-wise Attention
    • Eliminates frame index embedding
      • For permutation equivariance
      • For flexible input length

  • By default there are 24 layers of global and frame

attention (48 layers of transformer, similar amount as DINOv2/v3. So it’s fast!)

Frame

Attention

Global

Attention

 

16 of 26

Multi-task heads

16

 

17 of 26

Training Objectives

17

  • Camera loss

  • Depth loss

  • Point map loss

  • Track loss

18 of 26

Recap: Comparison of Reconstructions

18

Algorithm

Feature

Matching

Geometry

Verification

Image Registeration

Triangulation

Bundle Adjustment

COLMAP

SIFT

KNN

RANSAC, bidirectional homography

RANSAC, P3P/PnP

RANSAC, triangulation

Essential

PixSfM

SIFT,

CNN

KNN,

Queried

RANSAC, bidirectional homography

RANSAC, P3P/PnP

RANSAC, triangulation

Essential

Dust3R

Shared Vit Encoder (2I)

Cross-attention Decoder

Point map Header

Yes

VGGSfM

Tracking points Prediction

Camera Transformer

Triangulator Transformer

Essential

VGGT

DINOv2/v3

Global/ Frame

Attention, Tracking Head

Camera Head

DPT Head

[Optional] post optimization

VGGT predicts everything from intermediate feature at once, just “like” feedforward solver of optimization on projection error.

19 of 26

Experiments

Visual Geometry Grounded Transformer

19

20 of 26

Alternative Attention Ablation

21 of 26

Multi-Task Ablation

22 of 26

VGGT is accurate

Known G.T. Cameras

Unknown Cameras

23 of 26

VGGT is fast

24 of 26

VGGT costs low

 

25 of 26

VGGT is zero-shot for 3D

Even though VGGT wasn’t finetuned on some 3d tasks, it still works well!

26 of 26

VGGT is zero-shot for 3D

Even though VGGT wasn’t finetuned on some 3d tasks, it still works well!