1 of 26

Visual Geometry Grounded Transformer �VGG Group, Meta�(2025 CVPR best paper)

Siyu Li

Sep/18/25

2 of 26

Contents

Recap
Motivation
Algorithm
Experiments

3 of 26

Recap: 3D Reconstruction

Visual Geometry Grounded Transformer

3

4 of 26

Recap: Structure from Motion

4

Traditional SfM pipeline mainly includes:

Feature Extraction Vision embedding of patches
Matching Match points based on similarity of embeddings

Geometric Verification (post matching) Filter out incorrect matches using geometric constraints

Image Registration Calculate camera poses based on points matched

Triangulation (initial of BA) Calculate 3D coordinates based on camera poses and points matched

Bundle Adjustment Refine point map and camera globally on projection error, in need of iteration to filter outliers usually

5 of 26

Recap: Comparison of Reconstructions

5

Algorithm	Feature	Matching	Geometry Verification	Image Registeration	Triangulation	Bundle Adjustment
COLMAP	SIFT	KNN	RANSAC, bidirectional homography	RANSAC, P3P/PnP	RANSAC, triangulation	Essential
PixSfM	SIFT, CNN	KNN, Queried	RANSAC, bidirectional homography	RANSAC, P3P/PnP	RANSAC, triangulation	Essential
Dust3R	Shared Vit Encoder (2I)	Cross-attention Decoder		Point map Header		Yes
VGGSfM	Tracking points Prediction			Camera Transformer	Triangulator Transformer	Essential
VGGT	DINO	Global/ Frame Attention, Tracking Head		Camera Head	DPT Head	[Optional] post optimization

This table may not be correct, but I want to list which part might play the similar effect as corresponding SfM part does. I think it could be helpful to understand algorithms…

6 of 26

Motivation

Visual Geometry Grounded Transformer

6

7 of 26

Motivation: Why feedforward?

7

8 of 26

Motivation: Why feedforward?

8

Hence, Optimization-based SfM has to filter outliers and does iterations and RANSAC over and over again! However there may still be error accumulation.

Optimization on projection error explicitly is nonlinear, sensitive and slow, so people turn to feedforward for optimization in expectance among image distribution!

9 of 26

Motivation: VGGT

9

Shared backbone to solve all 3D tasks.

Learning to predict interrelated 3D attributes enhances overall accuracy.

Derive the point maps from separately predicted depth and camera, obtaining better accuracy than point map head.

10 of 26

Contribution of VGGT

10

A large feedforward transformer

Input: >= 1 image of the scene
Output: extrinsic, depth map, 3d point map, points track

Fast and excellent prediction, potentially better after BA

11 of 26

The Algorithm of VGGT

Visual Geometry Grounded Transformer

11

12 of 26

Overall Pipeline

12

Camera Head

DPT

Cameras

Input

Point maps

Tracks

Depth maps

Concat

randomly init

camera token

DINO

Frame

Attention

Global

Attention

13 of 26

Feature: DINOv2/v3

13

Self supervision reveals great performance on similarity for local, global and semantic feature.

It actually learned low-level vision prior like neighbor similarity, texture similarity etc., even 3d consistency.

14 of 26

Feature: DINOv2/v3

14

15 of 26

Global/Frame Attention

15

Global Attention

Ensures scene-level coherence

Frame-wise Attention

Eliminates frame index embedding

For permutation equivariance
For flexible input length

By default there are 24 layers of global and frame

attention (48 layers of transformer, similar amount as DINOv2/v3. So it’s fast!)

Frame

Attention

Global

Attention

16 of 26

Multi-task heads

16

17 of 26

Training Objectives

17

Camera loss

Depth loss

Point map loss

Track loss

18 of 26

Recap: Comparison of Reconstructions

18

Algorithm	Feature	Matching	Geometry Verification	Image Registeration	Triangulation	Bundle Adjustment
COLMAP	SIFT	KNN	RANSAC, bidirectional homography	RANSAC, P3P/PnP	RANSAC, triangulation	Essential
PixSfM	SIFT, CNN	KNN, Queried	RANSAC, bidirectional homography	RANSAC, P3P/PnP	RANSAC, triangulation	Essential
Dust3R	Shared Vit Encoder (2I)	Cross-attention Decoder		Point map Header		Yes
VGGSfM	Tracking points Prediction			Camera Transformer	Triangulator Transformer	Essential
VGGT	DINOv2/v3	Global/ Frame Attention, Tracking Head		Camera Head	DPT Head	[Optional] post optimization

VGGT predicts everything from intermediate feature at once, just “like” feedforward solver of optimization on projection error.

19 of 26

Experiments

Visual Geometry Grounded Transformer

19

20 of 26

Alternative Attention Ablation

21 of 26

Multi-Task Ablation

22 of 26

VGGT is accurate

Known G.T. Cameras

Unknown Cameras

23 of 26

VGGT is fast

24 of 26

VGGT costs low

25 of 26

VGGT is zero-shot for 3D

Even though VGGT wasn’t finetuned on some 3d tasks, it still works well!

26 of 26

VGGT is zero-shot for 3D

Even though VGGT wasn’t finetuned on some 3d tasks, it still works well!