Visual Geometry Grounded Transformer �VGG Group, Meta�(2025 CVPR best paper)
Siyu Li
Sep/18/25
Contents
Recap: 3D Reconstruction
Visual Geometry Grounded Transformer
3
Recap: Structure from Motion
4
Recap: Comparison of Reconstructions
5
Algorithm | Feature | Matching | Geometry Verification | Image Registeration | Triangulation | Bundle Adjustment |
COLMAP | SIFT | KNN | RANSAC, bidirectional homography | RANSAC, P3P/PnP | RANSAC, triangulation | Essential |
PixSfM | SIFT, CNN | KNN, Queried | RANSAC, bidirectional homography | RANSAC, P3P/PnP | RANSAC, triangulation | Essential |
Dust3R | Shared Vit Encoder (2I) | Cross-attention Decoder | Point map Header | Yes | ||
VGGSfM | Tracking points Prediction | Camera Transformer | Triangulator Transformer | Essential | ||
VGGT | DINO | Global/ Frame Attention, Tracking Head | Camera Head | DPT Head | [Optional] post optimization | |
This table may not be correct, but I want to list which part might play the similar effect as corresponding SfM part does. I think it could be helpful to understand algorithms…
Motivation
Visual Geometry Grounded Transformer
6
Motivation: Why feedforward?
7
Motivation: Why feedforward?
8
Motivation: VGGT
9
Contribution of VGGT
10
The Algorithm of VGGT
Visual Geometry Grounded Transformer
11
Overall Pipeline
12
Camera Head
DPT
Cameras
Input
Point maps
Tracks
Depth maps
Concat
randomly init
camera token
DINO
Frame
Attention
Global
Attention
Feature: DINOv2/v3
13
Feature: DINOv2/v3
14
Global/Frame Attention
15
attention (48 layers of transformer, similar amount as DINOv2/v3. So it’s fast!)
Frame
Attention
Global
Attention
Multi-task heads
16
Training Objectives
17
Recap: Comparison of Reconstructions
18
Algorithm | Feature | Matching | Geometry Verification | Image Registeration | Triangulation | Bundle Adjustment |
COLMAP | SIFT | KNN | RANSAC, bidirectional homography | RANSAC, P3P/PnP | RANSAC, triangulation | Essential |
PixSfM | SIFT, CNN | KNN, Queried | RANSAC, bidirectional homography | RANSAC, P3P/PnP | RANSAC, triangulation | Essential |
Dust3R | Shared Vit Encoder (2I) | Cross-attention Decoder | Point map Header | Yes | ||
VGGSfM | Tracking points Prediction | Camera Transformer | Triangulator Transformer | Essential | ||
VGGT | DINOv2/v3 | Global/ Frame Attention, Tracking Head | Camera Head | DPT Head | [Optional] post optimization | |
VGGT predicts everything from intermediate feature at once, just “like” feedforward solver of optimization on projection error.
Experiments
Visual Geometry Grounded Transformer
19
Alternative Attention Ablation
Multi-Task Ablation
VGGT is accurate
Known G.T. Cameras
Unknown Cameras
VGGT is fast
VGGT costs low
VGGT is zero-shot for 3D
Even though VGGT wasn’t finetuned on some 3d tasks, it still works well!
VGGT is zero-shot for 3D
Even though VGGT wasn’t finetuned on some 3d tasks, it still works well!