DUST3R Dataset

	A	B	C	D	E	F	G	H
1	Dataset	License	Samples / Size	Images / Resolution	Scenes / Env.	Primary Tasks	Annotations	Notes

2	CO3Dv2	CC BY-NC 4.0 (Non-Commercial) (GitHub - naver/dust3r: DUSt3R: Geometric 3D Vision Made Easy)	~38k sequences (videos) of objects (GitHub - facebookresearch/co3d: Tooling for the Common Objects In 3D dataset.); ~5.5 TB full dataset (Questions about downloading datasets · Issue #173 · naver/dust3r · GitHub)	~6 million frames (from ~19k videos v1, 4× in v2) (GitHub - facebookresearch/co3d: Tooling for the Common Objects In 3D dataset.) ([Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction	Papers With Code](https://paperswithcode.com/paper/common-objects-in-3d-large-scale-learning-and#:~:text=scale%20dataset%2C%20called%20Common%20Objects,powerful%20Transformer%20to%20reconstruct%20an)); images cropped to uniform size per sequence (GitHub - facebookresearch/co3d: Tooling for the Common Objects In 3D dataset.)	50 common object categories; real “in-the-wild” captures (varied indoor/outdoor backgrounds) ([Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction	Papers With Code](https://paperswithcode.com/paper/common-objects-in-3d-large-scale-learning-and#:~:text=scale%20dataset%2C%20called%20Common%20Objects,powerful%20Transformer%20to%20reconstruct%20an))	Category-specific 3D reconstruction; novel view synthesis ([Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction
3	ARKitScenes	CC BY-NC-SA 4.0 (GitHub - naver/dust3r: DUSt3R: Geometric 3D Vision Made Easy) (Apple license, research-only)	5,047 RGB-D scans of 1,661 unique indoor scenes (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.); ~623 GB (low-res 3DOD subset) (ARKitScenes/DATA.md at main · apple/ARKitScenes · GitHub); full raw dataset larger	Multi-frame RGB-D per scan (low-res RGB + depth per frame) (ARKitScenes/DATA.md at main · apple/ARKitScenes · GitHub); subset of 2,257 scans also with high-res RGB & depth (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.). iPhone LiDAR RGB ~1920×1440; depth low-res (ARKit) (ARKitScenes/DATA.md at main · apple/ARKitScenes · GitHub) (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.)	Indoor environments (homes/offices) with furniture (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.); captures include device RGB-D and static laser scanner data (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.)	3D object detection (furniture) and RGB-D depth upsampling (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.) (ARKitScenes/DATA.md at main · apple/ARKitScenes · GitHub)	3D oriented bounding boxes for ~17 furniture categories (Three-Dimensional Point Cloud Applications, Datasets, and ... - MDPI) (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.); per-frame camera poses and reconstructed surface meshes for each scene (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.)	Largest real indoor RGB-D dataset (first with mobile LiDAR) (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.). Combines mobile RGB-D data and high-quality laser depth maps for many scenes.
4	ScanNet++	Custom Terms (non-commercial research use) (GitHub - naver/dust3r: DUSt3R: Geometric 3D Vision Made Easy)	1,000+ high-fidelity 3D indoor scenes (v2 release) (ScanNet++ Dataset) (ScanNet++ Dataset); includes sub-mm LiDAR scans + 33 MP DSLR imagery (very large data, TB-scale)	Each scene has a set of 33 MP RGB images (DSLR) + corresponding iPhone RGB-D sequence (ScanNet++ Dataset). High-res images (33 megapixel); sub-millimeter laser scan resolution (ScanNet++ Dataset).	Indoor spaces (rooms, buildings) with rich detail; captured with laser scanners + DSLR + iPhone (coupled captures) (ScanNet++ Dataset)	3D semantic understanding (long-tail semantic segmentation of scenes) (ScanNet++ Dataset); high-quality novel view synthesis evaluation	Dense 3D reconstructions with per-vertex semantic labels (many object classes) (ScanNet++ Dataset); calibrated multi-sensor data (poses, intrinsics for cameras & LiDAR)	Must apply for access (account approval required) (ScanNet++ Dataset) (ScanNet++ Dataset). Benchmark tasks include 3D object detection and segmentation on this data.
5	BlendedMVS	CC BY 4.0 (GitHub - naver/dust3r: DUSt3R: Geometric 3D Vision Made Easy)	17k+ rendered images with depth ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks) ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks); ~156 GB high-res image set + 27.5 GB low-res set + 9.4 GB meshes (GitHub - YoYo000/BlendedMVS: BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks)	High-res images: 2048×1536 pixels (156 GB) (GitHub - YoYo000/BlendedMVS: BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks); also provided low-res 768×576 versions (27.5 GB) (GitHub - YoYo000/BlendedMVS: BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks). Multi-view rendered from ~100 scenes (cities, buildings, sculptures, small objects) ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks)	Synthetic (photorealistic) scenes covering outdoor cityscapes to indoor objects ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks); images are computer-generated by rendering textured 3D meshes and blending with real backgrounds ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks)	Generalized multi-view stereo (MVS) – train/test ground-truth depth prediction for any scene ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks) ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks)	Per-pixel depth maps for each image (rendered from 3D mesh) ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks); known camera parameters and pairing information (neighbor views) (GitHub - YoYo000/BlendedMVS: BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks); full textured 3D mesh models of scenes as ground truth geometry (GitHub - YoYo000/BlendedMVS: BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks)	Created by reconstructing meshes from real images and then re-rendering ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks). Ambient lighting from original images is “blended” in to improve realism ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks).
6	Waymo Open	Custom (Waymo Dataset Non-Commercial Use) (GitHub - naver/dust3r: DUSt3R: Geometric 3D Vision Made Easy)	~1,950 driving segments (20 s each @10 Hz) = 390k frames ([Waymo Open Dataset Dataset	Papers With Code](https://paperswithcode.com/dataset/waymo-open-dataset#:~:text=conditions,labels%20for%20camera%20data%20in)); ~2 TB for initial release (~1 GB per 20 s sequence) (Waymo Open Dataset Visualisation - Harry Nguyen - WordPress.com)	~390k LiDAR sweeps and ~1.95 million camera images (5 synchronized cameras) ([Waymo Open Dataset Dataset	Papers With Code](https://paperswithcode.com/dataset/waymo-open-dataset#:~:text=conditions,labels%20for%20camera%20data%20in)) ([Waymo Open Dataset Dataset	Papers With Code](https://paperswithcode.com/dataset/waymo-open-dataset#:~:text=range%20lidars%20,tracking%20IDs%20on%20camera%20data)). High-resolution RGB imagery (front/side cameras, ~1920×1280 each) ([Waymo Open Dataset Dataset	Papers With Code](https://paperswithcode.com/dataset/waymo-open-dataset#:~:text=conditions,labels%20for%20camera%20data%20in))
7	Habitat-Sim	Various (mix of HM3D, Replica, Gibson, etc., each with its own terms; generally research-only) (croco/datasets/habitat_sim at master · naver/croco · GitHub) (croco/datasets/habitat_sim at master · naver/croco · GitHub)	~1.9 million synthetic image pairs (generated) (croco/datasets/habitat_sim at master · naver/croco · GitHub). Data generated on-the-fly from metadata (no fixed size; images 256×256)	~3.8 million images total (each pair = 2 images) at 256×256 resolution, 60° FOV (croco/datasets/habitat_sim at master · naver/croco · GitHub). All images rendered via simulator (Habitat)	~2,000 virtual indoor scenes (3D scans or CAD models) from multiple sources: e.g. 1097 ScanNet rooms, 800 HM3D houses, etc. (croco/datasets/habitat_sim at master · naver/croco · GitHub). Environments are photorealistic indoor spaces simulated in Habitat	Pre-training for 3D vision: generates image pairs with known pose differences for tasks like relative pose estimation, depth prediction, feature matching	Ground-truth camera extrinsics for each image (exact relative pose between pairs) (croco/datasets/habitat_sim at master · naver/croco · GitHub); ground-truth depth map for each rendered image (from simulator) is available implicitly	DUSt3R uses Habitat to synthesize unlimited training pairs (croco/datasets/habitat_sim at master · naver/croco · GitHub) (croco/datasets/habitat_sim at master · naver/croco · GitHub). Images are small (256px) but cover diverse indoor layouts from Replica, Matterport3D (HM3D), etc.
8	MegaDepth	CC BY 4.0 (dataset content) (MegaDepth: Learning Single-View Depth Prediction from Internet Photos) (original web images have their own licenses)	196 scenes (landmark locations worldwide) (MegaDepth: Learning Single-View Depth Prediction from Internet Photos); ~199 GB images+depth maps (MegaDepth: Learning Single-View Depth Prediction from Internet Photos) plus ~667 GB SfM models (MegaDepth: Learning Single-View Depth Prediction from Internet Photos)	~130k crowd-sourced photos (Flickr) used ([PDF] Learning Single-View Depth Prediction from Internet Photos) – varying resolutions (often multi-MP, original sizes) (MegaDepth: Learning Single-View Depth Prediction from Internet Photos). Depth maps are provided at each image’s original resolution (MegaDepth: Learning Single-View Depth Prediction from Internet Photos)	Outdoor landmarks and some indoor landmarks (“in-the-wild” photo collections of famous places) – e.g. tourist photos of buildings, monuments, etc.	Single-view depth estimation (monocular depth from Internet photos) (MegaDepth Dataset - Papers With Code); multi-view 3D reconstruction benchmarking (via sparse SfM)	Dense depth map for ~100k images (ground-truth via multi-view stereo) ([PDF] Learning Single-View Depth Prediction from Internet Photos); additional ~30k images with ordinal (relative) depth ([PDF] Learning Single-View Depth Prediction from Internet Photos). SfM point clouds and camera poses for all 196 scenes (MegaDepth: Learning Single-View Depth Prediction from Internet Photos)	Generated by running COLMAP SfM/MVS on internet photo collections ([Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction
9	StaticThings3D	Custom (derived from FlyingThings3D; research-only) (Scene Flow Datasets: FlyingThings3D, Driving, Monkaa)	Training: thousands of synthetic multi-view sequences (no official count; from FlyingThings3D dataset) – Test: 600 sequences × 10 views each ([PDF] A Benchmark and a Baseline for Robust Multi-view Depth Estimation) (6000 images for evaluation)	Images ~960×540 px (approx., same as FlyingThings3D “cleanpass”) – synthetic RGB renderings with ground-truth depth. Multi-view: each sequence has ~10 images from different viewpoints	Fully synthetic random scenes composed of everyday 3D objects against random backgrounds (original FlyingThings3D scenes made static) (PowerPoint-Präsentation). Not real-world – intended to simulate diverse object arrangements	Multi-view depth estimation and stereo reconstruction (robust MVS training) ([PDF] A Benchmark and a Baseline for Robust Multi-view Depth Estimation) ([PDF] A Benchmark and a Baseline for Robust Multi-view Depth Estimation)	Ground-truth depth map for every image (pixel-accurate disparity) and known camera pose for each view (synthetic ground truth) ([PDF] A Benchmark and a Baseline for Robust Multi-view Depth Estimation). (Originally also optical flow for moving objects, but here objects are static)	“StaticThings3D” is a re-rendered static version of the FlyingThings3D scene-flow dataset (PowerPoint-Präsentation). All object motions are removed so that each sequence is a static scene observed from multiple angles. Used for zero-shot robustness studies ([PDF] A Benchmark and a Baseline for Robust Multi-view Depth Estimation).
10	WildRGB-D	TBD (not explicitly stated; likely free for research use)	~20,000 RGB-D video sequences (360° object scans) (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos), featuring ~8,500 distinct objects across 46 categories (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos). Dataset likely spans many TB (millions of frames captured via iPhone)	Videos of varying length (taken on iPhone; e.g. dozens to hundreds of frames each). RGB at phone camera resolution (1080p-level) and corresponding depth per frame (ARKit LiDAR). All videos have masks and are full 360° around objects (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos) (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos)	Real-world object-centric scenes with cluttered backgrounds (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos). Captures include: (i) single object per video, (ii) multiple objects together, (iii) object held by a static hand (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos) – all filmed in diverse environments (“in the wild”)	Novel view synthesis (RGB-D or RGB input) (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos); camera pose estimation (from RGB-D sequence) (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos); 6-DoF object pose estimation; object surface reconstruction (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos)	Each video comes with object segmentation masks for all frames, estimated real-world scale camera trajectories, and an aggregated 3D point cloud of the object (from merging the RGB-D video) (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos)	New large-scale RGB-D object dataset captured via iPhone LiDAR (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos). Enables learning 3D object reconstruction from real videos. Covers common object categories (e.g. furniture, toys, vehicles) with full 360° coverage.
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100