A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Dataset | License | Samples / Size | Images / Resolution | Scenes / Env. | Primary Tasks | Annotations | Notes | |||||||||||||||||||
2 | CO3Dv2 | CC BY-NC 4.0 (Non-Commercial) (GitHub - naver/dust3r: DUSt3R: Geometric 3D Vision Made Easy) | ~38k sequences (videos) of objects (GitHub - facebookresearch/co3d: Tooling for the Common Objects In 3D dataset.); ~5.5 TB full dataset (Questions about downloading datasets · Issue #173 · naver/dust3r · GitHub) | ~6 million frames (from ~19k videos v1, 4× in v2) (GitHub - facebookresearch/co3d: Tooling for the Common Objects In 3D dataset.) ([Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction | Papers With Code](https://paperswithcode.com/paper/common-objects-in-3d-large-scale-learning-and#:~:text=scale%20dataset%2C%20called%20Common%20Objects,powerful%20Transformer%20to%20reconstruct%20an)); images cropped to uniform size per sequence (GitHub - facebookresearch/co3d: Tooling for the Common Objects In 3D dataset.) | 50 common object categories; real “in-the-wild” captures (varied indoor/outdoor backgrounds) ([Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction | Papers With Code](https://paperswithcode.com/paper/common-objects-in-3d-large-scale-learning-and#:~:text=scale%20dataset%2C%20called%20Common%20Objects,powerful%20Transformer%20to%20reconstruct%20an)) | Category-specific 3D reconstruction; novel view synthesis ([Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction | |||||||||||||||||||
3 | ARKitScenes | CC BY-NC-SA 4.0 (GitHub - naver/dust3r: DUSt3R: Geometric 3D Vision Made Easy) (Apple license, research-only) | 5,047 RGB-D scans of 1,661 unique indoor scenes (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.); ~623 GB (low-res 3DOD subset) (ARKitScenes/DATA.md at main · apple/ARKitScenes · GitHub); full raw dataset larger | Multi-frame RGB-D per scan (low-res RGB + depth per frame) (ARKitScenes/DATA.md at main · apple/ARKitScenes · GitHub); subset of 2,257 scans also with high-res RGB & depth (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.). iPhone LiDAR RGB ~1920×1440; depth low-res (ARKit) (ARKitScenes/DATA.md at main · apple/ARKitScenes · GitHub) (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.) | Indoor environments (homes/offices) with furniture (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.); captures include device RGB-D and static laser scanner data (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.) | 3D object detection (furniture) and RGB-D depth upsampling (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.) (ARKitScenes/DATA.md at main · apple/ARKitScenes · GitHub) | 3D oriented bounding boxes for ~17 furniture categories (Three-Dimensional Point Cloud Applications, Datasets, and ... - MDPI) (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.); per-frame camera poses and reconstructed surface meshes for each scene (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.) | Largest real indoor RGB-D dataset (first with mobile LiDAR) (GitHub - apple/ARKitScenes: This repo accompanies the research paper, ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data and contains the data, scripts to visualize and process assets, and training code described in our paper.). Combines mobile RGB-D data and high-quality laser depth maps for many scenes. | |||||||||||||||||||
4 | ScanNet++ | Custom Terms (non-commercial research use) (GitHub - naver/dust3r: DUSt3R: Geometric 3D Vision Made Easy) | 1,000+ high-fidelity 3D indoor scenes (v2 release) (ScanNet++ Dataset) (ScanNet++ Dataset); includes sub-mm LiDAR scans + 33 MP DSLR imagery (very large data, TB-scale) | Each scene has a set of 33 MP RGB images (DSLR) + corresponding iPhone RGB-D sequence (ScanNet++ Dataset). High-res images (33 megapixel); sub-millimeter laser scan resolution (ScanNet++ Dataset). | Indoor spaces (rooms, buildings) with rich detail; captured with laser scanners + DSLR + iPhone (coupled captures) (ScanNet++ Dataset) | 3D semantic understanding (long-tail semantic segmentation of scenes) (ScanNet++ Dataset); high-quality novel view synthesis evaluation | Dense 3D reconstructions with per-vertex semantic labels (many object classes) (ScanNet++ Dataset); calibrated multi-sensor data (poses, intrinsics for cameras & LiDAR) | Must apply for access (account approval required) (ScanNet++ Dataset) (ScanNet++ Dataset). Benchmark tasks include 3D object detection and segmentation on this data. | |||||||||||||||||||
5 | BlendedMVS | CC BY 4.0 (GitHub - naver/dust3r: DUSt3R: Geometric 3D Vision Made Easy) | 17k+ rendered images with depth ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks) ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks); ~156 GB high-res image set + 27.5 GB low-res set + 9.4 GB meshes (GitHub - YoYo000/BlendedMVS: BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks) | High-res images: 2048×1536 pixels (156 GB) (GitHub - YoYo000/BlendedMVS: BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks); also provided low-res 768×576 versions (27.5 GB) (GitHub - YoYo000/BlendedMVS: BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks). Multi-view rendered from ~100 scenes (cities, buildings, sculptures, small objects) ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks) | Synthetic (photorealistic) scenes covering outdoor cityscapes to indoor objects ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks); images are computer-generated by rendering textured 3D meshes and blending with real backgrounds ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks) | Generalized multi-view stereo (MVS) – train/test ground-truth depth prediction for any scene ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks) ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks) | Per-pixel depth maps for each image (rendered from 3D mesh) ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks); known camera parameters and pairing information (neighbor views) (GitHub - YoYo000/BlendedMVS: BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks); full textured 3D mesh models of scenes as ground truth geometry (GitHub - YoYo000/BlendedMVS: BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks) | Created by reconstructing meshes from real images and then re-rendering ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks). Ambient lighting from original images is “blended” in to improve realism ([1911.10127] BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks). | |||||||||||||||||||
6 | Waymo Open | Custom (Waymo Dataset Non-Commercial Use) (GitHub - naver/dust3r: DUSt3R: Geometric 3D Vision Made Easy) | ~1,950 driving segments (20 s each @10 Hz) = 390k frames ([Waymo Open Dataset Dataset | Papers With Code](https://paperswithcode.com/dataset/waymo-open-dataset#:~:text=conditions,labels%20for%20camera%20data%20in)); ~2 TB for initial release (~1 GB per 20 s sequence) (Waymo Open Dataset Visualisation - Harry Nguyen - WordPress.com) | ~390k LiDAR sweeps and ~1.95 million camera images (5 synchronized cameras) ([Waymo Open Dataset Dataset | Papers With Code](https://paperswithcode.com/dataset/waymo-open-dataset#:~:text=conditions,labels%20for%20camera%20data%20in)) ([Waymo Open Dataset Dataset | Papers With Code](https://paperswithcode.com/dataset/waymo-open-dataset#:~:text=range%20lidars%20,tracking%20IDs%20on%20camera%20data)). High-resolution RGB imagery (front/side cameras, ~1920×1280 each) ([Waymo Open Dataset Dataset | Papers With Code](https://paperswithcode.com/dataset/waymo-open-dataset#:~:text=conditions,labels%20for%20camera%20data%20in)) | |||||||||||||||||||
7 | Habitat-Sim | Various (mix of HM3D, Replica, Gibson, etc., each with its own terms; generally research-only) (croco/datasets/habitat_sim at master · naver/croco · GitHub) (croco/datasets/habitat_sim at master · naver/croco · GitHub) | ~1.9 million synthetic image pairs (generated) (croco/datasets/habitat_sim at master · naver/croco · GitHub). Data generated on-the-fly from metadata (no fixed size; images 256×256) | ~3.8 million images total (each pair = 2 images) at 256×256 resolution, 60° FOV (croco/datasets/habitat_sim at master · naver/croco · GitHub). All images rendered via simulator (Habitat) | ~2,000 virtual indoor scenes (3D scans or CAD models) from multiple sources: e.g. 1097 ScanNet rooms, 800 HM3D houses, etc. (croco/datasets/habitat_sim at master · naver/croco · GitHub). Environments are photorealistic indoor spaces simulated in Habitat | Pre-training for 3D vision: generates image pairs with known pose differences for tasks like relative pose estimation, depth prediction, feature matching | Ground-truth camera extrinsics for each image (exact relative pose between pairs) (croco/datasets/habitat_sim at master · naver/croco · GitHub); ground-truth depth map for each rendered image (from simulator) is available implicitly | DUSt3R uses Habitat to synthesize unlimited training pairs (croco/datasets/habitat_sim at master · naver/croco · GitHub) (croco/datasets/habitat_sim at master · naver/croco · GitHub). Images are small (256px) but cover diverse indoor layouts from Replica, Matterport3D (HM3D), etc. | |||||||||||||||||||
8 | MegaDepth | CC BY 4.0 (dataset content) (MegaDepth: Learning Single-View Depth Prediction from Internet Photos) (original web images have their own licenses) | 196 scenes (landmark locations worldwide) (MegaDepth: Learning Single-View Depth Prediction from Internet Photos); ~199 GB images+depth maps (MegaDepth: Learning Single-View Depth Prediction from Internet Photos) plus ~667 GB SfM models (MegaDepth: Learning Single-View Depth Prediction from Internet Photos) | ~130k crowd-sourced photos (Flickr) used ([PDF] Learning Single-View Depth Prediction from Internet Photos) – varying resolutions (often multi-MP, original sizes) (MegaDepth: Learning Single-View Depth Prediction from Internet Photos). Depth maps are provided at each image’s original resolution (MegaDepth: Learning Single-View Depth Prediction from Internet Photos) | Outdoor landmarks and some indoor landmarks (“in-the-wild” photo collections of famous places) – e.g. tourist photos of buildings, monuments, etc. | Single-view depth estimation (monocular depth from Internet photos) (MegaDepth Dataset - Papers With Code); multi-view 3D reconstruction benchmarking (via sparse SfM) | Dense depth map for ~100k images (ground-truth via multi-view stereo) ([PDF] Learning Single-View Depth Prediction from Internet Photos); additional ~30k images with ordinal (relative) depth ([PDF] Learning Single-View Depth Prediction from Internet Photos). SfM point clouds and camera poses for all 196 scenes (MegaDepth: Learning Single-View Depth Prediction from Internet Photos) | Generated by running COLMAP SfM/MVS on internet photo collections ([Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction | |||||||||||||||||||
9 | StaticThings3D | Custom (derived from FlyingThings3D; research-only) (Scene Flow Datasets: FlyingThings3D, Driving, Monkaa) | Training: thousands of synthetic multi-view sequences (no official count; from FlyingThings3D dataset) – Test: 600 sequences × 10 views each ([PDF] A Benchmark and a Baseline for Robust Multi-view Depth Estimation) (6000 images for evaluation) | Images ~960×540 px (approx., same as FlyingThings3D “cleanpass”) – synthetic RGB renderings with ground-truth depth. Multi-view: each sequence has ~10 images from different viewpoints | Fully synthetic random scenes composed of everyday 3D objects against random backgrounds (original FlyingThings3D scenes made static) (PowerPoint-Präsentation). Not real-world – intended to simulate diverse object arrangements | Multi-view depth estimation and stereo reconstruction (robust MVS training) ([PDF] A Benchmark and a Baseline for Robust Multi-view Depth Estimation) ([PDF] A Benchmark and a Baseline for Robust Multi-view Depth Estimation) | Ground-truth depth map for every image (pixel-accurate disparity) and known camera pose for each view (synthetic ground truth) ([PDF] A Benchmark and a Baseline for Robust Multi-view Depth Estimation). (Originally also optical flow for moving objects, but here objects are static) | “StaticThings3D” is a re-rendered static version of the FlyingThings3D scene-flow dataset (PowerPoint-Präsentation). All object motions are removed so that each sequence is a static scene observed from multiple angles. Used for zero-shot robustness studies ([PDF] A Benchmark and a Baseline for Robust Multi-view Depth Estimation). | |||||||||||||||||||
10 | WildRGB-D | TBD (not explicitly stated; likely free for research use) | ~20,000 RGB-D video sequences (360° object scans) (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos), featuring ~8,500 distinct objects across 46 categories (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos). Dataset likely spans many TB (millions of frames captured via iPhone) | Videos of varying length (taken on iPhone; e.g. dozens to hundreds of frames each). RGB at phone camera resolution (1080p-level) and corresponding depth per frame (ARKit LiDAR). All videos have masks and are full 360° around objects (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos) (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos) | Real-world object-centric scenes with cluttered backgrounds (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos). Captures include: (i) single object per video, (ii) multiple objects together, (iii) object held by a static hand (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos) – all filmed in diverse environments (“in the wild”) | Novel view synthesis (RGB-D or RGB input) (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos); camera pose estimation (from RGB-D sequence) (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos); 6-DoF object pose estimation; object surface reconstruction (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos) | Each video comes with object segmentation masks for all frames, estimated real-world scale camera trajectories, and an aggregated 3D point cloud of the object (from merging the RGB-D video) (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos) | New large-scale RGB-D object dataset captured via iPhone LiDAR (RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos). Enables learning 3D object reconstruction from real videos. Covers common object categories (e.g. furniture, toys, vehicles) with full 360° coverage. | |||||||||||||||||||
11 | |||||||||||||||||||||||||||
12 | |||||||||||||||||||||||||||
13 | |||||||||||||||||||||||||||
14 | |||||||||||||||||||||||||||
15 | |||||||||||||||||||||||||||
16 | |||||||||||||||||||||||||||
17 | |||||||||||||||||||||||||||
18 | |||||||||||||||||||||||||||
19 | |||||||||||||||||||||||||||
20 | |||||||||||||||||||||||||||
21 | |||||||||||||||||||||||||||
22 | |||||||||||||||||||||||||||
23 | |||||||||||||||||||||||||||
24 | |||||||||||||||||||||||||||
25 | |||||||||||||||||||||||||||
26 | |||||||||||||||||||||||||||
27 | |||||||||||||||||||||||||||
28 | |||||||||||||||||||||||||||
29 | |||||||||||||||||||||||||||
30 | |||||||||||||||||||||||||||
31 | |||||||||||||||||||||||||||
32 | |||||||||||||||||||||||||||
33 | |||||||||||||||||||||||||||
34 | |||||||||||||||||||||||||||
35 | |||||||||||||||||||||||||||
36 | |||||||||||||||||||||||||||
37 | |||||||||||||||||||||||||||
38 | |||||||||||||||||||||||||||
39 | |||||||||||||||||||||||||||
40 | |||||||||||||||||||||||||||
41 | |||||||||||||||||||||||||||
42 | |||||||||||||||||||||||||||
43 | |||||||||||||||||||||||||||
44 | |||||||||||||||||||||||||||
45 | |||||||||||||||||||||||||||
46 | |||||||||||||||||||||||||||
47 | |||||||||||||||||||||||||||
48 | |||||||||||||||||||||||||||
49 | |||||||||||||||||||||||||||
50 | |||||||||||||||||||||||||||
51 | |||||||||||||||||||||||||||
52 | |||||||||||||||||||||||||||
53 | |||||||||||||||||||||||||||
54 | |||||||||||||||||||||||||||
55 | |||||||||||||||||||||||||||
56 | |||||||||||||||||||||||||||
57 | |||||||||||||||||||||||||||
58 | |||||||||||||||||||||||||||
59 | |||||||||||||||||||||||||||
60 | |||||||||||||||||||||||||||
61 | |||||||||||||||||||||||||||
62 | |||||||||||||||||||||||||||
63 | |||||||||||||||||||||||||||
64 | |||||||||||||||||||||||||||
65 | |||||||||||||||||||||||||||
66 | |||||||||||||||||||||||||||
67 | |||||||||||||||||||||||||||
68 | |||||||||||||||||||||||||||
69 | |||||||||||||||||||||||||||
70 | |||||||||||||||||||||||||||
71 | |||||||||||||||||||||||||||
72 | |||||||||||||||||||||||||||
73 | |||||||||||||||||||||||||||
74 | |||||||||||||||||||||||||||
75 | |||||||||||||||||||||||||||
76 | |||||||||||||||||||||||||||
77 | |||||||||||||||||||||||||||
78 | |||||||||||||||||||||||||||
79 | |||||||||||||||||||||||||||
80 | |||||||||||||||||||||||||||
81 | |||||||||||||||||||||||||||
82 | |||||||||||||||||||||||||||
83 | |||||||||||||||||||||||||||
84 | |||||||||||||||||||||||||||
85 | |||||||||||||||||||||||||||
86 | |||||||||||||||||||||||||||
87 | |||||||||||||||||||||||||||
88 | |||||||||||||||||||||||||||
89 | |||||||||||||||||||||||||||
90 | |||||||||||||||||||||||||||
91 | |||||||||||||||||||||||||||
92 | |||||||||||||||||||||||||||
93 | |||||||||||||||||||||||||||
94 | |||||||||||||||||||||||||||
95 | |||||||||||||||||||||||||||
96 | |||||||||||||||||||||||||||
97 | |||||||||||||||||||||||||||
98 | |||||||||||||||||||||||||||
99 | |||||||||||||||||||||||||||
100 |