ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
Language
2
Dataset NameLinkDescriptionN TokensNotesLead
3
4
Wikitext:https://huggingface.co/datasets/wikitext"The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia."100 millionIt seems that this dataset is mainly used as a benchmark. That is why we are moving to Open WebText
5
The Pile:https://huggingface.co/datasets/EleutherAI/pile"The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together."~1 trillion (The data card in hugging face says that)
6
Open WebText:https://skylion007.github.io/OpenWebTextCorpus/"An open source effort to reproduce OpenAI’s WebText dataset"
7
8
9
10
11
VQA
12
Dataset NameLinkDescriptionN TokensNotesLead
13
A-OKVQA:https://okvqa.allenai.org/download.html,https://allenai.org/project/a-okvqa/home"82,783 MS COCO training images, 40,504 MS COCO validation images and 81,434 MS COCO testing images 443,757 questions for training, 214,354 questions for validation and 447,793 questions for testing 4,437,570 answers for training and 2,143,540 answers for validation (10 per question)"
14
15
16
17
18
19
20
Captions
21
Dataset NameLinkDescriptionN TokensNotesLead
22
Conceptual Captionshttps://ai.google.com/research/ConceptualCaptions/download"The Training split consists of 3,318,333 image-URL/caption pairs, with a total number of 51,201 total token types in the captions (i.e., total vocabulary). The average number of tokens per captions is 10.3 (standard deviation of 4.5), while the median is 9.0 tokens per caption. The Validation split consists of 15,840 image-URL/caption pairs, with similar statistics. "
23
https://github.com/google-research-datasets/conceptual-captions
24
25
26
RL/Control
Note: This is from the original Datasheet. I think we are currently using Mujoco and Atari only. The original sheet is this one: NEKO Dataset Analysis
27
Dataset NameLinkDescriptionN TokensNotesLead
28
EnvironmentTasksEpisodesApprox TokensSample WeightAgent used**Open-Source Repo**Additional information
Similar Available Datasets
29
DM LAB25416.4M194B9.35%IMPALA
[DM Lab](https://github.com/deepmind/lab)
Appendix F.5 of the Gato paper mentions that they trained an IMPALA agent on a set of 18 parent DM Lab levels. “Data was collected by executing the agent on these 18 levels, as well as an additional set of 237 levels handcrafted to test a diverse set of skills”. We don’t have much information on the definition of those 18 “parent levels” and the 237 “handcrafted levels”. But there are a lot of different levels here: https://github.com/deepmind/lab/tree/master/game_scripts/levelsCheck out this paper which claims SOTA with an IMPALA agent on DM Lab 30: https://arxiv.org/pdf/1809.04474v1.pdf
30
ALE Atari5163.4K1.26B9.50%
Muesli agent for 200M steps per environment
[ALE Atari](https://github.com/mgbellemare/Arcade-Learning-Environment)
[RL Unplugged](https://github.com/deepmind/deepmind-research/tree/master/rl_unplugged) which is sourced from [batch_rl](https://github.com/google-research/batch_rl) generated from DQN replay (may want to filter, check methodology on [CQL-scale-generalizes](https://openreview.net/forum?id=4-k7kUavAj), [multi-game-dt](https://arxiv.org/abs/2205.15241). This repo, also has filtered variants: [d4rl-atari](https://github.com/takuseno/d4rl-atari)
31
ALE Atari Extended2828.4K565M10.00%
Muesli agent for 200M steps per environment
[ALE Atari](https://github.com/mgbellemare/Arcade-Learning-Environment)
32
Sokoban127.2K298M1.33%Muesli agent
[Sokoban](https://github.com/mpSchrader/gym-sokoban)
They use a Muesli agent to collect training data.
33
Baby AI464.61M22.8B9.06%
Built-in BabyAI bot with 100 000 episodes for each level
[Baby AI](https://github.com/mila-iqia/babyai)
The repo for babyai is not maintained and is now at: https://github.com/Farama-Foundation/Minigrid
34
DM Control Suite30395K22.5B4.62%D4PG
[DM Control](https://github.com/deepmind/dm_control)
In Appendix F.4 of the Gato paper, the authors mention that “for each task in the control suite, they collect two disjoint sets of data, one using only state features and another using only pixels'’ . They use a D4PG agent to collect data from tasks with state features, and an MPO based agent to collect data with pixels. They also collect data for randomized versions of the control suite tasks with a D4PG agent. They randomize the actuator gear, joint range, stiffness, and damping and geom size and density from a small interval and a large interval.There are some SOTA agents here :https://paperswithcode.com/dataset/deepmind-control-suite
[RL Unplugged](https://github.com/deepmind/deepmind-research/tree/master/rl_unplugged) provides some datasets. Specifically, they say most DM control data is generated with D4PG, or V-MPO on manipulator insert ball/peg
35
DM Control Suite Pixels
28485K35.5B7.07%MPO
[DM Control](https://github.com/deepmind/dm_control)
36
DM Control Suite Random Small
2610.6M313B3.04%D4PG
[DM Control](https://github.com/deepmind/dm_control)
37
DM Control Suite Random Large
2626.1M791B3.04%D4PG
[DM Control](https://github.com/deepmind/dm_control)
38
Meta-World4594.6K3.39B8.96%MPO agent
[Meta-World](https://github.com/Farama-Foundation/Metaworld)
Appendix F.9 of the Gato paper mention that they collected data from all train and test tasks in the MT50 mode by training a MPO agent with unlimited environment seeds and access to state of the MuJoCo physics engine. The collected data also contains the MuJoCo physics engine state.
39
Procgen Benchmark161.6M4.46B5.34%R2D2 agent
[Procgen](https://github.com/openai/procgen)
Appendix F.6 from the Gato paper mention that they trained a R2D2 agent on the 16 environments at the hard difficulty setting except for the maze and heist which they set to easy. OpenRL has some benchmarks here: https://wandb.ai/openrlbenchmark/openrlbenchmark/reportlist
40
RGB Stacking Simulator
1387K24.4B1.33%
[RGB Stacking](https://github.com/deepmind/rgb_stacking)
The repo contains specialist agent
41
RGB Stacking real robot
115.7K980M1.33%
42
Modular RL38843K69.6B8.23%
D4PG for a total of 140M steps with 30 random seeds
[Modular RL](https://github.com/huangwl18/modular-rl)
Appendix F.7 of the Gato paper mentions that the authors trained a D4PG agent on each variant for a total of 140M actor steps with 30 random seeds per variant.
43
DM Manipulation Playground
4286K6.58B1.68%The Gato paper mentions it contains 4 tasks of simulated Kinova Jaco arm but I cant find any specific repo or source for the “DM Manipulation playgroun”. Searching for ‘jaco’ in the DM control suite repo yields multiple results…. so maybe it is included in the DM Control suite repo?
44
Playroom1829K118B1.33%The word “Playroom” literally appears only once in the paper… I found a reference to a “Playroom” environment in a repo from Google Research: https://github.com/google-research/google-research/tree/master/playrooms
45
total59685.21%
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100