MultiNet Generalist Dataset Overview

	A	B	C	D	E	F	G	H	I
1	Language
2	Dataset Name	Link	Description	N Tokens	Notes	Lead
3
4	Wikitext:	https://huggingface.co/datasets/wikitext	"The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia."	100 million	It seems that this dataset is mainly used as a benchmark. That is why we are moving to Open WebText
5	The Pile:	https://huggingface.co/datasets/EleutherAI/pile	"The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together."	~1 trillion	(The data card in hugging face says that)
6	Open WebText:	https://skylion007.github.io/OpenWebTextCorpus/	"An open source effort to reproduce OpenAI’s WebText dataset"
7
8
9
10
11	VQA
12	Dataset Name	Link	Description	N Tokens	Notes	Lead
13	A-OKVQA:	https://okvqa.allenai.org/download.html,https://allenai.org/project/a-okvqa/home	"82,783 MS COCO training images, 40,504 MS COCO validation images and 81,434 MS COCO testing images 443,757 questions for training, 214,354 questions for validation and 447,793 questions for testing 4,437,570 answers for training and 2,143,540 answers for validation (10 per question)"
14
15
16
17
18
19
20	Captions
21	Dataset Name	Link	Description	N Tokens	Notes	Lead
22	Conceptual Captions	https://ai.google.com/research/ConceptualCaptions/download	"The Training split consists of 3,318,333 image-URL/caption pairs, with a total number of 51,201 total token types in the captions (i.e., total vocabulary). The average number of tokens per captions is 10.3 (standard deviation of 4.5), while the median is 9.0 tokens per caption. The Validation split consists of 15,840 image-URL/caption pairs, with similar statistics. "
23		https://github.com/google-research-datasets/conceptual-captions
24
25
26	RL/Control	Note: This is from the original Datasheet. I think we are currently using Mujoco and Atari only. The original sheet is this one: NEKO Dataset Analysis
27	Dataset Name	Link	Description	N Tokens	Notes	Lead
28	Environment	Tasks	Episodes	Approx Tokens	Sample Weight	Agent used	Open-Source Repo	Additional information	Similar Available Datasets
29	DM LAB	254	16.4M	194B	9.35%	IMPALA	[DM Lab](https://github.com/deepmind/lab)	Appendix F.5 of the Gato paper mentions that they trained an IMPALA agent on a set of 18 parent DM Lab levels. “Data was collected by executing the agent on these 18 levels, as well as an additional set of 237 levels handcrafted to test a diverse set of skills”. We don’t have much information on the definition of those 18 “parent levels” and the 237 “handcrafted levels”. But there are a lot of different levels here: https://github.com/deepmind/lab/tree/master/game_scripts/levelsCheck out this paper which claims SOTA with an IMPALA agent on DM Lab 30: https://arxiv.org/pdf/1809.04474v1.pdf
30	ALE Atari	51	63.4K	1.26B	9.50%	Muesli agent for 200M steps per environment	[ALE Atari](https://github.com/mgbellemare/Arcade-Learning-Environment)	[RL Unplugged](https://github.com/deepmind/deepmind-research/tree/master/rl_unplugged) which is sourced from [batch_rl](https://github.com/google-research/batch_rl) generated from DQN replay (may want to filter, check methodology on [CQL-scale-generalizes](https://openreview.net/forum?id=4-k7kUavAj), [multi-game-dt](https://arxiv.org/abs/2205.15241). This repo, also has filtered variants: [d4rl-atari](https://github.com/takuseno/d4rl-atari)
31	ALE Atari Extended	28	28.4K	565M	10.00%	Muesli agent for 200M steps per environment	[ALE Atari](https://github.com/mgbellemare/Arcade-Learning-Environment)
32	Sokoban	1	27.2K	298M	1.33%	Muesli agent	[Sokoban](https://github.com/mpSchrader/gym-sokoban)	They use a Muesli agent to collect training data.
33	Baby AI	46	4.61M	22.8B	9.06%	Built-in BabyAI bot with 100 000 episodes for each level	[Baby AI](https://github.com/mila-iqia/babyai)	The repo for babyai is not maintained and is now at: https://github.com/Farama-Foundation/Minigrid
34	DM Control Suite	30	395K	22.5B	4.62%	D4PG	[DM Control](https://github.com/deepmind/dm_control)	In Appendix F.4 of the Gato paper, the authors mention that “for each task in the control suite, they collect two disjoint sets of data, one using only state features and another using only pixels'’ . They use a D4PG agent to collect data from tasks with state features, and an MPO based agent to collect data with pixels. They also collect data for randomized versions of the control suite tasks with a D4PG agent. They randomize the actuator gear, joint range, stiffness, and damping and geom size and density from a small interval and a large interval.There are some SOTA agents here :https://paperswithcode.com/dataset/deepmind-control-suite	[RL Unplugged](https://github.com/deepmind/deepmind-research/tree/master/rl_unplugged) provides some datasets. Specifically, they say most DM control data is generated with D4PG, or V-MPO on manipulator insert ball/peg
35	DM Control Suite Pixels	28	485K	35.5B	7.07%	MPO	[DM Control](https://github.com/deepmind/dm_control)
36	DM Control Suite Random Small	26	10.6M	313B	3.04%	D4PG	[DM Control](https://github.com/deepmind/dm_control)
37	DM Control Suite Random Large	26	26.1M	791B	3.04%	D4PG	[DM Control](https://github.com/deepmind/dm_control)
38	Meta-World	45	94.6K	3.39B	8.96%	MPO agent	[Meta-World](https://github.com/Farama-Foundation/Metaworld)	Appendix F.9 of the Gato paper mention that they collected data from all train and test tasks in the MT50 mode by training a MPO agent with unlimited environment seeds and access to state of the MuJoCo physics engine. The collected data also contains the MuJoCo physics engine state.
39	Procgen Benchmark	16	1.6M	4.46B	5.34%	R2D2 agent	[Procgen](https://github.com/openai/procgen)	Appendix F.6 from the Gato paper mention that they trained a R2D2 agent on the 16 environments at the hard difficulty setting except for the maze and heist which they set to easy. OpenRL has some benchmarks here: https://wandb.ai/openrlbenchmark/openrlbenchmark/reportlist
40	RGB Stacking Simulator	1	387K	24.4B	1.33%		[RGB Stacking](https://github.com/deepmind/rgb_stacking)	The repo contains specialist agent
41	RGB Stacking real robot	1	15.7K	980M	1.33%
42	Modular RL	38	843K	69.6B	8.23%	D4PG for a total of 140M steps with 30 random seeds	[Modular RL](https://github.com/huangwl18/modular-rl)	Appendix F.7 of the Gato paper mentions that the authors trained a D4PG agent on each variant for a total of 140M actor steps with 30 random seeds per variant.
43	DM Manipulation Playground	4	286K	6.58B	1.68%			The Gato paper mentions it contains 4 tasks of simulated Kinova Jaco arm but I cant find any specific repo or source for the “DM Manipulation playgroun”. Searching for ‘jaco’ in the DM control suite repo yields multiple results…. so maybe it is included in the DM Control suite repo?
44	Playroom	1	829K	118B	1.33%			The word “Playroom” literally appears only once in the paper… I found a reference to a “Playroom” environment in a repo from Google Research: https://github.com/google-research/google-research/tree/master/playrooms
45	total	596			85.21%
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100