ABCDEFGHIJKLMNOP
1
(53) Permalink:
https://lifearchitect.ai/datasets-table/
Paper: What's in my AI?
https://lifearchitect.ai/whats-in-my-ai/
The Memo:
https://lifearchitect.ai/memo
2
DatasetLab
Total tokens
(T) ▼
Total size
(GB, uncompressed, ~4x tok)
Web (CC/C4) data
(GB, uncompressed)
Other data
(GB, uncompressed)
ALQual
(rates quality of data)
Announced
Public?Model examplePaperProject pageNotes
3
CosmosNVIDIA900020000002000000★★★★☆Jan/2025🔴Cosmoshttps://research.nvidia.com/publication/2025-01_cosmos-world-foundation-model-platform-physical-aihttps://nvidianews.nvidia.com/news/nvidia-launches-cosmos-world-foundation-model-platform-to-accelerate-physical-ai-development9Qa tokens=9,000T tokens. https://lifearchitect.ai/cosmos/
4
DeepSeek-R2
DeepSeek-AI
1300540000027000002700000★★☆☆☆Apr/2025🔴DeepSeek-R2https://www.jiuyangongshe.com/a/1h4gq724su0https://www.jiuyangongshe.com/a/1h4gq724su0
5.2 petabytes (PB) = 1.3Qa = 1,300T tokens = 1,300,000B tokens. "Constructed a 5.2PB high-quality corpus covering vertical domains such as finance, law, and patents."
5
DCLM-PoolInternational240100000010000000★☆☆☆☆Jun/2024🟡
DCLM-Baseline 7B
https://arxiv.org/abs/2406.11794https://www.datacomp.ai/dclm/
All CC from 2008-2022, new extraction using resiliparse framework. https://x.com/Vaishaal/status/1803198069888229817/photo/1 "DCLM-POOL contains 200B documents (370TB after gzip compression)"
6
GPT-5 datasetOpenAI7028100081000200000★★★★☆Aug/2024🔴GPT-5https://lifearchitect.ai/whats-in-gpt-5/My analysis: https://lifearchitect.ai/whats-in-gpt-5/
7
Qwen3Alibaba361340006150061500★★★☆☆Apr/2025🔴Qwen3https://qwenlm.github.io/blog/qwen3/Largest publicly-revealed text dataset and 'tokens seen' of any model to Apr/2025.
8
Llama 4Meta AI301250006000065000★★★☆☆Apr/2025🔴Llama 4
https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E
Alan: Meta AI Llama 4 Scout 109B was trained to 40T tokens, and Meta has publicly stated that this "overall data mixture for training consisted of more than 30 trillion tokens... and includes diverse text, image, and video datasets" (5/Apr/2025). It may be inferred that: a. The Llama 4 models were limited to a 30T token dataset. b. The Llama 4 dataset was roughly 60% text (18T tokens), 20% video, 20% images. c. Llama 4 Scout ‘saw’ the 30T token dataset for 1.3 epochs (40T tokens).
9
RedPajama-Data-v2Together AI301250001250000★★☆☆☆Oct/2023🟢https://together.ai/blog/redpajama-data-v2
10
Multimodal UniverseCambridge27.5100000100000★★★☆☆Dec/2024🟢
11
Piper monorepoGoogle37.986000086000★★★☆☆Jun/2023🔴DIDACTPiper PDF from 2016: https://dl.acm.org/doi/pdf/10.1145/2854146
12
MNBVC (Massive Never-ending BT Vast Chinese corpus
MNBVC.253874
304000040000★☆☆☆☆Oct/2023🟢https://github.com/esbatmop/MNBVChttps://mnbvc.253874.net/Chinese only.
13
AuroraGPT
Argonne National Laboratory
208000080000★★★★☆Jul/2024🔴
AuroraGPT-7B-A-S
https://lifearchitect.ai/auroragpt/https://www.youtubetranscript.com/?v=1K-hi-QjJiQ&t=604
"The mandate of the data team: To accumulate on the order of 20 plus trillion tokens of high quality scientific text and structured data with strong quality control, deduplication."
14
Claude-3.5 datasetAnthropic20800002000060000★★★★★Jun/2024🔴
Claude 3.5 Sonnet
https://www.anthropic.com/news/claude-3-5-sonnet
Michael Gerstenhaber, head of product at Anthropic, says the company’s new Claude 3.5 Sonnet model is larger than its predecessor but draws much of its new competence from innovations in training. For example, the model was given feedback designed to improve its logical reasoning skills. https://archive.md/iH4vg & Michael Gerstenhaber, product lead at Anthropic, says that the improvements are the result of architectural tweaks and new training data, including AI-generated data. Which data specifically? Gerstenhaber wouldn’t disclose, but he implied that Claude 3.5 Sonnet draws much of its strength from these training sets. https://techcrunch.com/2024/06/20/anthropic-claims-its-latest-model-is-best-in-class/
15
FineWebHF1544000440000★★☆☆☆Apr/2024🟢https://huggingface.co/datasets/HuggingFaceFW/fineweb
15T tokens. Much bigger than FineWeb2. "FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb"
16
GPT-4 datasetOpenAI1340000★★★★☆Mar/2023🔴GPT-4My estimate: https://lifearchitect.ai/gpt-4/#dataset
17
HPLT v.2.0 (cleaned)HPLT10.140000400000★★☆☆☆Oct/2024🟢https://arxiv.org/abs/2403.14009https://hplt-project.org/datasets/v2.0CC + archive.org, 193 languages. 15TB compressed, est 40TB uncompressed
18
FineWeb-Edu-score-2
HF5.431500315000★★★☆☆May/2024🟢https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2
19
CulturaXUOregon6.327000270000★★★☆☆Sep/2023🟢https://arxiv.org/abs/2309.09400https://huggingface.co/datasets/uonlp/CulturaX
20
HPLT (High Performance Language Technologies)
Helsinki5.650100501000★★☆☆☆Mar/2024🟢https://arxiv.org/abs/2403.14009
"In total, after de-duplication, we release a collection of 5.25 billion documents (approximately corresponding to web pages), totaling 50.1 TB of uncompressed texts and approximately 5.6 trillion whitespace-separated word tokens"
21
RefinedWebTII523240232400★★☆☆☆Jun/2023🔴Falconhttps://arxiv.org/pdf/2306.01116.pdf
22
MassiveText MLDeepMind520000454415655★★★★☆Dec/2021🔴Retro
23
MatrixInternational4.692160024652112★★★☆☆May/2024🟢MAP-Neohttps://arxiv.org/pdf/2405.19327https://cdn-uploads.huggingface.co/production/uploads/654907a4a1faff97850c4eff/1FWMF_t_Mhy0UQmu65Bb1.pngCombines RedPajama, Dolma, Culturax, Amber, SlimPajama, Falcon, CulturaY
24
FineWeb2HF43200032000★★☆☆☆Dec/2024🟢https://huggingface.co/datasets/HuggingFaceFW/fineweb-2https://github.com/huggingface/fineweb-2
3T words=4T tokens. Much smaller than FineWeb (15T tokens). 1,893 language-script pairs. Of these, 486 have more than 1MB of text data, and 80 have more than 1GB of filtered data.
25
Cultura-YUOregon416000★★★☆☆Mar/2024🟢Vistral-7B-Chathttps://www.ontocord.ai/blog/cultura-y
26
The WellCambridge4.11500015000★★★☆☆Dec/2024🟢https://github.com/PolymathicAI/the_well
27
DCLM-BaselineInternational41300013000★★★☆☆Jun/2024🟢
DCLM-Baseline 7B
https://arxiv.org/abs/2406.11794https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0
"All CC from 2008-2022, new extraction using resiliparse framework. https://x.com/Vaishaal/status/1803198069888229817/photo/1 ""DCLM-POOL contains 200B
documents (370TB after gzip compression)"""
28
PaLM 2 datasetGoogle3.613000★★★★☆May/2023🔴PaLM 2My estimate: https://lifearchitect.ai/bard/#dataset
29
DolmaAI23115199832.41686.6★★★☆☆Aug/2023🟢OLMohttps://arxiv.org/abs/2402.00159
30
InfinisetGoogle2.812616156911047★★★★☆May/2021🔴LaMDA
My calcs: https://lifearchitect.ai/bard/#dataset & It's not clear why Google chose to use 1.5TB of the ~14TB history within Wikipedia
31
MADLAD-400Google31200012000★★☆☆☆Sep/2023🟢MADLAD400-8Bhttps://arxiv.org/abs/2309.04662https://huggingface.co/datasets/allenai/MADLAD-400
32
MassiveText ENDeepMind2.351055051735376.5★★★★☆Dec/2021🔴
Chinchilla, Gopher
33
Common Pile v0.1EleutherAI2.2800040004000★★★☆☆Jun/2025🟢Comma v0.1-2Thttps://arxiv.org/abs/2506.05209https://huggingface.co/blog/stellaathena/common-pileOpenly licensed data.
34
Pleias Common Corpus
PleIAS21000010000★★★☆☆Nov/2024🟢https://huggingface.co/blog/Pclanglais/two-trillion-tokens-openhttps://huggingface.co/datasets/PleIAs/common_corpus
Lame, just CC again but no copyright (apparently). 2T is nowhere near the largest despite the claim of 'Releasing the largest multilingual open pretraining dataset'.
35
InternLMShanghai AI1.6510036161199★★☆☆☆Jun/2023🟡InternLMChinese/English. My rough estimates only by multiplying tokens (billions) by 3 to get GB
36
Stability New PileStability AI1.55000★★★☆☆Apr/2023🔴StableLMAnnounced but not detailed
37
FineWeb-Edu 1.3THF1.388408840★★★★☆May/2024🟢https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
38
ZydaZyphra1.38840★★★☆☆Jun/2024🟢https://www.zyphra.com/zydaCombination of Dolma, Fineweb, Pile, RefinedWeb, and SlimPajama
39
LLaMAMeta AI1.240834083666★★☆☆☆Feb/2023🟡LLaMA, Alpaca
40
RedPajamaTogether AI1.240333510524★★★☆☆Apr/2023🟢MPThttps://arxiv.org/abs/2411.12372v1
41
The Stack v2BigCode0.96750067500★★☆☆☆Feb/2024🟢StarCoder 2https://arxiv.org/abs/2402.19173https://huggingface.co/datasets/bigcode/the-stack-v2
"The Stack v2 is ten times larger than its predecessor, yielding a raw dataset of 67.5 TB. Through extensive cleaning, filtering, and subsampling of the source code, along with the incorporation of other high-quality code-related datasets, we created a training set of approximately 3TB (900B+ tokens)."
42
SlimPajamaCerebras0.6272685706145★★★☆☆Jun/2023🟢https://huggingface.co/datasets/cerebras/SlimPajama-627Bhttps://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama
The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama.
43
Common CorpusPleIAS0.6520002000★★☆☆☆Mar/2024🟢https://huggingface.co/collections/PleIAs/common-corpus-65d46e3ea3980fdcd66a5613500tok=650w.
44
ROOTSBigScience0.3411600★★☆☆☆Mar/2024🟢BLOOMhttps://arxiv.org/abs/2303.03915
45
The Pile v1EleutherAI0.247825227629.71★★★★☆Dec/2020🟢GPT-Neo, GPT-J
Some dupes in my older calcs. Deliberately excludes US Congressional minutes (slavery) and literotica (sex).
46
Institutional Books 1.0
Harvard University’s Institutional Data Initiative
0.2428250825★★★★☆Jun/2025🟢
https://arxiv.org/abs/2506.08300
https://huggingface.co/datasets/institutional/institutional-books-1.0
"This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available." Ref: https://archive.md/xhJvc
47
StarCoder dataset (The Stack 1.2 subset)
BigCode0.25783783★★★☆☆May/2023🟢https://huggingface.co/datasets/bigcode/starcoderdata
https://arxiv.org/abs/2305.06161
It contains 783GB of code in 86 programming languages, and includes 54GB GitHub Issues + 13GB Jupyter notebooks in scripts and text-code pairs, and 32GB of GitHub commits, which is approximately 250 Billion tokens.
48
The Stack v1BigCode0.264006400★★☆☆☆Nov/2022🟢
Megatron-LM fork
https://arxiv.org/abs/2211.15533https://huggingface.co/datasets/bigcode/the-stack-dedupVarious dedupes down to 2.7TB and 1.5TB.
49
GPT-3 datasetOpenAI0.499753620133.4★★★☆☆May/2020🔴GPT-3
50
RoBERTa datasetMeta AI16114516★★★☆☆Jul/2019🟡
RoBERTa, Megatron-11B
51
YouTube-CommonsPleIAS0.03110110★★★★☆Apr/2024🟢https://huggingface.co/datasets/HuggingFaceTB/cosmopediahttps://huggingface.co/datasets/PleIAs/YouTube-Commons286x parquet files x 385MB each
52
Cosmopedia v2 HF0.028103103★★★★★Jul/2024🔴SmolLMhttps://huggingface.co/blog/smollm
"Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 30 million textbooks, blog posts, and stories generated by Mixtral-8x7B-Instruct-v0.1. Most of the samples are generated by prompting the model to generate content on specific topics using a web page referred to as a "seed sample", as shown in Figure 1. We use web samples to increase diversity and expand the range of prompts."
53
Cosmopedia v0.1HF0.0259292★★★★★Mar/2024🟢https://huggingface.co/blog/cosmopediaReplication of phi-1.5, very high quality synthetic data
54
GPT-2 datasetOpenAI4040★★☆☆☆Feb/2019🟡GPT-2Popular web
55
GPT-1 datasetOpenAI4.64.6★☆☆☆☆Jun/2018🟡GPT-1Books
56
About this sheetAbout this sheet