A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | (53) Permalink: | https://lifearchitect.ai/datasets-table/ | Paper: What's in my AI? | https://lifearchitect.ai/whats-in-my-ai/ | The Memo: | https://lifearchitect.ai/memo | |||||||||||
2 | Dataset | Lab | Total tokens (T) ▼ | Total size (GB, uncompressed, ~4x tok) | Web (CC/C4) data (GB, uncompressed) | Other data (GB, uncompressed) | ALQual (rates quality of data) | Announced | Public? | Model example | Paper | Project page | Notes | ||||
3 | Cosmos | NVIDIA | 9000 | 2000000 | 2000000 | ★★★★☆ | Jan/2025 | 🔴 | Cosmos | https://research.nvidia.com/publication/2025-01_cosmos-world-foundation-model-platform-physical-ai | https://nvidianews.nvidia.com/news/nvidia-launches-cosmos-world-foundation-model-platform-to-accelerate-physical-ai-development | 9Qa tokens=9,000T tokens. https://lifearchitect.ai/cosmos/ | |||||
4 | DeepSeek-R2 | DeepSeek-AI | 1300 | 5400000 | 2700000 | 2700000 | ★★☆☆☆ | Apr/2025 | 🔴 | DeepSeek-R2 | https://www.jiuyangongshe.com/a/1h4gq724su0 | https://www.jiuyangongshe.com/a/1h4gq724su0 | 5.2 petabytes (PB) = 1.3Qa = 1,300T tokens = 1,300,000B tokens. "Constructed a 5.2PB high-quality corpus covering vertical domains such as finance, law, and patents." | ||||
5 | DCLM-Pool | International | 240 | 1000000 | 1000000 | 0 | ★☆☆☆☆ | Jun/2024 | 🟡 | DCLM-Baseline 7B | https://arxiv.org/abs/2406.11794 | https://www.datacomp.ai/dclm/ | All CC from 2008-2022, new extraction using resiliparse framework. https://x.com/Vaishaal/status/1803198069888229817/photo/1 "DCLM-POOL contains 200B documents (370TB after gzip compression)" | ||||
6 | GPT-5 dataset | OpenAI | 70 | 281000 | 81000 | 200000 | ★★★★☆ | Aug/2024 | 🔴 | GPT-5 | https://lifearchitect.ai/whats-in-gpt-5/ | My analysis: https://lifearchitect.ai/whats-in-gpt-5/ | |||||
7 | Qwen3 | Alibaba | 36 | 134000 | 61500 | 61500 | ★★★☆☆ | Apr/2025 | 🔴 | Qwen3 | https://qwenlm.github.io/blog/qwen3/ | Largest publicly-revealed text dataset and 'tokens seen' of any model to Apr/2025. | |||||
8 | Llama 4 | Meta AI | 30 | 125000 | 60000 | 65000 | ★★★☆☆ | Apr/2025 | 🔴 | Llama 4 | https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E | Alan: Meta AI Llama 4 Scout 109B was trained to 40T tokens, and Meta has publicly stated that this "overall data mixture for training consisted of more than 30 trillion tokens... and includes diverse text, image, and video datasets" (5/Apr/2025). It may be inferred that: a. The Llama 4 models were limited to a 30T token dataset. b. The Llama 4 dataset was roughly 60% text (18T tokens), 20% video, 20% images. c. Llama 4 Scout ‘saw’ the 30T token dataset for 1.3 epochs (40T tokens). | |||||
9 | RedPajama-Data-v2 | Together AI | 30 | 125000 | 125000 | 0 | ★★☆☆☆ | Oct/2023 | 🟢 | https://together.ai/blog/redpajama-data-v2 | |||||||
10 | Multimodal Universe | Cambridge | 27.5 | 100000 | 100000 | ★★★☆☆ | Dec/2024 | 🟢 | |||||||||
11 | Piper monorepo | 37.9 | 86000 | 0 | 86000 | ★★★☆☆ | Jun/2023 | 🔴 | DIDACT | Piper PDF from 2016: https://dl.acm.org/doi/pdf/10.1145/2854146 | |||||||
12 | MNBVC (Massive Never-ending BT Vast Chinese corpus | MNBVC.253874 | 30 | 40000 | 40000 | ★☆☆☆☆ | Oct/2023 | 🟢 | https://github.com/esbatmop/MNBVC | https://mnbvc.253874.net/ | Chinese only. | ||||||
13 | AuroraGPT | Argonne National Laboratory | 20 | 80000 | 80000 | ★★★★☆ | Jul/2024 | 🔴 | AuroraGPT-7B-A-S | https://lifearchitect.ai/auroragpt/ | https://www.youtubetranscript.com/?v=1K-hi-QjJiQ&t=604 | "The mandate of the data team: To accumulate on the order of 20 plus trillion tokens of high quality scientific text and structured data with strong quality control, deduplication." | |||||
14 | Claude-3.5 dataset | Anthropic | 20 | 80000 | 20000 | 60000 | ★★★★★ | Jun/2024 | 🔴 | Claude 3.5 Sonnet | https://www.anthropic.com/news/claude-3-5-sonnet | Michael Gerstenhaber, head of product at Anthropic, says the company’s new Claude 3.5 Sonnet model is larger than its predecessor but draws much of its new competence from innovations in training. For example, the model was given feedback designed to improve its logical reasoning skills. https://archive.md/iH4vg & Michael Gerstenhaber, product lead at Anthropic, says that the improvements are the result of architectural tweaks and new training data, including AI-generated data. Which data specifically? Gerstenhaber wouldn’t disclose, but he implied that Claude 3.5 Sonnet draws much of its strength from these training sets. https://techcrunch.com/2024/06/20/anthropic-claims-its-latest-model-is-best-in-class/ | |||||
15 | FineWeb | HF | 15 | 44000 | 44000 | 0 | ★★☆☆☆ | Apr/2024 | 🟢 | https://huggingface.co/datasets/HuggingFaceFW/fineweb | 15T tokens. Much bigger than FineWeb2. "FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb" | ||||||
16 | GPT-4 dataset | OpenAI | 13 | 40000 | ★★★★☆ | Mar/2023 | 🔴 | GPT-4 | My estimate: https://lifearchitect.ai/gpt-4/#dataset | ||||||||
17 | HPLT v.2.0 (cleaned) | HPLT | 10.1 | 40000 | 40000 | 0 | ★★☆☆☆ | Oct/2024 | 🟢 | https://arxiv.org/abs/2403.14009 | https://hplt-project.org/datasets/v2.0 | CC + archive.org, 193 languages. 15TB compressed, est 40TB uncompressed | |||||
18 | FineWeb-Edu-score-2 | HF | 5.4 | 31500 | 31500 | 0 | ★★★☆☆ | May/2024 | 🟢 | https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2 | |||||||
19 | CulturaX | UOregon | 6.3 | 27000 | 27000 | 0 | ★★★☆☆ | Sep/2023 | 🟢 | https://arxiv.org/abs/2309.09400 | https://huggingface.co/datasets/uonlp/CulturaX | ||||||
20 | HPLT (High Performance Language Technologies) | Helsinki | 5.6 | 50100 | 50100 | 0 | ★★☆☆☆ | Mar/2024 | 🟢 | https://arxiv.org/abs/2403.14009 | "In total, after de-duplication, we release a collection of 5.25 billion documents (approximately corresponding to web pages), totaling 50.1 TB of uncompressed texts and approximately 5.6 trillion whitespace-separated word tokens" | ||||||
21 | RefinedWeb | TII | 5 | 23240 | 23240 | 0 | ★★☆☆☆ | Jun/2023 | 🔴 | Falcon | https://arxiv.org/pdf/2306.01116.pdf | ||||||
22 | MassiveText ML | DeepMind | 5 | 20000 | 4544 | 15655 | ★★★★☆ | Dec/2021 | 🔴 | Retro | |||||||
23 | Matrix | International | 4.69 | 21600 | 2465 | 2112 | ★★★☆☆ | May/2024 | 🟢 | MAP-Neo | https://arxiv.org/pdf/2405.19327 | https://cdn-uploads.huggingface.co/production/uploads/654907a4a1faff97850c4eff/1FWMF_t_Mhy0UQmu65Bb1.png | Combines RedPajama, Dolma, Culturax, Amber, SlimPajama, Falcon, CulturaY | ||||
24 | FineWeb2 | HF | 4 | 32000 | 32000 | ★★☆☆☆ | Dec/2024 | 🟢 | https://huggingface.co/datasets/HuggingFaceFW/fineweb-2 | https://github.com/huggingface/fineweb-2 | 3T words=4T tokens. Much smaller than FineWeb (15T tokens). 1,893 language-script pairs. Of these, 486 have more than 1MB of text data, and 80 have more than 1GB of filtered data. | ||||||
25 | Cultura-Y | UOregon | 4 | 16000 | ★★★☆☆ | Mar/2024 | 🟢 | Vistral-7B-Chat | https://www.ontocord.ai/blog/cultura-y | ||||||||
26 | The Well | Cambridge | 4.1 | 15000 | 15000 | ★★★☆☆ | Dec/2024 | 🟢 | https://github.com/PolymathicAI/the_well | ||||||||
27 | DCLM-Baseline | International | 4 | 13000 | 13000 | ★★★☆☆ | Jun/2024 | 🟢 | DCLM-Baseline 7B | https://arxiv.org/abs/2406.11794 | https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0 | "All CC from 2008-2022, new extraction using resiliparse framework. https://x.com/Vaishaal/status/1803198069888229817/photo/1 ""DCLM-POOL contains 200B documents (370TB after gzip compression)""" | |||||
28 | PaLM 2 dataset | 3.6 | 13000 | ★★★★☆ | May/2023 | 🔴 | PaLM 2 | My estimate: https://lifearchitect.ai/bard/#dataset | |||||||||
29 | Dolma | AI2 | 3 | 11519 | 9832.4 | 1686.6 | ★★★☆☆ | Aug/2023 | 🟢 | OLMo | https://arxiv.org/abs/2402.00159 | ||||||
30 | Infiniset | 2.8 | 12616 | 1569 | 11047 | ★★★★☆ | May/2021 | 🔴 | LaMDA | My calcs: https://lifearchitect.ai/bard/#dataset & It's not clear why Google chose to use 1.5TB of the ~14TB history within Wikipedia | |||||||
31 | MADLAD-400 | 3 | 12000 | 12000 | ★★☆☆☆ | Sep/2023 | 🟢 | MADLAD400-8B | https://arxiv.org/abs/2309.04662 | https://huggingface.co/datasets/allenai/MADLAD-400 | |||||||
32 | MassiveText EN | DeepMind | 2.35 | 10550 | 5173 | 5376.5 | ★★★★☆ | Dec/2021 | 🔴 | Chinchilla, Gopher | |||||||
33 | Common Pile v0.1 | EleutherAI | 2.2 | 8000 | 4000 | 4000 | ★★★☆☆ | Jun/2025 | 🟢 | Comma v0.1-2T | https://arxiv.org/abs/2506.05209 | https://huggingface.co/blog/stellaathena/common-pile | Openly licensed data. | ||||
34 | Pleias Common Corpus | PleIAS | 2 | 10000 | 10000 | ★★★☆☆ | Nov/2024 | 🟢 | https://huggingface.co/blog/Pclanglais/two-trillion-tokens-open | https://huggingface.co/datasets/PleIAs/common_corpus | Lame, just CC again but no copyright (apparently). 2T is nowhere near the largest despite the claim of 'Releasing the largest multilingual open pretraining dataset'. | ||||||
35 | InternLM | Shanghai AI | 1.6 | 5100 | 3616 | 1199 | ★★☆☆☆ | Jun/2023 | 🟡 | InternLM | Chinese/English. My rough estimates only by multiplying tokens (billions) by 3 to get GB | ||||||
36 | Stability New Pile | Stability AI | 1.5 | 5000 | ★★★☆☆ | Apr/2023 | 🔴 | StableLM | Announced but not detailed | ||||||||
37 | FineWeb-Edu 1.3T | HF | 1.3 | 8840 | 8840 | ★★★★☆ | May/2024 | 🟢 | https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu | ||||||||
38 | Zyda | Zyphra | 1.3 | 8840 | ★★★☆☆ | Jun/2024 | 🟢 | https://www.zyphra.com/zyda | Combination of Dolma, Fineweb, Pile, RefinedWeb, and SlimPajama | ||||||||
39 | LLaMA | Meta AI | 1.2 | 4083 | 4083 | 666 | ★★☆☆☆ | Feb/2023 | 🟡 | LLaMA, Alpaca | |||||||
40 | RedPajama | Together AI | 1.2 | 4033 | 3510 | 524 | ★★★☆☆ | Apr/2023 | 🟢 | MPT | https://arxiv.org/abs/2411.12372v1 | ||||||
41 | The Stack v2 | BigCode | 0.9 | 67500 | 67500 | ★★☆☆☆ | Feb/2024 | 🟢 | StarCoder 2 | https://arxiv.org/abs/2402.19173 | https://huggingface.co/datasets/bigcode/the-stack-v2 | "The Stack v2 is ten times larger than its predecessor, yielding a raw dataset of 67.5 TB. Through extensive cleaning, filtering, and subsampling of the source code, along with the incorporation of other high-quality code-related datasets, we created a training set of approximately 3TB (900B+ tokens)." | |||||
42 | SlimPajama | Cerebras | 0.627 | 2685 | 706 | 145 | ★★★☆☆ | Jun/2023 | 🟢 | https://huggingface.co/datasets/cerebras/SlimPajama-627B | https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama | The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. | |||||
43 | Common Corpus | PleIAS | 0.65 | 2000 | 2000 | ★★☆☆☆ | Mar/2024 | 🟢 | https://huggingface.co/collections/PleIAs/common-corpus-65d46e3ea3980fdcd66a5613 | 500tok=650w. | |||||||
44 | ROOTS | BigScience | 0.341 | 1600 | ★★☆☆☆ | Mar/2024 | 🟢 | BLOOM | https://arxiv.org/abs/2303.03915 | ||||||||
45 | The Pile v1 | EleutherAI | 0.247 | 825 | 227 | 629.71 | ★★★★☆ | Dec/2020 | 🟢 | GPT-Neo, GPT-J | Some dupes in my older calcs. Deliberately excludes US Congressional minutes (slavery) and literotica (sex). | ||||||
46 | Institutional Books 1.0 | Harvard University’s Institutional Data Initiative | 0.242 | 825 | 0 | 825 | ★★★★☆ | Jun/2025 | 🟢 | https://arxiv.org/abs/2506.08300 | https://huggingface.co/datasets/institutional/institutional-books-1.0 | "This analysis covers the entirety of Harvard Library's collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available." Ref: https://archive.md/xhJvc | |||||
47 | StarCoder dataset (The Stack 1.2 subset) | BigCode | 0.25 | 783 | 783 | ★★★☆☆ | May/2023 | 🟢 | https://huggingface.co/datasets/bigcode/starcoderdata | https://arxiv.org/abs/2305.06161 | It contains 783GB of code in 86 programming languages, and includes 54GB GitHub Issues + 13GB Jupyter notebooks in scripts and text-code pairs, and 32GB of GitHub commits, which is approximately 250 Billion tokens. | ||||||
48 | The Stack v1 | BigCode | 0.2 | 6400 | 6400 | ★★☆☆☆ | Nov/2022 | 🟢 | Megatron-LM fork | https://arxiv.org/abs/2211.15533 | https://huggingface.co/datasets/bigcode/the-stack-dedup | Various dedupes down to 2.7TB and 1.5TB. | |||||
49 | GPT-3 dataset | OpenAI | 0.499 | 753 | 620 | 133.4 | ★★★☆☆ | May/2020 | 🔴 | GPT-3 | |||||||
50 | RoBERTa dataset | Meta AI | 161 | 145 | 16 | ★★★☆☆ | Jul/2019 | 🟡 | RoBERTa, Megatron-11B | ||||||||
51 | YouTube-Commons | PleIAS | 0.03 | 110 | 110 | ★★★★☆ | Apr/2024 | 🟢 | https://huggingface.co/datasets/HuggingFaceTB/cosmopedia | https://huggingface.co/datasets/PleIAs/YouTube-Commons | 286x parquet files x 385MB each | ||||||
52 | Cosmopedia v2 | HF | 0.028 | 103 | 103 | ★★★★★ | Jul/2024 | 🔴 | SmolLM | https://huggingface.co/blog/smollm | "Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 30 million textbooks, blog posts, and stories generated by Mixtral-8x7B-Instruct-v0.1. Most of the samples are generated by prompting the model to generate content on specific topics using a web page referred to as a "seed sample", as shown in Figure 1. We use web samples to increase diversity and expand the range of prompts." | ||||||
53 | Cosmopedia v0.1 | HF | 0.025 | 92 | 92 | ★★★★★ | Mar/2024 | 🟢 | https://huggingface.co/blog/cosmopedia | Replication of phi-1.5, very high quality synthetic data | |||||||
54 | GPT-2 dataset | OpenAI | 40 | 40 | ★★☆☆☆ | Feb/2019 | 🟡 | GPT-2 | Popular web | ||||||||
55 | GPT-1 dataset | OpenAI | 4.6 | 4.6 | ★☆☆☆☆ | Jun/2018 | 🟡 | GPT-1 | Books | ||||||||
56 | About this sheet | About this sheet |