Data Commons For GenAI Case Studies

	A	B	C	D	E	F	G	H	I	J
1	Name	Link	Data Types	Start Date	Description	Process	Impact and Outcomes	Location	Region	Contact info and other notes

2	Common Voice	https://commonvoice.mozilla.org/en	Voice data	2019	Common Voice is a crowdsourced open voice dataset that can be used to train AI-driven voice applications. The initiative aims to broaden access to voice data for non-English languages and other groups typically underrepresented in voice datasets. The initiative is led by the Mozilla Foundation.	Users seeking to develop AI applications can request a language on the Common Voice website. From there, the team will release a request for submissions for that language. All voice data goes through a validation process before publishing. The decisions about the data commons are made by a group of linguists, community actors, and other stakeholders.	As part of its programmatic work and with the support of the the Gates Foundation, the Foreign Commonwealth and Development Office (FCDO) and GIZ, the team awarded 8 grants to projects using the Kiswahili language and voice technology for the public good.	San Francisco	GLOBAL	https://commonvoice.mozilla.org/en/partner
3	MLCommons	https://mlcommons.org/datasets/	Voice data, image data, synthetic mobility data	2018	MLCommons is an AI benchmarking organization that provides open datasets for machine learning research and testing. Its objective is to support engineers in building AI technologies for the public good. The organization was founded by representatives from institutions including Baidu, Google, Harvard, Stanford, and UC Berkeley, with the aim of supporting engineers in building AI technologies for public benefit. The Multilingual Spoken Words corpus is one of MLCommons' datasets, comprising over 23 million spoken word samples across 50 languages. This dataset includes 340,000 unique keywords, providing over 6,000 hours of audio in the Opus format. The Cognata dataset consists of 100,000 synthetic data frames depicting urban and highway driving scenarios under various conditions. These data frames are captured through a virtual sensor array, including multiple cameras and lidar, and are accompanied by metadata and annotations.	MLCommons has several working groups focused on benchmarking and advancing research. Its Datasets Working Group focuses on building public datasets for machine learning purposes.	The datasets published have been used in several published papers and conference presentations: https://mlcommons.org/research/	San Francisco	GLOBAL	support@mlcommons.org
4	All of Us Research Program	https://allofus.nih.gov/	Biomedical data	2015	The All of Us Research Program is an initiative from the National Institutes of Health (NIH) and based on the NIH Precision Medicine Initiative Working Group of the Advisory Committee to the Director. The initiative aims to crowdsource health data from 1 million people across the United States to improve needs-based healthcare and precision medicine. The team provides tiered access to crowdsourced data for health research purposes (including AI health research).	All of Us Research Program participants provide access to their own health data. The team then removes all personally identifiable information ahead of providing access to data to specific researchers/organizations. The participant will receive information about their health based on the data they provided ahead of the research being published. All participants must consent to the initiative accessing their health records. They are also asked to complete health surveys.	Currently, 945 institutions have access to the data for research purposes and there are around 13,000 active projects: https://www.researchallofus.org/research-projects-directory/	Washington, DC / United States	NA	https://allofus.nih.gov/news-events/announcements/nih-launches-largest-precision-nutrition-research-effort-its-kind
5	Nightingale Open Science	https://www.ngsci.org/	Health data, images	2022	Nightingale Open Science is an open data platform that combines data from health researchers and organizations. Hosted by the Chicago Booth School Center for Applied Artificial Intelligence, the initiative aims to bring together a variety of health data sources to help accelerate clinical research and computational medicine.	All data supplied is de-identified ahead of publishing. Researchers can only access the data within the platform environment. Only those who will use data for non-commerical and academic purposes will be granted access to the data.	The team hosts various challenges to help advance research on specitic health related topics: https://www.ngsci.org/updates	Chicago	NA	info@ngsci.org
6	Health Data Nexus	https://healthdatanexus.ai/	Health data	2023	Health Data Nexus is an initiative by The Temerty Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM) at the University of Toronto that seeks to democratize access to health data for AI medical research.	Those interested in providing access to their health research data need a clear research plan and Research Ethics Board approval. They also must complete an access to data agreement prior to publishing. The team provides templated metadata that must be followed.	The platform is now being accessed by users in 5 countries: https://www.medrxiv.org/content/10.1101/2024.08.23.24312060v1.full.pdf	Toronto	NA	january.adams@utoronto.ca
7	IndiaAI Datasets Platform	https://indiaai.gov.in/datasets	N/A	2025	IndiaAI Datasets Platform aims to accelerate access to non-personal data for AI researchers and startups. The platform will be hosted by the National eGovernance Division. The platform will include private and public datasets.	To be launched in January 2025: https://www.ndtvprofit.com/technology/governments-indiaai-datasets-platform-to-be-operational-by-january-2025		India	APAC
8	National Cancer Institute (NCI) Cancer Research Data Commons - Data Readiness Challenge	https://datacommons.cancer.gov/news/nci-crdc-artificial-intelligence-data-readiness-aidr-challenge	Health data, open research, genomic data, images	2024	The NCI Cancer Research Data Commons provides access to NCI-funded research data with the goal of advancing cancer research. The platform includes 6 data commons: genomic data, proteomic data, imaging data, integrated canine data, cancer data, and clinical and translational data.	In October 2024, the team announced a new challenge for March 2025 which will ask researchers to evaluate the AI readiness of the existing datasets on the data commons and how to make future datasets AI ready. This initiative is part of an ongoing program on making the platform data relevant for AI research.		United States	NA	https://www.cancer.gov/research/infrastructure/artificial-intelligence
9	The Lacuna Fund	https://lacunafund.org/apply/	Agriculture, language, health and climate data	2020	The Lacuna Fund is advancing the development of labelled datasets about low and middle income countries that can be used for machine learning. The team accepts proposals for the creation of datasets across 4 domains: agriculture, language, health and climate data.	The Lacuna fund submits an RFP for the development of a dataset on a specific topic. They accept applications from non profits, research organizations or for profit social enterprises operating in the region of interest to develop the dataset. Applications are reviewed by a panel of experts who will select the grantee.		Global	GLOBAL
10	INSIGHT	https://www.insight.hdrhub.org/about-insight	Eye images	2015	INSIGHT is a repository of 35 million eye images that is made available for research purposes only. By providing image data, it allows reserachers to use advanced analytics and AI on anonymized patient records. This initiative is led by Moorfields Eye Hospital NHS Foundation Trust in collaboration with University Hospitals Birmingham NHS Foundation Trust.	The initiative is run by a data trust advisory board consisting of experts and community/patient representatives. The team also works with a patient representative group to determine how to ensure the initiative provides value to the NHS and patients.	The research team has published several resources using the data: https://www.insight.hdrhub.org/resources	United Kingdom	EMEA	enquiries@insight.hdrhub.org
11	Environmental Impact Data Collaborative (EIDC)	https://mdi.georgetown.edu/eidc/	Public and private data: e.g. EPA Air Quality Index, Tree Canopy Data, Census Bureau Community Resilience Estimates, Low-Income Energy Affordability Data	2023	The EIDC provides access to data for climate change related research and policy making. The team focuses on four main domains (Environmental Justice, Federal Spending, Energy, and Air and Water) and supplies cloud-based tools to analyze the data. The initiative is led by Georgetown University's Massive Data Institute.	All data is housed on a cloud-based platform where users can analyze and upload their own data.	AI researched developed using the platform: https://mdi.georgetown.edu/news/automated-pipeline-to-extract-wetland-damage-data-from-us-army-corps-of-engineers-notices/	Washington, DC	NA	https://georgetown.app.box.com/s/5jw7nx9x28gqyigc143nxfb4ihizgai0
12	Language Data Commons of Australia	https://www.ldaca.edu.au/	Language data	2021	The Language Data Commons of Australia is a partnership between the Australian Research Data Commons and the School of Languages and Cultures at The University of Queensland that seeks to make available Australian language data for both "academic and non-academic uses".		Case study of language data being applied at Appen: https://www.ldaca.edu.au/resources/general-resources/case-studies/data-mangement-appen/	Australia	APAC
13	UC Irvine Machine Learning Repository	https://archive.ics.uci.edu/about	Images, tabular data, text, time-series data	1987	The UCIrvine Machine Learning repository is a collection of datasets that can be used to test and train machine learning initiatives. It includes a range of datasets applicable to a variety of machine learning functions.	Individuals can submit datasets to the repository that do not include any PII. All datasets are reviewed by the team prior to publishing and are assigned a Digital Object Identifier (DOI). All datasets published are licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0).		California	GLOBAL	ml-repository@ics.uci.edu
14	Papers with Code	https://paperswithcode.com/about	Images, text, video, audio, tables, graphics, time-series data	2018	Papers with Code is a digital commons of machine learning papers, datasets, and code. All content on the platform is licenced under CC-BY-SA (similar to Wikipedia), users can make direct edits to the content, and the platform is managed by the community. The data included on the platform is either from the community, online data platforms with open licenses, or the internal team (primarily based at Meta AI Research).			California	GLOBAL	hello@paperswithcode.com
15	AIDA Data Hub	https://datahub.aida.scilifelab.se/about/	Medical images	2019	AIDA Data Hub is an open data repository in Sweden for medical research led by the Linköping University Center for Medical Image Science and Visualization (CMIV). The platform provides the infrastructure for researchers to share their own data and research using FAIR principles and DOI.	Only members can access the platform. Any Swedish companies working medical imaging or AI can become a member at a cost. The team also offers a data science platform which provides a tool for managing sensitive data.	In May 2024, PredictMe (an AI startup) was added to the initiative board -- which became the first commercial customer of the repository: https://datahub.aida.scilifelab.se/news/2024-05-29-predictme-first-sme-on-board/	Sweden	EMEA	aida-data-director@nbis.se
16	CLARIN	https://www.clarin.eu/	Language data	2012	CLARIN provides the infrastructure to combine language data from across European institutions for research purposes. The initiative is governed by the CLARIN European Research Infrastructure Consortium and a General Assembly of representatives from participating institutions' ministries.			Europe	EMEA	clarin@clarin.eu
17	The Common Pile v0.1 (The Pile v2)	https://arxiv.org/pdf/2506.05209	Public domain books	2025	EleutherAI released the Common Pile v0.1, a successor to The Pile dataset that addresses the previous copyright controvery surrounding The Pile. Common Pile v0.1 is an eight terrabyte dataset of entirely openly licensed and public domain text. This dataset includes research publications, books, government texts, online discussion forums and more, all of which meet the Open Definition 2.1 standards for open licensing.		EleutherAI released Comma v0.1-1T and Comma v0.1-2T models that were trained on the Common Pile. According to their report, these models acheived equal performance to models trained on unlicensed data.	Global	GLOBAL	https://openfuture.eu/wp-content/uploads/2024/04/240404Towards_a_Books_Data_Commons_for_AI_Training.pdf, https://venturebeat.com/ai/one-of-the-worlds-largest-ai-training-datasets-is-about-to-get-bigger-and-substantially-better/
18	HathiTrust Research Center	https://www.hathitrust.org/about/research-center/	Books	2008	The HathiTrust Research Center at Indiana University and the University of Illinois at Urbana-Champaig provides access to its digital library of books for research purposes. Through its HathiTrust Digital Library it provides access to several data analytics tools. Nonetheless, the initiative has been criticized in the past for copyright infringement.		HathiTrust's data commons has been used for research in fields ranging from musicology to computational linguistics.For instance, researchers at the University of Illinois Urbana-Champaign used the HathiTrust dataset to demonstrate how LLMs can identify metadata like keywords and abstracts from theses and dissertations in a 2025 study. Note: HathiTrust is suspending funding for the HathiTrust Research Center after 2026.	United States	NA	https://openfuture.eu/wp-content/uploads/2024/04/240404Towards_a_Books_Data_Commons_for_AI_Training.pdf
19	Gaia-X Data Space	https://www.bmwk.de/Redaktion/EN/Dossier/gaia-x.html	Various types of data across sectors	2021	Gaia-X is a European initiative, originating from Germany, to develop a sovereign, interoperable data infrastructure that enables secure access to data, respects data sovereignty, and fosters digital innovation within an open, transparent ecosystem.	Gaia-X works to create a federated data ecosystem that connects both centralized and decentralized infrastructures. It supports secure and transparent data exchange, fosters innovation, and establishes open interfaces, standards, and an innovation platform. National Gaia-X hubs provide local governance and coordination, while the Gaia-X Federation Services facilitate secure data exchange and control.	Key milestones include establishing the Gaia-X Association and national hubs, developing federation services, creating certified service labels, and running workshops for knowledge transfer. Despite slow progress and challenges in achieving full digital sovereignty, SMEs are already utilizing Gaia-X data spaces to share data for innovation.	Brussels	EMEA	Gaia-X enables the development and use of AI by addressing many of the challenges that AI faces. It also supports projects that deal with AI, including OpenGPT-X, an EU-specific, multilingual GenAI model. https://gaia-x-hub.de/wp-content/uploads/2023/12/GX-WP-AI.pdf
20	Catena-X Data Space	https://catena-x.net/en/	Automotive industry-related data	2021	Catena-X is a collaborative, open data ecosystem for the automotive industry, supporting secure, standardized access to data across stakeholders, including manufacturers, suppliers, and service providers.	Harmonizes partner data, enables traceability, carbon footprint analysis, and circular economy initiatives. Develops digital behavior twins for predictive insights, establishes shared data exchange standards, and collaborates with Gaia-X.	Expected to strengthen supply chain resilience, improve manufacturing efficiency, and support sustainable practices across the automotive sector. Joint venture Cofinity-X, and open source platform Tractus-X, have spun off from Catena X. Recently expanded to NA market.	Germany	EMEA	https://catena-x.net/en/kontakt Catena-X enables the development and use of AI by addressing many of the challenges that AI faces in the automotive industry. Through federated learning, Catena-X allows companies to train AI models collaboratively without compromising data privacy. https://www.dlr.de/en/ki/research-transfer/projects/catena-x
21	Mobilithek	https://bmdv.bund.de/SharedDocs/EN/Articles/DG/mobilithek.html	Mobility data	2022	Mobilithek is Germany's central platform for sharing digital mobility data; it primarily provides access to publicly available, governmental, and/or legally required data. It offers real-time updates, timetables, and related data. Operates with Mobility Data Space, replacing the Mobility Data Marketplace and integrating mCLOUD.	Integrate real-time traffic updates and public transit schedules, establish data exchange standards compliant with International Data Spaces technology, and provide user-friendly web portal for data access. Data from Mobility Data Marketplace (MDM) and mCLOUD migrated to Mobilithek. Continuous updates and collaborations with mobility providers ensure accurate data. Regular platform maintenance and engagement with national and European stakeholders.	Enables data-driven mobility planning, facilitates intermodal travel, improves road safety and traffic efficiency, fosters innovation, promotes open access to data, and drives sustainable transportation. Expected outcomes include cost reduction, increased use of sustainable transportation, and improved European data exchange for seamless mobility services.	Germany	EMEA	Mobilithek enables the development and use of AI by addressing many of the challenges that AI faces in the mobility industry. It also supports projects that deal with AI.
22	Mobility Data Space	https://mobility-dataspace.eu/	Mobility data	2021	Open platform for secure exchange of real-time traffic and mobility data, connecting public and private sources while ensuring data sovereignty and traceability.	Offers a secure, decentralized data space for traffic data, metadata catalog, and Connector-as-a-Service for data transfer. Facilitates collaboration via events and working groups.	Supports efficient, environmentally friendly mobility through access to data, enabling optimized logistics and transportation while maintaining data control for providers. Expected outcomes include reduced fuel use, emissions, and improved logistics.	Germany	EMEA	https://mobility-dataspace.eu/contact The Mobility Data Space enables the development and use of AI by addressing many of the challenges that AI faces in the mobility industry. It also supports projects that deal with AI.
23	Health Data Hub	https://www.health-data-hub.fr/	Health data, medical/administrative data	2019	French platform providing secure, centralized access to health data to support public-interest research. Intended to improve healthcare quality by enabling projects on topics like drug side effects and disease prediction.	Acts as a single access point, coordinating with CNIL and supporting data access via a secure platform. Collaborates with the European Health Data Space (EHDS) for cross-border health research and broader access to data.	Enables healthcare research that benefits patient care, supports regulatory compliance, and fosters collaboration across Europe, enhancing France’s role in health data management.	France	EMEA	https://www.health-data-hub.fr/contact The Health Data Hub aims to support AI applications in healthcare, accelerating projects that use large health datasets for predictive algorithms, disease monitoring, and improving patient outcomes. Collaborates with the European Health Data Space to enable broader AI-driven health research across countries.
24	The Public Sector Data Space	https://cbddo.gov.tr/en/projects/psector-data-space/	Big data, micro data	2024	The Public Sector Data Space is an initiative by the Digital Transformation Office and the Turkish Statistical Institute that aims to adapt the European Union's concept of a "data space" to the Turkish context. Towards that end, the team is developing a platform and supporting infrastructure for researchers and policy makers to access pools of data for data analytics and AI application development purposes. The platform is currently in development.			Turkey	EMEA	https://unstats.un.org/unsd/statcom/groups/NetEconStat/Meetings/DataStrategySprintSecondWebinar/Session3-4-DataStrategy-and-Governance-in-Turkey.pdf
25	The Data Dam	https://www.prnewswire.com/news-releases/korea-kicks-off-data-dam-301325717.html	Language, vision, land and environment, agriculture, livestock and fishing, safety, healthcare, and autonomous driving data	2021	The Data Dam is an initiative by Korea's Ministry of Science and ICT developed as part of Korea's Digitial New Deal that seeks to collect public and private data and publish it for re-use across several domains. One of the main purposes of the Data Dam is to produce data to fuel AI applications and improve the quality of the AI output.			South Korea	APAC
26	Common Corpus	https://huggingface.co/collections/PleIAs/openculture-65d46e3ea3980fdcd66a5613	Words, books, newspapers	2024	Common Corpus is one of the largest public-domain datasets for LLM training coorindated by Pleias (a technology company) in collaboration with HuggingFace, Occiglot, Eleuther, and Nomic AI. The dataset includes public domain books and newspapers in several languages from national libraries and archives along with other sources. It also includes language data in English, French, Dutch, Spanish, German and Italian.		Common Corpus data was used to train Pleias 1.0, a model specializing in retrieval augmented generation in legal, administrative, and financial use cases. It was also used to train Pleias Nano, which specializes in scientific publications. Both models were released in December 2024 and are said to be one of the first fully EU AI Act compliant models.	Global	GLOBAL
27	PD12M	https://huggingface.co/datasets/Spawning/PD12M	Image data, synthetic captions, metadata (image imensions, MIME type, licensing information, embeddings)	2024	Public Domain 12M or PD12M is a dataset composed of 12.4 million public domain and CC0-licensed images with synthetic captions designed for training text-to-image. The dataset was developed by Source.Plus (a platform by Spawning AI) and uses a commons based governance model.	The dataset was sourced from galleries, libraries, archives, museums, Wikimedia Commons, and iNaturalist. Images were pre-filtered to exclude copyrighted content and inappropriate materials. Synthetic captions were generated, and additional filtering ensured quality and safety. Images and metadata are hosted on dedicated cloud storage to maintain dataset integrity and support reproducibility.		Global	GLOBAL
28	Dolma	https://github.com/allenai/dolma	Academic publications, books, web content, code, encyclopedia content	2023	Dolma is a 3-trillion-token open dataset created for the Allen Institute of AI made up of academic research along with other data sources such as web pages, academic publications, code, books, and encyclopedic materials. The dataset is accompanied by a toolkit on how to source datasets for training purposes and aims to support transparency, risk mitigation, and reproducibility for responsible AI development.	Dolma was built using text data from various sources, including academic papers, books, web pages, and code. The dataset is shared under AI2’s ImpACT license, which requires users to disclose their intended use, report any datasets created from it, and follow ethical guidelines to prevent harmful applications.	Dolma was used to train OLMo, an open-source AI model by AI2. OLMo is a set of language models that are fully open, meaning the data, code, and evaluations used to create them are available to the public. By using datasets like Dolma, OLMo has been trained to perform tasks such as knowledge recall and commonsense reasoning.	Global	GLOBAL
29	BigScience	https://bigscience.huggingface.co/	Language data	2021	BigScience is a collaboration by HuggingFace, GENCI and IDRIS. Launched as a research workshop, the initiative focused on co-developing a multilingual neural network language model with researchers from institutions across the globe.			Global	GLOBAL
30	The Health Passbook	https://gpai.ai/projects/data-governance/DG08%20-%20The%20Role%20of%20Government%20as%20a%20Provider%20of%20Data%20for%20Artificial%20Intelligence%20-%20Interim%20Report.pdf	Health data	2014	The Health Passbook is an initiative by Taiwan's National Health Insurance Administration that provides access to health data to AI developers and researchers. The initiative was launched in 2014 where users could provide access to data from their "health passbook" (i.e. personal health records or test results) to 3rd party organizations and has been expanded overtime. The initiative rapidly grew during the COVID-19 pandemic as many health passbook users chose to share their vaccination records and COVID-19 test results.	Health Passbook uses a data stewardship model to provide access to data. The government of Taiwan provides the infrastructure to accelerate access to data, but does not share the data themselves. Users can only provide access to their data after their identities have been authenticated and they must provide consent every time they choose to provide access to their data. Organizations can access the data through a software development kit. These organizations include government agencies, organizations contracted by the National Health Insurance Administration, and others.	In 2022, approximately 30,000 users provided access to their data through the Health Passbook. It is important to note that compared to the number of users on Health Passbook who could opt in to provide access to their data, 30,000 users is quite low. In addition, 149 organizations applied to access the software development kit and 349 mobile applications were submitted (e.g. Lydia.ai's AI Health Index application which helps users understand insurance offerings). Having a national data protection law in addition to sector-specific data laws helped advance this initiative.	Taiwan	APAC
31	The Aclimate Agricultural Data Platform	https://www.aclimate.org/	Agricultural, environmental, and government data	Unknown	The Aclimate Agricultural Data Platform is a data commons initiative developed by the Colombian government, The International Centre for Tropical Agriculture (a research organization) and farmers' collectives to accelerate the development of AI applications for farmers. The initiative aims to provide farmers with insights about weather patterns, climate change, and the impact on agriculture so they can rapidly adjust their practices as needed.	Farmers can access data through the Aclimate Platform which uses machine learning to provide insights from pools of data. The platform models scenarios for crops based on alternative weather patterns. In 2022, the team added the Melisa AI-chatbot which aims to increase access to data and data-driven insights from the platform. The organizers used a contractual framework to generate access to data, but this occured in the context of an existing data protection law.	To help mitigate the risk of unequal access, the team deployed the platform using technologies that do not require a strong internet connection (e.g. text messaging). In 2014, rice farmers used insights from this data commons to rapidly adjust their farming practices and avoid around $3.5 million USD in costs.	Colombia	LATAM
32	Open Street Map	https://osmfoundation.org/	Geographic data	2004	OpenStreetMap is a free, editable map of the world, collaboratively created by a global community of nearly 5 million registered users. The project aims to provide free and accessible geospatial data for everyone, including individuals, organizations, and governments. Supported by the OpenStreetMap Foundation, this initiative fosters the growth, development, and distribution of open geospatial data.	Data is contributed by volunteers worldwide using free tools and software. The OpenStreetMap Foundation oversees critical functions like hosting servers, managing funds, and organizing conferences (e.g., State of the Map). The platform supports crowd-sourced data editing, validation, and integration.	OpenStreetMap data is used by a variety of stakeholders, including governments, non-profits, and businesses, to support navigation systems, disaster management efforts, urban planning, and other applications. The platform's open geospatial data has been incorporated into research and projects across multiple sectors.	United Kingdom	GLOBAL
33	Humanitarian Open Street Map (HOT)	https://www.hotosm.org/	Geospatial data, satellite imagery, mapping data	2009	Humanitarian OpenStreetMap Team (HOT) creates and utilizes open geospatial data for applications in disaster management, climate resilience, public health, migration, and other humanitarian fields. It uses tools like the Tasking Manager, which coordinates mapping projects, and fAIr, an AI-assisted mapping service.	HOT organizes mapping efforts using the Tasking Manager, a platform that segments mapping projects into smaller, assignable tasks. Data accuracy is supported through validation by experienced contributors. The fAIr tool incorporates AI to identify map features from satellite imagery, streamlining data collection processes. HOT collaborates with local mapping groups and partners to align data and tools with on-the-ground needs.	HOT has supported disaster response by mapping areas affected by events such as the Haiti earthquake and Typhoon Haiyan. An AI-assisted mapping pilot in Tanzania and Uganda increased mapping productivity, resulting in 18 million building footprints added to OpenStreetMap.	Washington, DC	GLOBAL
34	WikiData	https://www.wikidata.org/wiki/Wikidata:Main_Page	Structured data, including items, properties, and statements across various fields such as biographies, geography, and scientific information.	2012	Wikidata is an open and collaborative knowledge base that stores structured data for use by Wikimedia projects and other platforms. It acts as a central repository for data, which can be accessed and edited in multiple languages. Each data item has a unique identifier and is described through properties and values. Data from Wikidata is available under a public domain license, allowing for reuse in various contexts.	Wikidata organizes information into items identified by unique codes prefixed with “Q” (e.g., Q42). Each item is described through statements consisting of properties (e.g., “P69” for “educated at”) and values. Data is contributed by human editors and automated bots. Edits made in any language are immediately reflected across all languages. Information is maintained and updated by contributors globally.	Wikidata supports Wikimedia projects like Wikipedia by reducing duplication of effort and centralizing structured data. The database is also used by external applications and services for data retrieval and integration. As of 2024, Wikidata contained over 100 million data items, facilitating broader access to structured and linked data.	Global	GLOBAL
35	Google's DataGemma	https://blog.google/technology/ai/google-datagemma-ai-llm/	Statistical data, real-world numerical data	2024	DataGemma is an open model developed by Google to address the issue of inaccuracies in large language model outputs, often referred to as hallucinations. It integrates real-world statistical information from Google’s Data Commons into language models to improve factual accuracy and reasoning. Data Commons serves as a repository of publicly available data sourced from organizations such as the United Nations, the World Health Organization, and the Centers for Disease Control and Prevention. The model utilizes structured data for enhanced language model responses.	DataGemma employs two methodologies to enhance the factuality of language model outputs. The Retrieval-Interleaved Generation (RIG) approach queries trusted sources during response generation, enabling fact-checking against Data Commons data. Retrieval-Augmented Generation (RAG) retrieves contextually relevant data before response generation to reduce inaccuracies. These processes aim to connect statistical data with model outputs, improving reliability.	Initial studies suggest that integrating DataGemma into language models improves accuracy when handling numerical and statistical data. This integration is intended to minimize errors in applications requiring precise information, such as research and decision-making processes.	Mountain View, California	NA
36	OpenML	https://www.openml.org/	Machine learning datasets, metadata, algorithm flows	2013	OpenML is a collaborative platform designed for sharing datasets, algorithms, and machine learning experiments. The platform standardizes data formatting and metadata, enabling researchers to work with datasets directly in their machine learning environments. It records detailed experiment information, including data, models, pipelines, and settings, supporting reproducibility and transparency. OpenML organizes its resources into datasets, tasks, flows, and runs, providing a structured approach to managing machine learning projects.	Datasets are uniformly formatted and accessible through APIs. Tasks combine datasets with machine learning problems and evaluation methods. Flows describe machine learning algorithms, including their configurations and dependencies, while runs document the outcomes of experiments, including model performance and evaluation metrics. OpenML facilitates integration with machine learning tools, allowing for the seamless sharing and retrieval of data and results.	OpenML supports research by providing a repository of datasets and experimental results that can be reused and compared. It allows researchers to evaluate models across various tasks and datasets, fostering collaboration and the development of reproducible workflows. The platform has enabled the creation of numerous machine learning studies and comparisons across different environments.	Global	GLOBAL
37	Therapeutics Data Commons	https://tdcommons.ai/	Biomedical data, machine learning tasks, molecular data	2022	Therapeutics Data Commons (TDC) is a resource designed to support research in drug discovery and development by providing curated machine learning tasks, datasets, and benchmarks. TDC covers various therapeutic modalities, including small molecules, antibodies, peptides, and gene editing therapies. It includes datasets and tools for tasks such as target discovery, activity modeling, efficacy and safety evaluations, and manufacturing processes.	TDC organizes machine learning tasks and datasets into structured categories for therapeutic research. It provides tools for model evaluation, dataset splitting, data processing, and molecule generation. The datasets are standardized and accessible through a Python library, enabling integration into existing workflows. Model evaluation is supported through metrics to assess performance and out-of-distribution generalization.	TDC enables researchers to benchmark AI models and explore machine learning applications in therapeutic science. Its datasets and tools support the development of models aimed at identifying drug targets, optimizing therapeutic properties, and designing molecules for drug discovery.	Global	GLOBAL
38	Arxiver	https://huggingface.co/datasets/neuralwork/arxiver	Open research	2024	Arxiver is a collection of around 63,000 research papers published on Arxiv. Developed by Neuralwork (a technology and research company), Arxiver includes papers, URLs, authors, abstracts, and other information published between January and October 2023. This dataset aims to support AI applications that provide natural language responses to text queries and summarize information.	The dataset was created using Meta AI's text recognition model (Nougat). The team subsequently cleaned the data using several processing steps.	Arxiver was cited as one of the top liked datasets of 2024 within Hugging Face's Open Source AI year in review.	Global	GLOBAL
39	SmolTalk	https://huggingface.co/datasets/HuggingFaceTB/smoltalk	Synthetic data	2024	SmolTalk is a collection of new and existing pubicly available synthetic data that can be used for LLM training and fine tuning. This dataset was developed to support text editing, writing, and summarizing capabilities within AI applications. It also aims to enhance mathematics and coding.	The new data included was developed using Distilabel (an approach for creating synthetic data).	SmolTalk was used to develop the SmolLM2-Instruct AI models. Also, SmolTalk was cited as one of the top liked datasets of 2024 within Hugging Face's Open Source AI year in review.	Global	GLOBAL
40	FinePersonas	https://huggingface.co/datasets/argilla/FinePersonas-v0.1	Personas	2024	FinePersonas is a collection of 21 million persona descriptions that can be used to generate synethetic data. These personas include individual's characterists such as their personal goals and occupation. This dataset can help integrate more personas in AI applications and help generate text responses that reflect a broader range of perspectives.		FinePersonas was cited as one of the top liked datasets of 2024 within Hugging Face's Open Source AI year in review.	Global	GLOBAL
41	TxT360	https://huggingface.co/datasets/LLM360/TxT360	Common crawl data, wikipedia, open research	2024	TxT360 combines 99 common crawl snapshots with other data sources for AI training. A key feature of TxT360 is the removal of duplicative data across these data sources to support the weighting process.		TxT360 was cited as one of the top liked datasets of 2024 within Hugging Face's Open Source AI year in review.	Global	GLOBAL
42	FineVideo	https://huggingface.co/datasets/HuggingFaceFV/finevideo	Video data, closed captioning	2024	FineVideo includes around 40,000 open source videos across domains such as education, science and technology, and politics. All videos were uploaded to YouTube under Creative Commons Attribution (CC-BY) licenses. The dataset includes video metadata, information from the video YouTube pages, and video transcripts supplied by YouTube Commons. This effort aims to support the training of new AI models.	The videos were sourced from a diverse range of accounts and could include biased information. In order to access the dataset, users must agree to terms and conditions including abiding by the video licenses. However, users can view specific data using the FineVideo Space.	FineVideo was cited as one of the top liked datasets of 2024 within Hugging Face's Open Source AI year in review.	Global	GLOBAL
43	Open Humans	https://www.openhumans.org/	Health data, genetic data, activity data, GPS data, glucose monitor data, social media data, custom data files (personal and community-driven)	2015	Open Humans is an online platform that allows individuals to access and share personal data (e.g., health, genetic, and social media data). The platform aims to support academic, participant-led, and citizen science research while enabling controlled access to data and user choice. Members can import, manage, and share data selectively with various projects, encouraging open data exploration and collaborative research.	Members contribute personal data and opt in to share it with specific projects. Projects may request data access permissions and are subject to community review. For example, the Imputer project uses genetic data from Open Humans members to provide users with more comprehensive genetic information by estimating their genotype at additional locations not covered by typical direct-to-consumer genetic tests. The platform also provides tools for data analysis and visualization for individual and research use.	Over 10,500 members have joined, with 40 tools and activities operating on the platform. The platform has facilitated numerous research and citizen science projects through its data-sharing framework.	USA	GLOBAL
44	LAION	https://laion.ai/blog/laion-5b/	Images with associated text descriptions, language data, and metadata for organization and content safety	2022	LAION (Large-scale Artificial Intelligence Open Network) is a non-profit organization that creates and shares large datasets to support AI research. The LAION-5B dataset includes nearly 6 billion image-text pairs that can be used to train AI models for tasks like image recognition and language processing. It is publicly available, allowing researchers to freely access the dataset for AI development and research.	LAION collects its data from the internet through a large web-scraping project called Common Crawl. It gathers images and their descriptions from webpages by looking at text that describes the images. In efforts to improve the quality of the data, LAION removes duplicate images and filters out irrelevant or low-quality image-description pairs using AI models that check if the image and its description are related.	The dataset enabled research and training of AI models, including CLIP, DALL-E, and GLIDE. CLIP (Contrastive Language-Image Pretraining) is an AI model that can understand images and text together. It is trained to associate images with their descriptions, allowing it to perform tasks like image classification and generating captions for images. GLIDE (Generative Latent Diffusion Model) is another model trained using LAION-5B that generates high-quality images from text prompts.	Germany	EMEA
45	Speakleash	Speakleash https://huggingface.co/speakleash	Text datasets for Polish language	2023	SpeakLeash, also known as Spichlerz, is an open-source project focused on building and cataloging datasets for developing Polish Large Language Models (LLMs). These datasets aim to advance natural language generation and processing capabilities in Polish and provide researchers with tools for AI research. SpeakLeash collaborates with international organizations such as BigScience and EleutherAI to integrate Polish datasets into global AI development initiatives.	SpeakLeash collects, curates, and annotates text datasets in Polish, providing information on licensing and key dataset attributes. The project collaborates with global AI initiatives and uses open-source tools to refine the datasets, making them suitable for training and fine-tuning transformer-based models like GPT.	SpeakLeash has contributed to the development of PLLuM (Polish Large Language Universal Model), providing a resource for AI research and development in Poland. It has also supported various applications, from generating Polish-language content to research in fields like medicine and science.	Poland	EMEA
46	The Public Interest Corpus (Formerly called the Public-Interest Book Training Commons)	https://www.authorsalliance.org/2024/12/05/developing-a-public-interest-training-commons-of-books/	Books, text data	2024	The Public Interest Corpus (formerly known as the Public-Interest Book Training Commons) is an initiative by Authors Alliance, in collaboration with Northeastern University Library and supported by the Mellon Foundation. The project, currently in development, aims to create a plan for establishing a large-scale, structured book dataset for AI training. The goal is to determine whether to create a new organization or identify an existing one to manage the creation and stewardship of this dataset.	The Authors Alliance team engages with authors to work towards ethical, collaborative use of copyrighted work. In the current planning phase, members of the development team are working on potential partnerships, legal and policy challenges, and business model ideas, as per a meeting in March 2025. The Public Interest Corpus will likely implement different business models for commercial and noncommercial use.		United States	NA
47	Open Climate Fix	https://openclimatefix.org/	Climate data, satellite imagery	2019	Open Climate Fix is a nonprofit organization that develops AI-driven solutions for energy grid management and climate forecasting. The initiative focuses on improving the integration of renewable energy sources into electricity grids by making energy and weather datasets more accessible. The project publishes datasets on numerical weather predictions, satellite imagery, and photovoltaic (PV) power generation to support AI applications in climate and energy research.	Open Climate Fix collects and shares large-scale data about weather, energy production, and cloud patterns. This data, including weather forecasts, satellite images, and solar energy production information, is made available on platforms like Hugging Face, where anyone can access it. The initiative works with energy companies and research groups to develop AI models that improve solar power forecasting and predict changes in the weather.		United Kingdom	EMEA
48	Open for Good	https://openforgood.info/	Localized data, AI training data		Open for Good is an initiative developed by the Open for Good Alliance, which includes organizations like the Mozilla Foundation and Digital India Foundation. It aims to improve access to localized AI training data in regions like Africa and Asia, where there is a shortage of data about local languages, cultures, and contexts. The initiative works to collect, curate, and share publicly available datasets to address gaps in AI data representation for these regions. Open for Good also facilitates collaboration among organizations and the sharing of knowledge about AI data collection and usage.	Open for Good brings together organizations to make training data for AI accessible to researchers, developers, and organizations working with AI in regions like Africa and Asia. The initiative aims to improve how easily these users can find and use this data, while also supporting the maintenance of these datasets over time. Open for Good also facilitates discussions on how to collect and use AI data, in efforts to ensure the data is fair, unbiased, and relevant to local communities. The goal is to create a shared collection of AI training data that is available to those involved in AI development and research.		Global	GLOBAL
49	GainForest.Earth	https://gainforest.earth/	Environmental data, remote sensing data	2017	GainForest.Earth is a decentralized science nonprofit that archives global biodiversity and environmental data. The initiative facilitates community-owned data commons for biodiversity, aiming to enable local and Indigenous communities to collect, manage, and share environmental data. The data is designed to support conservation efforts and AI-driven environmental analysis.	Local and Indigenous communities collect biodiversity and environmental data through AI-guided storytelling and decentralized governance. Data can be self-hosted or managed by the GainForest Data Council, which determines pricing models for data contributions. GainForest.Earth collaborates with global nature organizations and hosts hackathons to develop AI tools and data visualizations for conservation.		Switzerland	EMEA
50	Posmo	posmo.coop	Geospatial data, GPS sensor data, mobility data	2020	Posmo is a data cooperative that collects and manages mobility data, which includes GPS and sensor-based movement information about how people travel through cities. This data is used in urban planning and transportation research. Contributors share their mobility data, which is governed collectively rather than by a single company. Machine learning is used to analyze travel patterns, and anonymized data (information that has been processed to remove or obscure personally identifiable details, ensuring that individual contributors cannot be identified) is provided to researchers, policymakers, and organizations working on transportation and climate-related projects. The data is also used in AI applications, such as improving traffic models and public transit planning.	Users contribute mobility data through the Posmo One and Posmo Project apps, which collect GPS and sensor data and analyze mobility patterns in real time using machine learning. Before data is shared with public agencies, researchers, and policymakers, identifiable details are removed to protect privacy. Data use is reviewed by the Ethics Council, which sets guidelines for how the data can be accessed and applied. Organizations can participate in data collection projects and access Posmo’s shared data pool for research and urban planning.	Posmo collects mobility data for urban planning, transportation research, and climate adaptation. A pilot project with the City of Zurich gathers anonymized data from residents to analyze cycling patterns and environmental factors. Researchers, government agencies, and policymakers use this data to study travel behavior and develop transportation policies. The VelObserver project, launched in 2022, collects resident assessments of Zurich’s cycling infrastructure through a digital platform. The data identifies weak points in the cycling network and is shared with city administrations. Discussions are ongoing to expand the project to other municipalities. The Data Donation for Public Benefits project (2022–2023) studies how voluntary mobility data sharing supports urban planning and climate policies. Residents submit anonymized data through the Posmo Project app. The project is managed with partners including the Risk Dialogue Foundation and University of Zurich.	Switzerland	EMEA
51	Institutional Books 1.0	https://www.institutionaldatainitiative.org/institutional-books	Public domain books	2025	The Institutional Data Initiative (IDI), an initiative at Harvard Law School Library, published a dataset of almost one million public domain books for AI training. The digitization of these books began in 2006 as part of the Google Books Project. Of the 1,075,000 books that were scanned for this project, 983,000 books are public domain and are published as the Institutional Books 1.0 dataset. This dataset, supported by Microsoft and OpenAI, can be used for training or evaluating LLMs, especially for multi-language processing or tasks that may involve historical language. With the release of the dataset, the team published a report documenting their data collection and processing methods.			United States	NA	contact@institutionaldatainitiative.org
52	FineWeb 2 and FineWeb-Edu	https://huggingface.co/datasets/HuggingFaceFW/fineweb-2	Common crawl textual data	2024	FineWeb 2 is a successor to the FineWeb dataset and is publicly available on HuggingFace. FineWeb2 now offers pretraining data to 1000+ languages, resulting in performance increases in some languages. FineWeb-Edu is a dataset of educational web pages filtered from the FineWeb dataset.	The data comes from 96 Common Crawl snapshots. In FineWeb 2, the data is deduplicated per language, whereas FineWeb's data was deduplicated per Common Crawl snapshot.	According to a report by the development team, models performed better for 11 out of 14 languages tested when trained on FineWeb 2 than when trained on other multilingual training datasets.	Global	GLOBAL
53	European Open Science Cloud	https://eosc.eu/eosc-about/	Multi-domain research data	2025	The European Open Science Cloud (EOSC) is a platform initiated by the European Commission that makes research data from across disciplines accessible and re-usable. The goal is to provide a widely used network of FAIR data. The EOSC also will offer tools and services to support scientific research and machine-ready data.	In 2024, the EOSC became a co-programmed European Partnership, which came with nearly 500 million euros in funding and a long-term partnership with the European Commission and its partners. The EOSC now operates under the EOSC Tripartite Governance model, constisting of the European Commission, the EOSC Association, and the EOSC Steering Board. The EOSC Association represents the European research community and is made up of participating members and observers. The EOSC Steering Board consists of EU member states and associated countries.	The AI4EOSC Platform is an initiative funded by the European Union's Horizon Europe program that seeks to develop AI applications that use the EOSC. Using its data and AI services, researchers across the EU can train and develop machine learning models. For instance, AI4OS-LLM, a tool released by AI4EOSC, allows users and developers from EU research institutions to deploy their own LLMs and access text generation, summarization, translation, and chatbot features without sharing private data.	European Union	EMEA
54	Medical Imaging & Data Resource Center (MIDRC) Data Commons	https://data.midrc.org/	Medical data, medical imaging data	2024	The Medical Imaging & Data Resource Center (MIDRC) facilitates and curates a data commons of medical imaging data, patient demographics and outcomes, and other clinical data. The MIDRC Data Commons, a subset of the MIDRC data, provides researchers with AI-ready data that can be easily searched, filtered, and freely accessed via an online portal. MIDRC also offers an AI reliability tool to support researchers in the creation of AI/ML models.	MIDRC accepts contributions of image data from academic medical centers, community hospitals, private practices, and more through two data intake portals, the American College of Radiology Clinical Imaging Research Registry and the Radiological Society of North America. These data collection pathways manage the organization, de-identification, and transfer of the data. Once the data is on the MIDRC website, it is free to use by registered users under the data use agreement.	The MIDRC was selected in 2024 by the National Science Foundation to participate in a two-year pilot program called the National Artificial Intelligence Research Resource (NAIRR) pilot. This initiative seeks to advance ethical AI for research and discovery.	USA	NA	https://www.midrc.org/midrc-contact
55	Data Vatika - Bhashini	https://bhashini.gov.in/vatika	Language data	2022	Data Vatika is a hub of high-quality language datasets that can be used to train AI models. Data Vatika is part of the Bhashini project, an initiative by India's Ministry of Electronics and Information Technology to improve data accessibility, AI-powered translation services, and open source AI model development. Part of this project includes BhashaDaan, a crowdsourced repository of content meant to expand the available data for all 22 of India's official languages. BhashaDaan allows citizens to either donate or validate voice/speech contributions, transcriptions of audio, translations, or transcribing images of text. This project increases the available data for lower-resource languages that have less content on which AI models can train.		Bhashini has developed several AI-based services, including Chitaanuvaad, an open-source platform to perform AI-powered translations of videos, and Lekhaanuvaad, which offers AI document translation and digitization.	India	APAC
56	Norwegian Colossal Corpus	https://aclanthology.org/2022.lrec-1.410/	Textual data for Norwegian Language	2022	The Norwegian Colossal Corpus is comprised of vast amounts of publicly available textual data in Norway's two official languages, Norwegian Bokmål and Norwegian Nynorsk. The corpus includes out-of-copyright books and newspapers from the National Library of Norway, public documents, online newspapers, and Wikipedia. The corpus can serve as training data for Norwegian language models, greatly enhancing the relatively limited existing Norwegian textual data.	The National Library of Norway has worked on digitization projects since 2006. This textual data, like digitized newspapers, books, and more, made up a large portion of the Norwegian Colossal Corpus.	The Norwegian Colossal Corpus has been used as part of the training data for generative AI models like Sweden's GPT-SW3 and Barcelona Supercomputing Center's ALIA-40b.	Norway	APAC
57	Royal Spanish Academy's Data Bank	https://www.rae.es/banco-de-datos	Multimodal language data (text, audio, video)	2025	The Royal Spanish Academy has developed a collection of Spanish language corpora that can be used for AI training. These include the CREA (Corpus of Contemporary Spanish) of written and oral data, the CORPES XXI of 21st century Spanish, the CDH (Historical Dictionary of the Spanish Language) with text from across nine centuries, and the CORDE (Corpus Diacrónico del Español). These corpora include linguistic notations that can help models learn grammatical structure. Additionally, these available datasets represent Spanish from all Spanish-speaking countries, making it a robust form of AI training data.	The digitization process of these texts began in 1995. The texts are openly accessible via the Royal Spanish Academy's web tools, and the vast majority of materials are public domain.		Spain	EMEA	corpus@rae.es
58	OpenPLACSP	https://contrataciondelestado.es/wps/portal/!ut/p/b1/jY_LDoIwEEW_hS_o0FfaZSmW1hCVkKJ0Y1gYg-GxMX6_1bAFnd3NOTczgwJqU8CUEo4JQxcUpu7V37tnP0_d8MmBX7NaCJWlCoB4AYpXlZY5YJAkCm0U6O6otbEYRE0iKHPvuY2xwN_-KmZLH1ZGwX_9DeHH_WcUtlfgRWBE02bfnHjtCgBnTV76lEXOF2Hjh4Odxxsaw2CMdA-qkuQNtiCnSw!!/dl4/d5/L2dJQSEvUUt3QS80SmtFL1o2X0sxQzhBQjFBMEdBUjUwUUpJR1FDMTRKSDY3/	Public procurement data	2021	Spain's Public Sector Procurement Platform (Plataforma de Contratación del Sector Público, or PLACSP) provides access to a network of open datasets relating to contracting bodies (e.g. governments looking to purchase goods or servivces) and their tenders. Many of these open datasets are accessible on the Minstry of Finance's Open Data Portal. OpenPLASCP is a tool to help manipulate and aggregate the various open data sources. This data helps governments, public bodies, and their tenders more effectively navigate public procurement. The open data can be used for AI tools, such as Tendios.	The open datasets come from published information on the Public Sector Procurement Platform. For example, OpenPLASCP provides dataset of tenders based on the tenders that are published in the contracting profiles on the public sector contracting platform (PLASCP). Other data includes contracting profiles of the contracting bodies hosted on the platform, or preliminary market inquiries published on the contracting profiles.	This network of open data allows AI platforms such as Tendios to provide analyses of bidding trends, streamlined exploration of public procurement data, and chatbots trained on the data.	Spain	EMEA
59	Data Foundry Scotland	https://data.nls.uk/	Digitized library collections	2019	The Data Foundry Scotland is an open data platform from the National Library of Scotland that makes its digital collections available in machine learning-ready formats. The data includes sources like digitized archival books, newspapers, and historical military lists. The platform provides meta-data and quality assurance. Special attention is given to cultural heritage data.	There are future plans to create APIs for accessing Data Foundry datesets.	The Data Foundry is used for various projects, including an upcoming text and data mining platform.	Scotland	EMEA	digital.scholarship@nls.uk
60	AI4Culture	https://ai4culture.eu/	Image data, translations, and transcriptions, speech data	2023	AI4Culture, a public platform developed by the Digital Europe Programme of the European Union, offers both AI tools and open datasets for training AI. Components on the platform are interoperable with the common European data space for cultural heritage. Some of the open datasets include verified translations, transcriptions of scanned handwritten documents, European artwork classification data, and 950,000 hours of speech data. The AI4Culture platform accepts contributions of datasets for review. The platform's tools and data can be used for AI-generated translations of cultural heritage metadata, multilingual subtitle generation, and multilingual text recognition in scanned documents.	This project ran from April 2023 to March 2025 and was co-funded by the European Union. The platform is still openly accessible even though the project period has ended.		European Union	EMEA
61	EUCAIM	https://cancerimage.eu/what-we-do/	Cancer-related imaging	2023	Cancer Image Europe (EUCAIM) is a platform for sharing de-identified cancer-related medical imaging. The platform engages clinical data providers, researchers, and industry to create an Atlas of Cancer Images for the development of AI tools.	EUCAIM combines and builds on existing cancer image repositories of the AI4HI initiative.	At least 50 AI algorithms, tools, and prediction models will be deployed within the infrastructure by 2026.	European Union	EMEA
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100