1 of 6

Big Science

Data Governance

Information for Data Custodians

2 of 6

What is Big Science?

The "Big Science" project is a year-long workshop bringing together over 900 researchers from 60 countries to work together on better understanding Large Language Models (LLMs) -- a family of AI systems that are trained on considerable amounts of data to learn statistical properties of language.

Our goal is to improve the scientific understanding of the capabilities and limitations of LLMs by creating both a transparent multilingual language corpus and a large language model open to the scientific community.

Why LLMs? LLMs have dramatically changed the state of the art in AI and are now adopted in many contemporary language technologies, from internet search to translation to story-writing. However, the increasing cost and computational demands of training these models have driven work on them out of the reach of all but a few of the largest technology companies.

Although these models have significant impact on modern society, the small set of companies capable of advancing LLMs are private, necessarily operating without public input or visibility. Such companies cannot detail or give access to the data used to train these models – meaning that everyone affected by these models are not able to know what the model has learned.

How can we use your help? We are hoping you will join us to provide text data for the LLM to learn from, and/or to help make the data accessible once the model is publicly available at the end of May.

With your help, we can create the first international data network with diverse data available for both the LLM and researchers to learn from.

3 of 6

Data Custodians

Individuals and institutions involved in making the data available are Data Custodians. A Data Custodian can be either a Data Host or a Data Provider (or both).

Data Providers

Individuals or institutions who have text, image, or audio data that they can make available are Data Providers. Data Providers can share either public/open-domain data, or data that they have the rights to. Currently, our Data Providers are internet archival institutions, private companies, non-profit organizations, and national libraries.

To be a Data Provider, you simply need to have text data that can be accessed in a controlled setting by researchers who agree to licensing terms.

Data Hosts

Data Hosts make the data available for analysis. A Data Host may simply be an organization with a server where the data can live. Depending on the data sources and the Data Provider, this data may be subject to additional access controls, such as exclusivity to individuals with the appropriate access code and signed agreements.

No data in the governance structure will be available for an individual to download directly. Data Hosts and Providers can do whatever they’d like with the data outside of the governance structure; for the purposes of this experiment, we remove the option to share data if it is not trackable.

4 of 6

Examples of Data Use

Data within the governance structure is intended for data exploration as well as training, fine-tuning and evaluation of the Large Language Model. If agreed by Data Host and Provider, further data derivatives may also be produced.

Data Exploration in Data Governance includes examining:

Statistics of words
Topics covered in the data
Risks of personal identification
Relationship between inputs and outputs
Understanding naturalness of the language
Biases and stereotypes
Targeted excerpts

Number of examples containing a given word
Cluster/topic exemplars

Aggregated statistics and limited excerpts through a visualization tool.

5 of 6

Further Information

Email: bigscience-data-governance@googlegroups.com

Data Governance Presentations

Slide deck from Data Governance Update, Big Science Episode #1

Slide deck from Data Governance Update, Big Science Episode #2

Example License

Example DATA PROVIDER-HOST AGREEMENT

Academic Paper

Accepted to FAccT 2022

Articles and Websites

The BigScience Workshop

10 January 2022: Inside Big Science, the quest to build a powerful open language model

6 January 2022: Behind HuggingFace's Big Science Project that crowdsources research on large language models

14 July 2021: NLP needs to be open. 500+ researchers are trying to make it happen | VentureBeat

18 January 2021: The race to understand the thrilling, dangerous world of language AI | MIT Technology Review

6 of 6

What is Data Governance?

A core part of this project focuses on creating an experimental Data Governance structure for the text data used to train and evaluate the model.

There are a wealth of language sources to draw from to meet the needs of this technology, but how to provide due consideration to ethical concerns in using the data such as transparency (ability to examine the data), anonymization (no personally identifying information), and consent (from those who hold the rights to the data) remains an open problem.

To this end, we are exploring the feasibility of working with a small network of organizations – hopefully you – who are themselves interested in working on aspects of ethical data governance and can help develop tools and protocols for data collection, hosting, and management. Together, these organizations serve as “Data Custodians” for the Big Science project.

Below, we sketch a high-level overview of the entities involved in Data Governance. This includes the Data Modelers (the BigScience participants), the legal scholars and rights advocates working with us, the Data Rights holders, and the Data Custodians. Together, we coordinate via an umbrella “Data Stewardship Organization”.