1 of 6

Big Science

Data Governance

Information for Data Custodians

2 of 6

What is Big Science?

The "Big Science" project is a year-long workshop bringing together over 900 researchers from 60 countries to work together on better understanding Large Language Models (LLMs) -- a family of AI systems that are trained on considerable amounts of data to learn statistical properties of language.

Our goal is to improve the scientific understanding of the capabilities and limitations of LLMs by creating both a transparent multilingual language corpus and a large language model open to the scientific community.

Why LLMs? LLMs have dramatically changed the state of the art in AI and are now adopted in many contemporary language technologies, from internet search to translation to story-writing. However, the increasing cost and computational demands of training these models have driven work on them out of the reach of all but a few of the largest technology companies.

Although these models have significant impact on modern society, the small set of companies capable of advancing LLMs are private, necessarily operating without public input or visibility. Such companies cannot detail or give access to the data used to train these models – meaning that everyone affected by these models are not able to know what the model has learned.

How can we use your help? We are hoping you will join us to provide text data for the LLM to learn from, and/or to help make the data accessible once the model is publicly available at the end of May.

With your help, we can create the first international data network with diverse data available for both the LLM and researchers to learn from.

3 of 6

Data Custodians

Individuals and institutions involved in making the data available are Data Custodians. A Data Custodian can be either a Data Host or a Data Provider (or both).

Data Providers

Individuals or institutions who have text, image, or audio data that they can make available are Data Providers. Data Providers can share either public/open-domain data, or data that they have the rights to. Currently, our Data Providers are internet archival institutions, private companies, non-profit organizations, and national libraries.

To be a Data Provider, you simply need to have text data that can be accessed in a controlled setting by researchers who agree to licensing terms.

Data Hosts

Data Hosts make the data available for analysis. A Data Host may simply be an organization with a server where the data can live. Depending on the data sources and the Data Provider, this data may be subject to additional access controls, such as exclusivity to individuals with the appropriate access code and signed agreements.

No data in the governance structure will be available for an individual to download directly. Data Hosts and Providers can do whatever they’d like with the data outside of the governance structure; for the purposes of this experiment, we remove the option to share data if it is not trackable.

4 of 6

Examples of Data Use

Data within the governance structure is intended for data exploration as well as training, fine-tuning and evaluation of the Large Language Model. If agreed by Data Host and Provider, further data derivatives may also be produced.

Data Exploration in Data Governance includes examining:

  • Statistics of words
  • Topics covered in the data
  • Risks of personal identification
  • Relationship between inputs and outputs
  • Understanding naturalness of the language
  • Biases and stereotypes
  • Targeted excerpts
    • Number of examples containing a given word
    • Cluster/topic exemplars
  • Aggregated statistics and limited excerpts through a visualization tool.

5 of 6

Further Information

6 of 6

What is Data Governance?

A core part of this project focuses on creating an experimental Data Governance structure for the text data used to train and evaluate the model.

There are a wealth of language sources to draw from to meet the needs of this technology, but how to provide due consideration to ethical concerns in using the data such as transparency (ability to examine the data), anonymization (no personally identifying information), and consent (from those who hold the rights to the data) remains an open problem.

To this end, we are exploring the feasibility of working with a small network of organizations – hopefully you – who are themselves interested in working on aspects of ethical data governance and can help develop tools and protocols for data collection, hosting, and management. Together, these organizations serve as “Data Custodians for the Big Science project.

Below, we sketch a high-level overview of the entities involved in Data Governance. This includes the Data Modelers (the BigScience participants), the legal scholars and rights advocates working with us, the Data Rights holders, and the Data Custodians. Together, we coordinate via an umbrella “Data Stewardship Organization”.