1 of 6

طائر المظلة

長耳垂傘鳥

Data Stewardship Organization

(“umbrellabird.ai”)

Regional Partner

🌎

Regional Partner

🌍

Regional Partner

🌏

Regional Partner

🌏

Regional Partner

🌍

Big Science

🤖

Interface for documentation & tools
Defines/approves ethical standards
Decides on whether to steward a dataset
Can sometimes provide compute for non-experimental (final) training
Engages stakeholders

Gathers dataset to be stewarded
Decides whether to host a dataset
Applies ethical standards
Collaborates on tools
Can sometimes provide compute for final training

Sources data
Defines formats
Creates tools and documentation
Provides technical support
Suggest ethical standards
Connects RPs to necessary parties to apply the ethical standards

Committee with 🤗, 🤖, 🌎🌍🌏, other stakeholders

Further described in this doc.

🌸

2 of 6

طائر المظلة

長耳垂傘鳥

Data Stewardship Organization

(“umbrellabird.ai”)

Regional Partner

🌎

Regional Partner

🌍

Regional Partner

🌏

Regional Partner

🌏

Regional Partner

🌍

Interface for documentation & tools
Defines/approves ethical standards
Decides on whether to steward a dataset
Can sometimes provide compute for non-experimental (final) training
Engages stakeholders

Gathers dataset to be stewarded
Decides whether to host a dataset
Applies ethical standards
Collaborates on tools
Can sometimes provide compute for final training

Committee with 🤗, 🤖, 🌎🌍🌏, other stakeholders

Further described in this doc.

Data Source

(e.g., Le Monde)

3 of 6

Data Stewardship Organization - What is it?

It’s a bird! It’s a plane! It’s….a data stewardship organization!!

Similar work in the US has construed these as “Data Trusts”

See this article detailing more about this.

But a “Trust” may not be the legal entity we want. Could also be:

Non-Profit, 501(c)(3) ← a Trust can be this too
LLC
Public Benefit Corporation
Other model

Key questions: Who makes the decisions, and how are they made? Who takes on liability in case of issues? What does the DSO “hold”/own?

4 of 6

Ethical Distinctions that BigScience is Already Adopting

Principles

Licensing and Attribution. ”Right to legal controls”
Anonymity/Privacy. “Right to privacy”
Benevolence. “Right to just treatment”
Autonomy. “Right to autonomy”�This includes:

Consent.
Contestation.

Inclusion/Representativeness. “Goal of diverse data”

Parties

People represented in data
People affected by models
People involved in dataset creation

Datasets

5 of 6

Ethical Distinctions that BigScience is Already Adopting

Licensing and Attribution: People represented in data have a “Right to controls”
Anonymity/Privacy: People represented in data have a “Right to privacy”.
Benevolence: People affected by models trained on datasets and

People involved in dataset creation have a “Right to just treatment”

(non-malicious use & equitable treatment, respectively).

Autonomy: People in all aspects of datasets have a “Right to autonomy”. This includes:

Consent: People involved in dataset creation should have informed consent
Contestation: People represented in data can have their datum be �removed/anonymized

Inclusion/Representativeness: Datasets aim to maximally represent the diversity of human language uses. What this means is further refined by the other ethical considerations we assert.

6 of 6

Ethical Distinctions that BigScience is Already Adopting

Licensing and Attribution: Abiding by the licenses of the individual instances within the data. For example, if a dataset contains a poem that has a creative commons license requiring author attribution when used, this will be appropriately associated as metadata for that instance. This might be categorized as a “Right to controls”
Anonymity/Privacy: Individuals represented in datasets can be harmfully targeted, e.g., by their governments, based on their political beliefs, gender or sexual orientation. Datasets must not infringe on individuals’ privacy in this way without informed consent from the individual. “Right to privacy”.
Benevolence: A dataset will not be supported when a primary use of a model trained on it would be for malicious purposes (e.g., hate speech generation).

This is related to, but different from, the dual-use issue -- where a dataset can be used for “good” and “bad” things. In these cases, whether to make the dataset available can be considered with respect to the other ethical considerations defined here.

Autonomy: All people involved have a “right to autonomy”. This includes:

Consent: Informed consent from data creators/collectors/controllers, and from those who are uniquely represented (PII) in the dataset. See the distinction on data roles in the doc on data stakeholders.
Contestation: Individuals with data in the dataset will have the ability to request that their data be removed or anonymized. They should also have relatively easy access to know that they are in the data. This is related to “Right to privacy” or “Right to anonymity”.

Inclusion/Representativeness: Datasets aim to reflect the diversity of human language uses. What this means is further refined by the other ethical considerations we assert, such as the right to anonymity/privacy above.

An axis that we particularly want to focus on is geographical diversity. See related doc on Diversity Criteria
Part of inclusiveness is the “Right to participate” and access to culture embodied in the datasets, including education (balanced against other rights). This means that regional orgs should be able to use the datasets they gather, esp. of their own region to educate and preserve their culture. An LLM encodes that culture in the type of language it might output, and the regional groups should be able to use the LLMs to exercise this right.