1 of 6

طائر المظلة

長耳垂傘鳥

Data Stewardship Organization

(“umbrellabird.ai”)

Regional Partner

🌎

Regional Partner

🌍

Regional Partner

🌏

Regional Partner

🌏

Regional Partner

🌍

Big Science

🤖

  • Interface for documentation & tools
  • Defines/approves ethical standards
  • Decides on whether to steward a dataset
  • Can sometimes provide compute for non-experimental (final) training
  • Engages stakeholders
  • Gathers dataset to be stewarded
  • Decides whether to host a dataset
  • Applies ethical standards
  • Collaborates on tools
  • Can sometimes provide compute for final training
  • Sources data
  • Defines formats
  • Creates tools and documentation
  • Provides technical support
  • Suggest ethical standards
  • Connects RPs to necessary parties to apply the ethical standards

Committee with 🤗, 🤖, 🌎🌍🌏, other stakeholders

🌸

🌸

🌸

🌸

🌸

2 of 6

طائر المظلة

長耳垂傘鳥

Data Stewardship Organization

(“umbrellabird.ai”)

Regional Partner

🌎

Regional Partner

🌍

Regional Partner

🌏

Regional Partner

🌏

Regional Partner

🌍

  • Interface for documentation & tools
  • Defines/approves ethical standards
  • Decides on whether to steward a dataset
  • Can sometimes provide compute for non-experimental (final) training
  • Engages stakeholders
  • Gathers dataset to be stewarded
  • Decides whether to host a dataset
  • Applies ethical standards
  • Collaborates on tools
  • Can sometimes provide compute for final training

Committee with 🤗, 🤖, 🌎🌍🌏, other stakeholders

Data Source

(e.g., Le Monde)

3 of 6

Data Stewardship Organization - What is it?

It’s a bird! It’s a plane! It’s….a data stewardship organization!!

  • Similar work in the US has construed these as “Data Trusts”
  • But a “Trust” may not be the legal entity we want. Could also be:
    • Non-Profit, 501(c)(3) ← a Trust can be this too
    • LLC
    • Public Benefit Corporation
    • Other model
  • Key questions: Who makes the decisions, and how are they made? Who takes on liability in case of issues? What does the DSO “hold”/own?

4 of 6

Ethical Distinctions that BigScience is Already Adopting

Principles

  • Licensing and Attribution. ”Right to legal controls”
  • Anonymity/Privacy. “Right to privacy”
  • Benevolence. “Right to just treatment”
  • Autonomy. “Right to autonomy”�This includes:
    • Consent.
    • Contestation.
  • Inclusion/Representativeness. “Goal of diverse data”

Parties

  • People represented in data
  • People affected by models
  • People involved in dataset creation

  • Datasets

5 of 6

Ethical Distinctions that BigScience is Already Adopting

  • Licensing and Attribution: People represented in data have a “Right to controls”
  • Anonymity/Privacy: People represented in data have a “Right to privacy”.
  • Benevolence: People affected by models trained on datasets and

People involved in dataset creation have a “Right to just treatment”

(non-malicious use & equitable treatment, respectively).

  • Autonomy: People in all aspects of datasets have a “Right to autonomy”. This includes:
    • Consent: People involved in dataset creation should have informed consent
    • Contestation: People represented in data can have their datum be �removed/anonymized
  • Inclusion/Representativeness: Datasets aim to maximally represent the diversity of human language uses. What this means is further refined by the other ethical considerations we assert.

6 of 6

Ethical Distinctions that BigScience is Already Adopting

  • Licensing and Attribution: Abiding by the licenses of the individual instances within the data. For example, if a dataset contains a poem that has a creative commons license requiring author attribution when used, this will be appropriately associated as metadata for that instance. This might be categorized as a “Right to controls”
  • Anonymity/Privacy: Individuals represented in datasets can be harmfully targeted, e.g., by their governments, based on their political beliefs, gender or sexual orientation. Datasets must not infringe on individuals’ privacy in this way without informed consent from the individual. “Right to privacy”.
  • Benevolence: A dataset will not be supported when a primary use of a model trained on it would be for malicious purposes (e.g., hate speech generation).
    • This is related to, but different from, the dual-use issue -- where a dataset can be used for “good” and “bad” things. In these cases, whether to make the dataset available can be considered with respect to the other ethical considerations defined here.
  • Autonomy: All people involved have a “right to autonomy”. This includes:
    • Consent: Informed consent from data creators/collectors/controllers, and from those who are uniquely represented (PII) in the dataset. See the distinction on data roles in the doc on data stakeholders.
    • Contestation: Individuals with data in the dataset will have the ability to request that their data be removed or anonymized. They should also have relatively easy access to know that they are in the data. This is related to “Right to privacy” or “Right to anonymity”.
  • Inclusion/Representativeness: Datasets aim to reflect the diversity of human language uses. What this means is further refined by the other ethical considerations we assert, such as the right to anonymity/privacy above.
    • An axis that we particularly want to focus on is geographical diversity. See related doc on Diversity Criteria
    • Part of inclusiveness is the “Right to participate” and access to culture embodied in the datasets, including education (balanced against other rights). This means that regional orgs should be able to use the datasets they gather, esp. of their own region to educate and preserve their culture. An LLM encodes that culture in the type of language it might output, and the regional groups should be able to use the LLMs to exercise this right.