1 of 21

Should I disclose my dataset?

Legal and ethical considerations for researchers dealing with court documents

Raysa Benatti1, 2

1 University of Tübingen, Germany

2 Institute of Computing, University of Campinas, Brazil

ACM Conference on Fairness, Accountability, and Transparency (FAccT)

Rio de Janeiro, Brazil, Jun/2024

Implications Tutorial

2 of 21

2

Outline

  • Background and previous iterations
  • Context, goals, and key concepts
      • reproducibility
      • publicity / secrecy of documents
      • personal and sensitive data
  • Relevant legal statements (BR/EU)
  • Data disclosure: restrictions, risk assessment and mitigation
  • Guidelines of good practices

3 of 21

3

Background

  • Master's work: revealing gender biases in court decisions with natural language processing (University of Campinas, 2020-2023)

triggered concerns: can we disclose everything?

  • Publication of Should I disclose my dataset? Caveats between reproducibility and individual data rights @ NLLP 2022

Authors: Raysa Benatti, Camila Villarroel, Sandra Avila, Esther Colombini, and Fabiana Severi

(Institute of Computing @ University of Campinas and Law School of Ribeirão Preto @ University of São Paulo)

4 of 21

4

Context

As a researcher, can I really disclose my dataset of court documents?

We try to provide guidelines built up from an example of interest: gender-based violence (GBV)-related cases brought to Brazilian courts

Why?

  • Human rights interest
  • Many sensitive information whose disclosure might pose significant harm

  • Large court system, much data
  • NLLP community
  • Civil law-based system

5 of 21

5

Goals

In the context of computational research with court documents,

  • to present legal and ethical considerations on data disclosure by researchers;

  • to provide guidelines for researchers to help them decide on data disclosure;

  • to discuss how to preserve both reproducibility of computational research and individual data rights.

6 of 21

6

Key concepts: Reproducibility

  • Scientific soundness and accountability
  • Allows for community scrutiny, fraud prevention, fraud detection
  • Public interest

resource sharing

builds connections inside and between communities

creates research possibilities

strengthens networks

7 of 21

7

Key concepts: Reproducibility

Critical quality of modern research; "science publicity"

  • Empirical research / Computer Science / Machine Learning

    • culture of openness of resources
    • expected standards
  • Data quality assessment

8 of 21

8

Key concepts: Reproducibility

⚠️

⚠️

⚠️

legal constraints

ethical issues

the burden of (not) sharing is not the same for different groups/contexts

9 of 21

9

Key concepts: Publicity / Secrecy

Access to information: fundamental right in a democratic environment; publicity by default, secrecy as exception

Brazilian context

  • democratization (80s), Access to Information Act (LAI), Federal Constitution

  • Court decisions are, by default, public documents
  • Publicity of essential data regarding legal cases (National Council of Justice regulations)
  • Digitalization of systems: increase of availability and access

10 of 21

10

Key concepts: Publicity / Secrecy

When is secrecy justified?

  • intimacy
  • social interest (e.g. national security)
  • explicit legal provision (e.g. family disputes, crimes against sexual dignity)
  • decided by court

⚠️

(online) availability non-secrecy (systems are not perfect!)

⚠️

publicity status can change over time

11 of 21

11

Key concepts: Publicity / Secrecy

So, if it’s public anyway, it’s OK to share it, right?

⚠️

"Despite the intersection between motivations supporting publicity and reproducibility, the justice system has different obligations and prerogatives than research institutions. When disclosing a court decision, the state complies with a legal duty to publicize and acts by itself; it claims the rights and responsibilities carried by such a publicization. If another person or entity --- for instance, a researcher or research agency --- extracts and discloses the same record, s/he creates another point of access, claiming responsibility over the content (even if unwittingly)."

12 of 21

12

Key concepts: Publicity / Secrecy

So, if it’s public anyway, it’s OK to share it, right?

⚠️

"(...) in research settings, the data might not be shared on its own; instead, it is often made available in the context of an experimental pipeline, with annotation, modifications, associated code, and/or results from models learned from them. In that case, disclosing the data is more than merely indexing it; it also publicizes it from a specific perspective. It makes sense that whoever is in charge of disclosing it is also legally and ethically responsible. Thus, when seeking reproducibility, researchers must account for (...) [such] boundaries, being wary about emulating publicity-guided acts from the public administration."

13 of 21

13

Guidelines of good practices I

if data is provided from cases under secrecy,

then it should not be disclosed,

unless mitigation measures are in place (more on that later)

otherwise,

check if other restrictions apply

14 of 21

14

Key concepts: Personal and Sensitive Data

Personal data: "information regarding identified or identifiable natural person" (LAI)

Personal data is sensitive if it refers to racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, health or sex life, or personal genetic or biometric information (General Data Protection Regulation (GDPR); General Data Protection Act (LGPD))

15 of 21

15

Personal data: processing restrictions

LAI

LGPD

16 of 21

16

Personal data: processing restrictions

Statistical and scientific research is usually a legal exception for restrictions

LAI: anonymization must be guaranteed

⚠️

unless: "public and general interest"

LGPD: anonymization optional although recommended

17 of 21

17

Risk assessment and mitigation

⚠️ risks can exist regardless of legal restrictions

legal restrictions impose a risk on their own (liability)

researcher/institution as controller (GDPR/LGPD)

  • violation of privacy and intimacy of minors and other vulnerable groups, victims, witnesses, defendants
  • exposure of confidential/sensitive information
  • exposure of any information that could jeopardize the safety or integrity of subject(s) involved in a legal case

18 of 21

18

Risk assessment and mitigation

Anonymization

data no longer personal

⚠️ technical obstacles

Disclosure by demand

safe and traceable

⚠️ assumes good faith

⚠️ extra layer for reproducibility

Not disclosing at all

escape from the burden of responsibility over the dataset disclosure; choice of privateness over publicity

19 of 21

19

Guidelines: summary

Is the data provided from cases under secrecy?

Yes OR not possible to determine

No

Disclose it only with mitigation measures

Does it contain sensitive personal data?

No

Not illegal for researchers to disclose; disclosure without mitigation should ideally be preceded by an analysis of specific context and risk-benefit assessment

Yes OR not possible to determine

Is disclosure essential for the research?

Yes

(in this case, disclosure without mitigation might

be ethically debatable)

No

20 of 21

20

For the future

  • Guidelines and recommendations from official data protection entities
  • Anonymization tools
  • Institutional data repositories
  • …what are your thoughts?

21 of 21

Thank you!

raysabenatti.com

original paper

(full list of references and legal statements)