Should I disclose my dataset?
Legal and ethical considerations for researchers dealing with court documents
Raysa Benatti1, 2
1 University of Tübingen, Germany
2 Institute of Computing, University of Campinas, Brazil
ACM Conference on Fairness, Accountability, and Transparency (FAccT)
Rio de Janeiro, Brazil, Jun/2024
Implications Tutorial
2
Outline
3
Background
triggered concerns: can we disclose everything?
Authors: Raysa Benatti, Camila Villarroel, Sandra Avila, Esther Colombini, and Fabiana Severi
(Institute of Computing @ University of Campinas and Law School of Ribeirão Preto @ University of São Paulo)
4
Context
As a researcher, can I really disclose my dataset of court documents?
We try to provide guidelines built up from an example of interest: gender-based violence (GBV)-related cases brought to Brazilian courts
Why?
5
Goals
In the context of computational research with court documents,
6
Key concepts: Reproducibility
resource sharing
builds connections inside and between communities
creates research possibilities
strengthens networks
7
Key concepts: Reproducibility
Critical quality of modern research; "science publicity"
8
Key concepts: Reproducibility
⚠️
⚠️
⚠️
legal constraints
ethical issues
the burden of (not) sharing is not the same for different groups/contexts
9
Key concepts: Publicity / Secrecy
Access to information: fundamental right in a democratic environment; publicity by default, secrecy as exception
Brazilian context
10
Key concepts: Publicity / Secrecy
When is secrecy justified?
⚠️
(online) availability non-secrecy (systems are not perfect!)
⚠️
publicity status can change over time
11
Key concepts: Publicity / Secrecy
So, if it’s public anyway, it’s OK to share it, right?
⚠️
"Despite the intersection between motivations supporting publicity and reproducibility, the justice system has different obligations and prerogatives than research institutions. When disclosing a court decision, the state complies with a legal duty to publicize and acts by itself; it claims the rights and responsibilities carried by such a publicization. If another person or entity --- for instance, a researcher or research agency --- extracts and discloses the same record, s/he creates another point of access, claiming responsibility over the content (even if unwittingly)."
12
Key concepts: Publicity / Secrecy
So, if it’s public anyway, it’s OK to share it, right?
⚠️
"(...) in research settings, the data might not be shared on its own; instead, it is often made available in the context of an experimental pipeline, with annotation, modifications, associated code, and/or results from models learned from them. In that case, disclosing the data is more than merely indexing it; it also publicizes it from a specific perspective. It makes sense that whoever is in charge of disclosing it is also legally and ethically responsible. Thus, when seeking reproducibility, researchers must account for (...) [such] boundaries, being wary about emulating publicity-guided acts from the public administration."
13
Guidelines of good practices I
if data is provided from cases under secrecy,
then it should not be disclosed,
unless mitigation measures are in place (more on that later)
otherwise,
check if other restrictions apply
14
Key concepts: Personal and Sensitive Data
Personal data: "information regarding identified or identifiable natural person" (LAI)
Personal data is sensitive if it refers to racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, health or sex life, or personal genetic or biometric information (General Data Protection Regulation (GDPR); General Data Protection Act (LGPD))
15
Personal data: processing restrictions
LAI
LGPD
16
Personal data: processing restrictions
Statistical and scientific research is usually a legal exception for restrictions
LAI: anonymization must be guaranteed
⚠️
unless: "public and general interest"
LGPD: anonymization optional although recommended
17
Risk assessment and mitigation
⚠️ risks can exist regardless of legal restrictions
legal restrictions impose a risk on their own (liability)
researcher/institution as controller (GDPR/LGPD)
18
Risk assessment and mitigation
Anonymization
data no longer personal
⚠️ technical obstacles
Disclosure by demand
safe and traceable
⚠️ assumes good faith
⚠️ extra layer for reproducibility
Not disclosing at all
escape from the burden of responsibility over the dataset disclosure; choice of privateness over publicity
19
Guidelines: summary
Is the data provided from cases under secrecy?
Yes OR not possible to determine
No
Disclose it only with mitigation measures
Does it contain sensitive personal data?
No
Not illegal for researchers to disclose; disclosure without mitigation should ideally be preceded by an analysis of specific context and risk-benefit assessment
Yes OR not possible to determine
Is disclosure essential for the research?
Yes
(in this case, disclosure without mitigation might
be ethically debatable)
No
20
For the future
Thank you!
raysabenatti.com
original paper
(full list of references and legal statements)