1 of 39

What Workshops covering research data practices and software

When 12pm - 1:15pm on the dates listed below

Where Zoom (all dates) / PCL Scholars Lab (select dates)

More info https://guides.lib.utexas.edu/data-and-donuts

  • Join us for our Data & Donuts workshops in the Spring 2025 semester! Use QR code for online schedule.

Open Source GIS: From QGIS to Python

Intro to Python for Data Management

Python for Data Retrieval and Visualization

Making Beautiful Plots in R’s ggplot2

Where and How to Publish Research Data

How to Share Sensitive (Human) Data

  • Sign up to receive Data & Donuts workshop event notifications at

January 31

February 11

February 12

February 13

February 14

February 28

Fri

Tue

Wed

Thu

Fri

Fri

2 of 39

  • Join us on Friday, April 4th from 10:00am to 12:00pm
  • In-person in the PCL Scholars Lab and virtual on Zoom
  • Hear about research computing services offered by
    • TACC
    • OVPR
    • Enterprise Technology
    • The UT Libraries

3 of 39

How to Share Sensitive (Human) Data

Data & Donuts

February 28, 2025

Meryl Brodsky, Communication & Information Librarian

Bryan Gee, Open Research Coordinator for Data and Software

https://bit.ly/4bjkwF4

4 of 39

What is Sensitive Data? (Poll #1)

5 of 39

What is sensitive data?

Sensitive data refers to any information that, if disclosed, could potentially harm an individual or organization. Examples include:

    • Personally identifiable information (PII)
    • Health records
    • Financial information

Legal Protections include:

  • Health Insurance Portability and Accountability Act (HIPAA) - protects medical/healthcare data
  • Family Education Rights & Privacy Act (FERPA) - protects educational records data, such as grades
  • Fair Credit Reporting Act (FCRA) – Consumer Credit Information

6 of 39

Identifiers

Direct

Personally Identifiable Information (PII), such as name, address, social security number, and phone number.

Indirect

Zip code, birthdate, education level , race, and ethnicity, medical diagnosis, occupation.

7 of 39

Communities / Indigenous Data

  • Was my data collected from or within a particular community?
  • How would public release of this data (even if de-identified) impact the community?
  • Who in the community can advise me on data sharing and community impacts?
  • Is there an agreement (e.g., data use agreement, memorandum of understanding) about the data collected from the community?
  • Are there laws and regulations that protect the community?

Carroll, S. R., Garba, I., Figueroa-Rodríguez, O. L., Holbrook, J., Lovett, R., Materechera, S.,  Hudson, M. (2020). The CARE Principles for Indigenous Data Governance. Data Science Journal19(1), 43. https://doi.org/10.5334/dsj-2020-043

8 of 39

What is the IRB? (Poll #2)

9 of 39

Institutional Review Board (IRB)

  • Used for research involving human subjects
  • Provides specialized training & guidance as required by federal & state laws
  • 3 principles
  • Beneficence: To maximize benefits for science, humanity, and research participants and to avoid or minimize risk or harm.
  • Respect: To protect the autonomy and privacy rights of participants.
  • Justice: To ensure the fair distribution among persons and groups of the costs and benefits of research

Institutional Review Board at UT

https://research.utexas.edu/resources/human-subjects

10 of 39

What is Informed Consent? (Poll #3)

11 of 39

Informed Consent

  • Purpose of the research
  • Who is behind the project with full contact info
  • What is involved in participation
  • Benefits and risks of participation
  • How to withdraw from the study
  • Data use – during research, analysis, publishing AND for subsequent sharing
  • Strategies to ensure confidentiality of data

UT IRB Standard English Informed Consent Template

12 of 39

Why should data be shared? (Poll #4)

13 of 39

Why should data be shared?

  • To advance science and increase innovation​
  • To facilitate reproducibility and provide transparency in the research process​
  • To allow others to re-use data for different purposes​
  • To return value to communities​
  • To comply with funder mandates
  • To comply with publisher mandates

14 of 39

What are examples of data re-use ?

Thesis

Interview transcripts re-used from a Prostitution Diversion Program in Baltimore.

Methodology Development

Contrast methodologies from three studies that looked at postnatal care referral behavior.

Teaching

Students critique the qualitative design and methods used in the research study.

Qualitative Data Repository https://qdr.syr.edu

15 of 39

Sharing research data (or not)

Wilkinson et al. (2016; Scientific Data)

  • Increasing requirement to share data associated with journal articles (‘open data’)
  • Can create conflicts when there are potential or definitive risks associated with public sharing of data

16 of 39

Working with sensitive research data

  • Participants
  • Institutions/agencies/organizations
  • Researchers
  • Journals
  • Funders

17 of 39

Keep data secure while working with it

UT has provisions for storing university data based on a data classification scheme:

  • Published Data: The data is publicly available, and such data have no requirement for confidentiality, integrity, or availability
  • Controlled Data: Data not publicly available, and data releasable in accordance with the Texas Public Information Act (e.g., contents of specific e-mail, place of birth, salary, etc.)
  • Confidential Data: Protected specifically by federal or state law or protected by University of Texas rules and regulations or data not otherwise protected by a known civil statute or regulation, but which must be protected due to contractual agreements requiring confidentiality, integrity, or availability considerations

18 of 39

Responsible use of AI

  • AI services are well-known to use input data for training purposes.
  • Free versions of AI services are not acceptable for anything other than Published Data.
  • The Enterprise (university contract) version of ChatGPT and the Microsoft365 (university contract) version of Copilot are acceptable only for some forms of Controlled Data.

19 of 39

Do journal/funder policies require all data to be shared? (poll)

20 of 39

Publishing sensitive data

“NIH expects that researchers will take steps to maximize scientific data sharing, but may acknowledge in Plans that certain factors (i.e., ethical, legal, or technical) may necessitate limiting sharing to some extent. Foreseeable limitations should be described in DMS Plans.”

"Data should not be shared in any way that could compromise participant anonymity or privacy, and data should not be shared if that would require the authors to break any laws or licensing agreements. […] For clinical data (Individual Participant Data) we request that you use controlled access repositories, such as clinicalstudydatarequest.com, the YODA project, or Vivli."

21 of 39

Publishing sensitive data

  • informed consent does not permit/expressly limits sharing and/or reuse
  • existing consent (not expressly informed) does not permit/expressly limits sharing and/or reuse
  • privacy or safety of research participants would be compromised or place them at greater risk of re-identification or suffering harm, and protective measures such as de-identification would be insufficient
  • explicit federal, state, local, or Tribal law, regulation, or policy prohibits disclosure
  • restrictions imposed by existing or anticipated agreements (e.g., with third parties, HIPAA-covered entities that provide data with specific use agreements)

22 of 39

Best practices for publishing restricted access data

  • Consult with journal (if appropriate) - some journals are inflexible or have their own definitions
  • Determine most appropriate custodian - often a researcher or research group is not the best
  • Determine storage location, ideally is permanent
  • Provide express details on who can access, when they can access, and how they can access

23 of 39

Mechanisms for restricting access

  1. Available upon request to corresponding author
  2. Available upon request to third party (e.g., national health system, tribal group)
  3. Available from repository via self-attestation
  4. Available from repository via application

24 of 39

25 of 39

Examples of data availability statements

Author-controlled: “The datasets generated and analysed during the current study are not publicly available due to risks of individual privacy but are available from the corresponding author upon reasonable request.”

-Dyer et al. (2025, BMJ Open) [how women with pre-existing diabetes can be better supported during the inter-pregnancy interval]

Third-party-controlled: “In compliance with all regulatory and legal requirements, data were stored, accessed, and analysed with the National Safe Haven maintained by NHS National Services Scotland. Outputs were subject to disclosure checks with members of the Electronic Data Research and Innovation (eDRIS) team of the Information and Statistics Division, Scotland, prior to release to the research team for inclusion in this manuscript. Any future access to this dataset would require application to the Scottish Government.”

-Maxwell et al. (2024, European Journal of Cancer Research) [To characterise cancer diagnosis in Scottish primary care in 2018/19 and draw comparisons with diagnostic activity in 2014.]

Third-party-controlled: “Data are available from the American Cancer Society by following the ACS Data Access Procedures (https://www.cancer.org/research/population-science/research-collaboration.html) for researchers who meet the criteria for access to confidential data. Please email cohort.data@cancer.org to inquire about access.”

-Rees-Punia et al. (2025, BMJ Open) [data collection and management methods for the Cancer Prevention Study-3 (CPS-3) Accelerometry Substudy, a nested cohort of device-based physical activity and sedentary time data.]

26 of 39

Downsides of restricted access

  1. Restricted data have restricted discoverability and reuse potential
  2. Restricted data can be more work to manage
  3. Restricted data can create more obstacles for reusers
  4. Restricted data may contravene participants’ wishes

27 of 39

When to share openly

  1. Participants explicitly consent to share scientific data openly without restrictions.
  2. Scientific data are de-identified and institutional review has determined that they pose very low risk when shared and used, including any risks posed by the presence of information that can allow inferences to be made about a participant’s identity when combined with other information.

Being re-identified is an intrinsic risk regardless of whether any information that could be revealed about them is considered to be 'sensitive.'

28 of 39

Forms of sensitive data

Geospatial

2D images

3D images

Date

Gender

2024/03/09

M

2024/07/13

F

2022/01/28

M

2021/12/13

M

2023/02/01

F

2020/11/15

M

Text

29 of 39

How to share openly

Minimum reproducible dataset

  1. Remove/Redact
  2. Recode
  3. Generalize

30 of 39

Examples of deidentifying data for sharing: recoding

Date

Recoded variable

2024/03/09

0

2024/07/13

0

2022/01/28

3

2021/12/13

4

2023/02/01

2

2020/11/15

4

Converting DOB to age

Treatment start

Treatment end

Days between

2024/03/09

2024/08/29

173 days

2024/07/13

2025/02/13

215 days

2022/01/28

2022/11/21

297 days

2021/12/13

2023/01/10

393 days

2023/02/01

2023/12/15

317 days

2020/11/15

2022/08/03

626 days

Converting two dates to duration

31 of 39

Examples of deidentifying data for sharing: generalizing

Age

Generalized age

45

40-49

36

30-39

29

20-29

31

30-39

42

40-49

27

20-29

Age

Generalized age

22

20+

21

20+

19

<20

20

20+

42

20+

19

<20

Binning age

Binning age, capping outlier

Income

Generalized income

54,000

50-59k

62,000

60-69k

85,000

>80k

72,000

70-79k

79,000

70-79k

10,800,000

>80k

Binning income, capping outlier

32 of 39

Examples of deidentifying data for sharing: date shifting

Date

Recoded date

2024/03/09

2024/08/06

2024/07/13

2024/12/10

2022/01/28

2022/06/27

2021/12/13

2022/05/12

2023/02/01

2023/07/01

2020/11/15

2021/04/14

Adding 150 days to each date

33 of 39

What could be sensitive data? (poll)

34 of 39

Examples of uncommon outliers

  • Geographic regions with low population (e.g., 01003)
  • Unique job titles (e.g., Open Research Coordinator for Data and Software)
  • Years of employment
  • Advanced degrees
  • Household composition (people and non-people)
  • Consumer habits

35 of 39

Beware of jigsaw-ing and predictive variables

Jigsaw-ing

  • Connection of dataset to original source
  • Interpolation between dataset and related project materials
  • Interpolation between multiple datasets

Predictive variables

  • Some variables/identifiers can imply certain other attributes that are not listed

36 of 39

Considerations for repository selection

General rules of thumb

  • Retention policy
  • Infrastructure and security
  • Storage capacity
  • Required attributes (e.g., PIDs, metadata)
  • Cost

Sensitive-data specific

  • Restricted/controlled access functionality

Terms of access and reuse

  • Data ownership and sovereignty
  • Data type restrictions

37 of 39

Examples of data repositories for human data

38 of 39

Examples of generalist data repositories

*These are not formally endorsed by NIH or any other federal agency

39 of 39

Need help?

Meryl Brodsky

Liaison Librarian for Communication

meryl.brodsky@austin.utexas.edu

Bryan Gee

Open Research Coordinator for Data and Software

bryan.gee@austin.utexas.edu