What Workshops covering research data practices and software
When 12pm - 1:15pm on the dates listed below
Where Zoom (all dates) / PCL Scholars Lab (select dates)
Open Source GIS: From QGIS to Python
Intro to Python for Data Management
Python for Data Retrieval and Visualization
Making Beautiful Plots in R’s ggplot2
Where and How to Publish Research Data
How to Share Sensitive (Human) Data
January 31
February 11
February 12
February 13
February 14
February 28
Fri
Tue
Wed
Thu
Fri
Fri
How to Share Sensitive (Human) Data
Data & Donuts
February 28, 2025
Meryl Brodsky, Communication & Information Librarian
Bryan Gee, Open Research Coordinator for Data and Software
https://bit.ly/4bjkwF4
What is Sensitive Data? (Poll #1)
What is sensitive data?
Sensitive data refers to any information that, if disclosed, could potentially harm an individual or organization. Examples include:
Legal Protections include:
Identifiers
Direct
Personally Identifiable Information (PII), such as name, address, social security number, and phone number.
Indirect
Zip code, birthdate, education level , race, and ethnicity, medical diagnosis, occupation.
Communities / Indigenous Data
Carroll, S. R., Garba, I., Figueroa-Rodríguez, O. L., Holbrook, J., Lovett, R., Materechera, S., Hudson, M. (2020). The CARE Principles for Indigenous Data Governance. Data Science Journal, 19(1), 43. https://doi.org/10.5334/dsj-2020-043
What is the IRB? (Poll #2)
Institutional Review Board (IRB)
Institutional Review Board at UT
https://research.utexas.edu/resources/human-subjects
What is Informed Consent? (Poll #3)
Informed Consent
UT IRB Standard English Informed Consent Template
Why should data be shared? (Poll #4)
Why should data be shared?
What are examples of data re-use ?
Thesis
Interview transcripts re-used from a Prostitution Diversion Program in Baltimore.
Methodology Development
Contrast methodologies from three studies that looked at postnatal care referral behavior.
Teaching
Students critique the qualitative design and methods used in the research study.
Qualitative Data Repository https://qdr.syr.edu
Sharing research data (or not)
Wilkinson et al. (2016; Scientific Data)
Working with sensitive research data
Keep data secure while working with it
UT has provisions for storing university data based on a data classification scheme:
Responsible use of AI
Do journal/funder policies require all data to be shared? (poll)
Publishing sensitive data
“NIH expects that researchers will take steps to maximize scientific data sharing, but may acknowledge in Plans that certain factors (i.e., ethical, legal, or technical) may necessitate limiting sharing to some extent. Foreseeable limitations should be described in DMS Plans.”
"Data should not be shared in any way that could compromise participant anonymity or privacy, and data should not be shared if that would require the authors to break any laws or licensing agreements. […] For clinical data (Individual Participant Data) we request that you use controlled access repositories, such as clinicalstudydatarequest.com, the YODA project, or Vivli."
Publishing sensitive data
Best practices for publishing restricted access data
Mechanisms for restricting access
Examples of data availability statements
Author-controlled: “The datasets generated and analysed during the current study are not publicly available due to risks of individual privacy but are available from the corresponding author upon reasonable request.”
-Dyer et al. (2025, BMJ Open) [how women with pre-existing diabetes can be better supported during the inter-pregnancy interval]
Third-party-controlled: “In compliance with all regulatory and legal requirements, data were stored, accessed, and analysed with the National Safe Haven maintained by NHS National Services Scotland. Outputs were subject to disclosure checks with members of the Electronic Data Research and Innovation (eDRIS) team of the Information and Statistics Division, Scotland, prior to release to the research team for inclusion in this manuscript. Any future access to this dataset would require application to the Scottish Government.”
-Maxwell et al. (2024, European Journal of Cancer Research) [To characterise cancer diagnosis in Scottish primary care in 2018/19 and draw comparisons with diagnostic activity in 2014.]
Third-party-controlled: “Data are available from the American Cancer Society by following the ACS Data Access Procedures (https://www.cancer.org/research/population-science/research-collaboration.html) for researchers who meet the criteria for access to confidential data. Please email cohort.data@cancer.org to inquire about access.”
-Rees-Punia et al. (2025, BMJ Open) [data collection and management methods for the Cancer Prevention Study-3 (CPS-3) Accelerometry Substudy, a nested cohort of device-based physical activity and sedentary time data.]
Downsides of restricted access
When to share openly
Being re-identified is an intrinsic risk regardless of whether any information that could be revealed about them is considered to be 'sensitive.'
Forms of sensitive data
Geospatial
2D images
3D images
Date | Gender |
2024/03/09 | M |
2024/07/13 | F |
2022/01/28 | M |
2021/12/13 | M |
2023/02/01 | F |
2020/11/15 | M |
Text
How to share openly
Minimum reproducible dataset
Examples of deidentifying data for sharing: recoding
Date | Recoded variable |
2024/03/09 | 0 |
2024/07/13 | 0 |
2022/01/28 | 3 |
2021/12/13 | 4 |
2023/02/01 | 2 |
2020/11/15 | 4 |
Converting DOB to age
Treatment start | Treatment end | Days between |
2024/03/09 | 2024/08/29 | 173 days |
2024/07/13 | 2025/02/13 | 215 days |
2022/01/28 | 2022/11/21 | 297 days |
2021/12/13 | 2023/01/10 | 393 days |
2023/02/01 | 2023/12/15 | 317 days |
2020/11/15 | 2022/08/03 | 626 days |
Converting two dates to duration
Examples of deidentifying data for sharing: generalizing
Age | Generalized age |
45 | 40-49 |
36 | 30-39 |
29 | 20-29 |
31 | 30-39 |
42 | 40-49 |
27 | 20-29 |
Age | Generalized age |
22 | 20+ |
21 | 20+ |
19 | <20 |
20 | 20+ |
42 | 20+ |
19 | <20 |
Binning age
Binning age, capping outlier
Income | Generalized income |
54,000 | 50-59k |
62,000 | 60-69k |
85,000 | >80k |
72,000 | 70-79k |
79,000 | 70-79k |
10,800,000 | >80k |
Binning income, capping outlier
Examples of deidentifying data for sharing: date shifting
Date | Recoded date |
2024/03/09 | 2024/08/06 |
2024/07/13 | 2024/12/10 |
2022/01/28 | 2022/06/27 |
2021/12/13 | 2022/05/12 |
2023/02/01 | 2023/07/01 |
2020/11/15 | 2021/04/14 |
Adding 150 days to each date
What could be sensitive data? (poll)
Examples of uncommon outliers
Beware of jigsaw-ing and predictive variables
Jigsaw-ing
Predictive variables
Considerations for repository selection
General rules of thumb
Sensitive-data specific
Terms of access and reuse
Examples of data repositories for human data
Examples of generalist data repositories
*These are not formally endorsed by NIH or any other federal agency
Need help?