1 of 39

What Workshops covering research data practices and software

When 12pm - 1:15pm on the dates listed below

Where Zoom (all dates) / PCL Scholars Lab (select dates)

More info https://guides.lib.utexas.edu/data-and-donuts

Join us for our Data & Donuts workshops in the Spring 2025 semester! Use QR code for online schedule.

Open Source GIS: From QGIS to Python

Intro to Python for Data Management

Python for Data Retrieval and Visualization

Making Beautiful Plots in R’s ggplot2

Where and How to Publish Research Data

How to Share Sensitive (Human) Data

Sign up to receive Data & Donuts workshop event notifications at

January 31

February 11

February 12

February 13

https://utlists.utexas.edu/sympa/subscribe/research-data-services

February 14

February 28

Fri

Tue

Wed

Thu

Fri

2 of 39

Join us on Friday, April 4th from 10:00am to 12:00pm
In-person in the PCL Scholars Lab and virtual on Zoom
Hear about research computing services offered by

TACC
OVPR
Enterprise Technology
The UT Libraries

https://bit.ly/utrcs2025form

3 of 39

How to Share Sensitive (Human) Data

Data & Donuts

February 28, 2025

Meryl Brodsky, Communication & Information Librarian

Bryan Gee, Open Research Coordinator for Data and Software

https://bit.ly/4bjkwF4

4 of 39

What is Sensitive Data? (Poll #1)

5 of 39

What is sensitive data?

Sensitive data refers to any information that, if disclosed, could potentially harm an individual or organization. Examples include:

Personally identifiable information (PII)
Health records
Financial information

Legal Protections include:

Health Insurance Portability and Accountability Act (HIPAA) - protects medical/healthcare data
Family Education Rights & Privacy Act (FERPA) - protects educational records data, such as grades
Fair Credit Reporting Act (FCRA) – Consumer Credit Information

Sensitive data include any information that could be used to harm research participants. By harm, I mean legal or financial problems, and emotional or reputational damage.

Examples include:

Personally Identifiable Information or PII, such as Social security number, date and place of birth, email address, phone #, gender, race, and address information. It also includes biometric data such as a facial image, fingerprints or voice signature. PII can be used to commit identity theft or other crimes.

Other examples of sensitive data that may be the focus of a research study are:

Sexual behaviors

Mental health information

Criminal behaviors, such as drug use

Information about minors or other vulnerable populations

I’ve also worked with students who were collecting political opinions from people who lived in a country where differing with the leading party could be dangerous.

Before sharing human participant data, you need to assess its risk to the study participants. They should not suffer harm as a result of participating in your research project.

Also, there are legal protections that you must adhere to, such as HIPAA, which protects medical information, FERPA which protects educational records such as grades, and the FCRA, which protects consumer credit information.

6 of 39

Identifiers

Direct

Personally Identifiable Information (PII), such as name, address, social security number, and phone number.

Indirect

Zip code, birthdate, education level , race, and ethnicity, medical diagnosis, occupation.

Identifiers refer to the type of information.

Direct Identifiers are generally unique to an individual or a small group of individuals. Direct identifiers usually need to be removed from a data set before its release.

Indirect identifiers can be linked with information from other sources (not in your research study) such as: social media, administrative data, or other public datasets, that may result in identification of an individual. On their own though, they are okay. For example if you were looking at a collection of people who had breast cancer, this in itself would not be revealing. If you were studying people of a certain age group who were treated at a specific hospital who had stage 2 breast cancer during this time period, these separate pieces of information used together could reveal individuals.

So, you may need to remove both direct identifiers AND indirect identifiers. Removing identifiers during data analysis rather than waiting until the study is complete, is another way to keep the data secure. Keep documentation of changes and codes in a secure, encrypted file.

Let me also point out here that when you are planning your research, you should think about the data you need, and collect only that data, a data minimization strategy. Some data elements may not be necessary. For example, you may not need someone's exact age, a 5 year age range may suffice. Also, unless your doing a geographic study, you won’t need their address.

Some studies pose a greater disclosure risk, such as those with small samples from geographically specific areas, sensitive topics, or protected subjects such as children, and or those with multiple demographic variables, which you often find in education research.

Researchers have both a legal and and ethical obligation to ensure that confidentiality is maintained.

In the Medical field, this goes back to Hippocrates, who spoke on patient confidentiality. Though HIPAA didn't become a law until 1996 due to the advent electronic health records.

7 of 39

Communities / Indigenous Data

Was my data collected from or within a particular community?
How would public release of this data (even if de-identified) impact the community?
Who in the community can advise me on data sharing and community impacts?
Is there an agreement (e.g., data use agreement, memorandum of understanding) about the data collected from the community?
Are there laws and regulations that protect the community?

Carroll, S. R., Garba, I., Figueroa-Rodríguez, O. L., Holbrook, J., Lovett, R., Materechera, S., Hudson, M. (2020). The CARE Principles for Indigenous Data Governance. Data Science Journal, 19(1), 43. https://doi.org/10.5334/dsj-2020-043

You may not be able to legally share the data you collect from and within communities, or you may make the decision not to share for ethical reasons.

Communities have been harmed by sharing data, by not collecting data, and by not sharing data collected about the community with the community.

There is a long history of this happening with Indigenous or Native American tribes, even recently. For example, during the pandemic, communities needed to apply for funding, but this data from these communities wasn't even collected, because the sample sizes were deemed too small.

There have been legal disputes, such as the Havasupi Tribe versus Arizona State University in 2004 in connection with a diabetes research project, in which data were used for unauthorized purposes, such as: investigations of population origin, population migration, and schizophrenia.

James R, Tsosie R, Sahota P, Parker M, Dillard D, Sylvester I, Lewis J, Klejka J, Muzquiz L, Olsen P, Whitener R, Burke W; Kiana Group. Exploring pathways to trust: a tribal perspective on data sharing. Genet Med. 2014 Nov;16(11):820-6. doi: 10.1038/gim.2014.47. Epub 2014 May 15. PMID: 24830328; PMCID: PMC4224626.

You must go the extra mile in working with an indigenous community to get approval to either share or protect the data. There are CARE principles in working with Indigenous data, and that has to do with Indigenous People owning their own data. The Library can help you find information on this. https://static1.squarespace.com/static/5d3799de845604000199cd24/t/640792a43ba5c11a1073bbc8/1678217895508/TheCAREPrinciples.pdf

Indigenous Data Soverignty & Governance

https://nni.arizona.edu/our-work/research-policy-analysis/indigenous-data-sovereignty-governance

8 of 39

What is the IRB? (Poll #2)

9 of 39

Institutional Review Board (IRB)

Used for research involving human subjects
Provides specialized training & guidance as required by federal & state laws
3 principles
Beneficence: To maximize benefits for science, humanity, and research participants and to avoid or minimize risk or harm.
Respect: To protect the autonomy and privacy rights of participants.
Justice: To ensure the fair distribution among persons and groups of the costs and benefits of research

Institutional Review Board at UT

https://research.utexas.edu/resources/human-subjects

IRBs came about as a direct result of historical mistreatment of humans in the course of medical research.

The 1945, the Nuremberg Code was developed as a result of the Nazi’s experimentation on humans.

In 1972, it was publicly disclosed that the government supported Tuskegee Syphilis Study had been going on for 30 years, even after antibiotics became available.

In 1974, the National Research Act, Public Law 93-348 was put into place. It mandates that an institutional review board, or human subjects committee, must be established by any university that receives federal funding for biomedical or behavioral research.

In 1979 the Commission published recommendations, known as the Belmont Report, that served as the basis for revised federal regulations on the Protection of Human Subjects (45 CFR 46).

The Belmont Report laid out three general ethical principles that should govern human subjects research:

Beneficence: To maximize benefits for science, humanity, and research participants and to avoid or minimize risk or harm.
Respect: To protect the autonomy and privacy rights of participants.
Justice: To ensure the fair distribution among persons and groups of the costs and benefits of research

Institutional review boards (IRBs) or research ethics committees, provide protection for human research participants through advance and independent review of the ethical acceptability of proposals for human research. Since the inception of IRBs (50+ years ago), the research landscape has grown and evolved, as has the system of IRB review and oversight. However, inconsistencies in IRB review and in the application of federal regulations has led to dissatisfaction with the IRB system. Some complain that IRB review is time-consuming and burdensome without clear evidence of effectiveness at protecting human subjects. This is why you need to comply with IRB guidelines, but also, you have to go beyond the IRB and do the best that you possibly can for your subjects and their data. If people are willing to participate in your research, you are ethically obligated to protect them.

https://pmc.ncbi.nlm.nih.gov/articles/PMC4631034/

10 of 39

What is Informed Consent? (Poll #3)

11 of 39

Informed Consent

Purpose of the research
Who is behind the project with full contact info
What is involved in participation
Benefits and risks of participation
How to withdraw from the study
Data use – during research, analysis, publishing AND for subsequent sharing
Strategies to ensure confidentiality of data

UT IRB Standard English Informed Consent Template

An informed consent document is required when you are working with human subjects.

It basically is your contract with the participants, and explains what the research is, who you are, how to get out of the study and what's going to happen to the data.

It should communicate the purpose of the research, and who is behind the project, along with contact info.

It should talk about what is involved in participation such as how long it will take, and whether there are any benefits and risks of participating.

You then have to say how to withdraw from the study.

Finally, you must address the data, including the security and privacy measures for your research.

Then the participants must sign or initial the document.

Act with data minimization in mind: Only generate and use data that is relevant for the purpose of your research.

Use safe and secure file storage and sharing.

Informed consent is a baseline of what is required. You can go beyond the minimum and be very clear about what your plans are for the data. Many respondents don’t read the informed consent, and it can be hard to imagine all the ways that this research may be re-used. However, it's still a good idea to build a robust informed consent, noting the data’s future use, such as analysis, publishing, sharing.

You may have breezed through these when taking an online survey. I always read the ones that I sign

If your intention is to destroy confidential data once your analysis is done, and only share aggregated data you can say that in your informed consent.

UT offers this template, but feel free to add to it if you feel it's not what you want to say.

12 of 39

Why should data be shared? (Poll #4)

13 of 39

Why should data be shared?

To advance science and increase innovation
To facilitate reproducibility and provide transparency in the research process
To allow others to re-use data for different purposes
To return value to communities
To comply with funder mandates
To comply with publisher mandates

Sharing data seems obvious today, but not so long ago, it wasn't easy to share data, nor was it desirable.

Researchers often had to travel to another campus or institution if they wanted to see data from previously published experiments.

Even within institutions, different departments might be competing for the same grants, so sharing research data could seem like you’re giving it away for free and you might not feel good about sharing it with a competitor.

Today, the funders require data sharing. In exchange for funding your research, they expect to be able to find and use the data that you produced.

The government, as a funding agency, sees data sharing as a way to reduce redundancies and increase innovation in this country.

It also means that others can see your research process, and serve as check that your research was done properly or other researchers may wish to use your data.

The community you worked with may also want to use the data to their own purposes.

Last, publishers are requiring data be made available when they publish an article summarizing your research. This also adds legitimacy to your claims and gives you a second citable product.

14 of 39

What are examples of data re-use ?

Thesis

Interview transcripts re-used from a Prostitution Diversion Program in Baltimore.

Methodology Development

Contrast methodologies from three studies that looked at postnatal care referral behavior.

Teaching

Students critique the qualitative design and methods used in the research study.

Qualitative Data Repository https://qdr.syr.edu

You might be wondering well, aren't these two things at odds? We have to share data and protect participants. The answer is yes, sometimes these two things are at odds, but there are things you can do to fulfill your obligations both to the funding agency or the publisher and to your participants. Bryan will be talking more about that.

You might also be thinking, well, how useful is my data anyway? You'd be surprised how often data gets reused.

These are all examples from the Qualitative Data Repository of data being reused. It does happen.

https://qdr.syr.edu

More recently the rapid development of the Covid-19 vaccine and the things we learned about how Covid19 is transmitted are cited as examples of global data sharing.

I’ve also helped graduate students find existing data sets when either they couldn’t afford to purchase data, or they needed to find data quickly. Plus, even though we are at UT, an R1 research institution, not all researchers have that benefit. So sharing data can also help researchers at less well-funded institutions.

Watson, C. Rise of the preprint: how rapid data sharing during COVID-19 has changed science forever. Nat Med 28, 2–5 (2022). https://doi.org/10.1038/s41591-021-01654-6

Now I'll turn it over to Bryan.

15 of 39

Sharing research data (or not)

Wilkinson et al. (2016; Scientific Data)

Increasing requirement to share data associated with journal articles (‘open data’)
Can create conflicts when there are potential or definitive risks associated with public sharing of data

16 of 39

Working with sensitive research data

Participants
Institutions/agencies/organizations
Researchers
Journals
Funders

17 of 39

Keep data secure while working with it

UT has provisions for storing university data based on a data classification scheme:

Published Data: The data is publicly available, and such data have no requirement for confidentiality, integrity, or availability
Controlled Data: Data not publicly available, and data releasable in accordance with the Texas Public Information Act (e.g., contents of specific e-mail, place of birth, salary, etc.)
Confidential Data: Protected specifically by federal or state law or protected by University of Texas rules and regulations or data not otherwise protected by a known civil statute or regulation, but which must be protected due to contractual agreements requiring confidentiality, integrity, or availability considerations

18 of 39

Responsible use of AI

AI services are well-known to use input data for training purposes.
Free versions of AI services are not acceptable for anything other than Published Data.
The Enterprise (university contract) version of ChatGPT and the Microsoft365 (university contract) version of Copilot are acceptable only for some forms of Controlled Data.

19 of 39

Do journal/funder policies require all data to be shared? (poll)

20 of 39

Publishing sensitive data

“NIH expects that researchers will take steps to maximize scientific data sharing, but may acknowledge in Plans that certain factors (i.e., ethical, legal, or technical) may necessitate limiting sharing to some extent. Foreseeable limitations should be described in DMS Plans.”

"Data should not be shared in any way that could compromise participant anonymity or privacy, and data should not be shared if that would require the authors to break any laws or licensing agreements. […] For clinical data (Individual Participant Data) we request that you use controlled access repositories, such as clinicalstudydatarequest.com, the YODA project, or Vivli."

21 of 39

Publishing sensitive data

informed consent does not permit/expressly limits sharing and/or reuse
existing consent (not expressly informed) does not permit/expressly limits sharing and/or reuse
privacy or safety of research participants would be compromised or place them at greater risk of re-identification or suffering harm, and protective measures such as de-identification would be insufficient
explicit federal, state, local, or Tribal law, regulation, or policy prohibits disclosure
restrictions imposed by existing or anticipated agreements (e.g., with third parties, HIPAA-covered entities that provide data with specific use agreements)

22 of 39

Best practices for publishing restricted access data

Consult with journal (if appropriate) - some journals are inflexible or have their own definitions
Determine most appropriate custodian - often a researcher or research group is not the best
Determine storage location, ideally is permanent
Provide express details on who can access, when they can access, and how they can access

23 of 39

Mechanisms for restricting access

Available upon request to corresponding author
Available upon request to third party (e.g., national health system, tribal group)
Available from repository via self-attestation
Available from repository via application

24 of 39

25 of 39

Examples of data availability statements

Author-controlled: “The datasets generated and analysed during the current study are not publicly available due to risks of individual privacy but are available from the corresponding author upon reasonable request.”

-Dyer et al. (2025, BMJ Open) [how women with pre-existing diabetes can be better supported during the inter-pregnancy interval]

Third-party-controlled: “In compliance with all regulatory and legal requirements, data were stored, accessed, and analysed with the National Safe Haven maintained by NHS National Services Scotland. Outputs were subject to disclosure checks with members of the Electronic Data Research and Innovation (eDRIS) team of the Information and Statistics Division, Scotland, prior to release to the research team for inclusion in this manuscript. Any future access to this dataset would require application to the Scottish Government.”

-Maxwell et al. (2024, European Journal of Cancer Research) [To characterise cancer diagnosis in Scottish primary care in 2018/19 and draw comparisons with diagnostic activity in 2014.]

Third-party-controlled: “Data are available from the American Cancer Society by following the ACS Data Access Procedures (https://www.cancer.org/research/population-science/research-collaboration.html) for researchers who meet the criteria for access to confidential data. Please email cohort.data@cancer.org to inquire about access.”

-Rees-Punia et al. (2025, BMJ Open) [data collection and management methods for the Cancer Prevention Study-3 (CPS-3) Accelerometry Substudy, a nested cohort of device-based physical activity and sedentary time data.]

26 of 39

Downsides of restricted access

Restricted data have restricted discoverability and reuse potential
Restricted data can be more work to manage
Restricted data can create more obstacles for reusers
Restricted data may contravene participants’ wishes

27 of 39

When to share openly

Participants explicitly consent to share scientific data openly without restrictions.
Scientific data are de-identified and institutional review has determined that they pose very low risk when shared and used, including any risks posed by the presence of information that can allow inferences to be made about a participant’s identity when combined with other information.

Being re-identified is an intrinsic risk regardless of whether any information that could be revealed about them is considered to be 'sensitive.'

28 of 39

Forms of sensitive data

Geospatial

2D images

3D images

Date	Gender
2024/03/09	M
2024/07/13	F
2022/01/28	M
2021/12/13	M
2023/02/01	F
2020/11/15	M

Text

29 of 39

How to share openly

Minimum reproducible dataset

Remove/Redact
Recode
Generalize

30 of 39

Examples of deidentifying data for sharing: recoding

Date	Recoded variable
2024/03/09	0
2024/07/13	0
2022/01/28	3
2021/12/13	4
2023/02/01	2
2020/11/15	4

Converting DOB to age

Treatment start	Treatment end	Days between
2024/03/09	2024/08/29	173 days
2024/07/13	2025/02/13	215 days
2022/01/28	2022/11/21	297 days
2021/12/13	2023/01/10	393 days
2023/02/01	2023/12/15	317 days
2020/11/15	2022/08/03	626 days

Converting two dates to duration

31 of 39

Examples of deidentifying data for sharing: generalizing

Age	Generalized age
45	40-49
36	30-39
29	20-29
31	30-39
42	40-49
27	20-29

Age	Generalized age
22	20+
21	20+
19	<20
20	20+
42	20+
19	<20

Binning age

Binning age, capping outlier

Income	Generalized income
54,000	50-59k
62,000	60-69k
85,000	>80k
72,000	70-79k
79,000	70-79k
10,800,000	>80k

Binning income, capping outlier

32 of 39

Examples of deidentifying data for sharing: date shifting

Date	Recoded date
2024/03/09	2024/08/06
2024/07/13	2024/12/10
2022/01/28	2022/06/27
2021/12/13	2022/05/12
2023/02/01	2023/07/01
2020/11/15	2021/04/14

Adding 150 days to each date

33 of 39

What could be sensitive data? (poll)

34 of 39

Examples of uncommon outliers

Geographic regions with low population (e.g., 01003)
Unique job titles (e.g., Open Research Coordinator for Data and Software)
Years of employment
Advanced degrees
Household composition (people and non-people)
Consumer habits

35 of 39

Beware of jigsaw-ing and predictive variables

Jigsaw-ing

Connection of dataset to original source
Interpolation between dataset and related project materials
Interpolation between multiple datasets

Predictive variables

Some variables/identifiers can imply certain other attributes that are not listed

36 of 39

Considerations for repository selection

General rules of thumb

Retention policy
Infrastructure and security
Storage capacity
Required attributes (e.g., PIDs, metadata)
Cost

Sensitive-data specific

Restricted/controlled access functionality

Terms of access and reuse

Data ownership and sovereignty
Data type restrictions

37 of 39

Examples of data repositories for human data

38 of 39

Examples of generalist data repositories

*These are not formally endorsed by NIH or any other federal agency

39 of 39

Need help?

Meryl Brodsky

Liaison Librarian for Communication

meryl.brodsky@austin.utexas.edu

Bryan Gee

Open Research Coordinator for Data and Software

bryan.gee@austin.utexas.edu