1 of 21

Evaluating the Impact of De-Identification on Social and Behavioral Research Data

Sayuri Modi, Melodie Galamo, Bhargavi Alluri Azwa Bajwah, Dakshita Pal

2 of 21

Project Description

Social and behavioral researchers often collect sensitive data about people. Publishing research data is beneficial for replication, meta-analysis, and public research. To prevent harms and privacy violations to research participants, data must be de-identified.

Principled approaches to de-identification, such as differential privacy and k-anonymity, can help ensure that data meets certain standards of privacy. However, researchers understandably have concerns that this gain in privacy will be unacceptably offset by a loss of data utility or fairness.

As a first step towards addressing researchers’ concerns, we aim to establish a baseline understanding of how existing de-identification tools impact data utility.

3 of 21

Goals

01

02

03

To understand how existing de-identification tools impact the utility of real research data

To be able to use ARX and SdcMicro data analysis tools to analyze data and calculate risk factors

To form hypotheses about how these tools could be better designed to meet the needs of social and behavioral researchers

4 of 21

sdcMicro Vs ARX

  • sdcMicro is an R-package to anonymize microdata.
  • ARX is a comprehensive open source software for anonymizing sensitive personal data. It supports a wide variety of privacy and risk models, methods for transforming data and methods for analyzing the usefulness of output data.

We will be using a dataset to anonymize it using these two tools and compare the results to understand how these two tools anonymize the same dataset differently.

5 of 21

Raw Data Set

Reducing Crime and Violence: Experimental Evidence from Cognitive Behavioral Therapy in Liberia

Data shows different attributes for 999 criminally-engaged men from Liberia and surrounding regions.

Notable variables include age, born country, born city, neighbourhood, tribe, religion, and level of education.

6 of 21

Importance of De-Identification

  • Want to ensure data remains anonymous and cannot be traced back to the person
    • If you have someone's name, email, and #, you can find out who that person is!
  • What if you only had the first letter of their name, last digit of their number, and first two letters of their email?
    • Much harder to find the person because so many people fit!
  • De-Identification is the process of removing any data or re-group demographics and personal information on the participants, and to broaden the scope of the data as much as possible.

7 of 21

Methodologies (general)

K-anonymization

  • Applies changes to data set
    • Liberia -> L******
    • 27 -> [20-30]
  • Ensures each data point has at least one with the same quasi-identifiers
    • K-value = # of similar data sets

  • First step to de-identifying.
  • Insensitive, Sensitive, Quasi-Identifiers, Identifiers (removed)
  • Use intervals, ordering, masking, and priorities to make data less specific

Generalizing (Hierarchy)

8 of 21

SdcMicro

01

9 of 21

Results

10 of 21

Results

11 of 21

Results

12 of 21

ARX

02

13 of 21

Results:

  • K-value of 4!
  • Country
  • Age
  • County
  • Neighbourhood
  • Tribe
  • Religion

Generalized..

K-Anonymization

14 of 21

15 of 21

16 of 21

17 of 21

Comparison

Using cross-examinations we were able to see able to see similar risk/utility percentages to what the original tools outputted!

18 of 21

Final Thoughts

03

19 of 21

Challenges Faced

  • Researching and understanding de-identification techniques
  • Learning curve on de-identification tools
    • Researching how to use them
    • Limitations on which techniques could be used
  • Cross-examining data collected from each tool

20 of 21

Takeaways

  • Learned the importance of de-identification
  • Learned how to implement various security and privacy techniques
  • Collaborated on the use of each tools and technique
  • Learned more about the research process specifically how to investigate more on a topic before asking a research question

21 of 21

Thank you!

Special thanks to graduate students Wentao and Emma!