1 of 20

Data Anonymization

Ondřej Sotolář, Spring 2021

2 of 20

Outline

  1. Background
  2. Legal aspects
  3. Threats
  4. Methods
  5. Tools

3 of 20

Location Anonymization

  • Zang, Hui, and Jean Bolot. "Anonymization of location data does not work: A large-scale measurement study." Proceedings of the 17th annual international conference on Mobile computing and networking. 2011.

4 of 20

Anonymization primer – motivation

Example faculty employee salary data (red is sensitive):

Naive anonymization attempts (not good):

Jméno

Pozice

Plat

Jan Kocourek

děkan

100 000

Anna Nováková

postdoc

70 000

Jiří Nový

postdoc

68 000

Pavel Ryšavý

lektor

52 000

Ondřej Dobrý

lektor

51 000

Tomáš Marný

lektor

50 000

Jméno

Pozice

Plat

děkan

100 000

postdoc

70 000

postdoc

68 000

lektor

52 000

lektor

51 000

lektor

50 000

Jméno

Pozice

Plat

100 000

70 000

68 000

52 000

51 000

50 000

A few attacks:

  • AOL 2006
  • Netflix 2007
  • Cambridge analytica 2018
  • Avast 2020
  • WhatsApp 2021

5 of 20

6 of 20

Anonymization primer – definitions

  • Goal
    • meet legal requirements (GDPR, HIPAA, local laws, etc.)
    • protect sensitive or confidential information
  • Data types
    • tabular, text, location, image
  • Definitions
    • personal information (GDPR)
      • any information related to an identified or identifiable natural person
      • should be protected from extraction by likely means [1]
    • identifier, quasi-identifier, sensitive
    • single out, linkability, inference
    • re-identification datasets
      • assume all publicly available (individuals or organizations might have more)
  • Methods
    • randomize, generalize, mask, delete
  • Evaluate
    • anonymization quality / data utility

7 of 20

Anonymization primer – tabular data

Traditional approach:

  • k-anonymity [2]
    • 1 record maps to k records
  • l-diversity [3]
    • intra-group diversity
  • t-closeness [4]
    • distribution of data values for an attribute
  • and others (δ-Presence, k-map, etc.)

Different approach: Differential privacy (ε) [5]

  • add random noise
    • original data is never saved
    • data is accessible through a view (API) that adds noise on-the-fly

All methods allow measuring data utility!

Jméno

Pozice

Plat

akademický

68 – 100 000

akademický

68 – 100 000

akademický

68 – 100 000

neakademický

30 – 52 000

neakademický

30 – 52 000

neakademický

30 – 52 000

8 of 20

Anonymization primer – location, image, text

  • Location
    • some tasks impossible while keeping utility (2 most frequent points: 50%)
  • Image
    • frequently used manual or automatic image recognition
  • Text
    • generally hard
    • easier on domain-specific data
    • frequently required in domains:
      • legal documents
      • medical records
      • data for sociological research
      • advertising
      • social network mining
    • tasks:
      • classify sensitive documents
      • recognize identifiers, quasi-identifiers and sensitive data in text
      • anonymize and evaluate quality and utility

9 of 20

Text anonymization - examples

  • Original
    • Petr Konečný, narozen v Brně 4.12.1980, je rektorem Masarykovy Univerzity. Jeho plat činí 153 000 kč. Bydlí v Brně na Táborské 105.
  • NER (+ coref)
    • Petr Konečný, narozen v Brně 4.12.1980, je rektorem Masarykovy Univerzity. Jeho plat činí 153 000 kč. Bydlí v Brně na Táborské 105.
  • a) Masking
    • [NAME] [SURNAME], narozen v [LOCATION] [DATE], je [OCCUPATION] [ORGANIZATION]. Jeho plat činí [MONEY]. Bydlí v [LOCATION] na [LOCATION].
  • b) Generalization
    • Muž, narozen v ČR 1970-1990, je vedoucím pracovníkem společnosti. Jeho plat činí 150 000-200 000 kč. Bydlí v jihomoravském kraji.
  • c) Randomize / Add noise
    • Petr Nový, narozen v Hodoníně 1.1.1963, je rektorem Masarykovy Univerzity. Jeho plat činí 113 000 kč. Bydlí v Hodoníně na Lipové 23.

10 of 20

Named Entity Recognition for sensitive data

  • Public labeled corpora from natural text – a problem

Selected solutions:

  1. IberLEF 2019: MEDDOCAN track [6]
    1. Spanish (synthetic) medical records
      1. various approaches (LSTM, BERT, etc.):
        1. top 10 papers > 0.95 F1
  2. Štefánik, M. [8]
    • Czech contracts
      • ELMo + LSTM + rules: 0.65 F1
  3. Sotolář, O. [12]
    • Czech Messenger conversations
      • NameTag 1 (CRF) + rules: 0.43 F1
      • human: 0.88 F1

11 of 20

MEDDOCAN 2019 Corpus

  • synthetic data
  • 1000 records
  • 495k words

Others:

  • i2b2, n2c2 [7]

12 of 20

Semantic similarity

  • C-Sanitized: Sanchez, D., Batet, M. 2016 – medical records [9]
    • recognize:
      • a priori defined set of sensitive terms C
      • find semantically related terms/phrases to C by IC, PMI
        • probabilities from web-search hits
    • generalize:
      • Knowledge Base: WordNet, SNOMED-CT (medical)
    • evaluate:
      • dataset from Wiki articles on topic
      • data utility = compare a priori/a posteriori SUM(IC) of a document
  • Hassan F., Sanchez, D. 2019 – recognition improvement [10]
    • w2v embeddings from noun, verb, and adjective phrases (eg. “Masaryk University”)
    • seek similar phrases to C by cosine similarity
    • no comment on utility measuring
  • N-Sanitized: Iwendi C. et al., 2020 [11]
    • C-Sanitized with negative sentences removed (e.g. HIV-negative)

13 of 20

My implementation of a tool for Czech text

Pupose:

  • protect the privacy of third parties in the on-line conversations of research subjects
    • annotators/researchers shouldn’t be able to identify participants

Users:

  • annotators/researchers from team IRTIS, FI + FSS
    • project FUTURE
      • use of technologies by adolescents
      • data collected from research subjects’ smartphones
        • little prior knowledge of data
        • now Messenger & WhatsApp conversations

14 of 20

Evaluation corpus

Source: Messenger conversations – 1.5 M words

Proposed classes (only 1004 annotated):

  • Name
    • surname + 0-n firstnames (non adjacent)
    • my mistake: no first names alone
    • no nicknames
  • Id
  • Location
  • Contact Information
  • Missing: HPI

Measuring IAA: Synthetic data

  • problem: annotators can only see their own files
  • solution: randomly added generated full names, phone numbers, etc.

15 of 20

Entity Recognition

Solution:

(NameTag 1 subset + rules)

+ composition

16 of 20

Recognition results

  • Nametag 1 on CNEC has ~0.82 F1 (NameTag 2 ~0.84 state-of-the-art for CS)
  • Missing thorough error analysis
  • Observed untagged nicknames
    • “tome” for Tomáš, etc.
  • utility was deemed ok w.r.t. PDE content in text (1/1000)

17 of 20

Sensitive data replacements

  • Gazetteer replacements by hash (salted)
    • use of POS tags for retention of word forms (MorphoDita)
  • Generalization
    • reduce precision on numerals, contact information, URL, locations
  • Randomization
    • keeping form

Problem:

  • Missing evaluation of utility

18 of 20

Replacement by hash from gazetteers

Advantages:

  • keep track of entities across documents
  • without the knowledge of the original gazetteer and salt reasonably secure

Versions:

  • Version 1:

  • Version 2:
    • split name to male / female
    • reduce first names to names in calendar only
    • reduce surnames to n-most common single word

19 of 20

Possible improvements

  • Recognition
    • boost CNEC with following & retrain
      • my corpus
      • automatically labeled nickname name forms (“tome” for Tomáš)
      • Wikipedia
    • leverage the actual data format: known name of message sender
    • add thorough error analysis
    • try coreference resolution for found entities
    • try newer NER methods: embeddings, BERT
  • Replacement
    • evaluate utility
      • compare utility before/after anonymization
      • generalizations from treebank
      • measure information content from search engine hits
      • humans: Bayesian testing

20 of 20

Sources

  1. An independent European advisory body on data protection and privacy, Opinion 05/2014 on anonymisation techniques
  2. SAMARATI, Pierangela; SWEENEY, Latanya. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. 1998.
  3. MACHANAVAJJHALA, Ashwin, et al. l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007, 1.1: 3-es.
  4. LI, Ninghui; LI, Tiancheng; VENKATASUBRAMANIAN, Suresh. t-closeness: Privacy beyond k-anonymity and l-diversity. In: 2007 IEEE 23rd International Conference on Data Engineering. IEEE, 2007. p. 106-115.
  5. Differential Privacy. Cynthia Dwork. 33rd International Colloquium on Automata, Languages and Programming, part II (ICALP 2006)
  6. http://ceur-ws.org/Vol-2421/
  7. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/
  8. https://www.gaussalgo.com/kyberneticka-bezpecnost/revealing-sensitive-documents-with-ner-practical-case-study
  9. SÁNCHEZ, David; BATET, Montserrat. C‐sanitized: A privacy model for document redaction and sanitization. Journal of the Association for Information Science and Technology, 2016, 67.1: 148-163.
  10. HASSAN, Fadi, et al. Automatic Anonymization of Textual Documents: Detecting Sensitive Information via Word Embeddings. In: 2019 18th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/13th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). IEEE, 2019. p. 358-365.
  11. IWENDI, Celestine, et al. N-Sanitization: A semantic privacy-preserving framework for unstructured medical datasets. Computer Communications, 2020, 161: 160-171.
  12. https://is.muni.cz/th/b4tkt/