Data Anonymization
Ondřej Sotolář, Spring 2021
Outline
Location Anonymization
Anonymization primer – motivation
Example faculty employee salary data (red is sensitive):
Naive anonymization attempts (not good):
Jméno | Pozice | Plat |
Jan Kocourek | děkan | 100 000 |
Anna Nováková | postdoc | 70 000 |
Jiří Nový | postdoc | 68 000 |
Pavel Ryšavý | lektor | 52 000 |
Ondřej Dobrý | lektor | 51 000 |
Tomáš Marný | lektor | 50 000 |
Jméno | Pozice | Plat |
| děkan | 100 000 |
| postdoc | 70 000 |
| postdoc | 68 000 |
| lektor | 52 000 |
| lektor | 51 000 |
| lektor | 50 000 |
Jméno | Pozice | Plat |
| | 100 000 |
| | 70 000 |
| | 68 000 |
| | 52 000 |
| | 51 000 |
| | 50 000 |
A few attacks:
Anonymization primer – definitions
Anonymization primer – tabular data
Traditional approach:
Different approach: Differential privacy (ε) [5]
All methods allow measuring data utility!
Jméno | Pozice | Plat |
| akademický | 68 – 100 000 |
| akademický | 68 – 100 000 |
| akademický | 68 – 100 000 |
| neakademický | 30 – 52 000 |
| neakademický | 30 – 52 000 |
| neakademický | 30 – 52 000 |
Anonymization primer – location, image, text
Text anonymization - examples
Named Entity Recognition for sensitive data
Selected solutions:
MEDDOCAN 2019 Corpus
Others:
Semantic similarity
My implementation of a tool for Czech text
Pupose:
Users:
Evaluation corpus
Source: Messenger conversations – 1.5 M words
Proposed classes (only 1004 annotated):
Measuring IAA: Synthetic data
Entity Recognition
Solution:
(NameTag 1 subset + rules)
+ composition
Recognition results
Sensitive data replacements
Problem:
Replacement by hash from gazetteers
Advantages:
Versions:
Possible improvements
Sources