Pseudonymization of
learner corpora
Elena Volodina
Språkbanken Text, University of Gothenburg, Sweden
Why learner data?
| | | | | | | | |
SLA research | | | | | | | | |
L1 identification | | | | | | | | |
Essay grading/assess. | | | | | | | | |
Writing support tools | | | | | | | | |
Error detect/corr | | | | | | | | |
Lexical profiling | | | | | | | | |
Grammar profiling | | | | | | | | |
Pseudonymization | | | | | | | | |
Bias detection | | | | | | | | |
…etc |
|
|
|
|
|
|
|
|
methodological
issues
NLP/ICALL
ICALL
theory
building
Why personal data?
| L1 | gender | L2 level | other langs | educ. backgr | L2 country residence | Task info | … |
SLA research | x | x | x | x | x | x | x | x |
L1 identification | x | ? | x | ? | x |
| x |
|
Essay grading/assess. | x | ? | x | ? | x |
| x |
|
Writing support tools | x | ? | x | ? | x |
| x |
|
Error detect/corr | x | ? | x | ? | ? | | ? | |
Lexical profiling | ? |
| x |
|
|
| x |
|
Grammar profiling | ? |
| x |
|
|
| x |
|
Pseudonymization | x | x | x | x | x | x | x | x |
Bias detection | x | x | x | ? | x | x | x | x |
…etc |
|
|
|
|
|
|
|
|
NLP/ICALL
ICALL
theory
building
methodological
issues
Minimizing risks
| L1 | gender | L2 level | other langs | educ. backgr | L2 country residence | Task info | … |
SLA research | x | x | x | x | x | x | x | x |
L1 identification | x | ? | x | ? | x |
| x |
|
Essay grading/assess. | x | ? | x | ? | x |
| x |
|
Writing support tools | x | ? | x | ? | x |
| x |
|
Error detect/corr | x | ? | x | ? | ? | | ? | |
Lexical profiling | ? |
| x |
|
|
| x |
|
Grammar profiling | ? |
| x |
|
|
| x |
|
Pseudonymization | x | x | x | x | x | x | x | x |
Bias detection | x | x | x | ? | x | x | x | x |
…etc |
|
|
|
|
|
|
|
|
NLP/ICALL
ICALL
methodological
issues
theory
building
Outline
Introduction to the SweLL project
SweLL
SweLL – �
research infrastructure for
Swedish as a Second Language
Swedish Learner Language
https://spraakbanken.gu.se/en/projects/swell
(2017-2020)
SweLL promises (main)
→
SweLL promises (main)
→
→
SweLL promises (main)
https://spraakbanken.gu.se/en/projects/swell/l2korp
BIG GOAL
and IT ALL STARTS WITH...
Challenges of learner data collection and availability
challenges
| L1 | gender | L2 level | other langs | educ. backgr | L2 country residence | Task info | … |
SLA research | x | x | x | x | x | x | x | x |
L1 identification | x | ? | x | ? | x |
| x |
|
Essay grading/assess. | x | ? | x | ? | x |
| x |
|
Writing support tools | x | ? | x | ? | x |
| x |
|
Error detect/corr | x | ? | x | ? | ? | | ? | |
Lexical profiling | ? |
| x |
|
|
| x |
|
Grammar profiling | ? |
| x |
|
|
| x |
|
Pseudonymization | x | x | x | x | x | x | x | x |
Bias detection | x | x | x | ? | x | x | x | x |
…etc |
|
|
|
|
|
|
|
|
NLP/ICALL
ICALL
theory
building
methodological
issues
Risks of sharing personal data
General Data Protection Regulation (GDPR)
Personal data is...
...any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an on-line identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.
GDPR, Article 4
GDPR
Anonymization vs pseudonymization
https://www.linkedin.com/pulse/anonymization-does-work-big-data-due-lack-protection-direct-lafever
Sweeney (2000)
http://ggs685.pbworks.com/w/file/fetch/94376315/Latanya.pdf
87% of US population could be identified uniquely
→
A lá Sweeney (2000) ?
http://ggs685.pbworks.com/w/file/fetch/94376315/Latanya.pdf
PII-1
PII-2
...
→
Can we
identify
a person?
Text
information
+
Personal
metadata
+
Task metadata
Public statistics/
records
+
Research data
+
Accidental
context
Learner corpus
Other sources
PII → Personally Identifiable
Information
→
Pseudonymization
pseudo-
nymization
Data protection (Danezis et al. 2014)
https://www.enisa.europa.eu/publications/privacy-and-data-protection-by-desig
fication
Pseudonymization
Pseudonymization of learner corpora
No proper guidelines on pseudonymization
Lack of systematic studies
Stemle et al (2019)
Stemle et al (2019)
all names → Adam, Eva or Sin (+ keeping suffixes)
geo → village<priv>
hard-coded; risky passages deleted
@name, @place, @something
Megyesi et al. (2018)
SweLL approach (principles for manual pseudonymization)
Principles (domain-specific, of course)
Clear mark-up of pseudonymized segments
15 head categories, 40 subcategories, morph. markers
Categories (domain-specific, of course)
SVALA - SweLL annotation tool
Dan Rosén,
research engineer
Arild Matsson,
research engineer
Samir Ali Mohammed,
systems developer
Volodina et al. (2020)
SweLL automatic pseudonymizer service in SVALA
Why automatize?
SVALA - pseudonymizer: facts
Git Repo: https://github.com/SamirYousuf/LR_project
Pseudonymizer demo
DEMO
SVALA pseudonymizer – hands-on demo
Example essay (translation into English + mocking errors)
I live in Stockholm on apartement . Jag är 29 år gammal . I live with my boyfriend . His name is Cezary . Apartement mine has a pattio and tree room . I enjoy there in Stockhulm but a lot of time to goto shop , fortifive minut . I have the buss and the Stockholm train . I lived in Danmark bifore , in Odense . It was less than Stockholm . I enjoy their too becaus I had more friends . I think it is hard to have friends here . But I enjoy better job here . In Odense jobbe I only on one website . In Stockholm I work on many website . I am webdevelooper . But Stockholm is closser to Luxembourg than Odense . It is important how one lives because I am not in my country . I mess my mother and my father but I live her with my boyfriend .
https://tinyurl.com/y3m8uqjs (Slides 38 + 39)
A link to this presentation
https://tinyurl.com/y3m8uqjs
How it works?
KNOW-HOW
I live in Stockholm on apartement . I am 29 years old . I live with my boyfriend . His name is Cezary .
I live in Stockholm on apartement . Jag är 29 år gammal . I live with my boyfriend . His name is Cezary .
I live in @city on apartement . Jag är @age år gammal . I live with my boyfriend . His name is @name .
I live in Gothenburg on apartement . Jag är 31 år gammal . I live with my boyfriend . His name is Johan .
ORIGINAL
DETECTION
LABELING
PSEUDOMNYMIZATION
Data for evaluation
To
TOTAL: 285 essays, ≈55.000 tokens, several levels / genres / topics
Results
Accuracy
Depends on
Topic/genre specifics
Evaluative texts / Investigative texts:
Tag statistics
Detection - (non-)capitalizing
Detection - misspellings
Detection - heavily unstructured PIIs
Detection (sensitive)
Detection - heavily unstructured PIIs
Labeling - ambiguity
Pseudonymizing -
linguistic constraints
Pseudonymizing - grammar constraints
I had vacation on Cuba → I had vacation on Portugal
Pseudonymizing - projecting errors?
Stockhulm --> Gothinburg ?
Pseudonymizing - projecting grammar features?
I ate at Frank’s house → I ate at Harry house
(Think Swedish: at Lars house → At Sven house)
How to evaluate the pseudonymization step?
Visions and agenda
Visions
Agenda
Agenda
Data for shared tasks
Datasets (e.g. Medlock 2006)
| Tokens (M) | Documents (M) | Tokens (SweLL) | Documents (SweLL) |
Training set | 666,138 |
| | |
Development/ validation set | 6,026 |
| | |
Test/evaluation/ holdout set | 31,926 |
| | |
Total | 704,090 | 2,500 emails | 211,563 | 668 essays |
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.380.8037&rep=rep1&type=pdf
Visions: Pseudonymization on the fly
Imagine scenario:
Write essays online
Automatic pseudo
Correction of pseudo- suggestions
Automatic analysis of an essay
Corr-reports to improve
pseudo-algorithm
Online consent
Online metadata
Upload essay to a database
Informed consent
Personal metadata
Account + ID
Essay
Automatic pseudo
Manual correction
Approval of pseudo
Essay analysis Add to a corpus
Informed consent
Personal metadata
Account + ID
Essay
Essay analysis Add to a corpus
Automatic pseudo
Manual correction
Approval of pseudo
Reports with
(if any)
(if any)
Thank you!
Link to the presentation:
https://tinyurl.com/y3m8uqjs