1 of 65

Pseudonymization of

learner corpora

Elena Volodina

Språkbanken Text, University of Gothenburg, Sweden

2 of 65

Why learner data?


SLA research
L1 identification
Essay grading/assess.
Writing support tools
Error detect/corr
Lexical profiling
Grammar profiling
Pseudonymization
Bias detection
…etc

methodological

issues

NLP/ICALL

ICALL

theory

building

3 of 65

Why personal data?

	L1	gender	L2 level	other langs	educ. backgr	L2 country residence	Task info	…
SLA research	x	x	x	x	x	x	x	x
L1 identification	x	?	x	?	x		x
Essay grading/assess.	x	?	x	?	x		x
Writing support tools	x	?	x	?	x		x
Error detect/corr	x	?	x	?	?		?
Lexical profiling	?		x				x
Grammar profiling	?		x				x
Pseudonymization	x	x	x	x	x	x	x	x
Bias detection	x	x	x	?	x	x	x	x
…etc

NLP/ICALL

ICALL

theory

building

methodological

issues

4 of 65

Minimizing risks

	L1	gender	L2 level	other langs	educ. backgr	L2 country residence	Task info	…
SLA research	x	x	x	x	x	x	x	x
L1 identification	x	?	x	?	x		x
Essay grading/assess.	x	?	x	?	x		x
Writing support tools	x	?	x	?	x		x
Error detect/corr	x	?	x	?	?		?
Lexical profiling	?		x				x
Grammar profiling	?		x				x
Pseudonymization	x	x	x	x	x	x	x	x
Bias detection	x	x	x	?	x	x	x	x
…etc

NLP/ICALL

ICALL

methodological

issues

theory

building

5 of 65

Outline

Challenges of learner data sharing in the age of GDPR

Pseudonymization of the data - status and prospects (+demo)

Visions & agenda

Introduction to SweLL project

6 of 65

Introduction to the SweLL project

SweLL

7 of 65

SweLL – �

research infrastructure for

Swedish as a Second Language

Swedish Learner Language

https://spraakbanken.gu.se/en/projects/swell

(2017-2020)

8 of 65

SweLL promises (main)

Deliver a well-annotated (gold standard) corpus of L2 essays

600 essays at different levels of proficiency
Incl manual correction annotation (& manually checked linguistic annotation)
Make available for research (and public?)

→

9 of 65

SweLL promises (main)

Set a platform (and workflow) for

Continuous collection of new essays
Manual correction annotation
Automatic linguistic annotation

→

10 of 65

SweLL promises (main)

Set a platform for browsing L2 essays

in concordance fashion (+parallel view)
In full text fashion

https://spraakbanken.gu.se/en/projects/swell/l2korp

11 of 65

BIG GOAL

… to empower second language learners / immigrants / teachers

… to support Second Language Acquisition (SLA) research

… to establish the field of “Computational SLA” (in Sweden) and make it attractive (for researchers, funders, PhD students, etc)

… to make a change!

and IT ALL STARTS WITH...

12 of 65

Challenges of learner data collection and availability

challenges

13 of 65

	L1	gender	L2 level	other langs	educ. backgr	L2 country residence	Task info	…
SLA research	x	x	x	x	x	x	x	x
L1 identification	x	?	x	?	x		x
Essay grading/assess.	x	?	x	?	x		x
Writing support tools	x	?	x	?	x		x
Error detect/corr	x	?	x	?	?		?
Lexical profiling	?		x				x
Grammar profiling	?		x				x
Pseudonymization	x	x	x	x	x	x	x	x
Bias detection	x	x	x	?	x	x	x	x
…etc

NLP/ICALL

ICALL

theory

building

methodological

issues

14 of 65

Risks of sharing personal data

Personal data is the highest commodity of the 21st century

We need to protect those who share their personal data for research (data subjects) -- out of ethics

We are obliged to protect our subjects -- by law

15 of 65

General Data Protection Regulation (GDPR)

Restricts use of digital data containing personal information

Hands back ownership of personal data to data subjects

Imposes data protection by design and by default (Art 25)

16 of 65

Personal data is...

...any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an on-line identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.

GDPR, Article 4

17 of 65

GDPR

Anonymous data is outside the scope of GDPR (Recital 26)

No data is truly anonymous (e.g. Rocher et al. 2019)

Pseudonymization is recognized as a way to reduce risks of re-identification of data subjects (Recital 28)

What are we supposed to pseudonymize?

18 of 65

Anonymization vs pseudonymization

https://www.linkedin.com/pulse/anonymization-does-work-big-data-due-lack-protection-direct-lafever

19 of 65

20 of 65

Sweeney (2000)

http://ggs685.pbworks.com/w/file/fetch/94376315/Latanya.pdf

87% of US population could be identified uniquely

→

21 of 65

A lá Sweeney (2000) ?

http://ggs685.pbworks.com/w/file/fetch/94376315/Latanya.pdf

PII-1

PII-2

...

→

Can we

identify

a person?

Text

information

+

Personal

metadata

+

Task metadata

Public statistics/

records

+

Research data

+

Accidental

context

Learner corpus

Other sources

PII → Personally Identifiable

Information

→

22 of 65

23 of 65

Pseudonymization

pseudo-

nymization

24 of 65

Data protection (Danezis et al. 2014)

https://www.enisa.europa.eu/publications/privacy-and-data-protection-by-desig

Encryption

Data minimization

Authenti-

fication

Anonymization

Pseudonymization

Privacy Preserving Data Mining

De-identification

...

25 of 65

Pseudonymization

detection & neutralization of personally identifiable information (PII)

predominantly manually handled

automatic approaches are NER-based or rule-based

most studies within medical domain

26 of 65

Pseudonymization of learner corpora

No proper guidelines on pseudonymization

Lack of systematic studies

27 of 65

Stemle et al (2019)

https://bia.unibz.it/bitstream/handle/10863/13346/stemle_etal-postprint.pdf?sequence=2&isAllowed=y

(Section 2.2)

28 of 65

Stemle et al (2019)

CzeSL (Rosen 2017):

all names → Adam, Eva or Sin (+ keeping suffixes)

geo → village<priv>

CroLTec (Preradovic et al. 2015):

hard-coded; risky passages deleted

ASK (Tenfjord et al. 2006):

@name, @place, @something

… etc.

https://bia.unibz.it/bitstream/handle/10863/13346/stemle_etal-postprint.pdf?sequence=2&isAllowed=y

(Section 2.2)

29 of 65

Megyesi et al. (2018)

SweLL approach (principles for manual pseudonymization)

http://www.ep.liu.se/ecp/152/006/ecp18152006.pdf

30 of 65

Principles (domain-specific, of course)

Clear mark-up of pseudonymized segments

15 head categories, 40 subcategories, morph. markers

names, geo names, institutions, transportation, age, date, miscellaneous, mark/sensitive

Guidelines: https://spraakbanken.github.io/swell-project/Anonymization_guidelines

31 of 65

Categories (domain-specific, of course)

32 of 65

SVALA - SweLL annotation tool

Parallel text
Visualized diff
Semi-automatic word alignment
Annotation on source–target links

Dan Rosén,

research engineer

Arild Matsson,

research engineer

Samir Ali Mohammed,

systems developer

33 of 65

34 of 65

Volodina et al. (2020)

SweLL automatic pseudonymizer service in SVALA

Why automatize?

To speed up annotation work
To boost essay collection
GDPR
Ethical reasons
Ultimate goal - online collection of essays

35 of 65

SVALA - pseudonymizer: facts

Rule-based vs Rules+POS

regular expressions
minor spelling correction (Levenstein distance)

Python
IN: text; OUT: json / text

Three steps:

detection
placeholder labelling
pseudonymization

Git Repo: https://github.com/SamirYousuf/LR_project

36 of 65

Pseudonymizer demo

DEMO

37 of 65

SVALA pseudonymizer – hands-on demo

https://spraakbanken.gu.se/swell/dev/

38 of 65

Example essay (translation into English + mocking errors)

I live in Stockholm on apartement . Jag är 29 år gammal . I live with my boyfriend . His name is Cezary . Apartement mine has a pattio and tree room . I enjoy there in Stockhulm but a lot of time to goto shop , fortifive minut . I have the buss and the Stockholm train . I lived in Danmark bifore , in Odense . It was less than Stockholm . I enjoy their too becaus I had more friends . I think it is hard to have friends here . But I enjoy better job here . In Odense jobbe I only on one website . In Stockholm I work on many website . I am webdevelooper . But Stockholm is closser to Luxembourg than Odense . It is important how one lives because I am not in my country . I mess my mother and my father but I live her with my boyfriend .

https://tinyurl.com/y3m8uqjs (Slides 38 + 39)

39 of 65

A link to this presentation

https://tinyurl.com/y3m8uqjs

40 of 65

How it works?

KNOW-HOW

41 of 65

I live in Stockholm on apartement . I am 29 years old . I live with my boyfriend . His name is Cezary .

I live in Stockholm on apartement . Jag är 29 år gammal . I live with my boyfriend . His name is Cezary .

I live in @city on apartement . Jag är @age år gammal . I live with my boyfriend . His name is @name .

I live in Gothenburg on apartement . Jag är 31 år gammal . I live with my boyfriend . His name is Johan .

ORIGINAL

DETECTION

LABELING

PSEUDOMNYMIZATION

42 of 65

Data for evaluation

To

TOTAL: 285 essays, ≈55.000 tokens, several levels / genres / topics

43 of 65

Results

44 of 65

Accuracy

89% average, but
low accuracy for surname, place, date_digits

Depends on

lack of capitalization
misspellings
ambiguity (can belong to several domains)

45 of 65

Topic/genre specifics

Evaluative texts / Investigative texts:

film/book reviews
response to an article

Switch off pseudonymization?
“Listen” to personal pronouns, e.g. In Vietnam, we... ?

46 of 65

Tag statistics

Most used → city, country, firstname, year

Most confused → city-country, place-city/country

47 of 65

Detection - (non-)capitalizing

48 of 65

Detection - misspellings

49 of 65

Detection - heavily unstructured PIIs

50 of 65

Detection (sensitive)

51 of 65

Detection - heavily unstructured PIIs

52 of 65

Labeling - ambiguity

53 of 65

Pseudonymizing -

linguistic constraints

54 of 65

Pseudonymizing - grammar constraints

I had vacation on Cuba → I had vacation on Portugal

55 of 65

Pseudonymizing - projecting errors?

Stockhulm --> Gothinburg ?

56 of 65

Pseudonymizing - projecting grammar features?

I ate at Frank’s house → I ate at Harry house

(Think Swedish: at Lars house → At Sven house)

57 of 65

How to evaluate the pseudonymization step?

Should it read as an original ?
Should it preserve learner errors ?
Should it avoid adding new errors ?

58 of 65

Visions and agenda

Visions

Agenda

59 of 65

Agenda

Can we agree on the categories in the community?

Can we agree on the procedure -- detection - labeling - pseudonymization -- and convertibility of formats?

Shared tasks -- detection, labeling, pseudonymization, readability, sensitive markup, de-identification, risk assessment, etc.

How to evaluate pseudonymization step?

60 of 65

Data for shared tasks

How much do we need?

size of the data sets

How to ensure access?

reduce personal metadata?
only task metadata?

How to “manipulate” the original data so that we (legally) can use it for shared tasks?

keep original strings, but randomly insert them into different essays?

61 of 65

Datasets (e.g. Medlock 2006)

	Tokens (M)	Documents (M)	Tokens (SweLL)	Documents (SweLL)
Training set	666,138
Development/ validation set	6,026
Test/evaluation/ holdout set	31,926
Total	704,090	2,500 emails	211,563	668 essays

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.380.8037&rep=rep1&type=pdf

62 of 65

Visions: Pseudonymization on the fly

Imagine scenario:

Write essays online

Automatic pseudo

Correction of pseudo- suggestions

Automatic analysis of an essay

Corr-reports to improve

pseudo-algorithm

Online consent

Online metadata

Upload essay to a database

63 of 65

Informed consent

Personal metadata

Account + ID

Essay

Automatic pseudo

Manual correction

Approval of pseudo

Essay analysis Add to a corpus

64 of 65

Informed consent

Personal metadata

Account + ID

Essay

Essay analysis Add to a corpus

Automatic pseudo

Manual correction

Approval of pseudo

Reports with

Orig string:

Auto-label

(if any)

Corrected label

(if any)

We may want to look into the full context, too, though

65 of 65

Thank you!

Link to the presentation:

https://tinyurl.com/y3m8uqjs