1 of 65

Pseudonymization of

learner corpora

Elena Volodina

Språkbanken Text, University of Gothenburg, Sweden

2 of 65

Why learner data?

SLA research

L1 identification

Essay grading/assess.

Writing support tools

Error detect/corr

Lexical profiling

Grammar profiling

Pseudonymization

Bias detection

…etc

methodological

issues

NLP/ICALL

ICALL

theory

building

3 of 65

Why personal data?

L1

gender

L2 level

other langs

educ. backgr

L2 country residence

Task info

SLA research

x

x

x

x

x

x

x

x

L1 identification

x

?

x

?

x

x

Essay grading/assess.

x

?

x

?

x

x

Writing support tools

x

?

x

?

x

x

Error detect/corr

x

?

x

?

?

?

Lexical profiling

?

x

x

Grammar profiling

?

x

x

Pseudonymization

x

x

x

x

x

x

x

x

Bias detection

x

x

x

?

x

x

x

x

…etc

NLP/ICALL

ICALL

theory

building

methodological

issues

4 of 65

Minimizing risks

L1

gender

L2 level

other langs

educ. backgr

L2 country residence

Task info

SLA research

x

x

x

x

x

x

x

x

L1 identification

x

?

x

?

x

x

Essay grading/assess.

x

?

x

?

x

x

Writing support tools

x

?

x

?

x

x

Error detect/corr

x

?

x

?

?

?

Lexical profiling

?

x

x

Grammar profiling

?

x

x

Pseudonymization

x

x

x

x

x

x

x

x

Bias detection

x

x

x

?

x

x

x

x

…etc

NLP/ICALL

ICALL

methodological

issues

theory

building

5 of 65

Outline

  • Challenges of learner data sharing in the age of GDPR
  • Pseudonymization of the data - status and prospects (+demo)
  • Visions & agenda
  • Introduction to SweLL project

6 of 65

Introduction to the SweLL project

SweLL

7 of 65

SweLL –

research infrastructure for

Swedish as a Second Language

Swedish Learner Language

https://spraakbanken.gu.se/en/projects/swell

(2017-2020)

8 of 65

SweLL promises (main)

  1. Deliver a well-annotated (gold standard) corpus of L2 essays
      • 600 essays at different levels of proficiency
      • Incl manual correction annotation (& manually checked linguistic annotation)
      • Make available for research (and public?)

9 of 65

SweLL promises (main)

  1. Set a platform (and workflow) for
      • Continuous collection of new essays
      • Manual correction annotation
      • Automatic linguistic annotation

10 of 65

SweLL promises (main)

  • Set a platform for browsing L2 essays
      • in concordance fashion (+parallel view)
      • In full text fashion

https://spraakbanken.gu.se/en/projects/swell/l2korp

11 of 65

BIG GOAL

  • … to empower second language learners / immigrants / teachers

  • … to support Second Language Acquisition (SLA) research

  • … to establish the field of “Computational SLA” (in Sweden) and make it attractive (for researchers, funders, PhD students, etc)

  • … to make a change!

and IT ALL STARTS WITH...

12 of 65

Challenges of learner data collection and availability

challenges

13 of 65

L1

gender

L2 level

other langs

educ. backgr

L2 country residence

Task info

SLA research

x

x

x

x

x

x

x

x

L1 identification

x

?

x

?

x

x

Essay grading/assess.

x

?

x

?

x

x

Writing support tools

x

?

x

?

x

x

Error detect/corr

x

?

x

?

?

?

Lexical profiling

?

x

x

Grammar profiling

?

x

x

Pseudonymization

x

x

x

x

x

x

x

x

Bias detection

x

x

x

?

x

x

x

x

…etc

NLP/ICALL

ICALL

theory

building

methodological

issues

14 of 65

Risks of sharing personal data

  • Personal data is the highest commodity of the 21st century

  • We need to protect those who share their personal data for research (data subjects) -- out of ethics

  • We are obliged to protect our subjects -- by law

15 of 65

General Data Protection Regulation (GDPR)

  • Restricts use of digital data containing personal information

  • Hands back ownership of personal data to data subjects

  • Imposes data protection by design and by default (Art 25)

16 of 65

Personal data is...

...any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an on-line identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.

GDPR, Article 4

17 of 65

GDPR

  • Anonymous data is outside the scope of GDPR (Recital 26)

    • No data is truly anonymous (e.g. Rocher et al. 2019)

  • Pseudonymization is recognized as a way to reduce risks of re-identification of data subjects (Recital 28)

    • What are we supposed to pseudonymize?

18 of 65

Anonymization vs pseudonymization

https://www.linkedin.com/pulse/anonymization-does-work-big-data-due-lack-protection-direct-lafever

19 of 65

20 of 65

Sweeney (2000)

http://ggs685.pbworks.com/w/file/fetch/94376315/Latanya.pdf

87% of US population could be identified uniquely

21 of 65

A lá Sweeney (2000) ?

http://ggs685.pbworks.com/w/file/fetch/94376315/Latanya.pdf

PII-1

PII-2

...

Can we

identify

a person?

Text

information

+

Personal

metadata

+

Task metadata

Public statistics/

records

+

Research data

+

Accidental

context

Learner corpus

Other sources

PII → Personally Identifiable

Information

22 of 65

23 of 65

Pseudonymization

pseudo-

nymization

24 of 65

Data protection (Danezis et al. 2014)

https://www.enisa.europa.eu/publications/privacy-and-data-protection-by-desig

  • Encryption
  • Data minimization
  • Authenti-

fication

  • Anonymization
  • Pseudonymization
  • Privacy Preserving Data Mining
  • De-identification
  • ...

25 of 65

Pseudonymization

  • detection & neutralization of personally identifiable information (PII)

  • predominantly manually handled

  • automatic approaches are NER-based or rule-based

  • most studies within medical domain

26 of 65

Pseudonymization of learner corpora

No proper guidelines on pseudonymization

Lack of systematic studies

27 of 65

Stemle et al (2019)

28 of 65

Stemle et al (2019)

  • CzeSL (Rosen 2017):

all names → Adam, Eva or Sin (+ keeping suffixes)

geo → village<priv>

  • CroLTec (Preradovic et al. 2015):

hard-coded; risky passages deleted

  • ASK (Tenfjord et al. 2006):

@name, @place, @something

  • … etc.

29 of 65

Megyesi et al. (2018)

SweLL approach (principles for manual pseudonymization)

30 of 65

Principles (domain-specific, of course)

Clear mark-up of pseudonymized segments

15 head categories, 40 subcategories, morph. markers

    • names, geo names, institutions, transportation, age, date, miscellaneous, mark/sensitive

31 of 65

Categories (domain-specific, of course)

32 of 65

SVALA - SweLL annotation tool

  • Parallel text
  • Visualized diff
  • Semi-automatic word alignment
  • Annotation on source–target links

Dan Rosén,

research engineer

Arild Matsson,

research engineer

Samir Ali Mohammed,

systems developer

33 of 65

34 of 65

Volodina et al. (2020)

SweLL automatic pseudonymizer service in SVALA

Why automatize?

    • To speed up annotation work
    • To boost essay collection
    • GDPR
    • Ethical reasons
    • Ultimate goal - online collection of essays

35 of 65

SVALA - pseudonymizer: facts

  • Rule-based vs Rules+POS
    • regular expressions
    • minor spelling correction (Levenstein distance)
  • Python
  • IN: text; OUT: json / text

  • Three steps:
    • detection
    • placeholder labelling
    • pseudonymization

36 of 65

Pseudonymizer demo

DEMO

37 of 65

SVALA pseudonymizer – hands-on demo

38 of 65

Example essay (translation into English + mocking errors)

I live in Stockholm on apartement . Jag är 29 år gammal . I live with my boyfriend . His name is Cezary . Apartement mine has a pattio and tree room . I enjoy there in Stockhulm but a lot of time to goto shop , fortifive minut . I have the buss and the Stockholm train . I lived in Danmark bifore , in Odense . It was less than Stockholm . I enjoy their too becaus I had more friends . I think it is hard to have friends here . But I enjoy better job here . In Odense jobbe I only on one website . In Stockholm I work on many website . I am webdevelooper . But Stockholm is closser to Luxembourg than Odense . It is important how one lives because I am not in my country . I mess my mother and my father but I live her with my boyfriend .

https://tinyurl.com/y3m8uqjs (Slides 38 + 39)

39 of 65

A link to this presentation

https://tinyurl.com/y3m8uqjs

40 of 65

How it works?

KNOW-HOW

41 of 65

I live in Stockholm on apartement . I am 29 years old . I live with my boyfriend . His name is Cezary .

I live in Stockholm on apartement . Jag är 29 år gammal . I live with my boyfriend . His name is Cezary .

I live in @city on apartement . Jag är @age år gammal . I live with my boyfriend . His name is @name .

I live in Gothenburg on apartement . Jag är 31 år gammal . I live with my boyfriend . His name is Johan .

ORIGINAL

DETECTION

LABELING

PSEUDOMNYMIZATION

42 of 65

Data for evaluation

To

TOTAL: 285 essays, ≈55.000 tokens, several levels / genres / topics

43 of 65

Results

44 of 65

Accuracy

  • 89% average, but
  • low accuracy for surname, place, date_digits

Depends on

  • lack of capitalization
  • misspellings
  • ambiguity (can belong to several domains)

45 of 65

Topic/genre specifics

Evaluative texts / Investigative texts:

  • film/book reviews
  • response to an article

  • Switch off pseudonymization?
  • “Listen” to personal pronouns, e.g. In Vietnam, we... ?

46 of 65

Tag statistics

  • Most used → city, country, firstname, year

  • Most confused → city-country, place-city/country

47 of 65

Detection - (non-)capitalizing

48 of 65

Detection - misspellings

49 of 65

Detection - heavily unstructured PIIs

50 of 65

Detection (sensitive)

51 of 65

Detection - heavily unstructured PIIs

52 of 65

Labeling - ambiguity

53 of 65

Pseudonymizing -

linguistic constraints

54 of 65

Pseudonymizing - grammar constraints

I had vacation on Cuba → I had vacation on Portugal

55 of 65

Pseudonymizing - projecting errors?

Stockhulm --> Gothinburg ?

56 of 65

Pseudonymizing - projecting grammar features?

I ate at Frank’s house → I ate at Harry house

(Think Swedish: at Lars house → At Sven house)

57 of 65

How to evaluate the pseudonymization step?

  • Should it read as an original ?
  • Should it preserve learner errors ?
  • Should it avoid adding new errors ?

58 of 65

Visions and agenda

Visions

Agenda

59 of 65

Agenda

  • Can we agree on the categories in the community?

  • Can we agree on the procedure -- detection - labeling - pseudonymization -- and convertibility of formats?

  • Shared tasks -- detection, labeling, pseudonymization, readability, sensitive markup, de-identification, risk assessment, etc.

  • How to evaluate pseudonymization step?

60 of 65

Data for shared tasks

  • How much do we need?
    • size of the data sets

  • How to ensure access?
    • reduce personal metadata?
    • only task metadata?

  • How to “manipulate” the original data so that we (legally) can use it for shared tasks?
    • keep original strings, but randomly insert them into different essays?

61 of 65

Datasets (e.g. Medlock 2006)

Tokens

(M)

Documents (M)

Tokens (SweLL)

Documents (SweLL)

Training set

666,138

Development/ validation set

6,026

Test/evaluation/ holdout set

31,926

Total

704,090

2,500 emails

211,563

668 essays

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.380.8037&rep=rep1&type=pdf

62 of 65

Visions: Pseudonymization on the fly

Imagine scenario:

Write essays online

Automatic pseudo

Correction of pseudo- suggestions

Automatic analysis of an essay

Corr-reports to improve

pseudo-algorithm

Online consent

Online metadata

Upload essay to a database

63 of 65

Informed consent

Personal metadata

Account + ID

Essay

Automatic pseudo

Manual correction

Approval of pseudo

Essay analysis Add to a corpus

64 of 65

Informed consent

Personal metadata

Account + ID

Essay

Essay analysis Add to a corpus

Automatic pseudo

Manual correction

Approval of pseudo

Reports with

  • Orig string:
    • Auto-label

(if any)

    • Corrected label

(if any)

  • We may want to look into the full context, too, though

65 of 65

Thank you!

Link to the presentation:

https://tinyurl.com/y3m8uqjs