1 of 45

Expert judgments versus crowdsourcing in ordering multi-word expressions

David Alfter (University of Gothenburg), �Therese Lindström Tiedemann (University of Helsinki) and �Elena Volodina (University of Gothenburg)

2 of 45

Problem definition (1)

We need :

more data

to validate data

to annotate data

...

money to employ annotators

… (often) who are experts in ...

… who (often) are too busy

… whose annotations (often) lack consistency

...

3 of 45

Is there an alternative to expert annotations?

4 of 45

Crowdsourcing as an alternative?

  • What do we know about it?
    • open call, non-experts, cheap labor, easy to get, etc. → A myth?

  • Do non-experts REALLY perform as well as experts?
  • Are crowdsourced annotations REALLY reliable?

  • Does it REALLY take less time? or money?

5 of 45

Object of study

  • Content-wise:
    • multi-word expressions (MWEs)
    • from a second language (L2) learning point of view
    • coursebook texts as a source of MWEs
  • Design-wise:
    • comparative judgment setting -- BEST-WORST SCALING algorithm
    • expert annotations (traditional approach)

6 of 45

Problem definition (2)

(to tell you the truth - all starts with very applied needs)

  • Need to validate target level projections from course books

Coursebooks

7 of 45

Problem definition (2)

(to tell you the truth - all starts with very applied needs)

  • Need to find a method to “stratify” unseen vocabulary into appropriate levels for learners at different levels of linguistic proficiency
    • language independent and cheap approach

8 of 45

English Vocabulary Profile

  • Lexicographer work
  • Language testers
  • Large learner corpus (exam essays)

9 of 45

So we want ...

  • to validate crowdsourcing as a method to support graded list creation for any language (in L2 context)

RQ1: Who can be the crowd?

i.e. whether L2 speakers (or a mixed group) can be used for the task

RQ2: Can we elicit reliable annotations with crowdsourcing?

i.e. which method works well

10 of 45

Design: participants

Non-experts

L2 speakers / L2 learners

(27)

Experts

L2 professionals:

teachers, assessors, researchers

(23)

CEFR experts

(3)

Others

(20)

3 crowdsourcing projects

3 crowdsourcing projects

3 direct annotation projects

3 crowdsourcing projects

11 of 45

Design: multi-word expressions

(COCTAILL data, SPARV-annotated, automatically identified MWEs)

Group1. Interjections

60 items from coursebooks:

12 per level

per 5 levels: A1, A2, B1, B2, C1

Group2. Verbs

60 items from coursebooks:

12 per level

per 5 levels: A1, A2, B1, B2, C1

Group3. Adverbs

60 items from coursebooks:

12 per level

per 5 levels: A1, A2, B1, B2, C1

12 of 45

Design: comparative vs direct

Comparative

Best-worst scaling

Direct labeling

13 of 45

Practicalities

  • pyBossa (installation at Språkbanken)
  • professional and personal networks to recruit
  • small gifts to stimulate participation
  • guidelines/instructions
  • 5 votes per item for non-experts
  • 3 votes per item for experts
  • NOTE : we asked to use intuitive judgments,

no need for professional reasoning

  • Assumption : perceived difficulty/complexity

is aligned with level progression

14 of 45

Crowdsourcing Methodology (Implementation)

15 of 45

Crowdsourcing Methodology

A

B

C

D

16 of 45

Crowdsourcing Methodology

A

B

C

D

A

B

C

D

17 of 45

Crowdsourcing methodology

A

B

C

D

A

B

C

D

Discovered relation (5/6):

B < A

B < D

B < C

A < C

D < C

Unknown relation:

A - D

18 of 45

Redundancy-reducing combinatorial algorithm

Four items per task and 60 expressions

→ 1,770 possible relations

→ 487,635 possible combinations.

redundancy-reducing combinatorial algorithm

→ 326 tasks

19 of 45

Evaluation methodology

  • Linear scale projection
  • Principal component analysis (PCA)
  • Clustering

20 of 45

Linear scale projection

B

A

D

C

2

1

3

2

21 of 45

Linear scale projection

B

A

D

C

2

1

3

2

Average over all values

E.g. B = (1 + 1 + 3 + 2 + 2 + 1) / 6

22 of 45

Results

23 of 45

Agreement between groups �(Spearman rank coefficient)

24 of 45

Inter-annotator Agreement in the direct ranking experiment → Expert judgment not ideal

25 of 45

Inter-annotator Agreement in the direct ranking experiment → Expert judgment not ideal

  • Expert judgment not ideal
  • Cf studies on essay evaluation (Diez-Bedmar, 2012; Snow et al, 2008; Carlsen, 2012)
  • De-contextualised
  • >Tolerance 1 = peripheral items? → Future research

26 of 45

Agreement between Crowdsourced-ranking & Direct ranking by experts (Spearman rank coefficient)

27 of 45

Agreement between Crowdsourced-ranking & Direct ranking by experts (Spearman rank coefficient)

  • Expert 1 is most consistent.
  • All experts most consistent for Interjections etc. (gr.1)
  • Expert scoring - more subjective & cognitively demanding (Lesterhuis et al, 2017)

28 of 45

Number of votes

29 of 45

Number of votes in crowdsourcing?

  • High correlation even with only 1 annotator (when sub-sampling).
  • Lowest among non-experts
  • Highest with mixed crowd
  • Better to have 2 voters than only 1 - but with a mixed crowd it makes no difference.

30 of 45

Discussion

  • Who can be the crowd – with regards to the background of crowdsourcers?
  • How can reliable annotations be achieved with regards to design, number of answers and number of contributors?

31 of 45

  • Who can be the crowd?
  • Mixed crowds agree very well in comparative ranking (best-worst scaling).

(cf previous work with explicit level judgment of essays e.g. Diez-Bedmar, 2012)

Future research to confirm our findings:

Similar experiments – other languages, other types of problems (e.g. texts), other sub-problems (e.g. single vocabulary items instead of MWE)

32 of 45

(2) Reliable annotation - how?

  • Design of annotation task influences the results.
    • More traditional method (expert (direct) judgment) → less reliable results than crowdsourced comparative judgment
    • Experts do not agree with themselves between tasks.
    • Comparative judgment - homogeneous results between all the different groups of crowdsourcers (expert / non-expert background)

33 of 45

Discussion number of votes

  • On a linear scale from 1–60: insignificant difference if there was 3 votes instead of only 2 (mixed crowd).
  • Mixed background most stable.

34 of 45

Some other lessons

35 of 45

Future work

  • Same method → same results when applied to other “problems” (single words, essays).
  • How to order unlabelled expressions, e.g. unknown vocabulary.
  • Clustering? Known anchor-points?
  • Identification of core and periphery vocabulary based on clustering?

36 of 45

Thank you!

37 of 45

References

Borin, Lars, Markus Forsberg, Martin Hammarstedt, Dan Rosen, Roland Schäfer, and Anne Schumacher. 2016. Sparv: Sprakbanken's corpus annotation pipeline infrastructure. In The Sixth Swedish Language Technology Conference (SLTC), Umea University, pages 17-18.

Brezina, Vaclav and Dana Gablasova. 2015. Is there a core general vocabulary? Introducing the New General Service List. Applied Linguistics 36(1):1-22.

Council of Europe. 2001. Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Press Syndicate of the University of Cambridge.

Council of Europe. 2018. Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Companion Volume with New Descriptors. www.coe.int/lang-cefr. Accessed 18.06.2020.

Díez-Bedmar, María Belén. 2012. The use of the common european framework of reference for languages to evaluate compositions in the english exam section of the university admission examination. Revista de Educacíon 357:55-79.

François, Thomas, Nuria Gala, Patrick Watrin, and Cédrick Fairon. 2014. FLELex: a graded lexical resource for French foreign learners. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, pages 3766-3773.

François, Thomas, Elena Volodina, Ildikó Pilán, and Anaïs Tack. 2016. SVALex: a CEFR-graded lexical resource for Swedish foreign and second language learners. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 213-219.

38 of 45

References

Kilgarri, Adam, Frieda Charalabopoulou, Maria Gavrilidou, Janne Bondi Johannessen, Saussan Khalil, Soe Johansson Kokkinakis, Robert Lew, Serge Sharo, Ravikiran Vadlapudi, and Elena Volodina. 2014. Corpus-based vocabulary lists for language learners for nine languages. In Language resources and evaluation 48(1):121-163.

Lesterhuis, Marije, San Verhavert, Liesje Coertjens, Vincent Donche, and Sven De Maeyer. 2017. Comparative judgement as a promising alternative to score competences. In Innovative practices for higher education assessment and measurement, pages 119-138. IGI Global.

Louviere, Jordan J, Terry N Flynn, and Anthony Alfred John Marley. 2015. Best-worst scaling: Theory, methods and applications. Cambridge University Press.

Tack, Anaïs, Thomas Francois, Piet Desmet, and Cedrick Fairon. 2018. NT2Lex: A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 137-146.

Volodina, Elena, Ildiko Pilan, Stian Rdven Eide, and Hannes Heidarsson. 2014. You get what you annotate: a pedagogically annotated corpus of coursebooks for Swedish as a Second Language. In Proceedings of the third workshop on NLP for computer-assisted language learning, pages 128-144.

West, Michael Philip. 1953. A general service list of English words: with semantic frequencies and a supplementary word-list for the writing of popular science and technology. Longmans, Green.

39 of 45

Principal Component

Interjections

40 of 45

Principal Component

Verbs

41 of 45

Principal Component

Adverbs

42 of 45

Clustering results

results for clustering

43 of 45

Clustering - interjections

44 of 45

Clustering - verbs

45 of 45

Clustering - adverbs