Expert judgments versus crowdsourcing in ordering multi-word expressions
David Alfter (University of Gothenburg), �Therese Lindström Tiedemann (University of Helsinki) and �Elena Volodina (University of Gothenburg)
Problem definition (1)
We need :
more data
to validate data
to annotate data
...
money to employ annotators
… (often) who are experts in ...
… who (often) are too busy
… whose annotations (often) lack consistency
...
Is there an alternative to expert annotations?
Crowdsourcing as an alternative?
Object of study
Problem definition (2)
(to tell you the truth - all starts with very applied needs)
Coursebooks
Problem definition (2)
(to tell you the truth - all starts with very applied needs)
English Vocabulary Profile
So we want ...
RQ1: Who can be the crowd?
i.e. whether L2 speakers (or a mixed group) can be used for the task
RQ2: Can we elicit reliable annotations with crowdsourcing?
i.e. which method works well
Design: participants
Non-experts
L2 speakers / L2 learners
(27)
Experts
L2 professionals:
teachers, assessors, researchers
(23)
CEFR experts
(3)
Others
(20)
3 crowdsourcing projects
3 crowdsourcing projects
3 direct annotation projects
3 crowdsourcing projects
Design: multi-word expressions
(COCTAILL data, SPARV-annotated, automatically identified MWEs)
Group1. Interjections
60 items from coursebooks:
12 per level
per 5 levels: A1, A2, B1, B2, C1
Group2. Verbs
60 items from coursebooks:
12 per level
per 5 levels: A1, A2, B1, B2, C1
Group3. Adverbs
60 items from coursebooks:
12 per level
per 5 levels: A1, A2, B1, B2, C1
Design: comparative vs direct
Comparative
Best-worst scaling
Direct labeling
Practicalities
no need for professional reasoning
is aligned with level progression
Crowdsourcing Methodology (Implementation)
Crowdsourcing Methodology
A
B
C
D
Crowdsourcing Methodology
A
B
C
D
A
B
C
D
Crowdsourcing methodology
A
B
C
D
A
B
C
D
Discovered relation (5/6):
B < A
B < D
B < C
A < C
D < C
Unknown relation:
A - D
Redundancy-reducing combinatorial algorithm
Four items per task and 60 expressions
→ 1,770 possible relations
→ 487,635 possible combinations.
redundancy-reducing combinatorial algorithm
→ 326 tasks
Evaluation methodology
Linear scale projection
B
A
D
C
2
1
3
2
Linear scale projection
B
A
D
C
2
1
3
2
Average over all values
E.g. B = (1 + 1 + 3 + 2 + 2 + 1) / 6
Results
Agreement between groups �(Spearman rank coefficient)
Inter-annotator Agreement in the direct ranking experiment → Expert judgment not ideal
Inter-annotator Agreement in the direct ranking experiment → Expert judgment not ideal
Agreement between Crowdsourced-ranking & Direct ranking by experts (Spearman rank coefficient)
Agreement between Crowdsourced-ranking & Direct ranking by experts (Spearman rank coefficient)
Number of votes
Number of votes in crowdsourcing?
Discussion
(cf previous work with explicit level judgment of essays e.g. Diez-Bedmar, 2012)
Future research to confirm our findings:
Similar experiments – other languages, other types of problems (e.g. texts), other sub-problems (e.g. single vocabulary items instead of MWE)
(2) Reliable annotation - how?
Discussion number of votes
Some other lessons
Future work
Thank you!
References
•Borin, Lars, Markus Forsberg, Martin Hammarstedt, Dan Rosen, Roland Schäfer, and Anne Schumacher. 2016. Sparv: Sprakbanken's corpus annotation pipeline infrastructure. In The Sixth Swedish Language Technology Conference (SLTC), Umea University, pages 17-18.
•Brezina, Vaclav and Dana Gablasova. 2015. Is there a core general vocabulary? Introducing the New General Service List. Applied Linguistics 36(1):1-22.
•Council of Europe. 2001. Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Press Syndicate of the University of Cambridge.
•Council of Europe. 2018. Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Companion Volume with New Descriptors. www.coe.int/lang-cefr. Accessed 18.06.2020.
•Díez-Bedmar, María Belén. 2012. The use of the common european framework of reference for languages to evaluate compositions in the english exam section of the university admission examination. Revista de Educacíon 357:55-79.
•François, Thomas, Nuria Gala, Patrick Watrin, and Cédrick Fairon. 2014. FLELex: a graded lexical resource for French foreign learners. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, pages 3766-3773.
•François, Thomas, Elena Volodina, Ildikó Pilán, and Anaïs Tack. 2016. SVALex: a CEFR-graded lexical resource for Swedish foreign and second language learners. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 213-219.
References
•Kilgarri, Adam, Frieda Charalabopoulou, Maria Gavrilidou, Janne Bondi Johannessen, Saussan Khalil, Soe Johansson Kokkinakis, Robert Lew, Serge Sharo, Ravikiran Vadlapudi, and Elena Volodina. 2014. Corpus-based vocabulary lists for language learners for nine languages. In Language resources and evaluation 48(1):121-163.
•Lesterhuis, Marije, San Verhavert, Liesje Coertjens, Vincent Donche, and Sven De Maeyer. 2017. Comparative judgement as a promising alternative to score competences. In Innovative practices for higher education assessment and measurement, pages 119-138. IGI Global.
•Louviere, Jordan J, Terry N Flynn, and Anthony Alfred John Marley. 2015. Best-worst scaling: Theory, methods and applications. Cambridge University Press.
•Tack, Anaïs, Thomas Francois, Piet Desmet, and Cedrick Fairon. 2018. NT2Lex: A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 137-146.
•Volodina, Elena, Ildiko Pilan, Stian Rdven Eide, and Hannes Heidarsson. 2014. You get what you annotate: a pedagogically annotated corpus of coursebooks for Swedish as a Second Language. In Proceedings of the third workshop on NLP for computer-assisted language learning, pages 128-144.
•West, Michael Philip. 1953. A general service list of English words: with semantic frequencies and a supplementary word-list for the writing of popular science and technology. Longmans, Green.
Principal Component
Interjections
Principal Component
Verbs
Principal Component
Adverbs
Clustering results
results for clustering
Clustering - interjections
Clustering - verbs
Clustering - adverbs