1 of 45

Expert judgments versus crowdsourcing in ordering multi-word expressions

David Alfter (University of Gothenburg), �Therese Lindström Tiedemann (University of Helsinki) and �Elena Volodina (University of Gothenburg)

2 of 45

Problem definition (1)

We need :

more data

to validate data

to annotate data

...

money to employ annotators

… (often) who are experts in ...

… who (often) are too busy

… whose annotations (often) lack consistency

...

3 of 45

Is there an alternative to expert annotations?

4 of 45

Crowdsourcing as an alternative?

What do we know about it?

open call, non-experts, cheap labor, easy to get, etc. → A myth?

Do non-experts REALLY perform as well as experts?
Are crowdsourced annotations REALLY reliable?

Does it REALLY take less time? or money?

5 of 45

Object of study

Content-wise:

multi-word expressions (MWEs)
from a second language (L2) learning point of view
coursebook texts as a source of MWEs

Design-wise:

comparative judgment setting -- BEST-WORST SCALING algorithm
expert annotations (traditional approach)

6 of 45

Problem definition (2)

(to tell you the truth - all starts with very applied needs)

Need to validate target level projections from course books

Coursebooks

7 of 45

Problem definition (2)

(to tell you the truth - all starts with very applied needs)

Need to find a method to “stratify” unseen vocabulary into appropriate levels for learners at different levels of linguistic proficiency

language independent and cheap approach

8 of 45

English Vocabulary Profile

Lexicographer work
Language testers
Large learner corpus (exam essays)

9 of 45

So we want ...

to validate crowdsourcing as a method to support graded list creation for any language (in L2 context)

RQ1: Who can be the crowd?

i.e. whether L2 speakers (or a mixed group) can be used for the task

RQ2: Can we elicit reliable annotations with crowdsourcing?

i.e. which method works well

10 of 45

Design: participants

Non-experts

L2 speakers / L2 learners

(27)

Experts

L2 professionals:

teachers, assessors, researchers

(23)

CEFR experts

(3)

Others

(20)

3 crowdsourcing projects

3 direct annotation projects

3 crowdsourcing projects

11 of 45

Design: multi-word expressions

(COCTAILL data, SPARV-annotated, automatically identified MWEs)

Group1. Interjections

60 items from coursebooks:

12 per level

per 5 levels: A1, A2, B1, B2, C1

Group2. Verbs

60 items from coursebooks:

12 per level

per 5 levels: A1, A2, B1, B2, C1

Group3. Adverbs

60 items from coursebooks:

12 per level

per 5 levels: A1, A2, B1, B2, C1

12 of 45

Design: comparative vs direct

Comparative

Best-worst scaling

Direct labeling

13 of 45

Practicalities

pyBossa (installation at Språkbanken)
professional and personal networks to recruit
small gifts to stimulate participation
guidelines/instructions
5 votes per item for non-experts
3 votes per item for experts
NOTE : we asked to use intuitive judgments,

no need for professional reasoning

Assumption : perceived difficulty/complexity

is aligned with level progression

14 of 45

Crowdsourcing Methodology (Implementation)

15 of 45

Crowdsourcing Methodology

A

B

C

D

16 of 45

Crowdsourcing Methodology

A

B

C

D

A

B

C

D

17 of 45

Crowdsourcing methodology

A

B

C

D

A

B

C

D

Discovered relation (5/6):

B < A

B < D

B < C

A < C

D < C

Unknown relation:

A - D

18 of 45

Redundancy-reducing combinatorial algorithm

Four items per task and 60 expressions

→ 1,770 possible relations

→ 487,635 possible combinations.

redundancy-reducing combinatorial algorithm

→ 326 tasks

19 of 45

Evaluation methodology

Linear scale projection
Principal component analysis (PCA)
Clustering

20 of 45

Linear scale projection

B

A

D

C

2

1

3

2

21 of 45

Linear scale projection

B

A

D

C

2

1

3

2

Average over all values

E.g. B = (1 + 1 + 3 + 2 + 2 + 1) / 6

22 of 45

Results

23 of 45

Agreement between groups �(Spearman rank coefficient)

Spearman rank correlation coefficient has a range from -1 to +1 where -1 indicates a perfect negative correlation; 0 indicates no correlation; and +1 indicates perfect positive correlation.

As can be gathered from Table 1, the highest

correlations can be found between non-experts (i.e.

L2 speakers) and the general group of “L2 professionals”

across all of the three MWE groups, while

the correlations between non-experts and “CEFR

experts” (i.e. the subgroup of L2 professionals) are

the lowest among all the three MWE groups. We

can thus say that non-experts and experts (i.e. the

general group of L2 professsionals) in our experiment

agree very well on the relative difficulty of

MWEs, followed by the general group of L2 professionals

and CEFR experts, while L2 speakers and

CEFR experts tend to agree to a lesser extent. Despite

these marginal fluctuations, we can see strong

correlations between all of the tested target groups

across all the three sets of tested MWEs. This indicates

that intuitions about the difficulty of MWEs

are more or less shared across all tested groups,

despite the differences in background and professional

competence. It seems that we can confirm

that non-experts (i.e. L2 speakers lacking expertise

in a subject (e.g. language assessment)) can be

seen as on par with experts for tasks requiring high

competence, something that has also been shown

in approaches in citizen science (Kullenberg and

Kasperowski, 2016).

24 of 45

Inter-annotator Agreement in the direct ranking experiment → Expert judgment not ideal

If we look closer at the simple and extended percentage agreement between the CEFR experts in the ‘direct’ labeling experiment, we can see that agreement is generally quite low for simple agreement (Tolerance 0 in Table 2).

With a tolerance of zero, one counts exact agreement between the annotators (e.g. the same item has been assigned to the same CEFR level).

However, if one relaxes the tolerance level to 1 (extended percentage agreement), meaning that positive agreement also includes cases where annotators differed by only one level (e.g. one annotator said the item was A2 while another annotator said the item was B1), we can see that agreement drastically improves, as illustrated in Table 2.

In general, this gives us a picture that expert judgments are not ideal and that reaching an exact agreement between them is possibly an unattainable

target, which also confirms the results from essay evaluation according to the CEFR-scale (see e.g. D´ıez-Bedmar (2012)). Given that direct labeling

is a subjective and cognitively challenging task, more opinions than one are required (cf Snow et al., 2008; Carlsen, 2012).

Furthermore, the MWEs in our experiments are de-contextualized which might further complicate decisions.

This speaks in favor of assuming tolerance level 1 since the assigned levels describe a continuum of proficiency rather than strict categories (Council of Europe, 2018, p. 34). A hypothesis in connection to this is that disagreement outside tolerance 1 may indicate items that are on the periphery of the lower CEFR level, while items within tolerance 1

constitute the core vocabulary on the lower level.

This is something to be explored in future research.

25 of 45

Inter-annotator Agreement in the direct ranking experiment → Expert judgment not ideal

Expert judgment not ideal
Cf studies on essay evaluation (Diez-Bedmar, 2012; Snow et al, 2008; Carlsen, 2012)
De-contextualised
>Tolerance 1 = peripheral items? → Future research

If we look closer at the simple and extended percentage agreement between the CEFR experts in the ‘direct’ labeling experiment, we can see that agreement is generally quite low for

simple agreement (Tolerance 0 in Table 2).

With a tolerance of zero, one counts exact agreement between the annotators (e.g. the same item has been assigned to the same CEFR level).

However, if one relaxes the tolerance level to 1 (extended percentage agreement), meaning that positive agreement also includes cases where annotators differed by only one level (e.g. one annotator said the item was A2 while another annotator said the item was B1), we can see that agreement drastically improves, as illustrated in Table 2.

In general, this gives us a picture that expert judgments are not ideal and that reaching an exact agreement between them is possibly an unattainable

target, which also confirms the results from essay evaluation according to the CEFR-scale (see e.g. D´ıez-Bedmar (2012)). Given that direct labeling

is a subjective and cognitively challenging task, more opinions than one are required (cf Snow et al., 2008; Carlsen, 2012).

Furthermore, the MWEs in our experiments are de-contextualized which might further complicate decisions.

This speaks in favor of assuming tolerance level 1 since the assigned levels describe a continuum of proficiency rather than strict categories (Council of Europe, 2018, p. 34). A hypothesis in connection to this is that disagreement outside tolerance 1 may indicate items that are on the periphery of the lower CEFR level, while items within tolerance 1

constitute the core vocabulary on the lower level.

This is something to be explored in future research.

26 of 45

Agreement between Crowdsourced-ranking & Direct ranking by experts (Spearman rank coefficient)

Intra-annotator agreement for CEFR experts from both experiment setups Results of agreement

between the explicit ranking of each individual CEFR expert and their own individual implicit

judgment from the crowdsourcing experiment show mixed results (Table 3).

Expert 1 is very consistent in both annotation methods, and all annotators seem to agree with themselves most for MWE

group 1.

This could indicate that expert 1 is the one with the most experience of working with CEFRlevels.

The inconsistency of the results for the same CEFR expert indicates that the expert reasons differently when using different methods, and that the

way of reasoning influences the results. It has been previously shown that explicit scoring is more subjective and cognitively demanding than assessing

by comparing two samples to each other (Lesterhuis et al., 2017), which also seems to be confirmed in this experiment.

This indicates that we should not compare the two types of annotation and that expert judgment can only give reliable annotation

if a reasonably large number of experts is used to counter-balance a potential subjective bias. How large a number constitutes a “reasonable amount”

is still an open question.

27 of 45

Agreement between Crowdsourced-ranking & Direct ranking by experts (Spearman rank coefficient)

Expert 1 is most consistent.
All experts most consistent for Interjections etc. (gr.1)
Expert scoring - more subjective & cognitively demanding (Lesterhuis et al, 2017)

Intra-annotator agreement for CEFR experts from both experiment setups Results of agreement

between the explicit ranking of each individual CEFR expert and their own individual implicit

judgment from the crowdsourcing experiment show mixed results (Table 3).

Expert 1 is very consistent in both annotation methods, and all annotators seem to agree with themselves most for MWE

group 1.

This could indicate that expert 1 is the one with the most experience of working with CEFRlevels.

The inconsistency of the results for the same CEFR expert indicates that the expert reasons differently when using different methods, and that the

way of reasoning influences the results. It has been previously shown that explicit scoring is more subjective and cognitively demanding than assessing

by comparing two samples to each other (Lesterhuis et al., 2017), which also seems to be confirmed in this experiment.

This indicates that we should not compare the two types of annotation and that expert judgment can only give reliable annotation

if a reasonably large number of experts is used to counter-balance a potential subjective bias. How large a number constitutes a “reasonable amount”

is still an open question.

28 of 45

Number of votes

29 of 45

Number of votes in crowdsourcing?

High correlation even with only 1 annotator (when sub-sampling).
Lowest among non-experts
Highest with mixed crowd
Better to have 2 voters than only 1 - but with a mixed crowd it makes no difference.

30 of 45

Discussion

Who can be the crowd – with regards to the background of crowdsourcers?
How can reliable annotations be achieved with regards to design, number of answers and number of contributors?

Our results convincingly show that non-experts can perform on par with experts.

We have seen that crowds with different backgrounds agree very well with each other, in comparison to previous

research where CEFR raters of essays have often reached fairly low agreement (D´ıez-Bedmar, 2012).

Note here that these conclusions are true of annotation carried out in a comparative judgment or

best-worst scaling setting whereas previous work on essay rating has been done based on scales (e.g.

the CEFR-scale) similar to our direct-labeling experiment.

To confirm our findings, similar experiments need to be repeated for other languages, other types of problems (e.g. annotation of texts for

difficulty/readability), and other sub-problems of a given problem (e.g. annotation of single vocabulary items for difficulty).

In relation to question (2) the reliability of annotations, we have seen how the design of an annotation task influences the results.

Clearly, a more traditional method of annotation – using expert judgments – produces less reliable results than crowdsourced comparative judgments/best-worst

scaling rankings. We have seen that experts do not agree with themselves when using comparative judgments versus categorical judgments, whereas

the comparative judgment setting leads to homogeneous results between all groups of crowdsourcers regardless of their background.

31 of 45

Who can be the crowd?

Mixed crowds agree very well in comparative ranking (best-worst scaling).

(cf previous work with explicit level judgment of essays e.g. Diez-Bedmar, 2012)

Future research to confirm our findings:

Similar experiments – other languages, other types of problems (e.g. texts), other sub-problems (e.g. single vocabulary items instead of MWE)

Our results convincingly show that non-experts can perform on par with experts.

We have seen that crowds with different backgrounds agree very well with each other, in comparison to previous

research where CEFR raters of essays have often reached fairly low agreement (D´ıez-Bedmar, 2012).

Note here that these conclusions are true of annotation carried out in a comparative judgment or

best-worst scaling setting whereas previous work on essay rating has been done based on scales (e.g.

the CEFR-scale) similar to our direct-labeling experiment.

To confirm our findings, similar experiments need to be repeated for other languages, other types of problems (e.g. annotation of texts for

difficulty/readability), and other sub-problems of a given problem (e.g. annotation of single vocabulary items for difficulty).

32 of 45

(2) Reliable annotation - how?

Design of annotation task influences the results.

More traditional method (expert (direct) judgment) → less reliable results than crowdsourced comparative judgment
Experts do not agree with themselves between tasks.
Comparative judgment - homogeneous results between all the different groups of crowdsourcers (expert / non-expert background)

33 of 45

Discussion number of votes

On a linear scale from 1–60: insignificant difference if there was 3 votes instead of only 2 (mixed crowd).
Mixed background most stable.

34 of 45

Some other lessons

35 of 45

Future work

Same method → same results when applied to other “problems” (single words, essays).
How to order unlabelled expressions, e.g. unknown vocabulary.
Clustering? Known anchor-points?
Identification of core and periphery vocabulary based on clustering?

36 of 45

Thank you!

37 of 45

References

•Borin, Lars, Markus Forsberg, Martin Hammarstedt, Dan Rosen, Roland Schäfer, and Anne Schumacher. 2016. Sparv: Sprakbanken's corpus annotation pipeline infrastructure. In The Sixth Swedish Language Technology Conference (SLTC), Umea University, pages 17-18.

•Brezina, Vaclav and Dana Gablasova. 2015. Is there a core general vocabulary? Introducing the New General Service List. Applied Linguistics 36(1):1-22.

•Council of Europe. 2001. Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Press Syndicate of the University of Cambridge.

•Council of Europe. 2018. Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Companion Volume with New Descriptors. www.coe.int/lang-cefr. Accessed 18.06.2020.

•Díez-Bedmar, María Belén. 2012. The use of the common european framework of reference for languages to evaluate compositions in the english exam section of the university admission examination. Revista de Educacíon 357:55-79.

•François, Thomas, Nuria Gala, Patrick Watrin, and Cédrick Fairon. 2014. FLELex: a graded lexical resource for French foreign learners. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, pages 3766-3773.

•François, Thomas, Elena Volodina, Ildikó Pilán, and Anaïs Tack. 2016. SVALex: a CEFR-graded lexical resource for Swedish foreign and second language learners. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 213-219.

38 of 45

References

•Kilgarri, Adam, Frieda Charalabopoulou, Maria Gavrilidou, Janne Bondi Johannessen, Saussan Khalil, Soe Johansson Kokkinakis, Robert Lew, Serge Sharo, Ravikiran Vadlapudi, and Elena Volodina. 2014. Corpus-based vocabulary lists for language learners for nine languages. In Language resources and evaluation 48(1):121-163.

•Lesterhuis, Marije, San Verhavert, Liesje Coertjens, Vincent Donche, and Sven De Maeyer. 2017. Comparative judgement as a promising alternative to score competences. In Innovative practices for higher education assessment and measurement, pages 119-138. IGI Global.

•Louviere, Jordan J, Terry N Flynn, and Anthony Alfred John Marley. 2015. Best-worst scaling: Theory, methods and applications. Cambridge University Press.

•Tack, Anaïs, Thomas Francois, Piet Desmet, and Cedrick Fairon. 2018. NT2Lex: A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 137-146.

•Volodina, Elena, Ildiko Pilan, Stian Rdven Eide, and Hannes Heidarsson. 2014. You get what you annotate: a pedagogically annotated corpus of coursebooks for Swedish as a Second Language. In Proceedings of the third workshop on NLP for computer-assisted language learning, pages 128-144.

•West, Michael Philip. 1953. A general service list of English words: with semantic frequencies and a supplementary word-list for the writing of popular science and technology. Longmans, Green.

39 of 45

Principal Component

Interjections

40 of 45

Principal Component

Verbs

41 of 45

Principal Component

Adverbs

42 of 45

Clustering results

results for clustering

43 of 45

Clustering - interjections

44 of 45

Clustering - verbs

45 of 45

Clustering - adverbs