Expert judgments versus crowdsourcing in ordering multi-word expressions

David Alfter (University of Gothenburg), �Therese Lindström Tiedemann (University of Helsinki) and �Elena Volodina (University of Gothenburg)

Problem definition (1)

We need :

more data

to validate data

to annotate data


money to employ annotators

… (often) who are experts in ...

… who (often) are too busy

… whose annotations (often) lack consistency


Is there an alternative to expert annotations?

Crowdsourcing as an alternative?

  • What do we know about it?
    • open call, non-experts, cheap labor, easy to get, etc. → A myth?

  • Do non-experts REALLY perform as well as experts?
  • Are crowdsourced annotations REALLY reliable?

  • Does it REALLY take less time? or money?

Object of study

  • Content-wise:
    • multi-word expressions (MWEs)
    • from a second language (L2) learning point of view
    • coursebook texts as a source of MWEs
  • Design-wise:
    • comparative judgment setting -- BEST-WORST SCALING algorithm
    • expert annotations (traditional approach)

Problem definition (2)

(to tell you the truth - all starts with very applied needs)

  • Need to validate target level projections from course books


Problem definition (2)

(to tell you the truth - all starts with very applied needs)

  • Need to find a method to “stratify” unseen vocabulary into appropriate levels for learners at different levels of linguistic proficiency
    • language independent and cheap approach

English Vocabulary Profile

  • Lexicographer work
  • Language testers
  • Large learner corpus (exam essays)

So we want ...

  • to validate crowdsourcing as a method to support graded list creation for any language (in L2 context)

RQ1: Who can be the crowd?

i.e. whether L2 speakers (or a mixed group) can be used for the task

RQ2: Can we elicit reliable annotations with crowdsourcing?

i.e. which method works well

Design: participants


L2 speakers / L2 learners



L2 professionals:

teachers, assessors, researchers


CEFR experts




3 crowdsourcing projects

3 direct annotation projects

Design: multi-word expressions

(COCTAILL data, SPARV-annotated, automatically identified MWEs)

Group1. Interjections

60 items from coursebooks:

12 per level

per 5 levels: A1, A2, B1, B2, C1

Group2. Verbs

60 items from coursebooks:

12 per level

per 5 levels: A1, A2, B1, B2, C1

Group3. Adverbs

60 items from coursebooks:

12 per level

per 5 levels: A1, A2, B1, B2, C1

Design: comparative vs direct


Best-worst scaling

Direct labeling

  • pyBossa (installation at Språkbanken)
  • professional and personal networks to recruit
  • small gifts to stimulate participation
  • guidelines/instructions
  • 5 votes per item for non-experts
  • 3 votes per item for experts
  • NOTE : we asked to use intuitive judgments,

no need for professional reasoning

  • Assumption : perceived difficulty/complexity

is aligned with level progression

Crowdsourcing Methodology (Implementation)

Crowdsourcing Methodology





Crowdsourcing Methodology









Crowdsourcing methodology









Discovered relation (5/6):

B < A

B < D

B < C

A < C

D < C

Unknown relation:

A - D

Redundancy-reducing combinatorial algorithm

Four items per task and 60 expressions

→ 1,770 possible relations

→ 487,635 possible combinations.

redundancy-reducing combinatorial algorithm

→ 326 tasks

Evaluation methodology

  • Linear scale projection
  • Principal component analysis (PCA)
  • Clustering

Linear scale projection









Linear scale projection









Average over all values

E.g. B = (1 + 1 + 3 + 2 + 2 + 1) / 6

Agreement between groups �(Spearman rank coefficient)

Inter-annotator Agreement in the direct ranking experiment → Expert judgment not ideal

Inter-annotator Agreement in the direct ranking experiment → Expert judgment not ideal

  • Expert judgment not ideal
  • Cf studies on essay evaluation (Diez-Bedmar, 2012; Snow et al, 2008; Carlsen, 2012)
  • De-contextualised
  • >Tolerance 1 = peripheral items? → Future research

Agreement between Crowdsourced-ranking & Direct ranking by experts (Spearman rank coefficient)

Agreement between Crowdsourced-ranking & Direct ranking by experts (Spearman rank coefficient)

  • Expert 1 is most consistent.
  • All experts most consistent for Interjections etc. (gr.1)
  • Expert scoring - more subjective & cognitively demanding (Lesterhuis et al, 2017)

Number of votes

Number of votes in crowdsourcing?

  • High correlation even with only 1 annotator (when sub-sampling).
  • Lowest among non-experts
  • Highest with mixed crowd
  • Better to have 2 voters than only 1 - but with a mixed crowd it makes no difference.

  • Who can be the crowd – with regards to the background of crowdsourcers?
  • How can reliable annotations be achieved with regards to design, number of answers and number of contributors?

  • Who can be the crowd?
  • Mixed crowds agree very well in comparative ranking (best-worst scaling).

(cf previous work with explicit level judgment of essays e.g. Diez-Bedmar, 2012)

Future research to confirm our findings:

Similar experiments – other languages, other types of problems (e.g. texts), other sub-problems (e.g. single vocabulary items instead of MWE)

(2) Reliable annotation - how?

  • Design of annotation task influences the results.
    • More traditional method (expert (direct) judgment) → less reliable results than crowdsourced comparative judgment
    • Experts do not agree with themselves between tasks.
    • Comparative judgment - homogeneous results between all the different groups of crowdsourcers (expert / non-expert background)

Discussion number of votes

  • On a linear scale from 1–60: insignificant difference if there was 3 votes instead of only 2 (mixed crowd).
  • Mixed background most stable.

Some other lessons

Future work

  • Same method → same results when applied to other “problems” (single words, essays).
  • How to order unlabelled expressions, e.g. unknown vocabulary.
  • Clustering? Known anchor-points?
  • Identification of core and periphery vocabulary based on clustering?

Thank you!

Principal Component


Principal Component


Principal Component


Clustering results

results for clustering

Clustering - interjections

Clustering - verbs

Clustering - adverbs