ABCDEFGHIJKLMNOPQRSTUVWXYZAAABADAEAFAGAHAI
1
WMT16 Metric Task TracksColors:
Blue (and green) cells distributed in per-langpair packages, because they include hybrid systems and are very big.
2
Some more details here
Yellow (and green) cells distributed as one package for seg-level participants only.
3
So if you download blues, you'll have everything.
4
Language Pairs; the cell specifies the domains included (WMT news task, WMT IT task, HUME medical task)
Task long description
5
Track short nameNew?TextsSystemsHybrids?cs2ende2enro2enfi2enru2entr2enen2csen2deen2roen2fien2ruen2tren2bgen2esen2euen2nlen2plen2pt
Training Data (Optional)
InputOutputGolden dataEvaluationNew in 2016
6
RRsysNewsmodifiednewstest2016news task systems + tuning task systems (en<->cs)yesT3+F3T3+F1T3+F1T3+F1T3+F2T3+F2T4+F4T4+F5T4+F6T4+F6T4+F2T4+F6past years of metrics taskssystem outputs + references of the whole test setyour metric score for the test setTrueSkill interpretation of RR judgementsPearson correlation of your metric score against TrueSkill score for just the real primary submissions (not across the 10k additional synthetic systems)you will get ~10000 MT systems to score, not just the ~20 per language pair; these systems will be generated automatically by randomly taking translation candidates (along with the corresponding manual judgements) of each sentence
7
RRsysITnewit-test2016IT task systemsyesT6T6T6T6+F7T6T6+F7T6+F7as aboveas aboveas aboveas aboveas abovestandard track but on a brand new domain; and 10k syntetic systems for confidence estimation
8
DAsysNewsnewnewstest2016as RRsysNewsyesT3+F3+T5T3+F1+T5T3+F1+T5T3+F1+T5T3+F2+T5T3+F2+T5nonononoT4+F2+T5noNone: since the language pairs Yvette has (es-en,en-es) are not in translation task.
as aboveyour metric score for the test setWMT16 DA judgementsPearson correlation of your metric score against DA sys-level score for just the real primary submissions (not across the 10k additional synthetic systems)you will get ~10000 MT systems to score, not just the ~20 per language pair; generated as above
9
RRsegNewsnewstest2016as RRsysNewsnoT7T7T7T7T7T7T8T8T8T8T8T8past years of metrics taskssystem outputs + reference for each sentenceyour metric score for the sentenceThe set of simulated pairwise judgements as collected in RR judgementsThe WMT14 variation inspired by Kendall's tau of your metric score against manual judgements.no change
10
DAsegNewsnewnewstest20162016 news task systems (excluding tuning task systems)noT7+F8T7+F8T7+F8T7+F8T7+F8T7+F8T8+F9new devsets prepared by Yvette (random sample of wmt'15 cs-en, de-en, fi-en and ru-en 500 translations per language pair)as above (500 translations per language pair)as aboveDA judgements for candidate translations (no relative comparison, only absolute judgements); candidates will be sampled to have a reliable average absolute score for each candidate in the set (scores are standardized per human assessor)Pearson correlation of your metric score against the DA seg-level score across all annotated sentences (500 per language pair) of all primary submissionsbrand new
11
HUMEsegnewhimltestHimL year 1 systemsnoT9+F10T9+F10T9+F10T9+F10noneas aboveas aboveHUME manual annotation, collapsed automatically into one score per segmentPearson correlation of your metric score against HUME aggregate seg-level score for all annotated sentences of all systemsbrand new
12
13
14
Each 'yes' indicates that we will provide some systems producing some translations of the given test sets (row) and language pair (column).
15
16
The expected output of participants is clear, described on metrics web page.
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100