ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
2
Subtask 1 - Lay Summarisation
3
4
1. Average metric scores will be calculated independently for the lay summaries in the test set of each dataset (PLOS and eLife).
5
6
PLOS
7
SubmissionRelevanceReadabilityFactuality
8
Rouge1Rouge2RougeLBERTScoreFKGLDCRSBARTScoreFactCC
9
s144.3144.3144.3144.3112.8911.8775.3341.32
10
s242.1213.2142.2383.5314.1212.4369.3242.65
11
s341.4212.1139.2281.5212.4311.274.2348.53
12
s439.0911.2339.682.8811.2411.6773.2345.65
13
14
eLife
15
SubmissionRelevanceReadabilityFactuality
16
Rouge1Rouge2RougeLBERTScoreFKGLDCRSBARTScoreFactCC
17
s145.2312.2142.6585.6410.2110.4274.3250.34
18
s247.5414.6344.6484.2111.5310.8970.2347.53
19
s342.3213.5440.6480.539.769.3271.3455.23
20
s438.5310.5339.5482.889.5310.0264.2342.12
21
22
23
24
2. To obtain the leaderboard, the metric scores for each dataset will be averaged. This is what will be visible to participants on CodaLab during the test phase.
25
26
SubmissionRelevanceReadabilityFactuality
27
Rouge1Rouge2RougeLBERTScoreFKGLDCRSBARTScoreFactCC
28
s144.7728.2643.4864.97511.5511.14574.82545.83
29
s244.8313.9243.43583.8712.82511.6669.77545.09
30
s341.8712.82539.9381.02511.09510.2672.78551.88
31
s438.8110.8839.5782.8810.38510.84568.7343.885
32
33
Note that the aim is to maximise all Relevance and Factuality scores, and minimise Readability scores.
34
35
36
37
3. After the test phase is complete, we will compute an average score for each individual aspect (i.e., Relevance, Readability, Factuality).
38
39
a) To do this we will first normalise metric values (using min-max normalization), so they share a common value range (0-1).
40
41
min-max normalisation = (score_i - min(score)) / (max(score) - min(score))
42
43
Rouge1Rouge2RougeLBERTScoreFKGLDCRSBARTScoreFactCC
44
min38.8110.8839.5764.97510.38510.2668.7343.885
45
max44.8328.2643.4883.8712.82511.6674.82551.88
46
47
SubmissionRelevanceReadabilityFactuality
48
Rouge1Rouge2RougeLBERTScoreFKGLDCRSBARTScoreFactCC
49
s10.99003322261100.47745901640.632142857110.2432770482
50
s210.17491369390.98849104861110.17145200980.1507191995
51
s30.50830564780.11191024170.092071611250.84943106640.290983606600.66529942581
52
s40000.947605186600.417857142900
53
54
55
56
b) Then we will compute an aspect-level score by averaging across the normalized scores of the revelant metrics.
57
58
SubmissionRelevanceReadabilityFactuality
59
s10.74750830560.55480093680.6216385241
60
s20.790851185610.1610856047
61
s30.39042964180.14549180330.8326497129
62
s40.23690129660.20892857140
63
64
In this case, s2 is best for Relevance, and s3 is best for Readability and Factuality.
65
66
67
68
4. To determine an best system across all three aspects, we simply calculate their cumulative rank across each aspect (lowest == best).
69
70
Rankings
71
SubmissionRelevanceReadabilityFactualityCumulative Rank
72
s12327
73
s21438
74
s33115
75
s442410
76
77
In this case, s3 obtains the lowest cumulative rank, so would be considered the best overall system.
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100