1 of 9

Anya Belz, University of Brighton, UK

A S Y S T E M A T I C R E V I E W O F

R E P R O D U C I B I L I T Y R E S E A R C H

I N N L P

@anyabelz

Joint work with Shubham Agarwal, Anastasia Shimorina and Ehud Reiter

1

Belz, 13 Years of Comparative and Competitive Evaluation in NLG, SwissText’20

1

2 of 9

R E P R O D U C I B I L I T Y C R I S I S

  • “An experimental result is not fully established unless it can be independently reproduced.” (Association for Computing Machinery)
  • Pedersen (2008): “reviewers and readers accept highly empirical results on faith. [...] to the point where we seem to have given up on the idea of being able to reproduce results”
  • Baker (2016): >50% of scientists report failing to reproduce own results at least once, 70% failing to reproduce someone else’s results
  • Mieskes et al. (2019): 24.9% of NLP researchers report failing to repro- duce conclusions of own results, 56.7% to reproduce another team’s

Belz, Agarwal, Shimorina & Reiter: Systematic Review of Reproducibility, EACL 2021

2

F A I L

2

Belz, 13 Years of Comparative and Competitive Evaluation in NLG, SwissText’20

2

3 of 9

A D D R E S S I N G R E P R O D U C I B I L I T Y

  • Increasing activity in NLP/ML around reproducibility: workshops, shared tasks, themes, checklists (e.g. ACL-IJCNLP’21)
  • Strong focus on ensuring systems can be rerun/recreated exactly
  • Increasingly sharing data, code and other information needed to ensure that ‘sameness of system’
  • Have self-reported estimates of experience of reproducibility, but don’t have an overall estimate of how reproducible NLP results are
  • One aim in this paper: provide overview of differences between scores in original vs. reproduction studies in NLP

Belz, Agarwal, Shimorina & Reiter: Systematic Review of Reproducibility, EACL 2021

3

3

Belz, 13 Years of Comparative and Competitive Evaluation in NLG, SwissText’20

3

4 of 9

T A L K I N G A B O U T R E P R O D U C I B I L I T Y

  • Many different ‘R-terms’ in use, often with incompatible definitions: reproducibility, repeatability, replicability, recreation, re-run, robustness, repetition, replication, reproduction, generalisability, …
  • ACM switched definitions of ‘reproducible’ and ‘replicable’ when asked by NISO to “harmonize its terminology and definitions with those used in the broader scientific research community”
  • Different accounts tie definitions to different dimensions of sameness:
    • E.g. Whitaker (2017): data, code
    • Dimensions in other definitions: team, method, system, author-supplied artifacts, etc.
    • Could add these but would run out of terms quickly

Belz, Agarwal, Shimorina & Reiter: Systematic Review of Reproducibility, EACL 2021

4

4

Belz, 13 Years of Comparative and Competitive Evaluation in NLG, SwissText’20

4

5 of 9

T A L K I N G A B O U T R E P R O D U C I B I L I T Y

  • International Vocabulary of Metrology (VIM) makes dimensions of sameness/difference part of definitions, distinguishing:
    • repeatability: the precision of measurements of the same or similar object obtained under the same conditions
    • reproducibility: the precision of measurements of the same or similar object obtained under different conditions
  • Using term ‘reproduction study’ to refer to any study that tests repeatability or reproducibility, we structure review into:
    1. Reproduction under same conditions
    2. Reproduction under (deliberately) varied conditions (see paper)
    3. Multi-test/multi-lab studies (see paper)

Belz, Agarwal, Shimorina & Reiter: Systematic Review of Reproducibility, EACL 2021

5

5

Belz, 13 Years of Comparative and Competitive Evaluation in NLG, SwissText’20

5

6 of 9

S Y S T E M A T I C R E V I E W M E T H O D

  1. Search ACL Anthology for titles containing reproduc or replica, capitalised or not 47 papers
  2. Exclude papers not about reproducibility in our sense 35
  3. Manually search non-ACL NLP/ML sources 60
  4. Manually search other fields 67
  5. Identify papers with scores from reproductions under same conditions .18.
  6. Identify papers with scores from reproductions under (deliberately) varied conditions ..9.
  7. Use papers from 5, 6 in numerical analyses; remainder in literature review

Belz, Agarwal, Shimorina & Reiter: Systematic Review of Reproducibility, EACL 2021

6

6

Belz, 13 Years of Comparative and Competitive Evaluation in NLG, SwissText’20

6

7 of 9

R E P R O D U C T I O N U N D E R S A M E C O N D I T I O N S

  • 18 reproduction papers 1.88 original papers on average = 34 paper pairs, and total of 549 score pairs
  • 36 score pairs: reproduction did not produce scores, e.g. because resource limits were reached, or code didn't work
  • Remaining 513 score pairs:
    • 77 score pairs (14.03%): scores exactly the same
    • 436 score pairs (85.97%): scores different, of which:
  • in 178 score pairs (40.8%), reproduction score was better
  • in 258 score pairs (59.2%), reproduction score was worse.

Belz, Agarwal, Shimorina & Reiter: Systematic Review of Reproducibility, EACL 2021

7

7

Belz, 13 Years of Comparative and Competitive Evaluation in NLG, SwissText’20

7

8 of 9

R E P R O D U C T I O N U N D E R S A M E C O N D I T I O N S

  • Also looked at size/direction of differences in terms of percentage increase/decrease for each score pair
  • Excluded score pairs where one or both scores were 0, and 4 outliers
  • Of the remaining differences:
    • 40% > +/-1%
    • 25% > +/-5%

Belz, Agarwal, Shimorina & Reiter: Systematic Review of Reproducibility, EACL 2021

8

8

Belz, 13 Years of Comparative and Competitive Evaluation in NLG, SwissText’20

8

9 of 9

C O N C L U S I O N

  • Main findings:
    • Very little agreement about how to define even basic concepts in reproducibility, let alone how to test for it
    • Just 14% of reproduction scores are same as original score
    • Where scores are different, 60% are worse, in 3 out of 5 cases by margins > +/-1%
  • Questions for future work:
    • Why are reproduction scores more often worse?
    • How similar are results from multiple reproductions of the same work?
    • What exactly are the dimensions of sameness/difference we want to control in reproducibility tests in NLP?

Belz, Agarwal, Shimorina & Reiter: Systematic Review of Reproducibility, EACL 2021

9

9

Belz, 13 Years of Comparative and Competitive Evaluation in NLG, SwissText’20

9