1 of 10

A Comparative Study of Synthetic Data Generation Methods for Grammatical Error Correction

Max White, Alla Rozovskaya

Queens College, City University of New York

2 of 10

Grammatical Error Correction (GEC)

Original Sentence:

If any problems faced , i 'm went to my parents and say the problem , there are solve the problem .

Corrected:

If any problems arise , I go to my parents and tell them the problem , and they solve the problem .

3 of 10

The BEA 2019 Shared Task on GEC

Unrestricted Track: any resources allowed

Restricted Track: limited to publicly available learner corpora

Low Resource Track: significantly limited use of annotated data

4 of 10

Contributions

  • We conduct a fair comparison of two methods of generating synthetic data
    • Inverted Spellchecker
    • Patterns+POS
  • We show the strengths of each method
  • We show that the Patterns+POS method outperforms the Inverted Spellchecker method

5 of 10

The Inverted Spellchecker Method (UEDIN-MS)

Original Sentence:

I started with Kopi last year and I played one game with him and that was it .

Noisified:

I started with Kopi last year any I player one game with him and than wad it .

Word

Confusion set

and

ans an ad sand rand land hand band wand end aid ant add any

was

saw wad as wars wast wads wags wasp wash ways wan war wag gas mas

6 of 10

The Patterns+POS Method (Kakao&Brain)

Original Sentence:

I started with Kopi last year and I played one game with him and that was it .

Noisified:

I started in Kopi last year and I play one game with him , that is it .

Token

Error patterns

with

with [removed] in on w/

played

played play plays

was

was is [removed] got ‘s

7 of 10

Fair Comparison of Methods

  • Train CNN models with identical hyperparameters (Chollampatt and Ng, 2018)
  • Generate synthetic data using a uniform amount and source of native English data
  • Evaluate on two GEC datasets:
    • W&I+LOCNESS dev (new data set for the BEA 2019 task)
    • FCE test

8 of 10

Results

W&I+L dev

FCE test

Noising Method

P

R

F0.5

P

R

F0.5

Inverted Spellchecker

31.30

16.24

26.41

35.31

19.48

30.37

Patterns+POS

42.96

20.00

34.94

41.55

19.94

34.15

9 of 10

Error-Type Analysis

Inverted Spellchecker Method Strengths:

  • Spelling, Orthographical Errors

Patterns+POS Method Strengths:

  • Verb Tense, Verb Form, Noun Number

10 of 10

Conclusion

  • We performed a fair comparison of two methods of generating synthetic data for GEC
  • Our experiments indicate that the Patterns+POS method outperforms the Inverted Spellchecker method.