1 of 21

Fact Aware Multi-Task Learning for Text Coherence Modeling

Tushar Abhishek, Daksh Rawat, Manish Gupta, Vasudeva Varma

tushar.abhishek@research.iiit.ac.in, daksh.rawat@students.iiit.ac.in, manish.gupta@iiit.ac.in, vv@iiit.ac.in

2 of 21

What is textual coherence?

  • Information flow doesn’t break while transitioning from one sentence to the next sentence
  • Helps the reader to understand the text/paragraph as a whole rather than a series of disparate sentences.
  • Vital for success of NLG systems (summarization, question answering, question generation, …)

Coherent Text

  1. John went to his favorite music store to buy a piano.
  2. He had frequented the store for many years.
  3. He was excited that he could finally buy a piano.
  4. He arrived just as the store was closing for the day.

Incoherent Text

  1. John went to his favorite music store to buy a piano.
  2. It was a store John had frequented for many years.
  3. He was excited that he could finally buy a piano.
  4. It was closing just as John arrived.

3 of 21

Previous work

  • EGRID: builds an entity grid which is a matrix that tracks entity mentions over sentences. Random forest classifier is trained over features extracted from entity grid.
  • CNN-Egrid: local coherence model that employs a CNN that operates over the entity grid representation.
  • PARSEQ: three stacked LSTMs to represent sentence, paragraph and document.
  • Hierachical LSTM: Similar to PARSEQ. Uses attentional BiLSTMs
  • Local Coherence Discriminator (LCD-L): uses max-pooling on the hidden state of the language model to get the sentence representation. Representations for two consecutive sentences are concatenated and fed to a dense layer
  • LCD BERT: LCD-L but uses averaged BERT (instead of GloVe) embeddings as the sentence representations.
  • LCD RoBERTa: Uses RoBERTa embeddings
  • LC (Local coherence): sentences are encoded with a recurrent or recursive layer and a filter of weights is applied over each window of sentence vectors to extract scores that are aggregated to calculate overall document coherence score.

4 of 21

Previous work

  • Coh+GR: extends Hierachical LSTM by training it to predict word-level labels indicating the predicted grammatical role (GR) type, along with the document-level coherence score.
  • Coh+GR BERT: Similar to Coh+GR, BERT embeddings are used instead of GloVe as input to BiLSTMs.
  • Coh+SOX: same as Coh+GR where, for each word, we only predict subject (S), object (O) and 'other' (X) roles.
  • Seq2Seq: two LSTM generative language models and uses the difference between conditional log likelihood of a sentence given its preceding/succeeding context, and the marginal log likelihood of the current/next sentence.
  • Unified: Uses a combination of LSTMs and CNNs.
  • Inc-lex-Coh: extracts sentence representations using a pretrained language model and combines the semantic centroid vector with semantic similarity vector to obtain coherence output.
  • Avg-XLNET-Doc: Encodes an text content at the document level and averages the encoded representations.
  • Avg-RoBERTa-Doc: Same as Avg-XLNET-Doc. Uses RoBERTa embedding instead of XLNET.

5 of 21

Proposed Transformer based architectures

  1. Vanilla Transformers
  2. Fact-aware Transformers
  3. Fact-aware Multi-Task Learning (MTL) Transformers

RoBERTa (Liu et al., 2019) for short sequences (less than 512 tokens)

Longformer (Beltagy et al., 2020) for very long sequences (up to 2048 tokens)

6 of 21

1. Vanilla Transformer

  • Feed the input text directly to a Transformer model.
  • Use the task specific output vector for training on different tasks. �

7 of 21

2. Fact-Aware transformers

  • Input = text + facts
    • Extract Facts (<subject, verb, object>) using MinIE (Gashteovski et al., 2017).
  • 3 modules
    • Document encoder: encodes text
    • Fact encoder
      • Encodes each fact separately.
      • Shares weights with document encoder
    • Fact-aware document encoder
      • Encodes output from document and fact encoders
      • Outputs final document representation.

8 of 21

3. Fact-Aware Multi-Task Learning (MTL) Transformers

  • Extension of fact-aware Transformer-based method
  • Multi-task learning
    • Train for text coherence and Natural Language Inference (NLI) together

9 of 21

Combining coherence specific loss and NLI-based task loss

10 of 21

Text coherence evaluation tasks

  1. Sentence ordering:
    • Positive: Every original document is assumed to be coherent.
    • Negatives: 20 random permutations (different from the original document) of sentences in the document.
    • Rank the original document higher than the permuted ones.
  2. 3-way classification:
    • Classify document into one of the three different coherence levels (high, medium and low)
  3. Essay Score Prediction:
    • Assign an automatic score for a given essay

11 of 21

Text coherence datasets

  • Wall Street Journal (WSJ) Dataset:
    • Part of the Penn Treebank (Elsner and Charniak,2008; Nguyen and Joty, 2017)
    • Contains long articles without any constraint on style.
    • Used for sentence ordering task
    • Used sections 00-13 for training and 14-24 for testing (documents with just 1 sentence are removed).

Document count

Average sent. count

Average word count

Synthetic document count

Train

1376

21.0

529.8

29720

Test

1090

21.9

564.3

21800

12 of 21

Text coherence datasets

  • Grammarly Corpus of Discourse Coherence (GCDC) Dataset (Lai and Tetreault, 2018)
    • Contains texts from four domains: Yahoo online forum posts, emails from Hillary Clinton’s office, emails from Enron and Yelp business reviews
    • Each document is annotated with a coherence score.
    • Used for 3-way classification

Domain

Document count

Average sent. count

Average word count

Low, Medium, High coherence (%)

Yahoo

1200

7.5

162.1

46.6, 17.4, 37.0

Clinton

1200

6.6

189.0

28.2, 20.6, 51.1

Enron

1200

7.7

196.2

29.9, 19.4, 50.7

Yelp

1200

7.5

183.1

27.1, 21.8, 51.1

13 of 21

Text coherence datasets

  • Automated Student Assessment Prize (ASAP) dataset
    • Taken from the Kaggle competition
    • The essays are associated with scores given by humans
    • Essays are categorized in eight prompts based on essay topic and genre.
    • Used for Essay Score Prediction task

Prompt

Essay count

Genre

Average word count

Range of scores

1

1783

argumentative

350

2-12

2

1800

argumentative

350

2-12

3

1726

response

150

0-3

4

1772

response

150

0-3

5

1805

response

150

0-4

6

1800

response

150

0-4

7

1569

narrative

250

0-30

8

723

narrative

650

0-60

14 of 21

Experimental setup

  • For all tasks except sentence ordering, we pass the document representation obtained from proposed models to a�dense layer with ReLU activation which is then connected to a task-specific output layer.
  • Reported results are the mean of 10 runs with different random seeds.
  • For MTL based Transformers, categorical cross entropy loss was used for NLI.
  • For Longformer, we fixed maximum sequence length to 2048. For RoBERTa, we fixed it to 512.
  • For sentence ordering task, we apply Siamese network (Bromley et al., 1993) or twin neural network approach with each of our four proposed architectures. ���

15 of 21

Results: Sentence ordering on WSJ

  • Pair-wise ranking accuracy (PRA)
    • Fact aware transformer > vanilla transformer
    • Fact-aware MTL model > other variants

Models

PRA

Baselines

LC [Li and Hovy, 2014]

74.10

PARSEQ [Lai and Tetreault, 2018]

74.10

Seq2Seq [Li and Jurafsky, 2017]

86.95

CNN-Egrid [Mohiuddin et al., 2018]

88.69

Unified (ELMo) [Moon et al., 2019]

93.19

Coh+GR [Farag and Yannakoudakis, 2019]

93.20

LCD-L [Xu et al., 2019]

95.49

Coh+GR_BERT [Farag et al., 2020]

96.10

LCD_BERT [Farag et al., 2020]

97.10

Ours

Vanilla Transformer

97.34

Fact-aware Transformer

97.81

Fact-aware MTL Transformer

98.22

16 of 21

Results: 3-way classification on GCDC

  • 3-way classification accuracy
  • Fact-aware model > vanilla model across all the domains
    • transitions of facts is important
  • Classifying documents of medium level coherence is hard
    • Less samples.

Model

Yahoo

Clinton

Enron

Yelp

Average

Baselines

Flesch-Kincaid grade level [Kincaid et al., 1975]

43.5

56.0

52.5

55

51.8

Coh+SOX [Farag and Yannakoudakis, 2019]

50.5

58.5

51.0

-

53.3

Hierachical LSTM [Farag and Yannakoudakis, 2019]

55.0

59.0

50.5

-

54.8

PARSEQ [Lai and Tetreault, 2018]

54.9

60.2

53.2

54.4

55.7

LC [Li and Hovy, 2014]

53.5

61.0

54.4

-

56.3

PARSEQ (all) [Lai and Tetreault, 2018]

58.5

61.0

53.9

56.5

57.5

Coh+GR [Farag and Yannakoudakis, 2019]

56.0

62.0

56.0

-

58.0

Incremental-lex-coh [Jeon et al., 2020]

57.3

61.3

54.5

59.0

58.1

Avg-RoBERTa-Doc [Jeon et al., 2020]

60.0

65.3

55.0

58.8

59.8

Avg-XLNET-Doc [Jeon et al., 2020]

60.5

65.9

56.9

59.0

60.6

Ours

Vanilla Transformer (all)

58.1

63.9

55.3

57.6

58.7

Fact-aware Transformer

59.2

67.2

56.3

58.5

60.3

Fact-aware MTL transformer

60.7

67.4

56.4

59.0

60.8

17 of 21

Results: Automated Essay Scoring (ASAP)

  • Quadratic weighted kappa (QWK) measure
  • EASE
    • ranked third amongst 154 participants at ASAP
    • Uses features+SVR and Bayesian Linear Ridge Regression
  • Constraint MTL: constrained multi-task pairwise preference learning approach
  • Attention based RCNN: Uses hierarchical sentence-document model to represent essays, using the attention mechanism to learn the relative importance of words and sentences
  • SkipFlow: Uses similarity between multiple states of an LSTM over time with a bounded window

Models

Prompts

1

2

3

4

5

6

7

8

Average

Baselines

CohLSTM [Mesgar et al., 2018]

0.669

0.634

0.591

0.710

0.639

0.716

0.729

0.641

0.666

EASE (SVR)

0.781

0.630

0.621

0.749

0.782

0.771

0.727

0.534

0.699

EASE (BLRR)

0.761

0.606

0.621

0.742

0.784

0.775

0.730

0.617

0.705

EASE+CohLSTM [Mesgar et al., 2018]

0.784

0.654

0.663

0.788

0.793

0.794

0.756

0.646

0.735

Constraint MTL [Cummins et al., 2016]

0.816

0.667

0.654

0.783

0.801

0.778

0.787

0.692

0.747

Attention based RCNN [Dong, et al., 2017]

0.822

0.682

0.672

0.814

0.803

0.811

0.801

0.705

0.764

SkipFlow [Tay et al., 2018]

0.832

0.684

0.695

0.788

0.815

0.810

0.800

0.697

0.765

Ours

Longformer

0.824

0.660

0.693

0.820

0.795

0.810

0.817

0.701

0.765

Longformer+Fact aware MTL Transformer

0.822

0.674

0.696

0.821

0.798

0.812

0.822

0.699

0.768

18 of 21

Qualitative analysis

  • Similar to (Li and Jurafsky, 2017), we examine the coherence scores assigned to some artificial miniature discourses.
  • Score ranges from 1 to 3�

Lexical coherence

Text inputs

Vanilla

MTL

Mary ate some apples. She likes apples.

1.45

2.00

Mary ate some apples. She likes pears.

1.37

1.84

Mary ate some apples. She likes Paris.

1.26

1.52

Pinochet was arrested. His arrest was unexpected.

1.81

2.76

Pinochet was arrested. His death was unexpected.

1.67

1.56

19 of 21

Qualitative analysis: Temporal order

Text inputs

Vanilla

Ours

Washington was unanimously elected president in the first two national elections. He oversaw the creation of a strong, well financed national government.

1.93

2.79

Washington oversaw the creation of a strong, well-financed national government. He was unanimously elected president in the first two national elections.

1.88

2.36

Qualitative analysis: Centering/Referential coherence

Text inputs

Vanilla

Ours

John went to his favorite music store to buy a piano.

He had frequented the store for many years.

He was excited that he could finally buy a piano.

He arrived just as the store was closing for the day.

2.38

2.86

John went to his favorite music store to buy a piano.

It was a store John had frequented for many years.

He was excited that he could finally buy a piano.

It was closing just as John arrived.

2.45

2.67

20 of 21

Take-aways

  • Proposed a fact-aware MTL model for text coherence assessment.
    • Text+Facts
    • Coherence+MTL
  • Works for synthetic data (WSJ) as well as real-world data (GCDC).
  • Improves automated essay scoring.

  • Future: Text coherence in an open domain setting?

21 of 21

THANK YOU