JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 21

Fact Aware Multi-Task Learning for Text Coherence Modeling

Tushar Abhishek, Daksh Rawat, Manish Gupta, Vasudeva Varma

tushar.abhishek@research.iiit.ac.in, daksh.rawat@students.iiit.ac.in, manish.gupta@iiit.ac.in, vv@iiit.ac.in

2 of 21

What is textual coherence?

Information flow doesn’t break while transitioning from one sentence to the next sentence
Helps the reader to understand the text/paragraph as a whole rather than a series of disparate sentences.
Vital for success of NLG systems (summarization, question answering, question generation, …)

Coherent Text

John went to his favorite music store to buy a piano.
He had frequented the store for many years.
He was excited that he could finally buy a piano.
He arrived just as the store was closing for the day.

Incoherent Text

John went to his favorite music store to buy a piano.
It was a store John had frequented for many years.
He was excited that he could finally buy a piano.
It was closing just as John arrived.

3 of 21

Previous work

EGRID: builds an entity grid which is a matrix that tracks entity mentions over sentences. Random forest classifier is trained over features extracted from entity grid.
CNN-Egrid: local coherence model that employs a CNN that operates over the entity grid representation.
PARSEQ: three stacked LSTMs to represent sentence, paragraph and document.
Hierachical LSTM: Similar to PARSEQ. Uses attentional BiLSTMs
Local Coherence Discriminator (LCD-L): uses max-pooling on the hidden state of the language model to get the sentence representation. Representations for two consecutive sentences are concatenated and fed to a dense layer
LCD BERT: LCD-L but uses averaged BERT (instead of GloVe) embeddings as the sentence representations.
LCD RoBERTa: Uses RoBERTa embeddings
LC (Local coherence): sentences are encoded with a recurrent or recursive layer and a filter of weights is applied over each window of sentence vectors to extract scores that are aggregated to calculate overall document coherence score.

4 of 21

Previous work

Coh+GR: extends Hierachical LSTM by training it to predict word-level labels indicating the predicted grammatical role (GR) type, along with the document-level coherence score.
Coh+GR BERT: Similar to Coh+GR, BERT embeddings are used instead of GloVe as input to BiLSTMs.
Coh+SOX: same as Coh+GR where, for each word, we only predict subject (S), object (O) and 'other' (X) roles.
Seq2Seq: two LSTM generative language models and uses the difference between conditional log likelihood of a sentence given its preceding/succeeding context, and the marginal log likelihood of the current/next sentence.
Unified: Uses a combination of LSTMs and CNNs.
Inc-lex-Coh: extracts sentence representations using a pretrained language model and combines the semantic centroid vector with semantic similarity vector to obtain coherence output.
Avg-XLNET-Doc: Encodes an text content at the document level and averages the encoded representations.
Avg-RoBERTa-Doc: Same as Avg-XLNET-Doc. Uses RoBERTa embedding instead of XLNET.

5 of 21

Proposed Transformer based architectures

Vanilla Transformers
Fact-aware Transformers
Fact-aware Multi-Task Learning (MTL) Transformers

RoBERTa (Liu et al., 2019) for short sequences (less than 512 tokens)

Longformer (Beltagy et al., 2020) for very long sequences (up to 2048 tokens)

6 of 21

1. Vanilla Transformer

Feed the input text directly to a Transformer model.
Use the task specific output vector for training on different tasks. �

7 of 21

2. Fact-Aware transformers

Input = text + facts

Extract Facts (<subject, verb, object>) using MinIE (Gashteovski et al., 2017).

3 modules

Document encoder: encodes text
Fact encoder

Encodes each fact separately.
Shares weights with document encoder

Fact-aware document encoder

Encodes output from document and fact encoders
Outputs final document representation.

8 of 21

3. Fact-Aware Multi-Task Learning (MTL) Transformers

Extension of fact-aware Transformer-based method
Multi-task learning

Train for text coherence and Natural Language Inference (NLI) together

9 of 21

Combining coherence specific loss and NLI-based task loss

10 of 21

Text coherence evaluation tasks

Sentence ordering:

Positive: Every original document is assumed to be coherent.
Negatives: 20 random permutations (different from the original document) of sentences in the document.
Rank the original document higher than the permuted ones.

3-way classification:

Classify document into one of the three different coherence levels (high, medium and low)

Essay Score Prediction:

Assign an automatic score for a given essay

11 of 21

Text coherence datasets

Wall Street Journal (WSJ) Dataset:

Part of the Penn Treebank (Elsner and Charniak,2008; Nguyen and Joty, 2017)
Contains long articles without any constraint on style.
Used for sentence ordering task
Used sections 00-13 for training and 14-24 for testing (documents with just 1 sentence are removed).

	Document count	Average sent. count	Average word count	Synthetic document count
Train	1376	21.0	529.8	29720
Test	1090	21.9	564.3	21800

12 of 21

Text coherence datasets

Grammarly Corpus of Discourse Coherence (GCDC) Dataset (Lai and Tetreault, 2018)

Contains texts from four domains: Yahoo online forum posts, emails from Hillary Clinton’s office, emails from Enron and Yelp business reviews
Each document is annotated with a coherence score.
Used for 3-way classification

Domain	Document count	Average sent. count	Average word count	Low, Medium, High coherence (%)
Yahoo	1200	7.5	162.1	46.6, 17.4, 37.0
Clinton	1200	6.6	189.0	28.2, 20.6, 51.1
Enron	1200	7.7	196.2	29.9, 19.4, 50.7
Yelp	1200	7.5	183.1	27.1, 21.8, 51.1

13 of 21

Text coherence datasets

Automated Student Assessment Prize (ASAP) dataset

Taken from the Kaggle competition
The essays are associated with scores given by humans
Essays are categorized in eight prompts based on essay topic and genre.
Used for Essay Score Prediction task

Prompt	Essay count	Genre	Average word count	Range of scores
1	1783	argumentative	350	2-12
2	1800	argumentative	350	2-12
3	1726	response	150	0-3
4	1772	response	150	0-3
5	1805	response	150	0-4
6	1800	response	150	0-4
7	1569	narrative	250	0-30
8	723	narrative	650	0-60

14 of 21

Experimental setup

For all tasks except sentence ordering, we pass the document representation obtained from proposed models to a�dense layer with ReLU activation which is then connected to a task-specific output layer.
Reported results are the mean of 10 runs with different random seeds.
For MTL based Transformers, categorical cross entropy loss was used for NLI.
For Longformer, we fixed maximum sequence length to 2048. For RoBERTa, we fixed it to 512.
For sentence ordering task, we apply Siamese network (Bromley et al., 1993) or twin neural network approach with each of our four proposed architectures. ��

15 of 21

Results: Sentence ordering on WSJ

Pair-wise ranking accuracy (PRA)

Fact aware transformer > vanilla transformer
Fact-aware MTL model > other variants

	Models	PRA
Baselines	LC [Li and Hovy, 2014]	74.10
	PARSEQ [Lai and Tetreault, 2018]	74.10
	Seq2Seq [Li and Jurafsky, 2017]	86.95
	CNN-Egrid [Mohiuddin et al., 2018]	88.69
	Unified (ELMo) [Moon et al., 2019]	93.19
	Coh+GR [Farag and Yannakoudakis, 2019]	93.20
	LCD-L [Xu et al., 2019]	95.49
	Coh+GR_BERT [Farag et al., 2020]	96.10
	LCD_BERT [Farag et al., 2020]	97.10
Ours	Vanilla Transformer	97.34
	Fact-aware Transformer	97.81
	Fact-aware MTL Transformer	98.22

16 of 21

Results: 3-way classification on GCDC

3-way classification accuracy
Fact-aware model > vanilla model across all the domains

transitions of facts is important

Classifying documents of medium level coherence is hard

Less samples.

	Model	Yahoo	Clinton	Enron	Yelp	Average
Baselines	Flesch-Kincaid grade level [Kincaid et al., 1975]	43.5	56.0	52.5	55	51.8
	Coh+SOX [Farag and Yannakoudakis, 2019]	50.5	58.5	51.0	-	53.3
	Hierachical LSTM [Farag and Yannakoudakis, 2019]	55.0	59.0	50.5	-	54.8
	PARSEQ [Lai and Tetreault, 2018]	54.9	60.2	53.2	54.4	55.7
	LC [Li and Hovy, 2014]	53.5	61.0	54.4	-	56.3
	PARSEQ (all) [Lai and Tetreault, 2018]	58.5	61.0	53.9	56.5	57.5
	Coh+GR [Farag and Yannakoudakis, 2019]	56.0	62.0	56.0	-	58.0
	Incremental-lex-coh [Jeon et al., 2020]	57.3	61.3	54.5	59.0	58.1
	Avg-RoBERTa-Doc [Jeon et al., 2020]	60.0	65.3	55.0	58.8	59.8
	Avg-XLNET-Doc [Jeon et al., 2020]	60.5	65.9	56.9	59.0	60.6
Ours	Vanilla Transformer (all)	58.1	63.9	55.3	57.6	58.7
	Fact-aware Transformer	59.2	67.2	56.3	58.5	60.3
	Fact-aware MTL transformer	60.7	67.4	56.4	59.0	60.8

17 of 21

Results: Automated Essay Scoring (ASAP)

Quadratic weighted kappa (QWK) measure
EASE

ranked third amongst 154 participants at ASAP
Uses features+SVR and Bayesian Linear Ridge Regression

Constraint MTL: constrained multi-task pairwise preference learning approach
Attention based RCNN: Uses hierarchical sentence-document model to represent essays, using the attention mechanism to learn the relative importance of words and sentences
SkipFlow: Uses similarity between multiple states of an LSTM over time with a bounded window

	Models	Prompts
		1	2	3	4	5	6	7	8	Average
Baselines	CohLSTM [Mesgar et al., 2018]	0.669	0.634	0.591	0.710	0.639	0.716	0.729	0.641	0.666
	EASE (SVR)	0.781	0.630	0.621	0.749	0.782	0.771	0.727	0.534	0.699
	EASE (BLRR)	0.761	0.606	0.621	0.742	0.784	0.775	0.730	0.617	0.705
	EASE+CohLSTM [Mesgar et al., 2018]	0.784	0.654	0.663	0.788	0.793	0.794	0.756	0.646	0.735
	Constraint MTL [Cummins et al., 2016]	0.816	0.667	0.654	0.783	0.801	0.778	0.787	0.692	0.747
	Attention based RCNN [Dong, et al., 2017]	0.822	0.682	0.672	0.814	0.803	0.811	0.801	0.705	0.764
	SkipFlow [Tay et al., 2018]	0.832	0.684	0.695	0.788	0.815	0.810	0.800	0.697	0.765
Ours	Longformer	0.824	0.660	0.693	0.820	0.795	0.810	0.817	0.701	0.765
Ours	Longformer+Fact aware MTL Transformer	0.822	0.674	0.696	0.821	0.798	0.812	0.822	0.699	0.768

18 of 21

Qualitative analysis

Similar to (Li and Jurafsky, 2017), we examine the coherence scores assigned to some artificial miniature discourses.
Score ranges from 1 to 3�

Lexical coherence

Text inputs	Vanilla	MTL
Mary ate some apples. She likes apples.	1.45	2.00
Mary ate some apples. She likes pears.	1.37	1.84
Mary ate some apples. She likes Paris.	1.26	1.52
Pinochet was arrested. His arrest was unexpected.	1.81	2.76
Pinochet was arrested. His death was unexpected.	1.67	1.56

19 of 21

Qualitative analysis: Temporal order

Text inputs	Vanilla	Ours
Washington was unanimously elected president in the first two national elections. He oversaw the creation of a strong, well financed national government.	1.93	2.79
Washington oversaw the creation of a strong, well-financed national government. He was unanimously elected president in the first two national elections.	1.88	2.36

Qualitative analysis: Centering/Referential coherence

Text inputs	Vanilla	Ours
John went to his favorite music store to buy a piano. He had frequented the store for many years. He was excited that he could finally buy a piano. He arrived just as the store was closing for the day.	2.38	2.86
John went to his favorite music store to buy a piano. It was a store John had frequented for many years. He was excited that he could finally buy a piano. It was closing just as John arrived.	2.45	2.67

20 of 21

Take-aways

Proposed a fact-aware MTL model for text coherence assessment.

Text+Facts
Coherence+MTL

Works for synthetic data (WSJ) as well as real-world data (GCDC).
Improves automated essay scoring.

Future: Text coherence in an open domain setting?

21 of 21

THANK YOU