1 of 34

Conversation Modeling to Predict Derailment

Jiaqing Yuan and Munindar P. Singh

ICWSM 2023

Presenter: Ali Behrouz

2 of 34

Motivations

Why do we want to predict derailment?

Online social platforms provide great opportunities for users to constructively discuss.

Antisocial behaviors such as personal attacks damage the healthy online communities.

Detection after occurrence damage is already done!

Provide an early warning for potential derailment of the conversation.

3 of 34

Motivations

Why do we want to predict derailment?

Online social platforms provide great opportunities for users to constructively discuss.

Antisocial behaviors such as personal attacks damage the healthy online communities.

Detection after occurrence damage is already done!

Provide an early warning for potential derailment of the conversation.

4 of 34

Motivations

Why do we want to predict derailment?

Online social platforms provide great opportunities for users to constructively discuss.

Antisocial behaviors such as personal attacks damage the healthy online communities.

Detection after occurrence damage is already done!

Provide an early warning for potential derailment of the conversation.

5 of 34

Motivations

Why do we want to predict derailment?

Online social platforms provide great opportunities for users to constructively discuss.

Antisocial behaviors such as personal attacks damage the healthy online communities.

Detection after occurrence damage is already done!

Provide an early warning for potential derailment of the conversation.

6 of 34

Challenges

What makes it hard to predict derailment?

Dynamics: There are complex dynamics at the levels of both the utterance and the conversation.

Length: The number of utterances that will occur in a conversation is unknown.

Complexity: The total length of a tokenized conversation produced by concatenating all utterances can exceed the maximum input length limit of deep learning methods.

7 of 34

Challenges

Dynamics: There are complex dynamics at the levels of both the utterance and the conversation.

Length: The number of utterances that will occur in a conversation is unknown.

Complexity: The total length of a tokenized conversation produced by concatenating all utterances can exceed the maximum input length limit of deep learning methods.

What makes it hard to predict derailment?

8 of 34

Challenges

Dynamics: There are complex dynamics at the levels of both the utterance and the conversation.

Length: The number of utterances that will occur in a conversation is unknown.

Complexity: The total length of a tokenized conversation produced by concatenating all utterances can exceed the maximum input length limit of deep learning methods.

What makes it hard to predict derailment?

9 of 34

Limitations of Existing Methods

What is missed by existing methods to effectively predict derailment?

Hand-crafted features to model a conversation.

Limited to the first 80 tokens of each utterance!

Solely rely on textual semantics and disregard information such as conversational structure.

10 of 34

Research Questions

What are the questions they aim to answer for derailment prediction?

Is it effective to leverage pretrained language models for conversation modeling tasks and in what way?

How can we leverage the information inherent in a conversation, such as distance from each utterance to the derailing utterance to enhance the prediction?

Does conversation structure matter for the derailment prediction and how do we integrate it into the model?

Leverage the pretrained language model to design a hierarchical transformer model that encodes the utterance- and the conversation-level information.

Use a multitask learning scheme and leverage the distance from each utterance to the derailing utterance as an auxiliary training objective.

Take advantage of the inherent utterance structure, as captured by the “reply-to” attribute for each utterance

11 of 34

Research Questions

What are the questions they aim to answer for derailment prediction?

Is it effective to leverage pretrained language models for conversation modeling tasks and in what way?

How can we leverage the information inherent in a conversation, such as distance from each utterance to the derailing utterance to enhance the prediction?

Does conversation structure matter for the derailment prediction and how do we integrate it into the model?

Leverage the pretrained language model to design a hierarchical transformer model that encodes the utterance- and the conversation-level information.

Use a multitask learning scheme and leverage the distance from each utterance to the derailing utterance as an auxiliary training objective.

Take advantage of the inherent utterance structure, as captured by the “reply-to” attribute for each utterance

12 of 34

Research Questions

What are the questions they aim to answer for derailment prediction?

Is it effective to leverage pretrained language models for conversation modeling tasks and in what way?

How can we leverage the information inherent in a conversation, such as distance from each utterance to the derailing utterance to enhance the prediction?

Does conversation structure matter for the derailment prediction and how do we integrate it into the model?

Leverage the pretrained language model to design a hierarchical transformer model that encodes the utterance- and the conversation-level information.

Use a multitask learning scheme and leverage the distance from each utterance to the derailing utterance as an auxiliary training objective.

Take advantage of the inherent utterance structure, as captured by the “reply-to” attribute for each utterance

13 of 34

Research Questions

What are the questions they aim to answer for derailment prediction?

Is it effective to leverage pretrained language models for conversation modeling tasks and in what way?

How can we leverage the information inherent in a conversation, such as distance from each utterance to the derailing utterance to enhance the prediction?

Does conversation structure matter for the derailment prediction and how do we integrate it into the model?

Leverage the pretrained language model to design a hierarchical transformer model that encodes the utterance- and the conversation-level information.

Use a multitask learning scheme and leverage the distance from each utterance to the derailing utterance as an auxiliary training objective.

Take advantage of the inherent utterance structure, as captured by the “reply-to” attribute for each utterance

14 of 34

Research Questions

What are the questions they aim to answer for derailment prediction?

Is it effective to leverage pretrained language models for conversation modeling tasks and in what way?

How can we leverage the information inherent in a conversation, such as distance from each utterance to the derailing utterance to enhance the prediction?

Does conversation structure matter for the derailment prediction and how do we integrate it into the model?

Leverage the pretrained language model to design a hierarchical transformer model that encodes the utterance- and the conversation-level information.

Use a multitask learning scheme and leverage the distance from each utterance to the derailing utterance as an auxiliary training objective.

Take advantage of the inherent utterance structure, as captured by the “reply-to” attribute for each utterance

15 of 34

Research Questions

What are the questions they aim to answer for derailment prediction?

Is it effective to leverage pretrained language models for conversation modeling tasks and in what way?

How can we leverage the information inherent in a conversation, such as distance from each utterance to the derailing utterance to enhance the prediction?

Does conversation structure matter for the derailment prediction and how do we integrate it into the model?

Leverage the pretrained language model to design a hierarchical transformer model that encodes the utterance- and the conversation-level information.

Use a multitask learning scheme and leverage the distance from each utterance to the derailing utterance as an auxiliary training objective.

Take advantage of the inherent utterance structure, as captured by the “reply-to” attribute for each utterance

16 of 34

Problem Formulation

Notations and setup.

Each conversation is a sequence of utterances:

Each utterance consists of a sequence of words:

A data sample can be represented as a tuple:

Task: Predicting the possibilities of derailment for ongoing and civil conversations.

How likely a civil conversation is to lead to a personal attack as it develops.

17 of 34

Problem Formulation

Notations and setup.

Each conversation is a sequence of utterances:

Each utterance consists of a sequence of words:

A data sample can be represented as a tuple:

Task: Predicting the possibilities of derailment for ongoing and civil conversations.

How likely a civil conversation is to lead to a personal attack as it develops.

Positive or Negative

18 of 34

Problem Formulation

Notations and setup.

Each conversation is a sequence of utterances:

Each utterance consists of a sequence of words:

A data sample can be represented as a tuple:

Task: Predicting the possibilities of derailment for ongoing and civil conversations.

How likely a civil conversation is to lead to a personal attack as it develops.

Positive or Negative

19 of 34

Utterance-Level Encoder

How to encode utterances?

RoBERTa-base model improves over BERT by employing dynamic masking.

Append special tokens [CLS] at the front and [SEP] at the end.

Add a pretrained positional embedding to each token.

20 of 34

Utterance-Level Encoder

How to encode utterances?

RoBERTa-base model improves over BERT by employing dynamic masking.

Append special tokens [CLS] at the front and [SEP] at the end.

Add a pretrained positional embedding to each token.

21 of 34

Utterance-Level Encoder

How to encode utterances?

RoBERTa-base improves over BERT by employing dynamic masking.

Append special tokens [CLS] at the front and [SEP] at the end.

Add a pretrained positional embedding to each token.

Explain Later!

22 of 34

Conversation-Level Encoder

Given utterance encodings, how to encode conversations?

Use Transformer layers to encode each conversation.

One fully connected linear layer for the binary classification head.

23 of 34

Conversation-Level Encoder

Given utterance encodings, how to encode conversations?

Use Transformer layers to encode each conversation.

One fully connected linear layer for the binary classification head.

24 of 34

Multitask Training with Distance to Derailment

Given utterance encodings, how to provide (almost) real-time warning for derailment?

Existing methods apply a static training strategy, where the model is trained only with full sequences up to the derailing utterance.

The distance from each civil utterance to the derailing utterance could provide additional cues for the model to learn.

Given a sample , replace it with:

Replace it with , , .

Train the model for the regression task, where the targets are distances to derailment.

Loss function:

25 of 34

Multitask Training with Distance to Derailment

Given utterance encodings, how to provide (almost) real-time warning for derailment?

Existing methods apply a static training strategy, where the model is trained only with full sequences up to the derailing utterance.

The distance from each civil utterance to the derailing utterance could provide additional cues for the model to learn.

Given a sample , replace it with:

Replace it with , , .

Train the model for the regression task, where the targets are distances to derailment.

Loss function:

26 of 34

Multitask Training with Distance to Derailment

Given utterance encodings, how to provide (almost) real-time warning for derailment?

Existing methods apply a static training strategy, where the model is trained only with full sequences up to the derailing utterance.

The distance from each civil utterance to the derailing utterance could provide additional cues for the model to learn.

Given a sample , replace it with:

Replace it with , , .

Train the model for the regression task, where the targets are distances to derailment.

Loss function:

27 of 34

Conversation Structure Pretraining

How to use information provided by “reply to” relation between utterance?

Pretraining on tree structure:

28 of 34

Experimental Setup

What are the used datasets?

Wikipedia talk page (WTP):

Labelled by a classifier that provides a toxicity score ranging from 0 to 1 for each utterance.

Civil Conversations: All utterances have toxicity score below 0.4.

Toxic Conversations: First two utterances are civil, but there is a comment with toxicity score above 0.6 afterwards.

Reddit ChangeMyView (CMV):

Labelled by the actions of the moderators.

Toxic Conversations: Conversations that end with a deleted comments by the moderators.

29 of 34

Results

Which architecture is better?

The full architecture of the proposed model outperforms baselines with respect to Accuracy and Precision.

The superior performance over Hierarchical-Base shows the importance of the regression task.

The superior performance over Hierarchical-Multi shows the importance of structural information.

Shows the importance of using pre-trained language models and dynamic training.

30 of 34

Research Questions (Recall)

What are the questions they aim to answer for derailment prediction?

Is it effective to leverage pretrained language models for conversation modeling tasks and in what way?

How can we leverage the information inherent in a conversation, such as distance from each utterance to the derailing utterance to enhance the prediction?

Does conversation structure matter for the derailment prediction and how do we integrate it into the model?

Leverage the pretrained language model to design a hierarchical transformer model that encodes the utterance- and the conversation-level information.

Use a multitask learning scheme and leverage the distance from each utterance to the derailing utterance as an auxiliary training objective.

Take advantage of the inherent utterance structure, as captured by the “reply-to” attribute for each utterance

Positive

31 of 34

Results

Recall the loss function:

Peaks at a certain point.

The model learns better when more utterances are being observed.

WTP Dataset

32 of 34

Results

How Early is the Warning?

The table shows the distance between warning and derailment.

Around 80% of warnings are issued when fewer than five utterances have been seen by the model.

33 of 34

Future Work

What are the possible improvements?

Transformers are computationally limited, causing limited conversation length.

History of each user is ignored!

Coming back to sequential encoders!

Encode each user based on their historical actions.

Positional encoding is limited!

Using Graph Neural Networks to encode the structure of the conversation.

Unmentioned Tables!

Making sure to cite all figures and tables!

1 of 34

2 of 34

3 of 34

4 of 34

5 of 34

6 of 34

7 of 34

8 of 34

9 of 34

10 of 34

11 of 34

12 of 34

13 of 34

14 of 34

15 of 34

16 of 34

17 of 34

18 of 34

19 of 34

20 of 34

21 of 34

22 of 34

23 of 34

24 of 34

25 of 34

26 of 34

27 of 34

28 of 34

29 of 34

30 of 34

31 of 34

32 of 34

33 of 34

34 of 34