2 of 58

Introduction

NAACL-HLT 2021

Introduction

Spurious Cues

Target-Aware Stance

Conclusion

The Part of work done in the Thesis is going to appear in the Main track of NAACL-HLT, 2021

tWT–WT: A Dataset to Assert the Role of Target Entities for Detecting Stance of Tweets

- Ayush Kaushal, Avirup Saha and Niloy Ganguly

Preprint

Target oblivious classifiers can deliver impressive performance.
Datasets have sentiment-stance and lexicons spurious ues.
Proposed target aware tWT–WT (targeted WT–WT) dataset.

Abstract

3 of 58

Introduction

Stance Detection

Introduction

Spurious Cues

Target-Aware Stance

Conclusion

* Text portion of the Tweet example taken from SemEval 2016 task 6 dataset

4 of 58

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

Applications of Stance Detection

Analysing Debates

Sentiment Analysis

Detecting Fake News

Verifying Rumours

5 of 58

Introduction

Stance Detection Systems

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

6 of 58

Introduction

Spurious Cues in Datasets

Introduction

Spurious Cues

Conclusion

Example:

Visual Question Answering^[1]

Q. What is the colour of sky?

Ans. Blue

Cue: Generic truth

Q. Does the man have legs in the air?

Ans. Yes

Cue: Nature of questions annotators ask.

[1] Y. Goyal et. al. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. CVPR 2017

Target-Aware Stance

7 of 58

Introduction

Spurious Cues in Datasets

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

8 of 58

Introduction

Role of Targets in Detecting Stance

Introduction

Spurious Cues

Conclusion

* The text portion of the annotated example is taken from WT-WT dataset.

Target-Aware Stance

9 of 58

Introduction

Targets as free-form sentences

Introduction

Spurious Cues

Conclusion

* Text portions of Tweet and Target are taken from RumourEval 2017 dataset

Target-Aware Stance

10 of 58

Introduction

Variants of Twitter Stance Detection

We considered at least one dataset of each type

Introduction

Spurious Cues

Conclusion

Stance Detection	Targets are Fixed Entities Test and train on same targets
Multi-target	Targets are a pair of Fixed Entities Test and train on same targets
Cross-target	Targets are Fixed Entities Test and train on different targets
Rumour Stance	Targets are free-form rumour claims Test and train on different claims

Target-Aware Stance

11 of 58

Introduction

Demonstrating Spurious Cues in Twitter Stance Detection Datasets.

Creating new datasets benchmarks for Target Aware Stance Detection

Investigating Datasets for the spurious cues.

Re-evaluating for Target Aware Stance Detection

Contributions

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

12 of 58

Spurious Cues in Twitter Stance Detection Datasets.

13 of 58

Spurious Cues in Datasets

Overview:

Demonstrating Spurious Cues:

Impressive performance of target oblivious models.
Small performance gap between target oblivious and target aware models.�

Nature of dataset biases:

Sentiment Correlations
Lexical correlations
Others: Tweet-length, Opinion

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

14 of 58

Spurious Cues in Datasets

Datasets Considered - 3/6

Will-They-Won’t-

They (WT-WT)

> Cross Target

> Financial Domain (M&A)

> 50k+ Tweet-target pairs

SemEval 2016

Task-6

> Vanilla Stance Detection

> Various Domains - politics, movements, policy

> 4.1k Tweet-target pairs

M-T Multitarget

> Multi-target Stance

> Political domain

> 4.4k Tweet-target pairs

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

15 of 58

Spurious Cues in Datasets

Datasets Considered - 6/6

RumourEval 2017

> Rumour Stance Detection

> Disaster Domain Threads

> 5.5k Tweet-target pairs

RumourEval 2019

> Rumour Stance Detection

> Disaster Domain Threads

> Twitter + Reddit

> 8.5k Tweet-target pairs

Encryption Debate

> Vanilla Stance Detection

> Encryption Debate

> 3k Tweet-target pairs

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

16 of 58

Spurious Cues in Datasets

Very few examples of tweets with different targets.

Dataset	% of tweets with different targets
WT-WT	2%
SemEval16	0%
Rumour2017	0%
Rumour2019	0%
Multi-target	0.9%
Encryption	0%

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

17 of 58

Spurious Cues in Datasets

Obtaining Dataset

Some of the datasets release only the tweet ids:

> Scrapped using Twitter API^* and Tweepy^**

Dataset	Tweets scrapped
WT-WT	45865 / 50210
Multi-target	2688 / 4413
Encryption	1634 / 2522

* developer.twitter.com

** tweepy.org

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

18 of 58

Spurious Cues in Datasets

Word De-contraction: Example - Don’t -> Do not

Removing URL, Emoji (😃😄😆😍), Punctuations

Word Segmenting

#ClimateChange -> [‘#’, ‘Climate. ‘Change’]

Text normalizing - Lowercasing and Username normalization

@BarackObama -> @USER

Trimmed sentences to 99 tokens

Preprocessing

* Libraries used - ekphrasis, nltk

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

19 of 58

Spurious Cues in Datasets

Setting up the Experiments

> Each datapoint is a tuple: (Tweet, Target, Stance)

> Target oblivious Model classify only on the tweet.

> Target aware Model receives both as input.

> Target Aware Models should outperform Target Oblivious significantly.

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

Images shown in this slide have been taken from Shutterstock.

20 of 58

Spurious Cues in Datasets

Target Aware

Bert Model

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

This picture of Bert is taken from Sesame Street show after which Bert has been named.

21 of 58

Spurious Cues in Datasets

Target Oblivious

Bert Model

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

This picture of Bert is taken from Sesame Street show after which Bert has been named.

22 of 58

Spurious Cues in Datasets

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

Domain-Specificity: Twitter

23 of 58

Spurious Cues in Datasets

Evaluation Metrics

Accuracy

Fraction of labels correctly predicted

Tile Error

Weighted Average F1

> Weighted average with weights proportional to the number of examples in that class.

Macro

Human Bounds

F1 Weighted

Macro Averaged F1

> F1 score is the harmonic mean of precision and Recall

> Macro F1 is a simple average of F1 across all the classes

Human upper bound

Used for comparison purposes only.

Provided for some datasets.

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

24 of 58

Spurious Cues in Datasets

Results (part 1): WT-WT Dataset

Observations:

> Target oblivious Bert performs near or above human bounds.

> Little performance gains from considering targets.

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

25 of 58

Spurious Cues in Datasets

Results (part 2): WT-WT Dataset

Similar Observations:

> Target oblivious Bert performs near human bounds.

> Out-of-Domain (OOD): Massive performance drop.

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

26 of 58

Spurious Cues in Datasets

Results (part 3): SE16 and M-T Datasets

> Target oblivious Bert consistently gives > ⅔ accuracy.

> Performs well on all metrics, very close to target aware.

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

27 of 58

Spurious Cues in Datasets

Results (part 4)

Skewed class distributions:

Desired metric: Macro-F1

Target Oblivious:

> Above ⅔ accuracy score

> Impressive Macro-F1

> Performs near target aware

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

28 of 58

Spurious Cues in Datasets

Visualizing the Results

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

This plot was drawn using Matplotlib and Seaborn Libraries.

29 of 58

Spurious Cues in Datasets

Dataset Analysis:

Picked WT-WT dataset

Most recent
Largest dataset

Dataset details:

Targets are 5 Merger and Acquisition.
Stance class

Support - Tweet supports that the merger will happen
Refute - Tweet refutes that the merger will happen
Comment - Tweet neither supports nor refutes.
Unrelated - Tweet does not talk about the merger.

Analysis:

Lexicon-choice associated with stance
Sentiment-stance correlations

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

The Image shown in this slide is taken from VanillaLaw

30 of 58

Spurious Cues in Datasets

Dataset Analysis: Lexical Choice

Pointwise mutual information^[1]

Practical considerations -

Stop-word removal.
Emphasis on highly discriminative word-class correlations:

Apply add 100-smoothing.

[1] Gururangan et. al. Annotation artifacts in natural language inference data. NAACL 2018

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

31 of 58

Spurious Cues in Datasets

Dataset Analysis: Lexical Choice

Top 5 stance-wise lexicons according to PMI, along with percent of tweets with stance class containing the word

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

Support		Refute		Comment		Unrelated
approves	3.3%	urges	3.0%	ceo	3.7%	stocks	3.4%
approve	5.1%	blocked	5.5%	healthcare	11.8%	size	2.6%
billion	26.2%	sues	4.3%	mean	2.3%	merge	11.3%
shareholder	0.7%	blocks	4.8%	merger	29.3%	bid	19.0%
close	6.4%	block	21.8%	trial	3.4%	agreement	16.7%

32 of 58

Spurious Cues in Datasets

Dataset Analysis: Sentiment and Stance

Used XLNet sentiment classifier:

Sentiment Range: [0, 1]
0 -> most negative
1 -> most positive

Observations:

Support, Refute -> Extremas
Comment, Unrelated -> Neutral

Class	Sentiment
Support	0.23
Refute	0.64
Comment	0.49
Unrelated	0.48

[1] Yang et. al. Xlnet: Generalized autoregressive pretraining for language understanding.NeurIPS 2019

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

33 of 58

Spurious Cues in Datasets

Dataset Analysis

Sentiment and lexicons can be spurious cues.

Various cues in other datasets:

RumourEval 2019

‘?’ in 75% ‘query’ stance tweets, 11% of remaining have it.
0.75 ‘Deny’ stance Tweets have < 0.1 sentiment score.

SemEval 2016

91.4% of tweets without opinion have ‘None’ stance.

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

34 of 58

Spurious Cues in Datasets

Dataset Analysis: Length Correlation

Very less correlation compared to previous works.^[1]

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

[1] Gururangan et. al. Annotation artifacts in natural language inference data. NAACL 2018

35 of 58

Spurious Cues in Datasets

Dataset Analysis: Length Correlation

Tweets with unrelated stance are somewhat longer.
Two peaks due to tweet length constraints.

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

36 of 58

Spurious Cues in Datasets

Dataset Analysis: Length Correlation

Query stance has a peculiar distribution over the tweet length.

Introduction

Spurious Cues

Conclusion

Target-Aware Stance

37 of 58

Towards Target Aware Twitter Stance Detection

38 of 58

Target Aware Stance Detection

Overview

Motivations from previous section:

Presence of spurious cues - sentiment and lexicons
These cues aid target oblivious models

Overview of this section:

Dataset creation process
Re-evaluating stance detection systems

Introduction

Spurious Cues

Target Aware Stance

Conclusion

39 of 58

Target Aware Stance Detection

Dataset Creation Method

Augment WT-WT dataset:

Largest and most recent
Annotated by experts

Reasoning:

Aim to handle the sentiment and lexicon correlations
Target unaware models will fail if stance varies with targets.

Introduction

Spurious Cues

Target Aware Stance

Conclusion

40 of 58

Target Aware Stance Detection

Augmenting procedure - part 1

Remove the sentiment-stance correlation

Create negated targets for refute and support stance class.

<buyer> buys <target> → <buyer> not buys <target>

Results: Near same sentiment score for each class.

Class	Sentiment
Support	0.44
Refute	0.44
Comment	0.49
Unrelated	0.48

Introduction

Spurious Cues

Target Aware Stance

Conclusion

41 of 58

Target Aware Stance Detection

Augmenting procedure - part 2 and 3

Address the lexicon-stance associations

For each tweet with only one labelled target, label one other target randomly with ‘unrelated’ stance.

Balance class distributions:

Create negated targets for 50% ‘comment’ & ‘unrelated’ stance

II III

Introduction

Spurious Cues

Target Aware Stance

Conclusion

42 of 58

Target Aware Stance Detection

Targeted WT–WT Dataset Statistics

111596 tweet-target pairs
At least 10000 data points for each target-merger
Balanced ratio for support, refute, comment, unrelated- 1:1:3:5

Similarly augment the SemEval 2016 and Multi-target datasets.

Introduction

Spurious Cues

Target Aware Stance

Conclusion

43 of 58

Target Aware Stance Detection

Maximum Accuracy of Target Oblivious Classifiers

Theorem: The maximum possible accuracy for any deterministic target-oblivious class stance classifier is:

where;

T = {t₁, t₂, . . . t_n} -> set of tweets; S = {s₁, s₂, . . . s_m} -> set of stances.
count(t_i) -> number of targets labelled for t_i
p(y_i | t_i) -> fraction of the targets with stance y_ifor tweet t_i

Introduction

Spurious Cues

Target Aware Stance

Conclusion

Targeted WT-WT	0.722
Targeted SE16	0.551
Targeted M-T	0.506

44 of 58

Target Aware Stance Detection

Experiments with Targeted datasets

Baselines:

Target Oblivious Bert: Same as Before
Target Aware Bert: Same as Before
SiamNet: Siamese networks with Bert
TAN: Target-specific Attention Networks with Bert

Metrics:

Same as the non-targeted counterparts.

Introduction

Spurious Cues

Target Aware Stance

Conclusion

45 of 58

Target Aware Stance Detection

SiamNet + Bert

Introduction

Spurious Cues

Target Aware Stance

Conclusion

46 of 58

Target Aware Stance Detection

TAN + Bert

Introduction

Spurious Cues

Target Aware Stance

Conclusion

47 of 58

Target Aware Stance Detection

Experiments with Targeted WT–WT (Part-1)

Observations:

Target Oblivious Bert performs very poorly.
Target Aware Bert performs the best with a lot of scope for improvement.

Introduction

Spurious Cues

Target Aware Stance

Conclusion

48 of 58

Target Aware Stance Detection

Experiments with Targeted WT–WT (Part-2)

Observations:

Target Oblivious Bert performs very poorly.
Target Aware Bert performs the best with a lot of scope for improvement.

Introduction

Spurious Cues

Target Aware Stance

Conclusion

49 of 58

Target Aware Stance Detection

Experiments (Part-3)

Observations:

> Target Oblivious Bert

performs poorly.

> Target Aware Bert performs

the best.

> SiamNet comes very close

to Target Aware Bert

> TAN performs very poorly.

Introduction

Spurious Cues

Target Aware Stance

Conclusion

50 of 58

Target Aware Stance Detection

Experiments Overview

Introduction

Spurious Cues

Target Aware Stance

Conclusion

51 of 58

Conclusions and Future Work

52 of 58

Conclusions & Future Work

Conclusions:

Empirically demonstrated spurious cues in twitter stance detection dataset that inflates performance of models.

Investigated the datasets for spurious cues, to find sentiment-stance and lexicon-stance correlation. Useful for future dataset creation.�
Proposed an augmentation method for removal spurious cues, creating the largest stance detection dataset.

Re-evaluated systems to show usefulness of the new datasets. Room for future work on stance detection systems.

Introduction

Spurious Cues

Target Aware Stance

Conclusion

53 of 58

Conclusions & Future Work

Future Work:

Explainable Stance Detection Systems.

Analysis of Multi-lingual stance datasets.

Target-aware Stance Detection Systems, reasoning about the target entities.

Introduction

Spurious Cues

Target Aware Stance

Conclusion

54 of 58

Conclusions & Future Work

Future Work: Visualization

Target Aware trained

on Targeted-WTWT

Introduction

Spurious Cues

Target Aware Stance

Conclusion

55 of 58

Conclusions & Future Work

Future Work: Visualization

Target Aware Bert trained on WTWT

Introduction

Spurious Cues

Target Aware Stance

Conclusion

56 of 58

Conclusions & Future Work

Code and Trained models

Spurious Cues: https://github.com/Ayushk4/bias-stance

Target Aware Stance: https://github.com/Ayushk4/stance-dataset

Links to Trained model on respective repositories.

Detailed Readme and Environment Configurations.

Introduction

Spurious Cues

Target Aware Stance

Conclusion

The pictures for Octocat, Pytorch logo and Huggingface logo are taken from their respective GitHub organizations.

57 of 58

Conclusions & Future Work

Leaderboard and Dataset

wtwtv2-dataset.github.io/

Introduction

Spurious Cues

Target Aware Stance

Conclusion

The leaderboard website is inspired by Squad, HotPotQA and HoVer dataset leaderboards.

1 of 58

2 of 58

3 of 58

4 of 58

5 of 58

6 of 58

7 of 58

8 of 58

9 of 58

10 of 58

11 of 58

12 of 58

13 of 58

14 of 58

15 of 58

16 of 58

17 of 58

18 of 58

19 of 58

20 of 58

21 of 58

22 of 58

23 of 58

24 of 58

25 of 58

26 of 58

27 of 58

28 of 58

29 of 58

30 of 58

31 of 58

32 of 58

33 of 58

34 of 58

35 of 58

36 of 58

37 of 58

38 of 58

39 of 58

40 of 58

41 of 58

42 of 58

43 of 58

44 of 58

45 of 58

46 of 58

47 of 58

48 of 58

49 of 58

50 of 58

51 of 58

52 of 58

53 of 58

54 of 58

55 of 58

56 of 58

57 of 58

58 of 58