1 of 46

Understanding The Relation Between Noise And Bias In Annotated Datasets

Abhishek Anand, Anweasha Saha, Prathyusha Naresh Kumar,

Ashwin Rao, Zihao He, Negar Mokhberian

Information Sciences Institute

2 of 46

Table of Contents

Motivation
Datasets
Two Regimes of Classification
Model and Performance
Uncertainty in Machine Learning Predictions
Results - Dataset Cartography
Results - Single Ground Truth Model
Findings
Results - Multi-Annotator Model
Findings
References

Information Sciences Institute

3 of 46

Motivation

Bias in Annotation: Annotator differences in subjective tasks introduce bias in their annotations, especially in sensitive domains like hate speech recognition, stemming from diverse backgrounds and perspectives.

Information Sciences Institute

4 of 46

Motivation

Bias in Annotation: Annotator differences in subjective tasks introduce bias in their annotations, especially in sensitive domains like hate speech recognition, stemming from diverse backgrounds and perspectives.
Misinterpretation of Bias as Noise: Minority votes are often considered outliers by models. This causes models to perceive them as noise, leading to biased predictions favoring majority vote.

Information Sciences Institute

5 of 46

Motivation

Bias in Annotation: Annotator differences in subjective tasks introduce bias in their annotations, especially in sensitive domains like hate speech recognition, stemming from diverse backgrounds and perspectives.
Misinterpretation of Bias as Noise: Minority votes are often considered outliers by models. This causes models to perceive them as noise, leading to biased predictions favoring majority vote.
In this project, we'll explore if perspectivist classification models effectively utilize valuable insights from instances labeled as noisy by noise-detection techniques.

Information Sciences Institute

6 of 46

Datasets

	Toxicity or Hate speech
	SBIC [1]	Kennedy [2]	Agree To Disagree [3]
# Annotators	307	7,912	819
# Annotations per annotator	479.3±829.6	17.1±3.8	63.7±139
# Unique texts	45318	39,565	10,440
# Annotations per text	3.2±1.2	2.3±1.0	5
# Number of labels	2	3	2

Information Sciences Institute

7 of 46

Methods

Data Cartography summarizes training dynamics for all samples as

Confidence: Mean of probabilities for gold label across epochs.
Variability: Standard Deviation of probabilities for gold label across epochs.

Multi-annotator models predict labels based on diverse annotator perspectives, taking instance and annotator IDs as input. They learn to predict the labels each annotator would provide for an instance in the dataset. We utilize DISCO [] for our experiments.

Information Sciences Institute

8 of 46

Two Regimes of Classification

Information Sciences Institute

9 of 46

Two Regimes of Classification

[6]

Information Sciences Institute

10 of 46

Model and Performance

Model

Trained 2 models for each dataset

Majority Label Model	Multi-annotator model
Model - Roberta-Base [6] Epochs - 5 Learning Rate - 5e-5 Batch size - 32	Model - DISCO [7] Epochs - 5 Learning Rate - 2e-3 Batch size - 200

Information Sciences Institute

11 of 46

Model and Performance

Dataset	F1 score (majority)	F1 score (multi-annotator)
Agree To Disagree	0.78	0.78
Kennedy	0.68	0.75
SBIC	0.80	0.78

Model

Trained 2 models for each dataset

Majority Label Model	Multi-annotator model
Model - Roberta-Base [6] Epochs - 5 Learning Rate - 5e-5 Batch size - 32	Model - DISCO [7] Epochs - 5 Learning Rate - 2e-3 Batch size - 200

Performance

Information Sciences Institute

12 of 46

Uncertainty in Machine Learning Predictions

Dataset Cartography (Swayamdipta et al., 2020) summarizes training dynamics for all samples as

Information Sciences Institute

13 of 46

Uncertainty in Machine Learning Predictions

Dataset Cartography (Swayamdipta et al., 2020) summarizes training dynamics for all samples as

Confidence: Mean of probabilities for gold label across epochs.

Information Sciences Institute

14 of 46

Uncertainty in Machine Learning Predictions

Dataset Cartography (Swayamdipta et al., 2020) summarizes training dynamics for all samples as

Confidence: Mean of probabilities for gold label across epochs.
Variability: Standard Deviation of probabilities for gold label across epochs.

Information Sciences Institute

15 of 46

Results - Dataset Cartography

Agree To Disagree

Information Sciences Institute

16 of 46

Results - Dataset Cartography

Agree To Disagree

Kennedy

Information Sciences Institute

17 of 46

Results - Dataset Cartography

SBIC

Information Sciences Institute

18 of 46

1st Regime of Classification

Information Sciences Institute

19 of 46

Results - Single Ground Truth Model

Agree To Disagree

Information Sciences Institute

20 of 46

Results - Single Ground Truth Model

Agree To Disagree

Information Sciences Institute

21 of 46

Results - Single Ground Truth Model

Agree To Disagree

Pearson’s R	P-value
0.44	0.0

Information Sciences Institute

22 of 46

Results - Single Ground Truth Model

Kennedy

Information Sciences Institute

23 of 46

Results - Single Ground Truth Model

Kennedy

Information Sciences Institute

24 of 46

Results - Single Ground Truth Model

Kennedy

Pearson’s R	P-value
0.45	0.0

Information Sciences Institute

25 of 46

Results - Single Ground Truth Model

SBIC

Information Sciences Institute

26 of 46

Results - Single Ground Truth Model

SBIC

Information Sciences Institute

27 of 46

Results - Single Ground Truth Model

SBIC

Pearson’s R	P-value
0.37	0.0

Information Sciences Institute

28 of 46

Findings from Single Ground Truth model

There is correlation between human disagreement on instances and model’s uncertainty/confidence for classifying that instance.

Information Sciences Institute

29 of 46

Findings from Single Ground Truth model

There is correlation between human disagreement on instances and model’s uncertainty/confidence for classifying that instance.
For instances that there is more disagreement among human labelers there is also low confidence from the single-GT model.

Information Sciences Institute

30 of 46

2nd Regime of Classification

Research Question: Does modeling the task based on each annotator vote help with getting better confidence for the low-confidence instances?

Information Sciences Institute

31 of 46

Results - Multi Annotator Model

Agree To Disagree

Information Sciences Institute

32 of 46

Results - Multi Annotator Model

Agree To Disagree

Information Sciences Institute

33 of 46

Results - Multi Annotator Model

Agree To Disagree

Pearson’s R	P-value
0.46	0.0

Correlation between agreement factor and confidence

Information Sciences Institute

34 of 46

Results - Multi Annotator Model

SBIC

Information Sciences Institute

35 of 46

Results - Multi Annotator Model

SBIC

Information Sciences Institute

36 of 46

Results - Multi Annotator Model

SBIC

Pearson’s R	P-value
0.43	0.0

Correlation between agreement factor and confidence

Information Sciences Institute

37 of 46

Results - Multi Annotator Model

Kennedy

Information Sciences Institute

38 of 46

Results - Multi Annotator Model

Kennedy

Information Sciences Institute

39 of 46

Results - Multi Annotator Model

Kennedy

Pearson’s R	P-value
0.50	0.0

Correlation between agreement factor and confidence

Information Sciences Institute

40 of 46

Results - Multi Annotator Model

Kennedy

Pearson’s R	P-value
0.50	0.0

This dataset is extra challenging because avg number of annotations per annotator is ~ 17!

Correlation between agreement factor and confidence

Information Sciences Institute

41 of 46

Results - Multi-Annotator Model

High Confidence in Single Ground Truth model and Low Confidence in Multi-Annotator model

Text	Annotation
I thought you were going to win this in a landslide	Offensive
crossing my fingers for you from Germany. Even though I am not an American.	Offensive
boy F**K you	Not Offensive
Fragility at its finest	Offensive

Information Sciences Institute

42 of 46

Results - Multi-Annotator Model

High Confidence in Single Ground Truth model and Low Confidence in Multi-Annotator model

Text	Annotation
I thought you were going to win this in a landslide	Offensive
crossing my fingers for you from Germany. Even though I am not an American.	Offensive
boy F**K you	Not Offensive
Fragility at its finest	Offensive

Mislabeled annotations!

Information Sciences Institute

43 of 46

Findings from Multi-annotator model

Like single gold label model, we see significant correlation between agreement between annotators and model confidence in all the datasets, with the confidence decreasing with more disagreement between the annotators.

Information Sciences Institute

44 of 46

Findings from Multi-annotator model

Like single gold label model, we see significant correlation between agreement between annotators and model confidence in all the datasets, with the confidence decreasing with more disagreement between the annotators.
For low confidence samples in single ground truth model, we observe multi-annotator model having high confidence for labels which disagree with majority, hence learning valuable information from samples that majority vote aggregation discards.

Information Sciences Institute

45 of 46

Findings from Multi-annotator model

Like single gold label model, we see significant correlation between agreement between annotators and model confidence in all the datasets, with the confidence decreasing with more disagreement between the annotators.
For low confidence samples in single ground truth model, we observe multi-annotator model having high confidence for labels which disagree with majority, hence learning valuable information from samples that majority vote aggregation discards.
We see high number of annotations per annotator is necessary to model different perspectives effectively.

Website: https://anweasha.github.io/DataFirst/

Information Sciences Institute

46 of 46

References

Social Bias Frames: Reasoning about Social and Power Implications of Language(Sap et al., ACL 2020)
Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application(Kennedy et al., 2020)
Agreeing to Disagree: Annotating Offensive Language Datasets with Annotators’ Disagreement (Leonardelli et al., EMNLP 2021)
Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics(Swayamdipta et al., EMNLP 2020)
Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations (Mostafazadeh Davani et al., TACL 2022)
Disagreement Matters: Preserving Label Diversity by Jointly Modeling Item and Annotator Label Distributions with DisCo (Weerasooriya et al., Findings 2023)
RoBERTa: A Robustly Optimized BERT Pretraining Approach (Yinhan Liu et al. 2019)

Information Sciences Institute