1 of 46

Understanding The Relation Between Noise And Bias In Annotated Datasets

Abhishek Anand, Anweasha Saha, Prathyusha Naresh Kumar,

Ashwin Rao, Zihao He, Negar Mokhberian

Information Sciences Institute

2 of 46

Table of Contents

  1. Motivation
  2. Datasets
  3. Two Regimes of Classification
  4. Model and Performance
  5. Uncertainty in Machine Learning Predictions
  6. Results - Dataset Cartography
  7. Results - Single Ground Truth Model
  8. Findings
  9. Results - Multi-Annotator Model
  10. Findings
  11. References

2

Information Sciences Institute

3 of 46

Motivation

  • Bias in Annotation: Annotator differences in subjective tasks introduce bias in their annotations, especially in sensitive domains like hate speech recognition, stemming from diverse backgrounds and perspectives.

3

Information Sciences Institute

4 of 46

Motivation

  • Bias in Annotation: Annotator differences in subjective tasks introduce bias in their annotations, especially in sensitive domains like hate speech recognition, stemming from diverse backgrounds and perspectives.
  • Misinterpretation of Bias as Noise: Minority votes are often considered outliers by models. This causes models to perceive them as noise, leading to biased predictions favoring majority vote.

4

Information Sciences Institute

5 of 46

Motivation

  • Bias in Annotation: Annotator differences in subjective tasks introduce bias in their annotations, especially in sensitive domains like hate speech recognition, stemming from diverse backgrounds and perspectives.
  • Misinterpretation of Bias as Noise: Minority votes are often considered outliers by models. This causes models to perceive them as noise, leading to biased predictions favoring majority vote.
  • In this project, we'll explore if perspectivist classification models effectively utilize valuable insights from instances labeled as noisy by noise-detection techniques.

5

Information Sciences Institute

6 of 46

Datasets

6

Toxicity or Hate speech

SBIC [1]

Kennedy [2]

Agree To Disagree [3]

# Annotators

307

7,912

819

# Annotations per annotator

479.3±829.6

17.1±3.8

63.7±139

# Unique texts

45318

39,565

10,440

# Annotations per text

3.2±1.2

2.3±1.0

5

# Number of labels

2

3

2

Information Sciences Institute

7 of 46

Methods

  • Data Cartography summarizes training dynamics for all samples as
    • Confidence: Mean of probabilities for gold label across epochs.
    • Variability: Standard Deviation of probabilities for gold label across epochs.
  • Multi-annotator models predict labels based on diverse annotator perspectives, taking instance and annotator IDs as input. They learn to predict the labels each annotator would provide for an instance in the dataset. We utilize DISCO [] for our experiments.

7

Information Sciences Institute

8 of 46

Two Regimes of Classification

Information Sciences Institute

9 of 46

Two Regimes of Classification

[6]

Information Sciences Institute

10 of 46

Model and Performance

10

Model

Trained 2 models for each dataset

Majority Label Model

Multi-annotator model

Model - Roberta-Base [6]

Epochs - 5

Learning Rate - 5e-5

Batch size - 32

Model - DISCO [7]

Epochs - 5

Learning Rate - 2e-3

Batch size - 200

Information Sciences Institute

11 of 46

Model and Performance

11

Dataset

F1 score (majority)

F1 score (multi-annotator)

Agree To Disagree

0.78

0.78

Kennedy

0.68

0.75

SBIC

0.80

0.78

Model

Trained 2 models for each dataset

Majority Label Model

Multi-annotator model

Model - Roberta-Base [6]

Epochs - 5

Learning Rate - 5e-5

Batch size - 32

Model - DISCO [7]

Epochs - 5

Learning Rate - 2e-3

Batch size - 200

Performance

Information Sciences Institute

12 of 46

Uncertainty in Machine Learning Predictions

  • Dataset Cartography (Swayamdipta et al., 2020) summarizes training dynamics for all samples as

12

Information Sciences Institute

13 of 46

Uncertainty in Machine Learning Predictions

  • Dataset Cartography (Swayamdipta et al., 2020) summarizes training dynamics for all samples as
    • Confidence: Mean of probabilities for gold label across epochs.

13

Information Sciences Institute

14 of 46

Uncertainty in Machine Learning Predictions

  • Dataset Cartography (Swayamdipta et al., 2020) summarizes training dynamics for all samples as
    • Confidence: Mean of probabilities for gold label across epochs.
    • Variability: Standard Deviation of probabilities for gold label across epochs.

14

Information Sciences Institute

15 of 46

Results - Dataset Cartography

Agree To Disagree

15

Information Sciences Institute

16 of 46

Results - Dataset Cartography

Agree To Disagree

Kennedy

16

Information Sciences Institute

17 of 46

Results - Dataset Cartography

SBIC

17

Information Sciences Institute

18 of 46

1st Regime of Classification

Information Sciences Institute

19 of 46

Results - Single Ground Truth Model

Agree To Disagree

19

Information Sciences Institute

20 of 46

Results - Single Ground Truth Model

Agree To Disagree

20

Information Sciences Institute

21 of 46

Results - Single Ground Truth Model

Agree To Disagree

21

Pearson’s R

P-value

0.44

0.0

Information Sciences Institute

22 of 46

Results - Single Ground Truth Model

Kennedy

22

Information Sciences Institute

23 of 46

Results - Single Ground Truth Model

Kennedy

23

Information Sciences Institute

24 of 46

Results - Single Ground Truth Model

Kennedy

24

Pearson’s R

P-value

0.45

0.0

Information Sciences Institute

25 of 46

Results - Single Ground Truth Model

SBIC

25

Information Sciences Institute

26 of 46

Results - Single Ground Truth Model

SBIC

26

Information Sciences Institute

27 of 46

Results - Single Ground Truth Model

SBIC

27

Pearson’s R

P-value

0.37

0.0

Information Sciences Institute

28 of 46

Findings from Single Ground Truth model

  • There is correlation between human disagreement on instances and model’s uncertainty/confidence for classifying that instance.

Information Sciences Institute

29 of 46

Findings from Single Ground Truth model

  • There is correlation between human disagreement on instances and model’s uncertainty/confidence for classifying that instance.
  • For instances that there is more disagreement among human labelers there is also low confidence from the single-GT model.

Information Sciences Institute

30 of 46

2nd Regime of Classification

Research Question: Does modeling the task based on each annotator vote help with getting better confidence for the low-confidence instances?

Information Sciences Institute

31 of 46

Results - Multi Annotator Model

Agree To Disagree

31

Information Sciences Institute

32 of 46

Results - Multi Annotator Model

Agree To Disagree

32

Information Sciences Institute

33 of 46

Results - Multi Annotator Model

Agree To Disagree

33

Pearson’s R

P-value

0.46

0.0

Correlation between agreement factor and confidence

Information Sciences Institute

34 of 46

Results - Multi Annotator Model

SBIC

34

Information Sciences Institute

35 of 46

Results - Multi Annotator Model

SBIC

35

Information Sciences Institute

36 of 46

Results - Multi Annotator Model

SBIC

36

Pearson’s R

P-value

0.43

0.0

Correlation between agreement factor and confidence

Information Sciences Institute

37 of 46

Results - Multi Annotator Model

Kennedy

37

Information Sciences Institute

38 of 46

Results - Multi Annotator Model

Kennedy

38

Information Sciences Institute

39 of 46

Results - Multi Annotator Model

Kennedy

39

Pearson’s R

P-value

0.50

0.0

Correlation between agreement factor and confidence

Information Sciences Institute

40 of 46

Results - Multi Annotator Model

Kennedy

40

Pearson’s R

P-value

0.50

0.0

This dataset is extra challenging because avg number of annotations per annotator is ~ 17!

Correlation between agreement factor and confidence

Information Sciences Institute

41 of 46

Results - Multi-Annotator Model

High Confidence in Single Ground Truth model and Low Confidence in Multi-Annotator model

41

Text

Annotation

I thought you were going to win this in a landslide

Offensive

crossing my fingers for you from Germany. Even though I am not an American.

Offensive

boy F**K you

Not Offensive

Fragility at its finest

Offensive

Information Sciences Institute

42 of 46

Results - Multi-Annotator Model

High Confidence in Single Ground Truth model and Low Confidence in Multi-Annotator model

42

Text

Annotation

I thought you were going to win this in a landslide

Offensive

crossing my fingers for you from Germany. Even though I am not an American.

Offensive

boy F**K you

Not Offensive

Fragility at its finest

Offensive

Mislabeled annotations!

Information Sciences Institute

43 of 46

Findings from Multi-annotator model

  • Like single gold label model, we see significant correlation between agreement between annotators and model confidence in all the datasets, with the confidence decreasing with more disagreement between the annotators.

43

Information Sciences Institute

44 of 46

Findings from Multi-annotator model

  • Like single gold label model, we see significant correlation between agreement between annotators and model confidence in all the datasets, with the confidence decreasing with more disagreement between the annotators.
  • For low confidence samples in single ground truth model, we observe multi-annotator model having high confidence for labels which disagree with majority, hence learning valuable information from samples that majority vote aggregation discards.

44

Information Sciences Institute

45 of 46

Findings from Multi-annotator model

  • Like single gold label model, we see significant correlation between agreement between annotators and model confidence in all the datasets, with the confidence decreasing with more disagreement between the annotators.
  • For low confidence samples in single ground truth model, we observe multi-annotator model having high confidence for labels which disagree with majority, hence learning valuable information from samples that majority vote aggregation discards.
  • We see high number of annotations per annotator is necessary to model different perspectives effectively.

Website: https://anweasha.github.io/DataFirst/

45

Information Sciences Institute

46 of 46

References

  1. Social Bias Frames: Reasoning about Social and Power Implications of Language(Sap et al., ACL 2020)
  2. Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application(Kennedy et al., 2020)
  3. Agreeing to Disagree: Annotating Offensive Language Datasets with Annotators’ Disagreement (Leonardelli et al., EMNLP 2021)
  4. Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics(Swayamdipta et al., EMNLP 2020)
  5. Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations (Mostafazadeh Davani et al., TACL 2022)
  6. Disagreement Matters: Preserving Label Diversity by Jointly Modeling Item and Annotator Label Distributions with DisCo (Weerasooriya et al., Findings 2023)
  7. RoBERTa: A Robustly Optimized BERT Pretraining Approach (Yinhan Liu et al. 2019)

46

Information Sciences Institute