1 of 81

Constructing interval variables via faceted Rasch measurement and multitask deep learning

Debiased, explainable, interval measurement of hate speech

November 2020

2 of 81

Research Team

Chris Kennedy (Lead) – Postdoc at Harvard Medical School, Biostatistics PhD
Claudia von Vacano (PI) – Policy, Organizations, Measurement & Evaluation PhD
Geoff Bacon – Linguistics PhD
Alexander Sahn – Political Science PhD candidate
Aniket Kesari - Law PhD and post-doc at D-Lab
Renata Barreto - Law PhD student

And with special thanks to:

Professor Mark Wilson

Graduate School of Education
Berkeley Evaluation & Assessment Research Center

3 of 81

Binary vs. interval variables

Question: What's the temperature today?

4 of 81

Scientific goals of our method

Create an outcome variable that is precise (interval) and with minimal bias from the humans that labeled the data
item response theory

Use machine learning to predict that outcome measure in a scalable way, also with minimal human bias, and with a clear explanation of what led to the predicted score
deep learning

5 of 81

Categorical, ordinal, and interval variables

Categorical / nominal variables

Variable value is a code for different qualitative labels
Can be seen as a way of encoding multiple mutually exclusive binary variables
E.g. color: red (1), blue (2), or green (3)
Alive: yes (1), no (0)

Ordinal variables

Values have an ordering from lower to higher on some variable
We cannot take differences between the exact distance between values is unknown
E.g. Likert scales: strongly disagree (0), disagree (1), neutral (2), agree (3), strongly agree (4)
Or disease severity: mild (1), moderate (2), severe (3)

Interval variables

Continuous variable in which differences between values are meaningful
I.e. magnitude or scale of units is constant across the range of the variable
A "ruler" that measures the location on an abstract continuum of a variable

6 of 81

Applicable to two types of supervised outcomes to measure

Complex outcome variable currently measured as a human-reviewed binary or ordinal variable for convenience, but that could be decomposed into multiple constituent components

Existing outcome variables measured as an index of multiple components rated by human reviewers, but not yet using item response theory

7 of 81

8 of 81

9 of 81

Standard approach is limited, and not considered measurement

Machine learning model

[Comment someone makes on Twitter]

Hey AI, is that comment hate speech?

Research team, social media platform, or judge/jury

I estimate 37% probability of being hate speech.

10 of 81

Our method measures hate speech as an interval variable, and explains why

Our machine learning model

[Comment someone makes on Twitter]

Hey AI, where do you place this comment on your hate speech scale?

Research team, social media platform, or judge/jury

I estimate the comment at 2.5 (+/- 0.3) on the hate speech scale - an extremely hateful comment. My reasoning is that this comment appears to have strongly negative sentiment (75% certainty), likely threatens violence (85% certainty), includes an identity group target (99%), and is likely humiliating to the target group (92%).

11 of 81

How does our method work? Details to be described

The core task is to decompose a single-question outcome (e.g. "Is this comment hate speech?") into a series (say 5) of ordinal components (respect, dehumanization, insult, etc.)
Recruit human labelers to review observations on those components (online survey)

Batches of comments should be created in an overlapping fashion so that the labelers are densely linked in a single network

Apply item response theory to aggregate those components into a continuous scale

Simultaneously estimate the bias of each labeler and eliminate its influence from the scale
Estimate the randomness in each labeler and remove labelers with inconsistent labels

Use deep learning in a multitask architecture to predict each component (ordinal classification) using the human labeled data, also incorporating the bias of labelers

The deep learning component predictions are then transformed to a continuous scale through IRT

The result is a debiased, explainable, efficient prediction machine for measuring the construct of interest on a continuous, interval scale (with std. errors)

12 of 81

Standard machine learning approach

binary definition of hate speech (yes or no) - qualitative
probability prediction: Pr(Y = 1 | X)
no sense of magnitude: how extreme is the hate speech?
biased by the interpretation of the humans that labeled data
no explanation
not generalizable to future time periods when our sensitivity to hate speech may change

13 of 81

New approach:

continuous hate speech scale (roughly -5.0 to +5.0)
magnitude is incorporated - true quantitative measurement
regression prediction: E[Y | X]
prediction can be explained by intermediate components
debiased from how humans labeled the data
generalizes beyond the specific components measured, comments analyzed, and raters who labeled data

14 of 81

Review our scientific contribution

We develop a method for integrating the measurement benefits of item response theory with the scalable, accurate prediction provided by deep learning
Our method makes five contributions:

Realistic granularity: Outcomes can be measured as interval variables on a continuous scale, rather than simplistic yes/no binaries
Labeler debiasing: we estimate the survey interpretation bias of individual labelers (a "fixed effect") and eliminate that bias from the estimation of the continuous outcome
Sample efficiency: we can achieve greater predictive accuracy for a given sample size because our ordinal components become supervised latent variables in a multitask neural architecture
Explainability: we can explain the predicted continuous score of any observation by examining the predictions of the individual components
Labeler quality: we show how item response theory can estimate the quality of labelers' responses, allowing low-quality labelers to be removed or down-weighted

In sum, our method stands to drastically change how we measure outcomes and conduct machine learning in big data

15 of 81

Comparison to related work

16 of 81

Agenda for Talk

Theorize construct (reference set, components)
Collect comments (web APIs)
Label components (crowdsourcing)
Scale (faceted Rasch IRT)
Predict (deep learning for NLP)

17 of 81

Our method applies to any human-rated data used for

supervised classification or regression

Examples: Text	Examples: Images
Hate speech	Radiological image review (e.g. CT severity index for acute pancreatitis)
Toxic language / bullying	Grading of agricultural produce
Sentiment	Satellite image rating for development
Essay grading	Pornography detection
Conference abstract or article review	Artist identification of paintings
	Microscopy analysis of liver biopsy

Also: time-series, like ECG classification. Other ideas from you?

18 of 81

Theory development

19 of 81

EDUC 274A (K. Draney) - Fundamentals of Measurement
EDUC 274B (M. Wilson) - Statistics of IRT

20 of 81

Construct Map: theoretical levels of hate speech

Qualitative ordered value, does not reflect an interval value on the final hate speech scale

21 of 81

Reference set: empirical grounding of theory

10+ comments for each of our theorized levels
Forms an empirical lattice that constrains the theory
Prompts introspection and debate, leading to improved understanding of how we truly theory our construct and its associated levels
Leads to confirmatory analysis, not exploratory

22 of 81

Components of hate speech

23 of 81

Survey details

Initial screen on identity group targets

Major identity groups: race/ethnicity, gender, religion, sexual orientation, disability, age
One follow-up question for sub-identity group for each major group

Hate speech scale questions (~10)
Participant demographics

Gender, education, race, age, income, religion, sexual orientation, political ideology

Free response feedback (optional)

24 of 81

Comment Collection

25 of 81

Stream comments

Reddit: Most recently published comments on any post in /r/all.

Twitter: Most recent tweets from their streaming API.

YouTube: Search for videos around major US cities, take all comments on them.

26 of 81

Class imbalance, statistical power, & budget limits

Binarized hate speech is < 1% of general internet content
If we had a yes/no outcome for hate speech, what hate speech proportions would we prefer the training data?
For a 8-level hate speech construct, we want a mostly even distribution over each level (~12.5% each)
Our labeling budget is finite, so we want to avoid spending a ton of money on imbalanced training data

27 of 81

Sample comments

We’ve collected over 75 million comments, but we only want to annotate 50k.

Over-sample comments with identity groups, and stratify on estimated hatefulness.

20k 20k 10k

28 of 81

Comment batch creation

29 of 81

Augment comments

Perspective API: Trained NLP models from Jigsaw for detecting various kinds of abusive language. We use their identity attack and threat models.

Word embeddings help us answer “How relevant is this comment to the identity groups we’re looking for?”

30 of 81

Bin comments

We use the metadata added from step 2 to bin the comments into 5 bins:

Not relevant (does not appear to target identity groups)
Relevant and low on hate scale
Relevant and neutral on hate scale
Relevant and high
Relevant and very high

31 of 81

Stratification: maximize power without eliminating any cells

	Positive	Neutral	Low Hate	High Hate
Identity groups	7,500	5,000	18,300	14,200
No identity groups	5,000

Hypothesis dimension: E[ hate score | X ]

Relevance dimension:

Pr[ identity groups = 1 | X ]

Total labeling budget: 50,000 comments

Comments downloaded: 75 million

32 of 81

Sampling design for human review of comments

33 of 81

Naive annotation plan can lead to distinct networks with disjoint subsets

Batches of 5 distinct comments
Each batch rated by 3 labelers
Each labeler rates only one batch
We cannot differentiate if a batch is more hateful or a set of raters is more lenient in their rating - we can't calibrate across batches

Allowing workers to label comments randomly, like on Figure 8's system, would likely also lead to disjoint subsets

But maybe one could get lucky and not have any disjoint subsets?

Batch 1

Batch 2

Batch 3

R 1

R 2

R 3

R 4

R 5

R 6

R 7

R 8

R 9

34 of 81

Overlapping reviews lead to a single linked network of raters + comments

1

Rater A

Rater B

Rater C

Rater D

Rater E

2

3

4

5

6

7

Comments

Labelers / annotators

Here is an example with 7 comments reviewed by 5 raters. Every rater reviews 3 comments
Each review creates a link (or connection, edge) between the rater and the comment.

Unfolded version of the same network

35 of 81

Densely linked network for human labeler debiasing

36 of 81

Scaling

37 of 81

Overview of item response theory scaling

Item response theory analyzes the patterns in the ordinal survey responses (components of hate speech) to create a continuous latent variable (hate speech scale)
That continuous hate speech score best explains the combined ratings on the survey instrument for each comment, after correcting for reviewer bias.
While doing that, IRT simultaneously estimates:

Where each survey item falls on the hate speech scale (where it is most informative)
Where each response option for each item falls on the hate speech scale
The bias (or "severity") of each annotator

This estimation is through maximum likelihood

We use joint maximum likelihood, but marginal or conditional maximum likelihood are options

It provides statistical diagnostics to evaluate the results

Reliability is the primary metric, ranging from 0 to 1. Our scale has a reliability of 0.94.

Interpretation: similar to R², it is proportion of variance accounted for by the model

It also generates fit statistics for each reviewer, which can identify reviewers who are selecting randomly
Fit statistics for each survey item tell us how well the item fits into the scale

Readings: Wilson (2004) Constructing Measures (Ch. 5 - 7), Wright & Masters (1982) Rating Scale Analysis

38 of 81

Item response theory estimation goal (slightly simplified)

Predict probability of response option R on item I for comment C by annotator A

Based on the subtraction formula:

hate score for comment C

- hate score for item I

- annotator A's bias (aka severity)

- hate score for response option R

See formula 1 in manuscript for the more technical version

Fixed effect terms

Latent variable of interest (random effect)

39 of 81

Estimation methods for IRT

(Add in highlights on JML, MML, CML, non-parametric)

40 of 81

Scaling results from item response theory

Most hateful

Somewhat hateful

Neutral

Counterspeech

Supportive

Very hateful

Reliability: 0.94!

41 of 81

With Thurstonian

thresholds (v3)

42 of 81

Item fit statistics (v3)

Respect

Dehumanize

Violence

Genocide

Hate speech (binary)

43 of 81

With Thurstonian

thresholds (v4)

44 of 81

Improved fit statistics (v4)

45 of 81

Disordered item step thresholds (Rasch-Andrich)

None

Andrich & Pedler. (2019). “Modelling ordinal assessments: fit is not sufficient”. In:

Communications in Statistics-Theory and Methods 48.12, pp. 2932–2947

46 of 81

Revised scale with 6 items

Reliability: 0.92

Sentiment

Respect

Insult

Humiliate

Status

Attack-Defend

47 of 81

Revised scale with 6 items

48 of 81

Example scaling results (trigger warning)

49 of 81

Distribution across social media platforms

50 of 81

We have created a measure of our construct.

Can we predict it ("auto-grade") with machine learning on raw text?

51 of 81

Short Circuit (1986)

52 of 81

Fully connected layers

Raw comment text

Binary hate speech status

Deep NLP

(BERT, ALBERT, RoBERTa, T5, USE)

Language representation

Latent variables related to hate speech

Current best practice in supervised NLP

53 of 81

Fully connected hidden layers

Raw comment text

Intermediate ordinal outcomes

(ratings on hate scale items)

1. Sentiment

2. Respect

8. Genocide

7. Violence

9. Attack-Defend

Continuous hate score

Item Response Theory

Estimated labeler bias (“fixed effect”)

Deep NLP

(BERT, ALBERT, RoBERTa, T5, USE)

Language representation

Learning to rate

Neural architecture for predicting a continuous score with multiple intermediate outcomes (multitask), labeler bias adjustment, and IRT activation

Loss: ordinal cross-entropy

Loss: squared-error

3. Insult

4. Humiliate

5. Status

6. Dehumanize

Final outcome

Non-linear activation function

54 of 81

Correlation of items suggests benefit from multitask approach

55 of 81

Ordinal classification with labeler bias adjustment

Final hidden layer

Output: Violence Item

Loss: ordinal cross-entropy

Wording: "This comment calls for using violence against the group(s) you previously identified. "

1. Strongly disagree

2. Disagree

3. Neutral

4. Agree

5. Strongly agree

Estimated labeler bias (“fixed effect”) - concatenated onto the final hidden layer

Predicted probabilities using only text (no bias adjustment)

Predicted probabilities with bias adjustment

Proportional Odds Latent Variable

(See Cao et al. 2019 Rank-consistent ordinal regression)

As an example, consider a comment in which most labelers responded "agree" or "strongly agree" for the violence item.
But for the current training observation, we have the ratings from a severe rater who responds "neutral" (perhaps he interprets the comment as "just joking around").
Using only the text inputs, the neural network puts most probability on "strongly agree", which leads to a high loss.
During back-propagation that high loss will lead to adjusting the hidden layers to yield a more neutral prediction in the future - which we don't want, since it's a biased response.

If we include the labeler bias as an additional input, it will have a weight to each response option shifting that value to be higher or lower.
As a result, when predicting the current observation, the network takes labeler bias into account and shifts its probability distribution towards "neutral"
This yields a lower loss, and during back-propagation the hidden layers do not have to adjust to the biased output from this particular labeler.
The end result is that the network learns how to fill out of labeling instrument while minimizing the influence of biased training observations.
When predicting in the future, we would typically use a labeler bias of 0 as an auxiliary input for all comments.

56 of 81

Quadratic weighted kappa loss: cost matrix

Predicted

Actual

57 of 81

Quadratic weighted kappa example:

Predicted Prob	12%	18%	35%	20%	15%
Distance	1	0	1	2	3
Weight	0.0625		0.0625	0.25	0.5625
Loss contribution	0.0075	0	0.02187	0.05	0.08438

= 0.16375

Compare to NLL:

-log(0.18) = 1.715

58 of 81

Labeler bias as an auxiliary input

During deep learning, each observation (comment text plus the set of ratings for a given comment) will have the estimated labeler bias (severity) as an auxiliary input
Labeler bias is a value on the hate speech scale: centered around 0 and within (-3, +3)
We include this scalar value as another latent variable in the final hidden layer
Those values are then inputs into the latent hidden value for each item's ordinal prediction

(Which is evaluated with quadratic weighted kappa loss)

The effect of the bias input is that the neural network can adjust its probability predictions for each item based on whether the rater for that observation was more or less severe.

Ex.: based on the text of a comment, the network might predict "strongly agree" for the genocide item
But if it knows the rater is severe, it should shift its prediction down, e.g. to "agree" or even "disagree"

59 of 81

Categorical classification with labeler bias adjustment

Final hidden layer

Output: Violence Item

Estimated labeler bias (“fixed effect”) - concatenated onto the final hidden layer

Loss: categorical cross-entropy

Wording: "This comment calls for using violence against the group(s) you previously identified. "

1. Strongly disagree

2. Disagree

3. Neutral

4. Agree

5. Strongly agree

Softmax activation

Predicted probabilities using only text (no bias adjustment)

Predicted probabilities with bias adjustment

60 of 81

Rasch scaling of the deep learning model

Trained model (autograder) predicts a probability distribution for each item's response

Given the raw text of the comment, and the fixed rater severity (set to 0 presumably)
Note: this differs from human labels where we do not know the underlying probability distribution

We take the highest probability response for each item

Similar to how a person would select the best response option
Could we instead leverage the estimated probability distribution?

Ex: take expected value of item response (weighted average)

Run partial credit scaling (only a single rater now)

Anchor item difficulty and item step parameters to those from Faceted Rasch model
We use Facets software currently, but in theory could use Conquest, TAM, etc.

Would require transforming the PCM parameterization used for anchoring

Now have scaled results and look-up table mapping total score to measure

61 of 81

Facets item fit statistics from deep learning ratings

62 of 81

Statistics of ordinal classification

(Add in some here)

63 of 81

Results

64 of 81

Future work

Partnerships - interest from Facebook, Google, Pinterest, Blizzard, et al.
Causal inference (interrupted time series, randomized interventions, user accounts)
Listening to victims: collect stories and experiences of hate speech
Focus on genocide in developing countries (Sri Lanka, Myanmar, India, Brazil)
Improved labeling: incorporate message context
New platforms: Facebook, Instagram, Wikipedia, game chats (Blizzard)
New languages
New constructs: toxicity, sentiment
New data types: images, audio, video
Other applications: automated essay grading, surgical skill evaluation, etc.
Exploring a possible patent application

65 of 81

Concluding inspirational quotation

66 of 81

Comments, questions, feedback?

hatespeech.berkeley.edu

ck37@berkeley.edu

cvacano@berkeley.edu

67 of 81

Appendix

68 of 81

Crowdsource worker quality analysis

69 of 81

Crowdsource worker quality: identity rate

70 of 81

Worker quality: mean-squared statistic vs. identity rate

71 of 81

Worker quality: mean-squared statistic vs. identity rate

72 of 81

73 of 81

Scaled reference set - initial

74 of 81

Scaled reference set - revised

75 of 81

Estimating thresholds for theorized levels

76 of 81

Distribution across social media platforms

77 of 81

Insufficiency of a single binary hate item

78 of 81

Implementation diagram

79 of 81

Technical implementation: Google serverless functions

Labeling instrument (Qualtrics)

Rater recruitment (Amazon Mechanical Turk)

Google Cloud

SQL Database

Comment Batches

Reserve comment batch

Ratings

Complete comment batch

Serverless functions pool

80 of 81

Fully connected hidden layers

Raw comment text

Output: Violence Item

Estimated labeler bias (“fixed effect”) - concatenated onto the final hidden layer

Deep NLP

(USE, XLNet, RoBERTa, ULMFiT)

Language representation

Learning to rate

Labeler bias as auxiliary input (violence item)

Loss: quadratic weighted kappa

Wording: "This comment calls for using violence against the group(s) you previously identified. "

1. Strongly disagree

2. Disagree

3. Neutral

4. Agree

5. Strongly agree

Final hidden layer

Proportional Odds Latent Variable

81 of 81

Fully connected hidden layers

Raw comment text

Output: Violence Item

Estimated labeler bias (“fixed effect”) - concatenated onto the final hidden layer

Deep NLP

(USE, XLNet, RoBERTa, ULMFiT)

Language representation

Learning to rate

Labeler bias as auxiliary input (violence item)

Loss: categorical cross-entropy

Wording: "This comment calls for using violence against the group(s) you previously identified. "

1. Strongly disagree

2. Disagree

3. Neutral

4. Agree

5. Strongly agree

Final hidden layer

Softmax activation