Constructing interval variables via faceted Rasch measurement and multitask deep learning
Debiased, explainable, interval measurement of hate speech
November 2020
Research Team
And with special thanks to:
Binary vs. interval variables
Question: What's the temperature today?
Scientific goals of our method
Categorical, ordinal, and interval variables
Applicable to two types of supervised outcomes to measure
Standard approach is limited, and not considered measurement
Machine learning model
[Comment someone makes on Twitter]
Hey AI, is that comment hate speech?
Research team, social media platform, or judge/jury
I estimate 37% probability of being hate speech.
Our method measures hate speech as an interval variable, and explains why
Our machine learning model
[Comment someone makes on Twitter]
Hey AI, where do you place this comment on your hate speech scale?
Research team, social media platform, or judge/jury
I estimate the comment at 2.5 (+/- 0.3) on the hate speech scale - an extremely hateful comment. My reasoning is that this comment appears to have strongly negative sentiment (75% certainty), likely threatens violence (85% certainty), includes an identity group target (99%), and is likely humiliating to the target group (92%).
How does our method work? Details to be described
Standard machine learning approach
New approach:
Review our scientific contribution
Comparison to related work
Agenda for Talk
Our method applies to any human-rated data used for
supervised classification or regression
Examples: Text | Examples: Images |
Hate speech | Radiological image review (e.g. CT severity index for acute pancreatitis) |
Toxic language / bullying | Grading of agricultural produce |
Sentiment | Satellite image rating for development |
Essay grading | Pornography detection |
Conference abstract or article review | Artist identification of paintings |
| Microscopy analysis of liver biopsy |
Also: time-series, like ECG classification. Other ideas from you?
Theory development
Construct Map: theoretical levels of hate speech
Qualitative ordered value, does not reflect an interval value on the final hate speech scale
Reference set: empirical grounding of theory
Components of hate speech
Survey details
Comment Collection
Stream comments
Reddit: Most recently published comments on any post in /r/all.
Twitter: Most recent tweets from their streaming API.
YouTube: Search for videos around major US cities, take all comments on them.
Class imbalance, statistical power, & budget limits
Sample comments
We’ve collected over 75 million comments, but we only want to annotate 50k.
Over-sample comments with identity groups, and stratify on estimated hatefulness.
20k 20k 10k
Comment batch creation
Augment comments
Perspective API: Trained NLP models from Jigsaw for detecting various kinds of abusive language. We use their identity attack and threat models.
Word embeddings help us answer “How relevant is this comment to the identity groups we’re looking for?”
Bin comments
We use the metadata added from step 2 to bin the comments into 5 bins:
Stratification: maximize power without eliminating any cells
| Positive | Neutral | Low Hate | High Hate |
Identity groups | 7,500 | 5,000 | 18,300 | 14,200 |
No identity groups | 5,000 |
Hypothesis dimension: E[ hate score | X ]
Relevance dimension:
Pr[ identity groups = 1 | X ]
Total labeling budget: 50,000 comments
Comments downloaded: 75 million
Sampling design for human review of comments
Naive annotation plan can lead to distinct networks with disjoint subsets
Batch 1
Batch 2
Batch 3
R 1
R 2
R 3
R 4
R 5
R 6
R 7
R 8
R 9
Overlapping reviews lead to a single linked network of raters + comments
1
Rater A
Rater B
Rater C
Rater D
Rater E
2
3
4
5
6
7
Comments
Labelers / annotators
Unfolded version of the same network
Densely linked network for human labeler debiasing
Scaling
Overview of item response theory scaling
Item response theory estimation goal (slightly simplified)
Predict probability of response option R on item I for comment C by annotator A
Based on the subtraction formula:
hate score for comment C
- hate score for item I
- annotator A's bias (aka severity)
- hate score for response option R
See formula 1 in manuscript for the more technical version
Fixed effect terms
Latent variable of interest (random effect)
Estimation methods for IRT
(Add in highlights on JML, MML, CML, non-parametric)
Scaling results from item response theory
Most hateful
Somewhat hateful
Neutral
Counterspeech
Supportive
Very hateful
Reliability: 0.94!
With Thurstonian
thresholds (v3)
Item fit statistics (v3)
Respect
Dehumanize
Violence
Genocide
Hate speech (binary)
With Thurstonian
thresholds (v4)
Improved fit statistics (v4)
Disordered item step thresholds (Rasch-Andrich)
Andrich & Pedler. (2019). “Modelling ordinal assessments: fit is not sufficient”. In:
Communications in Statistics-Theory and Methods 48.12, pp. 2932–2947
Revised scale with 6 items
Reliability: 0.92
Sentiment
Respect
Insult
Humiliate
Status
Attack-Defend
Revised scale with 6 items
Example scaling results (trigger warning)
Distribution across social media platforms
We have created a measure of our construct.
Can we predict it ("auto-grade") with machine learning on raw text?
Short Circuit (1986)
Fully connected layers
Raw comment text
Binary hate speech status
Deep NLP
(BERT, ALBERT, RoBERTa, T5, USE)
Language representation
Latent variables related to hate speech
Current best practice in supervised NLP
Fully connected hidden layers
Raw comment text
Intermediate ordinal outcomes
(ratings on hate scale items)
1. Sentiment
2. Respect
8. Genocide
7. Violence
9. Attack-Defend
Continuous hate score
Item Response Theory
Estimated labeler bias (“fixed effect”)
Deep NLP
(BERT, ALBERT, RoBERTa, T5, USE)
Language representation
Learning to rate
Neural architecture for predicting a continuous score with multiple intermediate outcomes (multitask), labeler bias adjustment, and IRT activation
Loss: ordinal cross-entropy
Loss: squared-error
3. Insult
4. Humiliate
5. Status
6. Dehumanize
Final outcome
Non-linear activation function
Correlation of items suggests benefit from multitask approach
Ordinal classification with labeler bias adjustment
Final hidden layer
Output: Violence Item
Loss: ordinal cross-entropy
Wording: "This comment calls for using violence against the group(s) you previously identified. "
1. Strongly disagree
2. Disagree
3. Neutral
4. Agree
5. Strongly agree
Estimated labeler bias (“fixed effect”) - concatenated onto the final hidden layer
Predicted probabilities using only text (no bias adjustment)
Predicted probabilities with bias adjustment
Proportional Odds Latent Variable
(See Cao et al. 2019 Rank-consistent ordinal regression)
Quadratic weighted kappa loss: cost matrix
Predicted
Actual
Quadratic weighted kappa example:
Predicted Prob | 12% | 18% | 35% | 20% | 15% |
Distance | 1 | 0 | 1 | 2 | 3 |
Weight | 0.0625 | | 0.0625 | 0.25 | 0.5625 |
Loss contribution | 0.0075 | 0 | 0.02187 | 0.05 | 0.08438 |
= 0.16375
Compare to NLL:
-log(0.18) = 1.715
Labeler bias as an auxiliary input
Categorical classification with labeler bias adjustment
Final hidden layer
Output: Violence Item
Estimated labeler bias (“fixed effect”) - concatenated onto the final hidden layer
Loss: categorical cross-entropy
Wording: "This comment calls for using violence against the group(s) you previously identified. "
1. Strongly disagree
2. Disagree
3. Neutral
4. Agree
5. Strongly agree
Softmax activation
Predicted probabilities using only text (no bias adjustment)
Predicted probabilities with bias adjustment
Rasch scaling of the deep learning model
Facets item fit statistics from deep learning ratings
Statistics of ordinal classification
(Add in some here)
Results
Future work
Concluding inspirational quotation
Appendix
Crowdsource worker quality analysis
Crowdsource worker quality: identity rate
Worker quality: mean-squared statistic vs. identity rate
Worker quality: mean-squared statistic vs. identity rate
Scaled reference set - initial
Scaled reference set - revised
Estimating thresholds for theorized levels
Distribution across social media platforms
Insufficiency of a single binary hate item
Implementation diagram
Technical implementation: Google serverless functions
Labeling instrument (Qualtrics)
Rater recruitment (Amazon Mechanical Turk)
Google Cloud
SQL Database
Comment Batches
Reserve comment batch
Ratings
Complete comment batch
Serverless functions pool
Fully connected hidden layers
Raw comment text
Output: Violence Item
Estimated labeler bias (“fixed effect”) - concatenated onto the final hidden layer
Deep NLP
(USE, XLNet, RoBERTa, ULMFiT)
Language representation
Learning to rate
Labeler bias as auxiliary input (violence item)
Loss: quadratic weighted kappa
Wording: "This comment calls for using violence against the group(s) you previously identified. "
1. Strongly disagree
2. Disagree
3. Neutral
4. Agree
5. Strongly agree
Final hidden layer
Proportional Odds Latent Variable
Fully connected hidden layers
Raw comment text
Output: Violence Item
Estimated labeler bias (“fixed effect”) - concatenated onto the final hidden layer
Deep NLP
(USE, XLNet, RoBERTa, ULMFiT)
Language representation
Learning to rate
Labeler bias as auxiliary input (violence item)
Loss: categorical cross-entropy
Wording: "This comment calls for using violence against the group(s) you previously identified. "
1. Strongly disagree
2. Disagree
3. Neutral
4. Agree
5. Strongly agree
Final hidden layer
Softmax activation