1 of 30

New methods, old problems

Ethics and Bias in Natural Language Processing

Ben Batorsky, PhD

Presented as part of the NLP Summit in 2021

(https://www.nlpsummit.org/nlp-2021/)

2 of 30

About me

Data Scientist, focused on NLP

PhD, Policy Analysis

https://benbatorsky.com/

Part of Ciox Health, health information management company representing 3/5ths of US hospitals

Focused on providing actionable health information and insights in the hands of researchers using AI + NLP technology

3 of 30

The “old days” of machine learning

1950s: Extensive rules engine-based MT

Into the era of neural models

https://towardsdatascience.com/a-logistic-regression-from-scratch-3824468b1f88

Lots of hand engineering (work)

4 of 30

2013: The advent of “generalized” word embeddings

Can instead “swap-in” pre-trained embedding

Circa 2000s Language Model

A Neural Probabilistic Language Model (https://papers.nips.cc/paper/1839-a-neural-probabilistic-language-model.pdf)

[1301.3781] Efficient Estimation of Word Representations in Vector Space

5 of 30

Word embeddings used across settings and languages

med2vec - Embeddings based on medical records

https://github.com/mp2893/med2vec

http://vectors.nlpl.eu/repository/

6 of 30

Large neural Language Models are the SOTA...and they keep getting larger

Vaswani, Ashish, et al. "Attention is all you need." https://arxiv.org/pdf/1706.03762.pdf

On the Dangers of Stochastic Parrots (Bender 2021)

7 of 30

Neural model performance on Machine Translation

https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html

8 of 30

But remember: Garbage in, garbage out

The Internet

9 of 30

Where does the data for these methods come from?

Word2vec (Mikolov 2013)

Google News dataset

BERT (Devlin 2018)

English Wikipedia
Brown Corpus (standard corpus of American English)

GPT-3 (Brown 2020)

CommonCrawl
WebText
Wikipedia
Book corpora

10 of 30

Whose perspectives are represented in Wikipedia and web text?

Wikipedia

2018 survey of all language Wikipedia: 90% of editors are male (1)
41% biographies nominated for deletion about women (2)

Only 17% of published biographies about women

Google News

Men more often the “face” of articles, across topics and outlets (1)
Newsrooms dominated by white men (3)

General internet users

67% of Reddit users are male, 70% are white (4)
Inclusion methodology for CommonCrawl likely to skew towards “dominant” viewpoints (5)

11 of 30

What are the impacts of flawed AI systems?

COMPAS model to predict risk of reoffending
2016: Propublica investigation

Black defendants predicted 77% more likely to commit violent crime than white defendants
48% of white defendants who DID reoffend labelled low risk (vs 28% for black defendants)

What is represented in the training set for this model?

https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

12 of 30

GPT-2 producing problematic passages from race/gender/orientation seeds

[1909.01326] The Woman Worked as a Babysitter: On Biases in Language Generation

13 of 30

Ethics and bias (what we’ll be talking about)

Ethics - “a set of moral principles : a theory or system of moral values” (1)

“Principled AI” - set of principles for AI usage and development (2)

Focusing here on “Fairness and non-discrimination”: design in favor of inclusivity

Bias

Generally: Disproportionate weight in favor/against a particular thing/idea
Statistical bias: “expected value of the results differs from the true underlying quantitative parameter being estimated” (Wikipedia)

Will discuss some examples and how to support fair use and reduce bias

14 of 30

(Some) Types of bias in ML

Historical Bias

Already existing bias and socio-technical issues in the world represented in data
Example: Incarceration rates of different populations as a product of institutional bias

Representation Bias

Results from the way we define and sample from a population
Example: ImageNet contains certain types of people doing certain activities, which will push models towards those representations

Aggregation Bias

Drawing potentially false conclusions about some subgroups based on other subgroups.
Example: Same disease, different populations, different trajectories

[1908.09635] A Survey on Bias and Fairness in Machine Learning

15 of 30

Gender/racial bias in word embeddings

Word vector similarity between gender words and occupations

Exploring and Mitigating Gender Bias in GloVe Word Embeddings

Word embeddings quantify 100 years of gender and ethnic stereotypes (https://www.pnas.org/content/pnas/115/16/E3635.full.pdf)

Racial/ethnic bias in the “most similar” occupations

16 of 30

Word embeddings encode history

Word embeddings quantify 100 years of gender and ethnic stereotypes (https://www.pnas.org/content/pnas/115/16/E3635.full.pdf)

17 of 30

Try it yourself

https://github.com/bpben/nlp_lessons/blob/master/notebooks_instructor/week_7_issues_bias.ipynb

Vincent Warmerdam- What Lies in Word Embeddings | PyData Global 2020

18 of 30

(Some) Definitions of fairness

Group-level fairness

Statistical parity

Individual-level fairness

Similar individual = similar outcome

Cynthia Dwork - Finding Fairness (https://www.youtube.com/watch?v=i_avLd49f8I&feature=youtu.be&t=1548)

19 of 30

Language under-representation: ~90% of languages have almost no labelled data available

Availability of data by language class (see table)

[2004.09095] The State and Fate of Linguistic Diversity and Inclusion in the NLP World

20 of 30

Which perspectives are represented in labels?

Locations of Mechanical Turk labellers

http://turktools.net/crowdsourcing/

21 of 30

Upshot: Variable performance across languages, multi-language models generally worse

[2003.02912] What the [MASK]? Making Sense of Language-Specific BERT Models

From BertLang (https://github.com/MilaNLProc/bertlang)

22 of 30

Fairness: Representation that better matches the speaker distribution

“Technology cannot be accessible if it is only available for English speakers with a standard accent.” (Sebastian Ruder https://ruder.io/nlp-beyond-english/)

“The handful of languages on which NLP systems are trained and tested are often related...leading to a typological echo-chamber. As a result, a vast majority of typologically diverse linguistic phenomena are never seen by our NLP systems” (Joshi 2021 [2004.09095] The State and Fate of Linguistic Diversity and Inclusion in the NLP World)

23 of 30

General strategies for addressing bias in prediction

Multi-accuracy targets

Breaking down metrics by different groups
But how do you determine which groups?

Likely groups most at-risk have limited representation

Affirmative action

Ranking within subgroups

Example: Top 5% of high school class
More detail: Top 5% of high school class stratified by education level of mother

Individual pairing

Select pairs of individuals with similar traits
Predict outcome pair-wise, rather than individual

Issue across all of these: Intersectionality

Difficult/impossible to ensure fairness across all sections

24 of 30

Some methods for monitoring/addressing bias

StereoSet (https://stereoset.mit.edu/)

Three scores

Language model score

How good the model is at ranking “meaningful” associations over “meaningless” ones

“My housekeeper is a Mexican” vs “My housekeeper is a round”

Stereotype score

How often “stereotype” constructions are preferred over “antisterotype” constructions

Idealized CAT score

Combine the above two measures into a score from 0 to 1

Deon (https://deon.drivendata.org/)

Ethics checklist
Five domains: Data collection, data storage, analysis, modelling, deployment

25 of 30

De-biasing word embeddings

Equalize

Equalize pairs

Concept: https://arxiv.org/pdf/1607.06520.pdf

Images: https://medium.com/machine-learning-bites/deeplearning-series-sentiment-classification-d6fb07b0da43

26 of 30

We fixed it!

https://www.nytimes.com/2019/12/06/business/algorithm-bias-fix.html

“To measure racial discrimination by people, we must create controlled circumstances in the real world where only race differs. For an algorithm, we can create equally controlled just by feeding it the right data and observing its behavior.”

27 of 30

We...fixed it?

Applying Bolukbasi’s method

X axis = Original bias measure

In Bolukbasi: Projection on gender direction
Negative = more female
Positive = more male

Y axis = Number of “male”-associated neighbors

Original not much different from debiased

Bias-neighbor correlation

Original: 0.75
Debased: 0.61

Bias coded in all words, not just the ones selected for correction

Bias may just now be in a different direction

[1903.03862] Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them

28 of 30

Some questions worth asking in development and usage

What is the “right” data (as opposed to what is the “right now” data)?

Deon checklist provides some useful questions (https://deon.drivendata.org)

Similarly, what is the “right” method?

StereoSet provides scores on LMs in terms of “embedded stereotypes”
But is a large LM necessary? Word embeddings extremely cheap to custom train

Who might be helped versus who might be harmed by this approach?

Is there a way to adjust the approach/product to reduce harm/increase accessibility?

And finally: Is this a worthwhile application?

29 of 30

Thank you!

More resources

Partnership on AI (https://partnershiponai.org/)
Collection of ethics in NLP information (https://aclweb.org/aclwiki/Ethics_in_NLP)
What Lies package (https://rasahq.github.io/whatlies/)
PyData panel on Ethics in AI (https://www.youtube.com/watch?v=sFsICj8_dHs)

Get in touch!

Website: https://benbatorsky.com/

Blog: https://bpben.github.io/

Twitter: https://twitter.com/bpben2

Linkedin: https://www.linkedin.com/in/benjamin-batorsky/

30 of 30

Is it worth it?

Using pre-trained components introduces complexity and reduces transparency
Bender and Gebru article (2021) posed several questions

Are ever larger LMs inevitable or necessary?
What costs are associated with this research direction and what should we consider before pursuing it?

Who is taking responsibility for assessing these costs?

On the Dangers of Stochastic Parrots | Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency