1 of 30

New methods, old problems

Ethics and Bias in Natural Language Processing

Ben Batorsky, PhD

Presented as part of the NLP Summit in 2021

(https://www.nlpsummit.org/nlp-2021/)

2 of 30

About me

Data Scientist, focused on NLP

PhD, Policy Analysis

https://benbatorsky.com/

Part of Ciox Health, health information management company representing 3/5ths of US hospitals

Focused on providing actionable health information and insights in the hands of researchers using AI + NLP technology

3 of 30

The “old days” of machine learning

1950s: Extensive rules engine-based MT

Into the era of neural models

https://towardsdatascience.com/a-logistic-regression-from-scratch-3824468b1f88

Lots of hand engineering (work)

4 of 30

2013: The advent of “generalized” word embeddings

Can instead “swap-in” pre-trained embedding

Circa 2000s Language Model

A Neural Probabilistic Language Model (https://papers.nips.cc/paper/1839-a-neural-probabilistic-language-model.pdf)

5 of 30

Word embeddings used across settings and languages

med2vec - Embeddings based on medical records

6 of 30

Large neural Language Models are the SOTA...and they keep getting larger

Vaswani, Ashish, et al. "Attention is all you need." https://arxiv.org/pdf/1706.03762.pdf

On the Dangers of Stochastic Parrots (Bender 2021)

7 of 30

Neural model performance on Machine Translation

8 of 30

But remember: Garbage in, garbage out

The Internet

9 of 30

Where does the data for these methods come from?

  • Word2vec (Mikolov 2013)
    • Google News dataset
  • BERT (Devlin 2018)
    • English Wikipedia
    • Brown Corpus (standard corpus of American English)
  • GPT-3 (Brown 2020)
    • CommonCrawl
    • WebText
    • Wikipedia
    • Book corpora

10 of 30

Whose perspectives are represented in Wikipedia and web text?

  • Wikipedia
    • 2018 survey of all language Wikipedia: 90% of editors are male (1)
    • 41% biographies nominated for deletion about women (2)
      • Only 17% of published biographies about women
  • Google News
    • Men more often the “face” of articles, across topics and outlets (1)
    • Newsrooms dominated by white men (3)
  • General internet users
    • 67% of Reddit users are male, 70% are white (4)
    • Inclusion methodology for CommonCrawl likely to skew towards “dominant” viewpoints (5)
  1. https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Countering_systemic_bias
  2. Ms. Categorized: Gender, notability, and inequality on Wikipedia
  3. Men outnumber women in US newsrooms. It's no different among fact-checkers. – Poynter
  4. Reddit news users more likely to be male, young and digital in their news preferences
  5. On the Dangers of Stochastic Parrots (Bender 2021)

11 of 30

What are the impacts of flawed AI systems?

  • COMPAS model to predict risk of reoffending
  • 2016: Propublica investigation
    • Black defendants predicted 77% more likely to commit violent crime than white defendants
    • 48% of white defendants who DID reoffend labelled low risk (vs 28% for black defendants)
  • What is represented in the training set for this model?

12 of 30

GPT-2 producing problematic passages from race/gender/orientation seeds

13 of 30

Ethics and bias (what we’ll be talking about)

  • Ethics - “a set of moral principles : a theory or system of moral values” (1)
    • “Principled AI” - set of principles for AI usage and development (2)
      • Focusing here on “Fairness and non-discrimination”: design in favor of inclusivity
  • Bias
    • Generally: Disproportionate weight in favor/against a particular thing/idea
    • Statistical bias: “expected value of the results differs from the true underlying quantitative parameter being estimated” (Wikipedia)
  • Will discuss some examples and how to support fair use and reduce bias
  1. https://www.merriam-webster.com/dictionary/ethic
  2. Principled Artificial Intelligence: Mapping Consensus in Ethical and Rights-based Approaches to Principles for AI

14 of 30

(Some) Types of bias in ML

  • Historical Bias
    • Already existing bias and socio-technical issues in the world represented in data
    • Example: Incarceration rates of different populations as a product of institutional bias
  • Representation Bias
    • Results from the way we define and sample from a population
    • Example: ImageNet contains certain types of people doing certain activities, which will push models towards those representations
  • Aggregation Bias
    • Drawing potentially false conclusions about some subgroups based on other subgroups.
    • Example: Same disease, different populations, different trajectories

15 of 30

Gender/racial bias in word embeddings

Word vector similarity between gender words and occupations

Word embeddings quantify 100 years of gender and ethnic stereotypes (https://www.pnas.org/content/pnas/115/16/E3635.full.pdf)

Racial/ethnic bias in the “most similar” occupations

16 of 30

Word embeddings encode history

Word embeddings quantify 100 years of gender and ethnic stereotypes (https://www.pnas.org/content/pnas/115/16/E3635.full.pdf)

17 of 30

Try it yourself

18 of 30

(Some) Definitions of fairness

Group-level fairness

Statistical parity

Individual-level fairness

Similar individual = similar outcome

19 of 30

Language under-representation: ~90% of languages have almost no labelled data available

Availability of data by language class (see table)

20 of 30

Which perspectives are represented in labels?

Locations of Mechanical Turk labellers

http://turktools.net/crowdsourcing/

21 of 30

Upshot: Variable performance across languages, multi-language models generally worse

22 of 30

Fairness: Representation that better matches the speaker distribution

“Technology cannot be accessible if it is only available for English speakers with a standard accent.” (Sebastian Ruder https://ruder.io/nlp-beyond-english/)

“The handful of languages on which NLP systems are trained and tested are often related...leading to a typological echo-chamber. As a result, a vast majority of typologically diverse linguistic phenomena are never seen by our NLP systems” (Joshi 2021 [2004.09095] The State and Fate of Linguistic Diversity and Inclusion in the NLP World)

23 of 30

General strategies for addressing bias in prediction

  • Multi-accuracy targets
    • Breaking down metrics by different groups
    • But how do you determine which groups?
      • Likely groups most at-risk have limited representation
  • Affirmative action
    • Ranking within subgroups
      • Example: Top 5% of high school class
      • More detail: Top 5% of high school class stratified by education level of mother
    • Individual pairing
      • Select pairs of individuals with similar traits
      • Predict outcome pair-wise, rather than individual
  • Issue across all of these: Intersectionality
    • Difficult/impossible to ensure fairness across all sections

24 of 30

Some methods for monitoring/addressing bias

  • StereoSet (https://stereoset.mit.edu/)
    • Three scores
      • Language model score
        • How good the model is at ranking “meaningful” associations over “meaningless” ones
          • “My housekeeper is a Mexican” vs “My housekeeper is a round”
      • Stereotype score
        • How often “stereotype” constructions are preferred over “antisterotype” constructions
      • Idealized CAT score
        • Combine the above two measures into a score from 0 to 1
  • Deon (https://deon.drivendata.org/)
    • Ethics checklist
    • Five domains: Data collection, data storage, analysis, modelling, deployment

25 of 30

De-biasing word embeddings

Equalize

Equalize pairs

26 of 30

We fixed it!

“To measure racial discrimination by people, we must create controlled circumstances in the real world where only race differs. For an algorithm, we can create equally controlled just by feeding it the right data and observing its behavior.”

27 of 30

We...fixed it?

  • Applying Bolukbasi’s method
    • X axis = Original bias measure
      • In Bolukbasi: Projection on gender direction
      • Negative = more female
      • Positive = more male
    • Y axis = Number of “male”-associated neighbors
  • Original not much different from debiased
    • Bias-neighbor correlation
      • Original: 0.75
      • Debased: 0.61
  • Bias coded in all words, not just the ones selected for correction
    • Bias may just now be in a different direction

28 of 30

Some questions worth asking in development and usage

  • What is the “right” data (as opposed to what is the “right now” data)?
    • Deon checklist provides some useful questions (https://deon.drivendata.org)
  • Similarly, what is the “right” method?
    • StereoSet provides scores on LMs in terms of “embedded stereotypes”
    • But is a large LM necessary? Word embeddings extremely cheap to custom train
  • Who might be helped versus who might be harmed by this approach?
    • Is there a way to adjust the approach/product to reduce harm/increase accessibility?
  • And finally: Is this a worthwhile application?

29 of 30

Thank you!

More resources

30 of 30

Is it worth it?

  • Using pre-trained components introduces complexity and reduces transparency
  • Bender and Gebru article (2021) posed several questions
    • Are ever larger LMs inevitable or necessary?
    • What costs are associated with this research direction and what should we consider before pursuing it?
  • Who is taking responsibility for assessing these costs?