1 of 29

Fighting Bias with Bias

Challenges and Opportunities for Artificial Intelligence in Healthcare

Keith Harrigian

Johns Hopkins University

2 of 29

About Me

PhD Candidate in Computer Science at Johns Hopkins University
Research Areas

Natural Language Processing (NLP) for healthcare
Robustness, domain adaptation, and generalization

Other Pursuits

Data science at Netflix, Unforged, Warner Media, and True Fit
Behavioral neuroscience research (goal-oriented human movement)

3 of 29

In the realm of healthcare, artificial intelligence serves as a powerful antidote to bias, paving the way for a future where every individual receives unbiased and equal treatment.

– ChatGPT

4 of 29

Transformative AI is Here: Now What?

Rapid Progress of AI

Improved modeling architectures
Improved computational resources

Proceed With Caution

Endless opportunities to leverage AI in the fight against healthcare disparities
Awareness of limitations matters

“Can LLMs like GPT-4 outperform traditional AI tools in dementia diagnosis? Maybe, but not today.” Wang et al. arXiv. 2023.

5 of 29

Agenda

Review

AI, Bias, and

Healthcare

Case Study

Characterizing Stigmatizing Language in Medical Records

Open Dialogue

Bringing AI to the

Alzheimer’s Association

6 of 29

Review

AI, Bias, and Healthcare

7 of 29

Terminology

Systematic error in the outcome of a study due to dataset curation or modeling decisions

Examples

A dataset is not representative of the population it intends to study
A model does not properly characterize the behavior of its target population

Human-leveled prejudices and predispositions regarding groups, attributes, or circumstances

Examples

A dataset of transplant decisions made using knowledge of a patient’s income or race
A language model which disproportionately associates high-paying jobs with men

Statistical Bias

Social Bias

8 of 29

Sources of Bias

“Garbage In, Garbage Out”
Bias in AI is inevitable

Dynamic standards
New domains

An awareness of system shortcomings goes a long way

“Biases in AI Systems.” Srinivasan and Chander. Communications of the ACM. 2021.

9 of 29

Distribution Shift

What Happens

Data distributions can change between training a model and deploying it
Types of shift

Prior Shift: p(y) != p(y’)
Covariate Shift: p(x) != p(x’)
Concept Shift: p(y|x) != p(y’|x’)

Possible Solutions

Domain adaptation (requires target data)
Domain generalization (compromise within-domain performance)

Example: Language models trained on out-of-distribution data require adaptation to their target distribution.

“An Eye on Clinical BERT: Investigating Language Model Generalization for Diabetic Eye Disease Phenotyping.” Harrigian et al. Under Review. 2023.

10 of 29

Distribution Shift

Example: Words which were frequently used by individuals with depression started to be used by the general population after the beginning of COVID-19 to reflect pandemic-specific phenomena.

Term	2019 Embedding Neighborhood	2020 Embedding Neighborhood
Panic	Emotion (i.e., Fear) rage, meltdown, anxiety, anger, barrage, migraine, phobia, outrage, manic, rush, asthma	Panic Buying, Misinformation hysteria, chaos, fear, misinformation, confusion, frenzy, paranoia, mayhem, insanity, fearmongering
Cuts	Physical cut, jumps, runs, cutting, pulls, moves, bounces, falls, turns, burns, drags, dips, breaks, bursts, rips, goes, bumps	Economic cut, cutting, subsidies, budgets, deductions, revenues, checks, payments, breaks, deals, figures, loans, deposits, gains
Isolated	Feeling Detached unpleasant, unstable, detached, unsafe, populated, invasive, unknown, confined, endangered, absent, vulnerable	Quarantine quarantined, isolating, separated, enclosed, insulated, infectious, confined, active, populated, autonomous, vulnerable, detached
Strain	Discomfort/Pressure inflammation, deficiency, dose, stress, pressure, calcium, medication, concentration, tissue, nausea, receptors, doses	Virus disease, illness, infections, symptom, mutation, virus, outbreak, pneumonia, infection, strains, influenza, epidemic
Vulnerable	Emotion susceptible, dangerous, prone, unstable, aggressive, hostile, disruptive, detrimental, receptive, fragile, damaging	At-risk Populations susceptible, dangerous, immunocompromised, infectious, isolating, elderly, disadvantaged, contagious, tolerant, likely, isolated

“The Problem of Semantic Shift in Longitudinal Monitoring of Social Media.” Harrigian et al. WebSci. 2022.

11 of 29

Group Imbalance

What Happens

Traditional machine learning models are trained to minimize average predictive error within their training dataset
If a training dataset is made up of multiple groups, the model is encouraged to do better on the larger groups

Possible Solutions

Distributionally Robust Optimization
Multi-Task Learning
Resampling

Example: A Logistic Regression classifier trained using ERM compromises minority group performance in favor of increasing majority group performance

12 of 29

Spurious Correlations

What Happens

Machine learning models tend to prefer simpler solutions over more complex alternatives
Non-causal correlations can be used erroneously as “shortcuts”

Possible Solutions

Causally-informed Models
Adversarial Learning

Hospital Bed

Vitals

Mortality

Example: Models will learn non-causal relationships between spurious (unstable) attributes and outcomes.

A

X

Y

13 of 29

The State of AI Bias Research

Defensive Tactics

Offensive Tactics

Improved Health Equity

Measure, identify, and protect against social and statistical bias in algorithmic healthcare tools

Measure, identify, and address instances of social bias in our healthcare system

14 of 29

Case Study

Characterizing Stigmatizing Language in Medical Records

15 of 29

Collaborators

Aya Zirikly

Brant Chee

Yahan Li

Mark Dredze

Anne R. Links

Alya Ahamad

Somnath Saha

Mary Catherine Beach

16 of 29

Problem Context

Black patients are significantly more likely than white patients to experience discrimination in the healthcare system (12.3% vs. 2.3%)

Patients who experience discrimination have:

Lower levels of adherence to treatment plans
Lower trust in healthcare providers
Increased likelihood to delay care or avoid chronic treatment screening

Healthcare providers who read notes containing stigmatizing language are more likely to formulate a less aggressive treatment plan

21st Century Cures Act mandates EHRs are readily available to all patients

17 of 29

Stigmatizing Language

Stigmatizing language assigns negative labels, stereotypes, and judgment to certain groups of people.

Often recognized in discussion regarding mental health and addiction

“Addict”
“Substance Abuse”
“Crazy”
“Junkie”

More generally, stigmatizing language reflects an implicit bias

Often expressed unconsciously
In the EHR, more commonly covert

18 of 29

Stigmatizing Language Taxonomy

Class	Definition	Examples
Disbelief	Insinuates doubt about a patient’s stated testimony.	adamant he doesn’t smoke; claims to see a therapist
Difficult	Describes patient perspective as inflexible/difficult/entrenched, typically with respect to their intentions.	insists on being admitted; adamantly opposed to limiting fruit intake
Exclude	Word/phrase is not used to characterize the patient or describe the patient’s behavior; may refer to medical condition or treatment or to another person or context.	patient’s friend insisted she go to the hospital; test claims submitted to insurance

Task: Credibility and Obstinacy

19 of 29

Stigmatizing Language Taxonomy

Class	Definition	Examples
Negative	Patient not, unlikely to, or questionably following medical advice	adherence to therapeutic medication is unclear; mother declines vaccines; struggles with medication and follow-up compliance
Neutral	Not used to describe whether the patient is not following medical advice or rejecting treatment; often used to describe generically some future plan involving a hypothetical.	discussed medication compliance; school refuses to provide adequate accommodations; feels that her parents’ health has declined
Positive	Patient following medical advice.	continues to be compliant with aspirin regimen; reports excellent adherence

Task: Compliance

20 of 29

Stigmatizing Language Taxonomy

Class	Definition	Examples
Negative	Patient’s demeanor cast in a negative light; insinuates the patients is not being forthright	concern for secondary gain; unwilling to meet with case manager
Neutral	Negation of negative descriptors; insinuates the patient was expected to have a negative demeanor.	not combative or belligerent; dad seems angry with patient at times
Positive	Patient’s demeanor or behavior is described in a positive light; patient is easy to interact with.	lovely 80 year old woman; well-groomed and holds good eye contact
Exclude	Patient self-description or description of another individual.	does not want providers to think she’s malingering; reports feeling angry

Task: Descriptors

21 of 29

Overview of System Structure

Stigma

Labels

Machine Learning Classifier

Anchor

Extraction

Clinical

Notes

“Despite my best advice, the patient remains adamant about leaving the hospital today. Social services is aware of the situation.”

adamant

Disbelief
Exclude
Difficult

my best advice the patient remains adamant about leaving the hospital today social

22 of 29

Data

Johns Hopkins University (Private)

English-language progress notes

5 clinical specialties are represented – internal medicine, emergency medicine, pediatrics, OB-GYN, and general surgery (Baltimore, MD)

5,201 labeled instances

MIMIC-IV (Public)

De-identified, English discharge notes

Patients admitted to emergency department or an intensive care unit at Beth Israel Deaconess Medical Center (Boston, MA)

5,043 labeled instances

23 of 29

Model Performance and Keyword Grounding Limitation

Figure 2: Projection of embeddings for a subset of keywords. Labels cluster globally, but keywords cluster locally.

Figure 1: Model accuracy in the Credibility task. BERT models maximize performance at cost of interpretability.

Exclude

Negative

Neutral

24 of 29

Domain Transfer Performance

What happened?

MIMIC frequently contains references to a patient’s family, not the patient (ICU-related shift)
MIMIC contains more psychiatry exams in which the patient describes their mental wellbeing in a negative manner
The distribution of labels conditioned on each keyword changed between datasets

Figure 3: Macro F1 Score when training and testing on different distributions. There exists a consistent loss in performance when transferring between datasets.

25 of 29

Recap of Biases

System Design

Reliance on keywords to ground model predictions limits generalization to rare and/or new forms of stigmatizing language

Sample Selection

The patient population (demographics, clinical specialty, etc.) on which the model was trained dictates how it will perform at test time

26 of 29

The Opportunities Ahead

Document prevalence of stigmatizing language amongst different patient populations
How do changes in medical education curriculum regarding bias manifest in clinical notes?
Provider-specific “report cards” to sunshine implicit bias
“Autocorrect” for the EHR and doctor-to-patient messaging systems
Augmented training objectives for clinical language models

“… there is a suspicion that the patient is not adhering to their medication regimen consistently.”

“Characterization of Stigmatizing Language in Medical Records” Harrigian et al. ACL. 2023.

27 of 29

Open Dialogue

Bringing AI to the

Alzheimer’s Association

28 of 29

Areas of Discussion

Interpretability: What is the model doing?
Benchmarking: Will this model work for our population?
Data Sharing: How do we safely facilitate research?
Adversarial Data Analysis: Is our data biased?
Regulation: What does the future look like?
Emerging Research: What’s on the association’s docket?
…

29 of 29

Thank you

Email: kharrigian@jhu.edu

Learn More: kharrigian.github.io