1 of 15

Fair Machine Learning Models for Integrating EMRs and Neighbor Information for Better Disease Screening and Risk Factor Identification

Yang Dai

Department of Biomedical Engineering

Center for Bioinformatics and Quantitative Biology (CBQB)

University of Illinois Chicago

UIC AI Ecosystem Continuing Symposium 1, 9.13.2024, UIC

2 of 15

Machine learning predictive models for disease using multi-omics data

Machine Learn

Disease prediction

Biomarker discovery

Biology

Characteristic of data:

  1. Obtained from well-matched samples (case vs control)
  2. Equal sized for each class;
  3. Dominant features driving class labels have been eliminated
  4. Sample size can be small -- > ML models focus on reducing the dimension

Neither ML generalization nor bias are typically evaluated

3 of 15

ML Modeling for Wellbeing�~ better disease screening, risk factor identification�~ recommendation of changes in screening, health policy, and lifestyle

Machine Learn

Fairness (Race/Gender)

Bias Mitigation

Other social determinants of health (SDOH)

Risk prediction

  • Prenatal Depression?
  • Lung cancer?
  • Accurate pancreatitis?
  • Cardiometabolic disorder…

Risk Factors (bio/socio-economic/environmental/behavioral)

  • heavy crime area ‘
  • Lack of access to good food
  • Heavy metal exposure
  • ….

4 of 15

ML Modeling for Wellbeing�~ better disease screening, risk factor identification�~ recommendation of changes in screening, health policy, and lifestyle

Machine Learn

EMRs

Neighborhood Information

Fairness (Race/Gender)

Bias Mitigation

Modeling

Dietary Records

Risk prediction

  • Prenatal Depression?
  • Lung cancer?
  • Accurate pancreatitis?
  • Cardiometabolic disorder…

Risk Factors (bio/socio-economic/environmental/behavioral)

  • heavy crime area ‘
  • Lack of access to good food
  • Heavy metal exposure
  • ….

5 of 15

Case study 1: Prenatal depression

Hypothesis: Community-level information could improve PND prediction

Overall Patient distribution

6 of 15

Biases in prediction for different racial/ethnic groups

Prenatal depression: data from UI Health

(2414 patients, 56 EMR features)

Huang Y, Alvernaz S, Kim SJ, Maki P, Dai Y, Bernabé BP.

Predicting prenatal depression and assessing model bias using machine learning models. Biological Psychiatry Global Sciences, August 2024

7 of 15

Feature/risk importance are race/ethnicity-specific

Huang et al, Biological Psychiatry Global Sciences 2024

8 of 15

CONTROL

CASE

Other

Other

  • Grid search for Hyperparameter tuning
  • Get the best model
  • Validate the result

NHB

NHW

H

NHB

NHW

H

Outer Loop

K = 10

Inner Loop

K = 5

  • Stratify the data (NHW) based on where the patient lived

Proposed new training procedure:

9 of 15

*

*

*

A

B

C

D

Integration of Community-Level information with EMRs Enhances ML Model Fairness While Maintaining Moderate Predictive Performance

10 of 15

Motivation: Lung cancer exhibits health disparity in race and gender; existing risk factors could not fully explain the disease incidence

Spatial distributions of lung cancer risk, homicide rate, and racial/ethnic composition of the city of Chicago.

Case study 2: Lung cancer risk prediction

Kim SJ, Kery C, An J, Rineer J, Bobashev G, Matthews AK. Racial/Ethnic disparities in exposure to neighborhood violence and lung cancer risk in Chicago. Soc Sci Med. 2024 Jan;340:116448.

11 of 15

Objectives: Develop ML models to identify multi-level risk factors for lung cancer

Summary of data: 14 variables

Patient data: Age over 40; UI Hospital

Case study 2: Lung cancer risk prediction

Type

Variable Name

Values

Definition

 

 

 

 

Categorical

(8)

BMI

1, 2, 3, 4

Body Mass Index

Male

0,1

Gender

Neversmoker

0,1

Smoking behavior

White

Black

Asian

Hisp

Other

0,1

0,1

0,1

0,1

0,1

Individual level race/ethnicity. use white, black hisp in the model.

 

 

 

Continuous

(6)

age

homiciderate1519

 

age

ppov

 

%poverty

pwhite

pbalck

phisp

 

%white

%black

%hispanics

12 of 15

Performance shows racial and gender bias

13 of 15

The risk ranking

Blue: positive

Red: Negative

Question:

How does this change

In the different race/gender groups?�

14 of 15

Lessons learned from our studies and future directions

Learned

  • ML models usually result in biased performance
  • Design a “good” training and testing procedure is important
  • Feature important need to be examined race/ethnicity; gender, age
  • Bias mitigation is not straightforward; it could result in the elimination of true risk factor

Moving forward for better ML strategies and procedures

  • Bias detection in data and Bias mitigation
  • Fair models
  • Incorporation of spatial information
  • Risk factor detection at multiple levels; evaluation of their interrelationship
  • Risk factor evaluation in race/specific manner

15 of 15

Acknowledgments

  • Principal Investigators/Collaborators
    • Beatriz Peñalver Bernabé (PhD): Department of Biomedical Engineering
    • Sage Kim (PhD): Division of Health Policy and Administration, School of Public Health
    • Paulin Maki (PhD): Depts of Psychiatry, Psychology and Obstetrics & Gynecology

  • Ph.D. Students
    • Yongchao Huang
    • Tina Khajeh
    • Suzanne Alvernaz
    • Evgenia Karayeva

  • Funding: R21HD110779, U54MD012523, K12HD10137, R01MD014839