1 of 15

Fair Machine Learning Models for Integrating EMRs and Neighbor Information for Better Disease Screening and Risk Factor Identification

Yang Dai

Department of Biomedical Engineering

Center for Bioinformatics and Quantitative Biology (CBQB)

University of Illinois Chicago

UIC AI Ecosystem Continuing Symposium 1, 9.13.2024, UIC

2 of 15

Machine learning predictive models for disease using multi-omics data �

Machine Learn

Disease prediction

Biomarker discovery

Biology

Characteristic of data:

Obtained from well-matched samples (case vs control)
Equal sized for each class;
Dominant features driving class labels have been eliminated
Sample size can be small -- > ML models focus on reducing the dimension

Neither ML generalization nor bias are typically evaluated

3 of 15

ML Modeling for Wellbeing�~ better disease screening, risk factor identification�~ recommendation of changes in screening, health policy, and lifestyle

Machine Learn

Fairness (Race/Gender)

Bias Mitigation

Other social determinants of health (SDOH)

Risk prediction

Prenatal Depression?
Lung cancer?
Accurate pancreatitis?
Cardiometabolic disorder…

Risk Factors (bio/socio-economic/environmental/behavioral)

heavy crime area ‘
Lack of access to good food
Heavy metal exposure
….

…

4 of 15

ML Modeling for Wellbeing�~ better disease screening, risk factor identification�~ recommendation of changes in screening, health policy, and lifestyle

Machine Learn

EMRs

Neighborhood Information

Fairness (Race/Gender)

Bias Mitigation

Modeling

Dietary Records

Risk prediction

Prenatal Depression?
Lung cancer?
Accurate pancreatitis?
Cardiometabolic disorder…

Risk Factors (bio/socio-economic/environmental/behavioral)

heavy crime area ‘
Lack of access to good food
Heavy metal exposure
….

…

5 of 15

Case study 1: Prenatal depression

Hypothesis: Community-level information could improve PND prediction

Overall Patient distribution

6 of 15

Biases in prediction for different racial/ethnic groups

Prenatal depression: data from UI Health

(2414 patients, 56 EMR features)

Huang Y, Alvernaz S, Kim SJ, Maki P, Dai Y, Bernabé BP.

Predicting prenatal depression and assessing model bias using machine learning models. Biological Psychiatry Global Sciences, August 2024

Community-level information, including neighborhood SDoH such as living in under-resourced areas and exposure to violence, plays a crucial role in PND development.

These factors provide essential context to individual risk profiles and offer a more comprehensive understanding of PND risk across diverse populations.

Left. Overall workflow. Electronic medical records (EMRs) were extracted from 5,875 patients aged 15 years or older who received obstetric care and delivered at UIHealth between 2014 and 2020. Patients were included based on their Patient Health Questionnaire-9 (PHQ-9) scores and timing of their first obstetric visit, before 24 weeks. The preprocessed EMR data underwent normalization and nested k-fold cross-validation to develop machine learning models for predicting perinatal depression (PND). This approach allowed for the identification of important features contributing to PND prediction and the assessment of potential model disparities across different racial/ethnic groups.

features whose importance: socio-demographic factors substance use and pain assessment scale.

Right: Performance evaluation on different ethnicity.

NHB: non-Hispanic Black. NHW: non-Hispanic White. H: Hispanic or Latina

7 of 15

Feature/risk importance are race/ethnicity-specific

Huang et al, Biological Psychiatry Global Sciences 2024

8 of 15

CONTROL

CASE

Other

Grid search for Hyperparameter tuning
Get the best model

Validate the result

NHB

NHW

H

NHB

NHW

H

Outer Loop

K = 10

Inner Loop

K = 5

Stratify the data (NHW) based on where the patient lived

Proposed new training procedure:

9 of 15

*

A

B

C

D

Integration of Community-Level information with EMRs Enhances ML Model Fairness While Maintaining Moderate Predictive Performance

10 of 15

Motivation: Lung cancer exhibits health disparity in race and gender; existing risk factors could not fully explain the disease incidence

Spatial distributions of lung cancer risk, homicide rate, and racial/ethnic composition of the city of Chicago.

Case study 2: Lung cancer risk prediction

Kim SJ, Kery C, An J, Rineer J, Bobashev G, Matthews AK. Racial/Ethnic disparities in exposure to neighborhood violence and lung cancer risk in Chicago. Soc Sci Med. 2024 Jan;340:116448.

11 of 15

Objectives: Develop ML models to identify multi-level risk factors for lung cancer

Summary of data: 14 variables

Patient data: Age over 40; UI Hospital

Case study 2: Lung cancer risk prediction

Type	Variable Name	Values	Definition
Categorical (8)	BMI	1, 2, 3, 4	Body Mass Index
	Male	0,1	Gender
	Neversmoker	0,1	Smoking behavior
	White Black Asian Hisp Other	0,1 0,1 0,1 0,1 0,1	Individual level race/ethnicity. use white, black hisp in the model.
Continuous (6)	age homiciderate1519		age
	ppov		%poverty
	pwhite pbalck phisp		%white %black %hispanics

12 of 15

Performance shows racial and gender bias

Left. Overall workflow. Electronic medical records (EMRs) were extracted from 5,875 patients aged 15 years or older who received obstetric care and delivered at UIHealth between 2014 and 2020. Patients were included based on their Patient Health Questionnaire-9 (PHQ-9) scores and timing of their first obstetric visit, before 24 weeks. The preprocessed EMR data underwent normalization and nested k-fold cross-validation to develop machine learning models for predicting perinatal depression (PND). This approach allowed for the identification of important features contributing to PND prediction and the assessment of potential model disparities across different racial/ethnic groups.

features whose importance: socio-demographic factors substance use and pain assessment scale.

NHB: non-Hispanic Black. NHW: non-Hispanic White. H: Hispanic or Latina

13 of 15

The risk ranking

Blue: positive

Red: Negative

Question:

How does this change

In the different race/gender groups?�

Left. Overall workflow. Electronic medical records (EMRs) were extracted from 5,875 patients aged 15 years or older who received obstetric care and delivered at UIHealth between 2014 and 2020. Patients were included based on their Patient Health Questionnaire-9 (PHQ-9) scores and timing of their first obstetric visit, before 24 weeks. The preprocessed EMR data underwent normalization and nested k-fold cross-validation to develop machine learning models for predicting perinatal depression (PND). This approach allowed for the identification of important features contributing to PND prediction and the assessment of potential model disparities across different racial/ethnic groups.

features whose importance: socio-demographic factors substance use and pain assessment scale.

NHB: non-Hispanic Black. NHW: non-Hispanic White. H: Hispanic or Latina

14 of 15

Lessons learned from our studies and future directions

Learned

ML models usually result in biased performance
Design a “good” training and testing procedure is important
Feature important need to be examined race/ethnicity; gender, age
Bias mitigation is not straightforward; it could result in the elimination of true risk factor

Moving forward for better ML strategies and procedures

Bias detection in data and Bias mitigation
Fair models
Incorporation of spatial information
Risk factor detection at multiple levels; evaluation of their interrelationship
Risk factor evaluation in race/specific manner

15 of 15

Acknowledgments

Principal Investigators/Collaborators

Beatriz Peñalver Bernabé (PhD): Department of Biomedical Engineering
Sage Kim (PhD): Division of Health Policy and Administration, School of Public Health
Paulin Maki (PhD): Depts of Psychiatry, Psychology and Obstetrics & Gynecology

Ph.D. Students

Yongchao Huang
Tina Khajeh
Suzanne Alvernaz
Evgenia Karayeva

Funding: R21HD110779, U54MD012523, K12HD10137, R01MD014839