1 of 23

Heavy Metals

Exposure and Diabetes

Risk Prediction using Machine Learning Approach

Sagar Shrestha, Felix Twum, Jennifer L. Lemacks, Sermin Aras

April 3rd, 2025

Susan A. Siltanen Graduate Student Research Symposium

The University of Southern Mississippi

2 of 23

Introduction

  • Diabetes is a major global health concern
  • 38.4 million people or 11.6% of the U.S. population have diabetes(CDC,2024)
  • Total Annual Cost of Diabetes (ADA, 2023)
    • $412.9 billion total cost in 2022
      1. $306.6 billion in direct medical costs
      2. $106.3 billion in indirect costs�

3 of 23

Risk factors

Traditional Risk

  • Genetics
  • Diet
  • Lifestyle

Environmental factors

  • Heavy Metal exposure

4 of 23

Motivation of the Study

  • Explore Environmental influence
  • Early Detection
  • Reduce Healthcare Costs
  • Improve Quality of Life

5 of 23

Formative Study - OLS Regression Analysis

  • Manganese: Negative association with HbA1c (β = -0.0125, p < 0.001). Potential protective effect.
  • Selenium: Positive association with HbA1c (β = 0.0021, p < 0.001). Possible link to increased risk.
  • Lead: Marginal association with HbA1c (p = 0.075).
  • Cadmium & Mercury: Not significant predictors of HbA1c.

6 of 23

Objectives

  • Investigate the association between heavy metal exposure and diabetes
  • Determine the most effective ML model for predicting diabetes risk

7 of 23

Literature review

  • Not many studies on heavy metal exposure predicting diabetes risk using machine learning tools
  • Use of dietary copper (Cu), urinary cadmium (Cd) and urinary mercury (Hg) for improved prediction accuracy (Zhao et al., 2025).
  • Urinary antimony as a significant predictor of diabetic retinopathy risk (Gui et al.,2024).

8 of 23

Data Preprocessing

9 of 23

Dataset

Data Source: NHANES 2011-2018, initially consisting of 51,122 participants.

Dataset Size: 13667 survey responses

Original NHANES Data: 51,122 responses and 330 variables

10 of 23

Dataset Histogram

11 of 23

Dataset Histogram

12 of 23

Data Filtering

Data Imbalance:

  • 87.38% no-diabetes, 12.62% people with diabetes.
  • Imbalance can pose challenges in modeling.

Feature Standardization:

  • Standardized nonbinary columns.
  • Ensures fair comparison and accurate analysis.

13 of 23

Key Variables

  • Demographics �(age, sex, BMI, etc.)
  • Blood heavy metal concentrations �(Lead, Mercury, Cadmium, Manganese)
  • Selenium in blood
  • Weight-related variables �(weight, height, weight in last year, etc )

14 of 23

Downsampling and Stratified K fold Cross Validation

  • Downsampling to prevent the model from being biased towards the majority class.
  • Satisfied K fold to divide the data into K folds.
  • Each fold maintains the same proportion of classes as the original dataset.
  • Provides a more accurate and unbiased evaluation of the model's ability to perform on all classes.

15 of 23

Principal component Analysis

  • Too many features in our weight-related data
  • Increase complexity, overfitting, and computational issue
  • PCA to reduce the number of features by transforming them into a smaller set of principal components
  • Understand the core factors influencing weight, streamline analysis

16 of 23

ML Models

CatBoost

LightGBM

Random Forest

Feedforward Neural Network (FNN)

17 of 23

Results for Risk Assessment

Metric

Random Forest

CatBoost

LightGBM

FNN

Accuracy

0.7166

0.7285

0.7163

0.668

Precision

0.6999

0.707

0.6985

0.6539

Recall

0.7596

0.7811

0.7626

0.7159

F1

0.7283

0.7421

0.729

0.6832

F2

0.7467

0.765

0.7487

0.7024

18 of 23

Feature importance for Cat Boost

  1. Lead in blood (8.254) “Most significant heavy metal predictor”
  2. Selenium in blood (5.688)
  3. Cadmium in blood (4.842)
  4. Mercury in blood (4.759)
  5. Manganese in blood (4.232)

19 of 23

Feature importance for Cat Boost

20 of 23

Conclusions

Heavy metal exposure, particularly lead, is strongly associated with diabetes risk.

CatBoost is the most effective model for predicting diabetes based on environmental and physiological factors.

These findings underscore the need for stricter environmental regulations and further research into heavy metal toxicity and diseases.

1

2

3

21 of 23

Future Work

Exploring additional environmental and genetic factors

Implementing deep learning models to improve predictive accuracy.

Conducting longitudinal studies to establish causal relationships.

1

2

3

22 of 23

References

“A report card: Diabetes in the United States infographic,” Diabetes, May 15, 2024. https://www.cdc.gov/diabetes/communication-resources/diabetes-statistics.html

American Diabetes Association. (2023). Annual Report 2023. Retrieved from https://diabetes.org/sites/default/files/2024-06/ADA_2023_AnnualReport.pdf

Zhao, M., Wan, J., Qin, W., Huang, X., Chen, G., & Zhao, X. (2023). A machine learning-based diagnosis modeling of type 2 diabetes mellitus with environmental metal exposure. Computer Methods and Programs in Biomedicine, 107537. https://doi.org/10.1016/j.cmpb.2023.107537

Gui, Y., Gui, S., Wang, X., Li, Y., Xu, Y., & Zhang, J. (2024). Exploring the relationship between heavy metals and diabetic retinopathy: a machine learning modeling approach. Scientific Reports, 14, 13049.

Centers for Disease Control and Prevention (CDC), National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey (NHANES), 2011-2018. Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention. Available from: https://www.cdc.gov/nchs/nhanes/index.htm

23 of 23

Thanks