Predicting Obesity Risk in South America Using Supervised Machine Learning Models
Qirui Zhang
Introduction &Significance
Global Obesity Overview:
- In 2020, approximately 14% of the global population was obese.
- This is projected to rise to 24% by 2035.
Impact of Obesity:
- Leads to various chronic diseases such as cardiovascular diseases, type 2 diabetes, and certain cancers.
- Significant economic burden, costing around $2 trillion annually, which is about 2.8% of global GDP.
Research Goals and Regional Focus:
- Develop a machine learning model capable of predicting obesity risk.
- Special focus on unique challenges faced in Latin America.
2
I have three hypotheses (that is, three questions I want to explore):
1. Which has a greater impact on the development of obesity, poor dietary habits or lack of exercise?
2. Will the use of technological devices increase the risk of developing obesity?
3. Does eating vegetables or not have a significant impact on the risk of obesity??
Research Question:
How can machine learning models integrate various lifestyle, dietary factors, and physical conditions to predict the risk of obesity in South America, using data from individuals in Mexico, Peru, and Colombia?
3
Methods
Data Source & Study Design
results will inform region-specific public-health strategies
Measures (Outcome & Exposures)
Dietary pattern — high-caloric food & vegetable frequency, meals/day, snacking
Physical activity, electronic-device use, water intake, alcohol, smoking, calorie monitoring, transport mode
Age, gender, family obesity history
Height, Weight
Statistical Analysis
Synthetic augmentation: CTGAN (1 000 epochs) → +5 000 records → merged n = 7 111 (height/weight withheld)
Variable selection: stepwise - AIC logistic regression
Model comparison: 12 classifiers (LDA, QDA, KNN k=7, Tree, RF, Bagging, GBM, LASSO, XGBoost, SVM, MLP etc.)
70 / 30 train-test split + 5-fold CV Metrics: Accuracy, Sensitivity, Specificity, Precision, F1, AUC
Bayesian hierarchical logistic model to quantify lifestyle effects → report ORs ± 95 % CI, ROC curves, variable importance
4
Results
5
Results
6
Conclusions
Superior predictive performance: K-Nearest Neighbor (AUC = 0.873; 73.4 % accuracy) and ensemble learners (Boosting, Random Forest) captured the multifactorial nature of obesity better than traditional single-model approaches.
Diet > Physical Activity: Frequent high-calorie intake exerted the strongest effect on obesity, corroborating Gardner et al. (2021) that diet is the dominant driver of excess weight.
Technology-related sedentary behavior: 1–3 h/day of electronic‐device use markedly increased obesity risk, consistent with Chatterjee (2019) and the growing screen-time literature.
Weak vegetable-protective signal: Contrary to many cohort studies, moderate-to-high vegetable consumption showed uncertain or slightly positive associations; cultural meal patterns or measurement imprecision may mask expected benefits.
Methodological contribution: Combining CTGAN data augmentation with Bayesian hierarchical modelling provided granular, uncertainty-aware estimates of behavioral effects in a Latin-American context.
7
Limitations & Recommendations
Cross-sectional Design: Unable to establish causality, only identifies correlations.
Selection Bias: Data collected through social media and email may favor participants with greater technological access and higher literacy levels.
Information Bias: Relies on self-reported data, which may be inaccurate due to social desirability effects.
Data Scope and Representativeness: Covers three countries but may not fully represent all demographics.
Missing Key Variables: Lacks important variables such as detailed family medical history and psychological factors affecting dietary behavior.
Focus on diet: tax / cap high-calorie foods, emphasize dietary reforms over stand-alone exercise campaigns
Less screen time
Add diet lessons in schools & workplaces
Limit ads for high-calorie snacks
8
References
[1] NCD Risk Factor Collaboration (NCD-RisC). Worldwide trends in underweight and obesity from 1990 to 2022: a pooled analysis of 3663 population-representative studies with 222 million children, adolescents, and adults. Lancet. 2024;403(10431):1027-1050.
[2] Ferreira SRG, Macotela Y, Velloso LA, Mori MA. Determinants of obesity in Latin America. Nat Metab. 2024;6(3):409-432.
[3] Ohanyan H, Portengen L, Huss A, et al. Machine learning approaches to. characterize the obesogenic urban exposome. Environ Int. 2022;158:107015.
[4] Gouveia N, Kephart JL, Dronova I, et al. Ambient fine particulate matter in Latin American cities: Levels, population exposure, and associated urban factors. Sci Total Environ. 2021;772:145035.
[5] Palechor FM, Manotas AH. Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data Brief. 2019;25:104344.
9
Thank You
10