Nested ensemble machine learning to predict heart attack risk in patients with chest pain
Chris Kennedy
Kaiser Permanente Division of Research
PhD candidate in biostatistics at UC Berkeley
Context for why I'm here
My background / plans
Dissertation chapters
Methods interests
Returning to the chest pain project
Collaborators
Caveat: preliminary results
Please do not cite or share results.
Preprint in the works and planned for release by July.
Scientific question (exploratory)
Among patients who present to KP with chest pain, what is their probability of having a heart attack or other major adverse cardiac event (MACE) in the next 2 months?
This risk estimate could potentially support improved resource allocation/workup and patient outcomes/discharge, based on low-risk cut-off of < 0.5% (or 1% or 2%).
Topics to cover
Data structure
Observations: 116,764
Outcome: major adverse cardiac event (MACE) in 60 days (1.88% positive)
Covariates (65):
Missing data
ggplot, kableExtra
Correlation structure of missingness
superheat::superheat()
Generalized low rank models for imputation
Hyperparameter tuning for GLRM: 5 dimensions
These can be optimized using a training/test split (e.g. 80/20) or cross-validation
First round of hyperparameter tuning: 300,000
Default hyperparameters with 19 components: 787,000
3rd round of hyperparameter tuning: 2,000
Optimal settings for missing value imputation
Imputation error comparison: GLRM vs. Median/Mode
GLRM: examining cumulative variance explained
Machine Learning
Decision tree baseline (6 leaf nodes)
Given class imbalance, it was critical to re-weight the loss matrix for inverse class proportion.
Alternatively one could use observation weights, but then the plots can't accurately show the proportion of dataset in each leaf node.
Performance metrics
precrec, cvAUC
Discrimination: PR-AUC
1.88%
Discrimination: ROC-AUC
ROC Curve
SuperLearner weight distribution
Random Forest plateau analysis
Calibration
Calibration: overall
Calibration: zoomed in and exponential scale
Variable importance
Vimp: Decision Tree
vip, rpart
Vimp: OLS
vip
Vimp: Random Forest
vip, ranger
Vimp: XGBoost
vip, xgboost
Accumulated local effect plots
ALE: troponin peak, age, EDACS
iml
ALE: troponin 3-hour, HEART, pulse
iml
Exploratory data analysis
EDA: EDACS
EDACS: inefficient resolution of score
EDA: age
EDA: BMI
Correlation (Pearson)
ck37r::
vim_corr()
Thanks - any questions, comments, or feedback?
Twitter: @c3k
Website: ck37.com
GitHub: github.com/ck37