One Model Does Not Fit All: Demographic Disparities in Machine Learning-Based Prediction of 30-Day Readmission Following Acute Myocardial Infarction
Abstract Body (Do not enter title and authors here): Background: Machine learning (ML) models are increasingly used to predict clinical outcomes such as 30-day readmission following acute myocardial infarction (AMI). However, most models are trained on heterogeneous populations and rarely account for demographic subgroup disparities potentially exacerbating inequities in care. This lack of subgroup-specific considerations may result in biased predictions, especially if a model underperforms in certain populations, leading to misinformed clinical decisions and widening existing health disparities. Objective: To evaluate the performance of a generalized XGBoost model for predicting 30-day AMI readmission and assess how accuracy varies across demographic and clinical subgroups. Methods: This study utilized a cohort of electronic health records from Vanderbilt University Medical Center. Data included patients hospitalized between 2007–2016 with an acute myocardial infarction and included variables on demographics (N=3) and clinical characteristics (N=108). The outcome was 30-day readmission from incident hospitalization. Missing variables were imputed with model-based imputation using K-Nearest Neighbors. After standard preprocessing and stratified train-test splitting (80/20), we trained an XGBoost classifier and evaluated performance using AUROC, specificity, and sensitivity. We then stratified the data by age group, sex, race, comorbidity category (Charlson), and length of stay to examine subgroup-specific model performance. Results: The cohort included 6,179 patients and 10.5% of them had a 30-day readmission. The overall XGBoost model achieved high accuracy (89.5%) and specificity (99.5%) but demonstrated poor sensitivity (3.8%) and a modest AUROC of 0.6050. Subgroup analysis showed major performance gaps. Subgroup analysis identified major performance differences, especially for younger patients (Age <40), older patients (Age>80), patients with greater co-morbidities (Charlson severe) and patients with a medium length of stay (LOS). Based on AUROC, the following sub-groups performed better than the overall model: Females, Age 40-60, Age 60-80, Charlson Mild, LOS shot, and LOS Long. Conclusions: Material differences in performance metrics were seen across demographic and clinical subgroups. This highlights the risk of deploying generalized models in diverse populations and underscores the need for subgroup-specific or fairness-aware modeling to improve both accuracy and equity in clinical predictions.
Shah, Jaini
( Dartmouth College
, Monroe
, New York
, United States
)
Matheny, Michael
( Vanderbilt University Medical Center
, Nashville
, Tennessee
, United States
)
Reeves, Ruth
( Vanderbilt University Medical Center
, Nashville
, Tennessee
, United States
)
Brown, Jeremiah
( THE DARTMOUTH INSTITUTE
, Lebanon
, New Hampshire
, United States
)
Ricket, Iben
( Dartmouth College
, Monroe
, New York
, United States
)
Author Disclosures:
Jaini Shah:No Answer
| Michael Matheny:DO have relevant financial relationships
;
Advisor:PCORI Methods Committee:Active (exists now)
| Ruth Reeves:No Answer
| Jeremiah Brown:DO NOT have relevant financial relationships
| Iben Ricket:DO NOT have relevant financial relationships