Beyond Traditional Risk Factors: A Data-Driven Approach to Coronary Artery Disease (CAD) Prediction AHA Conference Repository

American Heart Association

Final ID: TH961

Beyond Traditional Risk Factors: A Data-Driven Approach to Coronary Artery Disease (CAD) Prediction

Abstract Body: Background: Coronary artery disease (CAD) remains the leading cause of death worldwide, accounting for over 370,000 deaths in the United States in 2022. Traditional tools such as the Framingham Risk Score (c-statistics 0.65–0.75) have limited predictive accuracy and often fail to capture nonlinear interactions among risk factors. Improved early risk stratification could reduce mortality by up to 30%. Advances in machine learning (ML) offer new opportunities to integrate demographic, behavioral, and physiological variables for more precise prediction. This study aims to enhance CAD risk detection using feature engineering, synthetic data balancing, and optimized ML models.
Methods: We analyzed a publicly available cardiovascular dataset of 70,000 adults containing 11 predictors (age, height, weight, systolic and diastolic blood pressure, cholesterol, glucose, smoking, alcohol, and physical activity). Age was converted from days to years, and implausible outliers were excluded. New features such as body mass index (BMI), pulse pressure (sbp_hi – dbp_lo), and categorical age bins were engineered to improve clinical interpretability. To address class imbalance, we applied Synthetic Minority Over-Sampling with Nominal and Continuous features (SMOTENC), generating a balanced dataset of 1 million records. Two ML models, XGBoost and HistGradientBoostingClassifier, were trained using RandomizedSearchCV with stratified k-fold cross-validation. Model performance was evaluated via accuracy, F1-score, and confusion matrices.
Results: XGBoost achieved the highest overall accuracy (90.94%) and F1-score (0.91) across both classes, outperforming HistGradientBoostingClassifier (accuracy 80.68%). Incorporating engineered variables improved interpretability, while synthetic oversampling effectively mitigated class imbalance. The optimized XGBoost model demonstrated a 6% performance improvement over baseline implementations, underscoring the importance of robust preprocessing and hyperparameter tuning.
Conclusions: Feature-engineered and balanced ML models significantly improve CAD prediction accuracy compared with traditional methods. These data-driven approaches can support clinical decision-making by identifying high-risk individuals earlier and more precisely. Future work will integrate genetic, socioeconomic, and wearable device data to refine personalized prevention strategies and expand this framework to other chronic disease domains.

Gupta, Isheeta ( Washington University in St. Louis , Saint Louis , Missouri , United States )
Verma, Mallikarjun ( Washington University in St. Louis , Saint Louis , Missouri , United States )

Author Disclosures:

Meeting Info:

EPI-Lifestyle Scientific Sessions 2026

2026

Boston, Massachusetts

Session Info:

Poster Session 3

Thursday, 03/19/2026 , 05:00PM - 07:00PM

Poster Session

More abstracts from these authors:

Explainable Stroke Risk Prediction Using Machine Learning and Large Language Models: Toward a Mobile-Enabled Clinical Decision Support Application

Gupta Isheeta, Thaker Vishrut, Pandey Saugat, Yepuri Harita, Bita Ongolo Pierre Manuel, Jaiswal Vikash

Explainable OCT-Based Atherosclerotic Plaque Segmentation Using Convolutional Neural Networks and Large Language Models

Gupta Isheeta, Verma Mallikarjun