Predicting post-stroke all-cause dementia incidence using machine learning models and electronic health record data
Abstract Body: Introduction: All-cause dementia remains a significant public health concern, with stroke recognized as a key risk factor. Few studies have applied Machine Learning (ML) models to accurately predict cognitive impairment and dementia, yet none have specifically focused on post-stroke dementia risk prediction. This study aims to compare the efficacy of ML approaches and traditional biostatistical methods for predicting the incidence of one-year post-stroke all-cause dementia using electronic health record (EHR) data. Methods: We analyzed de-identified data extracted from the TriNetX network, covering 60 healthcare organizations. This study included patients aged 20+ who experienced their first stroke (any type) in 2018 (baseline). We excluded those with dementia history, lacking data 3 years after stroke onset, or without relevant health data within 3 years preceding stroke. We developed four models: Logistic regression (LR) with backward selection, regularized LR (LASSO and Ridge regression), and Random Forest (RF). The primary outcome was the incidence of all-cause dementia within one year post-stroke. Covariates included demographics, comorbidities, medications, laboratory measures, and vital signs. Model performance was evaluated using accuracy and the area under the curve (AUC) of the receiver operating characteristic (ROC). Results: The final cohort comprised 55,888 adults, of whom 8% developed all-cause dementia within the subsequent year. The sample was 48.4% female, with a distribution of 8.7% aged 20-44, 37.2% aged 45-64, and 54.0% aged 65+. About 64% were non-Hispanic Whites. Among those who developed dementia, 49.7% were female and 80.5% were 65+. They had slightly higher systolic blood pressure, lower BMI, higher rates of comorbidities, and medication use (Table 1). Performance metrics for the models were as follows: LR with backward selection (accuracy: 92.07%; AUC: 0.8033), LASSO regression (92.09%;0.8000), Ridge regression (92.04%; 0.8026), and RF (92.20%; 0.7828) (Table 2). Conclusion: This study demonstrated the feasibility of using ML models to accurately predict post-stroke all-cause dementia incidence. All models showed high accuracy and robust discriminative ability, with the RF model achieving the best accuracy and traditional LR displaying the highest AUC. ML approaches can effectively learn from the data to identify individuals at higher risk of post-stroke dementia, potentially enabling targeted interventions and improved patient care.
Ding, Xueting
( University of California, Irvine
, Irvine
, California
, United States
)
Dai, Jiahui
( University of California, Irvine
, Irvine
, California
, United States
)
Meng, Yang
( University of California, Irvine
, Irvine
, California
, United States
)
Xiang, Liner
( University of California, Irvine
, Irvine
, California
, United States
)
Albala, Bruce
( University of California, Irvine
, Irvine
, California
, United States
)
Castro, Megan
( University of California, Irvine
, Irvine
, California
, United States
)
Kurzman, Alissa
( University of California, Irvine
, Irvine
, California
, United States
)
Gutierrez, Desiree
( University of California, Irvine
, Irvine
, California
, United States
)
Boden-albala, Bernadette
( University of California, Irvine
, Irvine
, California
, United States
)
Author Disclosures:
Xueting Ding:DO NOT have relevant financial relationships
| Jiahui Dai:DO NOT have relevant financial relationships
| Yang Meng:DO NOT have relevant financial relationships
| Liner Xiang:No Answer
| Bruce Albala:DO NOT have relevant financial relationships
| Megan Castro:No Answer
| Alissa Kurzman:DO NOT have relevant financial relationships
| Desiree Gutierrez:DO NOT have relevant financial relationships
| Bernadette Boden-Albala:DO NOT have relevant financial relationships