Scalable Phenotyping of Heart Failure Across Multicenter, Non-Interoperable Health Systems Using Retrieval-Augmented Generation and Large Language Models
Abstract Body (Do not enter title and authors here): Background: While identifying patient characteristics is critical to all electronic health record (EHR)-based research, the ability to do multicenter studies is impeded by differences in data structures, such that tools don’t generalize across EHRs. Large language models (LLMs) can be optimized with Retrieval-Augmented Generation (RAG) to enable EHR-structure agnostic queries for cohort characterization with minimal a priori knowledge of EHR structure. We develop and validate a tabular RAG model to extract clinical characteristics across multiple domains among patients with heart failure (HF) in 2 distinct health system EHRs.
Methods: Our approach employs a novel RAG architecture, combining information retrieval and a generative text model (Llama2-13b) to enhance data extraction from medical records. This identifies data relevant to the query for a clinical feature and then uses the generative model to interpretably synthesize the output. We evaluated this model on 1000 HF patients from the Yale New Haven Health System and 1000 deidentified records from Beth Israel Deaconess Medical Center (MIMIC-IV). Clinical knowledge-based queries extracted patient records, across categorical features (demographics, conditions, and medications) and continuous features (vital signs and labs) [A]. We tested the RAG's performance against manually extracted variables from the tables.
Results: The RAG model performed robustly across key variables in both cohorts, with overall extraction accuracy of 81% for Yale cohort and 82.9% for MIMIC cohort. For categorical variables like myocardial infarction, peripheral arterial disease, and medications (beta blockers, ACE inhibitors), Cohen's kappa values indicated strong agreement with ground truth (Yale: 0.8, 0.76, 1.0, and 0.82; MIMIC: 0.66, 0.83, 0.94, and 0.95). Continuous variables like creatinine, heart rate and systolic blood pressure showed high correlations (Yale: 0.99, 0.90 and 0.92; MIMIC: 1.0, 0.87 and 0.51) [B]. No significant statistical difference was found between ground truth and extracted values for all categorical variables (Mcnemar’s p-value > 0.05).
Conclusion: LLM-optimized RAGs can accurately extract clinical information across multiple EHRs with varying data architectures. This introduces the potential for phenotype extraction at scale, with applications in federated multicenter research, spanning clinical trials and electronic clinical quality assessment.
Vasisht Shankar, Sumukh
( Yale University
, New Haven
, Connecticut
, United States
)
Thangaraj, Phyllis
( Yale University
, New Haven
, Connecticut
, United States
)
Adejumo, Philip
( Yale University
, New Haven
, Connecticut
, United States
)
Khera, Rohan
( Yale School of Medicine
, New Haven
, Connecticut
, United States
)
Author Disclosures:
Sumukh Vasisht Shankar:DO NOT have relevant financial relationships
| Phyllis Thangaraj:DO NOT have relevant financial relationships
| Philip Adejumo:DO NOT have relevant financial relationships
| Rohan Khera:DO have relevant financial relationships
;
Research Funding (PI or named investigator):Bristol-Myers Squibb:Active (exists now)
; Ownership Interest:Ensight-AI, Inc:Active (exists now)
; Ownership Interest:Evidence2Health LLC:Active (exists now)
; Research Funding (PI or named investigator):BridgeBio:Active (exists now)
; Research Funding (PI or named investigator):Novo Nordisk:Active (exists now)