The enhanced potential of electrocardiogram interpretation empowered by a vision–language foundation model
Abstract Body (Do not enter title and authors here): Background Electrocardiography (ECG) is a cornerstone of cardiovascular disease (CVD) diagnosis, but it faces limitations in spatial resolution and susceptibility to artifacts. Recent advances in vision-language foundation models, such as CLIP, offer potential for enhancing ECG interpretation by aligning multimodal data. This study introduces ECGCLIP, a novel model enabling the diagnosis of a broad spectrum of CVD by integrating ECG waveforms with clinical annotations. Methods ECGCLIP was trained on 5 million ECG-image/report pairs from multicenter datasets, annotated by experienced physicians. Using a self-supervised contrastive learning framework, the model aligned ECG signals with textual interpretations. Performance was evaluated on 45 ECG tasks (e.g., arrhythmias, conduction disorders) and 29 echocardiography tasks (e.g., valvular diseases, heart failure) across internal and external validation cohorts, by comparing precision-recall AUC (PRAUC) under varying data regimes. Results ECGCLIP achieved PRAUC improvements up to 0.5873 (e.g., AAI pacing: 0.0511 → 0.6384) and 0.5253 for rare conditions (e.g., Wolff-Parkinson-White syndrome at 1% data). Critical conditions like ST-elevation myocardial infarction (STEMI) showed gains of 0.2170 (0.1156 → 0.3326), addressing traditional ECG limitations in ischemia detection. The model enhanced detection of valvular diseases (e.g., mitral stenosis: Δ+0.211) and heart failure (LVEF < 40%: Δ+0.139), with 79% generalizability retention for tricuspid regurgitation in external validation. With only 1% training data, ECGCLIP matched or exceeded full-data baselines (e.g., sinus rhythm PRAUC: 0.9747 vs. 0.9877). Gains in low-incidence diseases (e.g., hyperkalemia: Δ+0.0169) highlighted efficacy in sparse-data scenarios. External validation showed an overall PRAUC improvement of 0.1565, with consistent gains in ventricular pre-excitation (Δ+0.5253) but gaps in structural anomalies (e.g., mitral stenosis external Δ+0.100 vs. internal Δ+0.211), reflecting ECG's dependence on functional sequelae. Conclusion ECGCLIP establishes a new paradigm for ECG interpretation by leveraging vision-language foundation models. It demonstrates exceptional data efficiency, enabling accurate diagnosis of diverse cardiac conditions—including rare and critical diseases—with minimal labeled data. The model's robustness across diseases supports its potential for deployment in resource-limited settings, enhancing accessibility and precision in CVD care.
Yu, Ziqing
( Zhongshan Hospital Fudan University
, Shanghai
, China
)
Liang, Yixiu
( Zhongshan Hospital Fudan University
, Shanghai
, China
)
Su, Yangang
( Zhongshan Hospital Fudan University
, Shanghai
, China
)