Echocardiogram Video Vision Foundation Model (Echo-Vision-FM): A Pre-training and Fine-tuning Framework for Automated Echocardiogram Analysis AHA Conference Repository

American Heart Association

Final ID: 48

Echocardiogram Video Vision Foundation Model (Echo-Vision-FM): A Pre-training and Fine-tuning Framework for Automated Echocardiogram Analysis

Abstract Body: Introduction: Echocardiography is essential for cardiac assessment, yet interpretation requires substantial expertise and suffers from inter-observer variability. Current AI approaches often rely on limited labeled data or process individual frames rather than full videos.

Objective: To develop Echo-Vision-FM (Echocardiogram Video Vision Foundation Model), a self-supervised video foundation model designed to enhance automated echocardiographic analysis across diverse clinical tasks.

Methods: We pre-trained a video encoder using masked auto-encoding on 525,328 unlabeled echocardiogram videos from the MIMIC-IV-ECHO dataset. A novel Spatial-Temporal Fusion Network (STF-Net) was developed to integrate spatial and temporal correlations from learned representations. The model was evaluated on three downstream tasks: (1) cardiac morphological value estimation (LVEF, LVESV, LVEDV), (2) heart function diagnosis (LVEF classification), and (3) aortic stenosis severity assessment. Performance was validated on external datasets including EchoNet-Dynamic (n=10,030), CAMUS (n=500), and TMED-2 (n=599).

Results: Echo-Vision-FM achieved superior performance across all tasks. For LVEF estimation, the model attained MAE of 3.87% and R-squared=0.825 on EchoNet-Dynamic, and MAE of 4.42% with R-squared=0.658 on CAMUS, outperforming state-of-the-art segmentation-based and supervised methods. For heart function diagnosis using LVEF threshold of 50%, the model achieved accuracy=0.905, AUC=0.931, and F1=0.941 on EchoNet-Dynamic. For aortic stenosis classification, AUC reached 0.849. The addition of STF-Net consistently improved performance across all tasks compared to the encoder alone (all p<0.05). Linear probing experiments demonstrated strong transferability of learned representations even with limited labeled data.

Conclusions: Echo-Vision-FM represents the first echocardiogram video foundation model trained exclusively on publicly available data, demonstrating robust performance across multiple institutions and clinical scenarios. This self-supervised approach addresses the annotation bottleneck in medical AI while providing a scalable, end-to-end solution for comprehensive cardiac assessment with significant potential for clinical implementation.

Ye, Jiancheng ( Weill Cornell Medicine , New York , New York , United States )
Zhang, Ziyang ( Northwestern University , Chicago , Illinois , United States )
Wu, Qinxin ( Zhejiang University , Zhejiang , China )
Ding, Sirui ( University of California San Francisco , San Francisco , California , United States )

Author Disclosures: