Logo

American Heart Association

  17
  0


Final ID: 48

Echocardiogram Video Vision Foundation Model (Echo-Vision-FM): A Pre-training and Fine-tuning Framework for Automated Echocardiogram Analysis

Abstract Body: Introduction: Echocardiography is essential for cardiac assessment, yet interpretation requires substantial expertise and suffers from inter-observer variability. Current AI approaches often rely on limited labeled data or process individual frames rather than full videos.

Objective: To develop Echo-Vision-FM (Echocardiogram Video Vision Foundation Model), a self-supervised video foundation model designed to enhance automated echocardiographic analysis across diverse clinical tasks.

Methods: We pre-trained a video encoder using masked auto-encoding on 525,328 unlabeled echocardiogram videos from the MIMIC-IV-ECHO dataset. A novel Spatial-Temporal Fusion Network (STF-Net) was developed to integrate spatial and temporal correlations from learned representations. The model was evaluated on three downstream tasks: (1) cardiac morphological value estimation (LVEF, LVESV, LVEDV), (2) heart function diagnosis (LVEF classification), and (3) aortic stenosis severity assessment. Performance was validated on external datasets including EchoNet-Dynamic (n=10,030), CAMUS (n=500), and TMED-2 (n=599).

Results: Echo-Vision-FM achieved superior performance across all tasks. For LVEF estimation, the model attained MAE of 3.87% and R-squared=0.825 on EchoNet-Dynamic, and MAE of 4.42% with R-squared=0.658 on CAMUS, outperforming state-of-the-art segmentation-based and supervised methods. For heart function diagnosis using LVEF threshold of 50%, the model achieved accuracy=0.905, AUC=0.931, and F1=0.941 on EchoNet-Dynamic. For aortic stenosis classification, AUC reached 0.849. The addition of STF-Net consistently improved performance across all tasks compared to the encoder alone (all p<0.05). Linear probing experiments demonstrated strong transferability of learned representations even with limited labeled data.

Conclusions: Echo-Vision-FM represents the first echocardiogram video foundation model trained exclusively on publicly available data, demonstrating robust performance across multiple institutions and clinical scenarios. This self-supervised approach addresses the annotation bottleneck in medical AI while providing a scalable, end-to-end solution for comprehensive cardiac assessment with significant potential for clinical implementation.
  • Ye, Jiancheng  ( Weill Cornell Medicine , New York , New York , United States )
  • Zhang, Ziyang  ( Northwestern University , Chicago , Illinois , United States )
  • Wu, Qinxin  ( Zhejiang University , Zhejiang , China )
  • Ding, Sirui  ( University of California San Francisco , San Francisco , California , United States )
  • Author Disclosures:
Meeting Info:

EPI-Lifestyle Scientific Sessions 2026

2026

Boston, Massachusetts

Session Info:

Innovations and Methods in Big Data

Friday, 03/20/2026 , 09:00AM - 10:00AM

Oral Abstract Session

More abstracts on this topic:
A Diagnostic Pitfall: Subclavian Stenosis Mimicking Severe Aortic Stenosis on Echocardiography"

Ezaldin Shady, Abdelsalam Mahmoud, Elsayed Omar, Lee Marciano

A large-scale multi-view deep learning-based assessment of left ventricular ejection fraction in echocardiography

Jing Linyuan, Metser Gil, Mawson Thomas, Tat Emily, Jiang Nona, Duffy Eamon, Hahn Rebecca, Homma Shunichi, Haggerty Christopher, Poterucha Timothy, Elias Pierre, Long Aaron, Vanmaanen David, Rocha Daniel, Hartzel Dustin, Kelsey Christopher, Ruhl Jeffrey, Beecy Ashley, Elnabawi Youssef

You have to be authorized to contact abstract author. Please, Login
Not Available