Logo

American Heart Association

  38
  0


Final ID: 48

Echocardiogram Video Vision Foundation Model (Echo-Vision-FM): A Pre-training and Fine-tuning Framework for Automated Echocardiogram Analysis

Abstract Body: Introduction: Echocardiography is essential for cardiac assessment, yet interpretation requires substantial expertise and suffers from inter-observer variability. Current AI approaches often rely on limited labeled data or process individual frames rather than full videos.

Objective: To develop Echo-Vision-FM (Echocardiogram Video Vision Foundation Model), a self-supervised video foundation model designed to enhance automated echocardiographic analysis across diverse clinical tasks.

Methods: We pre-trained a video encoder using masked auto-encoding on 525,328 unlabeled echocardiogram videos from the MIMIC-IV-ECHO dataset. A novel Spatial-Temporal Fusion Network (STF-Net) was developed to integrate spatial and temporal correlations from learned representations. The model was evaluated on three downstream tasks: (1) cardiac morphological value estimation (LVEF, LVESV, LVEDV), (2) heart function diagnosis (LVEF classification), and (3) aortic stenosis severity assessment. Performance was validated on external datasets including EchoNet-Dynamic (n=10,030), CAMUS (n=500), and TMED-2 (n=599).

Results: Echo-Vision-FM achieved superior performance across all tasks. For LVEF estimation, the model attained MAE of 3.87% and R-squared=0.825 on EchoNet-Dynamic, and MAE of 4.42% with R-squared=0.658 on CAMUS, outperforming state-of-the-art segmentation-based and supervised methods. For heart function diagnosis using LVEF threshold of 50%, the model achieved accuracy=0.905, AUC=0.931, and F1=0.941 on EchoNet-Dynamic. For aortic stenosis classification, AUC reached 0.849. The addition of STF-Net consistently improved performance across all tasks compared to the encoder alone (all p<0.05). Linear probing experiments demonstrated strong transferability of learned representations even with limited labeled data.

Conclusions: Echo-Vision-FM represents the first echocardiogram video foundation model trained exclusively on publicly available data, demonstrating robust performance across multiple institutions and clinical scenarios. This self-supervised approach addresses the annotation bottleneck in medical AI while providing a scalable, end-to-end solution for comprehensive cardiac assessment with significant potential for clinical implementation.
  • Ye, Jiancheng  ( Weill Cornell Medicine , New York , New York , United States )
  • Zhang, Ziyang  ( Northwestern University , Chicago , Illinois , United States )
  • Wu, Qinxin  ( Zhejiang University , Zhejiang , China )
  • Ding, Sirui  ( University of California San Francisco , San Francisco , California , United States )
  • Author Disclosures:
Meeting Info:

EPI-Lifestyle Scientific Sessions 2026

2026

Boston, Massachusetts

Session Info:

Innovations and Methods in Big Data

Friday, 03/20/2026 , 09:00AM - 10:00AM

Oral Abstract Session

More abstracts on this topic:
12-lead electrocardiograms predict adverse cardiovascular outcomes of emergency department patients

Haimovich Julian, Kolossvary Marton, Alam Ridwan, Padros I Valls Raimon, Lu Michael, Aguirre Aaron

A Cross-scale Causal Machine Learning Framework Pinpoints Mgl2+ Macrophage Orchestrators of Balanced Arterial Growth

Han Jonghyeuk, Kong Dasom, Schwarz Erica, Takaesu Felipe, Humphrey Jay, Park Hyun-ji, Davis Michael E

You have to be authorized to contact abstract author. Please, Login
Not Available