Logo

American Heart Association

  19
  0


Final ID: MP1414

Feasibility and Utility of Large Language Models Based on DeepSeek-R1 and ChatGPT-4o for the Interpretation of Cardiac Magnetic Resonance Reports: A Real-World Pilot Study

Abstract Body (Do not enter title and authors here): Background: Large language models (LLMs) serve as a promising tool for interpreting cardiac magnetic resonance (CMR) reports into more accessible language for patients, due to their advanced natural language understanding capabilities. This study aimed to assess the feasibility and utility value of two LLMs (DeepSeek-R1 and ChatGPT 4) in interpreting CMR reports and to compare the differences across specific dimensions.
Methods: In this prospective pilot study, 110 patients undergoing CMR at Fuwai Hospital (Beijing, China) between March and April 2025 were consecutively enrolled. Each structured CMR original-report was randomly assigned to one of the two pre-trained LLMs to generate LLM-report. Structured Likert-scale questionnaires were developed to assess the comprehensibility and quality of the LLM-reports. Patients evaluated LLM-report understanding across four dimensions, while CMR radiologists assessed LLM-report quality across five dimensions. The reliability and validity of the questionnaire were tested using Cronbach's α, the Kaiser Meyer Olkin (KMO) measure Bartlett’s test of sphericity, and exploratory factor analysis. Bonferroni correction was used to adjust for potential statistical bias.
Results: Ultimately, 100 LLM-reports were analyzed (mean age 48.82 ± 12.569 years; 72% male; hypertrophic cardiomyopathy 34%, dilated cardiomyopathy 33%, coronary artery disease 33%), with 50 interpreted by DeepSeek-R1 and 50 by ChatGPT-4o. The questionnaire demonstrated excellent internal reliability and construct validity, with a Cronbach’s α of 0.849, a KMO value of 0.803, and a cumulative variance explanation rate of 68.3%. Compared to original-reports, LLM-reports significantly improved scores across all four dimensions (all p < 0.013). However, no significant differences were seen between reports interpreted by DeepSeek-R1 and ChatGPT-4o across all four dimensions (all p>0.013). In quality assessment, no significant differences were seen between the two models across all four dimensions (all p>0.01).
Conclusion: This pilot study represented the first direct comparison of two large language models (LLMs) in interpreting structured CMR reports, integrating both patient feedback and radiologist evaluation. The findings suggested that LLMs showed good feasibility and utility in interpreting CMR reports into patient-accessible language, with comparable performance and quality between DeepSeek-R1 and ChatGPT-4o.
  • Sa, Fen  ( Department of Magnetic Resonance Im , Beijing , China )
  • Author Disclosures:
    Fen Sa: DO NOT have relevant financial relationships
Meeting Info:

Scientific Sessions 2025

2025

New Orleans, Louisiana

Session Info:
More abstracts on this topic:
You have to be authorized to contact abstract author. Please, Login
Not Available