Performance of a Popular Large Language Model in Answering Cardiovascular Related Queries: A Systematic Review and Pooled-Analysis AHA Conference Repository

American Heart Association

125

Final ID: Su3127

Performance of a Popular Large Language Model in Answering Cardiovascular Related Queries: A Systematic Review and Pooled-Analysis

Abstract Body (Do not enter title and authors here): Background: The integration of large language models (LLMs) such as ChatGPT into healthcare can have significant implications for patient education and clinical decision-making.

Aims: This systematic review and pooled analysis aim to evaluate the accuracy of ChatGPT 3.5 and 4 in answering simple queries across cardiovascular (CV) medicine disciplines.

Methods: Literature searches were conducted in PubMed, Embase, and Cochrane Central in May 2024. Keywords included “ChatGPT”, “LLMs”, and “Chat-based artificial intelligence models”. Cross-sectional, peer-reviewed studies published in 2023 and 2024 investigating ChatGPT’s performance in CV medicine-related queries (Table/Figure) were extracted and included. All queries were evaluated by expert physicians in the corresponding fields within each study (and not by our readjudication), and a standardized grading system was employed for pooled analysis using an "accurate" and "inaccurate" grading scale for each answer.

Results: Out of 127 identified and screened peer-reviewed studies, fourteen studies involving 542 CV-related queries were included. Pooled analysis revealed an overall accuracy of 84.5% (458/542) (95% CI [81.5, 87.6]). Stratification by model (ChatGPT-4 vs. ChatGPT-3.5) did not show a significant difference in accuracy (p=0.32). Furthermore, no significant differences in accuracies were seen between answers in 2023 and 2024 (p=0.07). The accuracies across the various topics were statistically comparable, except in the field of cardio-oncology, which showed significantly lower accuracy at 68% (p=0.02). Detailed performances per topic are included in the table and figure.

Conclusion: ChatGPT demonstrated consistently high accuracy in answering CV-related queries with no significant differences across model versions or years. These results support the potential use of online-chat based LLMs as an informational tool in cardiology.

Kassab, Joseph ( Cleveland Clinic Foundation , Cleveland , Ohio , United States )
El Hajjar, Abdel Hadi ( Cleveland Clinic , Cleveland , Ohio , United States )
Haroun, Elio ( Cleveland Clinic , Cleveland , Ohio , United States )
Kanj, Mohamed ( Cleveland Clinic , Cleveland , Ohio , United States )
Sarraju, Ashish ( Cleveland Clinic , Cleveland , Ohio , United States )
Xlaffinx, Xlukex ( Cleveland Clinic Foundation , Cleveland , Ohio , United States )
Kapadia, Samir ( CLEVELAND CLINIC , Orae , Ohio , United States )
Harb, Serge ( CLEVELAND CLINIC , Cleveland , Ohio , United States )

Author Disclosures:

Joseph Kassab: