Evaluating the Clinical Reasoning of GPT-4, Grok, and Gemini in Different Fields of Cardiology AHA Conference Repository

American Heart Association

Final ID: Su3130

Evaluating the Clinical Reasoning of GPT-4, Grok, and Gemini in Different Fields of Cardiology

Abstract Body (Do not enter title and authors here): Introduction: The use of Large Language Models (LLMs) in clinical practice seems increasingly possible, although comparative studies are needed to achieve standardization. GPT-4, Grok and Gemini are three commonly used LLMs. While these LLMs may exhibit some level of clinical reasoning, their performance could differ between them and between areas of cardiology.
Aim: To evaluate the clinical reasoning of GPT-4, Grok and Gemini in distinct fields of cardiology.
Methods: A total of 12 cases were selected from the AHA Circulation Journal, representing six cardiology areas: General Cardiology (GC), Interventional Cardiology (IC), Cardiac Imaging (CI), Electrophysiology (EP), Heart Failure (HF) and Congenital Heart Disease (CHD). Each area included two cases. We only provided the case presentation to the LLMs, ensuring they were unaware of the diagnosis and management. Four questions were asked with the following prompts: 1. What are the diagnostic steps?, 2. What is the most likely diagnosis?, 3. What is the differential diagnosis?, 4. What is the appropriate management?. The responses were systematically evaluated with a digital rubric through an international collaboration of 12 cardiologists from various institutions. Each cardiologist evaluated two cases in their area of expertise and was aware of the diagnosis and management. One-way ANOVA was used for analysis.
Results: GPT-4 had a slightly best overall performance (Table 1) and the highest score in GC and CHD; Grok excelled in determining the management and was superior in CI, HF and IC; Gemini had the highest performance in EP, although these findings were not statistically significant (Table 2).
Conclusions: This is one of the first studies comparing the clinical reasoning of LLMs in cardiology. Our results propose that GPT-4 has the best overall performance and Grok is superior indicating the management. More studies comparing LLMs are required for further standardization of their use.

Reyes-rivera, Jonathan ( Facultad de Medicina Autonoma de San Luis Potosi , San Luis Potosi , Mexico )
Chi, Gerald ( Beth Isreal Deaconess Medical CTR , Boston , Massachusetts , United States )
Angulo, Stephanie ( Instituto Nacional de Cardiologia , Ciudad de Mexico , Mexico )
Moore, Michelle ( Emory University , Atlanta , Georgia , United States )
Lopez-quijano, Juan M. ( Facultad de Medicina Autonoma de San Luis Potosi , San Luis Potosi , Mexico )
Samman, Abdallah ( CENA RESEARCH INSTITUTE , Houston , Texas , United States )
Gordillo-moscoso, Antonio A. ( Facultad de Medicina Autonoma de San Luis Potosi , San Luis Potosi , Mexico )
Ali, Asif ( UT Health Science Center Houston , Houston , Texas , United States )
Castro Molina, Alberto ( Beth Isreal Deaconess Medical CTR , Boston , Massachusetts , United States )
Romero-lorenzo, Marco ( CENA RESEARCH INSTITUTE , Houston , Texas , United States )
Ali, Sajid ( CENA RESEARCH INSTITUTE , Houston , Texas , United States )
Gibson, Charles ( Beth Isreal Deaconess Medical CTR , Boston , Massachusetts , United States )
Saucedo, Jorge ( Medical College of Wisconsin , Milwaukee , Wisconsin , United States )
Calandrelli, Matias ( Hospital de la Santa Creu , Barcelona , Spain )
García Cruz, Edgar ( Instituto Nacional de Cardiologia , Ciudad de Mexico , Mexico )
Bahit, Cecilia ( Hospital Italiano de Buenos Aires , Buenos Aires , Argentina )

Author Disclosures:

Jonathan Reyes-Rivera:

DO NOT have relevant financial relationships

Gerald Chi:

DO have relevant financial relationships

Stephanie Angulo:

No Answer

michelle moore:

No Answer

Juan M. Lopez-Quijano:

DO NOT have relevant financial relationships

Abdallah Samman:

DO NOT have relevant financial relationships

Antonio A. Gordillo-Moscoso:

DO NOT have relevant financial relationships

Asif Ali:

DO have relevant financial relationships

Alberto Castro Molina:

DO NOT have relevant financial relationships

Marco Romero-Lorenzo:

DO NOT have relevant financial relationships

sajid ali:

No Answer

Charles Gibson:

No Answer

Jorge Saucedo:

DO NOT have relevant financial relationships

Matias Calandrelli:

DO NOT have relevant financial relationships

Edgar García Cruz:

No Answer

Cecilia Bahit:

No Answer

Meeting Info:

Scientific Sessions 2024

2024

Chicago, Illinois

Session Info:

LLMs Friend or Foe?

Sunday, 11/17/2024 , 03:15PM - 04:15PM

Abstract Poster Session

More abstracts from these authors:

Enhanced External Counterpulsation as a Novel Treatment for Heart Transplant Candidates with Ischemic Heart Failure with Reduced Ejection Fraction

Ali Asif, Farooqui Sami, Khalid Emad

Shifting trends and study characteristics associated with randomization, blinding, and data monitoring committee oversight of cardiovascular trials: analysis of ClinicalTrials.gov listings from 2000 to 2023

Chi Gerald, Bahit Maria, Castro Molina Alberto, Korjian Serge, Vitarello Clara, Nara Paul, Shaunik Alka, Gibson Charles

American Heart Association

Evaluating the Clinical Reasoning of GPT-4, Grok, and Gemini in Different Fields of Cardiology

Meeting Info:

Session Info:

More abstracts on this topic:

More abstracts from these authors: