Logo

American Heart Association

  2
  0


Final ID: Su3113

Accuracy, Hallucinations, and Misgeneralizations of Large Language Models in Reviewing Cardiology Literature

Abstract Body (Do not enter title and authors here): Introduction: The text generation abilities of large language models (LLMs), such as ChatGPT, can streamline the process of research and potentially expedite reviewing literature.

Methods: We evaluated the capabilities of LLMs and their major susceptibilities—hallucination and goal misgeneralization—in reviewing cardiology literature. We asked 30 comprehensive questions (n=600) regarding findings from notable cardiovascular studies and randomized controlled trials (n=20) from 2016-2020 to GPT-4, GPT-3.5, and LLaMa 3. The question groups followed a format where the initial question checked for accuracy in identification, second for misgeneralization, and third for hallucination (Table 1). All questions were communicated in individual interfaces of the models to retain independence and prevent bias. Responses were then reviewed for accuracy, misgeneralizations, and hallucinations.

Results: GPT-4 and GPT-3.5 did not significantly differ in accuracy (55% vs. 40%, p=0.527), with a low association (φ=0.1) and an odds ratio (OR) of 1.83 (95% CI: 0.522-6.434). Similar results were observed between GPT-4 and LLaMa 3 (55% vs. 30%, p=0.201), with an OR of 2.85 (95% CI: 0.776-10.467). For generalization, GPT-4 and GPT-3.5 also did not significantly differ (30% vs. 35%, p=0.999), with a low association (φ=0.001) and an OR of 0.80 (95% CI: 0.211-2.998). Likewise, GPT-4 and LLaMa 3 showed no significant difference (30% vs. 25%, p=0.999), with an OR of 1.29 (95% CI: 0.319-5.174). For non-hallucinations, GPT-4 and GPT-3.5 did not significantly differ (30% vs. 60%, p=0.187), with a low association (φ=0.214) and an OR of 0.33 (95% CI: 0.088-1.256). Lastly, GPT-4 and LLaMa 3 showed no significant difference (30% vs. 25%, p=0.836), with an OR of 1.50 (95% CI: 0.36-6.13). Overall, all models did not differ in accuracy, misgeneralization, and hallucination. Given the moderate susceptibilities of these models, LLMs require more training before realistic usage in cardiology.

Conclusions: LLMs show potential for identifying cardiology studies, but moderate hallucination and misgeneralization render them unsuitable for this purpose. Without resolution of susceptibilities (<5-10%), LLMs remain unrealistic for accurately reviewing literature.
  • Fang, Spencer  ( Case Western Reserve University , Cleveland , Ohio , United States )
  • Pillai, Joshua  ( University of California, San Diego, School of Medicine , La Jolla , California , United States )
  • Mahin, Baharullah  ( Harvard University , Cambridge , Massachusetts , United States )
  • Kim, Sarah  ( University of California, San Diego, School of Medicine , La Jolla , California , United States )
  • Author Disclosures:
    Spencer Fang: DO NOT have relevant financial relationships | Joshua Pillai: DO NOT have relevant financial relationships | Baharullah Mahin: No Answer | Sarah Kim: DO NOT have relevant financial relationships
Meeting Info:

Scientific Sessions 2024

2024

Chicago, Illinois

Session Info:

Promise and Peril: Artificial Intelligence and Cardiovascular Medicine

Sunday, 11/17/2024 , 11:30AM - 12:30PM

Abstract Poster Session

More abstracts on this topic:
A Meta-Analysis Comparing Same-Day Discharge to Later-Day Discharge in Transcatheter Aortic Valve Replacement

Jain Hritvik, Passey Siddhant, Jain Jyoti, Goyal Aman, Wasir Amanpreet, Ahmed Mushood, Patel Nandan, Yadav Ashish, Shah Janhvi, Mehta Aryan

A Systematic Approach to Prompting Large Language Models for Automated Feature Extraction from Cardiovascular Imaging Reports

Goldfinger Shir, Mackay Emily, Chan Trevor, Eswar Vikram, Grasfield Rachel, Yan Vivian, Barreto David, Pouch Alison

You have to be authorized to contact abstract author. Please, Login
Not Available