Logo

American Heart Association

  19
  0


Final ID: MP1256

Performance of Large Language Models in Analyzing Common Hypertension Scenarios in Clinical Practice

Abstract Body (Do not enter title and authors here): Background/Objective: Hypertension is the most prevalent chronic disease in primary care and a leading cause of cardiovascular morbidity and mortality. Despite existing guidelines, therapeutic inertia and suboptimal control persist. Large language models (LLMs) offer a potential valuable addition to augment clinical decision-making, yet their reliability for guideline-driven tasks remains unverified. This study evaluated the accuracy and safety of hypertension management recommendations generated by three LLMs compared to expert responses.
Methods: Fifty-one clinical vignettes representing 17 core hypertension management concepts were constructed by hypertension experts. Each case was submitted to three LLMs (GPT-4, Gemini, MedLM) and a hypertension expert also wrote the “gold standard” answers. Three blinded expert reviewers rated each response on a 4-point accuracy scale, a binary safety (safe/unsafe) scale, and attempted to identify the source (LLM vs. expert) providing the response. Ratings were analyzed using mean scores, percentages of accurate and safe responses, and inter-rater agreement.
Results: GPT-4 had the highest accuracy (83%) and safety (86%) scores among LLMs but remained inferior to expert responses (92% accuracy, 93% safety). Gemini and MedLM performed significantly worse (accuracy: 64% and 35%; safety: 73% and 39%, respectively). GPT-4 generated the most guideline-concordant responses (46%) among the three LLMs (Gemini 35%, MedLM 14%), but remains lower than experts’ responses (68%). Evaluators misidentified LLM responses as expert-written in 10 to 25% of cases, particularly with GPT-4. Inter-rater reliability for accuracy ratings was highest for expert-generated responses (ICC 0.81), with progressively lower agreement for GPT-4 (0.76), Gemini (0.70), and MedLM (0.68). A similar pattern was observed for safety and source discrimination ratings. The agreement was strongest for safety assessments and weakest for source discrimination.
Conclusion: Among three tested LLMs, GPT-4 demonstrated closer agreement to expert decisions thereby showing greater potential for supporting hypertension management. However, current LLMs’ versions frequently produce inaccurate or unsafe recommendations and remain inferior to expert judgment. Human-in-the-loop supervision remains essential when deploying LLMs for clinical decision-making.
  • Zand, Jaleh  ( Mayo Clinic , Rochester , Minnesota , United States )
  • Miao, Jing  ( Mayo Clinic , Rochester , Minnesota , United States )
  • Hommos, Musab  ( Mayo Clinic , Scottsdale , Arizona , United States )
  • Schwartz, Gary  ( Mayo , Rochester , Minnesota , United States )
  • Taler, Sandra  ( Mayo Clinic , Minneapolis , Minnesota , United States )
  • Nejat, Peyman  ( Mayo Clinic , Rochester , Minnesota , United States )
  • Wisit, Cheungpasitporn  ( Mayo Clinic , Rochester , Minnesota , United States )
  • Zoghby, Ziad  ( Mayo Clinic , Rochester , Minnesota , United States )
  • Author Disclosures:
    Jaleh Zand: No Answer | Jing Miao: DO NOT have relevant financial relationships | Musab Hommos: DO have relevant financial relationships ; Individual Stocks/Stock Options:Akebia:Active (exists now) ; Research Funding (PI or named investigator):AstraZeneca :Past (completed) ; Individual Stocks/Stock Options:SeaStart Medical:Active (exists now) ; Individual Stocks/Stock Options:Iovance :Active (exists now) | Gary Schwartz: DO NOT have relevant financial relationships | Sandra Taler: DO NOT have relevant financial relationships | Peyman Nejat: DO NOT have relevant financial relationships | cheungpasitporn wisit: No Answer | Ziad Zoghby: DO have relevant financial relationships ; Consultant:BMJ Publishing Group Limited:Active (exists now) ; Advisor:Chronisense Medical Ltd.:Past (completed)
Meeting Info:

Scientific Sessions 2025

2025

New Orleans, Louisiana

Session Info:

Change is in the Air! New Discoveries in Hypertension Treatment

Sunday, 11/09/2025 , 09:15AM - 10:25AM

Moderated Digital Poster Session

More abstracts on this topic:
A Cross-scale Causal Machine Learning Framework Pinpoints Mgl2+ Macrophage Orchestrators of Balanced Arterial Growth

Han Jonghyeuk, Kong Dasom, Schwarz Erica, Takaesu Felipe, Humphrey Jay, Park Hyun-ji, Davis Michael E

2-Methoxyestradiol By Inhibiting Central Action of 12S-Hydroxyeicosatetraenoic Acid Protects Ovariectomized Mice From Hypertension

Dutta Shubha, Singh Purnima, Song Chi Young, Shin Ji Soo, Malik Kafait

More abstracts from these authors:
You have to be authorized to contact abstract author. Please, Login
Not Available