Performance Benchmarking of Smaller Language Models Against GPT-4 for Predicting Reasons for Oral Anticoagulation Nonprescription in Atrial Fibrillation AHA Conference Repository

American Heart Association

Final ID: MP465

Performance Benchmarking of Smaller Language Models Against GPT-4 for Predicting Reasons for Oral Anticoagulation Nonprescription in Atrial Fibrillation

Abstract Body (Do not enter title and authors here): Background:
Oral anticoagulation (OAC) reduces stroke risk in atrial fibrillation (AF), yet nonprescription rates approach 50% with poorly characterized reasons. Proprietary large language models (LLMs) like GPT-4 can identify documented reasons for OAC nonprescription from clinical notes but present cost and privacy barriers to widespread deployment. We investigate whether smaller, open-source LLMs (Gemma-2-9B-IT, Phi-4K) can achieve comparable performance.
Hypothesis:
Open-source LLMs can match the performance of GPT-4 using augmented techniques like chain-of-thought (CoT) prompting.
Methods:
We identified all patient encounters with clinician-billed ICD10 AF diagnosis codes at Stanford Health Care from January 1, 2015 through December 31, 2023. Three reviewers annotated 10% of AF-related note excerpts to identify OAC nonprescription reasons. We developed zero-shot prompts for GPT-4, Gemma-2-9B-IT, and Phi-4K, plus CoT prompts for the open-source models (Graphic 1). Performance was assessed using weighted macro-F1 scores.
Results:
Of 35,737 AF encounters, 7,712 (21.6%) lacked active OAC prescriptions. From 9,143 associated notes, we extracted 21,573 AF/OAC-related excerpts, with 10% (911 notes, 2,175 excerpts) manually annotated. Reasons for nonprescription appeared in 497 (54.6%) notes, most commonly antiplatelet use (18.6%), perceived contraindication (14.7%), and low AF burden (13.9%). Gemma-2-9B-IT with CoT achieved the highest average macro-F1 score (0.81), versus GPT-4 (0.80), Gemma-2-9B-IT (0.76), Phi-4-14B (0.71), and Phi-4-14B with CoT (0.68). Gemma-2-9B-IT with CoT outperformed others in four categories (perceived contraindication, low stroke risk, low AF burden, already on OAC), while GPT-4 performed best for patient preference and antiplatelet alternatives, and Gemma-2-9B-IT for history of AF ablation (Graphic 2).
Conclusions:
Gemma-2-9B-IT, an open-source LLM, effectively categorized OAC nonprescription reasons comparable to GPT-4. This demonstrates that much smaller, freely available, and privacy preserving LLMs can identify barriers to guideline-directed AF care and be deployed across health systems to help reduce care gaps in OAC prescriptions.

Somani, Sulaiman ( Stanford Health Care , Menlo Park , California , United States )
Kim, Dale ( Stanford University , Highlands Ranch , Colorado , United States )
Perez Guerrero, Eduardo ( Stanford University , Stanford , California , United States )
Ngo, Summer ( Stanford University , Highlands Ranch , Colorado , United States )
Nguyen, Minh ( Stanford University , Highlands Ranch , Colorado , United States )
Sandhu, Alexander ( Stanford University , Millbrae , California , United States )
Alsentzer, Emily ( Stanford University , Highlands Ranch , Colorado , United States )
Hernandez-boussard, Tina ( Stanford University , Highlands Ranch , Colorado , United States )
Rodriguez, Fatima ( STANFORD UNIVERSITY , Palo Alto , California , United States )

Author Disclosures:

Sulaiman Somani: