Performance Benchmarking of Smaller Language Models Against GPT-4 for Predicting Reasons for Oral Anticoagulation Nonprescription in Atrial Fibrillation
Abstract Body (Do not enter title and authors here): Background: Oral anticoagulation (OAC) reduces stroke risk in atrial fibrillation (AF), yet nonprescription rates approach 50% with poorly characterized reasons. Proprietary large language models (LLMs) like GPT-4 can identify documented reasons for OAC nonprescription from clinical notes but present cost and privacy barriers to widespread deployment. We investigate whether smaller, open-source LLMs (Gemma-2-9B-IT, Phi-4K) can achieve comparable performance. Hypothesis: Open-source LLMs can match the performance of GPT-4 using augmented techniques like chain-of-thought (CoT) prompting. Methods: We identified all patient encounters with clinician-billed ICD10 AF diagnosis codes at Stanford Health Care from January 1, 2015 through December 31, 2023. Three reviewers annotated 10% of AF-related note excerpts to identify OAC nonprescription reasons. We developed zero-shot prompts for GPT-4, Gemma-2-9B-IT, and Phi-4K, plus CoT prompts for the open-source models (Graphic 1). Performance was assessed using weighted macro-F1 scores. Results: Of 35,737 AF encounters, 7,712 (21.6%) lacked active OAC prescriptions. From 9,143 associated notes, we extracted 21,573 AF/OAC-related excerpts, with 10% (911 notes, 2,175 excerpts) manually annotated. Reasons for nonprescription appeared in 497 (54.6%) notes, most commonly antiplatelet use (18.6%), perceived contraindication (14.7%), and low AF burden (13.9%). Gemma-2-9B-IT with CoT achieved the highest average macro-F1 score (0.81), versus GPT-4 (0.80), Gemma-2-9B-IT (0.76), Phi-4-14B (0.71), and Phi-4-14B with CoT (0.68). Gemma-2-9B-IT with CoT outperformed others in four categories (perceived contraindication, low stroke risk, low AF burden, already on OAC), while GPT-4 performed best for patient preference and antiplatelet alternatives, and Gemma-2-9B-IT for history of AF ablation (Graphic 2). Conclusions: Gemma-2-9B-IT, an open-source LLM, effectively categorized OAC nonprescription reasons comparable to GPT-4. This demonstrates that much smaller, freely available, and privacy preserving LLMs can identify barriers to guideline-directed AF care and be deployed across health systems to help reduce care gaps in OAC prescriptions.
Somani, Sulaiman
( Stanford Health Care
, Menlo Park
, California
, United States
)
Kim, Dale
( Stanford University
, Highlands Ranch
, Colorado
, United States
)
Perez Guerrero, Eduardo
( Stanford University
, Stanford
, California
, United States
)
Ngo, Summer
( Stanford University
, Highlands Ranch
, Colorado
, United States
)
Nguyen, Minh
( Stanford University
, Highlands Ranch
, Colorado
, United States
)
Sandhu, Alexander
( Stanford University
, Millbrae
, California
, United States
)
Alsentzer, Emily
( Stanford University
, Highlands Ranch
, Colorado
, United States
)
Hernandez-boussard, Tina
( Stanford University
, Highlands Ranch
, Colorado
, United States
)
Rodriguez, Fatima
( STANFORD UNIVERSITY
, Palo Alto
, California
, United States
)
Author Disclosures:
Sulaiman Somani:DO NOT have relevant financial relationships
| Dale Kim:DO NOT have relevant financial relationships
| Eduardo Perez Guerrero:DO NOT have relevant financial relationships
| Summer Ngo:DO NOT have relevant financial relationships
| Minh Nguyen:DO NOT have relevant financial relationships
| Alexander Sandhu:DO have relevant financial relationships
;
Consultant:Reprieve Cardiovascular:Active (exists now)
; Consultant:Clearly:Active (exists now)
; Research Funding (PI or named investigator):NOVO NORDISK:Active (exists now)
; Research Funding (PI or named investigator):Novartis:Active (exists now)
; Research Funding (PI or named investigator):Bayer:Active (exists now)
; Research Funding (PI or named investigator):Astra Zeneca:Active (exists now)
| Emily Alsentzer:DO have relevant financial relationships
;
Consultant:Fourier Health:Active (exists now)
| Tina Hernandez-Boussard:No Answer
| Fatima Rodriguez:DO have relevant financial relationships
;
Consultant:HealthPals:Past (completed)
; Consultant:Cleerly Health:Active (exists now)
; Consultant:Amgen:Active (exists now)
; Consultant:iRhythm:Active (exists now)
; Consultant:HeartFlow:Active (exists now)
; Consultant:Arrowhead Pharmaceuticals:Active (exists now)
; Consultant:Edwards:Active (exists now)
; Consultant:Inclusive Health:Active (exists now)
; Consultant:Esperion Therapeutics:Past (completed)
; Consultant:Kento Health:Active (exists now)
; Consultant:Movano Health:Active (exists now)
; Consultant:NovoNordisk:Past (completed)
; Consultant:Novartis:Active (exists now)