Use of Large Language Models to Optimize Clinical Text Analysis for In-Hospital Cardiac Arrest Identification
Abstract Body: Introduction/Background In-hospital cardiac arrest (IHCA) is experienced by approximately 300,000 patients annually in the US. While individual care teams can readily identify IHCA at the bedside, the reporting of these events after the event often lacks consistency and poses a major challenge for hospitals across the US. Accurate and timely reporting of IHCA events is crucial for facilitating QI initiatives, such as CPR quality review, optimizing team performance, and benchmarking IHCA outcomes.
Hypothesis Large language models (LLMs) can be used to identify IHCA events at the patient encounter-level by analyzing targeted note types from the inpatient record with high performance.
Methods Adult (≥18 years old) inpatient encounters at the Hospital of the University of Pennsylvania from 06/2018 to 03/2022 with a reported clinical emergency were identified from a QI database. Discharge summaries and notes associated with the encounters of interest were extracted from the EHR and deidentified using a natural language processing tool. Notes were reviewed by research staff to ascertain true IHCA events. Positive IHCA was defined as the loss of pulses followed by the delivery of CPR. An LLM, GPT4 v. 32K-Chat, was employed on the Penn Medicine Microsoft Azure Databricks environment to compare zero- and few-shot prompting methods to identify IHCA. Model performance was measured using accuracy, precision, recall and F1 scores.
Results Manual chart review determined 96.8% and 3.2% of cases to be IHCA+ and IHCA-, respectively. A balanced pilot dataset of positive and negative cases was created and the accuracy values across strategically sampled note types was checked. Using zero-shot learning with Chain-of-Thought (COT) prompting, the LLM performed with an accuracy of 86%, precision of 77%, recall of 100%, and F1-score of 87%. Using three-shot learning with COT, the LLM performed with an accuracy of 90%, precision of 83%, recall of 100%, and F1-score of 91%.
Conclusions This study demonstrates the potential efficacy of leveraging LLMs to automatically classify IHCA from strategically sampled note types using both zero and few-shot learning approaches. In the future, we will implement temporal sampling of note types and refine model parameters such as output token size and temperature. These findings suggest that such technologies can be effective in real-world clinical settings, providing a scalable solution to improve patient outcomes and quality improvement efforts.
Kaviyarasu, Aarthi
( University of Pennsylvania
, Philadelphia
, Pennsylvania
, United States
)
Vurgun, Ugurcan
( University of Pennsylvania
, Philadelphia
, Pennsylvania
, United States
)
Hwang, Sy
( University of Pennsylvania
, Philadelphia
, Pennsylvania
, United States
)
Acevedo, Ana
( University of Pennsylvania
, Philadelphia
, Pennsylvania
, United States
)
Abella, Benjamin
( University of Pennsylvania
, Philadelphia
, Pennsylvania
, United States
)
Mowery, Danielle
( University of Pennsylvania
, Philadelphia
, Pennsylvania
, United States
)
Mitchell, Oscar
( University of Pennsylvania
, Philadelphia
, Pennsylvania
, United States
)
Author Disclosures:
Aarthi Kaviyarasu:DO NOT have relevant financial relationships
| Ugurcan Vurgun:DO NOT have relevant financial relationships
| Sy Hwang:No Answer
| Ana Acevedo:DO NOT have relevant financial relationships
| Benjamin Abella:DO have relevant financial relationships
;
Research Funding (PI or named investigator):Becton Dickinson:Active (exists now)
; Ownership Interest:Neuroptics:Active (exists now)
; Speaker:Stryker:Active (exists now)
; Speaker:Zoll:Active (exists now)
; Advisor:MDAlly:Active (exists now)
; Advisor:Neuroptics:Active (exists now)
| Danielle Mowery:DO NOT have relevant financial relationships
| Oscar Mitchell:No Answer