Logo

American Heart Association

  18
  0


Final ID: MP1527

Hybrid NLP Model Accurately Extracts Data from Tetralogy of Fallot Cardiac MR Reports

Abstract Body (Do not enter title and authors here): Introduction:
Utilization of large language models (LLMs) for named entity recognition from free-text medical reports is rapidly expanding. However, concerns about protected health information (PHI) restrict the adoption of commercial LLMs. While simpler natural language processing methods, such as regular expressions (RegEx), offer an accessible solution, they often fail when linguistic variability increases. Particularly in rare conditions with limited labeled datasets, advanced machine learning models like Bidirectional Encoder Representations from Transformers (BERT) are difficult to train. Thus, there is a need for secure, efficient, and accurate data extraction methods.
Research Questions:
We hypothesized that a hybrid approach combining simple RegEx with few-shot prompts on an on-premises LLM would maximize accuracy and efficiency while maintaining PHI compliance.
Methods:
We retrospectively analyzed cardiovascular magnetic resonance (CMR) reports from 183 patients. Custom RegEx rules and few-shot LLM prompts were independently applied across all reports. A hybrid extraction approach integrated both methods by selectively using LLM results in areas of poor RegEx performance. Ground truths were manually verified by a clinical expert. Performance was evaluated using Coverage, Precision, Recall, and F1-score metrics.
Results:
A manual review of 430 CMR reports (3/2005-12/2024) identified a median proportion of missing values of 3.95% (IQR 2.79–5.12) across 13 clinical metrics. The baseline RegEx extraction alone achieved a completeness of 90.7%, whereas the standalone few-shot LLM approach reached 91.9%. Combining RegEx with targeted few-shot LLM prompts, the hybrid method significantly improved data completeness to 99.8%. In terms of accuracy, the hybrid approach attained an F1 score of 97.5%±3.6, clearly outperforming RegEx alone (85.2%±22.2) and the standalone LLM (86.0%±15.1). Pairwise comparisons confirmed differences were significant (p<0.001) with large effect sizes (Cohen’s d >1.0). Additionally, the hybrid approach reduced computational time by approximately 75% compared to the LLM-only method.
Conclusion:
A hybrid NLP method combining deterministic RegEx and targeted LLM prompts significantly enhances data extraction accuracy from legacy clinical free-text reports. This approach addresses PHI security concerns and effectively reallocates annotation resources toward predictive modeling, thereby advancing clinical research and quality improvement.
  • Akbasli, Izzet Turkalp  ( Cleveland Clinic Childrens , Cleveland , Ohio , United States )
  • Baloglu, Orkun  ( Cleveland Clinic Childrens , Cleveland , Ohio , United States )
  • Liou, Wilson  ( Cleveland Clinic Childrens , Cleveland , Ohio , United States )
  • Latifi, Samir  ( Cleveland Clinic Childrens , Cleveland , Ohio , United States )
  • Marino, Bradley  ( Cleveland Clinic Childrens , Cleveland , Ohio , United States )
  • Albahra, Samer  ( Cleveland Clinic , Beachwood , Ohio , United States )
  • Tandon, Animesh  ( Cleveland Clinic Childrens , Cleveland , Ohio , United States )
  • Author Disclosures:
    Izzet Turkalp Akbasli: DO NOT have relevant financial relationships | Orkun Baloglu: No Answer | Wilson Liou: DO NOT have relevant financial relationships | Samir Latifi: DO NOT have relevant financial relationships | Bradley Marino: DO NOT have relevant financial relationships | Samer Albahra: DO NOT have relevant financial relationships | Animesh Tandon: DO have relevant financial relationships ; Individual Stocks/Stock Options:Amazon:Active (exists now) ; Individual Stocks/Stock Options:Nvidia:Active (exists now) ; Individual Stocks/Stock Options:Alphabet:Active (exists now)
Meeting Info:

Scientific Sessions 2025

2025

New Orleans, Louisiana

Session Info:

Transforming Healthcare with Large Language Models and NLP: From Unstructured Data to Clinical Insight

Sunday, 11/09/2025 , 11:50AM - 01:00PM

Moderated Digital Poster Session

More abstracts on this topic:
A Rare Case of Acute Undifferentiated Leukemia Presenting as an Isolated Cardiac Mass

Mallipeddi Tarun, Rantanen Petra, Debakey Michael, Cheng Lily, Waheed Nida

A Contemporary Machine Learning-Based Risk Stratification for Mortality and Hospitalization in Heart Failure with Preserved Ejection Fraction Using Multimodal Real-World Data

Fudim Marat, Weerts Jerremy, Patel Manesh, Balu Suresh, Hintze Bradley, Torres Francisco, Micsinai Balan Mariann, Rigolli Marzia, Kessler Paul, Touzot Maxime, Lund Lars, Van Empel Vanessa, Pradhan Aruna, Butler Javed, Zehnder Tobias, Sauty Benoit, Esposito Christian, Balazard Félix, Mayer Imke, Hallal Mohammad, Loiseau Nicolas

More abstracts from these authors:
You have to be authorized to contact abstract author. Please, Login
Not Available