Logo

American Heart Association

  79
  0


Final ID: MP1541

Comparative Evaluation of US- and China-developed Large Language Models for Bilingual Coronary Heart Disease Patient Education

Abstract Body (Do not enter title and authors here): Background: Coronary heart disease (CHD) remains the leading global cause of mortality and imposes substantial clinical and socioeconomic burdens. Patient education significantly improves health outcomes but is limited by resource constraints. Large language models (LLMs) offer scalable solutions for patient counselling, but their performance may vary by the language of the prompts and the region of model development. This study performs the first systematic bilingual (English and Chinese) evaluation of leading US- and China-developed LLMs for CHD education.
Methods: We assessed six widely used LLMs—GPT-4o, OpenAI o1, Gemini 1.5 (US-developed), and DeepSeek-R1, Doubao, ERNIE Bot 3.5 (China-developed)—using 30 bilingual CHD-related questions covering prevention, diagnosis, and management. Each response was independently rated by three cardiologists on four dimensions: correctness, comprehensiveness, understandability, and empathy. Scores were analyzed using cumulative-link mixed models (CLMM) adjusting for question and rater variability, with pairwise contrasts to identify specific differences.
Results: A total of 360 bilingual responses (4320 individual ratings) demonstrated high inter-rater reliability (Fleiss' κ=0.821). GPT-4o and OpenAI o1 achieved the highest overall scores. Correctness emerged as the strongest evaluation dimension, while empathy showed the greatest variability among models. Significant language-model interactions indicated that, surprisingly, Chinese-developed models (DeepSeek-R1, ERNIE Bot 3.5) performed better when responding in English rather than Chinese. Pairwise contrasts highlighted that OpenAI o1 significantly outperformed GPT-4o in correctness (English OR 8.07, 95 % CI 2.11–30.88; Chinese OR 2.78, 95 % CI 0.94–8.25) and understandability (English OR 12.21, 95 % CI 2.37–62.91; Chinese OR 7.07, 95 % CI 2.45–20.43). Gemini exhibited exceptional empathy specifically in Chinese (OR 31.69, 95 % CI 14.50–69.24).
Conclusions: Our comprehensive bilingual analysis identifies substantial performance variability influenced by model architecture and language specificity. Models with chain-of-thought reasoning, such as OpenAI o1 and DeepSeek-R1, hold significant promise for clinical application. Performance declines observed with Chinese prompts in China-developed models reflect a lack of high-quality Chinese medical corpora. Establishing robust multilingual medical databases is essential for improving cross-linguistic reliability and safety.
  • Liu, Kaiyuan  ( Fudan Universit, Zhongshan Hospital , Shang hai , China )
  • Author Disclosures:
    kaiyuan Liu: DO NOT have relevant financial relationships
Meeting Info:

Scientific Sessions 2025

2025

New Orleans, Louisiana

Session Info:
More abstracts on this topic:
A large-scale multi-view deep learning-based assessment of left ventricular ejection fraction in echocardiography

Jing Linyuan, Metser Gil, Mawson Thomas, Tat Emily, Jiang Nona, Duffy Eamon, Hahn Rebecca, Homma Shunichi, Haggerty Christopher, Poterucha Timothy, Elias Pierre, Long Aaron, Vanmaanen David, Rocha Daniel, Hartzel Dustin, Kelsey Christopher, Ruhl Jeffrey, Beecy Ashley, Elnabawi Youssef

Carvedilol Exerts Evolutionarily Conserved Cardioprotective Effects via Autophagy Modulation and Energetics Restoration

Boukhalfa Asma, Yang Vicky, Chen Howard, Shin Pei-tsz, Meola Dawn, Showers Annie, Yu Ada, Upshaw Jenica, London Cheryl, Jaffe Iris, Pulakat Lakshmi

You have to be authorized to contact abstract author. Please, Login
Not Available