Trusting AI with Your Heart: How Well Do Public Large Language Models Detect Atrial Fibrillation?

DGK Herztage 2025. Clin Res Cardiol (2025). https://doi.org/10.1007/s00392-025-02737-x

Nikola Cenikj (München)1, A. Steger (München)1, A. Müller (München)1, A. Bollinger (München)1, C. Zou (München)1, F. V. Hahn (München)1, J. Kehrer (München)1, I. Rattka (München)1, K.-L. Laugwitz (München)1, E. Martens (München)1, M. Rattka (München)1

1Klinikum rechts der Isar der Technischen Universität München Klinik und Poliklinik für Innere Medizin I München, Deutschland

 

Background:
Atrial fibrillation (AF) is the most common cardiac arrhythmia, linked to serious complications like stroke. Early diagnosis is essential, as timely treatment can reduce risks. The growing availability of wearable ECG devices and advances in AI have made AF detection more accessible. Increasingly, patients consult large language models (LLMs) such as ChatGPT, Gemini, or LLaMA for medical input, including ECG analysis. These models often suggest they can interpret ECGs, and clinicians are increasingly presented with AI-assessed tracings. Some patients even trust these results without clinical validation. This study investigates whether public LLMs can reliably detect AF on standard 12-lead ECGs and how they compare with a domain-specific model.

Objective:
The study compares several publicly available LLMs with a purpose-built ECG classifier to evaluate their effectiveness in identifying AF from ECGs. It aims to determine whether AI systems can match the diagnostic performance of models developed specifically for cardiology.

Methods:
The analyzed ECGs were obtained from two sources. The first, the KORA cohort from southern Germany, included 6,254 ECGs with 136 confirmed AF cases, verified by cardiologists. The second was a public dataset with 45,152 ECGs, from which atrial flutter cases were excluded, leaving 37,090 traces and 1,780 AF cases.The domain-specific model, ECG-CNN-CLS, is a transformer-based classifier pretrained on large ECG datasets and fine-tuned to identify various rhythms, including AF. It was compared to three LLaMA models (standard, 4-bit, and biomedical), Gemini 1.5 Flash and 2.0, and GPT-4o. LLaMA models, which can run offline, were tested on both datasets. Gemini and GPT were tested only on the public dataset. All models received identical prompts and ECG plots to ensure consistency.

Results:
ECG-CNN-CLS performed best across both datasets. On the KORA set, it reached an AUC of 0.935, accuracy of 0.993, sensitivity of 0.875, and specificity of 0.996. Among LLaMA models, the 4-bit version performed best with an AUC of 0.589 and sensitivity of 0.338. In the public dataset, ECG-CNN-CLS again showed strong results: AUC 0.868, accuracy 0.939, sensitivity 0.789, and specificity 0.947. LLMs underperformed by comparison. GPT-4o had an AUC of 0.591 and sensitivity of 0.415. Gemini 2.0 showed the best AUC among LLMs (0.635), but sensitivity remained below 0.5. Gemini 1.5 Flash achieved the highest specificity (0.987) and accuracy (0.942), slightly exceeding ECG-CNN-CLS, but still missed more than half of AF cases.

Discussion:
Across over 43,000 ECGs, the specialized ECG-CNN-CLS model outperformed all public LLMs in AF detection. This performance gap reflects fundamental differences in architecture and training. LLMs are optimized for text and cannot effectively process waveform data like ECGs. Presenting ECGs as static images removes key temporal patterns such as irregular rhythms or absent P-waves, limiting diagnostic value. None of the tested LLMs had prior training on large ECG datasets, while ECG-CNN-CLS was developed specifically for this purpose.

Conclusion:
Given the current evidence, specialized cardiology algorithms remain the most reliable option for point-of-care AF detection. Until multimodal foundation models are rigorously trained on large ECG repositories, chatbot interpretations should be treated with caution.

Diese Seite teilen