Can Large Language Models Make Guideline‑adherent Treatment Decisions For Patients with Severe Aortic Stenosis?

Tobias Röschl (Berlin)1, M. Hoffmann (Berlin)1, D. Hashemi (Berlin)2, F. Rarreck (Berlin)1, T. D. Trippel (Berlin)3, H. Dreger (Berlin)4, J. Kempfert (Berlin)1, G. Hindricks (Berlin)5, V. Falk (Berlin)6, A. Meyer (Berlin)1

1Deutsches Herzzentrum der Charite (DHZC) Klinik für Herz-, Thorax- und Gefäßchirurgie Berlin, Deutschland; 2Deutsches Herzzentrum der Charite (DHZC) Klinik für Kardiologie, Angiologie und Intensivmedizin | CVK Berlin, Deutschland; 3Charité - Universitätsmedizin Berlin CC11: Med. Klinik m.S. Kardiologie Berlin, Deutschland; 4Deutsches Herzzentrum der Charite (DHZC) Klinik für Kardiologie, Angiologie und Intensivmedizin | CBF Berlin, Deutschland; 5Charité - Universitätsmedizin Berlin CC11: Med. Klinik m. S. Kardiologie und Angiologie Berlin, Deutschland; 6Charité - Universitätsmedizin Berlin Klinik für kardiovaskuläre Chirurgie Berlin, Deutschland


Background and Aims

Large language models, such as ChatGPT, achieve high scores in medical board exams and can solve difficult diagnostic challenges. The utility of LLMs to guide treatment decision‑making based on real clinical data is unknown. Our aim was to determine whether ChatGPT can be used for guideline‑adherent treatment decision‑making for patients with severe aortic stenosis in a realistic clinical setting.


We included 40 patients suffering from severe aortic stenosis, who had been discussed by our institutional Heart Team in 2022. We presented medical text reports, case summaries, and both, additionally enriched with clinical practice guideline content to ChatGPT to obtain a treatment decision for either surgical – or transcatheter aortic valve replacement. We evaluated ChatGPT’s treatment decisions, separately for the versions 3.5 and 4.0, in terms of agreement with treatment decisions provided by our institutional Heart Team.


Based on raw medical text reports both ChatGPT-3.5 and ChatGPT-4 invariably opted against surgical aortic valve replacement. Cohen’s kappa coefficients steadily increased from -0.47 to 0.32 and from ‑0.05 to 0.66 for ChatGPT-3.5 and ChatGPT-4, respectively, when raw medical reports were augmented by case summaries and decision‑relevant clinical practice guideline content.


ChatGPT was not capable of choosing the correct treatment option for patients with severe aortic stenosis based on raw medical reports. In addition, ChatGPT exhibited bias towards transcatheter over surgical aortic valve replacement. When medical records were extensively preprocessed, only the latest ChatGPT version showed substantial alignment with the Heart Team.
Diese Seite teilen