Can Large Language Models Make Guideline‑adherent Treatment Decisions For Patients with Severe Aortic Stenosis?

Background and Aims

Large language models, such as ChatGPT, achieve high scores in medical board exams and can solve difficult diagnostic challenges. The utility of LLMs to guide treatment decision‑making based on real clinical data is unknown. Our aim was to determine whether ChatGPT can be used for guideline‑adherent treatment decision‑making for patients with severe aortic stenosis in a realistic clinical setting.

Methods

We included 40 patients suffering from severe aortic stenosis, who had been discussed by our institutional Heart Team in 2022. We presented medical text reports, case summaries, and both, additionally enriched with clinical practice guideline content to ChatGPT to obtain a treatment decision for either surgical – or transcatheter aortic valve replacement. We evaluated ChatGPT’s treatment decisions, separately for the versions 3.5 and 4.0, in terms of agreement with treatment decisions provided by our institutional Heart Team.

Results

Based on raw medical text reports both ChatGPT-3.5 and ChatGPT-4 invariably opted against surgical aortic valve replacement. Cohen’s kappa coefficients steadily increased from -0.47 to 0.32 and from ‑0.05 to 0.66 for ChatGPT-3.5 and ChatGPT-4, respectively, when raw medical reports were augmented by case summaries and decision‑relevant clinical practice guideline content.

Conclusions

ChatGPT was not capable of choosing the correct treatment option for patients with severe aortic stenosis based on raw medical reports. In addition, ChatGPT exhibited bias towards transcatheter over surgical aortic valve replacement. When medical records were extensively preprocessed, only the latest ChatGPT version showed substantial alignment with the Heart Team.