https://doi.org/10.1007/s00392-025-02737-x
1Universitätsklinikum Heidelberg Klinik für Innere Med. III, Kardiologie, Angiologie u. Pneumologie Heidelberg, Deutschland
Background:
Large Language Models (LLMs) are increasingly being used by patients seeking cardiovascular health information. However, their accuracy and suitability for providing guidance on heart failure and cardiomyopathy remains inadequately evaluated. This study systematically evaluated the quality of responses generated by LLMs to patient-relevant questions about heart failure and cardiomyopathies.
Methods:
Six widely-used LLMs were evaluated: OpenAI GPT-4o (2024-11-20), DeepSeek Chat, Gemini 2.5 Pro (Preview 05-06), Anthropic Claude 3.7 Sonnet (20250219), Perplexity Sonar Pro, and xAI Grok-3. Fifty patient-centric questions were developed covering disease understanding, diagnosis, treatment, prognosis and lifestyle concerns, derived from clinical practice observations, patient forums, and current literature. Questions were submitted identically to all models via standardized API interfaces. A purpose-built web-based evaluation platform LLM Response Evaluator randomized and blinded all responses for assessment by twelve reviewers: three cardiologists, three medical students, and six AI auto-graders using function calling for blinded evaluation. Responses were scored across nine quality dimensions using 5-point Likert scales, where 3 indicated neutral performance and 5 indicated excellent performance: appropriateness, comprehensibility, completeness, conciseness, confabulation avoidance, readability, educational value, actionability, and tone/empathy.
Results:
Readability analysis showed significant variation, with Gemini producing the most accessible content (Flesch Kincaid Grade Level: 11.9±1.8) compared to Claude’s more complex outputs (35.2±20.5). However, Gemini's responses were the longest (Word count 668.7±116.1), while Claude's messages were much more concise (Word count 226.9±38.9). Across 2,700 total ratings, Gemini demonstrated the highest overall performance (mean rating 4.55±0.02), particularly excelling in completeness and confabulation avoidance. followed by xAI Grok (4.41±0.02), OpenAI GPT-4o (4.26±0.02), DeepSeek (4.20±0.02), Anthropic Claude (4.15±0.02), and Perplexity (4.00±0.02). Confabulation avoidance scored highest across models (4.49±0.02), while conciseness scored lowest (3.81±0.05). Lower-rated models (e.g., Perplexity, 4.00 overall) performed less consistently, particularly in tone/empathy and conciseness. Rating tendencies varied by evaluator group: Auto-graders gave the highest average scores (mean 4.58 ± 0.60), followed by students (4.10 ± 0.88), while experts were more conservative (3.79 ± 0.93), reflecting stricter grading patterns closer to neutral.
Discussion:
Google’s Gemini 2.5 Pro achieved the highest overall performance across appropriateness, completeness, and actionability, suggesting strong potential for patient-facing cardiovascular communication. All evaluated LLMs showed good accuracy in avoiding medical misinformation, though significant variability exists in readability and comprehensiveness. While serious errors or confabulations were rare, they were not entirely absent. Differences in grading strictness between experts and other raters further emphasize the need for careful validation of chatbot outputs in clinical settings. LLMs hold promise for enhancing patient education but should be deployed with oversight and model-specific awareness