LINDA-LLM: Extraction of Cardiovascular Molecular Interaction Networks with Large Language Models

E. Gjerga (Heidelberg)1, P. Wiesenbach (Heidelberg)2, C. Dieterich (Heidelberg)3
1Universitätsklinikum Heidelberg Innere Medizin III, Inst. für Molekulare und Translationale Kardiologie Heidelberg, Deutschland; 2Universitätsklinikum Heidelberg Innere Medizin III, Klaus-Tschira-Institut für Computerkardiologie Heidelberg Heidelberg, Deutschland; 3Universitätsklinikum Heidelberg Klinik für Innere Med. III, Kardiologie, Angiologie u. Pneumologie Heidelberg, Deutschland
Objective: Curated databases of molecular interaction are essential for cardiovascular research, network modelling, improving interpretability and reproducibility, and prioritising mechanisms and targets. Despite their utility, maintaining these resources is resource-intensive due to manual curation, which oftentimes results in outdated information. Moreover, many of these resources exhibit a significant research bias towards overrepresented medical conditions, especially cancer, which limits their applicability in cardiobiology. Our LINDA-LLM project creates a continuously updated, cardiac-specific molecular relationship resource. Source code is available under https://github.com/dieterich-lab/LLM_Relations . This is accomplished by integrating large-language-model (LLM)–based relation extraction from the cardiovascular literature with subsequent in silico validation using an automated AlphaFold workflow.

Methods: We employ a robust LLM-based workflow to systematically mine full-text cardiac literature for protein–protein interactions (PPIs) and gene regulatory network (GRN) relations. We have benchmarked various open-source models (e.g., Meta's Llama-70B vs. Llama-8B), evaluated different prompt strategies (including positive/negative examples and retrieval-augmented context), and optionally used Named Entity Recognition (NER) pre-annotation. High-confidence, isoform-aware PPIs are prioritised by further analysis: AlphaFold-Multimer is used to model extractions and assess interface plausibility and domain mediation. Where feasible, molecular-dynamics simulations estimate thermodynamic stability.

Results: Previous benchmarking of this resource demonstrated that Llama-70B consistently outperformed Llama-8B across all evaluated tasks. For PPI extraction, the highest F1 score (≈0.72) was achieved using prompts with positive examples without entity pre-annotation. In contrast, for GRN extraction, providing NER pre-annotation improved recall and yielded the best F1 score (≈0.60). Across all prompting configurations, precision remained more stable than recall. Our proposed approach aims to apply AlphaFold3- and MD-based triage to candidate PPIs to improve identifiability and enable domain-level resolution. The resulting refined data will be integrated into our ALL-LINDA modelling framework, initially focusing on cardiac hypertrophy use cases.

Conclusion: We present a novel approach that uses high-throughput Large Language Model (LLM) extraction and structure-guided validation to create an open, specialised knowledge resource for cardiac biology. This resource addresses current database biases and obsolescence, improving its relevant for CVD research. Ultimately, it will facilitate better downstream signalling reconstruction and target nomination in cardiac biology. The resource with all the predicted structural properties will be showcased at DGK 2026 as a web application.