Aiding data retrieval in clinical trials with large language models: The APOLLO 11 Consortium in advanced lung cancer patients

Corso, Federica; Mazzeo, Laura; Peppoloni, Vittoria; Leone, Giuseppe; Miskovic, Vanja; Wiest, Isabella; Silvestri, Cecilia; Occhipinti, Mario; Brambilla, Marta; Beninato, Teresa; Ferrarin, Alberto; Meazza Prina, Marco; Proto, Claudia; Baili, Paolo; Ganzinelli, Monica; De Braud, Filippo Guglielmo Maria; Lo Russo, Giuseppe; Kather, Jakob Nikolas; Pedrocchi, Alessandra; Prelaj, Arsela

doi:10.1200/jco.2025.43.16_suppl.e23161

Background: Data retrieval is challenging in clinical research and traditional methods for data collection are often time-consuming and may be error-prone. Large Language Models (LLMs) have shown zero-shot capabilities in converting unstructured clinical text into structured data. These technologies could support the retrieval stage of clinical trials by leveraging the information reported in Electronic Health Records (EHRs) without relying any longer on manual curation. APOLLO 11 Consortium (NCT05550961) is a multicentric Italian trial which leverages a federated infrastructure for the analysis of advanced lung cancer patient data across Italy. Methods: We conducted a pilot study using Llama 3.1 8B on 358 Non-Small Cell Lung Cancer patients from the IRCCS Istituto Nazionale dei Tumori, leader of the APOLLO 11 Consortium. Anonymized EHRs have been analyzed within the LLM pipeline for feature extraction by Wiest et al. A combination of zero/few shot prompting techniques both in English and Italian languages was used. We selected smoking, histology, PD-L1 and staging as multiclass variables and bone/brain/liver metastases as binary variables. The ground truth collection involved a first Manual Data Entry (1-MDE) and a final full-revised MDE (2-MDE). The LLM accuracy was calculated only for the comparison LLM vs 2-MDE. In addition, we calculated the percentage of Missing Information (% MI) in 1-MDE, 2-MDE and LLM extraction. Results: Compared to 2-MDE, LLM achieved feature-specific accuracies of 0.78 for PD-L1, 0.85 for BONE METASTASIS, 0.83 for BRAIN METASTASIS, 0.89 for LIVER METASTASIS and 0.96 for TUMOUR STAGING. For smoking and staging, LLM extraction also reduced % MI relative to 1-MDE (Table 1). Only for PD-L1, we further analyzed the 12.8% of MI and found that 91.3% resulted from hallucinations (i.e., PD-L1 was misclassified as missing). Evaluations using English prompts confirmed the pipeline’s adaptability and high tasks accuracy. Conclusions: This study confirms the feasibility of LLMs for data retrieval in clinical trials demonstrating strong performance across diverse clinical features with minimal prompt optimization. LLMs could assist clinicians and data entry personnel in the 1-MDE process, streamlining initial data structuring and saving time. The 2-MDE step can remain as a quality check to address any discrepancies. Further improvements could focus on prompt optimization and integrating human feedback to reduce hallucination rates. [Table presented]