J. Beattie1, S. Neufeld2, D. X. Yang2, C. Chukwuma3, A. Gul4, N. B. Desai2, M. Dohopolski2, and S. B. Jiang2; 1Medical Artificial Intelligence and Automation (MAIA) Lab, Department of Radiation Oncology, UT Southwestern Medical Center, Dallas, TX, 2Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, TX, 3Department of Radiation Oncology, The University of Texas Southwestern Medical Center, Dallas, TX, 4UT Southwestern Medical Center, Dallas, TX
Purpose/Objective(s): Clinical trial screening is a critical yet resource-intensive step in the development of medical treatments. Additionally, Large Language Models (LLMs) have become increasingly powerful and popular tools in recent years. This study investigates the potential of Large Language Models (LLMs) as an alternative to traditional screening tools and techniques, measuring their efficiency and accuracy. Materials/
Methods: We used two datasets: the n2c2 2018 cohort selection dataset, available publicly, and a dataset from patients screened for an institutional phase II clinical trial. Each trial had 13 and 14 criteria respectively. Criteria ground truth was readily available for the former while the latter required manual verification by clinical research staff. The LLM pipeline was created for the n2c2 dataset, requiring 20 patient records for initial LLM prompt fine-tuning. Twenty patients were used to further refine prompts in our institutional application. Patient electronic health records (EHRs) were processed using OpenAI’s text-embedding-3-small model, creating embeddings to populate a vector database for efficient document section retrieval. For each screening criterion, we crafted physician-guided, LLM-enhanced prompts to extract the five most pertinent EHR sections. These were then analyzed by the GPT-3.5 Turbo/GPT-4 API on our HIPAA-compliant servers. The analysis focused on the accuracy, sensitivity, and specificity of LLMs for each criterion, complemented by a qualitative failure analysis to uncover the primary causes of any incorrect classifications. Results: In the n2c2 dataset analysis, accuracy reached 87%, with sensitivity at 84% and specificity at 89% across 182 patients on 13 criteria. For 10 patients who were fully eligible across 14 criteria for the institutional phase II trial, accuracy and sensitivity were both high at 98%. For 10 ineligible patients, accuracy was 87%, sensitivity 87%, and specificity 86%. The LLM screening process took 1-5 minutes per patient. Failure analyses revealed that most incorrect classifications stemmed from challenges in document retrieval, temporal reasoning errors, or misinterpretations of criteria requirements. Conclusion: Our findings demonstrate that LLMs facilitate rapid and accurate clinical trial screening across both public and private datasets. This approach shows the ability to aid clinical research staff in the clinical trial screening process, improving enrollment, and reducing costs. Further investigation into more diverse environments and complex clinical trials are required to prove this approachs efficacy across disciplines.