R. Khanmohammadi1, A. I. Ghanem2, K. Verdecchia2, R. D. Hall3, M. A. Elshaikh3, B. Movsas2, H. Bagher-Ebadian2, I. J. Chetty4, M. M. Ghassemi1, and K. Thind2; 1Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, 2Department of Radiation Oncology, Henry Ford Health, Detroit, MI, 3Department of Radiation Oncology, Henry Ford Cancer Institute, Detroit, MI, 4Department of Radiation Oncology, Cedars-Sinai Medical Center, Los Angeles, CA
Purpose/Objective(s): Current radiation oncology toxicity abstraction systems have poor performance in concept extraction, and are optimized for single-institution use, which is limiting for widespread applicability. We propose a novel student-teacher large language model (LLM) architecture that self-improves the key concept abstraction through automatic prompt optimization, and is designed for local use to safeguard patient privacy. Materials/
Methods: Student-teacher agent is tested on key concepts of RT caused symptoms, and treatments for these symptoms in prostate cancer patients’ long term (6m post RT) free-text notes. 177 patients that received 78 Gy of RT from 2013-2020 were selected. Model optimization utilized 235 notes for symptoms and 313 notes for treatments, while validation used 59 notes for single symptom presence, 375 for multiple symptoms, 79 for a single treatment, and 15 for multiple treatments. Manually annotated notes served as the benchmark for accuracy. The model focuses on 12 symptoms and 9 treatments. The Mixtral-8x7B student model initially extracts symptoms and treatments from given prompts, which are then refined by the GPT-4 teacher model over 16 rounds and 5 epochs, based on the students performance and rationale. The process involves the student ranking concepts as positive, negative, or neutral and justifying the ranking, with the teacher model evaluating and improving the prompts based on this analysis. Results: Improvement is observed for both symptom, and treatment abstraction, with incremental progress in each epoch. The average improvement in accuracy from 0.51 to 0.71, precision from 0.52 to 0.82, recall from 0.52 to 0.72, and F1 from 0.49 to 0.73 is observed in single symptom notes. The performance for multi-symptom notes demonstrated average improvement in accuracy from 0.24 to 0.43, precision from 0.60 to 0.76, recall from 0.24 to 0.43 and F1 from 0.20 to 0.44. For treatment abstraction, the average improvement in accuracy from 0.34 to 0.71, precision from 0.64 to 0.81, recall from 0.34 to 0.71, and F1 from 0.41 to 0.72 is observed in single treatment notes. The performance for multi-treatment notes demonstrated average improvement in accuracy from 0.03 to 0.44, precision from 0.12 to 0.75, recall from 0.03 to 0.44 and F1 from 0.02 to 0.48. Wilcoxon test shows p<0.05 for weighted accuracy, precision, recall, and F1 scores in all categories except for precision in multi-symptom abstraction. Conclusion: This novel architecture offers a method for locally optimized LLM agents capable of extracting key toxicities from clinical notes using a zero-shot learning approach. The student models ability to perform complete local inference aligns with the imperative to safeguard healthcare data privacy. Demonstrating proof-of-concept in a research environment, this system highlights the promising role of natural language processing techniques in augmenting radiation oncology clinical care.