Memorial Sloan Kettering Cancer Center Manhattan, New York
A. J. Jinia1, K. L. Chapman1, S. Liu2, C. Della Biancia1, A. Li2, and J. M. Moran2; 1Memorial Sloan Kettering Cancer Center, New York, NY, 2Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, NY
Purpose/Objective(s): Despite recent advancements in Artificial Intelligence (AI) and Large Language Models (LLMs), commercially developed LLMs often lack nuanced details, thus degrading their effectiveness in improving patient care with respect to patient safety in radiation oncology. Materials/
Methods: We developed a method to preprocess events and then submitted them to our artificial intelligence-based incident learning system (AI-ILS). Preprocessing involved automatic de-identification of protected health information (PHI) using Clinical BERT and an expansion of acronyms and abbreviations using the National Library of Medicine MetaMap program. AI-ILS investigates patient safety events through the application of the Human Factors Analysis and Classification System (HFACS). Two LLMs, Llama-2 and Mistral-V2, were chosen for AI-ILS due to their exceptional performance in benchmark experiments. Their open-source nature allows for flexible fine-tuning, ensuring precise training for HFACS. This openness enables local deployment of AI-ILS, thereby enhancing data confidentiality measures. Mock narrative events were created for the Tier 2 category of HFACS which applies to events related to Preconditions for Unsafe Acts. Experts created the mock events for the following sub-categories: environmental factors, personnel factors, and conditions of the operator. 10 initial events per sub-category were developed and then served as input for ChatGPT-4 to produce an additional 50 events. The resulting ChatGPT-4 events were reviewed for feasibility. Results: 153 events related to patient setup documents were extracted from our event reporting system and underwent preprocessing using the de-identification and acronym/abbreviation expansion modules. 87% of the events were successfully preprocessed. The remaining events were incorrectly preprocessed due to typos, inconsistent indentation between words, and multiple expansions of a given acronym/abbreviation. The baseline accuracies for different AI analysis methods without fine tuning were 60.1% for ChatGPT-4, 54.7% for Mistral-V2, and 33.9% for Llama-2. Although ChatGPT-4 was superior to other methods, its integration into AI-ILS was deemed unfeasible due to confidentiality concerns. Instead, ChatGPT was evaluated to generate a large training dataset based on the seed mock narrative events. However, nearly half of the events generated were deemed unusable. For instance, an example of an ChatGPT’s unusable event is: "The patients treatment was postponed when the treatment machines radiation shield was found to be damaged." Conclusion: To support analysis of larger datasets, we successfully preprocessed events underscoring the potential of these methods to protect patient PHI and reducing word ambiguity. We identified the need for model fine-tuning and the need for human review of ChatGPT-4 events to support future applications with AI-ILS.