C. Dvorak1, Y. Vesga-Prada1, J. Salazar1, A. P. Shah1, T. H. Wagner1, P. Kelly1, S. Meeks2, C. S. Wuu3, and T. Dvorak1; 1Department of Radiation Oncology, Orlando Health Cancer Institute, Orlando, FL, 2Varian Medical Systems, Palo Alto, CA, 3Columbia University, New York, NY
Purpose/Objective(s): The study aims to assess the efficacy of OpenAIs ChatGPT and Googles Bard, generative AI large language models, in medical physics education for resident radiation oncology physicians. By analyzing AI performance on the RAPHEX exams, we seek to explore their potential as an educational tool and contribute insight into the integration of AI technologies in enhancing medical education. Materials/
Methods: Three generative AI models - OpenAIs GPT3.5, GPT4, and Googles Bard – were evaluated in answering a total of 673 text-based questions from RAPHEX radiation physics exams spanning 2018-2022, with an annual range of 133 to 137 questions, excluding image-related questions. Permission was obtained from MPP Publishing. The models were tested through their respective online interfaces, using the software versions available in June-August 2023, and using the RAPHEX Therapy Answer key for evaluation. A subset of questions was subsequently re-entered under a standardized, motivational prompt to test reproducibility. Statistical analyses compared model performances, focusing on accuracy and consistency in responses. Results: GPT4 achieved a 77% accuracy rate, with performance ranging from 71% to 84% by year. In contrast, both GPT3.5 and Bard recorded a 52% accuracy rate, with Bards performance varying from 43% to 61% and GPT3.5s from 45% to 62% (p<0.01). On comparative analysis of best models between vendors, GPT4 and Bard both correctly answered 45% of the 673 questions (range 37% to 58% by year). Instances where GPT4 was correct, and Bard was incorrect accounted for 31% (range 26% to 34%), while instances where GPT4 was incorrect but Bard was correct constituted 6% (range 3% to 11%). Both models failed to answer correctly 18% of the questions (range 13% to 23%). The concordance using Cohen’s Kappa between ChatGPT4 and Bard was 0.26 (range 0.11 to 0.33), indicating a low correlation in the models’ “knowledge”. A reproducibility test using 26 questions from 2022 exam, with equal distribution of initially correct and incorrect responses by GPT4 (50% expected answer rate), showed a correct range of 46% to 69% by GPT4. It answered each question correctly at least once in 77% of instances (GPT4-Max), but also answered each question incorrectly at least once in 35% of instances (GPT4-Min), showing significant variability in the models consistency and accuracy across different sessions. Conclusion: ChatGPT4 was able to answer ~80% of RAPHEX questions correctly, with significant improvement over its prior GPT3.5 version and Google Bard. However, there was significant variability in accuracy between different models, as well as between different sessions of the same model. Incorporating such generative AI tools in medical physics for radiation oncology resident education will require better understanding of their performance and reliability in various physics knowledge domains.