J. Bertschmann1, Y. Xu2, C. Bayley3, and S. L. Lee2; 1University of Alberta, Edmonton, AB, Canada, 2Division of Radiation Oncology, Tom Baker Cancer Centre, Calgary, AB, Canada, 3Department of Oncology, Division of Radiation Oncology, Tom Baker Cancer Center, University of Calgary, Calgary, AB, Canada
Purpose/Objective(s): Large language models (LLMs) have demonstrated potential as a tool in medical practice and education. However, their performance in radiation oncology requires further study. The present study aimed to evaluate the knowledge level and ability to interpret questions of three popular LLMs: OpenAI’s ChatGPT-3.5 and GPT- 4, and Meta’s Llama-2 in the field of cancer and radiation biology through a multiple-choice question (MCQ) examination format. Materials/
Methods: The 2023 ASTRO Radiation and Cancer Biology Exam Study Guide was used to evaluate the performance of three LLMs: ChatGPT-3.5, GPT- 4, and Llama-2. The practice examination is comprised of 337 questions. Two questions containing graphical content were excluded. The exam questions were classified into the ten major topics and whether they required the use of math. The multiple-choice questions were individually entered verbatim into each LLM’s chat interface, and answers were recorded and graded as either correct or incorrect. Performance of the LLMs was assessed in relation to overall scores, topic, and use of math. The word count of the response and level of detail was recorded. Reliability of the LLMs was also assessed by inputting each question into the LLMs five separate times and recording the number of unique answers. Statistical analysis was performed using one-way analysis of variance (ANOVA) followed by a post-hoc analysis using the Bonferroni correction to adjust for multiple comparisons. Results: Biology practice exam was 221/335 (62%), 271/335 (81%), and 175/335 (51%), respectively. Thus, GPT-4 significantly outperformed GPT-3.5 (p < 0.001) and Llama-2 (p < 0.001). Overall, all three LLMs were strongest in domains related to molecular and cellular biology (molecular and cellular damage, tumor biology and microenvironment, and cancer biology). They performed poorly in domains related to the effects and delivery of radiation (interaction of radiation with matter, dose delivery, and late effects and radiation protection). All three LLMs performed significantly worse on questions requiring the use of math, with GPT-3.5, GPT-4, and Llama-2 scoring 26% (6/23), 57% (13/23), and 34% (8/23), respectively. The average word count of the response explanations provided by GPT-3.5, GPT-4, and Llama-2 were 134 words, 183 words, 162 words, respectively. When each question was inputted into each LLM five times, GPT3.5 outputted >1 distinct response for 39% of questions (132/335), GPT4 outputted >1 distinct response for 20% of questions (70/335), and Llama-2 outputted >1 distinct response for 19% of questions (62/335). Thus, despite having the lowest overall exam score, Llama-2 was the most consistent with its responses (p < 0.01). Conclusion: GPT-4 showed remarkable performance despite the lack of radiation and cancer biology specific training. While LLMs exhibit potential, these findings underscore the need for rigorous oversight and validation of LLM-generated information related to radiation and cancer biology.