W. K. Chuang1,2, Y. S. Kao3, Y. T. Liu4,5, and C. Y. Lee3,6; 1Department of Radiation Oncology, Shuang Ho Hospital, Taipei Medical University, New Taipei, Taiwan, 2Department of Biomedical Imaging and Radiological Sciences, National Yang Ming Chiao Tung University, Taipei, Taiwan, 3Department of Radiation Oncology, Taoyuan General Hospital, Ministry of Health and Welfare, Taoyuan, Taiwan, 4Division of Radiation Oncology, Department of Oncology, National Taiwan University Hospital Yunlin Branch, Yunlin, Taiwan, 5Department of Biomedical Engineering, National Taiwan University, Taipei, Taiwan, 6Department of Biomedical Engineering, National Yang Ming Chiao Tung University, Taipei, Taiwan
Purpose/Objective(s): ChatGPT has been increasingly applied to medical fields, but yet to be validated for its performance in this aspect. Our study aims to assess the practicality and correctness of ChatGPT-4s answers to clinical inquiry in radiation oncology, and to identify the types of mistakes in incorrect answers. Materials/
Methods: 164 expert-formulated questions covering representative professional domains (Clinical_G: knowledge on standardized guidelines; Clinical_C: complex or controversial clinical scenarios; Nursing: nursing and health education; Technology: Radiation technology, physics, and dosimetry) and cancer types were presented to ChatGPT-4. Each ChatGPT-4’s answer was graded as 1 (Directly applicable to clinical scenarios), 2 (Correct but inadequate), 3 (Mixed with correct and incorrect information), or 4 (Completely incorrect). Incorrect answers (Grade 3 or 4) were further analyzed for the error types. Results: 73.17% of ChatGPT-4’s answers were judged as Grade 1, 16.46% as Grade 2, 9.76% as Grade 3, and only 0.61% as Grade 4. Regarding the practicality of ChatGPT-4, the proportions of answers that were clinically applicable (Grade 1) were significantly associated with professional domains but not with cancer types. There were more Grade 1 answers in Nursing (91.89%) and Clinical_G (82.22%) domains than those in Clinical_C (54.05%) and Technology (64.44) domains. 89.63% of answers were correct (Grade 1+2); the proportions of correct answers were universally high across all professional domains and cancer types. The predominant error types in incorrect answers included "adding incorrect details to a generally correct answer", "misreading articles or guidelines", "missing crucial items" and "inaccurate calculations or geometrical assessments in dosimetry". Conclusion: In radiation oncology, ChatGPT-4 could be a supportive inquiry resource, as its correctness in answering expert-formulated questions was validated across professional domains and cancer types. However, the practicality of its clinical application is established only in Nursing and Clinical_G domains, but not sufficiently guaranteed in Clinical_C and Technology domains. The risk of hallucinative mistakes might compromise its immediate clinical application.