QP 14 - Patient Safety 2: From Time Tracking to AI: Let's Drive Safer Treatments Together
1079 - Automatic Large-Scale Prospective Monitoring and Blind Evaluation of the Performance of Commercial AI Auto-Segmentation in Routine Clinical Practice: Preliminary Results for a Cohort of Breast Cancer
P. Dubrowski1, M. J. Kim1, Y. Yang2, and D. H. Hristov1; 1Department of Radiation Oncology, Stanford University School of Medicine, Palo Alto, CA, 2Department of Radiation Oncology, Stanford University, Stanford, CA
Purpose/Objective(s): Anatomical structures segmented by Artificial Intelligence (AI) can form the baseline for automatic contouring quality assurance program, provided AI performance is reliably quantified to detect outliers. Small (few tens of patients) retrospective studies, limited to geometric metrics and performed at a single time point, are inherently biased and unsuitable. Hence, we aim to develop of an unsupervised software framework to automatically evaluate AI auto-segmentation performance for every treated patient. Materials/
Methods: An evaluation tool (AI-Evaluator) was developed with the Eclipse Scripting Application Programming Interface, Varian Medical Systems. Upon treatment plan approval for a given patient, the tool autonomously compares clinically approved and initial AI structures geometrically and dosimetrically by evaluating the dose distribution on both structure sets. Additionally, AI structure metrics are also used to establish Statistical Process Control (SPC) Limits to create a real-time system for flagging outliers for further inspection. Metrics relevant to planning include: Relative Volume Difference (RVD), Maximum Relative Volume Difference (MRVD) in DVH indices, Hausdorff Distance (HD), compliance to dosimetric protocols, others. Additional metrics are automatically recorded to evaluate correlations to anatomical site, image artifacts, treating physician, others. The AI-Evaluator can introduce and re-assess these metrics for processed patients, allowing new, important AI performance correlates to be evaluated as they emerge. Results: In a subset of 65 breast cancer patients treated by 5 physicians, across all 23 AI structures per patient, median RVD was 2.8 %. There were statistically significant variations in contour adjustments among physicians as measured by RVDs (p < 0.0001). Left/right Internal mammary (mean RVD = 13% / 21%), left axillary L1 (mean RVD = 8.8%), and left supraclavicular (mean RVD = 8.4%) lymph nodes had largest RVDs (ANOVA, p < 0.0001). When RVDs exceeded 1%, mean MRVDs for the same structures were: 42%/51%, 66%, and 51%. Variations of MVRDs across segment structures were statistically significant (ANOVA, p < 0.0001). SPC Charts identified 3 structures outside control limits (3sigma). Manual inspection found 2 of 3 contours were anatomically correct and none were adjusted in the clinical treatment. Conclusion: We have implemented a tool that evaluates AI auto-segmented structures against clinically approved ones for every treated patient without supervision and can flag contouring outliers for further inspection. Initial use on a subset of breast cancer patients already points to structures that need to be examined carefully in view of existing discrepancies in contours and their dosimetric impact.