V. Lee1, H. S. M. Park1, S. Aneja1, and A. A. Patel2; 1Department of Therapeutic Radiology, Yale School of Medicine, New Haven, CT, 2Department of Therapeutic Radiology, Yale University School of Medicine, New Haven, CT
Purpose/Objective(s): Lymph nodal metastasis (LNM) plays a critical role in the prognosis and treatment strategy formulation for non-small cell lung cancer (NSCLC). Initial diagnosis typically involves a CT scan, but its sensitivity for detecting mediastinal LNM is limited, sometimes as low as 51%. To improve detection, PET-CT, mediastinoscopy, and Endobronchial ultrasound (EBUS) guided needle aspiration are employed. Prediction of nodal metastasis from transcriptomic analysis of primary tumor tissue could inform clinical decisions, particularly for cases with diagnostic uncertainty, such as indeterminate PET-avid lesions or suspected false-negative EBUS guided biopsies. We hypothesized that transcriptome data from the primary tumor from The Cancer Genome Atlas (TCGA) could be used to predict LNM by various machine learning models and we sought to compare the performances of these models to aid NSCLC staging. Materials/
Methods: RNA sequencing data for primary lung adenocarcinoma were downloaded from the TCGA database. These data underwent a bootstrapping process to generate ten splits of data for model training and testing, maintaining a 9:1 ratio. The top 2000 genes with the largest median absolute deviation were selected for inclusion and underwent log normalization. The study utilized logistic regression, support vector machine, XGBoost, and random forest models for analysis. The research was conducted using Python 3.7 and associated libraries including numpy, pandas, sklearn, XGBoost, matplotlib, and shap. Results: A total of 493 primary transcriptome cases were analyzed. The random forest model had the highest median area under the receiver operating characteristic curve (AUC), achieving a median AUC of 0.76 (standard deviation [sd] ± 0.08) and a median accuracy of 0.70 (sd ± 0.04). Conversely, SVM had the lowest median AUC of 0.35 (sd ± 0.18) and a median accuracy of 0.66 (sd ± 0.05). The XGBoost model had a median AUC of 0.73 (sd ± 0.09) and a median accuracy of 0.71 (sd ± 0.05) while the logistic regression model had a median AUC of 0.68 (sd ± 0.08) and a median accuracy of 0.66 (sd ±0.06). Conclusion: These results suggest that machine learning models have the potential to predict LNM in NSCLC using transcriptome data from the primary tumor, albeit with significant variability across different models. To evaluate the ability to predict LNM from primary tumor transcriptome data, it is important to assess multiple machine learning models. The use of such models could help to guide clinical decision-making in situations where there is diagnostic uncertainty.