| Literature DB >> 34992804 |
Xian Gong1,2, Bin Zheng1,2, Guobing Xu1,2, Hao Chen1,2, Chun Chen1,2.
Abstract
BACKGROUND: Accurate prognostic estimation for esophageal cancer (EC) patients plays an important role in the process of clinical decision-making. The objective of this study was to develop an effective model to predict the 5-year survival status of EC patients using machine learning (ML) algorithms.Entities:
Keywords: Esophageal cancer (EC); Surveillance, Epidemiology, and End Results (SEER); machine learning (ML); survival
Year: 2021 PMID: 34992804 PMCID: PMC8662490 DOI: 10.21037/jtd-21-1107
Source DB: PubMed Journal: J Thorac Dis ISSN: 2072-1439 Impact factor: 2.895
Selected clinicopathological features from the SEER dataset
| Features | Number of categories | Top categorya | Frequencyb | P valuec |
|---|---|---|---|---|
| Categorical features | ||||
| Race recode (W, B, AI, API) | 4 | White | 8,941 | 0.156 |
| Sex | 2 | Male | 8,441 | 0.232 |
| Primary site-labelled | 7 | C15.5-lower third of esophagus | 6,941 | 0.841 |
| Diagnostic confirmation | 4 | Positive histology | 10,508 | 0.857 |
| ICD-O-3 Hist/behav | 47 | 8140/3: adenocarcinoma, NOS | 5,955 | <0.001 |
| Derived AJCC stage group, 7th ed (2010–2015) | 11 | IV | 3,171 | <0.001 |
| Derived AJCC T, 7th ed (2010–2015) | 11 | T3 | 4,270 | <0.001 |
| Derived AJCC N, 7th ed (2010–2015) | 4 | N1 | 4,729 | <0.001 |
| Derived AJCC M, 7th ed (2010–2015) | 2 | M0 | 7,417 | <0.001 |
| RX Summ—Surg Prim Site (1998+) | 4 | None | 7,364 | <0.001 |
| RX Summ—Scope Reg LN Sur (2003+) | 8 | None | 7,398 | <0.001 |
| RX Summ—Surg Oth Reg/Dis (2003+) | 6 | None; diagnosed at autopsy | 10,265 | 0.749 |
| SEER combined mets at DX-bone (2010+) | 2 | No | 9,832 | <0.001 |
| SEER combined mets at DX-brain (2010+) | 2 | No | 10,420 | <0.001 |
| SEER combined mets at DX-liver (2010+) | 2 | No | 9,170 | <0.001 |
| SEER combined mets at DX-lung (2010+) | 2 | No | 9,643 | <0.001 |
| CS tumor size (2004–2015) | 170 | 50 | 1,193 | <0.001 |
| CS lymph nodes (2004–2015) | 19 | 0 | 4,132 | <0.001 |
| CS mets at DX (2004–2015) | 6 | 0 | 7,417 | <0.001 |
| Sequence number | 8 | One primary only | 7,737 | <0.001 |
| Reason no cancer-directed surgery | 7 | Not recommended | 6,215 | <0.001 |
| 5-year survival | 2 | Dead | 9,048 | |
| Numerical features | ||||
| Age recode with single ages and 85+ | 66.71d | 10.89e | [18, 85]f | <0.001 |
| Regional nodes examined (1988+) | 8.66d | 21.4e | [0, 98]f | <0.001 |
| Regional nodes positive (1988+) | 70.87d | 43.38e | [0, 98]f | <0.001 |
a, the category with the highest frequency; b, the corresponding frequency; c, χ2 test; d, these data represent the mean; e, these data represent the Std.; f, these data represent the [range]. SEER, Surveillance, Epidemiology, and End Results.
Results of hyperparameter tuning
| Classifier | Training parameters | Searching space | Best parameters |
|---|---|---|---|
| XGBoost | n_estimators | [100, 10,000] | 1,169 |
| learning_rate | [0.001, 0.5] | 0.1 | |
| max_depth | [1, 10] | 5 | |
| subsample | [0.25, 0.75] | 0.62 | |
| colsample_bytree | [0.05, 0.5] | 0.49 | |
| colsample_bylevel | [0.05, 0.5] | 0.41 | |
| CatBoost | n_estimators | [100, 10,000] | 1,642 |
| learning_rate | [0.001, 0.5] | 0.1 | |
| max_depth | [0, 5] | 3 | |
| reg_lambda | [1e−8, 10] | 0.3455 | |
| LightGBM | n_estimators | [100, 10,000] | 3,248 |
| learning_rate | [0.001, 0.5] | 0.0316 | |
| max_depth | [1, 10] | 5 | |
| num_leaves | [1, 300] | 16 | |
| lambda_l1 | [1e−8, 10] | 0.52 | |
| lambda_l2 | [1e−8, 10] | 0.2 | |
| GBDT | n_estimators | [100, 5,000] | 1,340 |
| learning_rate | [0.001, 0.5] | 0.0023 | |
| max_depth | [1, 20] | 11 | |
| max_leaf_nodes | [2, 100] | 23 | |
| subsample | [0.25, 0.75] | 0.27 | |
| RF | n_estimators | [100,10,000] | 200 |
| max_depth | [1, 10] | 6 | |
| min_samples_split | [2, 11] | 2 | |
| min_samples_leaf | [1, 10] | 4 |
GBDT, gradient boosting decision trees; RF, random forest.
Figure 1An example of hyperparameter tuning of the LightGBM model. There are 100 dots in the picture. Each dot represents a trial. The shade of blue indicates the range of the objective value. The objective function is defined as the average logistic loss of the 5-fold cross-validation of the LightGBM model.
Model performance using 8 algorithms
| Classifier | The complete dataset | The dataset with the non-significant features removed | ||||||
|---|---|---|---|---|---|---|---|---|
| AUC | Accuracy | Logistic loss | AUC | Accuracy | Logistic loss | |||
| XGBoost | 0.852 | 0.875 | 0.301 | 0.845 | 0.871 | 0.307 | ||
| LightGBM | 0.850 | 0.875 | 0.302 | 0.844 | 0.870 | 0.308 | ||
| CatBoost | 0.849 | 0.874 | 0.304 | 0.843 | 0.871 | 0.308 | ||
| GBDT | 0.846 | 0.875 | 0.307 | 0.842 | 0.871 | 0.311 | ||
| ANNa | 0.844 | 0.871 | 0.308 | 0.833 | 0.869 | 0.316 | ||
| RF | 0.838 | 0.865 | 0.319 | 0.838 | 0.865 | 0.319 | ||
| NB | 0.833 | 0.769 | 1.766 | 0.833 | 0.769 | 1.766 | ||
| SVM | 0.789 | 0.855 | 0.364 | 0.789 | 0.855 | 0.363 | ||
a, the ANN structure with the best AUC is n-4-4-1. AUC, area under the receiver operating characteristic curve; GBDT, gradient boosting decision trees; RF, random forest; NB, naive Bayes; ANN, artificial neural networks; SVM, support vector machines.
Figure 2Visual representation of model performance based on 8 algorithms trained by the complete dataset. (A) The precision-recall curve. (B) The ROC curve. When the AUC is closer to 1, the performance of the model classification and prediction is better. ROC, receiver operating characteristic; AUC, area under the ROC curve.
Figure 3SHAP feature importance measured as the mean absolute SHAP value. SHAP, SHapley Additive exPlanations.
Figure 4SHAP summary plot. The summary plot uses SHAP values to show the distribution of the impact each feature has on the model output. The position on the Y-axis is determined by the feature, and the position on the X-axis is determined by the SHAP value. A trend of the distribution of the SHAP values per feature can be obtained by the overlapping points jittered in the Y-axis direction. The color indicates the value of the feature from low to high. The features are ordered according to their importance, and the importance values are shown in . SHAP, SHapley Additive exPlanations.
Figure 5SHAP dependence plot, (A) for regional nodes positive (1988+), (B) for CS tumor size (2004–2015). This figure plots the SHAP value of the feature vs. the value of the feature for all the patients in the dataset. The light grey bars are the frequency distribution histograms for the two features. SHAP, SHapley Additive exPlanations.