| Literature DB >> 32054897 |
Yucan Xu1, Lingsha Ju1, Jianhua Tong1, Cheng-Mao Zhou2, Jian-Jun Yang3.
Abstract
The aim of this study is to explore the feasibility of using machine learning (ML) technology to predict postoperative recurrence risk among stage IV colorectal cancer patients. Four basic ML algorithms were used for prediction-logistic regression, decision tree, GradientBoosting and lightGBM. The research samples were randomly divided into a training group and a testing group at a ratio of 8:2. 999 patients with stage 4 colorectal cancer were included in this study. In the training group, the GradientBoosting model's AUC value was the highest, at 0.881. The Logistic model's AUC value was the lowest, at 0.734. The GradientBoosting model had the highest F1_score (0.912). In the test group, the AUC Logistic model had the lowest AUC value (0.692). The GradientBoosting model's AUC value was 0.734, which can still predict cancer progress. However, the gbm model had the highest AUC value (0.761), and the gbm model had the highest F1_score (0.974). The GradientBoosting model and the gbm model performed better than the other two algorithms. The weight matrix diagram of the GradientBoosting algorithm shows that chemotherapy, age, LogCEA, CEA and anesthesia time were the five most influential risk factors for tumor recurrence. The four machine learning algorithms can each predict the risk of tumor recurrence in patients with stage IV colorectal cancer after surgery. Among them, GradientBoosting and gbm performed best. Moreover, the GradientBoosting weight matrix shows that the five most influential variables accounting for postoperative tumor recurrence are chemotherapy, age, LogCEA, CEA and anesthesia time.Entities:
Mesh:
Year: 2020 PMID: 32054897 PMCID: PMC7220939 DOI: 10.1038/s41598-020-59115-y
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Baseline data.
| Progress | No | Yes | P-value* |
|---|---|---|---|
| N | 221 | 778 | |
| AGE (years) | 68.9 ± 12.7 | 64.1 ± 13.8 | <0.001 |
| CEA | 219.6 ± 719.3 | 269.8 ± 1053.6 | 0.434 |
| LOGCEA | 1.3 ± 0.9 | 1.4 ± 0.9 | 0.410 |
| ANESTIME(min) | 326.2 ± 122.1 | 341.9 ± 120.6 | 0.050 |
| GENDER | 0.924 | ||
| Male | 136 (61.5%) | 476 (61.2%) | |
| Female | 85 (38.5%) | 302 (38.8%) | |
| ASA | <0.001 | ||
| 1 | 7 (3.2%) | 46 (5.9%) | |
| 2 | 113 (51.1%) | 446 (57.3%) | |
| 3 | 89 (40.3%) | 277 (35.6%) | |
| 4 | 11 (5.0%) | 9 (1.2%) | |
| 5 | 1 (0.5%) | 0 (0.0%) | |
| DM | 0.179 | ||
| No | 169 (76.5%) | 627 (80.6%) | |
| Yes | 52 (23.5%) | 151 (19.4%) | |
| CAD | 0.541 | ||
| No | 203 (91.9%) | 724 (93.1%) | |
| Yes | 18 (8.1%) | 54 (6.9%) | |
| HF | 0.456 | ||
| No | 209 (94.6%) | 746 (95.9%) | |
| Yes | 12 (5.4%) | 32 (4.1%) | |
| CVA | 0.076 | ||
| No | 203 (91.9%) | 739 (95.0%) | |
| Yes | 18 (8.1%) | 39 (5.0%) | |
| CKD | 0.227 | ||
| No | 185 (83.7%) | 676 (86.9%) | |
| Yes | 36 (16.3%) | 102 (13.1%) | |
| LAPAROSCOPIC | 1.000 | ||
| No | 213 (96.4%) | 748 (96.1%) | |
| Yes | 8 (3.6%) | 30 (3.9%) | |
| EA | 0.472 | ||
| No | 188 (85.1%) | 646 (83.0%) | |
| Yes | 33 (14.9%) | 132 (17.0%) | |
| AJCC | 0.105 | ||
| No | 134 (60.6%) | 424 (54.5%) | |
| Yes | 87 (39.4%) | 354 (45.5%) | |
| LIVER_ONLY | 0.259 | ||
| No | 132 (59.7%) | 497 (63.9%) | |
| Yes | 89 (40.3%) | 281 (36.1%) | |
| CT | <0.001 | ||
| No | 78 (35.3%) | 32 (4.1%) | |
| Yes | 143 (64.7%) | 746 (95.9%) | |
| RT | <0.001 | ||
| No | 213 (96.4%) | 676 (86.9%) | |
| Yes | 8 (3.6%) | 102 (13.1%) | |
| NACTRT | 0.081 | ||
| No | 195 (88.2%) | 649 (83.4%) | |
| Yes | 26 (11.8%) | 129 (16.6%) |
Abbreviations: ASA physical status: American Society of Anesthesiologists physical status; CEA: carcinoembryonic antigen; CT: chemotherapy; RT: radiotherapy; CKD: Chronic kidney disease; CHF: Heart failure; CAD: Coronary arterial disease.
Note: The percentage of CEA AND LogCEA missing values was 0.099. The remaining variables have no missing values.
Figure 1Correlation Analysis of Various Factors ASA physical status: American Society of Anesthesiologists physical status; CEA: carcinoembryonic antigen; CT: chemotherapy; RT: radiotherapy; CKD: Chronic kidney disease; CHF: Heart failure; CAD: Coronary arterial disease.
Figure 2Variable importance of features included in machine learning GradientBoosting’s algorithm for prediction of Recurrence of colorectal cancer after tumor resection. Abbreviations: ASA physical status: American Society of Anesthesiologists physical status; CEA: carcinoembryonic antigen; CT: chemotherapy; RT: radiotherapy; CKD: Chronic kidney disease; CHF: Heart failure; CAD: Coronary arterial disease.
Forecast Results of Training Group and Testing Group.
| Training Group | Testing Group | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Precision | Recall | F1_score | AUC | Accuracy | Precision | Recall | F1_score | AUC | |
| Logistic | 0.827 | 0.842 | 0.958 | 0.896 | 0.734 | 0.830 | 0.828 | 0.987 | 0.901 | 0.692 |
| DecisionTree | 0.847 | 0.844 | 0.986 | 0.909 | 0.766 | 0.810 | 0.821 | 0.968 | 0.888 | 0.723 |
| GradientBoosting | 0.851 | 0.841 | 0.997 | 0.912 | 0.881 | 0.820 | 0.819 | 0.987 | 0.895 | 0.734 |
| gbm | 0.825 | 0.841 | 0.955 | 0.895 | 0.752 | 0.825 | 0.831 | 0.974 | 0.974 | 0.761 |
Figure 3Machine learning algorithm for prediction of Recurrence of colorectal cancer after tumor resection in trainning group (Contains four machine learning algorithms, such as: logical regression, decision tree, GradientBoosting and lightGBM).
Figure 4Machine learning algorithm for prediction of Recurrence of colorectal cancer after tumor resection in testing group (Contains four machine learning algorithms, such as: logical regression, decision tree, GradientBoosting and lightGBM).
Functions, Packages, and Tuning Parameters in the anaconda Software Used for Each Machine Learning Algorithm.
| Algorithm | Classifier | Package | Tuning Parameters |
|---|---|---|---|
| Logistic regression | LogisticRegression | Sklearn 0.19.1 (from sklearn.linear_model import LogisticRegression) | penalty = ‘l2’, tol = 0.0001, C = 0.7, fit_intercept = True,intercept_scaling = 1, class_weight = None, max_iter = 100, multi_class = ‘ovr’, verbose = 0,warm_start = False, n_jobs = −1 |
| DecisionTree | DecisionTreeClassifier | Sklearn 0.19.1 (from sklearn.tree import DecisionTreeClassifier) | criterion = ‘gini’, splitter = ‘best’, max_depth = 7, min_samples_split = 20, min_samples_leaf = 1, min_weight_fraction_leaf = 0.0, max_features = None, random_state = None, max_leaf_nodes = None, min_impurity_decrease = 0.0, min_impurity_split = None, class_weight = None, presort = False |
| GradientBoosting | GradientBoostingClassifier | Sklearn 0.19.1 (from sklearn.ensemble import GradientBoostingClassifier) | learning_rate = 0.01, n_estimators = 100, min_samples_split = 10, min_samples_leaf = 1, subsample = 0.5, max_depth = 5 |
| gbm | lgb.LGBMClassifier | lightgbm 2.2.0 | boosting_type = ‘gbdt’, reg_alpha = 0.001, reg_lambda = 0.8, learning_rate = 0.1, max_depth = 1, n_estimators = 100, objective = ‘binary' |