| Literature DB >> 35833032 |
Youdan Feng1, Fan Song1, Peng Zhang1, Guangda Fan1, Tianyi Zhang1, Xiangyu Zhao1, Chenbin Ma1, Yangyang Sun1, Xiao Song2, Huangsheng Pu3, Fei Liu4, Guanglei Zhang1.
Abstract
Objectives: We aimed to identify whether ensemble learning can improve the performance of the epidermal growth factor receptor (EGFR) mutation status predicting model.Entities:
Keywords: EGFR; computed tomography; ensemble learning; non–small cell lung cancer; radiogenomics
Year: 2022 PMID: 35833032 PMCID: PMC9271946 DOI: 10.3389/fphar.2022.897597
Source DB: PubMed Journal: Front Pharmacol ISSN: 1663-9812 Impact factor: 5.988
FIGURE 1Framework of our proposed radiomics model. It includes volume of interest (VOI) segmentation, radiomics feature extraction, and model construction. In the model construction process, we make data expansion with the SMOTE on the training set and feature selection on the training and the test sets. The most appropriate hyperparameter of the model is selected by the average accuracy on the validation set in the training process, and the best model is sent to the testing process.
Number of each phenotype in the dataset before the SMOTE.
| Wild type | Mutant type | Total | |
|---|---|---|---|
| Training set | 112 | 39 | 151 |
| Test set | 13 | 4 | 17 |
| Total | 125 | 43 | 168 |
FIGURE 2Description of 1,409 radiomics features. (A) Process of feature extraction. (B,C) 14 filters and five matrixes used in (A). The numbers of extracted features are indicated in parentheses.
FIGURE 3Feature selection using the LASSO method. Relationship of MSE and λ.
FIGURE 4Variation of feature number through feature selection. (A) Feature number before and after feature selection. (B) Feature ratio before feature selection. (C) Feature ratio after feature selection.
FIGURE 5Weight of selected features with Lasso in the linear regression classifier.
ACC and AUC of individual models.
| Classifier | Validation set | Test set | ||
|---|---|---|---|---|
| ACC (Mean ± Std) | AUC (Mean ± Std) | ACC | AUC | |
| LR | 0.7944 ± 0.0542 | 0.8607 ± 0.0547 | 0.7647 | 0.7885 |
| SVM | 0.7942 ± 0.0503 | 0.8634 ± 0.0517 | 0.7647 | 0.7885 |
| RF | 0.7744 ± 0.069 | 0.7815 ± 0.0741 | 0.7647 | 0.8269 |
| XGBoost | 0.7744 ± 0.078 | 0.7911 ± 0.1129 | 0.7647 | 0.8269 |
FIGURE 6Confusion matrixes of the SVM. (A–E) Confusion matrixes on the validation sets during the 5-fold cross-validation. (F) Confusion matrix on the test set.
ACC and AUC of ensemble models.
| Classifier | Voting method | Validation set | Test set | ||
|---|---|---|---|---|---|
| ACC (Mean ± Std) | AUC (Mean ± Std) | ACC | AUC | ||
| LR | — | 0.7944 ± 0.0542 | 0.8607 ± 0.0547 | 0.7647 | 0.7885 |
| SVM | — | 0.7942 ± 0.0503 | 0.8634 ± 0.0517 | 0.7647 | 0.7885 |
| RF | — | 0.7744 ± 0.069 | 0.7815 ± 0.0741 | 0.7647 | 0.8269 |
| XGBoost | — | 0.7744 ± 0.078 | 0.7911 ± 0.1129 | 0.7647 | 0.8269 |
| RF + XGBoost + LR | soft | 0.7944 ± 0.0653 | 0.8465 ± 0.0659 | 0.8824 | 0.8654 |
| XGBoost + SVM + LR | soft |
|
| 0.8235 | 0.8462 |
| RF + XGBoost + SVM | soft | 0.8011 ± 0.0480 | 0.8453 ± 0.0684 |
|
|
| RF + XGBoost + LR | hard | 0.7811 ± 0.0695 | — | 0.8235 | — |
| All | hard | 0.8211 ± 0.0456 | — | 0.7647 | — |
| all | soft | 0.8144 ± 0.0275 | 0.8587 ± 0.0550 | 0.7647 | 0.8654 |
The best performance in the models is highlighted in bold.
Further performance of the combination of XGBoost, SVM, and LR on the test set.
| Classifier | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| SVM | 0.76 | 0.67 | 0.67 | 0.67 |
| LR | 0.76 | 0.70 | 0.76 | 0.72 |
| XGBoost | 0.76 | 0.75 | 0.85 | 0.74 |
| Hard-voting | 0.76 | 0.70 | 0.76 | 0.72 |
| Soft-voting | 0.82 | 0.76 | 0.80 | 0.77 |
FIGURE 7Confusion matrixes of the ensemble model and individual models on the test set. (A) XGBoost. (B) SVM. (C) LR. (D) Ensemble model with soft-voting of XGBoost, SVM, and LR.
FIGURE 8ROC of the ensemble model and individual models on the test set. (A) XGBoost. (B) SVM. (C) LR. (D) Ensemble model with soft-voting of XGBoost, SVM, and LR.