| Literature DB >> 35814445 |
Ruiyuan Yang1, Xingyu Xiong1, Haoyu Wang1, Weimin Li1,2,3,4.
Abstract
Objectives: The aim of this study is to determine whether the clinical features including blood markers can establish an explainable machine learning model to predict epidermal growth factor receptor (EGFR) mutation in lung cancer.Entities:
Keywords: EGFR mutation; SHAP value; lung cancer; machine learning; prediction
Year: 2022 PMID: 35814445 PMCID: PMC9259982 DOI: 10.3389/fonc.2022.924144
Source DB: PubMed Journal: Front Oncol ISSN: 2234-943X Impact factor: 5.738
Patient characteristics and blood markers.
| Variables | EGFR–Wild Type | EGFR-Mutation | P-Value |
|---|---|---|---|
| Patient population, n (n%) | 3,886 | 3,527 | |
|
| |||
| Gender, n (%) | <0.001 | ||
| Female | 1,446 (37.210) | 1,970 (55.855) | |
| Male | 2,440 (62.790) | 1,557 (44.145) | |
| Smoking Consumption, n (%) | <0.001 | ||
| No | 1,814 (46.873) | 2,490 (71.001) | |
| Yes | 2,056 (53.127) | 1,017 (28.999) | |
| Age (year), mean (SD) | 56.965 (10.749) | 56.893 (10.197) | 0.768 |
|
| |||
| Hemoglobin (g/L), mean (SD) | 120.959 (17.833) | 122.760 (17.470) | <0.001 |
| Platelet(109/L), mean (SD) | 215.069 (83.880) | 211.581 (81.704) | 0.079 |
| Neutrophils%, mean (SD) | 64.071 (12.673) | 63.993 (11.888) | 0.794 |
| Lymphocyte%, mean (SD) | 24.315 (10.758) | 24.825 (10.108) | 0.043 |
|
| |||
| Cholesterol (mmol/L), mean (SD) | 4.773 (1.006) | 4.726 (1.002) | 0.053 |
| Albumin Globulin Ratio, mean (SD) | 1. 527 (0.329) | 1.587 (0.320) | <0.001 |
| Glutamyl Transpeptidase (IU/L), mean (SD) | 36.925 (22.541) | 33.808 (22.600) | <0.001 |
| Aspartate Aminotransferase (IU/L), mean (SD) | 24.854 (8.265) | 24.952 (8.444) | 0.623 |
|
| |||
| Carcinoembryonic Antigen (ng/ml), median [Q1, Q3] | 6.120 [2.440, 25.348] | 6.640 [2.300, 30.797] | 0.881 |
The quantitative performance and the ROC curves of included models.
| Model | AUC | Youden_Index | Sensitivity | Specificity |
|---|---|---|---|---|
| RF | 0.825 (0.823, 0.827) | 0.510 (0.506, 0.514) | 0.738 (0.736, 0.74) | 0.752 (0.75, 0.754) |
| XGBoost | 0.826 (0.824, 0.828) | 0.513 (0.509, 0.517) | 0.749 (0.747, 0.751) | 0.751 (0.749, 0.753) |
| LightGBM | 0.819 (0.817, 0.821) | 0.517 (0.512, 0.522) | 0.749 (0.746, 0.752) | 0.751 (0.748, 0.754) |
| Decision Tree | 0.648 (0.647, 0.649) | 0.277 (0.275, 0.279) | 0.306 (0.305, 0.307) | 0.804 (0.803, 0.805) |
| LR | 0.695 (0.693, 0.697) | 0.299 (0.295, 0.303) | 0.633 (0.631, 0.635) | 0.636 (0.634, 0.638) |
| SVM | 0.795 (0.793, 0.797) | 0.472 (0.468, 0.476) | 0.719 (0.716, 0.722) | 0.727 (0.725, 0.729) |
| MLP | 0.774 (0.772, 0.776) | 0.442 (0.437, 0.447) | 0.711 (0.708, 0.714) | 0.714 (0.711, 0.717) |
Figure 1Comparison of AUCs among seven machine learning models with ROC; RF got the greatest AUC for single model prediction.
Figure 2Comparison of AUCs among seven machine learning models with bar graph.
Figure 3SHAP summary plot of the 12 features of the RF model. The higher the SHAP value of single feature, the higher the possibility of EGFR mutation. Red represents closer with this mutation, and blue represents apposite possibility.
Figure 4The SHAP dependence plot was used to explain how a single factor affects the result in this RF model. SHAP value for specific feature exceeding zero represents an increased risk of incidence of EGFR mutation. The demographic factors: (A) smoking consumption, (B) gender, and (C) age.
Figure 5The SHAP dependence plot about blood markers: (A) cholesterol, (B) albumin globulin ratio, (C) carcinoembryonic, and (D) platelet.