| Literature DB >> 35624472 |
Xueyun Tan1, Yuan Li2, Sufei Wang1, Hui Xia1, Rui Meng3, Juanjuan Xu1, Yanran Duan4, Yan Li5, Guanghai Yang6, Yanling Ma1, Yang Jin7.
Abstract
BACKGROUND: Timely identification of epidermal growth factor receptor (EGFR) mutation and anaplastic lymphoma kinase (ALK) rearrangement status in patients with non-small cell lung cancer (NSCLC) is essential for tyrosine kinase inhibitors (TKIs) administration. We aimed to use artificial intelligence (AI) models to predict EGFR mutations and ALK rearrangement status using common demographic features, pathology and serum tumor markers (STMs).Entities:
Keywords: Anaplastic lymphoma kinase; Artificial intelligence; Deep learning; Epidermal growth factor receptor; Machine learning; Non-small cell lung cancer; Serum tumor markers
Mesh:
Substances:
Year: 2022 PMID: 35624472 PMCID: PMC9145462 DOI: 10.1186/s12931-022-02053-2
Source DB: PubMed Journal: Respir Res ISSN: 1465-9921
Fig. 1Study design and patient selection
Association between clinical characteristics and EGFR and ALK status in the testing cohort
| Characteristics | EGFR | ALK | ||||
|---|---|---|---|---|---|---|
| Wild-type | Mutant-type | P value | Wild-type | Mutant-type | P value | |
| Age, years | 0.390 | < 0.001 | ||||
| Median | 61 | 60 | 61 | 54 | ||
| Range | 23–88 | 31–86 | 23–88 | 27–81 | ||
| Gender | < 0.001 | 0.337 | ||||
| Female | 495 (72.37) | 321 (41.15) | 783 (56.01) | 33 (50.00) | ||
| Male | 189 (27.63) | 459 (58.85) | 615 (43.99) | 33 (50.00) | ||
| Smoking history | < 0.001 | 0.121 | ||||
| Never-smoker | 345 (50.51) | 635 (81.41) | 930 (66.57) | 50 (75.76) | ||
| Ever-smoker | 338 (49.49) | 145 (18.59) | 467 (33.43) | 16 (24.24) | ||
| Pathology | < 0.001 | 0.036 | ||||
| Adenocarcinoma | 585 (85.53) | 753 (96.54) | 1273 (91.06) | 65 (98.48) | ||
| Non-adenocarcinoma | 99 (14.47) | 27 (3.46) | 125 (8.94) | 1 (1.52) | ||
| AFP | 0.104 | 1.000 | ||||
| Negative | 538 (98.18) | 642 (99.23) | 1126 (98.69) | 54 (100.00) | ||
| Positive | 10 (1.82) | 5 (0.77) | 15 (1.31) | 0 (0.00) | ||
| CEA | 0.014 | 0.060 | ||||
| Negative | 341 (50.52) | 342 (44.07) | 645 (46.54) | 38 (58.46) | ||
| Positive | 334 (49.48) | 434 (55.93) | 741 (53.46) | 27 (41.54) | ||
| CA125 | < 0.001 | 0.164 | ||||
| Negative | 319 (48.41) | 458 (59.87) | 748 (54.96) | 29 (46.03) | ||
| Positive | 340 (51.59) | 307 (40.13) | 613 (45.04) | 34 (53.97) | ||
| CA19-9 | 0.031 | 0.530 | ||||
| Negative | 493 (75.61) | 610 (80.37) | 1055 (78.32) | 48 (75.00) | ||
| Positive | 159 (24.39) | 149 (19.63) | 292 (21.68) | 16 (25.00) | ||
| CA15-3 | 0.195 | < 0.001 | ||||
| Negative | 451 (75.04) | 542 (78.10) | 962 (77.64) | 31 (55.36) | ||
| Positive | 150 (24.96) | 152 (21.90) | 277 (22.36) | 25 (44.64) | ||
| FERR | < 0.001 | 0.135 | ||||
| Negative | 210 (55.56) | 289 (68.16) | 471 (61.65) | 28 (73.68) | ||
| Positive | 168 (44.44) | 135 (31.84) | 293 (38.35) | 10 (26.32) | ||
| CA72-4 | 0.010 | 0.209 | ||||
| Negative | 289 (72.43) | 369 (79.87) | 630 (76.83) | 28 (68.29) | ||
| Positive | 110 (27.57) | 93 (20.13) | 190 (23.17) | 13 (31.71) | ||
| PSA | 0.610 | 0.711 | ||||
| Negative | 262 (91.93) | 165 (93.22) | 403 (92.22) | 24 (96.00) | ||
| Positive | 23 (8.07) | 12 (6.78) | 34 (7.78) | 1 (4.00) | ||
| FPSA | 0.436 | 1.000 | ||||
| Negative | 272 (95.44) | 166 (93.79) | 414 (94.74) | 24 (96.00) | ||
| Positive | 13 (4.56) | 11 (6.21) | 23 (5.26) | 1 (4.00) | ||
| SCC | < 0.001 | 0.178 | ||||
| Negative | 472 (76.62) | 636 (90.34) | 1063 (84.23) | 45 (77.59) | ||
| Positive | 144 (23.38) | 68 (9.66) | 199 (15.77) | 13 (22.41) | ||
| CYFRA 21-1 | < 0.001 | 0.137 | ||||
| Negative | 216 (35.64) | 332 (47.91) | 519 (41.75) | 29 (51.79) | ||
| Positive | 390 (64.36) | 361 (52.09) | 724 (58.25) | 27 (48.21) | ||
| NSE | 0.034 | 0.472 | ||||
| Negative | 212 (44.17) | 268 (50.85) | 460 (47.92) | 20 (42.55) | ||
| Positive | 268 (55.83) | 259 (49.15) | 500 (52.08) | 27 (57.45) | ||
| TTF-1 | < 0.001 | 0.145 | ||||
| Negative | 129 (20.48) | 14 (2.04) | 140 (11.13) | 3 (5.08) | ||
| Positive | 501 (79.52) | 673 (97.96) | 1118 (88.87) | 56 (94.92) | ||
| Napsin A | < 0.001 | 0.401 | ||||
| Negative | 121 (28.67) | 19 (5.23) | 135 (18.10) | 5 (12.82) | ||
| Positive | 301 (71.33) | 344 (94.77) | 611 (81.90) | 34 (87.18) | ||
| CK-7 | 0.006 | 1.000 | ||||
| Negative | 19 (5.79) | 3 (1.27) | 21 (3.94) | 1 (3.13) | ||
| Positive | 309 (94.21) | 234 (98.73) | 512 (96.06) | 31 (96.88) | ||
| Ki67 | 0.083 | 0.528 | ||||
| Negative | 37 (28.03) | 28 (40.00) | 60 (31.58) | 5 (41.67) | ||
| Positive | 95 (71.97) | 42 (60.00) | 130 (68.42) | 7 (58.33) | ||
Values presented are n (%) unless otherwise noted
EGFR epidermal growth factor receptor; ALK anaplastic lymphoma kinase; AFP alpha fetoprotein; CEA carcinoembryonic antigen; CA carbohydrate antigen; FERR ferritin; PSA prostate specific antigen; FPSA free prostate specific antigen; SCC squamous cell carcinoma antigen; CYFRA 21-1 soluble fragment of cytokeratin 19; NSE neuron-specific enolase; TTF-1 thyroid transcription factor-1; CK-7 cytokeratin-7
Fig. 2Discrimination of the computational algorithms for discrimination of EGFR mutant status in the training cohort and the testing cohort. A–B Deep leaning model; C–D DRF model; E–F GBM model; G–H GLM model; I–J XGBoost model; K–L XRF model; M–N Stacked Ensemble model
Performance measures of the stacked ensemble model for prediction two classifications of EGFR mutation
| Cohort | Sensitivity | Specificity | Accuracy | PPV | NPV |
|---|---|---|---|---|---|
| The training cohort | 0.835 | 0.677 | 0.578 | 0.886 | 0.732 |
| The testing cohort | 0.856 | 0.680 | 0.638 | 0.877 | 0.750 |
EGFR epidermal growth factor receptor; PPV positive predictive value; NPV negative predictive value
Fig. 3Discrimination of Stacked Ensemble model for discrimination of ALK rearrangement in the training cohort (A) and the testing cohort (B)
Fig. 4The importance of differential clinical parameters in different computational algorithm models. The AutoML randomly generated twenty algorithms based on deep learning model and five machine learning model, and the twenty models were interpreted using feature importance plots through the matplotlib package built in the software. A Variable importance for EGFR prediction. B Variable importance for ALK prediction
Fig. 5Stacked Ensemble model to distinguish common and uncommon EGFR mutations. A–C Discrimination of Stacked Ensemble model for identification of common and uncommon EGFR mutations in the training cohort, the testing cohort and total patients, respectively. D The importance of differential clinical parameters in the Stacked Ensemble model. The AutoML randomly generated twenty algorithms based on deep learning model and five machine learning model, and the twenty models were interpreted using feature importance plots through the matplotlib package built in the software
Fig. 6Stacked Ensemble model to distinguish ERFR mutant status and ALK rearrangement concurrently and corresponding variable importance. A–C The overall accuracy in the training cohort, the testing cohort and total patients. D The importance of differential clinical parameters in Stacked Ensemble model to distinguish EGFR mutant status and ALK rearrangement concurrently. The AutoML randomly generated twenty algorithms based on deep learning model and five machine learning model, and the twenty models were interpreted using feature importance plots through the matplotlib package built in the software