| Literature DB >> 34126702 |
Mohamed Hosny Osman1, Reham Hosny Mohamed1, Hossam Mohamed Sarhan2, Eun Jung Park3, Seung Hyuk Baik3, Kang Young Lee4, Jeonghyun Kang3.
Abstract
PURPOSE: Machine learning (ML) is a strong candidate for making accurate predictions, as we can use large amount of data with powerful computational algorithms. We developed a ML based model to predict survival of patients with colorectal cancer (CRC) using data from two independent datasets.Entities:
Keywords: Area under the curve; Colorectal neoplasms; LightGBM; Machine learning; Mortality; SEER
Mesh:
Year: 2021 PMID: 34126702 PMCID: PMC9016295 DOI: 10.4143/crt.2021.206
Source DB: PubMed Journal: Cancer Res Treat ISSN: 1598-2998 Impact factor: 4.679
Characteristics of included patients
| SEER dataset (n=364,316) | Korean dataset (n=1,572) | p-value | |
|---|---|---|---|
|
| 2004–2015 | 2003–2012 | |
|
| 2016 | 2019 | |
|
| 67.0±13.7 | 61.6±11.9 | < 0.001 |
|
| |||
| Male | 188,549 (51.8) | 965 (61.4) | < 0.001 |
| Female | 175,767 (48.2) | 607 (38.6) | |
|
| |||
| Colon | 264,288 (72.5) | 939 (59.7) | < 0.001 |
| Rectum | 100,028 (27.5) | 618 (39.3) | |
| Missing | 15 (1.0) | ||
|
| |||
| Adenocarcinoma | 326,628 (89.7) | 1,383 (88.0) | 0.308 |
| Other histology | 37,688 (10.3) | 146 (9.3) | |
| Missing | 43 (2.7) | ||
|
| |||
| Stage I | 90,647 (24.9) | 318 (20.2) | < 0.001 |
| Stage II | 96,337 (26.4) | 443 (28.2) | |
| Stage III | 100,478 (27.6) | 535 (34.0) | |
| Stage IV | 44,161 (12.1) | 180 (11.5) | |
| Missing | 32,693 (9.0) | 96 (6.1) | |
|
| |||
| Grade I | 34,608 (9.5) | 237 (15.1) | < 0.001 |
| Grade II | 230,595 (63.3) | 1,101 (70.0) | |
| Grade III | 56,079 (15.4) | 45 (2.9) | |
| Grade IV | 8,150 (2.2) | 73 (4.6) | |
| Missing | 34,884 (9.6) | 116 (7.4) | |
|
| 44.9±34.4 | 43.8±23.4 | < 0.001 |
|
| 15.0±11.0 | 21.0±17.9 | < 0.001 |
| < 12 | 127,245 (35.4) | 397 (25.4) | < 0.001 |
| ≥ 12 | 226,543 (63.1) | 1,167 (74.5) | |
| Unknown | 5,218 (1.5) | 2 (0.1) | |
|
| 1.6±3.6 | 1.8±4.0 | < 0.001 |
|
| |||
| Low | 109,429 (30.0) | 987 (62.8) | < 0.001 |
| High | 80,015 (22.0) | 507 (32.3) | |
| Missing | 174,872 (48.0) | 78 (5.0) | |
|
| |||
| Yes | 43,087 (11.8) | 280 (17.8) | < 0.001 |
| No/Unknown | 321,229 (88.2) | 1,292 (82.2) | |
|
| |||
| Yes | 124,894 (34.3) | 983 (62.5) | < 0.001 |
| No/Unknown | 239,422 (65.7) | 589 (37.5) | |
Values are presented as mean±SD or number (%). CEA, carcinoembryonic Antigen; LN, lymph node; SD, standard deviation; SEER, Surveillance, Epidemiology, and End Results.
Histologic grade: G1, well differentiated; G2, moderately differentiated; G3, poorly differentiated; G4, undifferentiated,
High: CEA ≥ 5, low: CEA < 5.
Fig. 1Comparison of 5-year overall survival between Surveillance, Epidemiology, and End Results (SEER) dataset and Korean dataset.
Comparing AUROC and accuracy of light gradient boosting algorithm with Bayesian optimization using SEER dataset and Korean dataset
| Survival periods | Internal validation using 18-fold CV on SEER dataset | External validation using Korean dataset | ||
|---|---|---|---|---|
|
|
| |||
| Accuracy (average±SD) | AUC (average±SD) | Accuracy | AUC | |
| 1 | 76.33±2.89 | 83.26±1.46 | 80.08 | 82.55 |
|
| ||||
| 2 | 75.63±1.89 | 82.45±1.12 | 78.16 | 83.62 |
|
| ||||
| 3 | 75.37±1.79 | 81.98±1.17 | 77.69 | 81.02 |
|
| ||||
| 4 | 74.87±1.42 | 81.83±1.22 | 76.41 | 80.52 |
|
| ||||
| 5 | 74.45±1.56 | 81.71±1.36 | 75.20 | 80.46 |
|
| ||||
| 6 | 74.08±1.28 | 81.59±1.17 | 74.57 | 78.75 |
|
| ||||
| 8 | 73.99±1.59 | 81.91±1.42 | 73.67 | 78.76 |
|
| ||||
| 10 | 74.26±1.41 | 82.82±1.20 | 74.21 | 77.72 |
AUC, area under ther curve; AUROC, area under the receiver operating characteristics; CV, cross validation; SD, standard deviation; SEER, Surveillance, Epidemiology, and End Results.
Fig. 2Comparison of receiver operating characteristic curve using 5-year survival in the training, internal validation and external validation. AUC, area under the curve.
Fig. 3Feature importance selection in respective survival time periods. CEA, carcinoembryonic antigen; LN, lymph node.
Comparing area under the receiver operating characteristics between light gradient boosting algorithm with Bayesian optimization and AJCC staging using validation Korean dataset
| Survival periods | AJCC stage | LGB algorithm | p-value |
|---|---|---|---|
| 1 | 75.13 | 82.55 | 0.002 |
| 2 | 76.66 | 83.62 | < 0.001 |
| 3 | 75.14 | 81.02 | < 0.001 |
| 4 | 75.15 | 80.52 | 0.001 |
| 5 | 73.67 | 80.46 | < 0.001 |
| 6 | 72.46 | 78.75 | 0.001 |
| 8 | 71.54 | 78.76 | 0.002 |
| 10 | 70.28 | 77.72 | 0.017 |
AJCC, American Joint Committee on Cancer; LGB algorithm, light gradient boosting algorithm.