| Literature DB >> 32612283 |
Jin-Ah Sim1, Young Ae Kim2, Ju Han Kim3, Jong Mog Lee4, Moon Soo Kim4, Young Mog Shim5, Jae Ill Zo5, Young Ho Yun6,7,8.
Abstract
The primary goal of this study was to evaluate the major roles of health-related quality of life (HRQOL) in a 5-year lung cancer survival prediction model using machine learning techniques (MLTs). The predictive performances of the models were compared with data from 809 survivors who underwent lung cancer surgery. Each of the modeling technique was applied to two feature sets: feature set 1 included clinical and sociodemographic variables, and feature set 2 added HRQOL factors to the variables from feature set 1. One of each developed prediction model was trained with the decision tree (DT), logistic regression (LR), bagging, random forest (RF), and adaptive boosting (AdaBoost) methods, and then, the best algorithm for modeling was determined. The models' performances were compared using fivefold cross-validation. For feature set 1, there were no significant differences in model accuracies (ranging from 0.647 to 0.713). Among the models in feature set 2, the AdaBoost and RF models outperformed the other prognostic models [area under the curve (AUC) = 0.850, 0.898, 0.981, 0.966, and 0.949 for the DT, LR, bagging, RF and AdaBoost models, respectively] in the test set. Overall, 5-year disease-free lung cancer survival prediction models with MLTs that included HRQOL as well as clinical variables improved predictive performance.Entities:
Mesh:
Year: 2020 PMID: 32612283 PMCID: PMC7329866 DOI: 10.1038/s41598-020-67604-3
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Comparison of the baseline characteristics between the living and deceased groups with up-sampled data.
| Variable | Balanced up-sampled data | ||||
|---|---|---|---|---|---|
| Living (N = 713) | Deceased (N = 713) | p-value | |||
| n % | n % | ||||
| 62.51 ± 8.55 | 66.21 ± 8.31 | < 0.001 | |||
| < 65 | 393 | 63.3 | 228 | 36.7 | < 0.001 |
| ≥ 65 | 320 | 39.8 | 485 | 60.2 | |
| Female | 177 | 69.4 | 78 | 30.6 | < 0.001 |
| Male | 537 | 45.8 | 635 | 54.2 | |
| ≥ 3,000 | 207 | 69.5 | 91 | 30.5 | < 0.001 |
| < 3,000 | 506 | 44.9 | 622 | 55.1 | |
| ≥ High school degree | 185 | 56.4 | 143 | 43.6 | 0.01 |
| < High school degree | 528 | 48.1 | 570 | 51.9 | |
| Yes | 655 | 50 | 656 | 50 | 0.92 |
| No | 58 | 50.4 | 57 | 49.6 | |
| 72.55 ± 15.11 | 65.77 ± 10.62 | < 0.001 | |||
| (FEV1/FVC)*100 ≥ 0.7 | 454 | 61.7 | 282 | 38.3 | < 0.001 |
| (FEV1/FVC)*100 < 0.7 | 259 | 37.5 | 431 | 62.5 | |
| No | 253 | 62.3 | 153 | 37.7 | < 0.001 |
| Yes | 460 | 45.1 | 560 | 54.9 | |
| No | 508 | 53.2 | 446 | 46.8 | < 0.001 |
| Yes | 205 | 43.4 | 267 | 56.6 | |
| Stage 0–I | 464 | 56.9 | 352 | 43.1 | < 0.001 |
| Stage II–III | 249 | 40.8 | 361 | 59.2 | |
| No | 630 | 62.8 | 373 | 37.2 | < 0.001 |
| Yes | 83 | 19.6 | 340 | 80.4 | |
| 0 | 320 | 49 | 333 | 51 | 0.49 |
| ≥ 1 | 393 | 50.8 | 380 | 49.2 | |
| OP | 435 | 51.7 | 417 | 48.6 | < 0.001 |
| OP + RT | 41 | 37.6 | 68 | 62.4 | |
| OP + CT | 193 | 53.6 | 167 | 46.4 | |
| OP + CT + RT | 44 | 40 | 66 | 60 | |
| 2.93 ± 1.59 | 2.983 ± 1.68 | 0.29 | |||
| ≥ 3 years | 307 | 53 | 272 | 47 | 0.06 |
| < 3 years | 406 | 47.9 | 441 | 52.1 | |
OP, operation; RT, radiation therapy; CT, chemotherapy; FEV1/FVC, forced expiratory volume 1/forced vital capacity.
The normalized importance scores of prognostic variables for each of the five MLTs.
| Domain | Variable | Feature sets | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Feature set 1: sociodemographic and clinical variables | Feature set 2: PRO variables added to feature set 1 | ||||||||||
| Normalized variable importance (%) | Normalized variable importance (%) | ||||||||||
| Model | Model | Model | Model | Model LR* | Model | Model | Model | Model | Model LR* | ||
| DT | Bagging | RF | AdaBoost | DT | Bagging | RF | AdaBoost | ||||
| Clinical factors | Cancer stage II–III | 19.39 | 23.44 | 6.06 | 6.60 | ||||||
| Local invasion of tumor | 8.90 | 14.34 | 14.28 | 10.66 | 12.50 | 6.31 | 5.58 | 3.26 | NS | ||
| Regional lymph node metastasis | 23.71 | 10.25 | 10.42 | 9.13 | NS | 6.20 | 6.42 | NS | |||
| Sociodemographic factors | Low household income (< 3,000$) | 13.82 | 18.46 | 16.00 | 14.26 | 20.49 | 4.93 | 5.33 | 5.29 | 5.60 | 5.14 |
| Age over 65 years | 20.45 | 19.87 | 23.19 | 6.04 | 6.28 | ||||||
| Male | 8.76 | 17.01 | 18.03 | 19.19 | 24.84 | 5.90 | 5.70 | 6.96 | |||
| HRQOL factors | BMI (kg/m2) before the operation (≥ 23. 5) | 5.91 | 6.55 | 10.09 | |||||||
| Anxiety | 3.34 | 5.49 | 6.28 | 7.04 | |||||||
| Depression | 3.59 | 6.60 | 6.36 | 4.46 | |||||||
| Poor physical functioning | 1.74 | 2.36 | 1.74 | 1.48 | 6.47 | ||||||
| Role functioning | 1.64 | 1.73 | 1.53 | 2.14 | 3.54 | ||||||
| Poor dyspnea | 5.47 | 5.89 | 6.59 | 3.82 | |||||||
| Poor appetite loss | 3.53 | 4.24 | 3.35 | 4.44 | NS | ||||||
| Poor diarrhea | 1.95 | 2.64 | 2.10 | 2.78 | NS | ||||||
| Poor lung cancer-specific cough | 3.00 | 4.27 | 3.98 | 4.28 | NS | ||||||
| Poor pain in chest | 3.36 | 4.41 | 4.34 | 3.88 | NS | ||||||
| Low new possibility | 5.93 | 5.26 | 4.88 | 3.69 | 7.69 | ||||||
| Low personal strength | 6.62 | 5.45 | 5.56 | 6.11 | |||||||
| Low appreciation of life | 6.89 | 5.19 | |||||||||
NS, nonsignificant; BMI, body mass index; HRQOL, health-related quality of life; DT, decision tree; RF, random forest; LR, logistic regression.
*LR variable selection using stepwise feature selection with a 5% significance level.
The most important variable in the top 20% from each model are highlighted in bold font.
Model comparisons based on the five machine leaning techniques.
| Feature set | Machine learning algorithm | Validation method | N folds | Training set size | Testing set size | Training accuracy | Testing accuracy |
|---|---|---|---|---|---|---|---|
| 1 | DT | Holdout sampling | 1,140 | 286 | 0.668 | 0.703 | |
| DT | Cross-validation | 5 | 912 | 286 | 0.625 | 0.692 | |
| LR | Holdout sampling | 1,140 | 286 | 0.663 | 0.647 | ||
| LR | Cross-validation | 5 | 912 | 286 | 0.657 | 0.632 | |
| Bagging | Holdout sampling | 1,140 | 286 | 0.680 | 0.710 | ||
| Bagging | Cross-validation | 5 | 912 | 286 | 0.655 | 0.706 | |
| RF | Holdout sampling | 1,140 | 286 | 0.675 | 0.713 | ||
| RF | Cross-validation | 5 | 912 | 286 | 0.675 | 0.692 | |
| AdaBoost | Holdout sampling | 1,140 | 286 | 0.668 | 0.696 | ||
| Real AdaBoost | Cross-validation | 5 | 912 | 286 | 0.642 | 0.713 | |
| 2 | DT | Holdout sampling | 1,140 | 286 | 0.780 | 0.762 | |
| DT | Cross-validation | 5 | 912 | 286 | 0.758 | 0.745 | |
| LR | Holdout sampling | 1,140 | 286 | 0.791 | 0.746 | ||
| LR | Cross-validation | 5 | 912 | 286 | 0.814 | 0.825 | |
| Bagging | Holdout sampling | 1,140 | 286 | 0.976 | 0.930 | ||
| Bagging | Cross-validation | 5 | 912 | 286 | 0.794 | 0.776 | |
| RF | Holdout sampling | 1,140 | 286 | 0.949 | 0.916 | ||
| RF | Cross-validation | 5 | 912 | 286 | 0.918 | 0.941 | |
| AdaBoost | Holdout sampling | 1,140 | 286 | 0.943 | 0.878 | ||
| Real AdaBoost | Cross-validation | 5 | 912 | 286 | 0.932 | 0.948 |
DT, decision tree; RF, random forest; LR, logistic regression.
Feature set 1 includes sociodemographic and clinical variables.
Feature set 2 includes PRO variables and the variables included in feature set 1.
Figure 1Comparison of ROC curves for the five MLT-based lung cancer models using the cross-validation test set. DT, decision tree; RF, random forest; Boost, AdaBoost; LR, logistic regression. (A) Model from feature set 1, (B) Model from feature set 2.
Figure 2Calibration plots for each MLT-based lung cancer model at five risk levels using the cross-validation test set. DT, decision tree; RF, random forest; LR, logistic regression.
Figure 3Study hypothesis and process.