| Literature DB >> 30393771 |
Lira Pi1, Susan Halabi1.
Abstract
BACKGROUND: Building prognostic models of clinical outcomes is an increasingly important research task and will remain a vital area in genomic medicine. Prognostic models of clinical outcomes are usually built and validated utilizing variable selection methods and machine learning tools. The challenges, however, in ultra-high dimensional space are not only to reduce the dimensionality of the data, but also to retain the important variables which predict the outcome. Screening approaches, such as the sure independence screening (SIS), iterative SIS (ISIS) and principled SIS (PSIS) have been developed to overcome the challenge of high dimensionality. We are interested in identifying important single-nucleotide polymorphisms (SNPs) and integrating them into a validated prognostic model of overall survival in patients with metastatic prostate cancer. While the abovementioned variable selection approaches have theoretical justification in selecting SNPs, the comparison and the performance of these combined methods in predicting time-to-event outcomes have not been previously studied in ultra-high dimensional space with hundreds of thousands of variables.Entities:
Keywords: Variable selection; calibration; elastic net; germline single-nucleotide polymorphism; high dimensional data; machine learning; overfitting; prognostic models; proportional hazards model; random forest; survival outcomes
Year: 2018 PMID: 30393771 PMCID: PMC6214199 DOI: 10.1186/s41512-018-0043-4
Source DB: PubMed Journal: Diagn Progn Res ISSN: 2397-7523
Fig. 1Overall diagram of the screening approaches and the variable selection methods
Fig. 2Process of the aggressive variant of sure independence screening (SIS)
Mean (SD) of for the training sets over 300 simulations
| Selection approach | Weak signal | Strong signal | ||
|---|---|---|---|---|
| ISIS-LASSO | 0.122 (0.109) | 0.578 (0.071) | 0.335 (0.318) | 0.797 (0.071) |
| ISIS-ALASSO | 0.108 (0.108) | 0.578 (0.071) | 0.337 (0.322) | 0.797 (0.071) |
| ISIS-RSF | 0.069 (0.088) | 0.564 (0.105) | 0.232 (0.284) | 0.790 (0.086) |
| SIS | 0.065 (0.085) | 0.303 (0.116) | 0.114 (0.119) | 0.491 (0.177) |
| SIS-LASSO | 0.112 (0.093) | 0.306 (0.113) | 0.142 (0.119) | 0.491 (0.177) |
| SIS-ALASSO | 0.093 (0.092) | 0.306 (0.113) | 0.137 (0.120) | 0.491 (0.177) |
| SIS-RSF | 0.063 (0.082) | 0.294 (0.125) | 0.097 (0.104) | 0.482 (0.178) |
| PSIS | 0.756 (0.116) | 0.811 (0.038) | 0.642 (0.183) | 0.869 (0.047) |
| PSIS-LASSO | 0.875 (0.051) | 0.805 (0.037) | 0.869 (0.062) | 0.874 (0.049) |
| PSIS-ALASSO | 0.874 (0.052) | 0.780 (0.042) | 0.869 (0.063) | 0.849 (0.057) |
| PSIS-RSF | 0.688 (0.088) | 0.413 (0.109) | 0.655 (0.111) | 0.555 (0.167) |
| LASSO | 0.758 (0.138) | 0.834 (0.043) | 0.787 (0.136) | − 0.211 (0.036) |
| ALASSO | 0.583 (0.115) | 0.611 (0.058) | 0.812 (0.095) | 0.815 (0.063) |
SD standard deviation
Mean (SD) of final model size for the training sets over 300 simulations
| Selection approach | Weak signal | Strong signal | ||
|---|---|---|---|---|
| ISIS-LASSO | 1.087 (1.130) | 6.010 (0.672) | 2.567 (2.265) | 6.073 (0.261) |
| ISIS-ALASSO | 0.993 (0.925) | 5.943 (0.732) | 2.397 (2.164) | 6.033 (0.180) |
| ISIS-RSF | 2.463 (0.550) | 6.007 (0.954) | 3.367 (1.777) | 6.033 (0.304) |
| SIS | 2.277 (0.448) | 3.120 (0.925) | 2.477 (0.500) | 4.380 (0.955) |
| SIS-LASSO | 0.773 (0.955) | 3.033 (1.014) | 1.217 (0.945) | 4.377 (0.951) |
| SIS-ALASSO | 0.750 (0.737) | 3.003 (1.049) | 1.100 (0.824) | 4.373 (0.951) |
| SIS-RSF | 2.057 (0.644) | 3.033 (1.037) | 2.240 (0.719) | 4.327 (0.961) |
| PSIS | 132.183 (11.557) | 120.440 (10.707) | 129.980 (11.203) | 119.103 (10.702) |
| PSIS-LASSO | 69.850 (7.238) | 78.503 (7.301) | 63.170 (7.212) | 65.473 (6.786) |
| PSIS-ALASSO | 52.613 (6.344) | 49.007 (7.881) | 42.487 (8.784) | 32.067 (9.986) |
| PSIS-RSF | 44.26 (10.512) | 16.033 (5.713) | 42.667 (11.009) | 15.113 (5.159) |
| LASSO | 26.619 (11.320) | 61.437 (13.310) | 58.347 (10.184) | 108.470 (14.364) |
| ALASSO | 19.948 (7.318) | 45.010 (8.276) | 33.793 (7.738) | 67.147 (9.422) |
SD standard deviation
Mean of the number of informative features (% of uninformative features) selected in final model over 300 simulations in the training sets
| Selection approach | Weak signal | Strong signal | ||
|---|---|---|---|---|
| ISIS-LASSO | 0.373 (64.1%) | 5.913 (1.6%) | 1.920 (35.3%) | 6.000 (1.0%) |
| ISIS-ALASSO | 0.383 (59.9%) | 5.897 (0.9%) | 1.983 (30.0%) | 6.000 (0.5%) |
| ISIS-RSF | 0.343 (88.3%) | 5.817 (3.3%) | 1.657 (61.1%) | 5.973 (0.9%) |
| SIS | 1.973 (89.4%) | 0.160 (5.5%) | 1.720 (70.7%) | 0.023 (0.4%) |
| SIS-LASSO | 0.303 (56.4%) | 2.960 (2.8%) | 0.757 (33.4%) | 4.357 (0.4%) |
| SIS-ALASSO | 0.303 (57.3%) | 2.960 (1.6%) | 0.757 (29.6%) | 4.357 (0.3%) |
| SIS-RSF | 0.293 (88.8%) | 2.880 (5.3%) | 0.683 (69.4%) | 4.317 (0.2%) |
| PSIS | 4.577 (96.5%) | 5.943 (95.0%) | 5.437 (95.8%) | 5.987 (94.9%) |
| PSIS-LASSO | 4.547 (93.4%) | 5.943 (92.4%) | 5.437 (91.3%) | 5.987 (90.8%) |
| PSIS-ALASSO | 4.180 (91.9%) | 5.943 (87.5%) | 5.403 (86.5%) | 5.987 (78.7%) |
| PSIS-RSF | 2.183 (94.8%) | 3.717 (73.8%) | 3.703 (90.7%) | 4.973 (63.3%) |
| LASSO | 4.671 (80.0%) | 6.000 (89.7%) | 5.990 (89.4%) | 6.000 (94.4%) |
| ALASSO | 5.731 (65.0%) | 6.000 (86.2%) | 6.000 (81.1%) | 6.000 (90.9%) |
Mean optimism and corrected c-index over 100 boostrapped samples
| Weak signal ( | Strong signal ( | |||
|---|---|---|---|---|
| Optimism | C-index corrected | Optimism | C-index corrected | |
| ISIS-LASSO | 0.010 | 0.654 | 0.005 | 0.794 |
| ISIS-ALASSO | 0.010 | 0.655 | 0.005 | 0.794 |
| ISIS-RSF | 0.010 | 0.641 | 0.005 | 0.791 |
| SIS | 0.008 | 0.426 | 0.007 | 0.627 |
| SIS-LASSO | 0.007 | 0.431 | 0.007 | 0.627 |
| SIS-ALASSO | 0.006 | 0.431 | 0.007 | 0.627 |
| SIS-RSF | 0.008 | 0.416 | 0.007 | 0.624 |
| PSIS | * | * | 0.119 | 0.770 |
| PSIS-LASSO | 0.082 | 0.745 | 0.043 | 0.836 |
| PSIS-ALASSO | 0.042 | 0.772 | 0.019 | 0.844 |
| PSIS-RSF | 0.026 | 0.574 | 0.016 | 0.721 |
| LASSO | 0.045 | 0.808 | 0.048 | 0.903 |
| ALASSO | 0.057 | 0.648 | 0.058 | 0.781 |
*Did not converge
Mean (SD) of C for the testing sets over 300 simulations
| Selection approach | Weak signal | Strong signal | ||
|---|---|---|---|---|
| ISIS-LASSO | 0.532 (0.053) | 0.820 (0.033) | 0.644 (0.149) | 0.895 (0.003) |
| ISIS-ALASSO | 0.534 (0.054) | 0.819 (0.037) | 0.648 (0.152) | 0.895 (0.003) |
| ISIS-RSF | 0.531 (0.050) | 0.815 (0.044) | 0.630 (0.135) | 0.894 (0.010) |
| SIS | 0.528 (0.045) | 0.696 (0.052) | 0.575 (0.068) | 0.805 (0.050) |
| SIS-LASSO | 0.528 (0.045) | 0.696 (0.052) | 0.575 (0.068) | 0.805 (0.050) |
| SIS-ALASSO | 0.528 (0.045) | 0.696 (0.052) | 0.575 (0.068) | 0.805 (0.050) |
| SIS-RSF | 0.528 (0.044) | 0.692 (0.057) | 0.570 (0.063) | 0.803 (0.051) |
| PSIS | 0.527 (0.035) | 0.711 (0.027) | 0.590 (0.066) | 0.856 (0.018) |
| PSIS-LASSO | 0.608 (0.050) | 0.737 (0.025) | 0.769 (0.072) | 0.873 (0.014) |
| PSIS-ALASSO | 0.600 (0.048) | 0.748 (0.027) | 0.769 (0.085) | 0.880 (0.014) |
| PSIS-RSF | 0.556 (0.033) | 0.696 (0.046) | 0.665 (0.066) | 0.832 (0.039) |
| LASSO | 0.692 (0.072) | 0.786 (0.011) | 0.857 (0.018) | 0.881 (0.005) |
| ALASSO | 0.797 (0.071) | 0.818 (0.006) | 0.886 (0.005) | 0.886 (0.004) |
SD standard deviation
Mean (SD) of calibration slope for testing set over 300 simulations
| Selection approach | Weak signal | Strong signal | ||
|---|---|---|---|---|
| ISIS-LASSO | 0.345 (0.340) | 0.944 (0.065) | 0.646 (0.347) | 0.965 (0.058) |
| ISIS-ALASSO | 0.318 (0.343) | 0.945 (0.065) | 0.628 (0.373) | 0.965 (0.058) |
| ISIS-RSF | 0.205 (0.379) | 0.936 (0.095) | 0.486 (0.515) | 0.965 (0.059) |
| SIS | 0.210 (0.395) | 0.887 (0.126) | 0.433 (0.512) | 0.955 (0.076) |
| SIS-LASSO | 0.391 (0.336) | 0.895 (0.091) | 0.624 (0.326) | 0.955 (0.076) |
| SIS-ALASSO | 0.321 (0.373) | 0.897 (0.091) | 0.611 (0.339) | 0.955 (0.076) |
| SIS-RSF | 0.092 (2.247) | 0.881 (0.129) | 0.448 (0.583) | 0.950 (0.080) |
| PSIS | 0.004 (0.006) | 0.245 (0.055) | 0.013 (0.017) | 0.370 (0.060) |
| PSIS-LASSO | 0.077 (0.049) | 0.355 (0.060) | 0.200 (0.072) | 0.530 (0.056) |
| PSIS-ALASSO | 0.075 (0.046) | 0.376 (0.072) | 0.246 (0.116) | 0.605 (0.100) |
| PSIS-RSF | 0.078 (0.055) | 0.574 (0.135) | 0.241 (0.114) | 0.806 (0.097) |
| LASSO | 0.359 (0.126) | 0.381 (0.076) | 0.213 (0.086) | 0.219 (0.055) |
| ALASSO | 0.780 (0.131) | 0.732 (0.069) | 0.642 (0.115) | 0.653 (0.069) |
SD standard deviation
Fig. 3Calibration plots on training set for observed survival probability at 2 years versus predicted survival for a ALASSO with n = 150 and weak signal strength, b ISIS-ALASSO with n = 150 and weak signal strength, c ISIS-LASSO when n = 150 and strong signal strength, and d ISIS-LASSO when n = 300 and strong signal strength
c-indices based on the training and testing sets for the real example
| Selection approach | No of SNPs selected | Training set (419) | Testing set ( | |
|---|---|---|---|---|
| Original c-index | Corrected c-index* | c-index* (95% CI) | ||
| ISIS-LASSO | 2 | 0.649 | 0.646 | 0.664 (0.621–0.707) |
| ISIS-ALASSO | 0 | 0.649 | 0.650 | 0.671 (0.624–0.719) |
| ISIS-RSF | 2 | 0.650 | 0.645 | 0.669 (0.618–0.720) |
| SIS | 2 | 0.650 | 0.646 | 0.669 (0.624–0.714) |
| SIS-LASSO | 0 | 0.649 | 0.649 | 0.671 (0.626–0.717) |
| SIS-ALASSO | 0 | 0.649 | 0.650 | 0.671 (0.620–0.723) |
| SIS-RSF | 2 | 0.650 | 0.648 | 0.669 (0.623–0.716) |
| PSIS | 40 | 0.749 | – | 0.572 (0.527–0.617) |
| PSIS-LASSO | 28 | 0.746 | 0.727 | 0.568 (0.524–0.613) |
| PSIS-ALASSO | 24 | 0.744 | 0.727 | 0.573 (0.528–0.617) |
| PSIS-RSF | 35 | 0.748 | – | 0.575 (0.529–0.622) |
| LASSO | 16 | 0.716 | – | 0.586 (0.540–0.632) |
| ALASSO | 13 | 0.653 | 0.634 | 0.647 (0.601–0.693) |
*Based on 200 bootstrapped samples