| Literature DB >> 29270195 |
Qing-Yan Yin1, Jun-Li Li2, Chun-Xia Zhang2.
Abstract
As a pivotal tool to build interpretive models, variable selection plays an increasingly important role in high-dimensional data analysis. In recent years, variable selection ensembles (VSEs) have gained much interest due to their many advantages. Stability selection (Meinshausen and Bühlmann, 2010), a VSE technique based on subsampling in combination with a base algorithm like lasso, is an effective method to control false discovery rate (FDR) and to improve selection accuracy in linear regression models. By adopting lasso as a base learner, we attempt to extend stability selection to handle variable selection problems in a Cox model. According to our experience, it is crucial to set the regularization region Λ in lasso and the parameter λmin properly so that stability selection can work well. To the best of our knowledge, however, there is no literature addressing this problem in an explicit way. Therefore, we first provide a detailed procedure to specify Λ and λmin. Then, some simulated and real-world data with various censoring rates are used to examine how well stability selection performs. It is also compared with several other variable selection approaches. Experimental results demonstrate that it achieves better or competitive performance in comparison with several other popular techniques.Entities:
Mesh:
Year: 2017 PMID: 29270195 PMCID: PMC5706076 DOI: 10.1155/2017/2747431
Source DB: PubMed Journal: Comput Intell Neurosci
Algorithm 1The stability selection algorithm for the Cox model.
Selection frequencies of StabSel to identify IV and UIV.
| | | |||||
|---|---|---|---|---|---|---|
| ( | ( | |||||
| Min | Med | Max | Min | Med | Max | |
| 0% | ||||||
| | 67 | 69 | 73 | 0 | 0 | 1 |
| | 75 | 77 | 81 | 0 | 1 | 3 |
| | 77 | 81 | 84 | 3 | 7 | 15 |
| 20% | ||||||
| | 85 | 88 | 91 | 0 | 0 | 3 |
| | 93 | 99 | 100 | 1 | 3 | 6 |
| | 100 | 100 | 100 | 3 | 8 | 20 |
| 40% | ||||||
| | 49 | 76 | 98 | 0 | 0 | 2 |
| | 94 | 98 | 100 | 0 | 1 | 6 |
| | 100 | 100 | 100 | 3 | 6 | 14 |
Figure 1Selection probabilities of StabSel and lasso.
Selection frequencies of each method in Simulation 3.
| Method |
|
| ||||
|---|---|---|---|---|---|---|
| Min | Med | Max | Min | Med | Max | |
| 0% | ||||||
| Stepwise | 97 | 100 | 100 | 13 | 22 | 30 |
| BSS | 79 | 100 | 100 | 3 | 7 | 10 |
| PGA | 40 | 93 | 100 | 0 | 0 | 1 |
| StabSel | 91 | 97 | 97 | 0 | 3 | 5 |
| RSMA | 79 | 98 | 100 | 4 | 8 | 13 |
| ST2E | 100 | 100 | 100 | 10 | 15 | 18 |
| 20% | ||||||
| Stepwise | 94 | 100 | 100 | 19 | 24 | 31 |
| BSS | 70 | 100 | 100 | 6 | 12 | 17 |
| PGA | 29 | 94 | 100 | 0 | 0 | 1 |
| StabSel | 94 | 96 | 97 | 1 | 3 | 5 |
| RSMA | 80 | 98 | 100 | 4 | 9 | 17 |
| ST2E | 94 | 100 | 100 | 8 | 15 | 23 |
| 40% | ||||||
| Stepwise | 94 | 100 | 100 | 22 | 26 | 38 |
| BSS | 65 | 89 | 96 | 8 | 11 | 15 |
| PGA | 31 | 95 | 100 | 0 | 0 | 1 |
| StabSel | 97 | 100 | 100 | 1 | 3 | 7 |
| RSMA | 80 | 99 | 100 | 7 | 13 | 18 |
| ST2E | 91 | 100 | 100 | 11 | 15 | 25 |
Figure 2Average selection rate for different ensemble approaches.
Results for each method in Simulation 3.
| Method | Succ. rate | Size | TNR | TPR |
|---|---|---|---|---|
| 0% | ||||
| Stepwise | 0.02 | 6.92 | 0.768 | 0.990 |
| BSS | 0.51 | 3.89 | 0.935 | 0.930 |
| PGA | 0.37 | 2.36 | 0.998 | 0.777 |
| StabSel | 0.55 | 3.30 | 0.973 | 0.950 |
| RSMA | 0.21 | 4.09 | 0.922 | 0.923 |
| ST2E | 0.01 | 5.47 | 0.855 | 1.000 |
| 20% | ||||
| Stepwise | 0.04 | 7.02 | 0.760 | 0.980 |
| BSS | 0.31 | 4.76 | 0.879 | 0.900 |
| PGA | 0.28 | 2.25 | 0.999 | 0.743 |
| StabSel | 0.57 | 3.33 | 0.914 | 0.957 |
| RSMA | 0.14 | 4.50 | 0.899 | 0.927 |
| ST2E | 0.05 | 5.55 | 0.849 | 0.993 |
| 40% | ||||
| Stepwise | 0.02 | 7.44 | 0.735 | 0.980 |
| BSS | 0.15 | 4.55 | 0.878 | 0.823 |
| PGA | 0.30 | 2.30 | 0.998 | 0.753 |
| StabSel | 0.61 | 3.51 | 0.968 | 0.990 |
| RSMA | 0.06 | 4.87 | 0.878 | 0.930 |
| ST2E | 0.03 | 5.68 | 0.837 | 0.970 |
| Oracle | 1.00 | 3.00 | 1.00 | 1.00 |
Main characteristics of the used real-world datasets.
| Dataset | Number of variables | Number of samples | Training size |
|---|---|---|---|
| PBC | 15 (original covariates) | 276 | 200 |
| +20 (random uniform) | |||
| Lung | 8 (original covariates) | 167 | 100 |
| +20 (random uniform) | |||
| Rats | 3 (original covariates) | 300 | 250 |
| +20 (random uniform) |
The performance of each method on three real datasets.
| Dataset | Metric | PGA | BSS | StabSel | RSMA | ST2E |
|---|---|---|---|---|---|---|
| PBC | Sel. rate | |||||
| IVs (1–15) | 0.299 | 0.597 | 0.518 | 0.543 | 0.605 | |
| UIVs (26–35) | 0.015 | 0.291 | 0.053 | 0.074 | 0.120 | |
| C-index | 0.792 | 0.812 | 0.819 | 0.826 | 0.835 | |
| TPR | 0.24 | 0.50 | 0.42 | 0.54 | 0.63 | |
| TNR | 0.98 | 0.60 | 0.99 | 0.96 | 0.96 | |
|
| ||||||
| Lung | Sel. rate | |||||
| IVs (1–8) | 0.284 | 0.607 | 0.477 | 0.476 | 0.466 | |
| UIVs (9–28) | 0.077 | 0.426 | 0.097 | 0.289 | 0.200 | |
| C-index | 0.631 | 0.703 | 0.695 | 0.680 | 0.695 | |
| TPR | 0.27 | 0.61 | 0.41 | 0.51 | 0.54 | |
| TNR | 0.92 | 0.55 | 0.83 | 0.71 | 0.74 | |
|
| ||||||
| Rats | Sel. rate | |||||
| IVs (1–3) | 0.627 | 0.890 | 0.850 | 0.893 | 0.997 | |
| UIVs (4–23) | 0.043 | 0.332 | 0.101 | 0.160 | 0.159 | |
| C-index | 0.800 | 0.870 | 0.853 | 0.869 | 0.693 | |
| TPR | 0.60 | 0.89 | 0.70 | 0.89 | 1.00 | |
| TNR | 0.91 | 0.67 | 0.90 | 0.84 | 0.84 | |