| Literature DB >> 28387364 |
Suyan Tian1,2, Chi Wang3, Howard H Chang4, Jianguo Sun1,5.
Abstract
In contrast to feature selection and gene set analysis, bi-level selection is a process of selecting not only important gene sets but also important genes within those gene sets. Depending on the order of selections, a bi-level selection method can be classified into three categories - forward selection, which first selects relevant gene sets followed by the selection of relevant individual genes; backward selection which takes the reversed order; and simultaneous selection, which performs the two tasks simultaneously usually with the aids of a penalized regression model. To test the existence of subtype-specific prognostic genes for non-small cell lung cancer (NSCLC), we had previously proposed the Cox-filter method that examines the association between patients' survival time after diagnosis with one specific gene, the disease subtypes, and their interaction terms. In this study, we further extend it to carry out forward and backward bi-level selection. Using simulations and a NSCLC application, we demonstrate that the forward selection outperforms the backward selection and other relevant algorithms in our setting. Both proposed methods are readily understandable and interpretable. Therefore, they represent useful tools for the researchers who are interested in exploring the prognostic value of gene expression data for specific subtypes or stages of a disease.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28387364 PMCID: PMC5384004 DOI: 10.1038/srep46164
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
The results of simulation 1.
| Size | ARRB1(%) | ECE2(%) | COPA(%) | SMAD4(%) | C-Stat (SE)% | Rand (SE)% | |
|---|---|---|---|---|---|---|---|
| Forward-AC | 28.7 | 16 | 0 | 69.71(4.11) | 34.68(5.56) | ||
| Forward-SCC | 52.2 | 64 | 4 | 74.95(6.61) | 28.02(2.26) | ||
| Backward-AC | 49.5 | 16 | 0 | 66.99(3.10) | 23.20(2.67) | ||
| Backward-SCC | 60.4 | 0 | 76 | 62.65(5.55) | 15.87(5.04) | ||
| Cox-filter: AC | 59.2 | 4 | 0 | 54.09(7.09) | 35.89(2.46) | ||
| Cox-filter: SCC | 74.1 | 0 | 0 | 54.44(10.05) | 26.47(2.80) | ||
| Cox-TGDR: AC | 8.2 | 0 | 0 | 69.09(2.66) | 32.13(3.33) | ||
| Cox-TGDR:SCC | 3.8 | 100 | 4 | 53.34(7.79) | 42.16(8.59) | ||
| LASSO: AC | 37.8 | 94 | 2 | 76.00(2.20) | 22.65(0.92) | ||
| LASSO: SCC | 4.0 | 0 | 0 | 54.03(4.67) | 34.05(7.75) |
Note: Size: the average number of selected genes over 50 replicates. Under each gene symbol, its frequencies of being selected over 50 replicates by different algorithms are presented. Forward: forward Cox-filter selection; Backward: backward Cox-filter selection.
The results of simulation 2.
| Size | ARRB1(%) | ECE2 (%) | COPA(%) | SMAD4(%) | C-Stat (SE) % | Rand (SE) % | |
|---|---|---|---|---|---|---|---|
| Forward-AC | 85.5 | 100 | 58 | 88 | 68 | 75.45(4.05) | 44.44(2.76) |
| Forward-SCC | 109.5 | 70 | 56 | 74 | 90 | 69.29(8.52) | 38.58(2.41) |
| Backward-AC | 110.4 | 64 | 76 | 92 | 66 | 72.39(4.04) | 28.11(4.53) |
| Backward-SCC | 142.3 | 54 | 56 | 84 | 86 | 70.62(7.28) | 37.98(7.64) |
| Cox-filter: AC | 78.9 | 100 | 20 | 0 | 74 | 72.31(2.38) | 50.50(6.49) |
| Cox-filter: SCC | 145.3 | 40 | 44 | 10 | 94 | 64.13(4.55) | 21.32(8.11) |
| Cox-TGDR: AC | 5.5 | 100 | 38 | 0 | 46 | 61.57(3.09) | 35.03(5.43) |
| Cox-TGDR:SCC | 8.7 | 98 | 64 | 22 | 90 | 54.16(4.61) | 38.35(6.22) |
| LASSO: AC | 28.7 | 98 | 72 | 48 | 98 | 81.27(2.10) | 25.48(2.06) |
| LASSO: SCC | 4.9 | 2 | 8 | 4 | 28 | 56.29(4.87) | 21.35(1.27) |
Note: Size: the average number of selected genes over 50 replicates. Under each gene symbol, its frequencies of being selected over 50 replicates by different algorithms are presented. Forward: forward Cox-filter selection; Backward: backward Cox-filter selection. C-Stat (SE): the mean of C-statistics over the replicates (its corresponding standard error).
The frequencies for four causal genes under a random guess model.
| Size | ARRB1(%) | ECE2(%) | COPA(%) | SMAD4(%) | |
|---|---|---|---|---|---|
| Forward-AC | 2.46 | 0 | 0 | 0 | 0 |
| Forward-SCC | 8.64 | 0 | 0 | 4 | 0 |
| Backward-AC | 1.2 | 0 | 0 | 0 | 0 |
| Backward-SCC | 1.5 | 0 | 0 | 0 | 0 |
Note: Size: the average number of selected genes over 50 replicates. Under each gene symbol, its frequencies of being selected over 50 replicates by different algorithms are presented. Forward: forward Cox-filter selection; Backward: backward Cox-filter selection.
Performance statistics for the NSCLC application.
| Method: subtype | 10-fold cross-validations | C Statistic | |
|---|---|---|---|
| Rand_gene (SE) | Rand_gs (SE) | Test set (SE) | |
| A. Using the microarray data as the training set | |||
| Forward: AC | 38.33% (1.51%) | 51.92% (4.17%) | 66.82% (3.52%) |
| Forward: SCC | 42.48% (1.13%) | 69.89% (6.76%) | 71.93% (4.38%) |
| Backward: AC | 46.08% (0.66%) | 43.45% (2.27%) | 58.29% (3.42%) |
| Backward: SCC | 46.04% (0.67%) | 56.00% (2.99%) | 52.80% (2.75%) |
| B. Using the RNA-Seq data as the training set | |||
| Forward: AC | 35.53% (0.67%) | 57.38% (3.45%) | 55.16% (5.98%) |
| Forward: SCC | 36.15% (0.58%) | 65.72% (1.91%) | 66.25% (6.20%) |
| Backward: AC | 40.89% (0.51%) | 23.59% (5.21%) | 57.28% (5.57%) |
| Backward: SCC | 39.56% (0.55%) | 27.03% (4.66%) | 60.93% (7.07%) |
| C. Comparison with other relevant algorithms by training on the microarray data | |||
| Cox-filter: AC | 25.25% (3.68%) | — | 60.34% (2.85%) |
| Cox-filter: SCC | 24.75% (3.65%) | — | 59.94% (2.55%) |
| Cox-TGDR: AC | 17.07% (3.31%) | — | 52.32% (5.49%) |
| Cox-TGDR: SCC | 18.65% (5.33%) | — | 48.83% (4.34%) |
| Lasso: ACs | 23.97% (2.06%) | — | 55.35% (6.78%) |
| Lasso: SCCs | 23.77% (3.47%) | — | 50.00% (7.88%) |
Note: Rand_gene: the rand index which evaluates the stability at the gene level; Rand_gs: the rand index which evaluates the stability at the gene set level; Forward: forward Cox-filter selection; Backward: backward Cox-filter selection; –: not available as the method only can carry out gene set-level selection. Sseparately on each subtype because the method itself does not account for subtype information; SE: the standard errors obtained using the bootstrapped samples. In last column, the C-statistics and their standard errors for different methods on the test set are listed.
Adjusted prognostic values of the resulting signatures in present of other clinical factors.
| Forward: AC | Forward: SCC | Backward: AC | Backward: SCC | |
|---|---|---|---|---|
| β (p-value) | β (p-value) | β (p-value) | β (p-value) | |
| Signature (risk score) | 1.6(9.2×10−7)* | 1.85(0.02)* | 3.01(3.4×10−6)* | 4.24(0.02)* |
| Female versus male | −0.27(0.39) | −0.21(0.99) | −0.23(0.46) | −0.21(0.99) |
| Age | 0.03(0.08) | 0.04(0.37) | 0.03(0.11) | 0.02(0.69) |
| Smoking vs non-smoking | 0.24(0.47) | −2.18(0.05)* | 0.26(0.43) | −1.91(0.08) |
Note: β: the estimated coefficient values in the multivariate Cox regression model using the prognostic signature, age, sex, and smoking status as covariates, representing the log hazard ratio. *p-value < 0.05, which is regarded as to be statistical significance.
Figure 1Venn diagrams showing the overlaps between the selected gene/gene sets for AC and SCC.
(A) At the gene level: F_AC: the selected genes by the forward method for AC; F_SCC: the selected genes by the forward method for SCC; B_AC: the selected genes by the backward method for AC; B_SCC: the selected genes by the backward method for SCC; (B) At the gene set level: sccf: the selected gene sets by the forward method for SCC; acf: the selected gene sets by the forward method for AC; sccb: the selected gene sets by the backward method for SCC; acb: the selected gene sets by the backward method for AC.
Figure 2Graphic illustration of the proposed methods.
(A) The forward Cox-filter method; (B) The backward Cox-filter method.