| Literature DB >> 27446945 |
Lei Zhang1, Linlin Wang2, Bochuan Du2, Tianjiao Wang2, Pu Tian2, Suyan Tian3.
Abstract
Among non-small cell lung cancer (NSCLC), adenocarcinoma (AC), and squamous cell carcinoma (SCC) are two major histology subtypes, accounting for roughly 40% and 30% of all lung cancer cases, respectively. Since AC and SCC differ in their cell of origin, location within the lung, and growth pattern, they are considered as distinct diseases. Gene expression signatures have been demonstrated to be an effective tool for distinguishing AC and SCC. Gene set analysis is regarded as irrelevant to the identification of gene expression signatures. Nevertheless, we found that one specific gene set analysis method, significance analysis of microarray-gene set reduction (SAMGSR), can be adopted directly to select relevant features and to construct gene expression signatures. In this study, we applied SAMGSR to a NSCLC gene expression dataset. When compared with several novel feature selection algorithms, for example, LASSO, SAMGSR has equivalent or better performance in terms of predictive ability and model parsimony. Therefore, SAMGSR is a feature selection algorithm, indeed. Additionally, we applied SAMGSR to AC and SCC subtypes separately to discriminate their respective stages, that is, stage II versus stage I. Few overlaps between these two resulting gene signatures illustrate that AC and SCC are technically distinct diseases. Therefore, stratified analyses on subtypes are recommended when diagnostic or prognostic signatures of these two NSCLC subtypes are constructed.Entities:
Mesh:
Year: 2016 PMID: 27446945 PMCID: PMC4944087 DOI: 10.1155/2016/2491671
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Study schema. Graphical illustration on the applications of SAMGSR to the stage segmentations of early-stage NSCLC.
Performance of SAMGSR on NSCLC data for stage segmentations.
| Training set | Test set | |||||||
|---|---|---|---|---|---|---|---|---|
| Error (%) | GBS | BCM | AUPR | Error (%) | GBS | BCM | AUPR | |
| (A) Trained on the microarray data (GSE50081) | ||||||||
| No IC filtering, on stage (115) | 1.18 | 0.050 | 0.809 | 0.976 | 32 | 0.318 | 0.51 | 0.612 |
| No IC filtering, for AC (83) | 0 | 0.039 | 0.825 | 0.996 | 35.7 | 0.357 | 0.5 | 0.627 |
| No IC filtering, for SCC (14) | 7.14 | 0.082 | 0.758 | 0.957 | 43.6 | 0.308 | 0.511 | 0.513 |
| With IC filtering, on stage (75) | 5.92 | 0.067 | 0.784 | 0.964 | 36 | 0.344 | 0.56 | 0.535 |
| With IC filtering, for AC (119) | 0 | 0.043 | 0.810 | 0.996 | 42.9 | 0.350 | 0.609 | 0.630 |
| With IC filtering, for SCC (26) | 2.36 | 0.062 | 0.802 | 0.992 | 32.7 | 0.256 | 0.589 | 0.583 |
|
| ||||||||
| (B) Trained on the RNA-seq data | ||||||||
| No IC filtering, on stage (52) | 0 | 0.028 | 0.871 | 0.997 | 30.8 | 0.270 | 0.523 | 0.529 |
| No IC filtering, for AC (14) | 11.43 | 0.087 | 0.779 | 0.961 | 58.4 | 0.454 | 0.533 | 0.536 |
| No IC filtering, for SCC (28) | 0 | 0.035 | 0.842 | 0.991 | 45.2 | 0.278 | 0.532 | 0.563 |
| With IC filtering, on stage (24) | 12.8 | 0.110 | 0.725 | 0.873 | 38.6 | 0.272 | 0.569 | 0.623 |
| With IC filtering, for AC (31) | 0 | 0.033 | 0.848 | 0.995 | 30.7 | 0.258 | 0.533 | 0.576 |
| With IC filtering, for SCC (10) | 9.09 | 0.101 | 0.712 | 0.905 | 33.3 | 0.279 | 0.556 | 0.641 |
Note: the test set is RNA-seq data in part (A) and GSE50081 microarray data in part (B).
Comparison of SAMGSR with other feature selection algorithms.
| Method | Subtype | Training set | TCGA RNA-seq | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Error (%) | GBS | BCM | AUPR | Error (%) | GBS | BCM | AUPR | ||
| SAMGSR + SVM | AC (119) | 0 | 0.043 | 0.810 | 0.996 | 42.9 | 0.350 | 0.609 | 0.630 |
| SCC (26) | 2.36 | 0.062 | 0.802 | 0.992 | 32.7 | 0.256 | 0.589 | 0.583 | |
|
| |||||||||
| Lasso | AC (81) | 0 | 1.14 × 10−4 | 0.990 | 0.996 | 35.7 | 0.357 | 0.5 | 0.624 |
| SCC (33) | 0 | <10−4 | 0.993 | 0.992 | 29.1 | 0.291 | 0.5 | 0.565 | |
|
| |||||||||
| Penalized | AC (528) | 0 | 0.003 | 0.951 | 0.996 | 37.1 | 0.318 | 0.524 | 0.615 |
| SVM (SCAD) | SCC (63) | 0 | <10−4 | 0.999 | 0.959 | 27.3 | 0.273 | 0.531 | 0.654 |
|
| |||||||||
| DEGs + SVM | AC (145) | 0 | 0.042 | 0.810 | 0.996 | 51.9 | 0.465 | 0.562 | 0.638 |
| SCC (46) | 0 | 0.046 | 0.803 | 0.992 | 29.1 | 0.287 | 0.501 | 0.632 | |
|
| |||||||||
| Radviz + SVM | AC (9) | 22.83 | 0.166 | 0.559 | 0.734 | 37.1 | 0.363 | 0.493 | 0.541 |
| SCC (8) | 4.76 | 0.076 | 0.774 | 0.934 | 30.9 | 0.293 | 0.493 | 0.536 | |
Figure 2Venn diagrams show how the selected gene sets and genes for SCC and AC stage segmentations overlap. (a) On the level of gene sets selected by SAMGS. (b) On the level of genes selected by SAMGSR. (c) On the level of enriched KEGG pathways. There are 5 overlapped gene sets, 3 overlapped genes, and 12 overlapped KEGG pathways, respectively.