| Literature DB >> 27681389 |
Suyan Tian1,2, Howard H Chang3, Chi Wang4.
Abstract
BACKGROUND: It has been demonstrated that a pathway-based feature selection method that incorporates biological information within pathways during the process of feature selection usually outperforms a gene-based feature selection algorithm in terms of predictive accuracy and stability. Significance analysis of microarray-gene set reduction algorithm (SAMGSR), an extension to a gene set analysis method with further reduction of the selected pathways to their respective core subsets, can be regarded as a pathway-based feature selection method.Entities:
Keywords: Multiple sclerosis (MS); Non-small cell lung cancer (NSCLC); Pathway knowledge; Pathway-based feature selection; Significance analysis of microarray (SAM); Weighted gene expression profiles
Year: 2016 PMID: 27681389 PMCID: PMC5041498 DOI: 10.1186/s13062-016-0152-3
Source DB: PubMed Journal: Biol Direct ISSN: 1745-6150 Impact factor: 4.540
Three categories of pathway-based feature selection algorithms. The filter and embedded methods are two typical types for the gene-based feature selection algorithms. As defined by [32], filter methods access the relevance of features by calculating some functional score while embedded methods search for the optimal subset simultaneously with the classifier construction
| Category/description | Property | Pathway topology information | Examples [Ref.] |
|---|---|---|---|
| Penalty: add an extra penalty term which accounts for the pathway structure to the objective function, then optimize the resulting function to get the final gene subset | Embedded feature selection methods, carry out feature selection and coefficient estimation simultaneously, moderate to heavy computing burden | Need the pathway topology information for all genes, e.g., are they connected and the distance between them | Net-Cox [Zhang et al. 2013] netSVM [Chen et al. 2011] |
| Stepwise forward: order genes based on one specific statistic, and then add gene one by one until there is no gain on the pre-defined score. | Usually filter methods, the beneath concepts and theory are simple. However, they also inherits the filter methods’ drawbacks of inferior model parsimony and thus high false positive rate. | Usually ignore the pathway topology information, the decision hinges mainly on the genes’ expression values | SAM-GSR [Dinu et al, 2009] SurvNet [Li et al. 2012] |
| Weighting: create some kind weights according to the pathway knowledge and then combine with other feature selection methods to identify the relevant genes | With different weights, the chance of those “driving” genes with subtle change being selected increases. However, if the estimated weights subject to big biases, the resulting model might even be inferior to those without weights. | Account for the pathway topology information. | RRFE [Johannes et al. 2010] DRW [Liu et al. 2013] |
Fig. 1Diagrams to elucidate both SAMGSR and weighted-SAMGSR algorithms
Fig. 2Scatterplot to show the correlation between the number of gene sets one gene is involved and its connectivity. ρ is the estimated Spearman correlation coefficient between the number of gene sets involved and (1 + the number of connected genes)
Simulation results
| Training set | Test set | |||||
|---|---|---|---|---|---|---|
| Method (Sizea) | HDAC1 (%) | GNAS (%) | Error (%) | GBS | BMC | AUPR |
| A. Simulated from 60 independent normal-distributed random variables | ||||||
| SAMGSR (3.8) | 19 | 100 | 16.5 | 0.118 | 0.733 | 0.921 |
| W-SAMGSR (6.23) | 65 | 100 | 13.2 | 0.101 | 0.755 | 0.948 |
| B. Simulated based on the NSCLC microarray data | ||||||
| SAMGSR (3.94) | 0 | 100 | 44.5 | 0.256 | 0.517 | 0.550 |
| W-SAMGSR (6.28) | 77 | 100 | 40.5 | 0.241 | 0.534 | 0.621 |
Note: W-SAMGSR stands for weighted-SAMGSR
astands for average the number of genes selected by either SAMGSR or W-SAMGSR over 100 replicates
Selected pathways and genes on MS data
| Pathways with high frequency (frequency %) | Genes (frequency %) | ||
|---|---|---|---|
| SAMGSR | DNA Directed DNA Polymease Activity (100 %) |
|
|
| DNA Polymease Activity (90 %) |
|
| |
| COVALENT_CHROMATIN_MODIFICATION (70 %) |
| DPM3 (70 %) | |
| HISTONE_MODIFICATION (70 %) |
|
| |
| Stability = 14.04 % | Stability = 12.83 % | ||
| Weighted-SAMGSR | DNA_RECOMBINATION (70 %) LIPOPROTEIN_BIOSYNTHETIC_PROCESS (70 %) |
|
|
| NEGATIVE_REGULATION_OF_IMMUNE_SYSTEM_PROCESS (70 %) |
|
| |
| PROTEIN_AMINO_ACID_LIPIDATION (70 %) | CHAF1A (70 %) |
| |
| DEPHOSPHORYLATION (60 %) INOSITOL_OR_PHOSPHATIDYLINOSITOL_KINASE_ACTIVITY (60 %) |
|
| |
| LIPOPROTEIN_METABOLIC_PROCESS (60 %) PROTEIN_C_TERMINUS_BINDING (60 %) | PEX16 (60 %) |
| |
|
| PPP1CA (60 %) | ||
| Stability = 15.76 % | Stability = 14.03 % | ||
Note: Gene symbols in bold are those overlapped genes by SAMGSR and weighted-SAMGSR; gene symbols underlined are directly related to MS according to the genecards database
Performance statistics of selected genes on MS data
| Training set (10-fold CV results) | Test set | |||||||
|---|---|---|---|---|---|---|---|---|
| A. Performance comparison | ||||||||
| Method (n) | Error (%) | GBS | BCM | AUPR | Error (%) | GBS | BCM | AUPR |
| SAMGSR (52) | 34.09 | 0.244 | 0.570 | 0.645 | 46.67 | 0.465 | 0.501 | 0.725 |
| W-SAMGSR (25) | 31.82 | 0.191 | 0.611 | 0.771 | 43.33 | 0.341 | 0.564 | 0.860 |
| LASSO (30) | 34.09 | 0.275 | 0.632 | 0.672 | 46.67 | 0.377 | 0.499 | 0.747 |
| Penalized SVM(11) | 47.73 | 0.406 | 0.534 | 0.630 | 45 | 0.569 | 0.431 | 0.555 |
| gelnet (169) | 34.09 | 0.251 | 0.528 | 0.589 | 46.67 | 0.246 | 0.547 | 0.746 |
| RRFE (198) | 43.18 | 0.263 | 0.547 | 0.619 | 46.67 | 0.300 | 0.523 | 0.693 |
| B. Performance of the top 3 teams in sbv MS sub-challenge (among 54 teams) | ||||||||
| Study (size) | Training data used/Method used | Error (%) | GBS | BCM | AUPR | |||
| Lauria’s ( | E-MTAB-69/Mann-Whitney test, then use top α % of the selected genes and Cytoscape to get the clusters on the test set | -- | -- | 0.884 | 0.874 | |||
| Tarca’s ( | GSE21942 (on Human Gene 1.0 ST)/LDA | -- | -- | 0.629 | 0.819 | |||
| Zhao’s ( | 7 other data and E-MTAB-69/Elastic net | 30 | -- | 0.576 | 0.820 | |||
Note: W-SAMGSR weighted-SAMGSR, LDA linear discrimination analysis, gelnet generalized elastic net by [25], RRFE reweighted recursive feature elimination by [14]
--: not available. Lauria’s Tarca’s and Zhao’s studies [38, 39, 44] are the 3 best studies in the sbv MS sub-challenge
Selected pathways and genes on NSCLC RNA-seq data (stage segmentation)
| Pathways with high frequency (frequency %) | Genes (frequency %) | |
|---|---|---|
| SAMGSR | DNA_FRAGMENTATION_DURING_APOPTOSIS (70 %) |
|
| Stability = 18.28 % | Stability = 24.48 % | |
| Weighted-SAMGSR | ANION_CHANNEL_ACTIVITY (100 %) |
|
| Stability = 42.75 % | Stability = 32.38 % |
Note: Gene symbols in bold are those overlapped genes by SAMGSR and weighted-SAMGSR; gene symbols underlined are directly related to NSCLC according to the genecards database
Performance statistics of selected genes on NSCLC RNA-seq data (stage segmentation)
| Training set (10-fold CV results) | Test set | |||||||
|---|---|---|---|---|---|---|---|---|
| Method (n) | Error (%) | GBS | BCM | AUPR | Error (%) | GBS | BCM | AUPR |
| SAMGSR (9) | 35.2 | 0.242 | 0.539 | 0. 575 | 50 | 0.279 | 0.507 | 0.531 |
| W-SAMGSR (8) | 32.8 | 0.231 | 0.556 | 0.584 | 49.3 | 0.276 | 0.513 | 0.580 |
| LASSO (30) | 36 | 0.219 | 0.558 | 0.610 | 50 | 0.453 | 0.500 | 0.509 |
| Penalized SVM (34) | 36.8 | 0.255 | 0.562 | 0.603 | 50 | 0.329 | 0.501 | 0.518 |
| gelnet (252) | 36.8 | 0.231 | 0.517 | 0.547 | 50 | 0.465 | 0.499 | 0.475 |
| RRFE (93) | 35.2 | 0.185 | 0.545 | 0.578 | 50 | 0.471 | 0.500 | 0.506 |
Note: W-SAMGSR weighted-SAMGSR, gelnet generalized elastic net, RRFE reweighted recursive feature elimination
Performance statistics of selected genes on NSCLC data (multiple-class case)
| Training set (5-fold CV results) | Test set | |||||||
|---|---|---|---|---|---|---|---|---|
| A. Performance comparison | ||||||||
| Method (n) | Error (%) | GBS | BCM | AUPR | Error (%) | GBS | BCM | AUPR |
| SAMGSR (30)a | 40.7 | 0.279 | 0.377 | 0.462 | 51.3 | 0.348 | 0.407 | 0.486 |
| W-SAMGSR (27)a | 37.2 | 0.276 | 0.378 | 0.453 | 51.3 | 0.345 | 0.405 | 0.492 |
| LASSO (95) | 38.6 | 0.281 | 0.458 | 0.483 | 52.7 | 0.395 | 0.456 | 0.485 |
| pSVM (>100) | 42.8 | 0.370 | 0.344 | 0.428 | 53.3 | 0.433 | 0.385 | 0.397 |
| gelnet (>400) | 36.6 | 0.284 | 0.346 | 0.416 | 54.7 | 0.343 | 0.377 | 0.489 |
| RRFE (>200) | 36.6 | 0.272 | 0.395 | 0.448 | 54 | 0.336 | 0.410 | 0.468 |
| B. Performance of the top 3 teams in sbv NSCLC sub-challenge (among 54 teams) | ||||||||
| Study (size) | Training data used/Method used | Error (%) | GBS | BCM | AUPR | |||
| Ben-Hamo’s (23) | GSE10245, GSE18842, GSE31799/PAM | 49.3 | -- | 0.48 | 0.46 | |||
| Tarca’s (25) | GSE10245, GSE18842, GSE2109/moderated t-tests + LDA | -- | -- | 0.459 | 0.454 | |||
| Tian’s (66) | GSE10245, GSE18842, GSE2109/TGDR in hierarchical way | 53.3 | 0.374 | 0.440 | 0.471 | |||
Note: W-SAMGSR weighted-SAMGSR, pSVM penalized support vector machine (SCAD penalty term), gelnet generalized elastic net, RRFE reweighted recursive feature elimination, LDA linear discriminant analysis, PAM partitioning around medoid, TGDR threshold gradient descent regularization
aThe sizes of final model for the stage segmentation because the results for the subtype segmentation for both algorithms are identical (but the final size > 300). Ben-Hamo’s study [31], Tarca’s study [44] and Tian’s study [45] are the 3 best studies in the sbv LC sub-challenge
Performance statistics on the test set for the weighted-SAMGSR algorithm (PPI information retrieved from the STRING database)
| No. | Error (%) | GBS | BMC | AUPR | Rand (gene) | Rand (GS) | |
|---|---|---|---|---|---|---|---|
| MS (b) | 22 | 43.3 | 0.279 | 0.581 | 0.847 | 15.3 % | 27.1 % |
| MS (c) | 20 | 28.3 | 0.179 | 0.613 | 0.828 | 15.5 % | 25.4 % |
| Stage for LC (b) | 32 | 45.3 | 0.318 | 0.520 | 0.552 | 36.3 % | 40.1 % |
| Stage for LC (c) | 26 | 45.3 | 0.274 | 0.525 | 0.566 | 35.8 % | 40.4 % |
| MC for LC (b) | 22a | 47.3 | 0.337 | 0.411 | 0.510 | -- | -- |
| MC for LC (c) | 31a | 51.3 | 0.334 | 0.410 | 0.512 | -- | -- |
Note: (b): using the binary values indicating if two genes are connected or not; (c): using the confidence scores for the gene connectivity. MS: the multiple sclerosis application; Stage for LC: the NSCLC stage application trained on the RNA-Seq data; MC for LC: the NSCLC multiple-class application. Rand (gene): the rand index at the gene level, across the gene lists obtained from 10-fold cross-validation data; Rand (GS): the rand index at the gene set level
ais the number of selected genes for the stage segmentation, the number of selected genes for the subtype segmentation > 300