| Literature DB >> 27846233 |
Lei Zhang1,2, Linlin Wang1, Pu Tian1, Suyan Tian3.
Abstract
The focus of analyzing data from microarray experiments has shifted from the identification of associated individual genes to that of associated biological pathways or gene sets. In bioinformatics, a feature selection algorithm is usually used to cope with the high dimensionality of microarray data. In addition to those algorithms that use the biological information contained within a gene set as a priori to facilitate the process of feature selection, various gene set analysis methods can be applied directly or modified readily for the purpose of feature selection. Significance analysis of microarray to gene-set reduction analysis (SAM-GSR) algorithm, a novel direction of gene set analysis, is one of such methods. Here, we explore the feature selection property of SAM-GSR and provide a modification to better achieve the goal of feature selection. In a multiple sclerosis (MS) microarray data application, both SAM-GSR and our modification of SAM-GSR perform well. Our results show that SAM-GSR can carry out feature selection indeed, and modified SAM-GSR outperforms SAM-GSR. Given pathway information is far from completeness, a statistical method capable of constructing biologically meaningful gene networks is of interest. Consequently, both SAM-GSR algorithms will be continuously revaluated in our future work, and thus better characterized.Entities:
Mesh:
Year: 2016 PMID: 27846233 PMCID: PMC5112852 DOI: 10.1371/journal.pone.0165543
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Graphical illustration of SAM-GSR and modified SAM-GSR algorithms.
A. The SAM-GSR algorithm. B. The modified SAM-GSR algorithm.
Fig 2Study schema.
Graphical illustration on how to analyze the multiple sclerosis (MS) microarray data.
Fig 3Selected pathways and genes by both SAM-GSR algorithms using pathways inside the MSigDB c2 category.
Gene symbols in purple are the genes indicated as being directly related to MS by the GeneCards database. The overlapped gene symbols between the SAM-GSR and modified SAM-GSR algorithms are in bold.
Fig 4Selected pathways and genes by both SAM-GSR algorithms using pathways inside the MSigDB c5 category.
Gene symbols in purple are the genes indicated as being directly related to MS by the GeneCards database. The overlapped gene symbols between the SAM-GSR and modified SAM-GSR algorithms are in bold.
Performance statistics of selected genes using E-MTAB-69 as the training set.
| E-MTAB-69 | sbv Improver test set | |||||||
|---|---|---|---|---|---|---|---|---|
| Error (%) | GBS | BCM | AUPR | Error (%) | GBS | BCM | AUPR | |
| C2: SAM-GSR (18) | 20.45 | 0.121 | 0.701 | 0.896 | 46.67 | 0.464 | 0.500 | 0.644 |
| C2: M-SAM-GSR (271) | 0 | 0.066 | 0.747 | 0.992 | 46.67 | 0.291 | 0.520 | 0.612 |
| C2: L1 as penalty (112) | 0 | 0.083 | 0.719 | 0.992 | 33.33 | 0.207 | 0.564 | 0.776 |
| C5: SAM-GSR (8) | 13.64 | 0.134 | 0.673 | 0.904 | 46.67 | 0.464 | 0.500 | 0.579 |
| C5: M-SAM-GSR (40) | 0 | 0.046 | 0.800 | 0.992 | 43.33 | 0.365 | 0.577 | 0.703 |
Note: C2 represents the analyses using the pathways in MSigDB c2 category; C5 represents the analyses using the pathways in MSigDB c5 category. M-SAM-GSR abbreviates for modified SAM-GSR algorithm. GBS: Generalized Brier Score; BCM: Belief Confusion Metric; AUPR: Area Under the Precision-Recall Curve.
Performance statistics of selected genes using the sbv Improver MS data as the training set.
| sbv Improver test set | E-MTAB-69 | |||||||
|---|---|---|---|---|---|---|---|---|
| Error (%) | GBS | BCM | AUPR | Error (%) | GBS | BCM | AUPR | |
| C2: SAM-GSR (257) | 0 | 0.054 | 0.772 | 0.995 | 42.73 | 0.296 | 0.486 | 0.483 |
| C2: M-SAM-GSR (111) | 0 | 0.020 | 0.901 | 0.995 | 59.09 | 0.316 | 0.501 | 0.516 |
| C5: SAMGSR (204) | 0 | 0.046 | 0.793 | 0.995 | 54.55 | 0.337 | 0.457 | 0.422 |
| C5: M-SAM-GSR (72) | 0 | <0.001 | 0.993 | 0.995 | 40.91 | 0.409 | 0.501 | 0.750 |
Note: C2 represents the analyses using the pathways in MSigDB c2 category; C5 represents the analyses using the pathways in MSigDB c5 category. M-SAM-GSR abbreviates for the modified SAM-GSR algorithm. GBS: Generalized Brier Score; BCM: Belief Confusion Metric; AUPR: Area Under the Precision-Recall Curve.
Comparison with other relevant signatures on the sbv Improver set.
| Study (size) | Training data used | Error (%) | GBS | BCM | AUPR |
|---|---|---|---|---|---|
| SAM-GSR (8) | E-MTAB-69 | 46.67 | 0.464 | 0.500 | 0.579 |
| M-SAM-GSR (40) | E-MTAB-69 | 43.33 | 0.365 | 0.577 | 0.703 |
| Lauria (n>100) | E-MTAB-69 | — | — | 0.884 | 0.874 |
| Tarca (n = 2) | GSE21942 (on Human Gene 1.0 ST) | — | — | 0.629 | 0.819 |
| Zhao (n = 58) | 7 other data besides E-MTAB-69 | 30 | — | 0.576 | 0.820 |
| Zhao (n = 84) | 7 other data besides E-MTAB-69 | 35 | — | 0.549 | 0.636 |
| Tian (n = 28) | 5 other data besides E-MTAB-69 | 68.33 | 0.546 | 0.345 | 0.362 |
| Tian (n = 38) | E-MTAB-69 | 38.33 | 0.290 | 0.559 | 0.593 |
| Guo (n = 8) | E-MTAB-69 | 46.67 | 0.462 | 0.499 | 0.504 |
Note: M-SAM-GSR abbreviates for the modified SAM-GSR algorithm. GBS: Generalized Brier Score; BCM: Belief Confusion Metric; AUPR: Area Under the Precision-Recall Curve; —: not available.
* The predictive statistics on the test set for Guo’s study were calculated based on the 8-gene signature they provided in their article.
1The original submission by us to sbv IMPROVER using the TGDR algorithm, it was ranked around 30 among 54 participants.
2We trained TGDR on E-MTAB-69 to evaluate if different training sets result in difference performance of an algorithm.
aZhao et al used elastic net to select individual genes, this submission ranked the third place in sbv MS subtask.
bZhao et al used elastic net to select pseudo genes created by the averages of the genes inside pathways.
Performance statistics for the lung adenocarcinoma application.
| Method | Size | TCGA RNA-Seq data | |||
|---|---|---|---|---|---|
| Error (%) | GBS | BCM | AUPR | ||
| SAM-GSR | 111 | 35.7 | 0.357 | 0.5 | 0.692 |
| M-SAM-GSR | 89 | 44.3 | 0.312 | 0.552 | 0.666 |
| SVM SCAD | 117 | 32.9 | 0.329 | 0.54 | 0.645 |
| Lasso | 84 | 52.9 | 0.528 | 0.511 | 0.504 |
| Moderated t-test | 329 | 35.7 | 0.357 | 0.5 | 0.569 |
Note: M-SAM-GSR abbreviates for the modified SAM-GSR algorithm. GBS: Generalized Brier Score; BCM: Belief Confusion Metric; AUPR: Area Under the Precision-Recall Curve