| Literature DB >> 30526581 |
Suyan Tian1, Chi Wang2, Howard H Chang3.
Abstract
BACKGROUND: Feature selection and gene set analysis are of increasing interest in the field of bioinformatics. While these two approaches have been developed for different purposes, we describe how some gene set analysis methods can be utilized to conduct feature selection.Entities:
Keywords: Core subset; Feature selection; Gene set analysis; Longitudinal microarray data; Significance analysis of microarray (SAM)
Mesh:
Year: 2018 PMID: 30526581 PMCID: PMC6284265 DOI: 10.1186/s12911-018-0685-8
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Flowchart illustrates the longitudinal SAMGSR algorithm
Fig. 2Graphical presentation illustrates how to calculate the performance statistics
Performance of SAMGSR extension and other relevant algorithms on the injury data
| Method | # of genes | Using 5-fold CVs | On the test set | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Error | GBS | BCM | AUPR | Error | GBS | BCM | AUPR | ||
| L-SAMGSR1 | 97 | 0.442 | 0.268 | 0.515 | 0.576 | 0.356 | 0.230 | 0.535 | 0.622 |
| EDGE1 | 1083 | 0.442 | 0.281 | 0.511 | 0.526 | 0.407 | 0.234 | 0.514 | 0.594 |
| SAMGSR separatelya | > 400 | 0.419 | 0.246 | 0.510 | 0.559 | 0.428 | 0.243 | 0.511 | 0.553 |
| P-SVM separately | > 1000 | 0.488 | 0.281 | 0.477 | 0.454 | 0.441 | 0.244 | 0.511 | 0.560 |
| LASSO separately | 147 | 0.465 | 0.261 | 0.497 | 0.498 | 0.407 | 0.237 | 0.509 | 0.580 |
Note: a the posterior probabilities were calculated using an SVM classifier. Here, the cutoff for q-value in SAM-GS part is set at 0.05. # of genes represents the number of the union of individual genes selected at each time point. CV: cross-validation
Fig. 3Venn-diagram illustrates how selected genes by the longitudinal SAMGSR method overlap at different time points
Performance of the longitudinal SAMGSR on simulated data
| Time 1 | Time 2 | Time 3 | Time 4 | Time 5 | ||
|---|---|---|---|---|---|---|
| # of genes | 19.84 | 19.14 | 13.68 | 9.30 | 11.00 | |
| Simulation 1 | F13A1 (%) | 72 | 100 | 100 | 92 | 68 |
| (Ave. # 32.06) | GSTM1 (%) | 0 | 0 | 62 | 22 | 0 |
| # of genes | 182.38 | 56.18 | 35.44 | 30.94 | 123.84 | |
| Simulation 2 | COX4I2 (%) | 96 | 0 | 0 | 0 | 4 |
| (Ave. # 291.98) | RP9 (%) | 10 | 4 | 4 | 6 | 96 |
Note: Ave. # represents the average number of the union of individual genes selected at each time point over 50 simulated datasets; # of genes represents the average number of individual genes selected at the specific time point over 50 simulated datasets; % represents the percentage of the corresponding true causal gene being selected by the algorithm over 50 simulated datasets