| Literature DB >> 16670007 |
Carmen Lai1, Marcel J T Reinders, Laura J van't Veer, Lodewyk F A Wessels.
Abstract
BACKGROUND: Gene selection is an important step when building predictors of disease state based on gene expression data. Gene selection generally improves performance and identifies a relevant subset of genes. Many univariate and multivariate gene selection approaches have been proposed. Frequently the claim is made that genes are co-regulated (due to pathway dependencies) and that multivariate approaches are therefore per definition more desirable than univariate selection approaches. Based on the published performances of all these approaches a fair comparison of the available results can not be made. This mainly stems from two factors. First, the results are often biased, since the validation set is in one way or another involved in training the predictor, resulting in optimistically biased performance estimates. Second, the published results are often based on a small number of relatively simple datasets. Consequently no generally applicable conclusions can be drawn.Entities:
Mesh:
Year: 2006 PMID: 16670007 PMCID: PMC1569875 DOI: 10.1186/1471-2105-7-235
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The mean and the Standard deviation of the 10-fold cross-validation error (in percentage) for the different approaches and the Affymetrix platform datasets employed in the study.
| Method | CNS | Colon | Leukemia | Prostate |
| Gene selection | mean ± std | mean ± std | mean ± std | mean ± std |
| U, SNR, NMC | 12.9 ± 4.2 * | 4.8 ± 2.7 * | 9.7 ± 4.2 * | |
| U, SNR, FLD | 42.5 ± 7.3 | 19.2 ± 5.9 | 8.0 ± 3.2 | 10.0 ± 3.0 * |
| U, t-test, NMC | 32.5 ± 4.9 * | 12.5 ± 4.2 * | 4.8 ± 2.7 * | 10.8 ± 3.4 |
| U, t-test, FLD | 35.8 ± 6.5 * | 12.0 ± 4.2 | ||
| BP greedy, FLD | 43.8 ± 6.2 | 12.9 ± 3.8 * | 11.6 ± 3.6 | 9.8 ± 3.3 * |
| FS, FLD | 47.9 ± 5.1 | 15.4 ± 4.1 | 10.2 ± 4.2 | 14.0 ± 3.4 |
| RFE, FLD | 34.2 ± 5.0 * | 22.9 ± 4.4 | 10.0 ± 2.6 * | |
| RFE, SVM | 35.4 ± 5.0 * | 22.1 ± 3.5 | 4.5 ± 2.6 * | |
| Liknon | 32.9 ± 6.1 * | 13.3 ± 4.2 * | 11.8 ± 4.0 | 10.8 ± 3.7 |
| TSP | 47.0 ± 5.6 | 5.4 ± 2.9 * | 10.6 ± 3.8 | 7.0 ± 2.6 * |
| no gene selection | mean ± std | mean ± std | mean ± std | mean ± std |
| NMC | 42.1 ± 5.5 | 17.9 ± 3.3 | 33.7 ± 3.9 | |
| FLD | 32.9 ± 6.3 * | 21.7 ± 3.7 | 4.5 ± 2.6 * | |
| SVM | 35.4 ± 7.0 * | 22.1 ± 3.5 | ||
The mean and the Standard deviation of the 10-fold cross-validation error (in percentage) for the different approaches and the cDNA platform datasets employed in the study.
| Method | DLBCL | HNSCC | Breast |
| gene selection | mean ± std | mean ± std | mean ± std |
| U, SNR, NMC | 33.0 ± 3.4 * | ||
| U, SNR, FLD | 15.8 ± 6.4 | 33.3 ± 6.6 | |
| U, t-test, NMC | 33.5 ± 3.8 * | ||
| U, t-test, FLD | 15.8 ± 6.4 | 36.2 ± 6.2 | 32.6 ± 3.0 * |
| BP greedy, FLD | 10.0 ± 4.3 | 36.2 ± 7.0 | 35.8 ± 2.3 |
| FS, FLD | 10.8 ± 3.7 | 45.4 ± 8.5 | 35.4 ± 4.2 |
| RFE, FLD | 16.7 ± 5.3 | 35.0 ± 6.3 | 33.8 ± 3.5 |
| RFE, SVM | 15.8 ± 5.2 | 35.4 ± 7.2 | 32.6 ± 3.2 * |
| Liknon | 13.3 ± 5.3 | 37.5 ± 7.4 | 34.5 ± 5.2 |
| TSP | 27.5 ± 2.8 | 37.6 ± 6.0 | 49.9 ± 4.6 |
| no gene selection | mean ± std | mean ± std | mean ± std |
| NMC | 6.7 ± 3.5 | 29.2 ± 7.2 | 36.7 ± 3.2 |
| FLD | 14.2 ± 5.4 | 32.5 ± 6.6 | 35.8 ± 4.1 |
| SVM | 9.2 ± 3.8 | 29.6 ± 5.7 | 34.3 ± 4.2 |
Figure 1The training-validation protocol employed to evaluate various gene selection and classification approaches in simplified schematic format. The input is a labeled dataset, D, and the Output is an estimate of the validation performance of algorithm A, denoted by PThe most important steps in the protocol are the training step (Block labeled 'Train') and the validation step (Block labeled 'Validate'). The training step, in turn, consists of two steps, namely 1) the optimization of the gene selection parameter, ϕ, employing a N– fold cross validation loop and 2) training the final classifier glven the optimal setting of the selection parameter. The validation step estimates the performance of the optimal trained classifier () on the completely independent validation set.