| Literature DB >> 20190952 |
Edward R Dougherty1, Jianping Hua, Chao Sima.
Abstract
High-throughput biological technologies offer the promise of finding feature sets to serve as biomarkers for medical applications; however, the sheer number of potential features (genes, proteins, etc.) means that there needs to be massive feature selection, far greater than that envisioned in the classical literature. This paper considers performance analysis for feature-selection algorithms from two fundamental perspectives: How does the classification accuracy achieved with a selected feature set compare to the accuracy when the best feature set is used and what is the optimal number of features that should be used? The criteria manifest themselves in several issues that need to be considered when examining the efficacy of a feature-selection algorithm: (1) the correlation between the classifier errors for the selected feature set and the theoretically best feature set; (2) the regressions of the aforementioned errors upon one another; (3) the peaking phenomenon, that is, the effect of sample size on feature selection; and (4) the analysis of feature selection in the framework of high-dimensional models corresponding to high-throughput data.Entities:
Year: 2009 PMID: 20190952 PMCID: PMC2766788 DOI: 10.2174/138920209789177629
Source DB: PubMed Journal: Curr Genomics ISSN: 1389-2029 Impact factor: 2.236
Feature Selection Studies: Sample Size Column Indicates the Total Sample Size. The Actual Number Used for Training Depends on the Criterion Function Used
| Paper | Data Set Name | # of Class | # of Features | Sample Size | Criterion Function |
|---|---|---|---|---|---|
| Jain and Zongker 1997 | Kittler’s synthetic data | 2 | 20 | 2000 | Mahalanobis |
| SAR data | 2 | 18 | ~11000 | Split | |
| Kudo and Sklansky 2000 | SAR | 3 | 10 | 285 | LOO |
| Vehicle | 4 | 18 | ~800 | 1 x CV-9 | |
| Mammogram (small) | 2 | 19 | 86 | LOO | |
| Kittler’s synthetic data | 2 | 20 | 2000 | Mahalanobis | |
| Mushroom (small) | 2 | 29 | 1000 | LOO | |
| Sonar (small) | 2 | 40 | 208 | LOO | |
| Sonar (large) | 2 | 60 | 208 | LOO | |
| Mammogram (large) | 2 | 65 | 86 | LOO | |
| Kestler and Müssel 2006 | Golub Data Set | 2 | 3051 | 72 | LOO |
| Khan Data set | 4 | 2308 | 63 | ||
| Diagnostic Chip Data set | 2 | 169 | 62 | ||
| Jeffery | DLBCL | 2 | 7129 | 77 | Resubstitution |
| Prostate | 2 | 12625 | 102 | ||
| Colon | 2 | 2000 | 62 | ||
| Leukaemia (Golub Data Set) | 2 | 7129 | 72 | ||
| Myeloma | 2 | 12625 | 173 | ||
| ALL.1 | 2 | 12628 | 128 | ||
| ALL.2 | 2 | 12628 | 125 | ||
| ALL.3 | 2 | 12628 | 100 | ||
| ALL.4 | 2 | 12628 | 93 |
The criterion functions are: Mahalanobis: Mahalanobis distance; Split: data is split equally into training and testing sets; LOO: leave-one-out; m x CV-n: n-fold cross-validation repeated for m times; m x Hold-out n: hold-out n sample points and testing on the remaining, repeated for m times; Resubstitution: Resubstitution method.