| Literature DB >> 17474999 |
Malik Yousef1, Segun Jung, Louise C Showe, Michael K Showe.
Abstract
BACKGROUND: Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE) rather than recursive feature elimination (RFE). We have tested this algorithm on six datasets and compared its performance with that of two related classification procedures with RFE.Entities:
Mesh:
Year: 2007 PMID: 17474999 PMCID: PMC1877816 DOI: 10.1186/1471-2105-8-144
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Summary results for the SVM-RCE, SVM-RFE and PDA-RFE method. Summary results for the SVM-RCE, SVM-RFE and PDA-RFE method applied on 6 public datasets. #c field is the number of clusters for the SVM-RCE method. The #g field is the number of genes in the associated #c clusters for SVM-RCE, while for the SVM-RFE and PDA-RFE indicates the number of genes used.
| SVM- | 2 | 12 | 99% | 2 | 8 | 100% | 2 | 8 | 91% | 2 | 8 | 100% | 2 | 9 | 100% | 2 | 8 | 87% |
| RCE | 3 | 32 | 98% | 9 | 32 | 100% | 9 | 34 | 96% | 8 | 32 | 100% | 6 | 32 | 100% | 11 | 36 | 95% |
| 28 | 100 | 97% | 32 | 101 | 100% | 28 | 104 | 96% | 28 | 103 | 100% | 25 | 103 | 100% | 32 | 100 | 93% | |
| SVM-RFE | 11 | 96% | 9 | 89% | 8 | 84% | 8 | 92% | 8 | 98% | 8 | 93% | ||||||
| 32 | 96% | 32 | 94% | 32 | 85% | 32 | 90% | 32 | 98% | 36 | 95% | |||||||
| 102 | 97% | 102 | 100% | 102 | 87% | 102 | 90% | 102 | 98% | 102 | 94% | |||||||
| PDA-RFE | 8 | 96% | 8 | 92% | 8 | 83% | 8 | 89% | 8 | 70% | 8 | 94% | ||||||
| 32 | 96% | 32 | 92% | 33 | 81% | 31 | 96% | 32 | 98% | 32 | 94% | |||||||
| 104 | 96% | 104 | 95% | 108 | 79% | 109 | 96% | 102 | 98% | 104 | 90% | |||||||
Figure 1Classification performance of SVM-RCE of Head & Neck vs. Lung tumors (I). All of the values are an average of 100 iterations of SVM-RCE. ACC is the accuracy, TP is the sensitivity, and TN is the specificity of the remaining genes determined on the test set. Avg is the average accuracy of the individual clusters at each level of clusters determined on the test set. The average accuracy increases as low-information clusters are eliminated. The x-axis shows the average number of genes hosted by the clusters.
Figure 2Hierarchal cluster of CTCL(I) on the top 20 genes from SVM-RFE and SVM-RCE. (a) Hierarchal cluster on the top 20 genes from SVM-RFE (b) Hierarchal cluster on the top 20 (~4 clusters) genes from SVM-RCE. Sample names that start with S are CTCL patients, while those that start with C are for controls. LT = long term, ST = short term.
Figure 3The description of the SVM-RCE algorithm. A flowchart of the SVM-RCE algorithm consists of main three steps: the Cluster step for clustering the genes, the SVM scoring step for assessment of significant clusters and the RCE step to remove clusters with low score