| Literature DB >> 17118133 |
Abstract
BACKGROUND: Recursive Feature Elimination is a common and well-studied method for reducing the number of attributes used for further analysis or development of prediction models. The effectiveness of the RFE algorithm is generally considered excellent, but the primary obstacle in using it is the amount of computational power required.Entities:
Mesh:
Year: 2006 PMID: 17118133 PMCID: PMC1683561 DOI: 10.1186/1471-2105-7-S2-S12
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1A Linear Support Vector Machine [16]. In its simplest, linear form, a SVM is a hyperplane that separates two classes of examples (postive and negative) with maximum margin. The margin is defined by the distance from the hyperplane to the nearest of the data points.
Figure 2Comparison of RFE and RFE-Annealing in terms of Prediction Rate on SJCRH data. This figure shows the results of comparing RFE and RFE-Annealing using the St. Jude Children's Research Hospital ALL study. Accuracy rates on the hidden test set are very high for both gene selection algorithms across all gene set sizes up to 200.
Figure 3Comparison of SQRT-RFE and RFE-Annealing in terms of Prediction Rate on SJCRH data. This figure shows the results of comparing SQRT-RFE and RFE-Annealing. They are very comparable with respect to accuracy on the hidden test data.
Figure 4Comparison of RFE and RFE-Annealing in terms of Prediction Rate based on Bhattacharjee Data. When comparing RFE and RFE-Annealing on the hidden test data from the Bhattacharjee set, the algorithms both do well. RFE does slightly better when more than 100 genes are selected and RFE-Annealing does slightly better when less than 100 genes are used.
Figure 5Comparison of SQRT-RFE and RFE-Annealing in terms of Prediction Rate based on Bhattacharjee Data. The accuracy rates are very similar when comparing SQRT-RFE and RFE-Annealing on the hidden test data from the Bhattacharjee set.
Figure 6Comparison of RFE, RFE-Annealing and SQRT-RFE in terms of Prediction Rate based on Alon's Data. On the smaller Alon data set, all three gene selection methods yielded the same accuracy on the hidden test data when 10 or more genes were selected. The largest gene set selected was 50 due to the fewer number of genes and samples.
Comparison of RFE, RFE-Annealing and SQRT-RFE algorithms in terms of time
| Data | Number of samples | RFE | RFE-Annealing | SQRT-RFE |
| St.Jude | 246 | 58 hours | 26 minutes | 60 minutes |
| Bhattacharjee | 203 | 27 hours | 20 minutes | 38 minutes |
| Alon | 62 | 6.3 minutes | 0.5 minute | 1 minute |
This table clearly demonstrates the computational efficiency of RFE-Annealing over SQRT-RFE and especially RFE. When data sets with a large number of samples and a large number of genes (e.g. SJCRH and Bhattacharjee) are used the difference is substantial.
Pathways associated with the selected genes in SJCRH data
| Pathway | RFE | RFE- Annealing | SQRT-RFE |
| Apoptosis | 2 | 2 | 2 |
| Apoptosis_GenMAPP | 3 | 3 | 3 |
| Apoptosis_KEGG | 2 | 2 | 2 |
| Arginine and proline metabolism | 2 | 2 | 2 |
| Biosynthesis of steroids | 1 | 1 | 1 |
| Calcium signaling pathway | 19 | 18 | 17 |
| 6 | 4 | 4 | |
| Cell_cycle_KEGG | 8 | 8 | 9 |
| Cholesterol_Biosynthesis | 1 | 1 | 1 |
| Circadian_Exercise | 2 | 2 | 2 |
| DNA_replication_Reactome | 0 | 0 | 1 |
| Electron_Transport_Chain | 1 | 1 | 1 |
| Fructose and mannose metabolism | 1 | 1 | 1 |
| Gl_to_S_cell_cycle_Reactome | 6 | 6 | 7 |
| 6 | 4 | 4 | |
| Galactose metabolism | 1 | 1 | 1 |
| Glutathione metabolism | 1 | 1 | 1 |
| Glycerolipid metabolism | 1 | 1 | 1 |
| Glycerophospholipid metabolism | 3 | 3 | 3 |
| Glycine, serine and threonine metabolism | 2 | 2 | 2 |
| Glycolysis/Gluconeogenesis | 1 | 1 | 1 |
| Glycolysis_and_Gluconeogenesis | 1 | 1 | 1 |
| GPCRDB_Class_A_Rhodopsin-like | 0 | 1 | 0 |
| Hypertrophy _model | 1 | 1 | 1 |
| 1 | 1 | 4 | |
| 4 | 4 | 0 | |
| Integrin-mediated_cell_adhesion_KEGG | 4 | 4 | 4 |
| MAPK_Cascade | 1 | 1 | 1 |
| mRNA_processing_Reactome | 2 | 2 | 3 |
| Nicotinate and nicotinamide metabolism | 1 | 1 | 1 |
| Ovarian_Infertility_Genes | 3 | 3 | 3 |
| Oxidative phosphorylation | 1 | 1 | 1 |
| Pentose phosphate pathway | 1 | 1 | 1 |
| Phosphatidylinositol signaling system | 4 | 4 | 3 |
| Prostaglandin_synthesis_regulation | 1 | 0 | 1 |
| Proteasome_Degradation | 1 | 1 | 1 |
| Purine metabolism | 3 | 2 | 2 |
| Pyrimidine metabolism | 2 | 2 | 2 |
| 12 | 10 | 10 | |
| Statin_Pathway_PharmGKB | 0 | 0 | 1 |
| TGF_Beta_Signaling_Pathway | 4 | 4 | 4 |
| Terpenoid biosynthesis | 1 | 1 | 1 |
| Type I diabetes mellitus | 4 | 5 | 5 |
| Ubiquitin mediated proteolysis | 2 | 2 | 2 |
| Urea cycle and metabolism of amino groups | 2 | 2 | 2 |
| Wnt_signaling | 3 | 3 | 2 |
This table lists the various known pathway associations with the size 200 gene sets selected by the algorithms using the SJCRH data. The values in the table represent the number of genes associated with that pathway in each of the sets. In general there was considerable overlap in the genes selected (over 90%). The pathways with the least consistency are shown in boldface. Pathway data was derived using NetAffx [13].
Figure 7Model Design. This figure describes the methodology used in the testing. The methodology employs a similar wrapper technique as described in [17]. The data sets were first divided, in a stratified way, into training and hidden test data (80% training and 20% test, except for Alon which was 75%/25% since it was smaller). The weight vector from the SVM with a linear kernel was used to identify the gene(s) to remove. The training data was projected down to just the specified subset and a classifier was constructed using a linear SVM. Then the test data was projected to the same feature set and tested using the classifier built from the training data. This was repeated for each data set, for each algorithm and for each size feature set from 1 to the maximum features selected.