| Literature DB >> 17187691 |
Satoshi Niijima1, Satoru Kuhara.
Abstract
BACKGROUND: In class prediction problems using microarray data, gene selection is essential to improve the prediction accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVM-RFE) has become one of the leading methods and is being widely used. The SVM-based approach performs gene selection using the weight vector of the hyperplane constructed by the samples on the margin. However, the performance can be easily affected by noise and outliers, when it is applied to noisy, small sample size microarray data.Entities:
Mesh:
Substances:
Year: 2006 PMID: 17187691 PMCID: PMC1790716 DOI: 10.1186/1471-2105-7-543
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Performance comparison for binary-class datasets.
| Classifier+Selection criterion | Number of genes | ||||
| 10 | 20 | 30 | 50 | 100 | |
| NMC+S2N | 12.2 ± 0.6 | 12.5 ± 0.6 | 12.3 ± 0.6 | 12.8 ± 0.6 | 12.9 ± 0.5 |
| NMC+MMC-RFE(U) | 13.9 ± 0.6 | 12.5 ± 0.6 | 12.5 ± 0.6 | 12.1 ± 0.6 | 11.2 ± 0.5 |
| NMC+MMC-RFE(O) | 13.4 ± 0.6 | 11.7 ± 0.6 | 11.5 ± 0.6 | 11.3 ± 0.6 | 11.2 ± 0.6 |
| NMC+SVM-RFE(H) | 16.2 ± 0.7 | 14.6 ± 0.6 | 13.7 ± 0.6 | 12.5 ± 0.6 | 11.6 ± 0.6 |
| NMC+SVM-RFE(S) | 13.3 ± 0.6 | 11.2 ± 0.6 | 10.7 ± 0.5 | 10.5 ± 0.5 | 10.9 ± 0.5 |
| MMC+MMC-RFE(U) | 13.6 ± 0.6 | 12.1 ± 0.6 | 11.9 ± 0.6 | 11.7 ± 0.5 | 11.0 ± 0.5 |
| MMC+MMC-RFE(O) | 13.2 ± 0.6 | 11.7 ± 0.6 | 11.5 ± 0.6 | 11.0 ± 0.6 | 11.1 ± 0.6 |
| SVM+SVM-RFE(H) | 18.3 ± 0.6 | 16.2 ± 0.7 | 15.8 ± 0.7 | 15.0 ± 0.6 | 15.0 ± 0.6 |
| SVM+SVM-RFE(S) | 13.5 ± 0.5 | 10.7 ± 0.5 | 10.2 ± 0.5 | 10.0 ± 0.5 | 10.5 ± 0.6 |
| NMC+S2N | 10.1 ± 0.4 | 11.3 ± 0.5 | 12.3 ± 0.6 | 13.6 ± 0.6 | 16.0 ± 0.7 |
| NMC+MMC-RFE(U) | 9.9 ± 0.5 | 10.4 ± 0.5 | 10.8 ± 0.5 | 11.6 ± 0.5 | 13.4 ± 0.6 |
| NMC+MMC-RFE(O) | 9.6 ± 0.5 | 9.9 ± 0.5 | 10.3 ± 0.6 | 11.2 ± 0.6 | 13.5 ± 0.7 |
| NMC+SVM-RFE(H) | 9.6 ± 0.4 | 10.1 ± 0.5 | 10.2 ± 0.5 | 10.8 ± 0.5 | 12.2 ± 0.6 |
| NMC+SVM-RFE(S) | 9.7 ± 0.4 | 9.6 ± 0.4 | 10.0 ± 0.5 | 10.7 ± 0.5 | 12.4 ± 0.6 |
| MMC+MMC-RFE(U) | 8.8 ± 0.4 | 8.4 ± 0.4 | 8.4 ± 0.4 | 8.4 ± 0.4 | 8.6 ± 0.4 |
| MMC+MMC-RFE(O) | 8.5 ± 0.4 | 8.2 ± 0.4 | 7.9 ± 0.4 | 7.9 ± 0.4 | 8.1 ± 0.5 |
| SVM+SVM-RFE(H) | 9.9 ± 0.5 | 9.1 ± 0.4 | 9.3 ± 0.4 | 9.2 ± 0.4 | 9.1 ± 0.4 |
| SVM+SVM-RFE(S) | 8.5 ± 0.4 | 8.0 ± 0.4 | 8.5 ± 0.4 | 8.4 ± 0.4 | 8.8 ± 0.4 |
| NMC+S2N | 5.6 ± 0.7 | 5.8 ± 0.6 | 5.4 ± 0.6 | 3.8 ± 0.5 | 3.2 ± 0.5 |
| NMC+MMC-RFE(U) | 5.7 ± 0.6 | 3.9 ± 0.5 | 3.8 ± 0.5 | 2.2 ± 0.4 | 0.8 ± 0.2 |
| NMC+MMC-RFE(O) | 5.8 ± 0.6 | 2.8 ± 0.5 | 1.8 ± 0.4 | 0.8 ± 0.2 | 0.4 ± 0.2 |
| NMC+SVM-RFE(H) | 5.4 ± 0.6 | 3.8 ± 0.5 | 3.4 ± 0.5 | 1.8 ± 0.4 | 0.6 ± 0.2 |
| NMC+SVM-RFE(S) | 6.0 ± 0.6 | 3.1 ± 0.4 | 2.0 ± 0.4 | 1.5 ± 0.3 | 0.9 ± 0.3 |
| MMC+MMC-RFE(U) | 5.6 ± 0.6 | 3.7 ± 0.5 | 3.7 ± 0.5 | 2.3 ± 0.4 | 0.8 ± 0.3 |
| MMC+MMC-RFE(O) | 5.8 ± 0.6 | 2.8 ± 0.5 | 1.6 ± 0.3 | 0.6 ± 0.2 | 0.3 ± 0.2 |
| SVM+SVM-RFE(H) | 4.1 ± 0.5 | 3.0 ± 0.4 | 2.9 ± 0.4 | 1.3 ± 0.3 | 1.3 ± 0.3 |
| SVM+SVM-RFE(S) | 3.8 ± 0.5 | 3.1 ± 0.4 | 2.5 ± 0.4 | 1.3 ± 0.3 | 1.3 ± 0.3 |
The average error and standard error rates (%) for Colon cancer, Prostate cancer and Leukemia, when the number of genes is {10, 20, 30, 50, 100}. SVM-RFE(S) shows the best result with respect to the C parameter; NMC+SVM-RFE(S): C = 0.01, SVM+SVM-RFE(S): C = 0.01 for Colon cancer; NMC+SVM-RFE(S): C = 0.01, SVM+SVM-RFE(S): C = 0.01 for Prostate cancer; NMC+SVM-RFE(S): C = 0.001, SVM+SVM-RFE(S): C = 100 for Leukemia.
Performance comparison for binary-class datasets (continued).
| Classifier+Selection criterion | Number of genes | ||||
| 10 | 20 | 30 | 50 | 100 | |
| NMC+S2N | 42.1 ± 1.1 | 40.9 ± 1.0 | 40.1 ± 0.9 | 40.8 ± 1.0 | 39.3 ± 1.1 |
| NMC+MMC-RFE(U) | 39.0 ± 1.0 | 36.5 ± 1.1 | 36.5 ± 1.0 | 35.8 ± 0.9 | 35.2 ± 1.0 |
| NMC+MMC-RFE(O) | 39.7 ± 0.9 | 37.1 ± 0.9 | 34.7 ± 0.9 | 33.2 ± 0.9 | 32.4 ± 0.9 |
| NMC+SVM-RFE(H) | 42.2 ± 1.1 | 38.5 ± 1.0 | 37.5 ± 1.0 | 34.8 ± 0.9 | 34.3 ± 0.9 |
| NMC+SVM-RFE(S) | 35.3 ± 0.9 | 32.8 ± 0.9 | 32.3 ± 0.9 | 31.5 ± 0.9 | 31.0 ± 0.9 |
| MMC+MMC-RFE(U) | 38.8 ± 0.9 | 36.9 ± 1.0 | 36.4 ± 1.0 | 35.8 ± 0.9 | 35.3 ± 1.0 |
| MMC+MMC-RFE(O) | 40.0 ± 0.9 | 37.0 ± 0.9 | 34.0 ± 0.9 | 32.9 ± 0.9 | 32.2 ± 0.9 |
| SVM+SVM-RFE(H) | 41.0 ± 1.0 | 37.9 ± 0.9 | 36.8 ± 0.9 | 35.7 ± 0.9 | 36.0 ± 0.9 |
| SVM+SVM-RFE(S) | 34.6 ± 0.4 | 32.9 ± 0.6 | 33.2 ± 0.8 | 33.9 ± 0.8 | 34.6 ± 0.8 |
| NMC+S2N | 34.2 ± 0.8 | 34.5 ± 0.8 | 35.0 ± 0.8 | 35.9 ± 0.8 | 36.1 ± 0.8 |
| NMC+MMC-RFE(U) | 38.0 ± 0.8 | 37.3 ± 0.7 | 36.8 ± 0.8 | 36.7 ± 0.7 | 35.4 ± 0.7 |
| NMC+MMC-RFE(O) | 37.7 ± 0.7 | 36.4 ± 0.7 | 35.6 ± 0.7 | 34.8 ± 0.7 | 35.2 ± 0.7 |
| NMC+SVM-RFE(H) | 39.4 ± 0.8 | 37.8 ± 0.7 | 36.6 ± 0.8 | 36.5 ± 0.7 | 35.6 ± 0.7 |
| NMC+SVM-RFE(S) | 36.6 ± 0.9 | 34.4 ± 0.8 | 34.1 ± 0.7 | 33.8 ± 0.7 | 33.4 ± 0.7 |
| MMC+MMC-RFE(U) | 38.5 ± 0.9 | 39.3 ± 0.7 | 38.2 ± 0.7 | 38.4 ± 0.7 | 37.2 ± 0.8 |
| MMC+MMC-RFE(O) | 38.0 ± 0.8 | 38.2 ± 0.8 | 37.0 ± 0.7 | 38.0 ± 0.7 | 36.9 ± 0.7 |
| SVM+SVM-RFE(H) | 41.1 ± 1.0 | 41.3 ± 0.9 | 41.7 ± 1.0 | 40.8 ± 0.8 | 40.7 ± 0.8 |
| SVM+SVM-RFE(S) | 43.4 ± 0.3 | 38.2 ± 0.6 | 36.3 ± 0.7 | 34.8 ± 0.7 | 35.0 ± 0.7 |
The average error and standard error rates (%) for Medulloblastoma and Breast cancer, when the number of genes is {10, 20, 30, 50, 100}. SVM-RFE(S) shows the best result with respect to the C parameter; NMC+SVM-RFE(S): C = 0.001, SVM+SVM-RFE(S): C = 0.01 for Medulloblastoma; NMC+SVM-RFE(S): C = 0.001, SVM+SVM-RFE(S): C = 0.001 for Breast cancer.
Figure 1Performance comparison for binary-class datasets. The average error rates (%) as a function of the number of genes from 1 to 100, for Colon cancer, Prostate cancer and Leukemia.
Figure 2Performance comparison for binary-class datasets (continued). The average error rates (%) as a function of the number of genes from 1 to 100, for Medulloblastoma and Breast cancer.
Performance comparison for multi-class datasets.
| Classifier+Selection criterion | Number of genes | ||||
| 10 | 20 | 30 | 50 | 100 | |
| NMC+BW | 11.5 ± 0.7 | 8.8 ± 0.6 | 7.4 ± 0.5 | 6.1 ± 0.5 | 5.6 ± 0.5 |
| NMC+MMC-RFE(U) | 7.0 ± 0.6 | 5.8 ± 0.5 | 5.1 ± 0.5 | 4.9 ± 0.5 | 4.0 ± 0.4 |
| NMC+MMC-RFE(O) | 6.4 ± 0.5 | 5.9 ± 0.5 | 5.6 ± 0.5 | 4.9 ± 0.4 | 4.4 ± 0.4 |
| NMC+SVM-RFE(H) | 26.9 ± 1.4 | 19.3 ± 1.2 | 15.5 ± 1.1 | 12.0 ± 0.8 | 9.1 ± 0.7 |
| NMC+SVM-RFE(S) | 28.0 ± 1.3 | 21.4 ± 1.1 | 16.6 ± 1.0 | 11.9 ± 0.8 | 7.9 ± 0.7 |
| MMC+MMC-RFE(U) | 6.8 ± 0.5 | 6.0 ± 0.5 | 5.2 ± 0.5 | 4.9 ± 0.5 | 4.0 ± 0.4 |
| MMC+MMC-RFE(O) | 6.4 ± 0.5 | 5.8 ± 0.5 | 5.6 ± 0.5 | 4.9 ± 0.4 | 4.5 ± 0.4 |
| SVM+SVM-RFE(H) | 31.3 ± 1.5 | 24.0 ± 1.4 | 18.3 ± 1.1 | 12.9 ± 0.8 | 7.9 ± 0.6 |
| SVM+SVM-RFE(S) | 26.2 ± 1.2 | 20.2 ± 1.1 | 14.4 ± 1.0 | 10.6 ± 0.8 | 6.8 ± 0.6 |
| NMC+BW | 35.2 ± 1.4 | 22.1 ± 0.7 | 19.3 ± 0.7 | 10.5 ± 0.7 | 7.6 ± 0.6 |
| NMC+MMC-RFE(U) | 5.0 ± 0.5 | 3.0 ± 0.4 | 2.4 ± 0.3 | 2.2 ± 0.3 | 2.7 ± 0.3 |
| NMC+MMC-RFE(O) | 8.9 ± 0.7 | 6.0 ± 0.5 | 6.5 ± 0.5 | 6.8 ± 0.5 | 6.4 ± 0.5 |
| NMC+SVM-RFE(H) | 29.2 ± 1.2 | 22.9 ± 1.1 | 19.5 ± 1.0 | 15.7 ± 0.9 | 11.6 ± 0.7 |
| NMC+SVM-RFE(S) | 27.2 ± 1.2 | 21.9 ± 1.2 | 18.3 ± 1.0 | 14.2 ± 0.7 | 11.1 ± 0.8 |
| MMC+MMC-RFE(U) | 4.4 ± 0.5 | 2.5 ± 0.3 | 2.0 ± 0.3 | 1.7 ± 0.3 | 1.3 ± 0.2 |
| MMC+MMC-RFE(O) | 4.7 ± 0.5 | 4.1 ± 0.4 | 4.4 ± 0.4 | 3.5 ± 0.4 | 3.3 ± 0.4 |
| SVM+SVM-RFE(H) | 24.0 ± 1.3 | 14.2 ± 1.0 | 9.6 ± 0.7 | 6.3 ± 0.5 | 3.6 ± 0.4 |
| SVM+SVM-RFE(S) | 24.8 ± 1.4 | 12.7 ± 1.1 | 8.8 ± 0.8 | 5.1 ± 0.5 | 3.4 ± 0.4 |
The average error and standard error rates (%) for MLL and SRBCT, when the number of genes is {10, 20, 30, 50, 100}. SVM-RFE(S) shows the best result with respect to the C parameter; NMC+SVM-RFE(S): C = 0.1, SVM+SVM-RFE(S): C = 0.1 for MLL; NMC+SVM-RFE(S): C = 100, SVM+SVM-RFE(S): C = 1000 for SRBCT.
Performance comparison for multi-class datasets (continued).
| Classifier+Selection criterion | Number of genes | ||||
| 10 | 20 | 30 | 50 | 100 | |
| NMC+BW | 31.1 ± 1.3 | 23.1 ± 1.2 | 20.1 ± 1.1 | 18.3 ± 1.0 | 15.9 ± 1.0 |
| NMC+MMC-RFE(U) | 27.2 ± 1.1 | 22.8 ± 0.9 | 21.9 ± 0.9 | 19.4 ± 0.8 | 16.8 ± 0.8 |
| NMC+MMC-RFE(O) | 24.4 ± 1.0 | 22.7 ± 0.8 | 22.1 ± 0.9 | 20.6 ± 0.9 | 18.9 ± 0.8 |
| NMC+SVM-RFE(H) | 45.6 ± 1.3 | 35.4 ± 1.0 | 33.3 ± 1.0 | 28.8 ± 0.9 | 24.9 ± 0.8 |
| NMC+SVM-RFE(S) | 45.4 ± 1.3 | 34.9 ± 1.0 | 32.5 ± 0.9 | 27.6 ± 0.8 | 24.6 ± 0.8 |
| MMC+MMC-RFE(U) | 27.6 ± 1.1 | 22.5 ± 0.9 | 21.3 ± 0.9 | 19.2 ± 0.8 | 16.9 ± 0.8 |
| MMC+MMC-RFE(O) | 24.4 ± 1.0 | 22.9 ± 0.8 | 22.2 ± 0.9 | 20.2 ± 0.9 | 19.4 ± 0.8 |
| SVM+SVM-RFE(H) | 54.0 ± 1.5 | 42.6 ± 1.4 | 36.8 ± 1.3 | 31.0 ± 0.9 | 25.2 ± 0.8 |
| SVM+SVM-RFE(S) | 47.3 ± 1.2 | 37.7 ± 1.1 | 32.6 ± 1.1 | 28.4 ± 1.0 | 26.6 ± 0.9 |
| NMC+BW | 49.8 ± 1.2 | 44.0 ± 1.0 | 41.6 ± 1.0 | 39.1 ± 0.8 | 37.7 ± 0.7 |
| NMC+MMC-RFE(U) | 46.4 ± 0.8 | 38.9 ± 0.8 | 34.0 ± 0.9 | 29.8 ± 0.9 | 26.8 ± 0.7 |
| NMC+MMC-RFE(O) | 48.2 ± 0.9 | 39.6 ± 0.9 | 35.0 ± 0.9 | 31.6 ± 0.8 | 30.2 ± 0.9 |
| NMC+SVM-RFE(H) | 60.6 ± 1.0 | 51.4 ± 1.0 | 48.4 ± 1.0 | 43.4 ± 0.9 | 38.0 ± 0.8 |
| NMC+SVM-RFE(S) | 60.8 ± 1.0 | 52.2 ± 0.9 | 47.3 ± 1.0 | 41.3 ± 0.9 | 39.0 ± 0.9 |
| MMC+MMC-RFE(U) | 46.0 ± 0.9 | 37.3 ± 0.8 | 33.7 ± 0.8 | 29.0 ± 0.9 | 25.0 ± 0.7 |
| MMC+MMC-RFE(O) | 49.0 ± 1.0 | 38.6 ± 0.9 | 34.3 ± 0.9 | 30.4 ± 0.8 | 28.7 ± 0.9 |
| SVM+SVM-RFE(H) | 64.7 ± 1.2 | 54.3 ± 1.1 | 47.7 ± 1.0 | 42.0 ± 0.9 | 35.9 ± 0.9 |
| SVM+SVM-RFE(S) | 59.9 ± 1.1 | 50.3 ± 1.0 | 46.2 ± 1.0 | 42.8 ± 1.1 | 35.8 ± 0.9 |
The average error and standard error rates (%) for CNS and NCI60, when the number of genes is {10, 20, 30, 50, 100}. SVM-RFE(S) shows the best result with respect to the C parameter; NMC+SVM-RFE(S): C = 10, SVM+SVM-RFE(S): C = 0.1 for CNS; NMC+SVM-RFE(S): C = 100, SVM+SVM-RFE(S): C = 0.1 for NCI60.
Figure 3Performance comparison for multi-class datasets. The average error rates (%) as a function of the number of genes from 1 to 100, for MLL and SRBCT.
Figure 4Performance comparison for multi-class datasets (continued). The average error rates (%) as a function of the number of genes from 1 to 100, for CNS and NCI60.
Performance comparison for independent test samples.
| Dataset | Classifier | # misclassifications (# genes) | |||
| S2N | MMC-RFE(U) | MMC-RFE(O) | SVM-RFE | ||
| Prostate cancer | NNC | 1 (1) | 0 (45) | 0 (22) | 1 (1) |
| NMC | 1 (1) | 0 (2) | 0 (22) | 1 (1) | |
| Leukemia | NNC | 0 (50) | 0 (3) | 0 (3) | 0 (3) |
| NMC | 1 (15) | 0 (54) | 0 (29) | 1 (1) | |
| Breast cancer | NNC | 4 (19) | 3 (91) | 3 (85) | 4 (2) |
| NMC | 4 (1) | 4 (35) | 4 (36) | 1 (2) | |
Minimum number of misclassifications and the number of genes used for Prostate cancer, Leukemia and Breast cancer. The C parameter of SVM-RFE was set to 0.01 for Prostate cancer, and to 0.001 for Leukemia and Breast cancer.
Comparison of selected genes for Prostate cancer.
| Rank | GAN | [14] | Rank | Gene description | |
| S2N | SVM-RFE | ||||
| 1 | X07732 | • | 1 | 1 | hepsin (transmembrane protease, serine 1) ( |
| 2 | M30894 | • | 2 | 2 | TCR gamma alternate reading frame protein ( |
| 3 | M84526 | • | 3 | 89 | complement factor D (adipsin) ( |
| 4 | AL049969 | • | 4 | 65 | PDZ and LIM domain 5 ( |
| 5 | X51345 | 38 | 5 | jun B proto-oncogene ( | |
| 6 | U21689 | 68 | 6 | glutathione S-transferase pi ( | |
| 7 | M98539 | • | 297 | 15 | prostaglandin D2 synthase 21kDa (brain) ( |
| 8 | X17206 | 95 | 12 | ribosomal protein S2 ( | |
| 9 | D83018 | • | 6 | 41 | NEL-like 2 (chicken) |
| 10 | AF065388 | • | 18 | 13 | tetraspanin 1 ( |
The 10 top-ranked genes of orthogonal MMC-RFE are listed in order of the rank; GAN: Gene Accession Number. Genes selected by Singh et al. [14] are denoted by •. C = 0.01 was used for SVM-RFE.
Comparison of selected genes for Leukemia.
| Rank | GAN | [3] | Rank | Gene description | |
| S2N | SVM-RFE | ||||
| 1 | M27891 | • | 1 | 2 | cystatin C ( |
| 2 | M28130 | • | 25 | 3 | interleukin 8 ( |
| 3 | M84526 | • | 5 | 1 | D component of complement (adipsin) ( |
| 4 | M19507 | 131 | 7 | myeloperoxidase ( | |
| 5 | Y00787 | • | 23 | 4 | interleukin-8 precursor |
| 6 | M11722 | 71 | 41 | deoxynucleotidyltransferase, terminal ( | |
| 7 | X95735 | • | 2 | 11 | zyxin ( |
| 8 | D88422 | 3 | 8 | cystatin A ( | |
| 9 | M27783 | 15 | 5 | elastase 2, neutrophil ( | |
| 10 | M96326 | • | 75 | 10 | azurocidin 1 ( |
The 10 top-ranked genes of orthogonal MMC-RFE are listed in order of the rank; GAN: Gene Accession Number. Genes selected by Golub et al. [3] are denoted by •. C = 0.001 was used for SVM-RFE.
Comparison of selected genes for Breast cancer.
| Rank | GAN | [16] | Rank | Gene description | |
| S2N | SVM-RFE | ||||
| 1 | Contig63649_RC | • | 3 | 40 | ESTs |
| 2 | AL080059 | • | 1 | 2 | TSPY-like5 ( |
| 3 | Contig27312_RC | 133 | 48 | collagen, type XXIII, alpha 1 | |
| 4 | NM_001756 | 412 | 35 | serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 6 ( | |
| 5 | Contig48328_RC | • | 2 | 4 | zinc finger protein 533 ( |
| 6 | NM_001635 | 69 | 24 | amphiphysin ( | |
| 7 | NM_006681 | • | 17 | 13 | neuromedin U ( |
| 8 | NC_001807 | 1174 | 39 | Human mitochondrion ( | |
| 9 | NM_000599 | • | 53 | 38 | insulin-like growth factor binding protein 5 ( |
| 10 | NM_000518 | 1387 | 45 | hemoglobin, beta ( | |
The 10 top-ranked genes of orthogonal MMC-RFE are listed in order of the rank; GAN: Gene Accession Number. Genes selected by van't Veer et al. [16] are denoted by •. C = 0.001 was used for SVM-RFE.
Figure 5The uncorrelated MMC-RFE algorithm.
Figure 6The orthogonal MMC-RFE algorithm.