| Literature DB >> 23148517 |
Hongyan Zhang1, Haiyan Wang, Zhijun Dai, Ming-shun Chen, Zheming Yuan.
Abstract
BACKGROUND: Even though the classification of cancer tissue samples based on gene expression data has advanced considerably in recent years, it faces great challenges to improve accuracy. One of the challenges is to establish an effective method that can select a parsimonious set of relevant genes. So far, most methods for gene selection in literature focus on screening individual or pairs of genes without considering the possible interactions among genes. Here we introduce a new computational method named the Binary Matrix Shuffling Filter (BMSF). It not only overcomes the difficulty associated with the search schemes of traditional wrapper methods and overfitting problem in large dimensional search space but also takes potential gene interactions into account during gene selection. This method, coupled with Support Vector Machine (SVM) for implementation, often selects very small number of genes for easy model interpretability.Entities:
Mesh:
Year: 2012 PMID: 23148517 PMCID: PMC3562261 DOI: 10.1186/1471-2105-13-298
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Selected genes from original colon dataset after screening by t-test
| T48041 | 0.90237 | 0.01289 |
| M19311 | 0.95889 | 1.2E-08 |
| T51023 | 0.00121 | 3.0E-09 |
| D63874 | 0.00798 | 0.00006 |
| X57206 | 0.11898 | 5.4E-07 |
| T57882 | 0.18106 | 0.00006 |
The paired t-test was conducted according to the method described in section "Importance ordering and significance of the selected genes".
Figure 1Plot of LOOCV accuracy of LDA, NB, and SVM using top ranked genes from SVM-RFE for = 2, …, 150. The accuracy of SVM in general increases as more genes are included in the model. The accuracies of LDA and NB do not show an increasing pattern suggesting that the gene ranking by SVM-RFE is SVM specific and may not generalize well to NB or LDA. The plotted curves assume the number of genes is known (oracle situation). Without knowing the number of genes to be used, additional variability will add to the LOOCV accuracy. The diamond-shaped points show the LOOCV accuracy of the LDA, NB, and SVM classifiers using the genes selected by BMSF.
Summary of nine datasets used in our experiments
| CNS | 7129 | 25(C) | 9(D) | [ |
| Colon | 2000 | 40(T) | 22(N) | [ |
| DLBCL | 7129 | 58(D) | 19(F) | [ |
| GCM | 16063 | 190(C) | 90(N) | [ |
| Leukemia | 7129 | 25(AML) | 47(ALL) | [ |
| Lung | 12533 | 150(A) | 31(M) | [ |
| Prostate1 | 12600 | 52(T) | 50(N) | [ |
| Prostate2 | 12625 | 38(T) | 50(N) | [ |
| Prostate3 | 12626 | 24(T) | 9(N) | [ |
Summary of selected genes
| CNS | 3 | J03507, U00968, Y00757 |
| Colon | 7 | Z50753, H67764, H17434, R88740, R36977, R81170, U14631 |
| DLBCL | 6 | K03430_at, M37815_cds1_at, X51688_at, X76534_at, Z70723_at, M16652_at |
| GCM | 32 | S82075_at, U35048_at, U61374_at, U87964_at, U95090_at, U97188_at, X14445_at, X92715_at, M29610_at, M21642_at, M19267_s_at, M19878_at, Z50115_s_at, X93511_s_at, X56687_s_at, AA256220_at, AA334630_at, AA362708_at, D30921_at, M54994_f_at, R10529_at, R29657_at, W27827_at, W39573_at, Z49995_at-2, RC_AA100437_at, RC_AA210695_at, RC_AA252372_at, RC_AA278134_at, RC_AA281769_s_at, RC_AA405698_at, RC_AA489009_at |
| Leukemia | 3 | X95735_at, Y07604_at, D26156_s_at |
| Lung | 8 | 1779_s_at, 33246_at, 36354_at, 38221_at, 40363_r_at, 540_at, 631_g_at, 885_g_at |
| Prostate1 | 8 | 33614_at, 38322_at, 36627_at, 38041_at, 41303_r_at, 1846_at,930_at, 829_s_at |
| Prostate2 | 11 | 31673_s_at, 35099_at, 37821_at, 40038_at, 41077_at, 36030_at, 37209_g_at, 38983_at, 33928_r_at, 38832_r_at, 33188_at |
| Prostate3 | 2 | 39364_s_at, 40546_s_at |
Figure 2Comparison with top performance results reported in literature for nine cancer datasets.
Number of genes used in the classifiers for gene-expression datasets
| TSP* | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| k-TSP* | 10 | 2 | 2 | 10 | 18 | 10 | 2 | 18 | 2 |
| SVM/NB ‡ | 7129 | 2000 | 7129 | 16063 | 7129 | 12533 | 12600 | 12625 | 12626 |
| PAM* | 4 | 15 | 17 | 47 | 2296 | 9 | 47 | 13 | 701 |
| Sumdiff/ mul/sign –PAM† | 286 | 80 | 286 | 642 | 286 | 502 | 504 | 506 | 506 |
| HBEЅ | - | - | 6 | - | 4 | - | 10 | - | - |
| BBF-SVMÞ | - | 20 | 5 | - | - | - | 13 | - | - |
| Random forest (GeneSrF) | 11 | 3 | 3 | 186 | 4 | 12 | 2 | 4 | 3 |
| BMSF-SVM/NB/ LDA/QDAϒ | 3 | 7 | 6 | 32 | 3 | 8 | 8 | 11 | 2 |
*Results obtained in Tan et al. [3]†Results obtained in Chopra et al. [12].
ÞResults obtained in Zhang and Deng [10]ЅResults obtained in Dagliyan et al. [4]‡ Results obtained using all genes.
ϒRefers to each of the classifiers using our selected genes.
Figure 3Average of absolute correlation (AAC) at each stage and its relationship with NB, SVM. The top panel gives the AAC for each dataset. ‘Original’ refers to the entire dataset; ‘filtering’ refers to the stage after the procedures in Section 5.1; ‘Detailed evaluation’ refers to the stage at the end of Section 5.2. The bottom panels show the relationship between the AAC on the original dataset with NB, BMSF-NB, SVM, and BMSF-SVM classifiers. The original AAC appears to be reversely related to the accuracy of NB. The relationship of the original AAC with SVM is not obvious. BMSF-NB and BMSF-SVM are much less influenced by the original AAC.
Figure 4The change in the number of selected genes in each round. The values labeled are the best MCC.
Selected genes from Colon dataset after screening by t-test
| Z50753 | 4.7E-06 | 5.2E-12 |
| H67764 | 0.01822 | 5.4E-15 |
| H17434 | 0.00514 | 1.7E-14 |
| R88740 | 0.02422 | 0.00022 |
| R36977 | 0.00119 | 2.6E-18 |
| R81170 | 0.02326 | 2.6E-18 |
| U14631 | 0.00248 | 0.00002 |
The paired t-test was conducted according to the method described in section "Importance ordering and significance of the selected genes".
Results from three runs of the leukemia dataset
| 1 | X95735_at, Y07604_at, D26156_s_at | 98.61 |
| 2 | U82759_at, X95735_at, Y07604_at | 97.22 |
| 3 | M23197_at, U77604_at, M28170_at | 100 |
Figure 5Joint effect of informative genes from multiple runs of the leukemia dataset. The left panel gives the LOOCV accuracy +/− standard error from 30 runs using the combined list of genes. The right panel gives the number of genes in the combined list +/− standard deviation from 30 runs. In both plots, the number of lists being combined is in the horizontal axis.
Figure 6Comparison of different variable selection methods for the same classification algorithm. For each of the classification algorithms (LDA, QDA, SVM, NB), identical number of genes are selected for each cancer dataset by BMSF and 11 other variable selection criteria (the number of genes used is according to BMSF). The LOOCV accuracy is presented in the dotplot, in which the coordinate of a point in the horizontal axis indicates the accuracy. A point located to the right represents higher accuracy than a point located to the left. In most of the cases, the algorithms with variables selected by BMSF reach the highest LOOCV accuracy. For the GCM data, the variables selected by the eight criteria from RankGene and MaxRel cannot perform QDA due to rank deficiency. So the average accuracy for QDA is calculated over the other datasets for fair comparison.
Average and standard deviation of 10-fold CV accuracy from 10 runs
| CNS | 94.11(1.38) | 88.23(0) | 92.05(2.79) | 91.17(0) | 96.47(1.86) | 83.23(3.93) | 95.88(2.05) | NA(NA) |
| Colon | 94.35(2.04) | 78.22(2.18) | 87.41(1.98) | 86.93(1.93) | 87.41(1.48) | 81.45(0.85) | 89.51(1.9) | 81.12(1.08) |
| DLBCL | 98.57(1.29) | 88.44(2.48) | 88.96(0.68) | 88.57(0.54) | 96.23(0.41) | 85.71(1.36) | 94.15(1.1) | 90(0.62) |
| GCM | 98.07(0.61) | 92.07(0.9) | 87.42(0.76) | 84.17(0.24) | 91.14(0.66) | 77.53(1.41) | 90.25(0.93) | NA(NA) |
| Leukemia | 98.33(0.58) | 96.66(0.97) | 96.25(0.93) | 94.58(0.43) | 98.33(0.58) | 93.19(0.43) | 97.63(0.67) | 94.58(0.78) |
| Lung | 99.11(0.64) | 98.95(0.4) | 98.39(0.31) | 98.34(0) | 97.79(0.26) | 98.34(0) | 97.56(0.46) | 98.34(0.26) |
| Pros1 | 96.76(0.8) | 92.64(1.24) | 89.6(0.82) | 93.33(0.62) | 95.49(0.5) | 90.98(1.01) | 93.23(0.85) | 91.56(0.68) |
| Pros2 | 97.38(1.2) | 83.75(2.27) | 90(1.17) | 85.11(1.46) | 95.34(1.13) | 85.45(0.71) | 90.11(2.27) | 83.63(1.09) |
| Pros3 | 98.48(1.59) | 93.93(2.47) | 99.69(0.95) | 99.69(0.95) | 96.66(0.95) | 93.93(0) | 100(0) | 96.66(1.72) |
BMSF and random forest (GeneSrF) are used for gene selection and SVM, NB, LDA, QDA are used to build model with the training data and predict the class of test data.