| Literature DB >> 25140319 |
Hongyan Zhang1, Lanzhi Li2, Chao Luo3, Congwei Sun2, Yuan Chen2, Zhijun Dai2, Zheming Yuan2.
Abstract
In efforts to discover disease mechanisms and improve clinical diagnosis of tumors, it is useful to mine profiles for informative genes with definite biological meanings and to build robust classifiers with high precision. In this study, we developed a new method for tumor-gene selection, the Chi-square test-based integrated rank gene and direct classifier (χ(2)-IRG-DC). First, we obtained the weighted integrated rank of gene importance from chi-square tests of single and pairwise gene interactions. Then, we sequentially introduced the ranked genes and removed redundant genes by using leave-one-out cross-validation of the chi-square test-based Direct Classifier (χ(2)-DC) within the training set to obtain informative genes. Finally, we determined the accuracy of independent test data by utilizing the genes obtained above with χ(2)-DC. Furthermore, we analyzed the robustness of χ(2)-IRG-DC by comparing the generalization performance of different models, the efficiency of different feature-selection methods, and the accuracy of different classifiers. An independent test of ten multiclass tumor gene-expression datasets showed that χ(2)-IRG-DC could efficiently control overfitting and had higher generalization performance. The informative genes selected by χ(2)-IRG-DC could dramatically improve the independent test precision of other classifiers; meanwhile, the informative genes selected by other feature selection methods also had good performance in χ(2)-DC.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25140319 PMCID: PMC4130026 DOI: 10.1155/2014/589290
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Multiclass gene-expression datasets.
| Dataset | Platform | No. of classes | No. of genes | No. of samples in training | No. of samples in test | Source |
|---|---|---|---|---|---|---|
| Leuk1 | Affy | 3 | 7,129 | 38 | 34 | [ |
| Lung1 | Affy | 3 | 7,129 | 64 | 32 | [ |
| Leuk2 | Affy | 3 | 12,582 | 57 | 15 | [ |
| SRBCT | cDNA | 4 | 2,308 | 63 | 20 | [ |
| Breast | Affy | 5 | 9,216 | 54 | 30 | [ |
| Lung2 | Affy | 5 | 12,600 | 136 | 67 | [ |
| DLBCL | cDNA | 6 | 4,026 | 58 | 30 | [ |
| Leukemia3 | Affy | 7 | 12,558 | 215 | 112 | [ |
| Cancers | Affy | 11 | 12,533 | 100 | 74 | [ |
| GCM | Affy | 14 | 16,063 | 144 | 46 | [ |
Frequency counts of samples in each class for single genes.
| Class |
|
| Total |
|---|---|---|---|
|
|
|
|
|
| ⋮ | ⋮ | ⋮ | ⋮ |
|
|
|
|
|
|
| |||
| Total |
|
|
|
Frequency counts of samples in each class for pairwise genes.
| Class | x | x | Total |
|---|---|---|---|
|
|
|
|
|
| ⋮ | ⋮ | ⋮ | ⋮ |
|
|
|
|
|
|
| |||
| Total |
|
|
|
Independent test accuracy and informative gene number used indifferent models (in parentheses) for multiclass gene-expression datasets.
| Model | Leuk1 | Lung1 | Leuk2 | SRBCT | Breast | Lung2 | DLBCL | Leuk3 | Cancers | GCM | Aver ± std |
|---|---|---|---|---|---|---|---|---|---|---|---|
| HC-TSP∗ |
| 71.88 | 80 | 95 | 66.67 | 83.58 | 83.33 | 77.68 | 74.32 | 52.17 | 78.17 ± 13.17 |
| (4) | (4) | (4) | (6) | (8) | (8) | (10) | (12) | (20) | (26) | (10.2) | |
|
| |||||||||||
| HC-K-TSP∗ |
| 78.13 |
|
| 66.67 | 94.03 | 83.33 | 82.14 | 82.43 |
| 85.12 ± 12.42 |
| (36) | (20) | (24) | (30) | (24) | (28) | (46) | (64) | (128) | (134) | (53.4) | |
|
| |||||||||||
| DT∗ | 85.29 | 78.13 | 80 | 75 | 73.33 | 88.06 | 86.67 | 75.89 | 68.92 | 52.17 | 76.35 ± 10.49 |
| (2) | (4) | (2) | (3) | (4) | (5) | (5) | (16) | (10) | (18) | (6.9) | |
|
| |||||||||||
| PAM∗ |
| 78.13 | 93.33 | 95 |
|
| 90 |
| 87.84 | 56.52 | 88.5 ± 12.71 |
| (44) | (13) | (62) | (285) | (4,822) | (614) | (3,949) | (3,338) | (2,008) | (1,253) | (1,638.8) | |
|
| |||||||||||
| mRMR-SVM | 76.47 | 78.13 | 100.00 | 75.00 | 96.67 | 95.52 | 96.67 | 91.96 | 71.62 | 45.65 | 82.77 ± 16.85 |
| (7) | (13) | (19) | (9) | (97) | (120) | (16) | (119) | (89) | (57) | (54.6) | |
|
| |||||||||||
| SVM-RFE-SVM | 85.29 | 78.13 | 93.33 | 95.00 | 90.00 | 88.06 | 90.00 | 91.07 | 93.24 | 63.04 | 86.72 ± 9.62 |
| (5) | (9) | (8) | (3) | (7) | (9) | (13) | (35) | (29) | (199) | (31.7) | |
|
| |||||||||||
| TSG | 97.06 | 81.25 | 100 | 100 | 86.67 | 95.52 | 93.33 | 91.07 | 79.73 | 67.39 | 89.20 ± 10.5 |
| (6) | (20) | (44) | (13) | (63) | (60) | (16) | (95) | (81) | (112) | (51) | |
|
| |||||||||||
|
|
| 84.38 |
|
| 90 | 97.01 | 93.33 |
| 85.14 |
|
|
| (29) | (23) | (20) | (23) | (31) | (52) | (37) | (46) | (47) | (64) | (37.2) | |
*Results reported in [28].
Figure 1Accuracy of mRMR-SVM for training and test data.
Figure 2Accuracy of SVM-RFE-SVM for training and test data.
Figure 3Accuracy of HC-k-TSP for training and test data.
Figure 4Accuracy of TSG for training and test data.
Figure 5Accuracy of χ 2-IRG-DC for training and test data.
Test accuracy of different classifiers with informative genes selected by different feature-selection methods.
| Classifier | Feature-selection method | Leuk1 | Lung1 | Leuk2 | SRBCT | Breast | Lung2 | DLBCL | Leuk3 | Cancers | GCM | Aver- |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NB | ALL∗ | 85.29 | 81.25 | 100.00 | 60.00 | 66.67 | 88.06 | 86.67 | 32.14 | 79.73 | 52.17 | 73.20 |
|
| 97.06 | 81.25 | 100.00 | 85.00 | 86.67 | 92.54 | 96.67 | 59.82 | 82.43 | 60.87 | 84.23 | |
| mRMR | 79.41 | 68.75 | 100.00 | 90.00 | 93.33 | 97.01 | 96.67 | 74.11 | 70.27 | 45.65 | 81.52 | |
| SVM-RFE | 67.65 | 81.25 | 80.00 | 95.00 | 80.00 | 89.55 | 90.00 | 95.00 | 77.03 | 63.04 | 81.85 | |
| HC-K-TSP | 91.18 | 81.25 | 100.00 | 80.00 | 80.00 | 95.52 | 86.67 | 100.00 | 77.03 | 65.22 | 85.69 | |
| TSG | 91.18 | 84.38 | 93.33 | 100 | 86.67 | 94.03 | 100 | 51.79 | 71.62 | 65.22 | 83.82 | |
| Aver- | 85.30 | 79.38 | 94.67 | 90.00 | 85.33 | 93.73 | 94 | 76.14 | 75.68 | 60.00 |
| |
|
| ||||||||||||
| KNN | ALL∗ | 67.65 | 75.00 | 86.67 | 70.00‡ | 63.33 | 88.06 | 93.33 | 75.89 | 64.86 | 34.78 | 71.96 |
|
| 97.06 | 71.88 | 86.67 | 100.00 | 86.67 | 85.07 | 96.67 | 87.50 | 85.14 | 58.70 | 85.54 | |
| mRMR | 70.59 | 68.75 | 80.00 | 80.00 | 96.67 | 86.57 | 100.00 | 91.07 | 54.05 | 36.96 | 76.47 | |
| SVM-RFE | 76.47 | 68.75 | 86.67 | 100.00 | 90.00 | 86.57 | 90.00 | 91.96 | 58.11 | 45.65 | 79.42 | |
| HC-K-TSP | 88.24 | 87.50 | 86.67 | 85.00 | 83.33 | 94.03 | 93.33 | 88.39 | 64.86 | 52.17 | 82.35 | |
| TSG | 91.18 | 75 | 93.33 | 100 | 80 | 88.06 | 96.67 | 86.6 | 74.32 | 39.13 | 82.43 | |
| Aver- | 84.71 | 74.38 | 86.67 | 93.00 | 87.33 | 88.06 | 95.33 | 89.10 | 67.30 | 46.52 |
| |
|
| ||||||||||||
| SVM | ALL∗ | 79.41 | 87.50 | 100.00 | 100.00 | 83.33 | 97.01 | 100.00 | 84.82 | 83.78 | 65.22 | 88.11 |
|
| 97.06 | 87.50 | 93.33 | 100.00 | 93.33 | 92.54 | 96.67 | 86.61 | 91.89 | 56.52 | 89.54 | |
| mRMR | 76.47 | 78.13 | 100.00 | 75.00 | 96.67 | 95.52 | 96.67 | 91.96 | 71.62 | 45.65 | 82.77 | |
| SVM-RFE | 85.29 | 78.13 | 93.33 | 95.00 | 90.00 | 88.06 | 90.00 | 91.07 | 93.24 | 63.04 | 86.72 | |
| HC-K-TSP | 85.29 | 84.38 | 100.00 | 90.00 | 86.67 | 98.51 | 96.67 | 94.64 | 82.43 | 60.87 | 87.95 | |
| TSG | 91.18 | 81.25 | 93.33 | 80 | 80 | 94.03 | 100 | 80.36 | 68.92 | 54.35 | 82.34 | |
| Aver- | 87.06 | 81.88 | 96.00 | 88.00 | 89.33 | 93.73 | 96.00 | 88.93 | 81.62 | 56.09 |
| |
|
| ||||||||||||
|
|
| 97.06 | 84.38 | 100.00 | 100.00 | 90.00 | 97.01 | 93.33 | 93.75 | 85.14 | 67.39 | 90.81 |
| mRMR | 82.35 | 65.63 | 100.00 | 90.00 | 90.00 | 95.52 | 70.00 | 96.43 | 60.81 | 47.83 | 79.86 | |
| SVM-RFE | 79.41 | 56.25 | 66.67 | 85.00 | 76.67 | 92.54 | 80.00 | 96.43 | 94.59 | 69.57 | 79.71 | |
| HC-K-TSP | 97.06 | 84.38 | 100.00 | 95.00 | 76.67 | 97.01 | 93.33 | 88.39 | 78.38 | 69.57 | 87.98 | |
| TSG | 97.06 | 81.25 | 100 | 100 | 86.67 | 95.52 | 93.33 | 91.07 | 79.73 | 67.39 | 89.20 | |
| Aver- | 90.59 | 74.38 | 93.33 | 94.00 | 84.00 | 95.52 | 86.00 | 93.21 | 79.73 | 64.35 |
| |
*Results reported in [28]; ‡30 in original paper, whereas the actual number was 70 after validation; †Aver-C was the average accuracy of a classifier with informative genes selected by four feature-selection methods.