| Literature DB >> 22830977 |
Shu-Lin Wang1, Xue-Ling Li, Jianwen Fang.
Abstract
BACKGROUND: Previous studies on tumor classification based on gene expression profiles suggest that gene selection plays a key role in improving the classification performance. Moreover, finding important tumor-related genes with the highest accuracy is a very important task because these genes might serve as tumor biomarkers, which is of great benefit to not only tumor molecular diagnosis but also drug development.Entities:
Mesh:
Year: 2012 PMID: 22830977 PMCID: PMC3465202 DOI: 10.1186/1471-2105-13-178
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1 A diagram of the expanded tree generated by HBSA.
Figure 2 A diagram of two regulatory pathways. The dotted lines represent all possible combinations of genes on different pathways.
Figure 3 The flowchart of our analysis method.
Figure 4 The construction of HBSA-SVM-based ensemble classifier.
Designation of training set and test set in our experiments
| Prostate102 | Tr** | 102 | 12,600 | 2 | Affy HU95A V2 |
| Prostate34 | Te** | 34 | 12,626 | 2 | Affy U95A |
| DLBCL77 | Tr | 77 | 7,129 | 2 | Affy HU6800 |
| DLBCL21 | Te | 21 | 12,581 | 2 | Affy HU95AV2 |
| Leukemia72 | Tr | 72 | 7,129 | 2 | Affy HU6800 |
| Leukemia52 | Te | 52 | 12,582 | 2 | AffyHGU95a |
| Colon | Tr | 42 | 2000 | 2 | AffyHUM6000 |
| Te | 20 | ||||
| ALL | Tr | 148 | 12626 | 6 | Affy HGU95AV2 |
| Te | 100 | ||||
| SRBCT | Tr | 63 | 2308 | 4 | cDNA |
| Te | 20 |
* m denotes the number of sample in dataset. n denotes the number of genes, and k denotes the number of subclasses.
**“Tr” denotes this dataset will be used as training set, while “Te” denotes this dataset will be used as test set.
Representative results obtained by the HBSA-SVM(Biased) and HBSA-SVM(Unbiased)
| Leukemia | 1 | {M23197, M31523} | 100 | 98.75 ± 0.42 | 86.54 | 86.54 |
| 2 | {M23197, Y07604} | 100 | 99.41 ± 0.69 | 80.77 | 73.08 | |
| 3 | {M23197, U46751} | 100 | 99.96 ± 0.33 | 80.77 | 73.08 | |
| 4 | ||||||
| 5 | {M31523, L47738} | 100 | 99.22 ± 0.73 | 88.46 | 71.15 | |
| 6 | ||||||
| DLBCL | 1 | {U28386, U81375, D78134} | 100 | 100 ± 0 | 90.48 | 76.19 |
| 2 | ||||||
| 3 | {X67951, L06132, D78134} | 100 | 100 ± 0 | 80.95 | 76.19 | |
| 4 | {U81375, L06132, D78134} | 100 | 100 ± 0 | 95.24 | 90.48 | |
| 5 | {L06132, L35249, D78134} | 100 | 99.86 ± 0.58 | 85.71 | 85.71 | |
| 6 | ||||||
| Prostate | 1 | {37639_at, 41504_s_at, 40074_at, 1708_at} | 100 | 99.96 ± 0.24 | 91.18 | 76.47 |
| 2 | ||||||
| 3 | {41288_at, 38087_s_at, 41504_s_at, 32786_at} | 100 | 99.99 ± 0.10 | 88.24 | 82.35 | |
| 4 | ||||||
| SRBCT | 1 | {770394, 769716, 563673} | 100 | 99.90 ± 0.39 | 80 | 75 |
| 2 | {859359, 1435862, 769716} | 100 | 99.80 ± 0.73 | 90 | 85 | |
| 3 | {377461, 769716, 563673} | 100 | 99.97 ± 0.20 | 85 | 75 | |
| 4 | {859359, 377461, 782193} | 100 | 99.72 ± 0.93 | 85 | 75 | |
| 5 | {1435862, 143306, 782193} | 100 | 99.72 ± 1.17 | 80 | 65 | |
| 6 | ||||||
| 7 | {1435862, 207274, 878652} | 100 | 99.97 ± 0.20 | 90 | 80 | |
| 8 | ||||||
| 9 | {308231, 214572, 784257} | 100 | 99.90 ± 0.63 | 75 | 65 | |
| 10 | {1435862,383188,141768} | 100 | 94.88 ± 1.59 | 70 | 65 | |
| ALL | 1 | {AF068180,L13939,AF041434,M64925,X17025,J03473} | 100 | 99.94 ± 0.27 | 96 | 95 |
| 2 | {M11722,AF013249,Z50022,X17025,J03473,U03106} | 100 | 99.98 ± 0.12 | 95 | 94 | |
| 3 | {M11722,AF013249,X17025,J03473,U03106,AB018310} | 100 | 99.99 ± 0.08 | 94 | 92 | |
| 4 | ||||||
| 5 | ||||||
| Colon | 1 | |||||
| 2 | {M26383, R84411} | 100 | 99.94 ± 0.37 | 80 | 65 | |
| 3 | ||||||
| 4 | {J05032, M76378} | 100 | 99.65 ± 1.14 | 70 | 65 | |
| 5 | {J05032, M63391} | 100 | 99.71 ± 0.95 | 75 | 70 |
Prediction accuracies of the ensemble SVM(Biased) and SVM(Unbiased) classifiers
| Top 300 gene subsets | 147 | 92.31 | 84.62 | |
| 10-Fold >98* | 47 | 96.15 | 88.46 | |
| 10-Fold = 100 and Full-fold > =99 | 5 | 88.46 | 86.54 | |
| Top 300 gene subsets | 61 | 95.24 | 85.71 | |
| 10-Fold = 100 | 143** | 95.24 | 85.71 | |
| 10-Fold = 100 and Full-fold = 100 | 29** | 95.24 | 85.71 | |
| Top 300 gene subsets | 300 | 97.06 | 88.24 | |
| Full-fold > 98 | 290 | 97.06 | 88.24 | |
| Full-fold > 99 | 139 | 97.06 | 88.24 | |
| Top 300 gene subsets | 300 | 90 | 80 | |
| Full-fold > 98 | 114 | 95 | 85 | |
| Full-fold > 98 and 10-Fold = 100 | 8 | 100 | 90 | |
| Top 300 gene subsets | 300 | 96 | 96 | |
| 10-Fold = 100 | 59 | 97 | 96 | |
| | 10-Fold = 100 and Full-fold > =99 | 42 | 95 | 95 |
| Top 300 gene subsets | 300 | 90 | 70 | |
| | 10-Fold = 100 | 62 | 85 | 65 |
| 10-Fold = 100 and Full-fold > =98 | 59 | 85 | 65 |
* The corresponding prediction accuracies (Biased and Unbiased) are obtained on the Leukemia52 test set, respectively. The item 10-Fold > 98 means that the gene subsets with 10-fold CV accuracy greater than 98% are selected from the 300 top-ranked gene subsets in which only 47 gene subsets are shared between the Leukemia72 training set and Leukemia52 test set. Thus the final ensemble classifier consists of the 47 individual classifiers respectively constructed from these 47 gene subsets; the corresponding prediction accuracies (Biased and Unbiased) are obtained by the ensemble classifiers constructed by SVM(Biased) and SVM(Unbiased) on the Leukemia52 test set, respectively.
** The individual classifiers are constructed from the gene subsets that are selected from all nodes in last layer, not limited to the 300 top-ranked nodes in last layer because more than 300 gene subsets can obtain 100% 10-fold CV accuracy on DLBCL.
Confidence levels of 20 test samples by HBSA-SVM(Unbiased)-based ensemble classifier on colon tumor dataset
| 1 (43) | 116 | 184 | 1.5862 | C |
| 2 (44) | 298 | 2 | 149 | C |
| 3 (45) | 111 | 189 | 1.7027 | |
| 4 (46) | 285 | 15 | 19 | C |
| 5 (47) | 286 | 14 | 20.4286 | C |
| 6 (48) | 119 | 181 | 1.5210 | C |
| 7 (49) | 69 | 231 | 3.3478 | |
| 8 (50) | 165 | 135 | 1.2222 | |
| 9 (51) | 227 | 73 | 3.1096 | |
| 10 (52) | 297 | 3 | 99 | C |
| 11 (53) | 276 | 24 | 11.5 | C |
| 12 (54) | 19 | 281 | 14.7895 | C |
| 13 (55) | 297 | 3 | 99 | |
| 14 (56) | 88 | 212 | 2.4091 | |
| 15 (57) | 193 | 107 | 1.8037 | C |
| 16 (58) | 230 | 70 | 3.2857 | C |
| 17 (59) | 260 | 40 | 6.5 | C |
| 18 (60) | 98 | 202 | 2.0612 | C |
| 19 (61) | 300 | 0 | 300 | C |
| 20 (62) | 118 | 182 | 1.5424 | C |
* The number in parentheses denotes the serial number of sample in original colon tumor dataset.
** “C” means the sample classified correctly and “E” means the sample classified mistakenly.
Prediction accuracies of five runs of HBSA-KNN-based ensemble classifier on six test sets
| Leukemia | 86.54 | 84.62 | 88.46 | 84.62 | 84.62 | 85.57 ± 1.66 |
| DLBCL | 90.48 | 90.48 | 90.48 | 85.71 | 90.48 | 89.53 ± 2.13 |
| Prostate | 85.29 | 82.35 | 85.29 | 85.29 | 85.29 | 84.70 ± 1.31 |
| SRBCT | 95 | 95 | 95 | 90 | 95 | 94 ± 2.24 |
| ALL | 95 | 97 | 96 | 95 | 95 | 95.60 ± 0.89 |
| Colon | 75 | 75 | 75 | 75 | 75 | 75 ± 0 |
Column First denotes the prediction accuracy of the constructed ensemble classifier obtained on the first run of the HBSA-KNN, and the others are deduced by analogy. The average accuracy is the average prediction accuracy obtained by five runs of the HBSA-KNN.
Figure 5 Power-law distribution of the occurrence frequency of genes selected on six tumor datasets. The abscissa denotes the frequency rank order of the selected genes. The vertical axis denotes the occurrence frequency of genes selected. The figure is drawn by using log-log coordinates.
Figure 6 Classification accuracy versus the number of top-ranked genes on the six test sets.
Comparison of the classification accuracies for HBSA-SVM(Biased), HBSA-SVM(Unbiased) and HBSA-KNN methods with the top-ranked genes
| 2 | 88.46 | 3 | 82.69 | 2 | 84.62 | |
| 5 | 86.54 | |||||
| 71 | 96.15 (H) | 15 | 92.31 (H) | 24 | 96.15 (H) | |
| 2 | 95.24 | 2 | 80.95 | 2 | 80.95 | |
| 1 | 94.12 | |||||
| 4 | 88.24 | 5 | 85.29 | |||
| 4 | 75 | 4 | 70 | 3 | 75 | |
| 24 | 100 (H) | 28 | 100 (H) | |||
| 7 | 96 | |||||
| 112 | 100 (H) | 111 | 99 (H) | 85 | 100 (H) | |
| 2 | 65 | 3 | 70 (H) | |||
| 7 | 70 | |||||
| 4 | 80 | 7 | 80 | |||
*‘#TG’ denotes the number of top-ranked genes. Note that the accuracy labeled by ‘H’ denotes the highest accuracy and the number of the corresponding top-ranked genes denotes the minimal number with the highest accuracy.
Figure 7 ROC comparisons of HBSA-SVM(Unbiased) and HBSA-KNN-SVM(Unbiased).
The comparison of prediction accuracies by HBSA-KNN, PAM and ClaNC on independent test set
| HBSA-KNN | Leukemia ( | 84.62 | 94.23 | 92.31 | 94.23 | 82.69 | 82.69 | 82.69 | 88.46 | 90.38 | 92.31 |
| 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | ||
| 71.43 | 89.29 | 85.71 | 89.29 | 67.86 | 67.86 | 67.86 | 78.57 | 82.14 | 85.71 | ||
| 75 | 88.89 | 85.71 | 88.89 | 72.73 | 72.73 | 72.73 | 80 | 82.76 | 85.71 | ||
| 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | ||
| DLBCL ( | 95.24 | 100 | 80.95 | 85.71 | 80.95 | 85.71 | 85.71 | 85.71 | 85.71 | 90.48 | |
| 100 | 100 | 85.71 | 85.71 | 78.57 | 85.71 | 85.71 | 85.71 | 85.71 | 92.86 | ||
| 85.71 | 100 | 71.43 | 85.71 | 85.71 | 85.71 | 85.71 | 85.71 | 85.71 | 85.71 | ||
| 93.33 | 100 | 85.71 | 92.31 | 91.67 | 92.31 | 92.31 | 92.31 | 92.31 | 92.86 | ||
| 93.33 | 100 | 85.71 | 92.31 | 91.67 | 92.31 | 92.31 | 92.31 | 92.31 | 92.86 | ||
| Prostate ( | 88.24 | 76.47 | 82.35 | 85.29 | 82.35 | 79.41 | 76.47 | 76.47 | 82.35 | 85.29 | |
| 100 | 100 | 100 | 100 | 100 | 88.89 | 88.89 | 88.89 | 88.89 | 100 | ||
| 84 | 68 | 76 | 80 | 76 | 76 | 72 | 72 | 80 | 80 | ||
| 69.23 | 52.94 | 60 | 64.29 | 60 | 57.14 | 53.33 | 53.33 | 61.54 | 64.29 | ||
| 100 | 100 | 100 | 100 | 100 | 95 | 94.74 | 94.74 | 95.24 | 100 | ||
| SRBCT | 75 | 95 | 95 | 100 | 95 | 95 | 95 | 95 | 95 | 100 | |
| ALL | 75 | 82 | 87 | 94 | 92 | 93 | 93 | 96 | 97 | 99 | |
| Colon ( | 65 | 70 | 80 | 75 | 80 | 80 | 75 | 75 | 75 | 75 | |
| 75 | 83.33 | 91.67 | 91.67 | 91.67 | 83.33 | 83.33 | 83.33 | 75 | 75 | ||
| 50 | 50 | 62.50 | 50 | 62.50 | 75 | 62.50 | 62.50 | 75 | 75 | ||
| 69.23 | 71.43 | 78.57 | 73.33 | 78.57 | 83.33 | 76.92 | 76.92 | 81.82 | 81.82 | ||
| | 57.14 | 66.67 | 83.33 | 80 | 83.33 | 75 | 71.43 | 71.43 | 66.67 | 66.67 | |
| PAM | |||||||||||
| Leukemia | 82.69 | 90.38 | 90.38 | 90.38 | 92.31 | 94.23 | 96.15 | 96.15 | 98.08 | 98.08 | |
| DLBCL | 66.67 | 66.67 | 66.67 | 66.67 | 66.67 | 66.67 | 66.67 | 66.67 | 66.67 | 66.67 | |
| Prostate | 73.53 | 73.53 | 73.53 | 73.53 | 73.53 | 73.53 | 73.53 | 73.53 | 73.53 | 73.53 | |
| SRBCT | 45 | 45 | 75 | 75 | 85 | 95 | 95 | 95 | 95 | 95 | |
| ALL | 43 | 61 | 61 | 68 | 68 | 83 | 85 | 85 | 86 | 86 | |
| | Colon | 65 | 75 | 70 | 70 | 70 | 75 | 75 | 75 | 75 | 75 |
| ClaNC | |||||||||||
| Leukemia | 86.54 | 90.39 | 90.39 | 92.31 | 90.39 | 94.23 | 94.23 | 94.23 | 94.23 | 96.15 | |
| DLBCL | 80.95 | 95.24 | 95.24 | 95.24 | 95.24 | 80.95 | 76.19 | 71.43 | 71.43 | 71.43 | |
| Prostate | 73.53 | 85.29 | 79.41 | 76.47 | 76.47 | 79.41 | 79.41 | 76.47 | 76.47 | 79.41 | |
| SRBCT | 85 | 95 | 95 | 95 | 95 | 95 | 95 | 95 | 95 | 95 | |
| ALL | 86 | 95 | 97 | 99 | 98 | 98 | 99 | 99 | 99 | 98 | |
| Colon | 65 | 65 | 65 | 70 | 70 | 75 | 75 | 75 | 75 | 75 | |
* k denotes the number of subclasses in each dataset, which ranges from two to six. For example, for ALL dataset, the size of the gene subset selected ranges from six (1 × 6) to sixty (10 × 6).
Figure 8 The comparisons of three gene ranking methods.
Top 10 genes selected by HBSA-KNN from the leukemia dataset
| APLP2 | 4 | stability |
| CD33 | 3 | [ |
| ZYX | -- | Tumor suppressor |
| MARCKSL1 | | [ |
| SP3 | 9 | [ |
| CD63 | 2 | [ |
| TCF3 | -- | [ |
| PSME1 | 1 | -- |
| CCND3 | -- | Tumor suppressor |
| CST3 | -- | PMID: 17728092 |
The genes are sorted according to their frequency. If a gene is validated in the literature, the corresponding reference is shown (‘PMID’ denotes the PubMed ID).
Top 10 genes selected by HBSA-KNN on the prostate dataset
| MAF | 7 | Tumor oncogene |
| HPN | 1 | [ |
| ABL1 | 46 | Tumor suppressor |
| SLC25A6 | -- | [ |
| CHD9 | 4 | PMID: 20308527 |
| SERPINB5 | -- | Tumor suppressor |
| A2R6W1 | -- | PMID:17259976;Tumor suppressor |
| WWC1 | 2 | PMID: 16684779 |
| NELL2 | 2 | -- |
| RBP1 | 4 | PMID: 15280411 |
The column is the same as described in Table8.