| Literature DB >> 19874631 |
Abstract
BACKGROUND: One intractable problem with using microarray data analysis for cancer classification is how to reduce the extremely high-dimensionality gene feature data to remove the effects of noise. Feature selection is often used to address this problem by selecting informative genes from among thousands or tens of thousands of genes. However, most of the existing methods of microarray-based cancer classification utilize too many genes to achieve accurate classification, which often hampers the interpretability of the models. For a better understanding of the classification results, it is desirable to develop simpler rule-based models with as few marker genes as possible.Entities:
Mesh:
Year: 2009 PMID: 19874631 PMCID: PMC2777919 DOI: 10.1186/1755-8794-2-64
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Microarray data decision table
| 1 | ... | ||||
| 2 | ... | ||||
| ... | ... | ... | ... | ... | ... |
| ... | |||||
Summary of the five gene expression datasets
| Leukemia 1 | 7129 | ALL/AML | 38 (27/11) | 34 (20/14) |
| Lung Cancer | 12533 | MPM/ADCA | 32 (16/16) | 149 (15/134) |
| Prostate Cancer | 12600 | Tumor/Normal | 102 (52/50) | 34 (25/9) |
| Breast Cancer | 24481 | relapse/non-relapse | 78 (34/44) | 19 (12/7) |
| Leukemia 2 | 12582 | ALL/MLL/AML | 57 (20/17/20) | 15 (4/3/8) |
Thirteen gene pairs with high classification accuracy in the Leukemia dataset 1
| U46499_at - M92287_at | 35 (26/9) | 92.11 (96.30/81.82) | 33 (20/13) | 97.06 (100/92.86) |
| U46499_at - M12959_s_at | 36 (27/9) | 94.74 (100/81.82) | ||
| U46499_at - D63880_at | 36 (27/9) | 94.74 (100/81.82) | 33 (20/13) | 97.06 (100/92.86) |
| U46499_at - S50223_at | 33 (19/14) | 97.06 (95/100) | ||
| U46499_at - Z15115_at | 35 (26/9) | 92.11(96.30/81.82) | 33 (20/13) | 97.06 (100/92.86) |
| L09209_s_at - M92287_at | 33 (20/13) | 97.06 (100/92.86) | ||
| L09209_s_at - S50223_at | 33 (19/14) | 97.06 (95/100) | ||
| X61587_at - M92287_at | 36 (26/10) | 94.74 (96.30/90.91) | 33 (20/13) | 97.06 (100/92.86) |
| X61587_at - M12959_s_at | 33 (19/14) | 97.06 (95/100) | ||
| L09209_s_at - D63880_at | 32 (19/13) | 94.12 (95/92.86) | ||
| U05259_rna1_at - M92287_at | 36 (26/10) | 94.74 (96.30/90.91) | 32 (20/12) | 94.12 (100/100) |
| L09209_s_at - X59417_at | 32 (19/13) | 94.12 (95/92.86) | ||
| L09209_s_at - Z15115_at | 32 (19/13) | 94.12 (95/92.86) | ||
Sixteen genes with high classification accuracy in the Lung Cancer dataset
| 2047_s_at | 30 (15/15) | 93.75 (93.75/93.75) | 122 (11/111) | 81.88 (73.33/82.84) |
| 266_s_at | 129 (13/116) | 86.58 (86.67/86.57) | ||
| 32046_at | 30 (15/15) | 93.75 (93.75/93.75) | 133 (12/121) | 89.26 (80/90.30) |
| 32551_at | 31 (15/16) | 96.88 (93.75/100) | 134 (14/120) | 89.93 (93.33/89.55) |
| 33245_at | 30 (15/15) | 93.75 (93.75/93.75) | 137 (14/123) | 91.95 (93.33/91.79) |
| 33833_at | 139 (13/126) | 93.29 (86.67/94.03) | ||
| 35330_at | 31 (15/16) | 96.88 (93.75/100) | 118 (14/104) | 79.19 (93.33/77.61) |
| 36533_at | 30 (15/15) | 93.75 (93.75/93.75) | 141 (13/128) | 94.64 (86.67/95.52) |
| 37205_at | 30 (15/15) | 93.75 (93.75/93.75) | 135 (12/123) | 90.60 (80/91.79) |
| 37716_at | 30 (15/15) | 93.75 (93.75/93.75) | ||
| 39795_at | 31 (16/15) | 96.88 (100/93.75) | 135 (14/121) | 90.60 (93.33/90.30) |
| 40936_at | 31 (15/16) | 96.88 (93.75/100) | 140 (12/128) | 93.96 (80/95.52) |
| 41286_at | 30 (15/15) | 93.75 (93.75/93.75) | 121 (13/108) | 81.21 (86.67/80.60) |
| 41402_at | 31 (16/15) | 96.88 (100/93.75) | 123 (13/110) | 82.55 (86.67/82.09) |
| 575_s_at | 141 (14/127) | 94.64 (93.33/94.78) | ||
| 988_at | 30 (15/15) | 93.75 (93.75/93.75) | 132 (13/119) | 88.59 (86.67/88.81) |
Twenty-five gene pairs with 100% LOOCV accuracy in the Lung Cancer dataset
| 33754_at - 36562_at | ||
| 33754_at - 40496_at | 143 (11/132) | 95.97 (73.33/98.51) |
| 34105_f_at - 40496_at | 141(9/132) | 94.64 (60/98.51) |
| 34105_f_at - 36562_at | 140 (10/130) | 93.96 (66.67/97.01) |
| 37004_at - 40496_at | 140 (11/129) | 93.96 (73.33/96.27) |
| 36562_at - 37004_at | 139 (13/126) | 93.29 (86.67/94.03) |
| 38827_at - 40445_at | 138 (15/123) | 92.62 (100/91.79) |
| 1882_g_at - 36562_at | 136 (11/125) | 91.28 (73.33/93.28) |
| 1882_g_at - 40496_at | 136 (10/126) | 91.28 (66.67/94.03) |
| 33907_at - 36562_at | 134 (10/124) | 89.93 (66.67/92.54) |
| 36562_at - 40496_at | 134 (9/125) | 89.93 (60/93.28) |
| 1882_g_at - 33907_at | 133 (11/122) | 89.26 (73.33/91.04) |
| 1882_g_at - 37004_at | 132 (13/119) | 88.59 (86.67/88.81) |
| 35947_at - 36269_at | 132 (12/120) | 88.59 (80/89.55) |
| 33907_at - 34105_f_at | 131(9/122) | 87.92 (60/91.04) |
| 36269_at - 40445_at | 131(14/117) | 87.92 (93.33/87.31) |
| 35947_at - 40445_at | 130 (14/116) | 87.25 (93.33/86.57) |
| 38074_at - 38827_at | 129 (14/115) | 86.58 (93.33/85.82) |
| 33907_at - 40496_at | 127(8/119) | 85.23 (53.33/88.81) |
| 36269_at - 38074_at | 125 (13/112) | 83.89 (86.67/83.58) |
| 38074_at - 40445_at | 122 (13/109) | 81.88 (86.67/81.34) |
| 1117_at - 38827_at | 116 (15/101) | 77.85 (100/75.37) |
| 1117_at - 36269_at | 113 (13/100) | 75.84 (86.67/74.63) |
| 1117_at - 35947_at | 109 (12/97) | 73.15 (80/72.39) |
| 1117_at - 38074_at | 106 (14/92) | 71.14 (93.33/68.66) |
Eight genes with high classification accuracy in the Prostate Cancer dataset
| 32598_at | 23 (17/6) | 67.65 (68.00/66.67) | 0.85 | ||
| 36491_at | 84 (41/43) | 82.35 (78.85/86.00) | 30 (23/7) | 88.24 (92.00/77.78) | 0.80 |
| 40856_at | 85 (46/39) | 83.33 (88.46/78.00) | 23 (15/8) | 67.65 (60.00/88.89) | 0.80 |
| 32243_g_at | 84 (41/43) | 82.35 (78.85/86.00) | 0.80 | ||
| 36601_at | 85 (46/39) | 83.33 (88.46/78.00) | 17 (8/9) | 50.00 (32.00/100) | 0.80 |
| 38044_at | 81 (41/40) | 79.41 (78.85/80.00) | 29 (21/8) | 85.29 (84.00/88.89) | 0.80 |
| 41288_at | 88 (41/47) | 86.27 (78.85/94.00) | 0.80 | ||
| 1767_s_at | 83 (40/43) | 81.37 (76.92/86.00) | 24 (22/2) | 70.59 (88.00/22.22) | 0.80 |
Three gene pairs with good classification accuracy in the Prostate Cancer dataset
| 35178_at - 35277_at | 83 (33/50) | 81.37 (63.46/100) | 26 (20/6) | 76.47 (80.00/66.67) | 0.75 |
| 35178_at - 38087_s_at | 83 (33/50) | 81.37 (63.46/100) | 0.75 | ||
| 39331_at - 33121_g_at | 0.75 | ||||
Eight genes with high classification accuracy in the Breast Cancer dataset
| 57 (21/36) | 73.08 (61.76/81.82) | 0.70 | |||
| 13 (8/5) | 68.42 (66.67/71.43) | 0.70 | |||
| 13 (9/4) | 68.42 (75.00/57.14) | 0.70 | |||
| 55 (17/38) | 70.51 (50.00/86.36) | 13 (11/2) | 68.42 (91.67/28.57) | 0.70 | |
| 57 (22/35) | 73.08 (64.71/79.55) | 15 (9/6) | 78.95 (75.00/85.71) | 0.70 | |
| 0.70 | |||||
| 57 (20/37) | 73.08 (58.82/84.09) | 13 (9/4) | 68.42 (75.00/57.14) | 0.70 | |
| 55 (22/33) | 70.51 (64.71/75.00) | 13 (10/3) | 68.42 (83.33/42.86) | 0.70 | |
Twenty-one genes with high classification accuracy in the Leukemia dataset 2
| 36239_at | 0.90 | ||||
| 39318_at | 47 (17/11/19) | 82.46 (85/64.71/95) | 13 (2/3/8) | 86.67 (50/100/100) | 0.80 |
| 40191_s_at | 48 (17/13/18) | 84.21 (85/76.47/90) | 12 (2/2/8) | 80 (50/66.67/100) | 0.80 |
| 840_at | 47 (19/10/18) | 82.46 (95/58.82/90) | 11 (3/1/7) | 73.33 (75/33.33/87.50) | 0.80 |
| 266_s_at | 46 (19/11/16) | 80.70 (95/64.71/80) | 13 (4/1/8) | 86.67 (100/33.33/100) | 0.80 |
| 37933_at | 45 (20/7/18) | 78.95 (100/41.18/90) | 8 (2/0/6) | 53.33 (50/0/75) | 0.75 |
| 38989_at | 43 (19/6/18) | 75.44 (95/35.29/90) | 12 (3/1/8) | 80 (75/33.33/100) | 0.75 |
| 33833_at | 44 (16/10/18) | 77.19 (80/58.82/90) | 10 (2/0/8) | 66.67 (50/0/100) | 0.75 |
| 32874_at | 43 (14/11/18) | 75.44 (70/64.71/90) | 10 (2/1/7) | 66.67 (50/33.33/87.5) | 0.7 |
| 37487_at | 41 (14/7/20) | 71.93 (70/41.18/100) | 11 (3/0/8) | 73.33 (75/0/100) | 0.7 |
| 31886_at | 42 (16/8/18) | 73.68 (80/47.06/90) | 13 (3/2/8) | 86.67 (75/66.67/100) | 0.7 |
| 35164_at | 48 (19/15/14) | 84.21 (95/88.24/70) | 13 (4/2/7) | 86.67 (100/66.67/87.5) | 0.7 |
| 36905_at | 46 (14/12/20) | 80.70 (70/70.59/100) | 9 (0/1/8) | 60 (0/33.33/100) | 0.7 |
| 37539_at | 50 (16/16/18) | 87.72 (80/94.12/90) | 10 (3/3/4) | 66.67 (75/100/50) | 0.7 |
| 37910_at | 45 (18/9/18) | 78.95 (90/52.94/90) | 9 (1/1/7) | 60 (25/33.33/87.5) | 0.7 |
| 32847_at | 44 (18/12/14) | 77.19 (90/70.59/70) | 11 (4/2/5) | 73.33 (100/66.67/62.5) | 0.7 |
| 35260_at | 42 (20/8/14) | 73.68 (100/47.06/70) | 9 (2/1/6) | 60 (50/33.33/75) | 0.7 |
| 41790_at | 47 (19/11/17) | 82.46 (95/64.71/85) | 13 (3/2/8) | 86.67 (75/66.67/100) | 0.7 |
| 32579_at | 48 (15/13/20) | 84.21 (75/76.47/100) | 11 (2/1/8) | 73.33 (50/33.33/100) | 0.7 |
| 1373_at | 47 (16/12/19) | 82.46 (80/70.59/95) | 10 (1/1/8) | 66.67 (25/33.33/100) | 0.7 |
| 1325_at | 47 (19/14/14) | 82.46 (95/82.35/70) | 10 (3/3/4) | 66.67 (75/100/50) | 0.7 |
Comparison of best classification accuracy for the Leukemia dataset 1
| depended degree + decision rules [this work] | 1 | 31 (91.18%) | yes |
| 2 | 34 (100%) | ||
| t-test, attribute reduction + decision rules [ | 1 | 31 (91.18%) | yes |
| attribute reduction + | 2 | 33 (97.06%) | no |
| rough sets, GAs + | 9 | 31 (91.18%) | no |
| EPs [ | 1 | 31 (91.18%) | yes |
| discretization + decision trees [ | unknownc | 31 (91.18%) | yes |
| CBF + decision trees [ | 1 | 31 (91.18%) | yes |
| TSP [ | 2 | 31 (91.18%) | yes |
| RCBT [ | 10-40 | 31 (91.18%) | yes |
| neighborhood analysis + weighted voting [ | 50 | 29 (85.29%) | no |
| signal to noise ratios + PNNs [ | 50 | 34 (100%) | no |
| MAMA [ | 132-549 | 34 (100%) | no |
| PLS + LD or QDA [ | 50-1500 | 28-33 (82.4%-97%) | no |
| prediction strength + SVMs [ | 25-1000 | 30-32 (88.2%-94.1%) | no |
| SVMs [ | 8-30 | 34 (100%) | no |
aThe text before "+" states the feature selection method, while that after it states the classification method. The absence of "+" means that the same method was used for both feature selection and classification.
bThe decision trees are also involved in feature selection.
c"unknown" means that no related data are provided in the article.
These explanations apply to the other tables.
Comparison of best classification accuracy for the Lung Cancer dataset
| depended degree + decision rules [this work] | 1 | 145 (97.34%) | yes |
| 2 | 144 (96.64%) | ||
| attribute reduction + | 2 | 146 (97.99%) | no |
| PCLs [ | unknown | 146 (97.99%) | yes |
| C4.5 [ | 1 | 122 (81.88%) | yes |
| Bagging [ | unknown | 131 (87.92%) | yes |
| Boosting [ | unknown | 122 (81.88%) | yes |
| SVMs [ | unknown | 148 (99.33%) | no |
| unknown | 148 (99.33%) | no | |
| discretization + decision trees [ | unknown | 139 (93.29%) | yes |
| RCBT [ | 10-40 | 146 (97.99%) | yes |
| gene expression ratios [ | 6 | 148 (99.33%) | no |
Comparison of best classification accuracy for the Prostate Cancer dataset
| depended degree + decision rules [this work] | 1 | 31 (91.18%) | yes |
| 2 | 27 (79.41%) | ||
| TSP [ | 2 | 32 (94.12%) | yes |
| PCLs [ | unknown | 33 (97.06%) | yes |
| discretization + Single C4.5 [ | unknown | 23 (67.65%) | yes |
| discretization + Bagging C4.5 [ | unknown | 25 (73.53%) | yes |
| discretization + AdaBoost C4.5 [ | unknown | 23 (67.65%) | yes |
| RCBT [ | unknown | 33 (97.06%) | yes |
| SVMs [ | unknown | 27 (79.41%) | no |
| signal to noise ratios + | 4 | 26 (77.2%) | no |
| 16 | 29 (85.7%) | no | |
dIn [18], as both raw and normalized datasets were used, two groups of prediction results were obtained. Here, we chose their results from the normalized dataset. Another small difference is that we obtained the dataset from the Kent Ridge Bio-medical Data Set Repository, where the prostate test set includes 25 tumor and 9 normal samples instead of the 27 tumor and 8 normal samples studied in [69]. To facilitate comparison, the correctly classified sample numbers were calculated according to the total of 34 samples.
Comparison of best classification accuracy for the Breast Cancer dataset
| 1 | 16 (84.21%) | yes | |
| TSP [ | 2 | 79.38%e | yes |
| RBF [ | 67 | 79.38%e | yes |
| discretization + decision trees [ | unknown | 17 (89.47%) | yes |
| correlation coefficient [ | 70 | 17 (89.47%) | no |
eLOOCV result in the total of 97 samples.
Comparison of best classification accuracy for the Leukemia dataset 2
| 1 | 14 (93.33%) | yes | |
| HykGene + | 26 | 100%f | noi |
| signal to noise ratios + | 40 | 95%g | no |
| 100 | 9 (90%)h | ||
fLOOCV result in a total of 72 samples.
gLOOCV result in a total of 57 training samples.
hIn [20], only 3 of 8 AML testing samples in the dataset were mentioned. Thus, their test set contained 10 rather than 15 samples.
iExcept for C4.5, all the others are not rule-based classifiers.