| Literature DB >> 21342522 |
Elizabeth Tapia1, Leonardo Ornella, Pilar Bulacio, Laura Angelone.
Abstract
BACKGROUND: Multiclass classification of microarray data samples with a reduced number of genes is a rich and challenging problem in Bioinformatics research. The problem gets harder as the number of classes is increased. In addition, the performance of most classifiers is tightly linked to the effectiveness of mandatory gene selection methods. Critical to gene selection is the availability of estimates about the maximum number of genes that can be handled by any classification algorithm. Lack of such estimates may lead to either computationally demanding explorations of a search space with thousands of dimensions or classification models based on gene sets of unrestricted size. In the former case, unbiased but possibly overfitted classification models may arise. In the latter case, biased classification models unable to support statistically significant findings may be obtained.Entities:
Mesh:
Year: 2011 PMID: 21342522 PMCID: PMC3056725 DOI: 10.1186/1471-2105-12-59
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The maximum fraction of genes per binary classifier. The maximum fraction of genes Qthat can be handled by core binary classifiers in binary mediated multiclass classification of microarray data samples involving p genes and q samples, p > >q. Multiclass classifiers for M ≥ 3 classes built from n binary classifiers, n ≥ ⌈log2M⌉ + 2, are considered. Qis evaluated on the 8 benchmark microarray datasets used in this paper (Lymphoma, SRCBT, Brain, NCI60, Staunton, Su, GCMRM and GCM).
The classification performance of OAA and ECOC classifiers
| p-valuesa | |||||||
|---|---|---|---|---|---|---|---|
| Dataset | M | n | Error-ECOC(F) | Error-OAA(G) | MW | ||
| 200 Montecarlo 4:1 train-test partitions at | |||||||
| Lymphoma | 3 | NA | NA | 0 | NA | NA | - |
| SRCBT | 4 | 9 | 0 | 0 | 0.00437 | 0.00219 | 0.99682 |
| Brain | 5 | 9 | 0.1250 | 0.1250 | 0.98741 | - | - |
| NCI60 | 8 | 9 | 0.3077 | 0.2308 | 0.02222 | 0.01111 | 0.99682 |
| Staunton | 9 | 12 | 0.4615 | 0.4615 | 0.71123 | - | - |
| GCM RM | 11 | 11 | 0 | 0 | 0.39273 | - | - |
| Su | 11 | 13 | 0.0857 | 0.0857 | 0.92282 | - | - |
| GCM | 14 | 12 | 0.3625 | 0.2863 | 9.99e-16 | 4.76e-16 | 1 |
| 200 Montecarlo 4:1 train-test partitions at | |||||||
| Lymphoma | 3 | 11 | 0 | 0 | 0.98741 | - | - |
| SRCBT | 4 | 9 | 0 | 0 | 0.00307 | 0.00153 | 0.99999 |
| Brain | 5 | 15 | 0.1250 | 0.1250 | 0.99970 | - | - |
| NCI60 | 8 | 14 | 0.3077 | 0.2308 | 0.00213 | 0.00106 | 0.99996 |
| Staunton | 9 | 19 | 0.4615 | 0.4615 | 0.79201 | - | - |
| GCM RM | 11 | 12 | 0 | 0 | 0.79201 | - | - |
| Su | 11 | 17 | 0.0857 | 0.0857 | 0.32750 | - | - |
| GCM | 14 | 12 | 0.3624 | 0.2863 | 9.99e-16 | 4.76e-16 | 1 |
| 200 Montecarlo 4:1 train-test partitions at | |||||||
| Lymphoma | 3 | 11 | 0 | 0 | 0.98741 | - | - |
| SRCBT | 4 | 9 | 0 | 0 | 0.00307 | 0.00153 | 0.99999 |
| Brain | 5 | 18 | 0.125 | 0.125 | 0.99999 | - | - |
| NCI60 | 8 | 16 | 0.3077 | 0.2308 | 0.00045 | 0.00022 | 0.99999 |
| Staunton | 9 | 19 | 0.4615 | 0.4615 | 0.62717 | - | - |
| GCM RM | 11 | 12 | 0 | 0 | 0.96394 | - | - |
| Su | 11 | 17 | 0.0857 | 0.0857 | 0.46532 | - | - |
| GCM | 14 | 12 | 0.3666 | 0.2863 | < 2.2e-16 | < 2.2e-16 | 1 |
The classification performance of ECOC classifiers of size at most ⌈η·log2M⌉ and OAA classifiers under bounded optimum S2N gene selection over 200 runs of Montecarlo 4:1 train-test partitions. M and n respectively denote the median number of binary classifiers at ECOC and OAA classifiers. Error-ECOC and Error-OAA respectively denote the median classification errors attained by ECOC and OAA classifiers. Error-ECOC and Error-OAA are denoted as F and G for purposes of KS tests, respectively.
a p-values of two-sided KS tests, one-sided KS tests and one-sided MW tests. The alternative hypothesis of two-sided KS tests is "the error of ECOC classifiers is different from that of OAA classifiers", i.e., the relationship between CDFs is F ≠ G. The alternative hypothesis for one sided KS tests is "the error of ECOC classifiers is greater than that OAA classifiers", i.e., the relationship between CDFs is F
The overall number of genes selected by OAA and ECOC classifiers
| p-valuesa | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Dataset | M | N | B-ECOC | B-OAA | G-ECOC(F) | G-OAA(G) | MW | ||
| 200 Montecarlo 4:1 train-test partitions at | |||||||||
| Lymphoma | 3 | NA | NA | 4 | NA | 22 | NA | NA | NA |
| SRCBT | 4 | 9 | 14.22 | 6 | 37 | 23 | < 2.2e-16 | < 2.2e-16 | 1 |
| Brain | 5 | 9 | 28.1 | 19 | 177 | 109.5 | 5.08e-05 | 2.54e-05 | 0.99975 |
| NCI60 | 8 | 9 | 45.11 | 34 | 310 | 326 | 9.31e-07 | 0.27804 | 0.07651 |
| Staunton | 9 | 12 | 46 | 34.11 | 387 | 296 | 9.91e-08 | 4.95e-08 | 0.99993 |
| GCM RM | 11 | 11 | 142 | 36 | 800 | 365.5 | < 2.2e-16 | 2.76e-08 | 1 |
| Su | 11 | 13 | 126 | 62 | 1056 | 916 | 5.36e-12 | 1.15e-24 | 0.99978 |
| GCM | 14 | 12 | 322 | 128 | 2096 | 1406 | < 2.2e-16 | < 2.2e-16 | 1 |
| 200 Montecarlo 4:1 train-test partitions at | |||||||||
| Lymphoma | 3 | 11 | 4.27 | 4 | 12 | 22 | 5.52e-08 | 1 | 9.85e-09 |
| SRCBT | 4 | 9 | 12.22 | 6 | 33 | 23 | < 2.2e-16 | < 2.2e-16 | 1 |
| Brain | 5 | 15 | 16.16 | 19 | 109.5 | 109.5 | 0.03970 | 0.01984 | 0.54495 |
| NCI60 | 8 | 14 | 42.12 | 39 | 286.5 | 326 | 9.31e-07 | 0.95599 | 0.00105 |
| Staunton | 9 | 19 | 40.03 | 34.11 | 381.5 | 296 | 6.95e-10 | 3.48e-10 | 0.99997 |
| GCM RM | 11 | 12 | 72 | 36 | 570 | 365.5 | < 2.2e-16 | 1.66e-19 | 1 |
| Su | 11 | 17 | 112 | 62 | 940 | 916 | 1.82e-10 | 9.11e-11 | 0.98387 |
| GCM | 14 | 12 | 322 | 128 | 2078 | 1406 | < 2.2e-16 | < 2.2e-16 | 1 |
| 200 Montecarlo 4:1 train-test partitions at | |||||||||
| Lymphoma | 3 | 11 | 4.26 | 4 | 12 | 22 | 3.05e-08 | 1 | 3.85e-09 |
| SRCBT | 4 | 9 | 12.22 | 6 | 33 | 23 | < 2.2e-16 | < 2.2e-16 | 1 |
| Brain | 5 | 18 | 16.06 | 19 | 105 | 109.5 | 0.03970 | 0.01984 | 0.15586 |
| NCI60 | 8 | 16 | 36.15 | 39 | 251 | 326 | 9.31e-07 | 1 | 3.23e-05 |
| Staunton | 9 | 19 | 34.09 | 34.11 | 373.5 | 296 | 4.81e-09 | 2.41e-09 | 0.99989 |
| GCM RM | 11 | 12 | 72 | 36 | 561 | 365.5 | < 2.2e-16 | 1.66e-19 | 1 |
| Su | 11 | 17 | 112 | 62 | 924.5 | 916 | 1.34e-09 | 6.69e-10 | 0.97006 |
| GCM | 14 | 12 | 322 | 128 | 2066 | 1406 | < 2.2e-16 | < 2.2e-16 | 1 |
The number of genes selected by OAA and ECOC classifiers of size at most ⌈η·log2M⌉ under bounded optimum S2N gene selection over 200 Montecarlo 4:1 train-test partitions. M and n respectively denote the median number of binary classifiers at OAA and ECOC classifiers. B-ECOC and B-OAA respectively denote the median number of genes per binary SVM at ECOC and OAA classifiers. G-ECOC and G-OAA respectively denote the median overall number of genes selected at ECOC and OAA classifiers. G-ECOC and G-OAA are denoted as F and G for purposes of KS tests, respectively.
a p-values of two-sided KS tests, one-sided KS tests and one-sided MW tests. The alternative hypothesis of two-sided KS tests is "the number of genes selected by ECOC classifiers (F) is different from that of OAA classifiers (G)", i.e., the relationship between corresponding CDFs is F ≠ G. The alternative hypothesis for one sided KS tests is "the number of genes selected by ECOC classifiers (F) is greater than that OAA classifiers (G)", i.e., the relationship between corresponding CDFs is F
The stability of gene selection attained by OAA and ECOC classifiers
| p-valuesa | |||||||
|---|---|---|---|---|---|---|---|
| Dataset | M | n | S-ECOC(F) | S-OAA(G) | MW | ||
| 200 Montecarlo 4:1 train-test partitions at | |||||||
| Lymphoma | 3 | NA | NA | 0.5539 | NA | NA | NA |
| SRCBT | 4 | 9 | 0.6835 | 0.5652 | < 2.2e-16 | 0.99979 | < 2.2e-16 |
| Brain | 5 | 9 | 0.4643 | 0.4315 | < 2.2e-16 | 0.02363 | < 2.2e-16 |
| NCI60 | 8 | 9 | 0.4313 | 0.4365 | < 2.2e-16 | < 2.2e-16 | 1 |
| Staunton | 9 | 12 | 0.4129 | 0.4119 | < 2.2e-16 | < 2.2e-16 | 0.73628 |
| GCM RM | 11 | 11 | 0.6043 | 0.6143 | < 2.2e-16 | < 2.2e-16b | < 2.2e-16 |
| Su | 11 | 13 | 0.6286 | 0.5461 | < 2.2e-16 | 0.99594 | < 2.2e-16 |
| GCM | 14 | 12 | 0.6783 | 0.5886 | < 2.2e-16 | 1 | < 2.2e-16 |
| 200 Montecarlo 4:1 train-test partitions at | |||||||
| Lymphoma | 3 | 11 | 0.6093 | 0.5539 | < 2.2e-16 | 1 | < 2.2e-16 |
| SRCBT | 4 | 9 | 0.6745 | 0.5652 | < 2.2e-16 | 1 | < 2.2e-16 |
| Brain | 5 | 15 | 0.4582 | 0.4315 | < 2.2e-16 | 0.00213b | < 2.2e-16 |
| NCI60 | 8 | 14 | 0.4234 | 0.4365 | < 2.2e-16 | < 2.2e-16 | 1 |
| Staunton | 9 | 19 | 0.4185 | 0.4119 | < 2.2e-16 | < 2.2e-16 | 5.93e-07 |
| GCM RM | 11 | 12 | 0.6112 | 0.6143 | < 2.2e-16 | 6.83e-08b | < 2.2e-16 |
| Su | 11 | 17 | 0.6423 | 0.5461 | < 2.2e-16 | 0.99154 | < 2.2e-16 |
| GCM | 14 | 12 | 0.6650 | 0.5886 | < 2.2e-16 | 0.42216 | < 2.2e-16 |
| 200 Montecarlo 4:1 train-test partitions at | |||||||
| Lymphoma | 3 | 11 | 0.6093 | 0.5539 | < 2.2e-16 | 1 | < 2.2e-16 |
| SRCBT | 4 | 9 | 0.6740 | 0.5652 | < 2.2e-16 | 1 | < 2.2e-16 |
| Brain | 5 | 18 | 0.4591 | 0.4315 | < 2.2e-16 | 0.00165b | < 2.2e-16 |
| NCI60 | 8 | 16 | 0.4170 | 0.4365 | < 2.2e-16 | < 2.2e-16 | 1 |
| Staunton | 9 | 19 | 0.4168 | 0.4119 | < 2.2e-16 | < 2.2e-16 | 0.02409 |
| GCM RM | 11 | 12 | 0.6124 | 0.6143 | < 2.2e-16 | 8.46e-05b | < 2.2e-16 |
| Su | 11 | 17 | 0.6405 | 0.5461 | < 2.2e-16 | 0.99154 | < 2.2e-16 |
| GCM | 14 | 12 | 0.6578 | 0.5886 | < 2.2e-16 | 0.03809b | < 2.2e-16 |
The stability of gene selection attained by ECOC classifiers of size at most ⌈η·log2M⌉ and OAA classifiers under bounded optimum S2N gene selection over 200 Montecarlo 4:1 train-test partitions. M and n respectively denote the median number of binary classifiers at OAA and ECOC classifiers. S-ECOC and S-OAA respectively denote the stability of gene selection attained by ECOC and OAA classifiers measured by the Salton's coefficient. S-ECOC and S-OAA are denoted as F and G for purposes of KS tests, respectively.
a p-values of two-sided KS tests, one-sided KS tests and one-sided MW tests. The alternative hypothesis of two-sided KS tests is "the stability of gene selection in ECOC classifiers (F) is different from that in OAA classifiers (G)", i.e., the relationship between corresponding CDFs is F ≠ G. The alternative hypothesis for one sided KS tests is "the stability of gene selection in ECOC classifiers (F) is lower than that OAA classifiers (G)", i.e., the relationship between corresponding CDFs is F >G. The alternative hypothesis of one-sided MW tests is "the median stability of gene selection in ECOC classifiers is higher than that of OAA classifiers".
b Difficult to definitely compare. Highly significant p-values for both one-sided KS tests.
The performance of OAA and ECOC classifiers on train-test partitions
| Dataset | M | n | G-ECOC | G-OAA | Error-ECOC | Error-OAA |
|---|---|---|---|---|---|---|
| GCM RM | 11 | 10 | 926 | 1260 | 0.1852 | 0.1852 |
| GCM | 14 | 20 | 1314 | 423 | 0.4782 | 0.3043 |
The performance of OAA and ECOC classifiers of size at most ⌈η·log2M⌉ on benchmark microarray datasets under bounded optimum S2N gene selection and a public train-test partition. M and n respectively denote the number of binary classifiers used by OAA and ECOC classifiers. G-OAA and G-ECOC respectively denote the overall number of genes selected by OAA and ECOC classifiers. Error-OAA and Error-ECOC respectively denote the classification error attained by OAA and ECOC classifiers.
Figure 2The architecture of an ECOC-LDPC classifier under bounded gene selection. Right squares represent constituent ECOC classifiers induced from simple parity codes, left squares represent practical binary classifiers, rectangles represent "channel" functions, ellipses represent binary predictions, and small circles represent gene expression measurements. Edges are put between constituent ECOC classifiers and ideal binary predictions ctaking care that the connectivity profile remains sparse (j <
The best C for ECOC and classifiers based on linear SVMs
| Dataset | ECOC at | ECOC at | ECOC at | OAAa |
|---|---|---|---|---|
| Lymphoma | NA | 1:1-1 | 1:1-1 | 1:1-1 |
| SRCBT | 1:1-1 | 1:1-1 | 1:1-1 | 1:1-1 |
| Brain | 1:1-1 | 1:1-1 | 1:1-1 | 1:1-1 |
| NCI60 | 1:1-1 | 1:1-1 | 1:1-1 | 1:0.5-1 |
| Staunton | 1:1-1 | 1:1-1 | 1:1-1 | 1:0.5-1 |
| GCMRM | 1:1-1 | 1:1-1 | 1:1-1 | 1:1-1 |
| Su | 1:1-1 | 1:1-1 | 1:1-1 | 1:0.5-1 |
| GCM | 1:1-1 | 1:1-1 | 1:1-1 | 1:0.5-1 |
The best C for ECOC classifiers of size at most ⌈η·log2M⌉ and OAA classifiers, both based on linear SVM classifiers, under bounded optimum S2N gene selection over 200 Montecarlo 4:1 train-test partitions
a The best C expressed as median: lower quartile-upper quartile.
54], we have I (y, r) ≤ I (y, x). In other words, the maximization of I (y, x) requires the choice of an error correcting output code such that I (y, r) is maximized. In addition, let r be a set n i.i.d. random variables r. Thus, we have I(y, r) = ΣI(y, r) and . Again by the data processing inequality, we have I(v, r) ≤ I(y, r). If we further assume that Tis a gene selection algorithm able to select just relevant genes to r, i.e., H(v| r) = 0, we have I(v, r) = H(v) - H(v| r) = H(v). Finally, let genes in vbe a set of g i.i.d. binary random variables. Thus, we have H(v) = H(T(x)) = Q · p · H(f) where Q is the fraction of relevant genes to rand H(f) is the binary entropy function measuring the information content of a generic gene which is expressed with probability f and not expressed with probability 1 - f. Hence, the following upper bound on the fraction of genes Q that can be handled by any binary classifier in a binary mediated multiclass classifier for microarray data samples is obtained