| Literature DB >> 19208118 |
Md Rafiul Hassan1, M Maruf Hossain, James Bailey, Geoff Macintyre, Joshua W K Ho, Kotagiri Ramamohanarao.
Abstract
BACKGROUND: Microarray gene expression profiling has provided extensive datasets that can describe characteristics of cancer patients. An important challenge for this type of data is the discovery of gene sets which can be used as the basis of developing a clinical predictor for cancer. It is desirable that such gene sets be compact, give accurate predictions across many classifiers, be biologically relevant and have good biological process coverage.Entities:
Mesh:
Year: 2009 PMID: 19208118 PMCID: PMC2648737 DOI: 10.1186/1471-2105-10-S1-S19
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Our set of 7 genes selected by majority voting and ordered by area under ROC curve
| 0.800802 | TSPY-like 5 | ||
| 0.794786 | Neuromedin U | ||
| 0.794786 | Carbonic Anhydrase IX | ||
| 0.792781 | ATP/GTP binding protein 1 | ||
| 0.774733 | Lin-9 homolog (C. elegans) | ||
| 0.766711 | ASP (abnormal spindle) homolog, microcephaly associated (Drosophila) | ||
| 0.764706 | Diaphanous homolog 1 (Drosophila) | ||
Accuracy achieved on test set by different classifiers using various subsets of our 7 genes
| C4.5 | 84.52% | |
| C4.5 with boosting (ADABoost.M1) | 91.67% | |
| C4.5 with bagging | 84.52% | |
| Naïve Bayes | 84.52% | |
| Naïve Bayes with boosting | 84.52% | |
| Naïve Bayes with bagging | 88.69% | |
| LMT | 84.52% | |
| NBTree | 84.52% | |
| Random Forest | 84.52% | |
| Random Forest with boosting | 84.52% | |
| Random Forest with bagging | 88.69% | |
| 80.36% | ||
| Logistic Regression | 81.55% | |
| ANN | 77.38% | |
| SVM | 83.33% | |
Comparison of the weighted accuracy of different classifiers using i) subsets of our 7 genes and ii) all 25,000 genes
| Test set (19) | All data (5-fold CV) | Test set (19) | All data (5-fold CV) | |
| C4.5 | 84.52% | 88.49% | 79.17% | 62.36% |
| C4.5 with boosting (ADABoost) | 91.67% | 89.54% | 63.10% | 62.89% |
| C4.5 with bagging | 84.52% | 88.94% | 48.81% | 63.98% |
| Naïve Bayes | 84.52% | 92.13% | 50.00% | 52.17% |
| Naïve Bayes with bagging | 88.69% | 86.82% | 50.00% | 52.17% |
| Naïve Bayes with boosting | 84.52% | 87.65% | 50.00% | 52.17% |
| LMT | 84.52% | 88.11% | 77.38% | 60.29% |
| NBTree | 84.52% | 83.69% | 66.07% | 58.76% |
| Random Forest | 84.52% | 90.59% | 66.07% | 62.47% |
| Random Forest with bagging | 88.69% | 90.59% | 73.21% | 64.75% |
| Random Forest with boosting | 84.52% | 88.48% | 66.07% | 62.45% |
| 80.36% | 83.00% | 63.69% | 61.94% | |
| Logistic Regression | 81.55% | 88.11% | Out of memory* | Out of memory* |
| ANN | 77.38% | 83.44% | Out of memory* | Out of memory* |
| SVM | 83.33% | 76.23% | 63.69% | 68.12% |
*Our experiments were carried out on a standard Intel Core 2 Duo CPU 2.4 GHz desktop computer running 2 GB of RAM.
Comparison of the accuracy of different classifiers using 2 known biomarker genes and our selection of 6 genes on Ma et al. and Loi et al. data
| 2 genes | 6 genes | 2 genes | 6 genes | |
| C4.5 | 60.00% | 75.64% | ||
| C4.5 with boosting (ADABoost) | 70.00% | 66.67% | ||
| C4.5 with bagging | 70.00% | 67.95% | ||
| Naïve Bayes | 60.00% | 74.36% | 74.36% | |
| Naïve Bayes with boosting | 60.00% | 74.36% | ||
| Naïve Bayes with bagging | 60.00% | 75.64% | 75.64% | |
| LMT | 70.00% | 76.92% | ||
| NBTree | 80.00% | 80.00% | 75.64% | |
| Random Forest | 60.00% | 74.36% | ||
| Random Forest with boosting | 70.00% | 67.95% | ||
| Random Forest with bagging | 70.00% | 71.79% | ||
| 70.00% | 71.79% | |||
| Logistic Regression | 70.00% | 74.36% | ||
| ANN | 60.00% | 74.36% | ||
| SVM | 60.00% | 74.36% | 74.36% | |
Comparison of the weighted accuracy on the test set of the best result from our voting method versus some well known cancer treatment guidelines
| C4.5 with boosting | 91.67% |
| St. Gallen 1998* | 68% |
| NIH 2000* | 79% |
| NPI* | 58% |
| 70-genes* | 74% |
| BPIM* | 68% |
| BDIM* | 58% |
*Results obtained from Gevaert et al. [43], where results were provided as number of true positives and true negatives.
Comparison of the classifier performance using i) a variable subset of our 7 genes, ii) a set of 17 genes identified by ROC with Markov Blanket [9], iii) a set of 17 genes identified by LAD [15], iv) a set of 70 genes identified by van 't Veer [3] and v) a set of 231 genes identified by van 't Veer [3]
| C4.5 with boosting | 91.67% | 84.52% | 68.42% | 59.52% | 54.76% | 76.19% |
| C4.5 | 84.52% | 84.52% | 68.42% | 57.90%* | 42.11%* | 73.68%* |
| 80.36% | 77.38% | 74.21% | 63.16%* | 63.16%* | 78.94%* | |
| Logistic Regression | 81.55% | 77.38% | 73.68% | 73.68%* | 47.37%* | 73.68%* |
| ANN | 77.38% | 76.19% | 84.21% | 84.21%* | 42.11%* | 73.68%* |
| SVM | 83.33% | 76.19% | 79.47% | 63.16%* | 57.90%* | 73.68%* |
*Results adopted from Alexe et al. [15].
Figure 1ROC curves of three classifiers with selected genes using three filter approaches FROC, . A group of three graphs showing ROC curves for three classifiers with selected genes using three filter approaches FROC, t-test and PCA.