| Literature DB >> 20716381 |
Susmita Datta1, Vasyl Pihur, Somnath Datta.
Abstract
BACKGROUND: Generally speaking, different classifiers tend to work well for certain types of data and conversely, it is usually not known a priori which algorithm will be optimal in any given classification application. In addition, for most classification problems, selecting the best performing classification algorithm amongst a number of competing algorithms is a difficult task for various reasons. As for example, the order of performance may depend on the performance measure employed for such a comparison. In this work, we present a novel adaptive ensemble classifier constructed by combining bagging and rank aggregation that is capable of adaptively changing its performance depending on the type of data that is being classified. The attractive feature of the proposed classifier is its multi-objective nature where the classification results can be simultaneously optimized with respect to several performance measures, for example, accuracy, sensitivity and specificity. We also show that our somewhat complex strategy has better predictive performance as judged on test samples than a more naive approach that attempts to directly identify the optimal classifier based on the training data performances of the individual classifiers.Entities:
Mesh:
Year: 2010 PMID: 20716381 PMCID: PMC2933716 DOI: 10.1186/1471-2105-11-427
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Threenorm simulation data
| Accuracy | Sensitivity | Specificity | AUC | |
|---|---|---|---|---|
| SVM | 0.451900 | 0.468200 | 0.435600 | 0.429016 |
| (0.00988) | (0.02144) | (0.02314) | (0.01318) | |
| RF | 0.562200 | 0.557600 | 0.566800 | 0.591170 |
| (0.00540) | (0.00853) | (0.00806) | (0.00635) | |
| PLS + LDA | 0.610000 | 0.608000 | 0.612000 | 0.610032 |
| (0.00561) | (0.00860) | (0.00797) | (0.00561) | |
| PCA + LDA | 0.503600 | 0.501800 | 0.505400 | 0.505236 |
| (0.00617) | (0.00674) | (0.00680) | (0.00753) | |
| PLS + RF | 0.612200 | 0.586400 | 0.638000 | 0.648102 |
| (0.00506) | (0.01250) | (0.01198) | (0.00595) | |
| PLS + QDA | 0.607500 | 0.617200 | 0.597800 | 0.607500 |
| (0.00577) | (0.01142) | (0.01218) | (0.00577) | |
| PLR | 0.540800 | 0.538000 | 0.543600 | 0.557342 |
| (0.00459) | (0.00819) | (0.00804) | (0.00553) | |
| PLS | 0.600300 | 0.600400 | 0.600200 | 0.647896 |
| (0.00542) | (0.01319) | (0.01361) | (0.00609) | |
| Greedy | 0.596600 | 0.581800 | 0.611400 | 0.621590 |
| (0.00559) | (0.01117) | (0.01045) | (0.00657) | |
| Ensemble | 0.613000 | 0.606200 | 0.619800 | 0.653700 |
| (0.00563) | (0.00823) | (0.00729) | (0.00587) |
Average accuracy, sensitivity, specificity and AUC for 100 datasets from the threenorm data with N = 100 and d = 1000. Standard errors are reported in parentheses.
Simulated microarray data
| Accuracy | Sensitivity | Specificity | AUC | |
|---|---|---|---|---|
| linear SVM | 0.902200 | 0.907600 | 0.896800 | 0.967464 |
| (0.00451) | (0.00683) | (0.00679) | (0.00216) | |
| polynomial SVM | 0.506200 | 0.716400 | 0.296000 | 0.498772 |
| (0.00383) | (0.05493) | (0.05477) | (0.00640) | |
| radial SVM | 0.773200 | 0.882000 | 0.664400 | 0.833576 |
| (0.03090) | (0.02851) | (0.04473) | (0.03750) | |
| sigmoid SVM | 0.905000 | 0.910400 | 0.899600 | 0.968472 |
| (0.00432) | (0.00655) | (0.00581) | (0.00210) | |
| greedy | 0.671400 | 0.807200 | 0.535600 | 0.702040 |
| (0.04177) | (0.03811) | (0.05508) | (0.05016) | |
| Ensemble | 0.900600 | 0.902400 | 0.898800 | 0.968156 |
| (0.00366) | (0.00661) | (0.00592) | (0.00213) |
Average accuracy, sensitivity, specificity and AUC for 50 datasets from the simulated microarray data with N = 100 and d = 5000. Standard errors are reported in parentheses. A single SVM classifier was used with four different kernel settings.
Breast cancer microarray data
| Accuracy | Sensitivity | Specificity | AUC | Count | |
|---|---|---|---|---|---|
| SVM | 0.5846 | 0.6679 | 0.5525 | 0.6845 | 168 |
| PLR | 0.6154 | 0.6859 | 0.5706 | 0.6503 | 197 |
| PLS + RF | 0.6077 | 0.6615 | 0.5562 | 0.6498 | 170 |
| PLS + LDA | 0.6846 | 0.6744 | 0.6887 | 0.6826 | 305 |
| PLS + QDA | 0.6462 | 0.7063 | 0.5799 | 0.6871 | 78 |
| PCA + QDA | 0.4692 | 0.3127 | 0.6645 | 0.5401 | 92 |
| Ensemble | 0.6385 | 0.6563 | 0.6227 | 0.7108 |
Average of 10-fold cross validation for the breast cancer microarray data. The number of bootstraps N = 101. The count column shows the number of times a particular individual algorithm was a locally "best" performing classifier across all 10 folds.
Proteomics ovarian cancer data
| Accuracy | Sensitivity | Specificity | AUC | |
|---|---|---|---|---|
| RF | 0.9550 | 0.9639 | 0.9520 | 0.9924 |
| SVM | 0.9350 | 0.9021 | 0.9731 | 0.9795 |
| PLS + RF | 0.9050 | 0.9040 | 0.9029 | 0.9703 |
| PLS + LDA | 0.9600 | 0.9639 | 0.9624 | 0.9784 |
| PLS + QDA | 0.9550 | 0.9539 | 0.9648 | 0.9781 |
| Ensemble | 0.9650 | 0.9639 | 0.9711 | 0.9871 |
Averages of 5-fold cross validation for the proteomics ovarian cancer data.
Confusion matrix
| True | ||||
|---|---|---|---|---|
| Class 1 | Class 0 | Total | ||
| Predicted | Class 1 | a | b | a + b |
| Class 0 | c | d | c + d | |
| Total | a + c | b + d | a + b + c + d | |
Confusion matrix from which many performance measures (accuracy, sensitivity, specificity) can be computed.
Figure 1Workflow of our ensemble classifier.