| Literature DB >> 22838505 |
Jonathan M Keith1, Christian M Davey, Sarah E Boyd.
Abstract
BACKGROUND: Many problems in bioinformatics involve classification based on features such as sequence, structure or morphology. Given multiple classifiers, two crucial questions arise: how does their performance compare, and how can they best be combined to produce a better classifier? A classifier can be evaluated in terms of sensitivity and specificity using benchmark, or gold standard, data, that is, data for which the true classification is known. However, a gold standard is not always available. Here we demonstrate that a Bayesian model for comparing medical diagnostics without a gold standard can be successfully applied in the bioinformatics domain, to genomic scale data sets. We present a new implementation, which unlike previous implementations is applicable to any number of classifiers. We apply this model, for the first time, to the problem of finding the globally optimal logical combination of classifiers.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22838505 PMCID: PMC3473310 DOI: 10.1186/1471-2105-13-179
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Data matrix
| | ||||
|---|---|---|---|---|
| 1 | 0 | 1 | … | 0 |
| 2 | 1 | 0 | … | 1 |
| ⋮ | ⋮ | ⋮ | ⡆ | ⋮ |
| 1 | 1 | … | 1 | |
Figure 1Protein sub-cellular localisation results. Density plots of model variables for the chloroplast localisation data. Vertical lines show gold standard sensitivity, specificity or proportion. A, B, C: the sensitivity of the AA , DP and NCC classifiers, respectively. D, E, F: the specificity of the AA , DP and NCC classifiers, respectively. G: estimated proportion of proteins localised to chloroplasts.
Figure 2Swine flu results. Density plots of model variables for the swine flu data. A: Sensitivity of the NPA classifier. B: Sensitivity of the NS classifier. C: Specificity of the NPA classifier. D: Specificity of the NS classifier. E: Prevalence of the disease.
Sensitivities and specificities of the chloroplast localisation classifier combinations
| | | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 0.311 | 0.307 | 0.094 | 1.000 | 1.000 | 0.000 | |||||
| 4 | 0.394 | 0.384 | 0.096 | 0.999 | 0.999 | 0.001 | |||||
| 6 | 0.553 | 0.544 | 0.108 | 0.997 | 0.997 | 0.001 | |||||
| 18 | 0.436 | 0.433 | 0.101 | 0.998 | 0.998 | 0.001 | |||||
| 24 | At least two classifiers | 0.762 | 0.772 | 0.089 | 0.994 | 0.995 | 0.003 | ||||
| 64 | 0.867 | 0.870 | 0.058 | 0.932 | 0.933 | 0.021 | |||||
| 96 | 0.934 | 0.939 | 0.037 | 0.894 | 0.895 | 0.025 | |||||
| 120 | 0.900 | 0.910 | 0.053 | 0.904 | 0.907 | 0.025 | |||||
| 128 | 0.969 | 0.975 | 0.021 | 0.868 | 0.869 | 0.029 | |||||
SD: Standard Deviation.
‡Optimal combination using ranking criteria 1, 2 and 3.
†Optimal combination using ranking criterion 4.
Sensitivities and specificities of the swine flu classifier combinations
| | | ||||||
|---|---|---|---|---|---|---|---|
| 2 | |||||||
| 1 | All results are negative | 0.000 | 0.000 | 0.000 | 1.000 | 1.000 | 0.000 |
| 2 | 0.626 | 0.633 | 0.162 | 0.991 | 0.994 | 0.010 | |
| 3 | ¬ | 0.116 | 0.101 | 0.088 | 0.938 | 0.947 | 0.047 |
| 4 | 0.742 | 0.760 | 0.152 | 0.928 | 0.939 | 0.054 | |
| 5 | 0.214 | 0.201 | 0.127 | 0.879 | 0.885 | 0.072 | |
| 6 | 0.840 | 0.859 | 0.119 | 0.870 | 0.875 | 0.078 | |
| 7 | (¬ | 0.330 | 0.340 | 0.124 | 0.817 | 0.821 | 0.079 |
| 8 | 0.957 | 0.974 | 0.050 | 0.808 | 0.813 | 0.087 | |
| 9 | ¬( | 0.043 | 0.026 | 0.050 | 0.192 | 0.187 | 0.087 |
| 10 | ( | 0.670 | 0.660 | 0.124 | 0.183 | 0.179 | 0.079 |
| 11 | ¬ | 0.160 | 0.141 | 0.119 | 0.130 | 0.125 | 0.078 |
| 12 | ¬ | 0.786 | 0.799 | 0.127 | 0.121 | 0.115 | 0.072 |
| 13 | ¬ | 0.258 | 0.240 | 0.152 | 0.072 | 0.061 | 0.054 |
| 14 | 0.884 | 0.899 | 0.088 | 0.062 | 0.053 | 0.047 | |
| 15 | ¬( | 0.374 | 0.367 | 0.162 | 0.009 | 0.006 | 0.010 |
| 16 | All results are positive | 1.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 |
SD: Standard Deviation.
‡Optimal combination using ranking criteria 1, 2 and 3.
†Optimal combination using ranking criterion 4.
Sensitivities and specificities of the SNP classifier combinations
| 2 | 0.083 | 0.082 | 0.017 | 1.000 | 1.000 | 0.0000 | ||
| 4 | 0.098 | 0.097 | 0.018 | 1.000 | 1.000 | 0.0000 | ||
| 6 | 0.144 | 0.143 | 0.020 | 1.000 | 1.000 | 0.0000 | ||
| 18 | 0.483 | 0.481 | 0.053 | 1.000 | 1.000 | 0.0000 | ||
| 24 | At least two classifiers | 0.559 | 0.558 | 0.053 | 1.000 | 1.000 | 0.0000 | |
| 64 | 0.645 | 0.647 | 0.047 | 0.997 | 0.997 | 0.0001 | ||
| 96 | 0.869 | 0.869 | 0.035 | 0.997 | 0.997 | 0.0001 | ||
| 120 | 0.932 | 0.933 | 0.021 | 0.998 | 0.998 | 0.0001 | ||
| 128 | 0.943 | 0.944 | 0.018 | 0.996 | 0.996 | 0.0002 | ||
| ∼ 5000 positives | | Sensitivity | Specificity | |||||
| 2 | Combination | Mean | Median | SD | | Mean | Median | SD |
| 2 | 0.065 | 0.066 | 0.007 | | 1.000 | 1.000 | 0.0000 | |
| 4 | 0.079 | 0.079 | 0.007 | | 1.000 | 1.000 | 0.0000 | |
| 6 | 0.123 | 0.123 | 0.008 | | 1.000 | 1.000 | 0.0000 | |
| 18 | 0.439 | 0.440 | 0.025 | | 1.000 | 1.000 | 0.0000 | |
| 24 | At least two classifiers | 0.510 | 0.511 | 0.026 | | 1.000 | 1.000 | 0.0000 |
| 64 | 0.601 | 0.601 | 0.024 | | 0.987 | 0.987 | 0.0002 | |
| 96 | 0.850 | 0.852 | 0.018 | | 0.991 | 0.991 | 0.0004 | |
| 120 | 0.918 | 0.919 | 0.011 | | 0.994 | 0.994 | 0.0004 | |
| 128 | 0.930 | 0.931 | 0.010 | 0.986 | 0.986 | 0.0004 | ||
SD: Standard Deviation.
‡Optimal combination using ranking criteria 1, 2, 3 and 4.
Run times
| Swine Flu | 48 | 625 s | 0s | 1s |
| Chloroplasts | 357 | 624 s | 1s | 12s |
| SNP | 541094 | 8 hrs | 2917s | 12s |
Figure 3Conditional dependencies of the model. The dependencies of parameters in the model. ϕis the proportion of the population that is positive for the feature of interest, Tis the true classification of individual n, αand βare the probabilities of a true positive and a false positive (respectively) for classifier k, and Cis the classification of individual n according to classifier k.