| Literature DB >> 19723330 |
Torsten Blum1, Sebastian Briesemeister, Oliver Kohlbacher.
Abstract
BACKGROUND: Knowledge of subcellular localization of proteins is crucial to proteomics, drug target discovery and systems biology since localization and biological function are highly correlated. In recent years, numerous computational prediction methods have been developed. Nevertheless, there is still a need for prediction methods that show more robustness and higher accuracy.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19723330 PMCID: PMC2745392 DOI: 10.1186/1471-2105-10-274
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1MultiLoc2 architecture. The architecture of MultiLoc2-HighRes (animal version). A query sequence is processed by a first layer of six subprediction methods (SVMTarget, SVMSA, SVMaac, PhyloLoc, GOLoc and MotifSearch). The two new subprediction methods, PhyloLoc and GOLoc, are highlighted in bold. The individual output of the methods of the first layer are collected in the protein profile vector (PPV), which enters a second layer of SVMs producing probability estimates for each localization.
Figure 2PhyloLoc and GOLoc architecture. The architectures of PhyloLoc and GOLoc from MultiLoc2-LowRes. The input of PhyloLoc is a vector of similarities (phylogenetic profile) between the query sequence and the best sequence match in each genome inferred from BLAST. The input of GOLoc is a binary-coded vector representing the GO terms of the query sequence inferred from InterPro using InterProScan. PhyloLoc and GOLoc use one-versus-one SVMs to process their input and to calculate probability estimates for each localization.
Cross-validation performance comparison of different MultiLoc architectures trained using the BaCelLo and the Höglund datasets
| BaCelLo | |||||||
| MultiLoc | 77.3 (± 2.9) | 75.7 (± 3.1) | 78.4 (± 2.7) | 71.0 (± 2.6) | 71.4 (± 6.8) | 67.8 (± 3.8) | |
| + PhyloLoc | 80.1 (± 2.4) | 78.2 (± 2.9) | 80.0 (± 2.5) | 73.6 (± 0.9) | 78.6 (± 3.6) | 77.4 (± 1.9) | |
| + GOLoc | 84.0 (± 1.7) | 82.8 (± 2.0) | 81.1 (± 0.5) | 75.5 (± 1.1) | 80.9 (± 4.4) | 77.6 (± 3.5) | |
| MultiLoc2-LowRes | 86.1 (± 1.4) | 84.0 (± 1.7) | 82.8 (± 2.2) | 77.9 (± 0.5) | 81.9 (± 4.1) | 80.2 (± 3.5) | |
| Höglund | |||||||
| MultiLoc | 78.6 (± 1.2) | 76.4 (± 1.2) | 78.0 (± 1.3) | 76.6 (± 1.2) | 78.0 (± 1.8) | 76.4 (± 1.7) | |
| + PhyloLoc | 84.6 (± 0.7) | 84.0 (± 0.6) | 84.7 (± 1.4) | 84.4 (± 0.9) | 86.5 (± 1.5) | 84.3 (± 0.7) | |
| + GOLoc | 87.3 (± 1.8) | 86.7 (± 1.0) | 87.1 (± 0.9) | 86.9 (± 0.8) | 86.9 (± 1.4) | 86.3 (± 1.1) | |
| MultiLoc2-HighRes | 89.3 (± 1.4) | 88.6 (± 1.0) | 89.2 (± 1.1) | 88.9 (± 1.2) | 89.4 (± 0.8) | 88.7 (± 0.9) | |
This table compares the average sensitivities (AVGs) and overall accuracies (ACCs) of MultiLoc2-LowRes and MultiLoc2-HighRes with those of the original MultiLoc and the extended architecture based on PhyloLoc as well as GOLoc only. The AVGs and ACCs are given in percent. The standard deviations (in parentheses) refer to the differences of the AVGs and ACCs of the different cross-validation models.
Comparison of the localization-specific prediction results using BaCalLo independent dataset (BacelLo IDS)
| SP | 75 | 97 | 97 | 0.89 | 9 | 78 | 98 | 0.60 | 6 | 83 | 95 | 0.58 | |
| mi | 48 | 89 | 97 | 0.81 | 77 | 68 | 94 | 0.62 | 6 | 67 | 96 | 0.51 | |
| ch | - | - | - | - | - | - | - | - | 72 | 77 | 94 | 0.72 | |
| nu | 224 | 62 | 93 | 0.57 | 152 | 63 | 79 | 0.36 | 36 | 91 | 90 | 0.77 | |
| cy | 85 | 72 | 82 | 0.43 | 180 | 54 | 78 | 0.27 | 17 | 41 | 94 | 0.38 | |
| nu/cy | 308 | 93 | 96 | 0.87 | 332 | 92 | 78 | 0.63 | 52 | 94 | 92 | 0.84 | |
| SP | 75 | 87 | 95 | 0.79 | 9 | 78 | 98 | 0.63 | 6 | 83 | 93 | 0.50 | |
| mi | 48 | 83 | 96 | 0.75 | 77 | 51 | 95 | 0.52 | 6 | 67 | 93 | 0.40 | |
| ch | - | - | - | - | - | - | - | - | 72 | 53 | 94 | 0.51 | |
| nu | 224 | 58 | 93 | 0.54 | 152 | 50 | 84 | 0.32 | 36 | 86 | 91 | 0.74 | |
| cy | 85 | 71 | 80 | 0.39 | 180 | 56 | 75 | 0.22 | 17 | 37 | 87 | 0.20 | |
| nu/cy | 308 | 91 | 91 | 0.78 | 332 | 84 | 76 | 0.48 | 52 | 93 | 84 | 0.74 | |
| SP | 75 | 93 | 97 | 0.88 | 9 | 100 | 98 | 0.74 | 6 | 100 | 95 | 0.66 | |
| mi | 48 | 74 | 95 | 0.66 | 77 | 79 | 87 | 0.58 | 6 | 17 | 100 | 0.40 | |
| ch | - | - | - | - | - | - | - | - | 72 | 71 | 83 | 0.54 | |
| nu | 224 | 57 | 83 | 0.41 | 152 | 72 | 67 | 0.38 | 36 | 88 | 78 | 0.60 | |
| cy | 85 | 51 | 74 | 0.21 | 180 | 32 | 84 | 0.19 | 17 | 27 | 98 | 0.38 | |
| nu/cy | 308 | 93 | 92 | 0.83 | 332 | 85 | 83 | 0.61 | 52 | 88 | 84 | 0.70 | |
| SP | 75 | 79 | 91 | 0.65 | 9 | 78 | 92 | 0.35 | 6 | 83 | 96 | 0.60 | |
| mi | 48 | 64 | 92 | 0.51 | 77 | 42 | 92 | 0.38 | 6 | 58 | 90 | 0.30 | |
| ch | - | - | - | - | - | - | - | - | 72 | 77 | 88 | 0.66 | |
| nu | 224 | 66 | 73 | 0.39 | 152 | 63 | 59 | 0.22 | 36 | 72 | 89 | 0.61 | |
| cy | 85 | 35 | 86 | 0.22 | 180 | 35 | 78 | 0.15 | 17 | 33 | 97 | 0.39 | |
| nu/cy | 308 | 84 | 83 | 0.64 | 332 | 83 | 49 | 0.31 | 52 | 75 | 93 | 0.70 | |
| SP | 75 | 86 | 99 | 0.88 | 9 | 93 | 99 | 0.20 | 6 | 100 | 92 | 0.61 | |
| mi | 48 | 51 | 99 | 0.71 | 77 | 33 | 99 | 0.51 | 6 | 67 | 86 | 0.40 | |
| ch | - | - | - | - | - | - | - | - | 72 | 7 | 95 | 0.40 | |
| nu/cy | 308 | 98 | 73 | 0.79 | 332 | 98 | 40 | 0.52 | 52 | 86 | 77 | 0.52 | |
| SP | 75 | 88 | 98 | 0.88 | 9 | 89 | 97 | 0.56 | 6 | 100 | 93 | 0.61 | |
| mi | 48 | 82 | 92 | 0.63 | 77 | 50 | 92 | 0.44 | 6 | 50 | 91 | 0.26 | |
| ch | - | - | - | - | - | - | - | - | 72 | 55 | 91 | 0.49 | |
| nu/cy | 308 | 89 | 89 | 0.75 | 332 | 89 | 59 | 0.48 | 52 | 83 | 79 | 0.62 | |
| SP | 75 | 92 | 94 | 0.80 | 9 | 89 | 99 | 0.73 | 6 | 33 | 95 | 0.24 | |
| mi | 48 | 71 | 95 | 0.63 | 77 | 53 | 90 | 0.44 | 6 | 42 | 99 | 0.52 | |
| ch | - | - | - | - | - | - | - | - | 72 | 61 | 81 | 0.43 | |
| nu | 224 | 77 | 81 | 0.58 | 152 | 93 | 39 | 0.35 | 36 | 72 | 83 | 0.52 | |
| cy | 85 | 34 | 88 | 0.23 | 180 | 11 | 98 | 0.19 | 17 | 24 | 83 | 0.28 | |
| nu/cy | 308 | 89 | 90 | 0.76 | 332 | 89 | 57 | 0.46 | 52 | 87 | 74 | 0.61 | |
The sensitivity (SE) and specificity (SP), given in percentages, and Matthews correlation coefficient (MCC) are listed for each localization (Loc). The number of clusters (No.) per localization is also shown. In Protein Prowler and TargetP, predictions for nu and cy are only available grouped as nu/cy.
Comparison of the overall performance results using BaCelLo independent dataset (BaCelLo IDS)
| 3 | 79 ( | |||
| 4 | 66 ( | |||
| 3 | 87 (89) | 71 (76) | 74 (71) | |
| 4 | 75 (68) | 59 (52) | 65 (62) | |
| 3 | 87 (91) | 69 (76) | ||
| 4 | 69 (64) | 61 (69) | ||
| 3 | 76 (81) | 68 (75) | 73 (76) | |
| 4 | 61 (62) | 55 (47) | 65 (70) | |
| 3 | 78 (91) | 75 (86) | 65 (63) | |
| 4 | - | - | - | |
| 3 | 86 (88) | 76 (82) | 72 (67) | |
| 4 | - | - | - | |
| 3 | 84 (88) | 77 (82) | 56 (69) | |
| 4 | 69 (71) | 62 (51) | 46 (57) |
The average sensitivity and the overall accuracy (in parentheses) for the prediction of three and four classes for animals and fungi and four and five classes for plants are shown. Both measures are given in percentages. The top-scoring average sensitivity and average accuracy are highlighted in bold. Results for Protein Prowler and TargetP predictions are only available for a reduced number of classes since nu and cy are grouped as nu/cy.
Performance comparison of MultiLoc2-HighRes with WoLF PSORT using Höglund independent dataset (Höglund IDS)
| ex | 78 | 78 | 91 | 7 | 77 | 86 | 1 | 0 | 100 | |
| pm | 34 | 55 | 78 | 29 | 10 | 31 | 6 | 33 | 50 | |
| pe | 3 | 33 | 100 | 5 | 20 | 100 | 2 | 50 | 100 | |
| er | 25 | 28 | 70 | 46 | 46 | 83 | 6 | 50 | 83 | |
| go | 14 | 7 | 57 | 8 | 25 | 63 | 6 | 33 | 50 | |
| ly | 4 | 25 | 75 | - | - | - | - | - | - | |
| va | - | - | - | 11 | 0 | 0 | 9 | 11 | 33 | |
| 57 | ||||||||||
| ex | 78 | 93 | 97 | 7 | 36 | 79 | 1 | 0 | 0 | |
| pm | 34 | 41 | 59 | 29 | 59 | 79 | 6 | 83 | 83 | |
| pe | 3 | 0 | 0 | 5 | 0 | 0 | 2 | 0 | 0 | |
| er | 25 | 8 | 40 | 46 | 9 | 54 | 6 | 0 | 50 | |
| go | 14 | 0 | 7 | 8 | 0 | 0 | 6 | 17 | 17 | |
| ly | 4 | 0 | 25 | - | - | - | - | - | - | |
| va | - | - | - | 11 | 0 | 0 | 9 | 0 | 33 | |
| 24 | 38 | 17 | 35 | 17 | 31 | |||||
| 68 | 22 | 51 | 20 | 40 | ||||||
The sensitivity (SE) and top three sensitivity (SE3) for each localization are shown. SE3 measures the fraction of correctly predicted proteins within the top three ranked localizations. The corresponding average sensitivity and overall accuracy are listed also, with the top-scoring highlighted in bold. Values based on very few proteins (less than six) are drawn in gray. All measures are given as percentages.