| Literature DB >> 24646119 |
Ramanuja Simha, Hagit Shatkay1.
Abstract
MOTIVATION: Knowing the location of a protein within the cell is important for understanding its function, role in biological processes, and potential use as a drug target. Much progress has been made in developing computational methods that predict single locations for proteins. Most such methods are based on the over-simplifying assumption that proteins localize to a single location. However, it has been shown that proteins localize to multiple locations. While a few recent systems attempt to predict multiple locations of proteins, their performance leaves much room for improvement. Moreover, they typically treat locations as independent and do not attempt to utilize possible inter-dependencies among locations. Our hypothesis is that directly incorporating inter-dependencies among locations into both the classifier-learning and the prediction process can improve location prediction performance.Entities:
Year: 2014 PMID: 24646119 PMCID: PMC3994749 DOI: 10.1186/1748-7188-9-8
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1An example of a collection of Bayesian network classifiers we learn. The collection consists of several classifiers C1,…,C, one for each of the q subcellular locations. Directed edges represent dependencies between the connected nodes. There are edges among location variables (L1,…,L), as well as between feature variables (F1,…,F) and location variables (L1,…,L), but not among the feature variables. The latter indicates independencies among features, as well as conditional independencies among features given the locations.
Figure 2Adding, deleting, and reversing an edge in a Bayesian network during structure learning. The network on the left (i), is the starting point. Networks (ii), (iii), and (iv) show the addition, deletion, and reversal of an edge, respectively, as performed by the greedy hill climbing algorithm for structure learning.
Figure 3Multiple location prediction for protein. First, SVMs SVM1,…,SVM are used to obtain the location indicator estimates . The Bayesian network classifiers C1,…,C are then used to predict the actual location indicators . The Bayesian network classifiers use the location-indicator estimates as well as with inter-dependencies among the locations.
Multi-location prediction results on the PROSITE-GO version of the dataset, averaged over 25 runs of 5-fold cross-validation, for multi-localized proteins only, using our system, YLoc [[26]], Euk-mPLoc [[24]], WoLF PSORT [[23]], and KnowPred [[36]]
| | |||||
|---|---|---|---|---|---|
| 0.66 (± 0.02) | 0.68 | 0.44 | 0.53 | 0.66 | |
| 0.63 (± 0.01) | 0.64 | 0.41 | 0.43 | 0.63 |
The F1-label score and Acc measures shown for all the systems except for ours are taken directly from Table Three in the paper by Briesemeister et al. [26]. Standard deviations are provided for our system (not available for others).
Multi-location prediction results on the PROSITE-GO version of the dataset, averaged over 25 runs of 5-fold cross-validation, for the combined set of single- and multi-localized proteins, using our system
| 0.77 (± 0.01) | 0.67 (± 0.02) | 0.72 (± 0.01) | |
| | | | |
| 0.81 (± 0.01) | 0.76 (± 0.02) | 0.76 (± 0.01) | |
The table shows the F1 score, the F1-label score, and the overall accuracy (Acc) obtained from SVMs without using location inter-dependencies and from our system, which uses location inter-dependencies. Standard deviations are shown in parentheses.
Multi-location prediction results on the No-PROSITE-GO, No-PROSITE, and No-GO versions of the dataset, averaged over 25 runs of 5-fold cross-validation, for the combined set of single- and multi-localized proteins, using our system
| No-PROSITE-GO | 0.75 (± 0.04) | 0.66 (± 0.02) | 0.70 (± 0.04) | |
| No-PROSITE-GO | 0.78 (± 0.05) | 0.72 (± 0.07) | 0.73 (± 0.05) | |
| No-PROSITE | 0.77 (± 0.01) | 0.66 (± 0.02) | 0.72 (± 0.01) | |
| No-PROSITE | 0.80 (± 0.01) | 0.75 (± 0.02) | 0.75 (± 0.01) | |
| No-GO | 0.76 (± 0.03) | 0.67 (± 0.03) | 0.71 (± 0.03) | |
| No-GO | 0.79 (± 0.04) | 0.72 (± 0.08) | 0.74 (± 0.04) |
The table shows the F1 score, the F1-label score, and the overall accuracy (Acc) obtained from SVMs without using location inter-dependencies and from our system, which uses location inter-dependencies. Standard deviations are shown in parentheses.
Multi-location prediction results on the PROSITE-GO version of the dataset, per location, averaged over 25 runs of 5-fold cross-validation, for the combined set of single- and multi-localized proteins
| 0.87 (± 0.02) | |||||
| 0.90 (± 0.01) | 0.87 (± 0.03) | ||||
| 0.85 (± 0.01) | 0.64 (± 0.02) | 0.72 (± 0.02) | 0.79 (± 0.02) | 0.62 (± 0.03) | |
| 0.89 (± 0.02) | 0.87 (± 0.03) | ||||
| 0.81 (± 0.02) | 0.90 (± 0.01) | ||||
| 0.78 (± 0.01) | 0.72 (± 0.02) | 0.77 (± 0.01) | 0.76 (± 0.01) | 0.68 (± 0.02) | |
Results are shown for the five locations s that have the largest number of associated proteins (the number of proteins per location is given in parenthesis): cytoplasm (cyt), extracellular space (ex), nucleus (nuc), membrane (mem), and mitochondrion (mi). The table shows the per-location measures: standard precision (Pre-), recall (Rec-), Multilabel-Precision (), and Multilabel-Recall (), obtained from SVMs without using location inter-dependencies and from our system using location inter-dependencies. For each location and measure, the highest of the values obtained from the two methods is shown in boldface. Standard deviations are shown in parentheses.