| Literature DB >> 26072505 |
Ramanuja Simha1, Sebastian Briesemeister1, Oliver Kohlbacher1, Hagit Shatkay2.
Abstract
MOTIVATION: Proteins are responsible for a multitude of vital tasks in all living organisms. Given that a protein's function and role are strongly related to its subcellular location, protein location prediction is an important research area. While proteins move from one location to another and can localize to multiple locations, most existing location prediction systems assign only a single location per protein. A few recent systems attempt to predict multiple locations for proteins, however, their performance leaves much room for improvement. Moreover, such systems do not capture dependencies among locations and usually consider locations as independent. We hypothesize that a multi-location predictor that captures location inter-dependencies can improve location predictions for proteins.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26072505 PMCID: PMC4765880 DOI: 10.1093/bioinformatics/btv264
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.An example location-Bayesian-network that we learn. Directed edges represent dependencies between the connected nodes. The location associated with each variable is shown below the corresponding node
Fig. 2.The generative process for a protein P. First, location coins, , are tossed (top left); based on the outcomes, location indicator values, , are chosen (bottom left). Collectively, these values make up the location indicator vector . For each feature F, the die is then tossed to select a location dependency set (top right); based on the selected set LS, the feature die is tossed to pick the feature-value (bottom right)
Fig. 3.The probabilistic graphical model for the generation of protein features. Directed edges represent dependencies between nodes. Locations and features are shown as circles and location sets as squares. Shaded nodes represent observed variables and unshaded nodes represent latent variables. The variable takes on a value k, indicating the selection of the set LS, with a probability . The rectangular plate notation is used to represent replication of features and location sets with the same dependencies
Fig. 4.A summary of our model-learning process. The rectangular boxes represent steps in the learning process, the diamond indicates checking for a stopping criterion, and the oval represents the output, which in our case is the learned model. Directed edges indicate the order among steps
Multi-location prediction results, averaged over 25 runs of 5-fold cross-validation, for multi-localized proteins only
| (A) | ||||||
|---|---|---|---|---|---|---|
| MDLoc | BNCs | Euk-mPLoc | WoLF PSORT | KnowPredsite | ||
| 0.66 (± 0.02) | 0.68 | 0.44 | 0.53 | 0.66 | ||
| 0.63 (± 0.01) | 0.64 | 0.41 | 0.43 | 0.63 | ||
Standard deviations are shown in parentheses (if available). The highest values are shown in boldface. (A) Overall F1-label scores and overall accuracy (Acc) obtained using our current system MDLoc, our preliminary system (denoted BNCs, Simha and Shatkay, 2014), YLoc+ (Briesemeister ), Euk-mPLoc (Chou and Shen, 2007), WoLF PSORT (Horton ) and KnowPredsite (Lin ). The four rightmost columns are taken directly from Table 3 in the article by Briesemeister . (B) Per location scores: Multilabel-Precision () and Recall (), as well as standard precision (Pre-) and recall (Rec-), for each location s, for MDLoc and YLoc+. Results for YLoc+ were reproduced using our five-way splits. The p-values indicate the statistical significance of the differences between the values obtained from MDLoc and from YLoc+.
Multi-location prediction results, per location-combination, obtained using one run of 5-fold cross-validation, for multi-localized proteins only
| MDLoc | 51 (45.1%) | |||||||
| BNCs | 976 (51.9%) | 16 (4.8%) | 15 (6%) | 25 (10.4%) | 11 (9.2%) | 16 (13.9%) | ||
| MDLoc | 164 (68.3%) | |||||||
| BNCs | 1578 (83.8%) | 60 (18%) | 174 (69%) | 37 (30.8%) | 68 (60.2%) | |||
| MDLoc | ||||||||
| BNCs | 1240 (65.9%) | 246 (73.7%) | 68 (27%) | 85 (35.4%) | 64 (53.3%) | 27 (23.5%) | 68 (60.2%) | |
For each combination, the table shows the number of proteins with correct predictions for both locations, for the first of the two locations, and for the second of the two locations, using MDLoc and using our preliminary system (BNCs, Simha and Shatkay, 2014). The highest values are shown in boldface.
Multi-location prediction results, per location, averaged over 25 runs of 5-fold cross-validation, for the combined set of single- and multi-localized proteins
| MDLoc | |||||||||||
| BNCs | 0.795 (±0.011) | 0.784 (±0.017) | 0.737 (±0.022) | 0.780 (±0.014) | 0.730 (±0.025) | ||||||
| MDLoc | 0.03 | 0.822 (±0.014) | 0.02 | 0.864 (±0.020) | 0.872 (±0.014) | 0.861 (±0.024) | 0.001 | ||||
| BNCs | 0.809 (±0.018) | ||||||||||
| MDLoc | 0.1 | ||||||||||
| BNCs | 0.861 (±0.014) | 0.736 (±0.031) | 0.652 (±0.024) | 0.805 (±0.017) | 0.664 (±0.034) | ||||||
| MDLoc | 0.001 | 0.783 (±0.020) | 0.6 | 0.839 (±0.028) | 0.882 (±0.014) | 0.843 (±0.026) | 0.001 | ||||
| BNCs | 0.840 (±0.011) | ||||||||||
The table shows the same measures used in Table 1B obtained over the combined dataset using our current system MDLoc, and using our preliminary system (denoted BNCs) (Simha and Shatkay, 2014). The highest values are shown in boldface. The p-values indicate the statistical significance of the differences between the values obtained from MDLoc and those obtained from BNCs. Standard deviations are shown in parentheses.