Literature DB >> 26072505

Protein (multi-)location prediction: utilizing interdependencies via a generative model.

Ramanuja Simha¹, Sebastian Briesemeister¹, Oliver Kohlbacher¹, Hagit Shatkay².

Abstract

MOTIVATION: Proteins are responsible for a multitude of vital tasks in all living organisms. Given that a protein's function and role are strongly related to its subcellular location, protein location prediction is an important research area. While proteins move from one location to another and can localize to multiple locations, most existing location prediction systems assign only a single location per protein. A few recent systems attempt to predict multiple locations for proteins, however, their performance leaves much room for improvement. Moreover, such systems do not capture dependencies among locations and usually consider locations as independent. We hypothesize that a multi-location predictor that captures location inter-dependencies can improve location predictions for proteins.
RESULTS: We introduce a probabilistic generative model for protein localization, and develop a system based on it-which we call MDLoc-that utilizes inter-dependencies among locations to predict multiple locations for proteins. The model captures location inter-dependencies using Bayesian networks and represents dependency between features and locations using a mixture model. We use iterative processes for learning model parameters and for estimating protein locations. We evaluate our classifier MDLoc, on a dataset of single- and multi-localized proteins derived from the DBMLoc dataset, which is the most comprehensive protein multi-localization dataset currently available. Our results, obtained by using MDLoc, significantly improve upon results obtained by an initial simpler classifier, as well as on results reported by other top systems.
AVAILABILITY AND IMPLEMENTATION: MDLoc is available at: http://www.eecis.udel.edu/∼compbio/mdloc.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2015 PMID： 26072505 PMCID： PMC4765880 DOI： 10.1093/bioinformatics/btv264

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Proteins are responsible for a multitude of diverse vital tasks in all living organisms (Rost ). Given that a protein’s function and role are strongly related to its subcellular location, protein location prediction is an important research area (Alberts ; Nair and Rost, 2008). Furthermore, the location of a protein helps understand the protein’s prospective utility as a drug target (Bakheet and Doig, 2009). Methods for determining protein locations include experimental as well as high-throughput computational ones. The experimental methods accurately determine protein locations, but are typically time consuming and are typically not cost effective for finding locations for a large number of proteins. Such methods include mass spectrometry (Dreger, 2003) and green fluorescence detection (Simpson ). On the other hand, the computational methods are fast, and can potentially predict locations for proteins whose actual locations have not yet been experimentally determined. Most of the prediction systems represent proteins using sequence-derived features and utilize machine learning methods (e.g. Blum ; Emanuelsson ; Nakai and Kanehisa, 1991; Shatkay ). Proteins move from one location to another and localize to multiple subcellular compartments (Murphy, 2010; Pohlschroder ). For instance, the enzyme TREX1, which assists in DNA repair, is primarily present in the cytoplasm but is also transported to the nucleus in response to DNA damage (Tomicic ). Thus, predicting multiple locations for proteins is important, as protein movement across locations enables the protein to serve multiple distinct functions. Nevertheless, all prediction systems mentioned earlier and most current systems assign only a single location per protein. Since proteins localize systematically, and translocation occurs only among specific locations for the purpose of a particular subcellular function, our hypothesis is that modeling inter-depedencies among locations can assist in predicting locations of proteins more accurately. Posing the problem using computational, machine-learning terms, assigning multiple locations to proteins is a multi-label classification task. Traditional single-label classification assigns a single label (location) to each instance (protein), and is addressed by methods such as Support Vector Machines (Scholkopf and Smola, 2002), naïve Bayes or neural networks (Russell and Norvig, 2010). Multi-label classification, on the other hand, aims to associate each instance with possibly multiple classes. Some of the simplest and commonly used approaches transform the multi-label classification task into one or more single-label classification task(s) (Tsoumakas ); such approaches do not capture label inter-dependencies. More sophisticated multi-label classification approaches attempt to capture label inter-dependencies and incorporate them into the classification process. Such multi-label classification methods have not yet been employed in the context of protein location prediction. In this article, we present a new, dependency-based probabilistic generative model for eukaryotic protein localization, and develop a multi-location prediction system—which we call MDLoc, to predict locations of multiply localized proteins. As was done before (Briesemeister ; King and Guda, 2007; Li ), we use sequence-derived features and Gene Ontology (GO) terms to represent proteins. Here we introduce a new model using Bayesian networks to directly address and capture inter-dependencies among locations. Furthermore, we present the concept of location dependency sets and use a mixture model to represent feature dependency on location-combinations. The new system uses a generative model and an iterative procedure for estimating its parameters, and effectively improves the estimation process of multi-locations. Our method is based on iteratively learning parameters of the location-Bayesian-network and the mixture model, while re-inferring the location estimates of the proteins in each iteration. This improves on our preliminary system, which comprised a collection of Bayesian network classifiers, where location inter-dependencies were not learnt as part of the model but rather captured based on simple estimates of location values (Simha and Shatkay, 2014). We evaluate MDLoc on a dataset derived from the DBMLoc dataset (Zhang ), which is the most comprehensive protein multi-localization dataset currently available, using multiple runs of 5-fold cross-validation. We show that the performance of MDLoc on multi-localized proteins improves over earlier results for a top performing system, YLoc+ (Briesemeister ). The improved results obtained by MDLoc demonstrate the advantage of utilizing location inter-dependencies and feature dependencies on locations in the prediction process. The rest of the article proceeds as follows: Section 2 surveys methods for protein multi-location prediction. In Section 3, we introduce the concept of location dependency sets and provide relevant notations; we also present our new probabilistic generative model for protein localization, which captures dependencies between protein-features and locations. In Section 4, we discuss the model parameters, the learning procedure used for finding them, and the inference technique used for predicting multiple protein locations. Experiments and results are presented in Section 5, followed by conclusions and future directions.

2 Related work

A number of recent location prediction systems attempt to predict multiple locations for proteins, however their performance leaves much room for improvement. While most use sequence-derived features (e.g. amino acid composition) and GO terms to represent proteins and to predict protein locations, a few are based solely on sequence-based similarity. The former class of methods incorporate one or more of the following classifiers: k-nearest neighbors (k-NN, Chou ), Support Vector Machines (Li ), naïve Bayes (Briesemeister ) and neural networks (Emanuelsson ). KnowPredsite (Lin ) is an example of the latter, similarity based, class of methods. Systems that use k-NN adaptations to predict multiple locations for proteins include WoLF PSORT (Horton ), Euk-mPLoc (Chou and Shen, 2007), iLoc-Euk (Chou ) and an ensemble system (Li ). WoLF PSORT outputs for a query protein the location-combination that is most frequent among the protein’s k-NN in the training set; the predictions are thus restricted to location-combinations already present in the set. Both iLoc-Euk and Euk-mPLoc compute a score for each candidate location, based on the query protein; iLoc-Euk outputs locations having the highest scores; the number of locations is the same as that associated with the query protein’s nearest neighbor in the dataset; Euk-mPLoc assigns the protein to locations whose score lies within a certain deviation from the highest score. All the methods described thus far treat locations as independent from one another and do not utilize possible inter-dependencies among locations in the prediction process. A few systems, however, have tried to make use of location inter-dependencies to predict multiple locations for proteins. For example, the classifier by He attempts to use pairwise location- correlation in the prediction process, but does not use more complex inter-dependencies. YLoc+ (Briesemeister ) introduces a new class for each location-combination represented in the training dataset and uses a naïve Bayes classifier to predict a probability distribution over these new classes. Thus, each classifier prediction is restricted to location-combinations in the training set. YLoc+’s performance was evaluated using the most comprehensive protein multi-localization dataset and is the highest among current multi-location prediction systems. In our earlier preliminary work (Simha and Shatkay, 2014), we used a collection of Bayesian network classifiers to predict multiple locations of proteins. The simplified model used did not incorporate location-interdependencies into the iterative learning process, but rather utilized one-time estimates of location values to establish interdependencies. The performance of that classifier was comparable to that of YLoc+ when using the same dataset, but did not improve on it. In the next section, we present a new probabilistic generative model for protein localization that directly incorporates the learning of location inter-dependencies into the iterative learning process. Additionally, we introduce the concept of location dependency sets, which enables us to capture feature dependencies on location-combinations in a mixture model setting. The resulting system MDLoc, shows significant improvement, according to all evaluation metrics, compared with previously reported performance for protein multi-location prediction.

3 A probabilistic generative model for protein localization

As we and others have done before (Briesemeister ; Garg and Raghava, 2008; Simha and Shatkay, 2014), we represent each protein P as a weighted feature vector, where d is the number of features. Let be the set of q subcellular components in the cell. Each protein P localizes to at least one—and possibly more than one—location. The locations of each protein P are represented by a location indicator vector, of 0/1 values, where if P localizes to s, and otherwise. We view each location indicator as a value taken by a random variable L and each feature as a value taken by a random variable F. Given a protein P, represented as a vector , the multi-localization task amounts to assigning a (correct) 0/1 value to each of the entries .

3.1 Modeling location inter-dependency

We use Bayesian networks to model inter-dependencies among subcellular locations. A Bayesian network consists of a directed acyclic graph whose set of nodes L corresponds to random variables and set of edges E indicates dependencies among the variables. In our case, nodes represent location variables denoted . Each variable L corresponds to a location s within the cell and takes on a 0/1 value. Figure 1 shows an example Bayesian network we learn over location variables. A directed edge, for instance, from membrane to cytoplasm represents the assertion that knowing that a protein localizes to the membrane influences the level of belief about the protein localizing to the cytoplasm. According to the conditional independence relationship encoded in the Bayesian network, each variable L is conditionally independent of its non-descendants given its parents (for additional details see Russell and Norvig, 2010). The joint distribution of the location variables can thus be calculated as:

Fig. 1.

An example location-Bayesian-network that we learn. Directed edges represent dependencies between the connected nodes. The location associated with each variable is shown below the corresponding node

3.2 Capturing location-feature dependency

The value of each feature represents a certain characteristic of a protein, such as the relative abundance of each amino acid in the protein’s amino-acid composition (King and Guda, 2007). In our experiments, we use the exact same features used by Briesemeister , as explained in Section 5.1. For the purpose of predicting locations for a protein, we view a protein as though it was generated through a stochastic process, in which each of its feature values was determined. The value of each feature variable F () is assigned based on the values taken by one or more location random variables; that is, each feature value may depend on multiple locations and not just on one. For instance, consider a feature capturing the abundance of tryptophan (Trp) residues in the amino acid composition of a protein; we denote the random variable associated with this feature by FTrp. The value of this feature varies greatly between proteins known to localize to the membrane vs. those that are known to localize to both the membrane and the cytoplasm. Specifically, the probability of a membrane protein to have more than three Trp residues (formally denoted as the conditional probability: ), is 0.36 [this high probability agrees with the well-established importance of Trp’s role in membrane proteins (Schiffer )]. In contrast, the probability of proteins known to be multi-localized to both the membrane and the cytoplasm to have more than three Trp residues (), is only 0.15 (the probability values are calculated based on the dataset described in Section 5.1). Thus, the feature value depends on more than a single location value. To accurately capture the dependency between protein features and location-combinations, we view each feature value as depending on a set of location indicator values. Recall that we view a protein as represented by (i.e. comprised of) a set of features. As such, we view each possible location of a protein P as depending on a set of locations to which proteins with similar feature values (including P itself) are likely to be localized. We thus introduce the concept of location dependency sets. For a location s, its dependency set comprises the minimal set of locations such that the likelihood of a protein to localize to s depends on (i.e is correlated or anti-correlated with) its likelihood of to localize to each of . Using the Bayesian network framework, we note that a dependency as described above between the locations and s can be represented as a directed edge from the graph node to L. Given a Bayesian network that represents the dependencies among locations in this way, we can thus denote the location dependency set for each location variable L as the parents of L in the Bayesian network. As such, we define q location dependency sets, one set per location, where () denotes the parents of location variable L in a Bayesian network. Given a Bayesian network G, the steps involved in protein generation are discussed in the rest of this section. We use a coin-toss model to set location indicator values, and two die-roll processes to set feature values. For each feature, one die roll is used to select a location dependency set, and another to assign the actual feature value. We next describe each of the steps in detail.

3.3 Setting location values

As part of the generative process for a protein P, we view the value of a location indicator () as set by tossing a coin C; if the coin comes up Heads, the location indicator is set to 1; otherwise . The probability of C to come up Heads is: . Values comprising the location indicator vector are thus set by tossing the location-specific coins in a sequence one after the other. We assume that there is a specific order in which the coins are tossed. To establish the order, we use a topological ordering of location variables in the Bayesian network G denoted as , where each parent in the network appears before its descendant; an example of such an ordering of nodes based on the network in Figure 1 is . Consequently, coin is tossed first, and based on its outcome, the location indicator value is set, then is tossed and is set, and so on, until Ctq is tossed and is set.

3.4 Setting feature values

We further view each feature value as selected from among n possible distinct values by adhering to the following steps: A dependency set is selected: A location dependency set is chosen based on a probability distribution over q such sets [see Equation (2) for the sets definitions]. For each feature F, let be a random variable that takes on the values , where a value i () indicates that the ith location dependency set is selected. We denote the event of selecting the ith set, LS, by . Given a location indicator vector (selected in the previous step), to select a location dependency set, a die with q faces is rolled. If the die lands with the ith face up, the set LS is selected. The probability of to come up as i is: . A feature-value is assigned: Based on the values taken by variables in a selected location set, the feature value is chosen. Given that the set LS was selected, we assume that a die with n faces is rolled to pick a value for feature F. If the die lands with the vth face up, the feature value is set to v. The probability of to come up as v is: , where F is the random variable associated with the jth feature. Based on this model, each feature value is set independently of other features to construct the complete feature vector of the protein P, . The generative process for a protein P is summarized as shown in Figure 2: First, the location coins are tossed in the order —as shown on the left side of the figure. If the coin Cti () comes up Heads, the location indicator is set to 1; otherwise . Collectively, this results in choosing the location indicator vector . Next, for each feature F (), the location vector die is rolled; if the die lands with the kth face up, the set LS is selected—as shown on the top-right side of the figure. Based on the selected set LS, the feature die is rolled; if the die lands with the vth face up, the feature value is set to v—as shown on the bottom-right side of the figure.

Fig. 2.

The generative process for a protein P. First, location coins, , are tossed (top left); based on the outcomes, location indicator values, , are chosen (bottom left). Collectively, these values make up the location indicator vector . For each feature F, the die is then tossed to select a location dependency set (top right); based on the selected set LS, the feature die is tossed to pick the feature-value (bottom right) We note that our generative model makes the following two independence assumptions: The feature values of a protein P, are conditionally independent of each other given the protein’s location indicator vector , formally: While this assumption may oversimplify the underlying biological mechanisms, it works well in practice and has proven useful before (Briesemeister ). Moreover, our model carefully accounts for inter-dependencies among locations, as well as among locations and features, thus indirectly capturing interdependencies among features. Given the values taken by a location variable L and its parents in a selected location dependency set LS, the feature value for a protein, , is conditionally independent of all other location values, formally: Figure 3 shows the protein generation process using the standard notation of a probabilistic graphical model. Nodes represent random variables and directed edges represent dependencies among variables. The values of location and feature random variables are governed by a probability distribution and as such are denoted using circles. In contrast, the value of each location dependency set variable LS is assigned deterministically based on the values of the location variable L and its parents , and is denoted as a square. The variables representing locations, features, and location dependency sets are observed and hence are shown as shaded; the rest of the variables are latent and are shown unshaded. The latent variable takes on a value k, indicating the selection of the location set LS, with a probability . As was shown in Figure 1, edges among location variables capture inter-dependencies among locations. The rectangular plate notation is used to represent replication of feature and location set variables with the same dependencies. The lack of feature–feature edges captures the conditional independencies among features given location sets.

Fig. 3.

The probabilistic graphical model for the generation of protein features. Directed edges represent dependencies between nodes. Locations and features are shown as circles and location sets as squares. Shaded nodes represent observed variables and unshaded nodes represent latent variables. The variable takes on a value k, indicating the selection of the set LS, with a probability . The rectangular plate notation is used to represent replication of features and location sets with the same dependencies Under the independence assumptions and the structure of our model described earlier, the joint probability of the location indicator vector and the feature vector is expressed as: where each term corresponds to a parameter of the generative model as described below: is the factorization of the joint probability , over the individual q location indicator values; denotes the conditional probability of a feature value (, where d is the total number of features), given the values taken by a location variable L and its parents comprising the location dependency set LS (under the current model G); denotes the probability that the location dependency set LS was selected for a given feature F and a location indicator vector .

4 Model learning and protein multi-location prediction

In this section, we introduce the procedure used for learning the structure and the parameters of our generative model and for predicting multiple locations for proteins. We present an expectation maximization (EM) algorithm to estimate the hidden parameters and explain the inference technique used for multi-location prediction. As our goal is to predict multiple locations for proteins, we use the probabilistic generative model presented in Section 3 to predict a 0/1 value for each location variable L. To obtain the model, we use an iterative process (see Fig. 4) in which the structure of a Bayesian network and the parameters of the generative model [shown in Equation (5)] are learned. Each iteration consists of first learning a network structure and estimating its parameters, and following the learning by performance assessment of the resulting model by using it to infer the locations of proteins in the training dataset. This process is continued until a stopping criterion is met, namely, until the prediction performance of the learned model on the proteins from the training-set does not improve between two successive iterations. Typically the process does not require more than ten iterations to complete. To measure prediction performance in each iteration, we use the F1-score metric, which is formally defined later in Section 5.2. We next discuss the procedures used for learning the structure and the parameters of our model.

Fig. 4.

A summary of our model-learning process. The rectangular boxes represent steps in the learning process, the diamond indicates checking for a stopping criterion, and the oval represents the output, which in our case is the learned model. Directed edges indicate the order among steps

4.1 Model learning

In each iteration of the learning process, we obtain a Bayesian network structure of locations using the software package BANJO (Smith ) and estimate the model parameters shown in the previous section in Equation (5). The initial Bayesian network structure is learned from protein locations in the training set, and iteratively updated to reflect the most-recently estimated locations. To estimate the model parameters described in components (a) and (b) of Equation (5), we calculate the maximum likelihood estimates from frequency counts in the training dataset. As for component (c) there, the location set probability for a given location indicator vector and a feature F cannot be directly computed from the dataset. We thus use an EM algorithm (Dempster ) to estimate the hidden parameter , as described next. In the E-step, for each protein P and each of its feature values in the training set, we compute the probability of a location set LS to be used to determine the protein’s feature value as: In Equation (6), for each location indicator vector and feature F, the distribution over q location sets is initialized as uniform; thus initially for all k, ). The conditional probability, , of a feature value given the location set LS (where ) is initialized to the maximum likelihood estimate computed using the training dataset. In the M-step, we re-estimate all the model parameters. For each location indicator vector and feature F, the probability of a location dependency set LS is re-estimated as: where v is a feature value of F and k denotes the selection of the dependency set LS. That is, in the numerator, for each feature, F, we go over all feature values v that F takes, and all proteins in the set that have this feature value; we sum the probability of having used the dependency set LS to generate feature value v—weighted by the probability of observing that feature value. The denominator is a normalization factor ensuring that probabilities sum to 1. The probability of a set LS to be selected for determining , is calculated in the E-step [see Equation (6)]. Note that is computed separately for each feature since feature-values are determined independently of each other during protein generation. To re-estimate the conditional probability , we introduce the notation to denote the restriction of the location indicator vector to only those locations that are members in the location dependency set LS. The conditional probability is then calculated as: This re-estimation formula is similar to the one shown in Equation (7), but taking into account only those proteins in the training set that are localized to the locations included in the dependency set LS. The process of alternating between the E-step and the M-step is carried out until convergence is reached, i.e. until changes to the hidden parameter values between iterations are no greater than 0.05. Throughout the estimation process, we use Laplace smoothing to avoid overfitting, by adding fractional pseudocounts to observed counts of events (Russell and Norvig, 2010). The smoothing parameter (α) is set to 0.5, which is close to the count of rare events and almost insignificant compared with counts of frequent ones. We next present the inference procedure that we use for predicting protein locations.

4.2 Multiple location prediction

Given a protein P, represented as a feature vector , our task is to predict its location indicator vector , i.e. we need to assign a 0/1 value to each of its location indicators . Under the Bayesian network model, this task translates to inferring the value of the random variable L, which in turn depends on the values of its parent nodes . We thus infer the values of the location dependency set . The inference procedure aims to assign values to L and to such that the conditional probability, is maximized. To infer these values, we follow an iterative process. We start by initializing all location indicators in to 0. For any value-assignment, l to L, we denote by the values assigned to all parents of L. We also denote by the current value assignment for all location random variables in the network other than L and . In each iteration, we consider in turn each of the random variables L. For all possible value assignments, l, , to L and , respectively, we calculate the conditional probability, . The value assignment to L that produces the highest probability is the one used as the current estimate for L. As noted earlier, the process typically requires about ten iterations to reach convergence. We next describe our experiments and the results obtained using the protein generation model.

5 Experiments and results

We implemented our algorithms for learning parameters of the generative model and for inferring locations using Python. We have applied our system MDLoc to the largest available dataset of multi-localized proteins, previously used for training YLoc+ (Briesemeister ). Next, we describe the dataset and the evaluation methods we use, followed by experiments and results obtained using MDLoc. We also provide several specific examples demonstrating the utility of incorporating location inter-dependencies into the prediction process.

5.1 Data

In our experiments, we use a dataset first constructed for an extensive comparison of multi-location prediction systems as part of the evaluation of YLoc+ (Briesemeister ). It contains 5447 single-localized proteins, originally published by Höglund , and 3056 multi-localized proteins, originally published as part of the DBMLoc dataset (Zhang ). As in a true prediction scenario it is not known a priori whether a protein may localize to a single or to multiple locations, we train our system on the combined set of proteins, thus enabling it to handle the actual prediction task. The dataset is already homology-reduced, i.e. proteins sharing >80% sequence identity with another protein in the dataset were removed. We compare the performance of our system to that of others using only multi-localized proteins (3056 proteins) because the only results publicly available for the other systems were obtained on this dataset (Briesemeister ). The single-localized proteins are from the following locations (abbreviations and number of proteins per location are given in parentheses): cytoplasm (cyt, 1411 proteins); endoplasmic reticulum (ER, 198); extra cellular space (ex, 843); golgi apparatus (gol, 150); lysosome (lys, 103); mitochondrion (mi, 510); nucleus (nuc, 837); membrane (mem, 1238); peroxisome (per, 157). The multi-localized proteins are from the following pairs of locations: cyt_nuc: 1882 proteins; ex_mem: 334; cyt_mem: 252; cyt_mi: 240; nuc_mi: 120; ER_ex: 115; ex_nuc: 113. Note that all the multi-location subsets used have over 100 representative proteins. We use the exact same representation of a 30-dimensional feature vector as used for evaluating YLoc+ (for further details see Briesemeister ): (i) thirteen features derived directly from the protein sequence data; (ii) nine features constructed using pseudo-amino acid composition (Chou, 2001); (iii) two annotation-based features constructed using two distinct groups of PROSITE patterns; (iv) six annotation-based features based on GO-annotations.

5.2 Experimental setting and performance measures

We compare the performance of MDLoc to that of our preliminary system (Simha and Shatkay, 2014) and to other systems, specifically, YLoc+ (Briesemeister ), Euk-mPLoc (Chou and Shen, 2007), WoLF PSORT (Horton ) and KnowPredsite (Lin ), whose results on the multi-localized proteins are described in a previously published comprehensive study by Briesemeister . The comparison uses the exact same dataset from that study, and employs multiple runs of stratified 5-fold cross-validation. That is, we ran 5-fold-cross-validation five complete times (25 runs in total), using a different five-way split each time. The use of multiple runs with multiple splits helps validate the stability and the significance of the results. The total training time for our system for the 25 training experiments is about 8 hours (wall-clock), when running on a standard Dell Poweredge machine with 32 AMD Opteron 6276 processors. To formally define the evaluation measures we use, let D be a dataset containing proteins. For a given protein P, let , where be the set of locations to which protein P localizes according to the dataset, and let , where be the set of locations that a classifier predicts for P, where is the 0/1 prediction obtained for location s. We use adapted measures of multi-label precision and recall denoted and and defined as follows (Briesemeister ): We also use the adapted measure of accuracy proposed by Tsoumakas for evaluating multi-label classification. Some of these measures have also been previously used for multi-location evaluation (Briesemeister ; He ). The multi-label accuracy and the F1-label score used for the evaluation of YLoc+ (Briesemeister ) are computed as: Finally, to evaluate the correctness of predictions made for each location s, we use the standard precision and recall measures, denoted by Pre- and Rec- and defined as: and , where TP (true positives) denotes the number of proteins that localize to s and are predicted to localize to s, FP (false positives) denotes the number of proteins that do not localize to s but are predicted to localize to s, and FN (false negatives) denotes the number of proteins that localize to s but are not predicted to localize to s. The F1-score for location s is defined as:

5.3 Classification results

In this section, we compare the performance of our system with that of existing location prediction systems over the commonly used set of multi-localized proteins. We also report experiments using the combined set of single and multi-localized proteins as mentioned in Section 5.1. Our analysis includes an examination of the per-location break-up of the results. Additionally, we focus on several specific examples demonstrating the benefit of incorporating location interdependency into our prediction system. Table 1A shows the F1-label score and the accuracy obtained by our current system MDLoc compared with those obtained by other multi-location predictors [YLoc+, Euk-mPLoc, WoLF PSORT and KnowPredsite as reported by Briesemeister in Table 3] and by our preliminary system (Bayesian network classifiers, denoted BNCs, Simha and Shatkay, 2014), using the same set of multi-localized proteins and evaluation measures. The table shows that MDLoc performs better than the existing top-systems, including YLoc+ which has the best performance reported so far and whose predictions are based only on location-combinations in the training set. In contrast, MDLoc is not limited to the location-combinations in the training set, as it represents dependency of features on location-combinations in a generalizable manner, and directly captures inter-dependencies among locations. The only other system that attempts to capture such dependencies is our preliminary system BNCs.

Table 1.

Multi-location prediction results, averaged over 25 runs of 5-fold cross-validation, for multi-localized proteins only

(A)
	MDLoc	BNCs	YLoc+	Euk-mPLoc	WoLF PSORT	KnowPred_site
F1-label	0.71 (± 0.02)	0.66 (± 0.02)	0.68	0.44	0.53	0.66
Acc	0.68 (± 0.01)	0.63 (± 0.01)	0.64	0.41	0.43	0.63

Standard deviations are shown in parentheses (if available). The highest values are shown in boldface. (A) Overall F1-label scores and overall accuracy (Acc) obtained using our current system MDLoc, our preliminary system (denoted BNCs, Simha and Shatkay, 2014), YLoc+ (Briesemeister ), Euk-mPLoc (Chou and Shen, 2007), WoLF PSORT (Horton ) and KnowPredsite (Lin ). The four rightmost columns are taken directly from Table 3 in the article by Briesemeister . (B) Per location scores: Multilabel-Precision () and Recall (), as well as standard precision (Pre-) and recall (Rec-), for each location s, for MDLoc and YLoc+. Results for YLoc+ were reproduced using our five-way splits. The p-values indicate the statistical significance of the differences between the values obtained from MDLoc and from YLoc+.

Table 3.

Multi-location prediction results, per location-combination, obtained using one run of 5-fold cross-validation, for multi-localized proteins only

		cyt_nuc (1882)	ex_mem (334)	cyt_mem (252)	cyt_mi (240)	nuc_mi (120)	ER_ex (115)	ex_nuc (113)
Both locations correct	MDLoc	1253 (66.6%)	34 (10.2%)	31 (12.3%)	36 (15%)	15 (12.5%)	35 (30.4%)	51 (45.1%)
Both locations correct	BNCs	976 (51.9%)	16 (4.8%)	15 (6%)	25 (10.4%)	11 (9.2%)	16 (13.9%)	54 (47.8%)
First location correct	MDLoc	1603 (85.2%)	87 (26%)	186 (73.8%)	164 (68.3%)	43 (35.8%)	66 (57.4%)	73 (64.6%)
First location correct	BNCs	1578 (83.8%)	60 (18%)	174 (69%)	165 (68.8%)	37 (30.8%)	66 (57.4%)	68 (60.2%)
Second location correct	MDLoc	1481 (78.7%)	258 (77.2%)	82 (32.5%)	99 (41.3%)	67 (55.8%)	51 (44.3%)	72 (63.7%)
Second location correct	BNCs	1240 (65.9%)	246 (73.7%)	68 (27%)	85 (35.4%)	64 (53.3%)	27 (23.5%)	68 (60.2%)

For each combination, the table shows the number of proteins with correct predictions for both locations, for the first of the two locations, and for the second of the two locations, using MDLoc and using our preliminary system (BNCs, Simha and Shatkay, 2014). The highest values are shown in boldface.

Multi-location prediction results, averaged over 25 runs of 5-fold cross-validation, for multi-localized proteins only Standard deviations are shown in parentheses (if available). The highest values are shown in boldface. (A) Overall F1-label scores and overall accuracy (Acc) obtained using our current system MDLoc, our preliminary system (denoted BNCs, Simha and Shatkay, 2014), YLoc+ (Briesemeister ), Euk-mPLoc (Chou and Shen, 2007), WoLF PSORT (Horton ) and KnowPredsite (Lin ). The four rightmost columns are taken directly from Table 3 in the article by Briesemeister . (B) Per location scores: Multilabel-Precision () and Recall (), as well as standard precision (Pre-) and recall (Rec-), for each location s, for MDLoc and YLoc+. Results for YLoc+ were reproduced using our five-way splits. The p-values indicate the statistical significance of the differences between the values obtained from MDLoc and from YLoc+. To illustrate the use of interdependency, consider the protein Securin which is included in our dataset and localizes to both the cytoplasm (cyt) and the nucleus (nuc). Securin, initially present in the cytoplasm, translocates to the nucleus in response to DNA damage (Kim ). While MDLoc assigns it to both the cyt and the nuc, YLoc+ assigns it to the nuc only. Our system utilizes the dependency between nuc and cyt (represented by a directed edge between the two locations, see Fig. 1) to make an accurate multi-location prediction. Location dependencies reflect intrinsic relationships that locations share with each other, and in this case, it is well-known that proteins shuttle continuously between the nucleus and the cytoplasm to control a variety of functions such as cell cycle progression (Gama-Carvalho and Carmo-Fonseca, 2001). MDLoc’s benefit from capturing the interdependency between cyt and nuc is also reflected in its significantly higher Multilabel-Precision and Multilabel-Recall ( and , respectively) for the cyt and the nuc as shown in Table 1B. As another example, consider Protransforming growth factor alpha (TGF-alpha), a protein that assists in cell growth (See NCBI’s Gene database, http://www.ncbi.nlm.nih.gov/gene/7039), localizes to both the extracellular space (ex) and the plasma membrane (mem), and is correctly assigned by MDLoc to both. Here MDLoc employs the well-known dependency between the extracellular space and the plasma membrane, as reflected for instance in the exocytic trafficking pathway (Tokarev ), and in the transition of proteins such as hsp 90-alpha (initiated by TGF-alpha) from the extracellular space to the plasma membrane in response to stress (Cheng ). Again, the value of utilizing interdependencies is demonstrated in MDLoc’s significantly improved precision in terms of Multilabel-Precision () on the ex and mem proteins (while still retaining a similar level of recall, , to that of YLoc+). As an example for MDLoc’s ability to handle proteins whose location-combination is not included in the training set, consider Transmembrane emp24 domain-containing protein 7 (emp24). It localizes to the ER and transports secretory proteins to the golgi complex (gol) (Belden and Barlowe, 1996). (The tables shown do not include ER and gol proteins, as the number of proteins from either of these locations in the dataset is very small.) MDLoc assigns emp24 to both the ER and the gol, whereas YLoc+ assigns it to the ER only. As indicated before, MDLoc makes use of the dependency which captures the relationship between the ER and the gol, both of which act as components in the exocytic trafficking pathway (Tokarev ). We thus see that MDLoc is not restricted to predicting only pre-defined location-combinations. Table 1B shows the per-location prediction results for multi-localized proteins obtained by MDLoc compared with those obtained by YLoc+ (Briesemeister ). Per-location predictions for the other systems are not shown here as they are not publicly available. Results are shown for the five locations with the largest number of associated proteins. For each location s, we show Multilabel-Precision () and Multilabel-Recall () as well as standard precision (Pre-) and recall (Rec-). For the cytoplasm and the nucleus, which have a large number of proteins, the precision and recall values obtained using MDLoc are significantly higher in most cases than those obtained using YLoc+. For locations with much fewer proteins, while the recall values when using MDLoc are marginally lower than when using YLoc+, MDLoc’s precision values are typically significantly higher than those of YLoc+. We note that YLoc+ assigns each protein to all the locations whose probability exceeds a pre-defined threshold; as such, the number of locations it assigns exceeds that to which the protein actually localizes resulting in a lower precision. In contrast, MDLoc does not simply assign a protein to each location whose probability is higher, but rather, it simultaneously considers a set of locations and assigns each protein to the set whose overall probability is high, leading to a higher precision. Table 2 shows the per-location prediction results on the combined dataset of both single- and multi-localized proteins obtained by MDLoc, in comparision to those obtained by BNCs (Simha and Shatkay, 2014). While MDLoc’s precision values are somewhat lower than those of BNCs, MDLoc’s recall is typically higher. MDLoc simultaneously infers the probability of a set of locations; in contrast, BNCs uses an independent Bayesian network structure to infer the probability of each location separately. As such, the likelihood of BNCs to correctly assign the combination of several locations to a protein is much lower than its probability to correctly assign a single location, which directly translates into a relatively low recall measure. When using MDLoc, the increase in recall values for almost all cases is higher than the decrease in the precision values, except in the case of the extracellular space (ex). Notably, proteins in the extracellular space all originate from or are bound toward another location within the cell and as such predicting them as extracellular is challenging for most prediction systems.

Table 2.

Multi-location prediction results, per location, averaged over 25 runs of 5-fold cross-validation, for the combined set of single- and multi-localized proteins

		cyt (3785)	p-value	nuc (2952)	p-value	ex (1405)	p-value	mem (1824)	p-value	mi (870)	p-value
Recsi	MDLoc	0.825 (±0.009)	≪0.001	0.830 (±0.010)	≪0.001	0.780 (±0.020)	≪0.001	0.822 (±0.012)	≪0.001	0.773 (±0.013)	≪0.001
Recsi	BNCs	0.795 (±0.011)	≪0.001	0.784 (±0.017)	≪0.001	0.737 (±0.022)	≪0.001	0.780 (±0.014)	≪0.001	0.730 (±0.025)	≪0.001
Presi	MDLoc	0.819 (±0.013)	0.03	0.822 (±0.014)	0.02	0.864 (±0.020)	≪0.001	0.872 (±0.014)	≪0.001	0.861 (±0.024)	0.001
Presi	BNCs	0.809 (±0.018)	0.03	0.832 (±0.013)	0.02	0.912 (±0.019)	≪0.001	0.900 (±0.012)	≪0.001	0.885 (±0.023)	0.001
Rec-Stdsi	MDLoc	0.867 (±0.015)	0.1	0.808 (±0.021)	≪0.001	0.715 (±0.030)	≪0.001	0.842 (±0.017)	≪0.001	0.719 (±0.028)	≪0.001
Rec-Stdsi	BNCs	0.861 (±0.014)	0.1	0.736 (±0.031)	≪0.001	0.652 (±0.024)	≪0.001	0.805 (±0.017)	≪0.001	0.664 (±0.034)	≪0.001
Prec-Stdsi	MDLoc	0.854 (±0.014)	0.001	0.783 (±0.020)	0.6	0.839 (±0.028)	≪0.001	0.882 (±0.014)	≪0.001	0.843 (±0.026)	0.001
Prec-Stdsi	BNCs	0.840 (±0.011)	0.001	0.786 (±0.026)	0.6	0.906 (±0.022)	≪0.001	0.900 (±0.015)	≪0.001	0.873 (±0.034)	0.001

The table shows the same measures used in Table 1B obtained over the combined dataset using our current system MDLoc, and using our preliminary system (denoted BNCs) (Simha and Shatkay, 2014). The highest values are shown in boldface. The p-values indicate the statistical significance of the differences between the values obtained from MDLoc and those obtained from BNCs. Standard deviations are shown in parentheses.

Multi-location prediction results, per location, averaged over 25 runs of 5-fold cross-validation, for the combined set of single- and multi-localized proteins The table shows the same measures used in Table 1B obtained over the combined dataset using our current system MDLoc, and using our preliminary system (denoted BNCs) (Simha and Shatkay, 2014). The highest values are shown in boldface. The p-values indicate the statistical significance of the differences between the values obtained from MDLoc and those obtained from BNCs. Standard deviations are shown in parentheses. Moreover, MDLoc assigns some proteins hitherto known to localize only to a single location into multiple locations. It is likely that at least some of these additional predicted locations are indeed correct and can be the subject of an experimental validation. For instance, Calreticulin (Cal) is currently annotated by SwissProt as localized to the ER only. However, MDLoc assigns it to both the ER and the ex, and work by Gold suggests that it indeed relocates from the ER to the ex. We also examine the statistically significant differences in the Multilabel-Recall for the location with the highest number of multi-localized proteins (cytoplasm, 2374 proteins) and the location with the lowest number (endoplasmic reticulum, 115 proteins). The Multilabel-Recall for cytoplasm (Reccyt) increases from 0.80 when classifying using BNCs, to 0.83 when using MDLoc. Similarly, the Multilabel-Recall for endoplasmic reticulum (RecER, not shown in Table 2) increases from 0.64 to 0.69. This analysis demonstrates the advantage of using MDLoc for predicting protein locations, not just for locations that have a large number of associated proteins but also for locations that are associated with relatively few proteins. Table 3 shows the prediction results obtained using MDLoc in contrast to those obtained using BNCs (Simha and Shatkay, 2014) for all location-combinations, using multi-localized proteins only. For each location combination in the dataset, we show the number of proteins with correct predictions for both locations, as well as for the first of the two locations, and for the second, separately. For almost all combinations, the number of proteins whose location is correctly predicted by MDLoc is significantly higher than the corresponding number when using BNCs. We examine the predictions for the location-combination with the highest number of proteins (cytoplasm and nucleus—1882 proteins) and its constituent locations (cytoplasm—1411 and nucleus—837 proteins). As can be seen from the table, the number of multi-localized proteins whose combined-location is correctly predicted increases significantly from 976 when classifying using BNCs, to 1253 when using MDLoc. The increase shows that location inter-dependencies learnt using MDLoc help to improve predictions for multi-localized proteins. Multi-location prediction results, per location-combination, obtained using one run of 5-fold cross-validation, for multi-localized proteins only For each combination, the table shows the number of proteins with correct predictions for both locations, for the first of the two locations, and for the second of the two locations, using MDLoc and using our preliminary system (BNCs, Simha and Shatkay, 2014). The highest values are shown in boldface.

6 Conclusion and future work

We presented a new probabilistic generative model for protein localization based on Bayesian networks and a mixture model, and developed a system MDLoc, to predict multiple locations for proteins. MDLoc takes advantage of the location inter-dependencies and location-feature dependency to provide a generalizable method for predicting multiple locations for proteins. Our results demonstrate the utility of using location inter-dependencies in the prediction process, and show that the performance of MDLoc improves over current state-of-the-art reported results. MDLoc significantly improves over our own preliminary method which used a relatively simple collection of Bayesian network classifiers (Simha and Shatkay, 2014) whose performance was on par with that of YLoc+ (Briesemeister ). In our previous method, location inter-dependencies were not learnt as part of the model but rather captured based on simple estimates of location values. In contrast, MDLoc uses a generative model comprising Bayesian networks to directly address and capture inter-dependencies among locations, and a mixture model to represent feature dependency on location-combinations. We iteratively learn a Bayesian network over location variables while estimating the locations using expectation maximization. Our future work includes exploring alternative ways to learn the mixture model parameters, to evaluate the model learned in each iteration of our current process, and to perform multi-location inference. We will also conduct experiments testing our system’s performance on more complex location-combinations. Having a larger set of multi-localized proteins from plant-, fungi- and animal-specific organelles will also enable us to explore the possibility of building a model for each taxonomic group. As another direction, we will also experiment with features other than the ones previously used by YLoc+, utilizing multiple data-sources, which is likely to be more appropriate for representing proteins in the context of multi-location prediction.

33 in total

Review 1. The rules and roles of nucleocytoplasmic shuttling proteins.

Authors: M Gama-Carvalho; M Carmo-Fonseca
Journal: FEBS Lett Date: 2001-06-08 Impact factor: 4.124

Review 2. Diversity and evolution of protein translocation.

Authors: Mechthild Pohlschröder; Enno Hartmann; Nicholas J Hand; Kieran Dilks; Alex Haddad
Journal: Annu Rev Microbiol Date: 2005 Impact factor: 15.500

3. Properties and identification of human protein drug targets.

Authors: Tala M Bakheet; Andrew J Doig
Journal: Bioinformatics Date: 2009-01-21 Impact factor: 6.937

4. Erv25p, a component of COPII-coated vesicles, forms a complex with Emp24p that is required for efficient endoplasmic reticulum to Golgi transport.

Authors: W J Belden; C Barlowe
Journal: J Biol Chem Date: 1996-10-25 Impact factor: 5.157

Review 5. Communicating subcellular distributions.

Authors: Robert F Murphy
Journal: Cytometry A Date: 2010-07 Impact factor: 4.355

6. Securin induces genetic instability in colorectal cancer by inhibiting double-stranded DNA repair activity.

Authors: D S Kim; J A Franklyn; V E Smith; A L Stratford; H N Pemberton; A Warfield; J C Watkinson; T Ishmail; M J O Wakelam; C J McCabe
Journal: Carcinogenesis Date: 2006-10-27 Impact factor: 4.944

7. Going from where to why--interpretable prediction of protein subcellular localization.

Authors: Sebastian Briesemeister; Jörg Rahnenführer; Oliver Kohlbacher
Journal: Bioinformatics Date: 2010-03-17 Impact factor: 6.937

8. YLoc--an interpretable web server for predicting subcellular localization.

Authors: Sebastian Briesemeister; Jörg Rahnenführer; Oliver Kohlbacher
Journal: Nucleic Acids Res Date: 2010-05-27 Impact factor: 16.971

9. iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins.

Authors: Kuo-Chen Chou; Zhi-Cheng Wu; Xuan Xiao
Journal: PLoS One Date: 2011-03-30 Impact factor: 3.240

10. Protein (multi-)location prediction: using location inter-dependencies in a probabilistic framework.

Authors: Ramanuja Simha; Hagit Shatkay
Journal: Algorithms Mol Biol Date: 2014-03-19 Impact factor: 1.405

4 in total

1. PMLPR: A novel method for predicting subcellular localization based on recommender systems.

Authors: Elnaz Mirzaei Mehrabad; Reza Hassanzadeh; Changiz Eslahchi
Journal: Sci Rep Date: 2018-08-13 Impact factor: 4.379

2. Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field.

Authors: Jalil Villalobos-Alva; Luis Ochoa-Toledo; Mario Javier Villalobos-Alva; Atocha Aliseda; Fernando Pérez-Escamirosa; Nelly F Altamirano-Bustamante; Francine Ochoa-Fernández; Ricardo Zamora-Solís; Sebastián Villalobos-Alva; Cristina Revilla-Monsalve; Nicolás Kemper-Valverde; Myriam M Altamirano-Bustamante
Journal: Front Bioeng Biotechnol Date: 2022-07-07

3. Protein Subcellular Localization Prediction.

Authors: Elettra Barberis; Emilio Marengo; Marcello Manfredi
Journal: Methods Mol Biol Date: 2021

4. BESFA: bioinformatics based evolutionary, structural & functional analysis of prostrate, Placenta, Ovary, Testis, and Embryo (POTE) paralogs.

Authors: Sahar Qazi; Bimal Prasad Jit; Abhishek Das; Muthukumarasamy Karthikeyan; Amit Saxena; M D Ray; Angel Rajan Singh; Khalid Raza; B Jayaram; Ashok Sharma
Journal: Heliyon Date: 2022-09-05

4 in total