| Literature DB >> 24466278 |
Abstract
One of the fundamental tasks in biology is to identify the functions of all proteins to reveal the primary machinery of a cell. Knowledge of the subcellular locations of proteins will provide key hints to reveal their functions and to understand the intricate pathways that regulate biological processes at the cellular level. Protein subcellular location prediction has been extensively studied in the past two decades. A lot of methods have been developed based on protein primary sequences as well as protein-protein interaction network. In this paper, we propose to use the protein-protein interaction network as an infrastructure to integrate existing sequence based predictors. When predicting the subcellular locations of a given protein, not only the protein itself, but also all its interacting partners were considered. Unlike existing methods, our method requires neither the comprehensive knowledge of the protein-protein interaction network nor the experimentally annotated subcellular locations of most proteins in the protein-protein interaction network. Besides, our method can be used as a framework to integrate multiple predictors. Our method achieved 56% on human proteome in absolute-true rate, which is higher than the state-of-the-art methods.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24466278 PMCID: PMC3900678 DOI: 10.1371/journal.pone.0086879
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The summary of dataset.
(A) The number of locative proteins in different subcellular locations. There are 6951 proteins with experimentally annotated subcellular locations in the dataset. Because one protein may have more than one subcellular location, the number of locative proteins is 9493. (B) The number of proteins with different number of subcellular locations.
Figure 2The relationship between ECC and co-localization scores.
For every pair of interacting proteins with experimentally annotated subcellular locations, the ECC of their interactions and the co-localization score were computed. These interactions were divided into ten groups according to their ECC values. The first group contained the interactions with ECC value between 0 and 0.1.The second group contained the interactions with ECC value between 0.1 and 0.2. The third group contained the interactions with ECC value between 0.2 and 0.3, and so forth. The average values of ECC and co-localization score were computed for every group. The horizontal axis of this figure is the average value of ECC. The vertical axis of this figure is the average value of co-localization score. Ten dots were plotted on this figure to represent the ten groups of interactions. A straight line was generated using simple linear regression method to represent the linear relationship between the average ECC and the average co-localization score.
Comparison of prediction performances.
| Predictor | AIM | CVR | ACC | ATR | AFR |
| Hum-mPLoc 2.0 | 75.7% | 75.4% | 67.1% | 51.4% | 7.4% |
| Y-Loc | 72.4% | 61.0% | 59.8% | 47.4% | 8.4% |
| This method | 79.8% | 74.9% | 70.0% | 56.0% | 6.5% |
AIM is Aiming, as defined in eqn (15);
CVR is Coverage, as defined in eqn (16);
ACC is Accuracy, as defined in eqn (17);
ATR is Absolute-True-Rate, as defined in eqn (18);
AFR is Absolute-False-Rate, as defined in eqn (19).
Performance improvements for every single predictor.
| Predictor | AIM | CVR | ACC | ATR | AFR |
| Hum-mPLoc 2.0 | 75.7% | 75.4% | 67.1% | 51.4% | 7.4% |
| Hum-mPLoc 2.0 + PPI | 79.1% | 72.0% | 68.4% | 54.9% | 6.8% |
| Y-Loc | 72.4% | 61.0% | 59.8% | 47.4% | 8.4% |
| Y-Loc + PPI | 73.2% | 61.1% | 60.5% | 48.6% | 8.2% |
AIM is Aiming, as defined in eqn (15);
CVR is Coverage, as defined in eqn (16);
ACC is Accuracy, as defined in eqn (17);
ATR is Absolute-True-Rate, as defined in eqn (18);
AFR is Absolute-False-Rate, as defined in eqn (19);
These performance values were obtained without optimizing parameters. “+PPI” means using the current method with only Hum-mPLoc 2.0;
These performance values were obtained without optimizing parameters. “+PPI” means using the current method with only Y-Loc.
Performances of iterative prediction.
| Iterations | AIM | CVR | ACC | ATR | AFR |
| 1 | 79.8% | 74.9% | 70.0% | 56.0% | 6.5% |
| 2 | 80.0% | 74.8% | 70.0% | 56.2% | 6.5% |
| 3 | 80.0% | 74.8% | 70.0% | 56.2% | 6.5% |
| 4 | 80.0% | 74.8% | 70.0% | 56.2% | 6.5% |
AIM is Aiming, as defined in eqn (15);
CVR is Coverage, as defined in eqn (16);
ACC is Accuracy, as defined in eqn (17);
ATR is Absolute-True-Rate, as defined in eqn (18);
AFR is Absolute-False-Rate, as defined in eqn (19).
Figure 3The information flow chart of the whole framework.
The input of the framework is only the protein sequences. There are three phases in the whole process. (A) In the first phase, several existing sequence-based predictors give prediction results using only protein sequences. In the current study, these sequence-based predictors include the Y-Loc predictor and the Hum-mPLoc 2.0 predictor. The number n is 2. (B) In the second phase, the prediction results of the first phase were collected and then annotated on a protein-protein interaction network. (C) In the third phase, the annotated protein-protein interaction network was analyzed and the network-based prediction results were generated.