| Literature DB >> 31198622 |
Sovan Saha1, Piyali Chatterjee2, Subhadip Basu3, Mita Nasipuri3, Dariusz Plewczynski4,5.
Abstract
Proteins are the most versatile macromolecules in living systems and perform crucial biological functions. In the advent of the post-genomic era, the next generation sequencing is done routinely at the population scale for a variety of species. The challenging problem is to massively determine the functions of proteins that are yet not characterized by detailed experimental studies. Identification of protein functions experimentally is a laborious and time-consuming task involving many resources. We therefore propose the automated protein function prediction methodology using in silico algorithms trained on carefully curated experimental datasets. We present the improved protein function prediction tool FunPred 3.0, an extended version of our previous methodology FunPred 2, which exploits neighborhood properties in protein-protein interaction network (PPIN) and physicochemical properties of amino acids. Our method is validated using the available functional annotations in the PPIN network of Saccharomyces cerevisiae in the latest Munich information center for protein (MIPS) dataset. The PPIN data of S. cerevisiae in MIPS dataset includes 4,554 unique proteins in 13,528 protein-protein interactions after the elimination of the self-replicating and the self-interacting protein pairs. Using the developed FunPred 3.0 tool, we are able to achieve the mean precision, the recall and the F-score values of 0.55, 0.82 and 0.66, respectively. FunPred 3.0 is then used to predict the functions of unpredicted protein pairs (incomplete and missing functional annotations) in MIPS dataset of S. cerevisiae. The method is also capable of predicting the subcellular localization of proteins along with its corresponding functions. The code and the complete prediction results are available freely at: https://github.com/SovanSaha/FunPred-3.0.git.Entities:
Keywords: MIPS Database; Neighborhood approach; Physico-chemical properties; Protein function prediction; Protein interaction networks; Protein–protein interactions
Year: 2019 PMID: 31198622 PMCID: PMC6535044 DOI: 10.7717/peerj.6830
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Filtering of PPIN.
Application of node weight and edge weight at three levels of threshold: High, Medium and Low in FunPred 3.0_Clust.
Figure 2Cluster formations.
Formation of clusters from refined network after application of three levels of node and edge weight threshold in FunPred 3.0_Clust.
FunPred 3.0_Clust: (For formation of protein clusters which consist of essential proteins and reliable edges).
| Input: Undirected PPIN |
| Output: Protein clusters at three levels of threshold: high, medium and low |
| Begin |
| |
| for all nodes in |
| compute node weight |
| |
| compute node weight threshold at three levels: high, medium and low using equation 1 |
| |
| for each level of threshold |
| for all nodes in |
| if node weight does not exceed threshold |
| remove corresponding node. |
| |
| |
| |
| for all edges in |
| compute edge weight |
| |
| compute edge weight threshold at three levels: high, medium and low using equation 1 |
| |
| for all edges in |
| if edge weight does not exceed high level of edge threshold |
| remove corresponding edge. |
| repeat the same for low, medium level of threshold and |
| |
| |
| form |
| |
| for all proteins in |
| form node weight table |
| sort the node weight table based on the node weights |
| select the first protein |
| i.e. |
| neighbors of |
| update the node edge table by eliminating all the proteins present in |
| repeat the same procedure for |
| End |
FunPred 3.0_Pred: (Protein function prediction of test proteins).
| Input: Set of un-annotated proteins in |
| Output: Functional group of un-annotated proteins |
| Begin |
| |
| for each protein |
| for each |
| add |
| |
| add immediate neighbors of |
| |
| compute Physico-Chemical features of each protein from the amino acid sequence of each protein and execute the selected classifiers to select the most essential features. |
| |
| |
| for each protein |
| compute its mean PCPscore of six selected Physico-Chemical features |
| for each formed cluster |
| compute its mean PCPscore of six selected Physico-Chemical features |
| |
| for each protein |
| for all clusters |
| obtain the difference of PCPscore of |
| functional groups of cluster |
| Repeat all the above steps for annotation of protein functions in |
| End. |
Figure 3FunPred 3.0_Pred.
Working Model of FunPred 3.0_Pred. A: Selected test protein B: Formation of PPIN of test protein C: Formation of clusters D: Computation of distance of the test protein from each of the formed cluster E: Allocation of test protein to the selected cluster having minimum distance along with all it’s functions.
Top-ranked selected physicochemical features (marked in blue and bold)-using four classifiers based on the maximum number of hits.
| Physicochemical properties | Classifiers used | ||||
|---|---|---|---|---|---|
| XGBoost | Random tree | Extra tree | Recursive feature elimination | #Hits | |
| Instability index | |||||
| Extinction coefficient | |||||
| Absorbance | |||||
| Ip/mol weight | |||||
Figure 4Categorization of proteins based on subcellular localization.
PPIN of yeast (Saccharomyces cerevisiae): cytoplasm proteins (red), nuclear proteins (green), interface proteins (blue), unpredicted localization proteins (orange).
Figure 6Disintegrated network views of PPIN of yeast.
Separate PPIN’s of cytoplasm proteins (red), nuclear proteins (green), interface proteins (blue), unpredicted localization proteins (orange) and their interactions.
Figure 7Nuclear PPIN of yeast.
Candidate (green) and test (yellow) proteins in nuclear PPIN (green and yellow) of yeast (violet: other nodes in the network).
Figure 9Interface PPIN of yeast.
Candidate (blue) and test (yellow) proteins in Interface PPIN (blue and yellow) of yeast (violet: other nodes in the network).
Performance analyses of FunPred 3.0_Pred_SL.
| Types of Proteins (based on Subcellular-localization) | Total no. of proteins in database | Total number of selected annotated proteins | Total number of selected essential test proteins | Prediction accuracy (Total no. of matched proteins) | Prediction accuracy (Total no. of unmatched proteins) | Failed to predict |
|---|---|---|---|---|---|---|
| Nuclear proteins | 1,771 | 1,609 | 162 | 112 | 32 | 18 |
| Cytoplasm proteins | 1,757 | 1,566 | 191 | 109 | 51 | 31 |
| Interface proteins | 2,246 | 2,176 | 70 | 37 | 23 | 10 |
Figure 10Network view.
PPI network of Yeast (Saccharomyces cerevisiae).
Figure 11Selected candidate and test proteins.
PPIN of annotated (red circle) and test/unannotated proteins (yellow circle) of the yeast network (Saccharomyces cerevisiae).
Precision, recall and F-score obtained at three levels of node and edge weight threshold.
| Threshold type | Node weight threshold | Edge weight threshold | Selected test proteins | Precision | Recall | |
|---|---|---|---|---|---|---|
| High | 1.072 | 0.110 | 433 | 0.55 | 0.82 | 0.66 |
| Medium | 1.068 | 0.109 | 433 | 0.55 | 0.82 | 0.66 |
| Low | 1.064 | 0.107 | 520 | 0.54 | 0.82 | 0.65 |
Performance analyses of FunPred 3.0 with other protein function prediction methodologies.
| Methods | Precision | Recall | |
|---|---|---|---|
| FunPred 3.0 | 0.55 | 0.82 | 0.66 |
| FunPred-2 ( | 0.51 | 0.90 | 0.65 |
| FPred_Apriori ( | 0.64 | 0.66 | 0.65 |
| FunPred 1.1 ( | 0.61 | 0.50 | 0.55 |
| FunPred 1.2 ( | 0.63 | 0.56 | 0.59 |
| Deep_GO ( | 0.48 | 0.49 | 0.48 |
| Chi-square #1&2 ( | 0.20 | 0.25 | 0.22 |
| Chi-square #1 ( | 0.25 | 0.27 | 0.26 |
| Neighborhood counting #1&2 ( | 0.28 | 0.41 | 0.33 |
| Neighborhood counting #1 ( | 0.26 | 0.45 | 0.33 |
| Fs-weight #1&2 ( | 0.36 | 0.43 | 0.39 |
| Fs-weight #1 ( | 0.33 | 0.42 | 0.37 |
| Nrc ( | 0.37 | 0.43 | 0.40 |
| Zhang ( | 0.20 | 0.19 | 0.19 |
| DCS ( | 0.36 | 0.37 | 0.36 |
| DSCP ( | 0.39 | 0.40 | 0.39 |
| PON ( | 0.15 | 0.14 | 0.14 |
Predicted samples of unpredicted protein pair interactions/functions (“missing” protein-pair-interactions/functions) in the MIPS dataset.
| Interacting protein pairs | Predicted interactions | Predicted functions | |||
|---|---|---|---|---|---|
| Protein#1 | Protein#2 | Interaction#1 | Interaction#2 | Function#1 | Function#2 |
| YAL014c | YAL030w | Two hybrid | Coimmunoprecipitation | – | – |
| YAL014c | YMR197c | Two hybrid | Coimmunoprecipitation | – | – |
| YLR459w | YDR434w | Unable to Predict | – | – | – |
| YDR167w | YBR081c | Two hybrid | – | – | – |
| YGL173c | YML085c | Synthetic lethal | – | – | – |
| YGL190c | YKL048c | Synthetic lethal | Two hybrid | Cell polarity | – |
| YMR167w | YNL082w | Coimmunoprecipitation | Copurification | DNA repair | – |
| YDR027c | YJR060w | Affinity chromatography, affinity-tag GST | Two hybrid | – | – |
| YGR082w | YNL131w | Crosslinking | Coimmunoprecipitation | – | – |
| YJR066w | YHR186c | Synthetic lethal | – | – | Lipid metabolism |
| YDR363w-a | YER008c | Synthetic lethal | – | Vesicular transport | – |
| YDR309c | YLR319c | Synthetic lethal | Cell structure | – | – |
| YLR336c | YPL268w | Unable to predict | – | – | – |
| YKR099w | YDL106c | Unable to predict | – | – | – |
Predicted samples of unpredicted protein pair interactions/functions (“unknown” protein-pair-interactions/functions) in the MIPS dataset.
| Interacting protein pairs | Predicted interactions | Predicted functions | |||
|---|---|---|---|---|---|
| Protein#1 | Protein#2 | Interaction#1 | Interaction#2 | Function#1 | Function#2 |
| YLR418c | YIL040w | Two hybrid | – | Pol II Transcription | – |
| YOR326w | YNL120c | Mitosis | – | Cell polarity | Cell cycle control |
| YJR057w | YDR438w | Unable to Predict | – | – | – |
| YFL037w | YMR299c | Cell structure | – | RNA processing | DNA repair |
| YHR129c | YGL124c | Mitosis | Two hybrid | – | – |
| YGR078c | YAL011w | Synthetic lethal | Two hybrid | – | – |
| YNL153c | YDR149c | Two hybrid | – | Pol II transcription | – |
| YMR307w | YMR317w | Two hybrid | – | Carbohydrate metabolism | – |
| YLR039c | YIL039w | Vesicular transport | Two hybrid | – | – |
| YMR307w | YHR004c | Two hybrid | – | Carbohydrate metabolism | – |
| YDL003w | YGL250w | Two hybrid | – | Energy generation | – |
| YNL271c | YGR228w | Meiosis | – | Cell polarity | Protein modification |
| YML094w | YBR108w | Unable to predict | – | – | – |
| YEL003w | YDR334w | Unable to predict | – | – | – |