| Literature DB >> 36246645 |
Kaustav Sengupta1,2,3, Sovan Saha4, Anup Kumar Halder1,3, Piyali Chatterjee5, Mita Nasipuri2, Subhadip Basu2, Dariusz Plewczynski1,3.
Abstract
Protein function prediction is gradually emerging as an essential field in biological and computational studies. Though the latter has clinched a significant footprint, it has been observed that the application of computational information gathered from multiple sources has more significant influence than the one derived from a single source. Considering this fact, a methodology, PFP-GO, is proposed where heterogeneous sources like Protein Sequence, Protein Domain, and Protein-Protein Interaction Network have been processed separately for ranking each individual functional GO term. Based on this ranking, GO terms are propagated to the target proteins. While Protein sequence enriches the sequence-based information, Protein Domain and Protein-Protein Interaction Networks embed structural/functional and topological based information, respectively, during the phase of GO ranking. Performance analysis of PFP-GO is also based on Precision, Recall, and F-Score. The same was found to perform reasonably better when compared to the other existing state-of-art. PFP-GO has achieved an overall Precision, Recall, and F-Score of 0.67, 0.58, and 0.62, respectively. Furthermore, we check some of the top-ranked GO terms predicted by PFP-GO through multilayer network propagation that affect the 3D structure of the genome. The complete source code of PFP-GO is freely available at https://sites.google.com/view/pfp-go/.Entities:
Keywords: 3D gene-gene association; protein domain; protein function prediction; protein sequence; protein-protein interaction network; ranked GO
Year: 2022 PMID: 36246645 PMCID: PMC9556876 DOI: 10.3389/fgene.2022.969915
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
Current computational methodologies of protein function prediction.
| Features used | Brief description | References |
|---|---|---|
| Sequence and Network | A deep learning framework for gene ontology annotations with sequence- and network-based information | F. |
| DeepFunc: A deep learning framework for accurate prediction of protein functions from protein sequences and interactions | F. | |
| Predicting GO annotations from protein sequences and interactions | X. | |
| GO terms | A deep learning framework for predicting protein functions with co-occurrence of GO terms | M. |
| Gene function prediction based on gene ontology hierarchy preserving hashing |
| |
| Gene function prediction based on combining gene ontology hierarchy with multi-instance multi-label learning | Z. | |
| Structure | Structure-based protein function prediction using graph convolutional networks |
|
| Structure-based function prediction: approaches and applications |
| |
| prediction of protein function from structure: insights from methods for the detection of local structural similarities |
|
FIGURE 1Functionally active target protein selection. Two databases: STRING and UniProt, have been used for this purpose.
FIGURE 2Pruning of target neighborhood graphs. Bridge, fjord, and shore proteins are detected and pruned.
FIGURE 3Double filtering of target neighborhood graph. Edge clustering coefficient and edge weight are computed based on which non-essential nodes are filtered.
FIGURE 4Schematic representation of Sequence-based protein function prediction. The essential aspects of this phase are the selection of seeds followed by the formation of clusters and computation of Physico-chemical properties.
FIGURE 5Working strategy of Domain-based protein function prediction. Four databases, String, UniProt, PFAM, and DOMINE, are used in this phase.
FIGURE 6PPI network-based protein function prediction. GO term enrichment with p value is vital in this phase.
FIGURE 7Sample human PPI network from STRING database. It consists of nodes and interactions between them.
Performance Analysis of INGA and PFP-GO based on PPI network, sequence, and domain.
| Methodology | Precision | Recall | F-score |
|---|---|---|---|
| PFP-GO | 0.67 | 0.58 | 0.62 |
| INGA | 0.44 | 0.51 | 0.47 |
Performance analysis of PFP-GO with other methods based on PPI network.
| Methodology | Precision | Recall | F-score |
|---|---|---|---|
| PFP-GO | 0.74 | 0.67 | 0.73 |
| FunApriori | 0.57 | 0.61 | 0.58 |
| Chi square #1and2 | 0.13 | 0.12 | 0.12 |
| Chi square #1 | 0.12 | 0.15 | 0.13 |
| Neighborhood counting #1and2 | 0.21 | 0.25 | 0.18 |
| Neighborhood counting #1 | 0.15 | 0.21 | 0.17 |
| Fs-weight #1and2 | 0.24 | 0.22 | 0.22 |
| Fs-weight #1 | 0.16 | 0.19 | 0.19 |
| Nrc | 0.25 | 0.24 | 0.22 |
Performance analysis of PFP-GO with other methods based on PPI network and sequence.
| Methodology | Precision | Recall | F-score |
|---|---|---|---|
| PFP-GO | 0.52 | 0.64 | 0.56 |
| Deep_GO | 0.48 | 0.49 | 0.48 |
| BLAST | 0.30 | 0.50 | 0.37 |
| NAÏVE | 0.33 | 0.31 | 0.31 |
Performance analysis of INGA and PFP-GO separately on CC, MF and BP.
| Precision | Recall | F-score | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Methodology | BP | MF | CC | BP | MF | CC | BP | MF | CC |
| PFP-GO | 0.49 | 0.51 | 0.48 | 0.95 | 0.98 | 0.95 | 0.64 | 0.67 | 0.64 |
| INGA | 0.37 | 0.53 | 0.42 | 0.33 | 0.63 | 0.63 | 0.58 | 0.57 | 0.49 |
Performance analysis of PFP-GO with other methods based on Fmax score.
| Methodology | BP | MF | CC |
|---|---|---|---|
| PFP-GO | 0.65 | 0.61 | 0.66 |
| NetGO 3.0 | 0.64 | 0.43 | 0.66 |
| Deep_GO_Plus | 0.57 | 0.41 | 0.59 |
| BLAST | 0.63 | 0.31 | 0.56 |
| NAÏVE | 0.4 | 0.23 | 0.54 |
Top-ranked gene ontology terms selected from GGA validation.
| Rank | Gene | GO-terms |
|---|---|---|
| 1 | ENSG00000123131 | GO:0000049 |
| 2 | ENSG00000123131 | GO:0001731 |
| 3 | ENSG00000123131 | GO:0003743 |
| 4 | ENSG00000130741 | GO:0005576 |
| 5 | ENSG00000130741 | GO:0005634 |
| 6 | ENSG00000130741 | GO:0005783 |