| Literature DB >> 35033002 |
Juan A G Ranea1,2,3, Pedro Seoane-Zonjic1,2,3, Elena Rojano1,3, Fernando M Jabato1,3, James R Perkins4,5,6, José Córdoba-Caballero1, Federico García-Criado1, Ian Sillitoe7, Christine Orengo7.
Abstract
BACKGROUND: Protein function prediction remains a key challenge. Domain composition affects protein function. Here we present DomFun, a Ruby gem that uses associations between protein domains and functions, calculated using multiple indices based on tripartite network analysis. These domain-function associations are combined at the protein level, to generate protein-function predictions.Entities:
Keywords: CAFA; CATH; DomFun; Function prediction; Protein domains
Mesh:
Substances:
Year: 2022 PMID: 35033002 PMCID: PMC8761305 DOI: 10.1186/s12859-022-04565-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Workflow of the procedure followed in this study. We first built the domain-protein-function tripartite network. Then, we calculated associations between domains and functional annotations (through shared proteins) with NetAnalyzer. Once calculated, we combine these domain-function associations to predict proteins function with DomFun. For a given protein, DomFun obtains its constituent domains and their associated functions. These domain-function association values are combined to obtain protein-function scores
Fig. 2DomFun evaluation in terms of maximum F-measure () calculation when predicting for Gene Ontology (GO) molecular function (GOMF) (a), biological process (GOBP) (b) and cellular component (c) (GOCC) terms using the CAFA 3 prediction benchmark. The associations between GO terms and protein domains, classified using FunFams (FF) and superfamilies (SF) separately, were calculated using four different association indices: Jaccard (Jac), Simpson (Sim), Pearson correlation coefficient (PCC) and hypergeometric index (HyI), and combined using either Fisher’s method (Fis) or Stouffer’s method (Sto). Results for the baseline methods BLAST and Naïve are also included for comparison. Coverage (C) values for each method are included within the bars. The CAFA 3 evaluation procedure was set to partial mode and limited knowledge
Maximum F-measure () scores obtained with DomFun using the CAFA 3 prediction benchmark
| Domain classification | FA | Association + combination methods | |||
|---|---|---|---|---|---|
| HyI + Fis | PCC+Sto | Jac + Sto | Sim+Sto | ||
| FunFams | GOMF | 0.608 | 0.553 | 0.515 | |
| GOBP | 0.452 | 0.444 | 0.443 | ||
| GOCC | 0.542 | 0.529 | 0.529 | ||
| Superfamilies | GOMF | 0.314 | 0.347 | 0.350 | |
| GOBP | 0.099 | 0.172 | 0.174 | ||
| GOCC | 0.254 | 0.353 | 0.340 | ||
The best performing methods for each domain/GO subontology combination are indicated in bold
FA Functional annotation, HyI hypergeometric index, Sim Simpson index, PCC Pearson correlation coefficient, Jac Jaccard index, Sto Stouffer’s combination method, Fis Fisher’s combined probability test. CAFA 3 evaluation procedure set to partial mode and limited knowledge
DomFun ranking analysis based on comparing different evaluation methods
| Ontology | Type | Mode | FF-HyI-Fis | FF-PCC-Sto | FF-Jac-Sto | FF-Sim-Sto | SF-HyI-Fis | SF-PCC-Sto | SF-Jac-Sto | SF-Sim-Sto |
|---|---|---|---|---|---|---|---|---|---|---|
| GOMF | 1 | 1 | 2.5 | 4.5 | 4.5 | 2.5 | 8 | 6 | 1 | 7 |
| GOMF | 1 | 2 | 2 | 3.5 | 3.5 | 1 | 8 | 6 | 5 | 7 |
| GOMF | 2 | 1 | 2 | 3 | 4 | 1 | 8 | 6 | 7 | 5 |
| GOMF | 2 | 2 | 2 | 3 | 4 | 1 | 8 | 7 | 6 | 5 |
| GOBP | 1 | 1 | 2 | 3.5 | 3.5 | 1 | 8 | 6 | 5 | 7 |
| GOBP | 1 | 2 | 2 | 3.5 | 3.5 | 1 | 8 | 6 | 5 | 7 |
| GOBP | 2 | 1 | 2 | 3.5 | 3.5 | 1 | 8 | 6 | 5 | 7 |
| GOBP | 2 | 2 | 2 | 3 | 4 | 1 | 8 | 6 | 5 | 7 |
| GOCC | 1 | 1 | 3 | 3 | 3 | 1 | 8 | 7 | 5 | 6 |
| GOCC | 1 | 2 | 2 | 3.5 | 3.5 | 1 | 8 | 6 | 5 | 7 |
| GOCC | 2 | 1 | 2 | 3.5 | 3.5 | 1 | 8 | 6 | 5 | 7 |
| GOCC | 2 | 2 | 2 | 3.5 | 3.5 | 1 | 8 | 6 | 5 | 7 |
Type 1: no knowledge, type 2: limited knowledge. Mode 1: Full, mode 2: partial. FF FunFams, SF superfamilies. Jac Jaccard, Sim Simpson, PCC Pearson correlation coefficient, HyI hypergeometric, Sto Stouffer, Fis Fisher
top values: DomFun (Simpson + Stouffer) vs. CAFA 3 methods
| Ontology | Type | Mode | Top DomFun | DomFun coverage | Top CAFA 3 | CAFA 3 coverage |
|---|---|---|---|---|---|---|
| GOMF | 1 | 1 | 0.357 | 0.71 | 0.618 | 1 |
| GOMF | 1 | 2 | 0.567 | 0.41 | 0.622 | 0.02 |
| GOMF | 2 | 1 | 0.431 | 0.49 | 0.622 | 1 |
| GOMF | 2 | 2 | 0.624 | 0.49 | 0.623 | 0.88 |
| GOBP | 1 | 1 | 0.275 | 0.46 | 0.397 | 1 |
| GOBP | 1 | 2 | 0.402 | 0.46 | 0.418 | 0.62 |
| GOBP | 2 | 1 | 0.37 | 0.55 | 0.598 | 1 |
| GOBP | 2 | 2 | 0.492 | 0.55 | 0.64 | 0.83 |
| GOCC | 1 | 1 | 0.412 | 0.49 | 0.615 | 1 |
| GOCC | 1 | 2 | 0.606 | 0.49 | 0.908 | 0 |
| GOCC | 2 | 1 | 0.422 | 0.51 | 0.615 | 1 |
| GOCC | 2 | 2 | 0.602 | 0.51 | 0.825 | 0 |
Type 1: no knowledge, type 2: limited knowledge. Mode 1: full evaluation, mode 2: partial evaluation
Maximum F-measure () scores for precision and recall (PR) curves obtained with DomFun using the Pathway Prediction Performance benchmark procedure
| Domains classification | FA | Association + combination methods | |||
|---|---|---|---|---|---|
| HyI+Fis | PCC+Sto | Jac+Sto | Sim+Sto | ||
| FunFams | GOMF | 0.779 | 0.749 | 0.749 | |
| GOBP | 0.643 | 0.604 | 0.604 | ||
| GOCC | 0.750 | 0.704 | 0.704 | ||
| KEGG | 0.730 | 0.730 | 0.730 | ||
| Reactome | 0.762 | 0.680 | 0.663 | ||
| Superfamilies | GOMF | 0.241 | 0.370 | 0.139 | |
| GOBP | 0.196 | 0.291 | 0.089 | ||
| GOCC | 0.221 | 0.217 | 0.129 | ||
| KEGG | 0.132 | 0.340 | 0.271 | ||
| Reactome | 0.127 | 0.327 | 0.081 | ||
The best performing methods for each domain/annotation source combination are indicated in bold
FA Functional annotation, HyI hypergeometric index, PCC Pearson correlation coefficient, Jac Jaccard index, Sim Simpson index, Sto Stouffer’s method, Fis Fisher’s method