| Literature DB >> 31073595 |
Damiano Piovesan1, Silvio C E Tosatto1,2.
Abstract
Our current knowledge of complex biological systems is stored in a computable form through the Gene Ontology (GO) which provides a comprehensive description of genes function. Prediction of GO terms from the sequence remains, however, a challenging task, which is particularly critical for novel genomes. Here we present INGA 2.0, a new version of the INGA software for protein function prediction. INGA exploits homology, domain architecture, interaction networks and information from the 'dark proteome', like transmembrane and intrinsically disordered regions, to generate a consensus prediction. INGA was ranked in the top ten methods on both CAFA2 and CAFA3 blind tests. The new algorithm can process entire genomes in a few hours or even less when additional input files are provided. The new interface provides a better user experience by integrating filters and widgets to explore the graph structure of the predicted terms. The INGA web server, databases and benchmarking are available from URL: https://inga.bio.unipd.it/.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31073595 PMCID: PMC6602455 DOI: 10.1093/nar/gkz375
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Enriched terms in the INGA domain architecture database
| Enriched terms | Average depth | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Signature | Architectures | Proteins | MF | BP | CC | MF | BP | CC | |
|
| Transmembrane | 165 465 | 11 864 693 | 778 207 | 1 445 467 | 248 211 | 3.67 | 4.23 | 2.67 |
| Signal | 109 529 | 2 953 070 | 433 446 | 884 071 | 133 662 | 3.50 | 4.19 | 2.60 | |
| Cytoplasmic | 5312 | 67 800 | 36 365 | 142 550 | 30 771 | 3.85 | 4.51 | 2.97 | |
| Extracellular | 3292 | 22 381 | 25 425 | 102 560 | 16 810 | 3.82 | 4.54 | 2.95 | |
| C-term disorder | 166 470 | 2 329 632 | 747 643 | 1 738 544 | 329 203 | 3.65 | 4.32 | 2.77 | |
| N-term disorder | 161 436 | 2 348 434 | 729 138 | 1 675 203 | 329 310 | 3.66 | 4.33 | 2.78 | |
| Central disorder | 126 412 | 1 134 920 | 519 506 | 1 228 743 | 239 888 | 3.67 | 4.37 | 2.81 | |
| Fully disordered | 3047 | 43 869 | 7 837 | 30 078 | 7 722 | 3.38 | 4.39 | 2.71 | |
| All | 488 312 | 18 112 467 | 2 181 980 | 4 626 913 | 843 145 | 3.61 | 4.26 | 2.71 | |
|
| All | 366 108 | 72 418 252 | 2 019 833 | 3 943 739 | 650 507 | 3.50 | 4.11 | 2.63 |
|
| 854 420 | 90 530 719 | 4 201 813 | 8 570 652 | 1 493 652 | 3.56 | 4.19 | 2.68 | |
Number of molecular function (MF), biological process (BP) and cellular component (CC) terms statistically enriched (enriched terms) for different types of architectures in the INGA database. (Average depth) Average minimum distance from the corresponding ontology root. All architectures contain an InterPro signature, dark architectures also contain a non-globular signature (Dark). The same architecture can have multiple ‘dark’ signatures, partial counts are provided in separate rows (transmembrane, signal, etc.).
Figure 1.Estimated precision of three INGA components. Precision is reported for different ranking positions. Ranking is provided by BLAST Bit-score for Homology and by the enrichment P-value for Architectures and Interactions (see methods). The horizontal axes is cut at 10.
INGA performance in comparison with other methods
| Ontology | Method | Th (Fmax) | Precision | Recall | Fmax | Th (Smin) | Smin | Coverage |
|---|---|---|---|---|---|---|---|---|
|
| INGA 2.0 | 0.49 | 0.660 | 0.730 | 0.693 | 0.67 | 5.83 | 0.93 |
| INGA 2.0 Arch | 0.28 | 0.545 | 0.495 | 0.519 | 0.60 | 13.25 | 0.61 | |
| INGA 2.0 Arch Non-Dark | 0.47 | 0.600 | 0.365 | 0.454 | 0.60 | 11.20 | 0.46 | |
| INGA 1.0 | 0.78 | 0.658 | 0.583 | 0.618 | 0.95 | 10.33 | 0.90 | |
| BLAST | 0.68 | 0.568 | 0.321 | 0.410 | 1.0 | 19.25 | 0.90 | |
| Naive | 0.06 | 0.296 | 0.082 | 0.128 | 0.6 | 28.93 | 1.00 | |
|
| INGA 2.0 | 0.40 | 0.515 | 0.632 | 0.567 | 0.56 | 29.91 | 0.93 |
| INGA 2.0 Arch | 0.16 | 0.394 | 0.396 | 0.395 | 0.46 | 71.96 | 0.59 | |
| INGA 2.0 Arch Non-Dark | 0.21 | 0.370 | 0.321 | 0.344 | 0.57 | 62.54 | 0.47 | |
| INGA 1.0 | 0.59 | 0.482 | 0.499 | 0.490 | 0.76 | 56.98 | 0.90 | |
| BLAST | 0.22 | 0.422 | 0.097 | 0.158 | 1.0 | 123.91 | 0.91 | |
| Naive | 0.22 | 0.030 | 0.027 | 0.029 | 0.46 | 150.09 | 1.00 | |
|
| INGA 2.0 | 0.40 | 0.589 | 0.641 | 0.614 | 0.56 | 3.78 | 0.96 |
| INGA 2.0 Arch | 0.16 | 0.480 | 0.337 | 0.396 | 0.40 | 12.78 | 0.54 | |
| INGA 2.0 Arch Non-Dark | 0.16 | 0.431 | 0.314 | 0.363 | 0.50 | 11.51 | 0.48 | |
| INGA 1.0 | 0.65 | 0.503 | 0.508 | 0.505 | 0.87 | 10.19 | 0.85 | |
| BLAST | 0.79 | 0.452 | 0.184 | 0.262 | 1.0 | 25.77 | 0.90 | |
| Naive | 0.09 | 0.152 | 0.188 | 0.168 | 0.09 | 32.12 | 1.00 |
This evaluation corresponds to the CAFA full-evaluation with both no- and -limited-knowledge examples merged in a single benchmark. Precision and recall measures are reported for the confidence threshold which maximize the F-score. The coverage is the fraction of predicted targets. INGA Architecture (INGA Arch.) component includes ‘dark’ signatures. INGA 2.0 corresponds to the full algorithm. BLAST and Naive are implemented and trained as described in CAFA2. Table values do not correspond to a fair blind test as training and test examples overlap.
Figure 2.Precision recall curves for methods compared in Table 2 for the three GO ontologies. In the legend, (F) is the Fmax and (C) is the coverage as the fraction of predicted targets.