Literature DB >> 31073595

INGA 2.0: improving protein function prediction for the dark proteome.

Damiano Piovesan¹, Silvio C E Tosatto^1,2.

Abstract

Our current knowledge of complex biological systems is stored in a computable form through the Gene Ontology (GO) which provides a comprehensive description of genes function. Prediction of GO terms from the sequence remains, however, a challenging task, which is particularly critical for novel genomes. Here we present INGA 2.0, a new version of the INGA software for protein function prediction. INGA exploits homology, domain architecture, interaction networks and information from the 'dark proteome', like transmembrane and intrinsically disordered regions, to generate a consensus prediction. INGA was ranked in the top ten methods on both CAFA2 and CAFA3 blind tests. The new algorithm can process entire genomes in a few hours or even less when additional input files are provided. The new interface provides a better user experience by integrating filters and widgets to explore the graph structure of the predicted terms. The INGA web server, databases and benchmarking are available from URL: https://inga.bio.unipd.it/.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2019 PMID： 31073595 PMCID： PMC6602455 DOI： 10.1093/nar/gkz375

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The problem of predicting protein function from the amino acid sequence is intrinsically difficult due to the limited number of available experimentally-validated examples and the complexity of the cellular machine. The Gene Ontology (GO) (1), which provides a vocabulary of function descriptors, includes more than 45 thousand different terms. Manually annotated GO terms in UniProtKB (2) cover <1% of the entries. UniProt-GOA (3) provides automatic annotation for the rest of the database. It employs a number of different techniques exploiting sequence properties, InterPro (4) predictions and taxonomy. Yet about 40% of entries remains unannotated and the quality of predicted GO terms is unknown. The need for better methods to improve functional characterization of known proteins and to predict the function of new organisms is becoming critical. Scores of new function prediction methods are published every year, however, an objective overview of the real performance is problematic and a comparison between methods is almost impossible given the heterogeneity of adopted evaluation protocols. The Critical Assessment of protein Function Annotation (CAFA) challenge solves this problem implementing a real blind test (5) and highlights the most effective methods for automatic protein function prediction. The last CAFA results (6) show that methods using the ‘transfer by homology’ approaches (7–9), based on sequence similarity, compete both with machine learning (10) and integrative methods (11). BLAST Fmax, for example is only about 10% lower than the best method. CAFA also shows that predicting biological processes (BP) is much more difficult than molecular function (MF). The Naive baseline, which assigns terms simply based on their frequency in UniProtKB to all benchmark proteins, is still a good predictor due to strong biases in annotation database, for example a very large fraction of experimentally annotated proteins are annotated with the ‘protein binding’ term. Eukaryotes, including simple organisms such as yeast, are much more difficult to predict than prokaryotes. Recent work shows organism complexity negatively correlated with residue level annotation (12). A large fraction of eukaryotic proteome residues, up to 50% for human, is uncharacterized and remains inaccessible to common domain detection pipelines. The so-called ‘dark proteome’ is thought to be composed by new folds, transmembrane regions and intrinsically disordered residues (13). Here we present a new version of INGA (11), Interaction Network GO Annotator, which combines homology, domain architecture, interaction networks and ‘dark’ features to predict protein function. In our previous work we already showed how protein-protein interactions can be used effectively to infer function based on the ‘Guilty by Association’ principle exploiting protein-protein interaction (PPI) networks (14). The fact that disordered regions, compared to globular domains, provide a repertoire of new alternative functions is becoming evident in the literature (15–17), in particular for longer regions (18). The extraction of disorder features from the sequence has been proven to be useful for function prediction methods (19,20). INGA already ranked in the top ten in CAFA2 (6) for both MF and BP ontologies when considering the Fmax in the full-evaluation / no-knowledge mode. Recently, INGA 2.0 ranked in the top ten in CAFA3 for all three ontologies and for both the Fmax and Smin evaluations (manuscript in preparation). A lower Smin, in particular, indicates the ability of the method to predict specific and difficult rare terms, i.e. those less represented in annotation databases. The INGA server has been completely redesigned in order to improve reproducibility, reliability and usability. The INGA 2.0 algorithm can be executed in two alternative modes by providing either the protein sequence(s) or BLAST and InterProScan predictions. In the first case, different components can be excluded to speed-up the calculation at the cost of partially losing specificity. In the second case, INGA can provide maximum accuracy and predict function for entire genomes in less than one hour.

MATERIALS AND METHODS

INGA derives function information from different sources to generate a consensus prediction. The method exploits homology, domain architecture and interaction networks as proxies for transferring function from annotated proteins. The new version of INGA also integrates intrinsic disorder and transmembrane region prediction to cover information from the ‘dark proteome’, i.e. regions poorly characterized in public databases. The consensus prediction provided by INGA 2.0 has been evaluated in the CAFA3 assessment, resulting among the top ten methods for all three ontologies and both Fmax and Smin. A description of the implementation and the contribution of each component to the overall consensus accuracy follows.

Homology and protein interaction networks

Homology is based on the concept of vicinity. In the context of genetic phylogeny, homologous proteins share a common ancestor and therefore the same biological function (21). Other methods are able to distinguish paralogy from orthology because paralogous proteins often diverge too much and lose function similarity (8). However these methods are bound to the computational cost of building a phylogenetic tree and to the number of available representatives for a given protein family. Instead, INGA infers homology by simply measuring sequence similarity. In particular, it performs a BLAST search, with default parameters, against the entire UniProtKB sequence database. The default sorting based on the BLAST Bit-score is used to transfer GO terms and assign an estimated probability representing prediction precision. Different probabilities are assigned simply based on the BLAST ranking independently from input properties or alignment coverage. In contrast to the previous INGA version, hits are not filtered. INGA also exploits information from protein-protein interaction networks to predict function. This has been shown to be effective in our previous works (14), in particular for Cellular Component and partially Biological Process ontologies. The new version of INGA uses exactly the same implementation. It considers only direct interactors from the STRING database, filtered with a confidence score of at least 0.4 corresponding to the STRING default. GO terms associated to direct interactors are transferred with a probability representing their enrichment in comparison to the entire STRING database. The enrichment is calculated with a Fisher exact test, while probability is estimated considering the P-value ranking and measuring the precision for each ranking position.

Domain architecture database and the dark proteome

Proteins are organized in modular architectures (22). According to classification databases, complex architectures are provided by the repetition and rearrangements of a relatively small number of domains (23,24). Domains can be considered as functional determinants and are therefore subject to evolutionary pressure. When the three dimensional structure is conserved across different species, domain detection from the sequence is straightforward as key positions are also conserved. InterPro provides the largest collection of sequence models (signatures) of protein domains with known biological role (4). However, when processing proteomes with InterPro a large fraction of residues remains undetected, in particular for eukaryotes (12). The ‘dark proteome’ includes all those functional modules for which key residues are not position specific but, instead, characterized by compositionally biased regions like in disordered and transmembrane proteins (13). INGA transfers GO terms from proteins with the same domain architecture. The new version uses InterProScan (25), Phobius (26) and MobiDB-lite (27) to predict domain, transmembrane and disordered signatures (labels) respectively. In addition to ‘transmembrane’, Phobius also provides the ‘signal peptide’, ‘cytoplasmic’ and ‘extracellular’ labels. MobiDB-lite predictions are transformed into four different signatures either representing the localization in the sequence or indicating ‘fully disordered’ when disorder content is larger than 75%. Both InterPro and ‘dark’ signatures are combined to generate the INGA domain architecture database. Architectures are calculated for the entire GOA (3). GO annotations of proteins with the same architecture are grouped together and sorted inside the cluster based on their enrichment (Fisher's test) calculated in comparison to the rest of the database (background). When a target sequence matches an architecture in the INGA database, GO terms are transferred with a probability estimated on the ranking provided by the enrichment. Terms with a P-value lower than 0.001 are discarded. This ensures that significantly enriched terms are specific, i.e. distant from the ontology root. Table 1 shows the number of enriched terms for different architectures in the database. Notably, 57% of architectures contain ‘dark’ signatures (Dark), while the number of associated proteins is much higher for globular (Non-dark) architectures, indicating, on average, larger clusters. The number of enriched terms is almost the same for the two major classes but terms enriched in the ‘dark’ database (Dark) are slightly more specific (Average depth). The introduction of ‘dark’ signatures results in the split of large clusters and therefore the separation of different functional groups. On average 5 MF, 10 BP and 2 CC terms are associated to each architecture and can be safely transferred to the matching sequences.

Table 1.

Enriched terms in the INGA domain architecture database

				Enriched terms			Average depth
	Signature	Architectures	Proteins	MF	BP	CC	MF	BP	CC
Dark	Transmembrane	165 465	11 864 693	778 207	1 445 467	248 211	3.67	4.23	2.67
	Signal	109 529	2 953 070	433 446	884 071	133 662	3.50	4.19	2.60
	Cytoplasmic	5312	67 800	36 365	142 550	30 771	3.85	4.51	2.97
	Extracellular	3292	22 381	25 425	102 560	16 810	3.82	4.54	2.95
	C-term disorder	166 470	2 329 632	747 643	1 738 544	329 203	3.65	4.32	2.77
	N-term disorder	161 436	2 348 434	729 138	1 675 203	329 310	3.66	4.33	2.78
	Central disorder	126 412	1 134 920	519 506	1 228 743	239 888	3.67	4.37	2.81
	Fully disordered	3047	43 869	7 837	30 078	7 722	3.38	4.39	2.71
	All	488 312	18 112 467	2 181 980	4 626 913	843 145	3.61	4.26	2.71
Non-dark	All	366 108	72 418 252	2 019 833	3 943 739	650 507	3.50	4.11	2.63
Total		854 420	90 530 719	4 201 813	8 570 652	1 493 652	3.56	4.19	2.68

Number of molecular function (MF), biological process (BP) and cellular component (CC) terms statistically enriched (enriched terms) for different types of architectures in the INGA database. (Average depth) Average minimum distance from the corresponding ontology root. All architectures contain an InterPro signature, dark architectures also contain a non-globular signature (Dark). The same architecture can have multiple ‘dark’ signatures, partial counts are provided in separate rows (transmembrane, signal, etc.).

Enriched terms in the INGA domain architecture database Number of molecular function (MF), biological process (BP) and cellular component (CC) terms statistically enriched (enriched terms) for different types of architectures in the INGA database. (Average depth) Average minimum distance from the corresponding ontology root. All architectures contain an InterPro signature, dark architectures also contain a non-globular signature (Dark). The same architecture can have multiple ‘dark’ signatures, partial counts are provided in separate rows (transmembrane, signal, etc.).

Consensus and training

The training set is the one provided by the CAFA organizers and published on the official web page (https://biofunctionprediction.org/cafa) as ‘CAFA 3 Training Data’ corresponding to all experimental GO terms available in UniProtKB, including 35 086, 50 813 and 49 328 proteins with 371 584, 2 047 227 and 582 454 terms for the MF, BP and CC ontologies respectively. The training set is used to estimate with a ten fold cross-validation the correlation between precision and ranking position for the three INGA components: Homology, Architectures and Interactions. In Figure 1 the distribution of precision in relation to the ranking is provided. When generating predictions, INGA assigns a confidence score which is the average precision of the ranking. The ranking is calculated in different ways for the three INGA components. For Homology it corresponds to the BLAST output position and ranking 1 means the hit (or set of hits) with the best Bit-score. For the Architecture and Interaction components the ranking is provided by the enrichment. Ranking 1 corresponds to all those terms with the lowest P-value (see methods for details). The final consensus is calculated in the same way as in the previous INGA version, i.e. calculating the joint probability for terms provided by different methods. An additional weighting parameter to balance the contribution of different methods has been trained using the same dataset and applying a simple grid search algorithm.

Figure 1.

Estimated precision of three INGA components. Precision is reported for different ranking positions. Ranking is provided by BLAST Bit-score for Homology and by the enrichment P-value for Architectures and Interactions (see methods). The horizontal axes is cut at 10.

Evaluation

INGA has been evaluated in the CAFA2 (6) and CAFA3 (manuscript in preparation) blind test experiments as ‘INGA-Tosatto’. In CAFA2, considering the Fmax and the full-evaluation/no-knowledge mode, INGA ranked among the top 10 methods for MF and BP ontologies. In CAFA3, INGA (version 2.0) is in the top 10 also for CC and for both the Fmax and Smin metrics. The latter takes into consideration the information content of the terms and gives an indication about prediction specificity (28). Terms with high information content are less frequent in the annotation databases and therefore more difficult to predict. A fair comparison with other methods is very difficult outside the CAFA context due to a number of variables which cannot be controlled, for example the version of training databases, ontology, etc. In Table 2, we report a comparison with the previous INGA and baseline methods as implemented in CAFA using the benchmarking data provided in CAFA2 which contains 2618 BP, 2938 CC and 1828 MF protein targets. It has to be noted that numbers in the table are not comparable with CAFA evaluations as we consider the whole reference instead of subcategories and differences in calculation details exist. For example, no- and limited-knowledge examples were not separated in order to maximize the dataset size and the source of GO terms (UniProtGOA) contains new terms not present in the benchmarking. Also, the test is not fully blind, as the training data (UniProtGOA) overlaps with test examples. Table 2 is provided just to show the contribution of the different INGA components and a comparison with baseline methods. We used the same input (when applicable), i.e. same BLAST database, UniProtGOA, Gene Ontology version, etc. in order to equally propagate the effect of possible biases. For a fair evaluation we refer to the official CAFA3 results. Table 2 also reports performance for the INGA Architectures component as it can be used for fast large-scale prediction and also for the same component without ‘dark’ features (INGA Arch Non-dark). The evaluation is provided as in the full CAFA evaluation, where methods with a lower coverage are penalized because recall is calculated averaging over the benchmark size. INGA 2.0 outperforms all methods and has ∼10% higher Fmax compared to its previous version for all ontologies. The INGA Architecture component has generally an 18% lower Fmax than the consensus but 6% higher than the one without ‘dark’ features. The Smin shows the same trend with a stronger difference between INGA 1.0 and INGA 2.0. Figure 2 shows the precision recall curves for methods reported in Table 2. The higher performance of INGA 2.0 over other methods (and INGA 2.0 Arch over INGA 2.0 Arch Non-Dark) can be explained by the higher number of considered features, as ‘dark’ features are expected to be extensively represented both in the training and test examples. All INGA CAFA3 predictions and benchmarking data are available for download from URL https://inga.bio.unipd.it/documentation/cafa.

Table 2.

INGA performance in comparison with other methods

Ontology	Method	Th (F_max)	Precision	Recall	F_max	Th (S_min)	S_min	Coverage
MF	INGA 2.0	0.49	0.660	0.730	0.693	0.67	5.83	0.93
	INGA 2.0 Arch	0.28	0.545	0.495	0.519	0.60	13.25	0.61
	INGA 2.0 Arch Non-Dark	0.47	0.600	0.365	0.454	0.60	11.20	0.46
	INGA 1.0	0.78	0.658	0.583	0.618	0.95	10.33	0.90
	BLAST	0.68	0.568	0.321	0.410	1.0	19.25	0.90
	Naive	0.06	0.296	0.082	0.128	0.6	28.93	1.00
BP	INGA 2.0	0.40	0.515	0.632	0.567	0.56	29.91	0.93
	INGA 2.0 Arch	0.16	0.394	0.396	0.395	0.46	71.96	0.59
	INGA 2.0 Arch Non-Dark	0.21	0.370	0.321	0.344	0.57	62.54	0.47
	INGA 1.0	0.59	0.482	0.499	0.490	0.76	56.98	0.90
	BLAST	0.22	0.422	0.097	0.158	1.0	123.91	0.91
	Naive	0.22	0.030	0.027	0.029	0.46	150.09	1.00
CC	INGA 2.0	0.40	0.589	0.641	0.614	0.56	3.78	0.96
	INGA 2.0 Arch	0.16	0.480	0.337	0.396	0.40	12.78	0.54
	INGA 2.0 Arch Non-Dark	0.16	0.431	0.314	0.363	0.50	11.51	0.48
	INGA 1.0	0.65	0.503	0.508	0.505	0.87	10.19	0.85
	BLAST	0.79	0.452	0.184	0.262	1.0	25.77	0.90
	Naive	0.09	0.152	0.188	0.168	0.09	32.12	1.00

This evaluation corresponds to the CAFA full-evaluation with both no- and -limited-knowledge examples merged in a single benchmark. Precision and recall measures are reported for the confidence threshold which maximize the F-score. The coverage is the fraction of predicted targets. INGA Architecture (INGA Arch.) component includes ‘dark’ signatures. INGA 2.0 corresponds to the full algorithm. BLAST and Naive are implemented and trained as described in CAFA2. Table values do not correspond to a fair blind test as training and test examples overlap.

Figure 2.

Precision recall curves for methods compared in Table 2 for the three GO ontologies. In the legend, (F) is the Fmax and (C) is the coverage as the fraction of predicted targets.

INGA performance in comparison with other methods This evaluation corresponds to the CAFA full-evaluation with both no- and -limited-knowledge examples merged in a single benchmark. Precision and recall measures are reported for the confidence threshold which maximize the F-score. The coverage is the fraction of predicted targets. INGA Architecture (INGA Arch.) component includes ‘dark’ signatures. INGA 2.0 corresponds to the full algorithm. BLAST and Naive are implemented and trained as described in CAFA2. Table values do not correspond to a fair blind test as training and test examples overlap. Precision recall curves for methods compared in Table 2 for the three GO ontologies. In the legend, (F) is the Fmax and (C) is the coverage as the fraction of predicted targets.

Implementation

The INGA web server is implemented using the REST (Representational State Transfer) architecture. The INGA services can be accessed both from a web interface or a custom client. Submitted jobs can be retrieved at a later time by providing the session identifier or the URL to the result page. INGA guarantees to maintain job sessions for at least two weeks. Predictions are stored permanently in a database where entries are indexed by their sequence in order to speed up the service when requesting a cached protein.

SERVER DESCRIPTION

Input

The INGA website is free and open to all users and there is no login requirement. The interface can alternatively accept either protein sequences (Sequence input tab) or BLAST and InterPro predictions (Prediction input tab). In the first case INGA outputs single or multiple predictions (up to 50 or 1000 in slow and fast mode respectively) from pasted or uploaded FASTA sequences or UniProtKB accessions (e.g. P04050). A checkbox group allows the user to choose which component to run, i.e. limiting the execution to the INGA Architectures for a faster prediction. A single job (e.g. 10 sequences) lasts around 30 min in default mode and 15 min in fast mode considering only the INGA Architectures component. The alternative Prediction input tab allows to provide intermediate files, namely InterPro output, a BLAST search against UniProtKB and another BLAST search against the STRING sequence database. In this case input sequences are not necessary and INGA generates predictions in constant time independently of the input size.

Output

The server provides a results page listing all submitted sequences. Once predictions are ready, the user can access single protein pages listing the predicted GO terms. Terms are split into three tables available in three different tabs corresponding to the different ontologies. For each GO term the score (probability) and annotation source (UniProtKB annotated entries) provided by different methods are reported in the same row. Predicted terms are sorted by INGA score and then by specificity, i.e. terms more distant from the root are shown first. A left sidebar provides filters and widgets to explore the graph structure of the predicted terms. The specificity and INGA score can be filtered on the fly. Ancestors and children of a given term can be highlighted in a single click in order to visualize specific GO branches. The protein architecture (where available) is shown on the top of the table and a feature viewer can be optionally open to visualize sequence position of the detected signatures. Both prediction and predicted features are available for download both in JSON and text formats.

CONCLUSIONS

We have presented a new version of the INGA algorithm for the prediction of Gene Ontology terms from the protein sequence. The new version integrates ‘dark’ proteome information to improve prediction accuracy, in particular intrinsic disorder and transmembrane region detection. INGA ranked in the top ten for both CAFA2 and CAFA3. A new option allows fast prediction of entire genomes at the cost of partially losing accuracy. The web server was completely redesigned to provide a better interpretation of the function and visualization of the different predicted GO branches. We believe that improving the characterization and classification of ‘dark’ features will provide a better description of protein function and quality of predictors.

28 in total

Review 1. More than the sum of their parts: on the evolution of proteins from peptides.

Authors: Johannes Söding; Andrei N Lupas
Journal: Bioessays Date: 2003-09 Impact factor: 4.345

2. A combined transmembrane topology and signal peptide prediction method.

Authors: Lukas Käll; Anders Krogh; Erik L L Sonnhammer
Journal: J Mol Biol Date: 2004-05-14 Impact factor: 5.469

Review 3. Intrinsic disorder and functional proteomics.

Authors: Predrag Radivojac; Lilia M Iakoucheva; Christopher J Oldfield; Zoran Obradovic; Vladimir N Uversky; A Keith Dunker
Journal: Biophys J Date: 2006-12-08 Impact factor: 4.033

4. FFPred 2.0: improved homology-independent prediction of gene ontology terms for eukaryotic protein sequences.

Authors: Federico Minneci; Damiano Piovesan; Domenico Cozzetto; David T Jones
Journal: PLoS One Date: 2013-05-22 Impact factor: 3.240

5. BAR-PLUS: the Bologna Annotation Resource Plus for functional and structural annotation of protein sequences.

Authors: Damiano Piovesan; Pier Luigi Martelli; Piero Fariselli; Andrea Zauli; Ivan Rossi; Rita Casadio
Journal: Nucleic Acids Res Date: 2011-05-26 Impact factor: 16.971

6. InterProScan 5: genome-scale protein function classification.

Authors: Philip Jones; David Binns; Hsin-Yu Chang; Matthew Fraser; Weizhong Li; Craig McAnulla; Hamish McWilliam; John Maslen; Alex Mitchell; Gift Nuka; Sebastien Pesseat; Antony F Quinn; Amaia Sangrador-Vegas; Maxim Scheremetjew; Siew-Yit Yong; Rodrigo Lopez; Sarah Hunter
Journal: Bioinformatics Date: 2014-01-21 Impact factor: 6.937

7. The GOA database: gene Ontology annotation updates for 2015.

Authors: Rachael P Huntley; Tony Sawford; Prudence Mutowo-Meullenet; Aleksandra Shypitsyna; Carlos Bonilla; Maria J Martin; Claire O'Donovan
Journal: Nucleic Acids Res Date: 2014-11-06 Impact factor: 19.160

8. The challenge of increasing Pfam coverage of the human proteome.

Authors: Jaina Mistry; Penny Coggill; Ruth Y Eberhardt; Antonio Deiana; Andrea Giansanti; Robert D Finn; Alex Bateman; Marco Punta
Journal: Database (Oxford) Date: 2013-04-19 Impact factor: 3.451

9. Information-theoretic evaluation of predicted ontological annotations.

Authors: Wyatt T Clark; Predrag Radivojac
Journal: Bioinformatics Date: 2013-07-01 Impact factor: 6.937

10. A large-scale evaluation of computational protein function prediction.

Authors: Predrag Radivojac; Wyatt T Clark; Tal Ronnen Oron; Alexandra M Schnoes; Tobias Wittkop; Artem Sokolov; Kiley Graim; Christopher Funk; Karin Verspoor; Asa Ben-Hur; Gaurav Pandey; Jeffrey M Yunes; Ameet S Talwalkar; Susanna Repo; Michael L Souza; Damiano Piovesan; Rita Casadio; Zheng Wang; Jianlin Cheng; Hai Fang; Julian Gough; Patrik Koskinen; Petri Törönen; Jussi Nokso-Koivisto; Liisa Holm; Domenico Cozzetto; Daniel W A Buchan; Kevin Bryson; David T Jones; Bhakti Limaye; Harshal Inamdar; Avik Datta; Sunitha K Manjari; Rajendra Joshi; Meghana Chitale; Daisuke Kihara; Andreas M Lisewski; Serkan Erdin; Eric Venner; Olivier Lichtarge; Robert Rentzsch; Haixuan Yang; Alfonso E Romero; Prajwal Bhat; Alberto Paccanaro; Tobias Hamp; Rebecca Kaßner; Stefan Seemayer; Esmeralda Vicedo; Christian Schaefer; Dominik Achten; Florian Auer; Ariane Boehm; Tatjana Braun; Maximilian Hecht; Mark Heron; Peter Hönigschmid; Thomas A Hopf; Stefanie Kaufmann; Michael Kiening; Denis Krompass; Cedric Landerer; Yannick Mahlich; Manfred Roos; Jari Björne; Tapio Salakoski; Andrew Wong; Hagit Shatkay; Fanny Gatzmann; Ingolf Sommer; Mark N Wass; Michael J E Sternberg; Nives Škunca; Fran Supek; Matko Bošnjak; Panče Panov; Sašo Džeroski; Tomislav Šmuc; Yiannis A I Kourmpetis; Aalt D J van Dijk; Cajo J F ter Braak; Yuanpeng Zhou; Qingtian Gong; Xinran Dong; Weidong Tian; Marco Falda; Paolo Fontana; Enrico Lavezzo; Barbara Di Camillo; Stefano Toppo; Liang Lan; Nemanja Djuric; Yuhong Guo; Slobodan Vucetic; Amos Bairoch; Michal Linial; Patricia C Babbitt; Steven E Brenner; Christine Orengo; Burkhard Rost; Sean D Mooney; Iddo Friedberg
Journal: Nat Methods Date: 2013-01-27 Impact factor: 28.547

13 in total

1. NetGO 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information.

Authors: Shuwei Yao; Ronghui You; Shaojun Wang; Yi Xiong; Xiaodi Huang; Shanfeng Zhu
Journal: Nucleic Acids Res Date: 2021-07-02 Impact factor: 16.971

Review 2. I-TASSER-MTD: a deep-learning-based platform for multi-domain protein structure and function prediction.

Authors: Xiaogen Zhou; Wei Zheng; Yang Li; Robin Pearce; Chengxin Zhang; Eric W Bell; Guijun Zhang; Yang Zhang
Journal: Nat Protoc Date: 2022-08-05 Impact factor: 17.021

3. Semantic similarity and machine learning with ontologies.

Authors: Maxat Kulmanov; Fatima Zohra Smaili; Xin Gao; Robert Hoehndorf
Journal: Brief Bioinform Date: 2021-07-20 Impact factor: 11.622

4. DisProt: intrinsic protein disorder annotation in 2020.

Authors: András Hatos; Borbála Hajdu-Soltész; Alexander M Monzon; Nicolas Palopoli; Lucía Álvarez; Burcu Aykac-Fas; Claudio Bassot; Guillermo I Benítez; Martina Bevilacqua; Anastasia Chasapi; Lucia Chemes; Norman E Davey; Radoslav Davidović; A Keith Dunker; Arne Elofsson; Julien Gobeill; Nicolás S González Foutel; Govindarajan Sudha; Mainak Guharoy; Tamas Horvath; Valentin Iglesias; Andrey V Kajava; Orsolya P Kovacs; John Lamb; Matteo Lambrughi; Tamas Lazar; Jeremy Y Leclercq; Emanuela Leonardi; Sandra Macedo-Ribeiro; Mauricio Macossay-Castillo; Emiliano Maiani; José A Manso; Cristina Marino-Buslje; Elizabeth Martínez-Pérez; Bálint Mészáros; Ivan Mičetić; Giovanni Minervini; Nikoletta Murvai; Marco Necci; Christos A Ouzounis; Mátyás Pajkos; Lisanna Paladin; Rita Pancsa; Elena Papaleo; Gustavo Parisi; Emilie Pasche; Pedro J Barbosa Pereira; Vasilis J Promponas; Jordi Pujols; Federica Quaglia; Patrick Ruch; Marco Salvatore; Eva Schad; Beata Szabo; Tamás Szaniszló; Stella Tamana; Agnes Tantos; Nevena Veljkovic; Salvador Ventura; Wim Vranken; Zsuzsanna Dosztányi; Peter Tompa; Silvio C E Tosatto; Damiano Piovesan
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

5. PANNZER-A practical tool for protein function prediction.

Authors: Petri Törönen; Liisa Holm
Journal: Protein Sci Date: 2021-10-14 Impact factor: 6.725

6. Predicting Functions of Uncharacterized Human Proteins: From Canonical to Proteoforms.

Authors: Ekaterina Poverennaya; Olga Kiseleva; Anastasia Romanova; Mikhail Pyatnitskiy
Journal: Genes (Basel) Date: 2020-06-21 Impact factor: 4.096

7. A thorough analysis of the contribution of experimental, derived and sequence-based predicted protein-protein interactions for functional annotation of proteins.

Authors: Stavros Makrodimitris; Marcel Reinders; Roeland van Ham
Journal: PLoS One Date: 2020-11-25 Impact factor: 3.240

8. SDN2GO: An Integrated Deep Learning Model for Protein Function Prediction.

Authors: Yideng Cai; Jiacheng Wang; Lei Deng
Journal: Front Bioeng Biotechnol Date: 2020-04-29

9. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens.

Authors: Naihui Zhou; Yuxiang Jiang; Timothy R Bergquist; Alexandra J Lee; Balint Z Kacsoh; Alex W Crocker; Kimberley A Lewis; George Georghiou; Huy N Nguyen; Md Nafiz Hamid; Larry Davis; Tunca Dogan; Volkan Atalay; Ahmet S Rifaioglu; Alperen Dalkıran; Rengul Cetin Atalay; Chengxin Zhang; Rebecca L Hurto; Peter L Freddolino; Yang Zhang; Prajwal Bhat; Fran Supek; José M Fernández; Branislava Gemovic; Vladimir R Perovic; Radoslav S Davidović; Neven Sumonja; Nevena Veljkovic; Ehsaneddin Asgari; Mohammad R K Mofrad; Giuseppe Profiti; Castrense Savojardo; Pier Luigi Martelli; Rita Casadio; Florian Boecker; Heiko Schoof; Indika Kahanda; Natalie Thurlby; Alice C McHardy; Alexandre Renaux; Rabie Saidi; Julian Gough; Alex A Freitas; Magdalena Antczak; Fabio Fabris; Mark N Wass; Jie Hou; Jianlin Cheng; Zheng Wang; Alfonso E Romero; Alberto Paccanaro; Haixuan Yang; Tatyana Goldberg; Chenguang Zhao; Liisa Holm; Petri Törönen; Alan J Medlar; Elaine Zosa; Itamar Borukhov; Ilya Novikov; Angela Wilkins; Olivier Lichtarge; Po-Han Chi; Wei-Cheng Tseng; Michal Linial; Peter W Rose; Christophe Dessimoz; Vedrana Vidulin; Saso Dzeroski; Ian Sillitoe; Sayoni Das; Jonathan Gill Lees; David T Jones; Cen Wan; Domenico Cozzetto; Rui Fa; Mateo Torres; Alex Warwick Vesztrocy; Jose Manuel Rodriguez; Michael L Tress; Marco Frasca; Marco Notaro; Giuliano Grossi; Alessandro Petrini; Matteo Re; Giorgio Valentini; Marco Mesiti; Daniel B Roche; Jonas Reeb; David W Ritchie; Sabeur Aridhi; Seyed Ziaeddin Alborzi; Marie-Dominique Devignes; Da Chen Emily Koo; Richard Bonneau; Vladimir Gligorijević; Meet Barot; Hai Fang; Stefano Toppo; Enrico Lavezzo; Marco Falda; Michele Berselli; Silvio C E Tosatto; Marco Carraro; Damiano Piovesan; Hafeez Ur Rehman; Qizhong Mao; Shanshan Zhang; Slobodan Vucetic; Gage S Black; Dane Jo; Erica Suh; Jonathan B Dayton; Dallas J Larsen; Ashton R Omdahl; Liam J McGuffin; Danielle A Brackenridge; Patricia C Babbitt; Jeffrey M Yunes; Paolo Fontana; Feng Zhang; Shanfeng Zhu; Ronghui You; Zihan Zhang; Suyang Dai; Shuwei Yao; Weidong Tian; Renzhi Cao; Caleb Chandler; Miguel Amezola; Devon Johnson; Jia-Ming Chang; Wen-Hung Liao; Yi-Wei Liu; Stefano Pascarelli; Yotam Frank; Robert Hoehndorf; Maxat Kulmanov; Imane Boudellioua; Gianfranco Politano; Stefano Di Carlo; Alfredo Benso; Kai Hakala; Filip Ginter; Farrokh Mehryary; Suwisa Kaewphan; Jari Björne; Hans Moen; Martti E E Tolvanen; Tapio Salakoski; Daisuke Kihara; Aashish Jain; Tomislav Šmuc; Adrian Altenhoff; Asa Ben-Hur; Burkhard Rost; Steven E Brenner; Christine A Orengo; Constance J Jeffery; Giovanni Bosco; Deborah A Hogan; Maria J Martin; Claire O'Donovan; Sean D Mooney; Casey S Greene; Predrag Radivojac; Iddo Friedberg
Journal: Genome Biol Date: 2019-11-19 Impact factor: 13.583

Review 10. Automatic Gene Function Prediction in the 2020's.

Authors: Stavros Makrodimitris; Roeland C H J van Ham; Marcel J T Reinders
Journal: Genes (Basel) Date: 2020-10-27 Impact factor: 4.096