Literature DB >> 22641853

CombFunc: predicting protein function using heterogeneous data sources.

Mark N Wass¹, Geraint Barton, Michael J E Sternberg.

Abstract

Only a small fraction of known proteins have been functionally characterized, making protein function prediction essential to propose annotations for uncharacterized proteins. In recent years many function prediction methods have been developed using various sources of biological data from protein sequence and structure to gene expression data. Here we present the CombFunc web server, which makes Gene Ontology (GO)-based protein function predictions. CombFunc incorporates ConFunc, our existing function prediction method, with other approaches for function prediction that use protein sequence, gene expression and protein-protein interaction data. In benchmarking on a set of 1686 proteins CombFunc obtains precision and recall of 0.71 and 0.64 respectively for gene ontology molecular function terms. For biological process GO terms precision of 0.74 and recall of 0.41 is obtained. CombFunc is available at http://www.sbg.bio.ic.ac.uk/combfunc.

Entities: CellLine Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2012 PMID： 22641853 PMCID： PMC3394346 DOI： 10.1093/nar/gks489

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Protein function prediction is essential to provide insight to the functions of uncharacterized proteins. This is highlighted by the gap between the large number of proteins that have been identified and the small percentage of them that have been functionally characterized (1). Annotation transfer using BLAST (2) represents a standard and widely-used method of function prediction but as protein function is often only conserved by homologues sharing a high sequence identity, this approach can be prone to errors (3). In recent years many methods have been developed to improve upon BLAST-based annotation transfer. This has included methods such as GOtcha (4) and PFP/ESG (5,6), which combine the Gene Ontology (GO) (7) annotations present in multiple homologues and use their e-values to weight predictions or use machine learning to optimize predictions (8). Phylogenomics approaches distinguish between orthologues and paralogues to infer function (9). The presence of domains from Interpro (10) or Pfam (11) are used for electronic annotation in GO annotations (12) and combinations of domains have also been used for function prediction (13). In ConFunc we used conserved residues representative of individual GO terms to predict protein function (14). Other methods have used protein–protein interaction networks (15,16), gene co-expression (17,18) or multiple protein features including protein disorder and secondary structure (19). Some methods combine predictions from multiple sources of data (20–26). This includes methods that use Bayesian approaches (24) or Support Vector Machines (SVMs) (23,25) to combine predictions. Some of these methods are available as web servers. The ProKnow (21) webserver combines the evidence from multiple sources to make overall predictions of GO functions. In contrast the ProFunc (20) and PredUS (22) servers do not make overall predictions of protein function, instead they enable the user to explore the results of the many sequence and structural analyses that they perform. Further details of these methods and others are available in recent reviews (1,27). Here we present CombFunc a server for GO-based protein function prediction. CombFunc incorporates ConFunc (14) our existing sequence based function prediction method and it also extends our recent use of multiple methods to predict the functions of proteins in the Plasmodium berghei male gamete (28). CombFunc uses sequence information including BLAST/PSI-BLAST (29) annotation transfer, domain information from Interpro, protein–protein interaction data from IntAct (30) and MiNT (31) and gene expression data from COXPRESdb (32).

MATERIALS AND METHODS

The CombFunc algorithm

CombFunc obtains information from multiple analyses which are then combined using a SVM (33) to make an overall prediction. The data sources used are described below. The sequence-based sources of input to CombFunc are: ConFunc, BLAST/PSI-BLAST annotation transfer, domain information and a sequence search against the fold library of Phyre2 (34), our in-house protein structure prediction server. ConFunc is run as previously described in Wass and Sternberg (14). Both BLAST and PSI-BLAST are used to search for GO annotated homologues of the query sequence in UniProt (35). Where PSI-BLAST is used, UniRef50 is initially searched and the profile generated is used to search the full UniProt database as this approach has been shown to improve the identification of homologues (36). Domain information is obtained using Interpro (10) and Pfam domain combinations are also used to make predictions as described in (13). HHsearch (37) is used to search the fold library of Phyre2 to identify structures homologous to the query sequence, whose annotations are input to the SVM. All methods use only experimentally determined GO annotations. The non-sequence-based data sources are protein–protein interactions (PPI) and gene co-expression. PPI data are obtained from both IntAct (30) and MiNT (31). Function prediction is performed by simple neighbour counting (38) and indirect neighbours are also included (15). Gene expression data is obtained from the COXPRESdb database (32), which contains expression data for Human, Mouse, Rat, Chicken, Zebrafish, Fly and Nematode. COXPRESdb uses a mutual rank score to determine the strength of co-expression, which is calculated as the geometric mean of the correlation rank of gene A to gene B and of gene B to gene A. The frequency of GO terms within the set of co-expressed genes with a mutual rank less than 50 (39) is input into the SVM. CombFunc uses each of the individual methods to identify GO terms that may be associated with the query. Features associated with the GO terms identified by individual methods are used by CombFunc to make a final prediction of the query function. The features used for each method are listed in Supplementary Table S1 and are described below. For BLAST and PSI-BLAST the top annotated hit is identified and the GO terms it is annotated with are used for prediction. The features from BLAST and PSI-BLAST include the e-value of the top annotated hit, the sequence identity between the query and top annotated hit and also the sequence coverage of the query by the top hit. Additionally for PSI-BLAST data the annotations of multiple sequences are considered by calculating the i-score as used in GOtcha (4). For terms identified by the interactome analysis the features correspond to the fraction of direct and also indirect neighbours that are annotated with that term. For terms present in the Interpro analysis, the feature corresponds to the lowest e-value of a domain hit annotated with that term (maximum of 1). For the Pfam domain combinations analysis the feature is 1 if predicted by the method and 0 otherwise. Features from the Phyre2 fold library use terms present in the top annotated hit and use the probability score from HHsearch (37) between the query and the hit and also the sequence coverage of the query by the hit. Features for GO terms identified from expression data use a number of features including: the fraction of co-expressed proteins annotated with the function and the minimum, average and maximum mutual rank and correlation coefficients of the co-expressed proteins. Finally a feature is included for each of the individual level 1 GO terms (i.e. binding and catalytic function in molecular function). These features are set to 1 if they are a parent term of the term being considered and zero otherwise. CombFunc uses three classifiers for the molecular function and biological process categories. As the features associated with GO terms are likely to vary depending on their location in the GO graph, the three classifiers are used for different levels of GO. One classifier considers only terms one level below the root (e.g. catalytic activity or binding in the molecular function category), the second considers terms in the next two levels, while the third classifier considers all more specific terms. The scores output from the SVMs are converted to probabilities as described in Platt (40). The classification process is repeated 10 times, using the 10 sets of optimized SVMs generated during cross-validation. GO terms are predicted to be a function of the query protein if they are predicted to be so by at least 5 of the 10 sets of SVMs with a probability score set as an average of the probability scores for the SVMs that predicted the function.

Generating a test set

A test set of proteins with experimental GO annotations in both the molecular function and biological process GO categories was extracted using the UniProt-GOA annotations from December 2011. This was reduced to a representative set with less than or equal to 25% sequence identity using CD-HIT (41). Of the resulting 6686 sequences, 5000 were used for cross-validation and the remaining 1686 for final testing of the server.

SVM training

The SVMs were generated using SVMlight (33). A linear kernel was used for classification. For each of the 10-fold, eight were used for training, a further fold was used for optimization and the SVM tested on the remaining fold. In cross-validation each SVM was optimized for the trade off between training error and margin. As the training data is unbalanced with many more negative examples than positive ones we also assessed the effect of the cost factor to identify how training errors on positive examples should outweigh those on negative examples (see Supplementary Material section).

EVALUATING COMBFUNC PERFORMANCE

Here we assess the performance of CombFunc using the set of sequences that were not used in cross-validation. The performance of CombFunc on this set of 1686 sequences was assessed using precision and recall calculated as described in Wass and Sternberg (14). The precision-recall graphs in Figure 1 show the performance of CombFunc at a range of thresholds and a comparison with the performance of BLAST annotation transfer. For CombFunc the performance is assessed at confidence thresholds in the range 0–1. We observe that at high confidence (>0.95) CombFunc obtains high precision (0.96) and low recall (0.21). As the threshold is reduced the recall increases while precision reduces and including low confidence predictions CombFunc obtains precision and recall of 0.71 and 0.64 respectively (Figure 1A). CombFunc does not perform as well on biological process terms with both lower recall and precision at equivalent confidence scores. Using a confidence threshold of 0.3 obtains precision of 0.74 and recall of 0.41.

Figure 1.

Benchmarking CombFunc. Precision-recall graphs showing the performance of CombFunc on 1686 sequences not used in cross-validation. CombFunc results are shown in blue, ConFunc in black and BLAST in red. For (A) the GO molecular function and (B) biological process categories. For comparison the performance of BLAST and ConFunc on the same dataset was considered. For BLAST (Version 2.219) annotation transfer the UniProt database (version December 2011) was searched and the annotation of the top (lowest e-value) experimentally annotated hit transferred to the query sequence. A range of precision and recall scores is obtained by only transferring the annotation if the top hit has an e-value below a threshold, which was varied from 0 − 1e−03. For ConFunc precision-recall values were obtained using a threshold for the ratio score (range 0–1). For benchmarking of all three methods, sequences with >99% sequence identity were excluded for the sequence based prediction components to ensure that the query sequence was not used to make predictions for itself. We observe that CombFunc performs better than both BLAST and ConFunc. For ConFunc predictions there is a large reduction in precision as the prediction threshold is reduced. ConFunc considers all of the annotations that are present in the homologues of the query identified by BLAST. This often includes the annotation of the query sequence but additionally includes many other functions that are not annotations of the query sequence. At low thresholds many false positive predictions are made. In contrast through the use of multiple data sources and machine learning, CombFunc does not have such a large reduction in precision at lower thresholds, particularly when predicting molecular function terms (Figure 1).

THE COMBFUNC WEB SERVER

CombFunc is available at http://www.sbg.bio.ic.ac.uk/combfunc. Users are required to submit a protein sequence in fasta format and they may also input the UniProt accession of the query sequence. The UniProt accession is required to perform the PPI and co-expression analyses. Processing time for each submission can vary from between 20 min to a few hours, this is largely due to the time taken to perform the search of the Phyre2 fold library.

Results output

CombFunc results output is split into two main sections. The prediction section provides details of the functions predicted by the SVM. In the second section details of the data generated from each of the individual analyses are provided, which users can explore to obtain further details of the data used to make the prediction. The prediction section displays separate results for molecular function and biological process predictions. For both of these GO categories a table of the predictions lists the term, its name and the probability score of the prediction, this has a range of 0–1, with 1 being the highest confidence (Figure 2). The probability scores are colour coded to indicate the confidence of the predictions, ranging from yellow for low probability predictions to red for high probability. Longer descriptions of the predicted functions are displayed adjacent to the table when the mouse is moved over the rows of the table. Additionally links to the GO terms on the GO website are provided, enabling the user to access external further information about the predicted GO terms.

Figure 2.

Display of a CombFunc prediction. CombFunc predictions are displayed in a table showing the confidence of the prediction and in an image and list placing them in the context of GO structure.

The predictions are visualized within the GO graph in an image that displays a subgraph of GO containing all of the predicted terms and their parent terms (Figure 2). Again predicted terms are colour coded to indicate the confidence of their prediction. The image has a zoom function that enables users to zoom into different areas of the graph to investigate the predictions, which is particularly useful when multiple terms are predicted and the subgraph becomes large. Additionally, the predictions are displayed as an expandable list, which enables similar investigation of the predicted terms. The second section of the results page contains the output from each of the individual analyses performed. The data associated with each analysis are initially hidden so that the user can view only the analyses they wish to. For each analysis a table lists the GO terms identified by the method and the values or scores associated with those terms. Interpro results are additionally displayed graphically enabling the user to identify the location of the hits on the query sequence. For all analyses the same colour coding as for the main predictions is used to give an indication of how ‘good’ the different scores displayed are. This includes colour coding sequence identity and e-values of BLAST hits and mutual rank values for gene co-expression. Where relevant, links to external data on the GO, UniProt and Intpero websites are provided. For each submission to CombFunc a submission is also made to 3DLigandSite (42,43), our in-house ligand binding site prediction server. This enables users to combine the function prediction results with the binding site prediction of 3DLigandSite. A link to the 3DligandSite results is provided at the end of the analysis section. Display of a CombFunc prediction. CombFunc predictions are displayed in a table showing the confidence of the prediction and in an image and list placing them in the context of GO structure.

CONCLUSION

CombFunc was developed to utilize the multiple data sources that are available for protein function prediction. In benchmarking CombFunc obtains good performance with 0.71 and 0.64 precision and recall respectively for molecular function GO terms and precision of 0.74 and recall of 0.41 for biological process terms. The CombFunc server provides a resource for users to view predicted functions in both tabular and graphical formats, access to the raw data from each individual method and access to external resources to enable users to explore the functions and data used to make predictions.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Table 1 and Supplementary Methods.

FUNDING

Biotechnology and Biological Sciences Research Council [BB/F020481/1 to M.N.W.]. Funding for open access charge: Imperial College London Library. Conflict of interest statement. M.J.E.S. is a founder director of Equinox Pharma Ltd, holds shares in the company, and has obtained remuneration from the company. Equinox Pharma Ltd is exploiting computational methods for drug discovery and markets software.

42 in total

1. Inference of protein function from protein structure.

Authors: Debnath Pal; David Eisenberg
Journal: Structure Date: 2005-01 Impact factor: 5.006

2. Protein structure prediction on the Web: a case study using the Phyre server.

Authors: Lawrence A Kelley; Michael J E Sternberg
Journal: Nat Protoc Date: 2009 Impact factor: 13.491

3. 3DLigandSite: predicting ligand-binding sites using similar structures.

Authors: Mark N Wass; Lawrence A Kelley; Michael J E Sternberg
Journal: Nucleic Acids Res Date: 2010-05-31 Impact factor: 16.971

4. Cluster analysis and display of genome-wide expression patterns.

Authors: M B Eisen; P T Spellman; P O Brown; D Botstein
Journal: Proc Natl Acad Sci U S A Date: 1998-12-08 Impact factor: 11.205

5. Proteomic analysis of Plasmodium in the mosquito: progress and pitfalls.

Authors: M N Wass; R Stanway; A M Blagborough; K Lal; J H Prieto; D Raine; M J E Sternberg; A M Talman; F Tomley; J Yates; R E Sinden
Journal: Parasitology Date: 2012-02-16 Impact factor: 3.234

6. Ongoing and future developments at the Universal Protein Resource.

Authors:
Journal: Nucleic Acids Res Date: 2010-11-04 Impact factor: 16.971

7. The Pfam protein families database.

Authors: Marco Punta; Penny C Coggill; Ruth Y Eberhardt; Jaina Mistry; John Tate; Chris Boursnell; Ningze Pang; Kristoffer Forslund; Goran Ceric; Jody Clements; Andreas Heger; Liisa Holm; Erik L L Sonnhammer; Sean R Eddy; Alex Bateman; Robert D Finn
Journal: Nucleic Acids Res Date: 2011-11-29 Impact factor: 16.971

8. Inferring function using patterns of native disorder in proteins.

Authors: Anna Lobley; Mark B Swindells; Christine A Orengo; David T Jones
Journal: PLoS Comput Biol Date: 2007-07-03 Impact factor: 4.475

9. COXPRESdb: a database of coexpressed gene networks in mammals.

Authors: Takeshi Obayashi; Shinpei Hayashi; Masayuki Shibaoka; Motoshi Saeki; Hiroyuki Ohta; Kengo Kinoshita
Journal: Nucleic Acids Res Date: 2007-10-11 Impact factor: 16.971

10. Probabilistic protein function prediction from heterogeneous genome-wide data.

Authors: Naoki Nariai; Eric D Kolaczyk; Simon Kasif
Journal: PLoS One Date: 2007-03-28 Impact factor: 3.240

24 in total

1. Redox-Sensitive MarR Homologue BifR from Burkholderia thailandensis Regulates Biofilm Formation.

Authors: Ashish Gupta; Stanley M Fuentes; Anne Grove
Journal: Biochemistry Date: 2017-04-21 Impact factor: 3.162

2. SIFTER search: a web server for accurate phylogeny-based protein function prediction.

Authors: Sayed M Sahraeian; Kevin R Luo; Steven E Brenner
Journal: Nucleic Acids Res Date: 2015-05-15 Impact factor: 16.971

3. NetGO: improving large-scale protein function prediction with massive network information.

Authors: Ronghui You; Shuwei Yao; Yi Xiong; Xiaodi Huang; Fengzhu Sun; Hiroshi Mamitsuka; Shanfeng Zhu
Journal: Nucleic Acids Res Date: 2019-07-02 Impact factor: 16.971

4. MetaGO: Predicting Gene Ontology of Non-homologous Proteins Through Low-Resolution Protein Structure Prediction and Protein-Protein Network Mapping.

Authors: Chengxin Zhang; Wei Zheng; Peter L Freddolino; Yang Zhang
Journal: J Mol Biol Date: 2018-03-10 Impact factor: 5.469

5. Exploring Human Diseases and Biological Mechanisms by Protein Structure Prediction and Modeling.

Authors: Juexin Wang; Joseph Luttrell; Ning Zhang; Saad Khan; NianQing Shi; Michael X Wang; Jing-Qiong Kang; Zheng Wang; Dong Xu
Journal: Adv Exp Med Biol Date: 2016 Impact factor: 2.622

6. Predicting Sequence Features, Function, and Structure of Proteins Using MESSA.

Authors: Archana S Bhat; Nick V Grishin
Journal: Curr Protoc Bioinformatics Date: 2019-09

7. Structure-based protein function prediction using graph convolutional networks.

Authors: Vladimir Gligorijević; P Douglas Renfrew; Tomasz Kosciolek; Julia Koehler Leman; Daniel Berenberg; Tommi Vatanen; Chris Chandler; Bryn C Taylor; Ian M Fisk; Hera Vlamakis; Ramnik J Xavier; Rob Knight; Kyunghyun Cho; Richard Bonneau
Journal: Nat Commun Date: 2021-05-26 Impact factor: 14.919

8. MESSA: MEta-Server for protein Sequence Analysis.

Authors: Qian Cong; Nick V Grishin
Journal: BMC Biol Date: 2012-10-02 Impact factor: 7.431

9. Structure to function prediction of hypothetical protein KPN_00953 (Ycbk) from Klebsiella pneumoniae MGH 78578 highlights possible role in cell wall metabolism.

Authors: Boon Aun Teh; Sy Bing Choi; Nasihah Musa; Few Ling Ling; See Too Wei Cun; Abu Bakar Salleh; Nazalan Najimudin; Habibah A Wahab; Yahaya M Normi
Journal: BMC Struct Biol Date: 2014-02-05

10. Going the distance for protein function prediction: a new distance metric for protein interaction networks.

Authors: Mengfei Cao; Hao Zhang; Jisoo Park; Noah M Daniels; Mark E Crovella; Lenore J Cowen; Benjamin Hescott
Journal: PLoS One Date: 2013-10-23 Impact factor: 3.240