Literature DB >> 17553833

DOMAC: an accurate, hybrid protein domain prediction server.

Abstract

Protein domain prediction is important for protein structure prediction, structure determination, function annotation, mutagenesis analysis and protein engineering. Here we describe an accurate protein domain prediction server (DOMAC) combining both template-based and ab initio methods. The preliminary version of the server was ranked among the top domain prediction servers in the seventh edition of Critical Assessment of Techniques for Protein Structure Prediction (CASP7), 2006. DOMAC server and datasets are available at: http://www.bioinfotool.org/domac.html.

Entities: Chemical Gene Species

Mesh：

Substances：
Proteins

Year: 2007 PMID： 17553833 PMCID： PMC1933197 DOI： 10.1093/nar/gkm390

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Protein domains are structural, functional and evolutionary units of proteins. The prediction of domains from sequence information can improve tertiary structure prediction (1), enhance protein function annotation (2), aid structure determination (3) and guide protein engineering (4) and mutagenesis (5). A number of different methods have been developed to identify domains starting from primary sequences. These methods can be roughly classified into four categories: template-based methods (6–10), ab initio (template-free) methods (11–22), the hybrid approach combining template-based and ab initio methods (23), and meta-domain prediction methods (24). Here we describe an accurate, hybrid domain prediction server (DOMAC) that integrates homology modeling, domain parsing and ab initio methods together. The preliminary implementation of the server [under the name: FOLDpro (25)] participated in the domain evaluation in the seventh edition of Critical Assessment of Techniques for Protein Structure Prediction (CASP7) (26,27). It was ranked among the top domain prediction servers in CASP7.

IMPLEMENTATION

Our hybrid approach uses the template-based method to predict domains for proteins having homologous template structures in Protein Data Bank (PDB) (28), and the ab initio method based on neural networks (29) to predict domains for de novo proteins. It predicts protein domains in two steps. First, it uses the PSI-BLAST (30) to search the target sequence against NCBI Non-Redundant sequence database to construct a profile. The profile is used to search a template structure library built from the proteins in PDB to identify templates, similarly as PDB-BLAST approach (31). Second, if some significant templates are identified (e-value ⩽0.001), it generates a structure model for the target using Modeller (32) based on the template structures. Multiple significant templates are combined to improve model quality if available. Then it uses an accurate domain parsing tool PDP (33) to parse the model into domains. If the parsed domains do not cover the whole target sequence, DOMAC will assign uncovered regions to adjacent domains. If no significant homologous template is found, DOMAC will invoke the ab initio domain predictor DOMpro (29) to predict domains. DOMpro uses neural networks in conjunction with sequence profile, predicted secondary structure, and relative solvent accessibility to predict domain boundary. The secondary structure and relative solvent accessibility are predicted by SSpro (34) and ACCpro (35) in the SCRATCH suite (36). DOMpro tries to identify domain boundary positions based on the composition bias of sequence and structural features in domain linker regions. The preliminary implementation of DOMAC participated in CASP7 and was ranked first among 13 domain prediction servers. Since then, we have significantly speeded up the template identification process without sacrificing accuracy and added a module to update the template library weekly to incorporate the newly released proteins in PDB.

RESULTS

Here we firstly describe the performance of the preliminary implementation of DOMAC in CASP7 (under server name: FOLDpro). We compare it with 12 other server predictors in CASP7 using two evaluation metrics: CASP evaluation metric (37) and domain number accuracy. CASP metric (NDO: normalized domain overlap score) is to compute the overlapping score of domains without explicitly checking domain number and domain boundary (37). It computes the numbers of correctly and wrongly overlapped residues between true domains and predicted domains, respectively. It summarizes the numbers of the overlapping residues into a single score to evaluate domain prediction. The best score for a target is 1 and the worst score is 0. The domain number accuracy is defined as the percentage of targets with correct domain number predictions. Table 1 reports the performance of 13 servers on 95 targets in CASP 7. The CASP score is the average domain overlap score across all predicted targets. The domain number accuracy is computed by comparing the domain number predictions with the official domain definitions released by CASP7. In terms of the two evaluation metrics, the preliminary implementation of DOMAC (FOLDpro) yielded the best performance.

Table 1.

The performance of 13 domain prediction servers in CASP7

Method	Target Num	Domain Num Acc. (%)	CASP7 Score
FOLDpro (DOMAC)	95	93.7	0.963
Baker-RosettaDom (23)	94	86.2	0.940
Ma-OPUS-DOM	94	87.2	0.933
ROBETTA-GINZU (23)	94	84.0	0.932
DomSSEA (7)	94	78.7	0.910
HHpred3 (38)	95	75.8	0.910
Meta-DP (24)	95	74.7	0.907
HHpred1 (38)	93	75.3	0.902
DomFOLD	95	75.8	0.898
DPS(13)	93	75.3	0.889
Chop (22)	83	56.6	0.827
Distill (39)	95	70.5	0.819
NN_PUT-Lab	92	58.7	0.795

The second column (target num) lists the number of targets for which a predictor made predictions.

The performance of 13 domain prediction servers in CASP7 The second column (target num) lists the number of targets for which a predictor made predictions. We also evaluate DOMAC on the three categories of CASP7 targets: highly homologous, homologous and analogous/ab initio. The domain number prediction accuracy of DOMAC is 96%, 94% and 88% in the three categories, respectively. However, because the majority (68 out of 95) of CASP7 targets is single-domain proteins, the domain prediction accuracy is very likely over-estimated. Thus, we evaluate DOMAC on a larger, balanced, high-quality dataset manually curated by Holland et al. (2). The publicly released version of the Holland's benchmark2 dataset has 156 proteins consisting of 54 single-domain proteins, 69 two-domain proteins, 25 three-domain proteins, 4 four-domain proteins, 3 five-domain proteins and 1 six-domain protein. We evaluate both template-based and ab initio methods on the whole dataset, respectively. Table 2 reports the specificity and sensitivity of each method in each category in terms of domain numbers. The overall domain number prediction accuracy of the template-based and ab initio methods is 75% and 46%, respectively.

Table 2.

The specificity and sensitivity of domain number prediction on the Holland's dataset using the template-based and ab initio methods

Method	Acc. (%)	1-dom	2-dom	3-dom	4-dom	5-dom	6-dom
Template	Sens.	96.1	66.7	56.0	75.0	66.7	–
	Spec.	74.2	88.0	70.0	42.9	33.3	–
Ab initio	Sens.	88.5	31.3	12.0	–	–	–
	Spec.	46.5	48.8	30.0	–	–	–

The specificity and sensitivity of domain number prediction on the Holland's dataset using the template-based and ab initio methods Moreover, we assess the accuracy of the domain boundary prediction, which is important for generating hypotheses for crystallizing individual protein domains. Following the same convention (7,22), a predicted boundary within 20 residues away from a true domain boundary is considered correct. The domain boundary specificity and sensitivity is 50% and 76.5% for the template-based method, and 27% and 14% for the ab initio method. Thus, the accuracy are sufficient for guiding the crystallization experiment, whereas the ab initio method is not always reliable enough for the general, practical use.

USE OF WEB SERVICE

The use of DOMAC are intuitive through a simple input form. Since the reliability assessment of domain predictions is still an open issue, the user is advised to use the accuracy on the Holland's dataset to decide how to use these predictions. The input form requires only three inputs: email address, target name, and protein sequence. DOMAC usually can make predictions within 15 min and send the results back to users through email. Domain prediction results include the user-defined target name, the protein sequence, the predicted domain number, the start and end positions of each domain and the method (template-based or ab initio). For template-based prediction, it also reports the PDB codes of the templates. Figure 1 shows an output example for the CASP7 target T0324.

Figure 1.

Domain prediction result of CASP7 target T0324. The protein is predicted to have two domains. Domain 1 has two non-continuous segments, spanning from residues 1 to 16 and residues 82 to 208, respectively. Domain 2 spans from residues 17 to 81. The templates used to make the domain prediction are identified by PDB code + chain id. The chain in a single-chain protein is always assigned chain id ‘A’ instead of ‘-’.

CONCLUSION AND FUTURE WORK

We have developed a hybrid domain prediction web service integrating template-based and ab initio methods. The template-based method is accurate enough for guiding protein structure prediction, structure determination, function annotation, mutagenesis analysis and protein engineering. However, the ab initio method still needs to be improved for practical use. Since protein domain architecture is largely shaped by gene recombination events, such as gene fusion, fission, domain swapping and exon exchange, leveraging the evolutionary gene recombination signals embedded in the multiple sequence alignment of a protein family and exon boundaries (or splicing sites) in its gene structure, may help improve ab initio domain prediction significantly.

38 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Comparison of sequence profiles. Strategies for structural predictions using sequence information.

Authors: L Rychlewski; L Jaroszewski; W Li; A Godzik
Journal: Protein Sci Date: 2000-02 Impact factor: 6.725

3. Identification of cell-binding sites on the Laminin alpha 5 N-terminal domain by site-directed mutagenesis.

Authors: P K Nielsen; Y Yamada
Journal: J Biol Chem Date: 2000-11-29 Impact factor: 5.157

4. Domain size distributions can predict domain boundaries.

Authors: S J Wheelan; A Marchler-Bauer; S H Bryant
Journal: Bioinformatics Date: 2000-07 Impact factor: 6.937

5. Prediction of coordination number and relative solvent accessibility in proteins.

Authors: Gianluca Pollastri; Pierre Baldi; Pietro Fariselli; Rita Casadio
Journal: Proteins Date: 2002-05-01

6. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles.

Authors: Gianluca Pollastri; Darisz Przybylski; Burkhard Rost; Pierre Baldi
Journal: Proteins Date: 2002-05-01

7. SnapDRAGON: a method to delineate protein structural domains from sequence data.

Authors: Richard A George; Jaap Heringa
Journal: J Mol Biol Date: 2002-02-22 Impact factor: 5.469

8. Rapid protein domain assignment from amino acid sequence using predicted secondary structure.

Authors: Russell L Marsden; Liam J McGuffin; David T Jones
Journal: Protein Sci Date: 2002-12 Impact factor: 6.725

9. Protein domain identification and improved sequence similarity searching using PSI-BLAST.

Authors: Richard A George; Jaap Heringa
Journal: Proteins Date: 2002-09-01

10. Protein structure prediction servers at University College London.

Authors: Kevin Bryson; Liam J McGuffin; Russell L Marsden; Jonathan J Ward; Jaspreet S Sodhi; David T Jones
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

21 in total

1. Structure prediction of domain insertion proteins from structures of individual domains.

Authors: Monica Berrondo; Marc Ostermeier; Jeffrey J Gray
Journal: Structure Date: 2008-04 Impact factor: 5.006

2. FUpred: detecting protein domains through deep-learning-based contact map prediction.

Authors: Wei Zheng; Xiaogen Zhou; Qiqige Wuyun; Robin Pearce; Yang Li; Yang Zhang
Journal: Bioinformatics Date: 2020-06-01 Impact factor: 6.937

3. Molecular cloning, tissue distribution and bioinformatics analyses of the rabbit BK channel beta1 subunit gene.

Authors: Xiao-Yong Zhang; Sha Wang; Zhen Yan; Yi Wan; Wei Wang; Guang-Bin Cui; Pang Du; Ke-Jun Ma; Wei Han; Ying-Qi Zhang; Jing-Guo Wei
Journal: Mol Biol Rep Date: 2007-09-14 Impact factor: 2.316

4. Structure of BT_3984, a member of the SusD/RagB family of nutrient-binding molecules.

Authors: Constantina Bakolitsa; Qingping Xu; Christopher L Rife; Polat Abdubek; Tamara Astakhova; Herbert L Axelrod; Dennis Carlton; Connie Chen; Hsiu Ju Chiu; Thomas Clayton; Debanu Das; Marc C Deller; Lian Duan; Kyle Ellrott; Carol L Farr; Julie Feuerhelm; Joanna C Grant; Anna Grzechnik; Gye Won Han; Lukasz Jaroszewski; Kevin K Jin; Heath E Klock; Mark W Knuth; Piotr Kozbial; S Sri Krishna; Abhinav Kumar; Winnie W Lam; David Marciano; Daniel McMullan; Mitchell D Miller; Andrew T Morse; Edward Nigoghossian; Amanda Nopakun; Linda Okach; Christina Puckett; Ron Reyes; Henry J Tien; Christine B Trame; Henry van den Bedem; Dana Weekes; Keith O Hodgson; John Wooley; Marc André Elsliger; Ashley M Deacon; Adam Godzik; Scott A Lesley; Ian A Wilson
Journal: Acta Crystallogr Sect F Struct Biol Cryst Commun Date: 2010-09-22

5. OPUS-Dom: applying the folding-based method VECFOLD to determine protein domain boundaries.

Authors: Yinghao Wu; Athanasios D Dousis; Mingzhi Chen; Jialin Li; Jianpeng Ma
Journal: J Mol Biol Date: 2008-11-10 Impact factor: 5.469

6. Ab initio and homology based prediction of protein domains by recursive neural networks.

Authors: Ian Walsh; Alberto J M Martin; Catherine Mooney; Enrico Rubagotti; Alessandro Vullo; Gianluca Pollastri
Journal: BMC Bioinformatics Date: 2009-06-26 Impact factor: 3.169

7. SoyDB: a knowledge database of soybean transcription factors.

Authors: Zheng Wang; Marc Libault; Trupti Joshi; Babu Valliyodan; Henry T Nguyen; Dong Xu; Gary Stacey; Jianlin Cheng
Journal: BMC Plant Biol Date: 2010-01-18 Impact factor: 4.215

8. Structure prediction, molecular dynamics simulation and docking studies of D-specific dehalogenase from Rhizobium sp. RC1.

Authors: Ismaila Yada Sudi; Ee Lin Wong; Kwee Hong Joyce-Tan; Mohd Shahir Shamsir; Haryati Jamaluddin; Fahrul Huyop
Journal: Int J Mol Sci Date: 2012-11-26 Impact factor: 5.923

9. Comparative domain modeling of human EGF-like module EMR2 and study of interaction of the fourth domain of EGF with chondroitin 4-sulphate.

Authors: Mukta Rani; Manas R Dikhit; Ganesh C Sahoo; Pradeep Das
Journal: J Biomed Res Date: 2011-03

10. DomHR: accurately identifying domain boundaries in proteins using a hinge region strategy.

Authors: Xiao-yan Zhang; Long-jian Lu; Qi Song; Qian-qian Yang; Da-peng Li; Jiang-ming Sun; Tong-hua Li; Pei-sheng Cong
Journal: PLoS One Date: 2013-04-11 Impact factor: 3.240