Literature DB >> 20089514

iDBPs: a web server for the identification of DNA binding proteins.

Guy Nimrod¹, Maya Schushan, András Szilágyi, Christina Leslie, Nir Ben-Tal.

Abstract

SUMMARY: The iDBPs server uses the three-dimensional (3D) structure of a query protein to predict whether it binds DNA. First, the algorithm predicts the functional region of the protein based on its evolutionary profile; the assumption is that large clusters of conserved residues are good markers of functional regions. Next, various characteristics of the predicted functional region as well as global features of the protein are calculated, such as the average surface electrostatic potential, the dipole moment and cluster-based amino acid conservation patterns. Finally, a random forests classifier is used to predict whether the query protein is likely to bind DNA and to estimate the prediction confidence. We have trained and tested the classifier on various datasets and shown that it outperformed related methods. On a dataset that reflects the fraction of DNA binding proteins (DBPs) in a proteome, the area under the ROC curve was 0.90. The application of the server to an updated version of the N-Func database, which contains proteins of unknown function with solved 3D-structure, suggested new putative DBPs for experimental studies. AVAILABILITY: http://idbps.tau.ac.il/

Entities: Chemical Gene

Mesh：

Substances：
DNA-Binding Proteins

Year: 2010 PMID： 20089514 PMCID： PMC2828122 DOI： 10.1093/bioinformatics/btq019

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

DNA binding proteins (DBPs) compose a considerable part of the proteomes of the various organisms (Nimrod et al., 2009), and take part in various processes, such as DNA transcription, replication and packing. There are a number of approaches for the identification of DBPs. Some methods look for direct similarity between the query protein and DBPs (e.g. Gao and Skolnick, 2008; Shanahan et al., 2004). When the DNA binding domain is novel, methods that do not rely directly on previous data may be advantageous. Such methods often rely on electrostatic features of the proteins. DNA is negatively charged, and the DNA binding region of the protein is often positively charged. Therefore, features of positively charged patches on the proteins' surfaces have been examined in order to identify DBPs (Bhardwaj et al., 2005; Stawiski et al., 2003). Other features that represent the distribution of charges within the protein structure have also been used (Ahmad and Sarai, 2004; Szilágyi and Skolnick, 2006), as well as secondary structure content (Stawiski et al., 2003) and the amino acid composition (Szilágyi and Skolnick, 2006). We recently developed a method for the prediction of DBPs based on the identification of the functional region within the query protein (Nimrod et al., 2009). We showed that patches of highly conserved amino acids, detected by PatchFinder (Nimrod et al., 2008), often delineate the functional regions in proteins in general, and the core of DNA binding regions within DBPs in particular (Nimrod et al., 2009). Using features of the predicted functional regions and additional global features, we trained a random forests classifier (Breiman, 2001) on a dataset of 138 DBPs and 110 proteins that do not bind DNA (Szilágyi and Skolnick, 2006). We examined the classifier on a realistic dataset that reflects the fraction of DBPs in proteomes. We evaluated this fraction to be 14% and extended the original dataset by 733 additional proteins that do not bind DNA. The sensitivity and the precision on this dataset were 0.90 and 0.35, respectively, with the default prediction score cutoff. The area under the ROC curve (AUC) was 0.90. We also showed that the performance of the classifier was superior to related methods (Nimrod et al., 2009). Here, we present the iDBPs web server, which implements the classifier. The server is freely available at http://idbps.tau.ac.il/. It is easy to use and only requires the PDB file (or PDB id) and the chain identifier of the protein of interest.

2 RESULTS

The N-Func database is a collection, which we recently established, of proteins of known three-dimensional (3D)-structure that lack functional annotation (Nimrod et al., 2008). The functional region of each of the proteins in N-Func was predicted using PatchFinder as a first step toward the annotation of these proteins. Here, we present an updated version of the database, which includes 973 PDB entries and their predicted functional regions. Next, we applied the iDBPs server to N-Func in order to identify potential DNA binders. The results, available as Supplementary Table 1, include the prediction score of each protein as well as the corresponding estimated precision and sensitivity. Using the default prediction threshold, 233 proteins were identified as potential DBPs. At this threshold, the expected precision is only 0.35, while the sensitivity is 0.9. However, one can filter the results using different thresholds in order to gather predictions with high precision. Supplementary Figure 2 presents an example of predicted DBP from N-Func. We previously showed that many of the patches cover most of the hydrogen bonds within the protein–DNA interface in DBPs (Nimrod et al., 2009). Here, we also show that they cover most of the interface positions that interact with the DNA bases (Supplementary Material and Supplementary Fig. 1).

3 IMPLEMENTATION

3.1 Prediction of functional regions in the protein

PatchFinder uses as input the protein structure (or a model) and a multiple sequence alignment (MSA) of the query protein and its sequence homologs. The MSA is generated automatically using the procedure implemented in ConSurf-DB (Goldenberg et al., 2009). PatchFinder searches for statistically significant clusters of evolutionarily conserved residues on the protein surface (ML-patches), which often correspond to the functional regions in proteins (Nimrod et al., 2008). When only a few sequence homologs are available for the query protein, the conservation signal cannot be calculated reliably and the functional region is not predicted. In such cases, the iDBPs server uses a classifier that was trained on the global features alone.

3.2 The classifier's input features

The features calculated for the ML-patches are: average surface electrostatic potential, secondary structure content, patch size (number of residues) and cluster-based amino acid conservation patterns (Nimrod et al., 2009). The global features include the average electrostatic potential, the secondary structure content and the protein size. They also include the protein's dipole moment, its amino acid composition, the spatial asymmetry of residues within the protein structure (Szilágyi and Skolnick, 2006) and the fraction of hydrogen donors/acceptors on the protein surface.

3.3 The web server

The web server requires the user to upload a protein structure in PDB format (or provide the PDB id), indicate the chain identifier of the query proteins and provide an e-mail address (optional). Once the calculations are finished, the results are sent to the user and include the prediction score as well as the expected sensitivity and precision at this score cutoff as calculated on the extended dataset. When available, a link to the PatchFinder results is also supplied. The PatchFinder results include the MSA, the evolutionary rates computed for each position in the protein (Mayrose et al., 2004), the list of residues composing the ML-patch and the confidence of the prediction. In addition, the user can also visualize the ML-patch on the 3D-structure of the protein using the FirstGlance in Jmol applet.

3.4 Update of the N-Func database

The procedure we used to gather the structures in N-Func is described in detail in the original publication with the following modifications: sequence homologs were collected and multiply aligned using the protocol of the ConSurf-DB server (Goldenberg et al., 2009) on the UniProt database (Bairoch et al., 2005).

11 in total

1. Annotating nucleic acid-binding function based on protein structure.

Authors: Eric W Stawiski; Lydia M Gregoret; Yael Mandel-Gutfreund
Journal: J Mol Biol Date: 2003-02-28 Impact factor: 5.469

2. Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior.

Authors: Itay Mayrose; Dan Graur; Nir Ben-Tal; Tal Pupko
Journal: Mol Biol Evol Date: 2004-06-16 Impact factor: 16.240

3. Identifying DNA-binding proteins using structural motifs and the electrostatic potential.

Authors: Hugh P Shanahan; Mario A Garcia; Susan Jones; Janet M Thornton
Journal: Nucleic Acids Res Date: 2004-09-08 Impact factor: 16.971

Review 4. Moment-based prediction of DNA-binding proteins.

Authors: Shandar Ahmad; Akinori Sarai
Journal: J Mol Biol Date: 2004-07-30 Impact factor: 5.469

5. Efficient prediction of nucleic acid binding function from low-resolution protein structures.

Authors: András Szilágyi; Jeffrey Skolnick
Journal: J Mol Biol Date: 2006-03-10 Impact factor: 5.469

6. Detection of functionally important regions in "hypothetical proteins" of known structure.

Authors: Guy Nimrod; Maya Schushan; David M Steinberg; Nir Ben-Tal
Journal: Structure Date: 2008-12-10 Impact factor: 5.006

7. Identification of DNA-binding proteins using structural, electrostatic and evolutionary features.

Authors: Guy Nimrod; András Szilágyi; Christina Leslie; Nir Ben-Tal
Journal: J Mol Biol Date: 2009-02-20 Impact factor: 5.469

8. Kernel-based machine learning protocol for predicting DNA-binding proteins.

Authors: Nitin Bhardwaj; Robert E Langlois; Guijun Zhao; Hui Lu
Journal: Nucleic Acids Res Date: 2005-11-10 Impact factor: 16.971

9. The ConSurf-DB: pre-calculated evolutionary conservation profiles of protein structures.

Authors: Ofir Goldenberg; Elana Erez; Guy Nimrod; Nir Ben-Tal
Journal: Nucleic Acids Res Date: 2008-10-29 Impact factor: 16.971

10. DBD-Hunter: a knowledge-based method for the prediction of DNA-protein interactions.

Authors: Mu Gao; Jeffrey Skolnick
Journal: Nucleic Acids Res Date: 2008-05-31 Impact factor: 16.971

27 in total

Review 1. DNA-protein interactions: methods for detection and analysis.

Authors: Bipasha Dey; Sameer Thukral; Shruti Krishnan; Mainak Chakrobarty; Sahil Gupta; Chanchal Manghani; Vibha Rani
Journal: Mol Cell Biochem Date: 2012-03-08 Impact factor: 3.396

2. Influence of pK(a) shifts on the calculated dipole moments of proteins.

Authors: Brett L Mellor; Shiul Khadka; David D Busath; Brian A Mazzeo
Journal: Protein J Date: 2011-10 Impact factor: 2.371

Review 3. DNA-protein interaction: identification, prediction and data analysis.

Authors: Abbasali Emamjomeh; Darush Choobineh; Behzad Hajieghrari; Nafiseh MahdiNezhad; Amir Khodavirdipour
Journal: Mol Biol Rep Date: 2019-03-26 Impact factor: 2.316

4. DP-BINDER: machine learning model for prediction of DNA-binding proteins by fusing evolutionary and physicochemical information.

Authors: Farman Ali; Saeed Ahmed; Zar Nawab Khan Swati; Shahid Akbar
Journal: J Comput Aided Mol Des Date: 2019-05-23 Impact factor: 3.686

5. Predicting nucleic acid binding interfaces from structural models of proteins.

Authors: Iris Dror; Shula Shazman; Srayanta Mukherjee; Yang Zhang; Fabian Glaser; Yael Mandel-Gutfreund
Journal: Proteins Date: 2011-11-16

6. Comparative Analysis on Alignment-Based and Pretrained Feature Representations for the Identification of DNA-Binding Proteins.

Authors: Die Chen; Hua Zhang; Zeqi Chen; Bo Xie; Ye Wang
Journal: Comput Math Methods Med Date: 2022-06-28 Impact factor: 2.809