| Literature DB >> 25756377 |
Jingna Si1, Rui Zhao2, Rongling Wu3.
Abstract
Interactions between proteins and DNA play an important role in many essential biological processes such as DNA replication, transcription, splicing, and repair. The identification of amino acid residues involved in DNA-binding sites is critical for understanding the mechanism of these biological activities. In the last decade, numerous computational approaches have been developed to predict protein DNA-binding sites based on protein sequence and/or structural information, which play an important role in complementing experimental strategies. At this time, approaches can be divided into three categories: sequence-based DNA-binding site prediction, structure-based DNA-binding site prediction, and homology modeling and threading. In this article, we review existing research on computational methods to predict protein DNA-binding sites, which includes data sets, various residue sequence/structural features, machine learning methods for comparison and selection, evaluation methods, performance comparison of different tools, and future directions in protein DNA-binding site prediction. In particular, we detail the meta-analysis of protein DNA-binding sites. We also propose specific implications that are likely to result in novel prediction methods, increased performance, or practical applications.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25756377 PMCID: PMC4394471 DOI: 10.3390/ijms16035194
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Sequence similarity and structure similarity-based strategies.
Commonly used data sets for DNA-binding site identification.
| ID | Ref. No. | Notes |
|---|---|---|
| DB179 | [ | 179 DNA-binding proteins, almost entirely nonredundant at 40% sequence identity |
| NB3797 | [ | 3797 nonbinding proteins, significant redundancy at 35% sequence identity level (only 3482 independent clusters) |
| PD138 | [ | 138 DNA-binding proteins, almost entirely nonredundant at 35% sequence identity, divided into seven structural classes |
| DISIS | [ | 78 DNA-binding proteins, close to nonredundant at 20% sequence identity |
| PDNA62 | [ | 62 DNA-binding proteins, 78 chains, 57 nonredundant sequences at 30% identity. |
| NB110 | [ | 110 nonbinding proteins, nonredundant at 30% sequence identity level, derived from the RS126 secondary structure data set by removing entries related to DNA |
| BIND54 | [ | Reported as 54 binding proteins, actually 58 chains, nonredundant at 30% sequence identity, original list of proteins was reported in [ |
| NB250 | [ | 250 nonbinding proteins, mostly nonredundant at a 35% sequence identity |
| DBP374 | [ | 374 DNA-binding proteins, significant redundancy at a 25% sequence identity level |
| TS75 | [ | 75 DNA-binding proteins, designed to be independent from DBP374 and PDNA62 but has some redundant entries in both at a 35% sequence identity level |
| PDNA-316 | [ | 316 target proteins used in metaDBSite Web server, at 30% sequence identity |
| DNABindR171 | [ | 171 proteins with mutual sequence identity ≤30% and each protein has at least 40 amino acid residues. All the structures have resolution better than 3.0 Å and an R factor less than 0.3 |
Evaluation parameters.
| Parameter | Meaning | Expression |
|---|---|---|
| Accuracy (ACC) | Percentage of correct prediction |
|
| Sensitivity | Percentage of correctly predicted positive |
|
| Specificity | Percentage of correctly predicted negative |
|
| Strength | Mean value of the sum of sensitivity and specificity |
|
| MCC | Matthews correlation coefficient |
|
| Precision | Positive predictive rate |
|
| F-measure | The harmonic mean of sensitivity and specificity |
|
| AUC b | Probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one |
|
a: TP = True positive number; TN = True negative number; FP = False positive number; FP = False negative number; b: In AUC formulation, i takes on values from 1 to n, T is the total number of positives in the test set, and Ti is the number of positives that score higher than the i th highest scoring negative.
Performance of the state-of-the-art methods for DNA-binding site prediction.
| Author & Year | Data set (own/PDNA-316) | Performance | Alogrithm b | Reference | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| ACC | SEN | SPE | AUC | MCC | Strength | F-Measure | Precision | ||||
| Jones 2003 | own | 0.680 | 1 | [ | |||||||
| Ahmad 2004 | own | 0.664 | 0.682 | 0.660 | 2 | [ | |||||
| a
| [ | ||||||||||
| Ferrer-Costa 2005 | own | 0.835 | 4 | [ | |||||||
| Kuznetsov 2006 | own | 0.760 | 0.769 | 0.747 | 0.830 | 0.450 | 3 | [ | |||
| Wang 2006 | own | 0.703 | 0.694 | 0.704 | 0.750 | 3 | [ | ||||
| [ | |||||||||||
| Yan 2006 | own | 0.710 | 0.530 | 0.350 | 5 | [ | |||||
| [ | |||||||||||
| Tjong 2007 | own | 0.680 | 2 | [ | |||||||
| Ofran 2007 | own | 0.890 | 2/3 | [ | |||||||
| [ | |||||||||||
| Hwang 2007 | own | 0.772 | 0.764 | 0.766 | 3 | [ | |||||
| [ | |||||||||||
| Nimrod 2009 | own | 0.900 | 0.900 | 0.350 | 6 | [ | |||||
| Wang 2009 | own | 0.800 | 0.731 | 0.806 | 0.850 | 6 | [ | ||||
| [ | |||||||||||
| Wu 2009 | own | 0.914 | 0.766 | 0.944 | 6 | [ | |||||
| Carson 2010 | own | 0.785 | 0.797 | 0.772 | 0.860 | 0.570 | 7 | [ | |||
| Ozbek 2010 | own | 0.960 | 0.360 | 0.990 | 8 | [ | |||||
| [ | |||||||||||
a: Italic lines represent the performance of previous methods using the PDNA-316 data set (from the metaDBSite method); b: Patch prediction = 1, Neural Network = 2, SVM = 3, Linear Predictor = 4, Naïve Bayes = 5, Random Forest = 6, C4.5BAGCST = 7, Gaussian Network Model = 8.
A selection of DNA-binding protein or residues prediction Web servers.
| Methods | URLs | References | Publication Year |
|---|---|---|---|
| newDNA-Prot | [ | 2014 | |
| DNABind | [ | 2013 | |
| DNABR | [ | 2012 | |
| DR_bind | [ | 2012 | |
| MetaDBSite | [ | 2011 | |
| DNABINDPROT | [ | 2010 | |
| bindn-rf | [ | 2009 | |
| DBindR | [ | 2009 | |
| DP-Bind | [ | 2007 | |
| BindN | [ | 2006 |