| Literature DB >> 23788679 |
R Nagarajan1, Shandar Ahmad, M Michael Gromiha.
Abstract
Protein-DNA complexes play vital roles in many cellular processes by the interactions of amino acids with DNA. Several computational methods have been developed for predicting the interacting residues in DNA-binding proteins using sequence and/or structural information. These methods showed different levels of accuracies, which may depend on the choice of data sets used in training, the feature sets selected for developing a predictive model, the ability of the models to capture information useful for prediction or a combination of these factors. In many cases, different methods are likely to produce similar results, whereas in others, the predictors may return contradictory predictions. In this situation, a priori estimates of prediction performance applicable to the system being investigated would be helpful for biologists to choose the best method for designing their experiments. In this work, we have constructed unbiased, stringent and diverse data sets for DNA-binding proteins based on various biologically relevant considerations: (i) seven structural classes, (ii) 86 folds, (iii) 106 superfamilies, (iv) 194 families, (v) 15 binding motifs, (vi) single/double-stranded DNA, (vii) DNA conformation (A, B, Z, etc.), (viii) three functions and (ix) disordered regions. These data sets were culled as non-redundant with sequence identities of 25 and 40% and used to evaluate the performance of 11 different methods in which online services or standalone programs are available. We observed that the best performing methods for each of the data sets showed significant biases toward the data sets selected for their benchmark. Our analysis revealed important data set features, which could be used to estimate these context-specific biases and hence suggest the best method to be used for a given problem. We have developed a web server, which considers these features on demand and displays the best method that the investigator should use. The web server is freely available at http://www.biotech.iitm.ac.in/DNA-protein/. Further, we have grouped the methods based on their complexity and analyzed the performance. The information gained in this work could be effectively used to select the best method for designing experiments.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23788679 PMCID: PMC3763535 DOI: 10.1093/nar/gkt544
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Prediction accuracy of binding sites in different classes
| Methods | Average Accuracy | all-α | all-β | α + β | α/β | Coiled coil | Multidomain | Small proteins |
|---|---|---|---|---|---|---|---|---|
| BindN | 64.2 (74.9) | 66.2 (76.3) | 62.1 (74.6) | 60.3 (74.8) | 62.2 (79.6) | 75.1 (73.2) | 61.3 (79.7) | 62.4 (66.5) |
| BindN+ | 71.1 (82.8) | 76.2 (83.8) | 66.0 (83.7) | 67.9 (81.9) | 66.5 (85.8) | 65.5 (87.3) | 66.9 (70.2) | |
| BindN-RF | 71.9 (82.3) | 68.4 (82.8) | 67.8 (84.8) | 88.1 (86.5) | 65.8 (86.6) | |||
| DBS-Pred | 64.3 (72.6) | 64.2 (73.0) | 62.6 (71.6) | 62.0 (71.7) | 63.1 (75.4) | 74.4 (73.6) | 59.8 (76.4) | 63.6 (66.5) |
| DBS-PSSM | 70.2 (78.5) | 73.2 (80.2) | 65.5 (78.2) | 65.8 (76.5) | 67.1 (83.3) | 87.4 (81.6) | 65.3 (87.0) | 67.3 (62.3) |
| DP-Bind_Binary | 66.9 (68.0) | 68.1 68.6) | 63.5 (65.8) | 63.1 (67.2) | 66.3 (70.6) | 79.8 (70.4) | 62.0 (70.7) | 65.4 (62.9) |
| DP-Bind_BLOSUM | 66.1 (67.8) | 69.2 (69.5) | 63.2 (66.3) | 62.8 (67.8) | 66.3 (71.5) | 75.6 (66.6) | 61.1 (70.5) | 65.0 (62.4) |
| DP-Bind_PSSM | 72.1 (76.4) | 73.7 (78.4) | 67.9 (75.3) | 70.4 (80.5) | 88.1 (84.6) | 64.8 (56.8) | ||
| DNABindR | 68.0 (71.9) | 70.1 (72.9) | 62.6 (68.1) | 65.2 (71.0) | 66.2 (75.2) | 82.9 (77.3) | 64.2 (77.1) | 64.4 (61.7) |
| metaDBSite | 69.9 (72.3) | 72.0 (74.1) | 66.9 (70.2) | 67.2 (71.5) | 82.0 (74.0) | 65.4 (76.6) | 66.5 (62.9) | |
| NAPS | 63.6 (65.1) | 64.6 (64.8) | 58.8 (61.3) | 59.4 (62.5) | 57.6 (66.9) | 80.6 (75.0) | 62.5 (67.9) | 61.6 (57.6) |
Accuracies obtained with Equation (3) are given in parentheses. The highest accuracy in each class is shown in bold.
Figure 1.Performance of DNA-binding site prediction methods in various folds, superfamilies and families.
Typical examples of best and worst predicted folds, superfamilies and families
| Fold/Superfamily/Family | Method | Sensitivity | Specificity | Accuracy1 | Accuracy2 | MCC | Lowest Accuracy | MCC |
|---|---|---|---|---|---|---|---|---|
| Profilin-like (1) | BindN+ | 100.0 | 96.4 | 96.6 | 98.2 | 0.32 | 64.5 (DP-Bind_BLOSUM) | 0.20 |
| Tetracyclin repressor-like, C terminal domain (2) | DP-Bind_PSSM | 96.2 | 89.6 | 89.8 | 92.9 | 0.28 | 51.1 (DP-Bind_BLOSUM) | 0.20 |
| Transcription factor IIA(TFIIA), beta-barrel domain (2) | DBS-Pred | 100.0 | 80.4 | 82.0 | 90.2 | 0.16 | 67.9 (NAPS) | 0.13 |
| Pheromone-binding, quourm-sensing transcription factors (1) | BindN+ | 100.0 | 96.4 | 96.6 | 98.2 | 0.31 | 64.5 (DP-Bind_BLOSUM) | 0.20 |
| Dimeric alpha + beta barrel (1) | BindN-RF | 87.5 | 96.4 | 95.9 | 92.0 | 0.34 | 47.3 (DBS-Pred) | 0.17 |
| DNA-binding domain- eukaryotic transcription factors (1) | DBS-PSSM | 100.0 | 88.5 | 90.5 | 94.3 | 0.28 | 72.3 (DBS-Pred) | 0.20 |
| AraC type transcriptional activator (1) | BindN-RF | 100.0 | 99.0 | 99.1 | 99.5 | 0.32 | 65.9 (DBS-Pred) | 0.19 |
| CopG-like (1) | BindN | 100.0 | 81.1 | 83.7 | 90.5 | 0.22 | 78.1 (BindN-RF) | 0.20 |
| Z-DNA binding domain (1) | DBS-PSSM | 100.0 | 81.1 | 82.5 | 90.6 | 0.26 | 47.4 (DP-Bind_Binary) | 0.19 |
The worst predicted folds/superfamilies/families are shown in italics.
Prediction performance of binding sites in disordered regions
| Method | Sensitivity | Specificity | Accuracy1 | Accuracy2 | MCC |
|---|---|---|---|---|---|
| DBS-Pred | 61.3 | 60.7 | 60.8 | 61.0 | 0.17 |
| BindN | 55.5 | 67.5 | 65.2 | 61.5 | 0.19 |
| BindN+ | 61.3 | 64.6 | 64.0 | 63.0 | 0.21 |
| BindN-RF | 55.5 | 68.3 | 65.9 | 61.9 | 0.19 |
| DP-Bind_Binary | 78.1 | 48.4 | 54.0 | 63.3 | 0.21 |
| DP-Bind_BLOSUM | 73.0 | 50.3 | 54.5 | 61.6 | 0.18 |
| DP-Bind_PSSM | 65.7 | 56.4 | 60.6 | 61.0 | 0.20 |
| NAPS | 59.1 | 58.9 | 58.9 | 59.0 | 0.14 |
| DNABindR | 75.9 | 51.9 | 56.4 | 63.9 | 0.22 |
| metaDBSite | 73.0 | 56.0 | 59.2 | 64.5 | 0.23 |
| DBS-PSSM | 65.0 | 61.1 | 61.8 | 63.0 | 0.20 |
Figure 2.Performance of prediction methods in 15 different types of DNA binding motifs. Number of motifs, which are predicted with the sensitivity and specificity of >60% each in all considered methods are shown.
Prediction performance of different methods in two independent data sets
| Method | Data set 1 | Data set 2 | ||||
|---|---|---|---|---|---|---|
| Accuracy1 | Accuracy2 | MCC | Accuracy1 | Accuracy2 | MCC | |
| BindN | 76.1 | 63.1 | 0.17 | 76.4 | 61.4 | 0.14 |
| BindN+ | 80.2 | 69.2 | 0.28 | 79.6 | 68.7 | 0.26 |
| BindN-RF | 78.0 | 69.5 | 0.28 | 75.3 | 68.7 | 0.24 |
| DBS-Pred | 72.6 | 62.4 | 0.16 | 72.8 | 62.2 | 0.14 |
| DBS-PSSM | 78.3 | 66.5 | 0.25 | 78.4 | 69.7 | 0.23 |
| NAPS | 63.5 | 60.2 | 0.13 | 64.8 | 60.3 | 0.12 |
| DNABindR | 71.6 | 66.3 | 0.21 | 72.1 | 66.7 | 0.20 |
| metaDBSite | 74.7 | 68.7 | 0.24 | 78.2 | 66.2 | 0.22 |
| DP-Bind_Binary | 67.9 | 65.9 | 0.19 | 68.6 | 67.7 | 0.19 |
| DP-Bind_BLOSUM | 68.4 | 66.1 | 0.19 | 67.3 | 65.4 | 0.17 |
| DP-Bind_PSSM | 75.9 | 70.3 | 0.27 | 77.7 | 70.0 | 0.25 |
Data set 1: List of DNA–protein complexes analyzed in this work and not used in the respective methods.
Data set 2: List of DNA–protein complexes published from June 2011, after the publication of all the analyzed methods.