| Literature DB >> 22641851 |
Sucharita Dey1, Arumay Pal, Mainak Guharoy, Shrihari Sonavane, Pinak Chakrabarti.
Abstract
We present a set of four parameters that in combination can predict DNA-binding residues on protein structures to a high degree of accuracy. These are the number of evolutionary conserved residues (N(cons)) and their spatial clustering (ρ(e)), hydrogen bond donor capability (D(p)) and residue propensity (R(p)). We first used these parameters to characterize 130 interfaces in a set of 126 DNA-binding proteins (DBPs). The applicability of these parameters both individually and in combination, to distinguish the true binding region from the rest of the protein surface was then analyzed. R(p) shows the best performance identifying the true interface with the top rank in 83% cases. Importantly, we also used the unbound-bound test cases of the protein-DNA docking benchmark to test the efficacy of our method. When applied to the unbound form of the DBPs, R(p) can distinguish 86% cases. Finally, we have applied the SVM approach for recognizing the interface region using the above parameters along with the individual amino acid composition as attributes. The accuracy of prediction is 90.5% for the bound structures and 93.6% for the unbound form of the proteins.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22641851 PMCID: PMC3424558 DOI: 10.1093/nar/gks405
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Average values of interface parameters in protein–DNA complexes
| Parameters | Values |
|---|---|
| Number of complexes | 122 |
| 0.51 ± 0.28 | |
| 0.18 ± 0.20 | |
| 0.09 ± 0.02 | |
| 0.08 ± 0.02 | |
| 1.11 ± 0.10 | |
| 0.12 ± 0.08 | |
| 0.71 ± 2.91 | |
| 18 ± 10 | |
| 18 ± 8 |
aOf the 130 DBPs, 8 with only a few homologs were excluded.
b
cThe differences between Ms,int and Ms,cons (and between
Figure 1.Plot of Ms,cons versus Ms,int (clustering of conserved residues versus that for all the residues in the interface).
Figure 2.Distribution of the rank (on a scale of 1 to 10) of the known DNA-binding site relative to other patches on the surface of the protein using four different parameters. In (a) 77 structures are used with a strict definition of patches, in (b) 106 structures (where the patches may contain up to 10% interface residues).
Figure 3.Distribution of five parameters calculated for all patches for the DNA complex of human topoisomerase I (PDB code, 1ej9). On each graph all the surface patches are represented in grey and the value for the known DNA-binding interface is indicated by an arrow. The parameters used are (a) ρ, (b) ρe, (c) Rp, (d) Dp and (e) Ncons.
Percentage of cases where the true interface is ranked #1 using different parameters applied to different datasets
| Parameter | This dataset [77, 106] | Jones and Stawiski |
|---|---|---|
| 47, 54 | 50, 51 | |
| 79, 83 | 81, 82 | |
| 68, 70 | 67, 72 | |
| 70, 73 | 71, 68 |
aρ is omitted being already incorporated in ρe.
bThe first entry indicates the percentage of cases using stringent conditions (the surface patches devoid of any interface residue), the latter for patches that may contain up to 10% of interface residues.
cCombining Jones and Stawiski datasets (15,16) and excluding the redundant entries.
Average accessible surface area,
| Groups | Residues | <ASA> (Å2) in | |||
|---|---|---|---|---|---|
| Interface | Surface | ||||
| Before complexation | After complexation | ||||
| NE | Arg | 10 ± 4 (10 ± 6) | 3 ± 3 | 7 ± 3 | |
| NH1 | Arg | 29 ± 10 (31 ± 15) | 11 ± 7 | 25 ± 9 | |
| NH2 | Arg | 35 ± 11 (34 ± 19) | 13 ± 10 | 31 ± 11 | |
| ND1 | His | 11 ± 8 (10 ± 9) | 3 ± 4 | 10 ± 5 | |
| NE2 | His | 15 ± 9 (17 ± 9) | 5 ± 6 | 13 ± 9 | |
| NZ | Lys | 35 ± 8 (32 ± 12) | 19 ± 9 | 33 ± 7 | |
| ND2 | Asn | 30 ± 12 (27 ± 17) | 12 ± 10 | 31 ± 10 | |
| NE1 | Trp | 12 ± 7 (9 ± 9) | 3 ± 4 | 7 ± 5 | |
| NE2 | Gln | 31 ± 15 (21 ± 19) | 12 ± 10 | 27 ± 9 | |
| OG | Ser | 17 ± 8 (17 ± 11) | 6 ± 5 | 14 ± 6 | |
| OG1 | Thr | 15 ± 7 (14 ± 10) | 5 ± 6 | 12 ± 6 | |
| OH | Tyr | 21 ± 11 (21 ± 15) | 7 ± 7 | 19 ± 9 | |
aThe difference between the accessibilities is significant at 0.1 to 5% level (P-value ranging from 0.001 to 0.05), except for ND1, NE2 (His and Gln), OH and ND2.
bThe values for the unbound form (from the protein–DNA docking benchmark) are given in parentheses, for comparison.
Figure 4.Distribution of the rank (on a scale of 1 to 10) of the DNA-binding site in the unbound form (obtained by mapping the interface information from the bound structure) of 42 DNA-binding proteins taken from benchmark version 1.2, relative to other patches on the surface of the protein using four different parameters. Patches were identified using the strict definition.
Comparison of the efficiency of the present method with other techniques
| Dataset (# of cases) | Reported prediction accuracy (%) | Accuracy (%) using | |
|---|---|---|---|
| Jones ( | 68 | 82 | 72 |
| Stawiski ( | 81 | ||
| Stawiski enzyme data set ( | 50 | 92 | 62 |
aThe present method was applied to the combined Jones and Stawiski datasets as given in Table 2.
bBased on 13 cases (three could not be used as no surface patch showed up).
Summary of SVM modeling
| Attributes | γ | MCC | |
|---|---|---|---|
| Top 5 | 15 | 0.013 | 0.7867 |
| Top 10 | 14 | 0.5 | 0.8393 |
| Top 15 | 7 | 0.021 | 0.8608 |
| All 25 | 3 | 0 | 0.8508 |
Performance of the model on our test set and the unbound cases in protein–DNA docking benchmark
| Test set | Accuracy | Specificity | Sensitivity/ Recall | Precision | |
|---|---|---|---|---|---|
| Our dataset | 90.5 | 91.7 | 88.8 | 89.9 | 89.1 |
| protein–DNA docking benchmark | 93.6 | 92.8 | 95.2 | 86.9 | 90.9 |
aValues shown are average performance on 10 different randomly generated test sets.
b42 positives and 83 negatives.
Figure 5.Distribution of the rank (on a scale of 1 to 10) of the known RNA-binding site relative to other patches on the surface of the protein using four different parameters. In (a) 39 structures are used with a strict definition of patches, in (b) 45 structures (where the patches may contain up to 10% interface residues).