| Literature DB >> 23873960 |
Xiaolei Zhu1, Spencer S Ericksen, Julie C Mitchell.
Abstract
In this study, we present the DNA-Binding Site Identifier (DBSI), a new structure-based method for predicting protein interaction sites for DNA binding. DBSI was trained and validated on a data set of 263 proteins (TRAIN-263), tested on an independent set of protein-DNA complexes (TEST-206) and data sets of 29 unbound (APO-29) and 30 bound (HOLO-30) protein structures distinct from the training data. We computed 480 candidate features for identifying protein residues that bind DNA, including new features that capture the electrostatic microenvironment within shells near the protein surface. Our iterative feature selection process identified features important in other models, as well as features unique to the DBSI model, such as a banded electrostatic feature with spatial separation comparable with the canonical width of the DNA minor groove. Validations and comparisons with established methods using a range of performance metrics clearly demonstrate the predictive advantage of DBSI, and its comparable performance on unbound (APO-29) and bound (HOLO-30) conformations demonstrates robustness to binding-induced protein conformational changes. Finally, we offer our feature data table to others for integration into their own models or for testing improved feature selection and model training strategies based on DBSI.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23873960 PMCID: PMC3763564 DOI: 10.1093/nar/gkt617
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The figure illustrates calculation of the atom-level electrostatic feature within a shell offset 0.5 Å from the van der Waals surface. The grid on which the electrostatic potential is calculated is shown relative to the molecule, shown in dark gray, and the atom at which the feature is calculated is marked using a black dot. Electrostatic potential values at grid points within the light gray annular region are those averaged to generate the feature for this atom. Grid points inside the 0.5 Å offset surface are excluded from the calculation. The light gray annular region is 1.4 Å in width, regardless of the offset used to define the shell.
Figure 2.The basic residue-level electrostatics feature is mapped onto the surface of the Nucleosome Core Particle (PDB 1KX5). The feature calculated in the shell between the van der Waals and solvent accessible surface (top) shows patches where this feature takes on negative values. When this feature is calculated for the shell that is shifted 0.5 Å outward, some patches flip from negative to positive. Thus, a region that might otherwise seem unfavorable to DNA binding is now seen to have the correct biophysical characteristics for recognition.
The best model for each electrostatic feature on a subset with 1000 data points
| Electrostatic feature | Sensitivity | Specificity | Precision | F1 |
|---|---|---|---|---|
| ESP_T (Feature 14) | 0.06 | 0.99 | 0.63 | 0.10 |
| AVE_ESP (Feature 17) | 0.10 | 0.98 | 0.63 | 0.26 |
| AVE_ESP1 (Feature 18) | 0.10 | 0.98 | 0.53 | 0.17 |
| RANK_AVEESP1 (Feature 20) | 0.16 | 0.96 | 0.44 | 0.23 |
| ESP_T_0.1 (Feature 21) | 0.08 | 0.98 | 0.54 | 0.14 |
| AVE_ESP_0.1 (Feature 24) | 0.13 | 0.98 | 0.55 | 0.21 |
| AVE_ESP1_0.1 (Feature 25) | 0.09 | 0.99 | 0.59 | 0.16 |
| RANK_AVEESP1_0.1 (Feature 27) | 0.18 | 0.97 | 0.53 | 0.27 |
| ESP_T_0.3 (Feature 28) | 0.25 | 0.87 | 0.31 | 0.28 |
| AVE_ESP_0.3 (Feature 31) | 0.25 | 0.88 | 0.31 | 0.28 |
| AVE_ESP1_0.3 (Feature 32) | 0.30 | 0.88 | 0.35 | 0.32 |
| RANK_AVEESP1_0.3 (Feature 34) | 0.22 | 0.93 | 0.42 | 0.29 |
| ESP_T_0.5 (Feature 35) | 0.33 | 0.88 | 0.38 | 0.35 |
| AVE_ESP_0.5 (Feature 38) | 0.23 | 0.97 | 0.60 | 0.33 |
| AVE_ESP1_0.5 (Feature 39) | 0.31 | 0.94 | 0.54 | 0.40 |
| RANK_AVEESP1_0.5 (Feature 41) | 0.26 | 0.94 | 0.47 | 0.33 |
The number of the feature, as listed in Supplementary Table S3, is given in parentheses. Models were trained on the individual features related to electrostatics, in search of individual features with high predictive value. Predictive value in this case was measured using the F1 Score, which favored models having high Specificity but low to moderate Sensitivity. The best results were obtained for features calculated using the shell between the surfaces offset 0.5 Å from the van der Waals and solvent accessible surfaces. Based on this observation, we later checked whether use of more distant surfaces improved our final model, but this was not the case. The predictive value of the models trained on these individual features has F1 Scores that are fairly low, but in combination with other features, we will derive significantly better models.
The best models for PSSM-based features based on a subset with 1000 training data points
| PSSM features | Sensitivity | Specificity | Precision | F1 |
|---|---|---|---|---|
| PSSM[ | 0.29 | 0.89 | 0.37 | 0.32 |
| (Features 281–380) | ||||
| PSSM[ | 0.28 | 0.93 | 0.48 | 0.35 |
| (Features 261–400) | ||||
| PSSM[ | 0.30 | 0.94 | 0.54 | 0.38 |
| (Features 241–420) | ||||
| PSSM[ | 0.32 | 0.93 | 0.52 | 0.40 |
| (Features 221–440) |
The feature numbers, as listed in Supplementary Table S3, are given in parentheses. These groups are nested so that the second group contains the first, and so on, up to the last group, which consists of all PSSM-based features. The predictive performance was comparable among the different groups, and although the inclusion of larger scoring windows improved performance somewhat, the improvement was statistically insignificant.
Feature comparison and selection based on a subset with 1000 data points
| Feature | Sensitivity | Specificity | Precision | F1 |
|---|---|---|---|---|
| AVE_ESP1_0.5 | 0.31 | 0.94 | 0.54 | 0.40 |
| (Feature 39) | ||||
| Local Amino Acid | 0.24 | 0.85 | 0.27 | 0.26 |
| Microenvironment | ||||
| (Features 133–152) | ||||
| PSSM[ | 0.32 | 0.93 | 0.52 | 0.40 |
| (Features 221–440) |
Training results for best electrostatic-based feature (Table 1), best PSSM-based feature group (Table 2), and best residue microenvironment feature group are compared.
Progressive feature combinations used to develop DBSI
| Iteration | Feature combination | Sensitivity | Specificity | Precision | F1 |
|---|---|---|---|---|---|
| 1 | NEAR_ESP1_0.5 | 0.41 | 0.92 | 0.53 | 0.47 |
| PSSM[ | |||||
| (Features 37, 261–400) | |||||
| 2 | NEAR_ESP1_0.5 | 0.41 | 0.94 | 0.60 | 0.49 |
| PAA | |||||
| PSSM[ | |||||
| (Features 13,37, 261–400) | |||||
| 3 | NEAR_ESP1_0.5 | 0.41 | 0.95 | 0.63 | 0.50 |
| NEAR_ESP_0.3 | |||||
| PAA | |||||
| PSSM[ | |||||
| (Features 13,29,37, 261–400) | |||||
| 4 | NEAR_ESP1_0.5 | 0.43 | 0.95 | 0.63 | 0.51 |
| NEAR_ESP_0.3 | |||||
| PAA | |||||
| nnear_PTN | |||||
| PSSM[ | |||||
| (Features 13,29,37, 175,261–400) |
Based on the best combination of NEAR_ESP1_0.5 and the PSSM features, we successively introduced additional features. Descriptions of all features are in Supplementary Table S3.
Figure 3.The ROC curves of the TRAIN-263, cross-validation results, along with the TEST-206, HOLO-30 and APO-29 predictions. In each case, the AUC is greater than 0.8, which indicates that DBSI is a highly predictive model.
Predictive performance of DBSI on the training and independent data sets relative to a variety of performance metrics
| Data Set | Sensitivity | Specificity | Precision | Accuracy | F1 | Strength | MCC | AUC |
|---|---|---|---|---|---|---|---|---|
| TRAIN-263 | 0.70 | 0.85 | 0.50 | 0.82 | 0.58 | 0.77 | 0.48 | 0.86 |
| TEST-206 | 0.74 | 0.85 | 0.49 | 0.84 | 0.59 | 0.80 | 0.51 | 0.88 |
| APO-29 | 0.58 | 0.89 | 0.42 | 0.86 | 0.48 | 0.73 | 0.44 | 0.83 |
| HOLO-30 | 0.60 | 0.89 | 0.45 | 0.85 | 0.52 | 0.75 | 0.41 | 0.85 |
Comparison of cross-validated results for DBSI and DISPLAR on the TRAIN-263 data set
| Method | Sensitivity | Specificity | Precision | Accuracy | F1 | Strength | MCC |
|---|---|---|---|---|---|---|---|
| DBSI | 0.70 | 0.85 | 0.50 | 0.82 | 0.58 | 0.77 | 0.48 |
| DISPLAR | 0.60 | 0.79 | 0.39 | 0.76 | 0.47 | 0.70 | 0.34 |
Comparison between DBSI and several other DNA-binding site prediction methods on the TEST-206 data set
| Method | Sensitivity | Specificity | Precision | Accuracy | F1 | Strength | MCC |
|---|---|---|---|---|---|---|---|
| DBSI | 0.74 | 0.85 | 0.49 | 0.84 | 0.59 | 0.80 | 0.51 |
| DISPLAR | 0.55 | 0.89 | 0.48 | 0.83 | 0.51 | 0.72 | 0.42 |
| BindN | 0.46 | 0.76 | 0.27 | 0.72 | 0.34 | 0.61 | 0.18 |
| BindN-rf | 0.56 | 0.83 | 0.38 | 0.79 | 0.45 | 0.69 | 0.34 |
| DBS-PRED | 0.46 | 0.73 | 0.25 | 0.69 | 0.32 | 0.60 | 0.16 |
| DNABindR | 0.60 | 0.72 | 0.29 | 0.70 | 0.39 | 0.66 | 0.25 |
| DP-Bind | 0.63 | 0.80 | 0.37 | 0.77 | 0.47 | 0.71 | 0.35 |
| metaDBsite | 0.54 | 0.80 | 0.34 | 0.76 | 0.42 | 0.67 | 0.29 |
| DBSI 0–30% | 0.68 | 0.84 | 0.43 | 0.82 | 0.53 | 0.76 | 0.44 |
| DBSI 30–60% | 0.81 | 0.87 | 0.58 | 0.86 | 0.68 | 0.84 | 0.61 |
| DBSI 60–100% | 0.76 | 0.86 | 0.51 | 0.85 | 0.61 | 0.81 | 0.54 |
| DISPLAR 0–30% | 0.50 | 0.89 | 0.45 | 0.83 | 0.47 | 0.70 | 0.38 |
| DISPLAR 30–60% | 0.65 | 0.88 | 0.55 | 0.84 | 0.59 | 0.76 | 0.50 |
| DISPLAR 60–100% | 0.56 | 0.89 | 0.49 | 0.84 | 0.52 | 0.73 | 0.43 |
| DP-Bind 0–30% | 0.62 | 0.79 | 0.33 | 0.76 | 0.43 | 0.70 | 0.32 |
| DP-Bind 30–60% | 0.69 | 0.79 | 0.41 | 0.77 | 0.51 | 0.74 | 0.40 |
| DP-Bind 60–100% | 0.62 | 0.82 | 0.39 | 0.79 | 0.48 | 0.72 | 0.37 |
In addition, we compare DBSI, DISPLAR and DP-Bind on three subsets of the TEST-206 data. Proteins in these subsets have homology in the ranges 0–30, 30–60 and 60–100% to examples the TRAIN-263 data set.
Comparison between DBSI, DISPLAR and DP-Bind on the HOLO-30 and APO-29 data sets
| Method | Sensitivity | Specificity | Precision | Accuracy | F1 | Strength | MCC |
|---|---|---|---|---|---|---|---|
| DBSI | 0.60 (0.58) | 0.89 (0.89) | 0.45 (0.42) | 0.85 (0.86) | 0.52 (0.48) | 0.75 (0.73) | 0.44 (0.41) |
| DISPLAR | 0.38 (0.35) | 0.91 (0.92) | 0.40 (0.35) | 0.85 (0.85) | 0.39 (0.35) | 0.65 (0.63) | 0.30 (0.26) |
| DP-Bind | 0.61 (0.60) | 0.79 (0.79) | 0.34 (0.30) | 0.77 (0.76) | 0.44 (0.41) | 0.70 (0.69) | 0.32 (0.30) |
aThere are two HOLO-30 examples, 3c46_A and 3ei2_A, and two APO-29 examples, 2po4_A and 3ei3_A, that could not be included in the DP-Bind results because their sequence lengths are larger than 1000. Also, the difference in results between the two data sets for DP-Bind is due to the inclusion of one additional example in the HOLO-30 data set, as sequence-based predictions are unaltered by protein conformation.
Results in parentheses are for the APO-29 data, whereas other numbers are for the HOLO-30 data.