| Literature DB >> 24475169 |
Wangchao Lou1, Xiaoqing Wang1, Fan Chen1, Yixiao Chen1, Bo Jiang1, Hua Zhang1.
Abstract
Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naïve Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24475169 PMCID: PMC3901691 DOI: 10.1371/journal.pone.0086703
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary of the considered features, where x, x′ = {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V} denotes the 20 AA types, y = {C, H, E} denotes the three secondary structure states, h = {0.1, 0.2, 0.3, 0.4, 0.5} denotes the cutoff used to categorize the buried/exposed residues based on their relative solvent accessibility, t = {0, 25, 50, 75, 100} denotes the ratio for computing the percentile values, and m = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} denotes the lag for calculating the auto-correlation coefficients.
| Category | Feature description | Abbreviation | No. of features |
| SS based | Content of the residues with secondary structure type | Con_SS | 3 |
| Average RSA based | Average RSA of the residues with AA type | AveRSA_Res | 20 |
| Average RSA of the residues with secondary structure type y | AveRSA_SS | 3 | |
| Average RSA of the residues with AA type x and secondary structure type y | AveRSA_Res | 60 | |
| Amino acid composition based | Composition of the residues with AA type | AAC_Res | 20 |
| Composition of the residues with AA type | AAC_Res | 60 | |
| Composition of the residues with AA type | AAC_Res | 100 | |
| Composition of the residues with AA type | AAC_Res | 100 | |
| Composition of dipeptide with the left AA type | DIC_Res | 400 | |
| PSSM score based | Average PSSM score of the residues along with the column of amino acid type | AvePscore_AA | 20 |
| Average PSSM score of the residues with AA type | AvePscore_AA | 400 | |
| Percentile of the PSSM scores according to the percent threshold | Pscore_AA | 100 | |
| Auto-correlation coefficient of scores with lag | AutoCC_AA | 200 |
Comparison of the prediction performance of the Gaussian naïve Bayes (GNB)-based wrapper, logistic regression (LogR)-based wrapper, decision tree (DT)-based wrapper, k-nearest neighbor (KNN)-based wrapper, and two support vector machine (SVM)-based wrappers with the RBF and polynomial kernels (denoted as SVM-RBF and SVM-Poly respectively).
| Wrapper method | Five-fold CV (average of 10 runs) | Jackknife test | ||||||
| Sen | Spe | Acc | MCC | Sen | Spe | Acc | MCC | |
| GNB | 0.815±0.010 | 0.767±0.009 |
|
| 0.828 |
|
|
|
| DT | 0.716±0.019 | 0.704±0.025 | 0.710±0.011 | 0.421±0.021 | 0.684 | 0.700 | 0.692 | 0.384 |
| LogR | 0.801±0.008 | 0.699±0.005 | 0.750±0.006 | 0.502±0.012 | 0.805 | 0.704 | 0.754 | 0.511 |
| KNN | 0.716±0.015 |
| 0.743±0.008 | 0.487±0.016 | 0.721 | 0.771 | 0.746 | 0.492 |
| SVM-Poly |
| 0.668±0.011 | 0.768±0.009 | 0.547±0.019 |
| 0.687 | 0.771 | 0.550 |
| SVM-RBF | 0.830±0.013 | 0.746±0.006 | 0.788±0.008 | 0.578±0.016 | 0.848 | 0.754 | 0.801 | 0.605 |
Note: The CV tests were based on ten runs and the averages and the standard deviations are shown. The highest values are shown in bold.
Figure 1The flowchart of the proposed method.
Figure 2The improvement of MCC values (y axis) along with the increasing number of selected features (x axis) for the performed wrapper based feature selection.
A forward, best-first search was executed using both 10 5 CV runs and jackknife tests on the PDB594 dataset. The standard deviations of MCC values for the case of 5 CV with 10 runs are shown using error bar.
Comparison of DBPPred with the existing methods based on independent blind tests on the same dataset PDB186.
| Method | Reference | Sensitivity | Specificity | Accuracy | MCC | AUC |
| DBPPred | This work |
| 0.742 |
|
|
|
| iDNA-Prot |
| 0.677 | 0.667 | 0.672 | 0.344 | N/A |
| DNA-Prot |
| 0.699 | 0.538 | 0.618 | 0.240 | N/A |
| DNAbinder |
| 0.570 | 0.645 | 0.608 | 0.216 | 0.607 |
| DNABIND |
| 0.667 | 0.688 | 0.677 | 0.355 | 0.694 |
| DBD-Threader |
| 0.237 |
| 0.597 | 0.279 | N/A |
N/A means that the data are not available.
Figure 3ROC curves for the predictions of DNA-binding proteins on the PDB186 dataset.
We compare the predictions of DBPPred with DNABIND and DNAbinder that provide real-value outputs.
List of false positive rates of the proposed DBPPred and the existing iDNA-Prot, DNA-Prot, DNAbinder and DNABIND on datasets NDBP4025, RB174, RB256 and RB430.
| Method | False positive rate | |||
| NDBP4025 | RB174 | RB256 | RB430 | |
| DBPPred | 0.254 | 0.534 | 0.527 | 0.530 |
| iDNA-Prot | 0.310 | 0.483 | 0.559 | 0.528 |
| DNA-Prot | 0.354 | 0.713 | 0.703 | 0.707 |
| DNAbinder | 0.325 | 0.672 | 0.652 | 0.660 |
| DNABIND | 0.299 | 0.741 | 0.727 | 0.733 |
The mean values of the selected 56 features and the P values that quantify significance of the differences between DNA-binding and non DNA-binding proteins for PDB594 dataset.
| Feature | Category | Mean±std | P-value | |
| DNA-binding | Non DNA-binding | |||
| Pscore_AAQ_P75 | PSSM score based | 0.696±0.095 | 0.626±0.124 | <10−3 |
| AvePscore_AAY_ResK | PSSM score based | 0.160±0.078 | 0.207±0.093 | <10−3 |
| AveRSA_ResG | Average RSA based | 0.310±0.076 | 0.271±0.061 | <10−3 |
| AvePscore_AAP | PSSM score based | 0.232±0.049 | 0.255±0.058 | <10−3 |
| DIC_ResKS | AA composition based | 0.005±0.006 | 0.004±0.005 | 0.019 |
| AvePscore_AAR_ResG | PSSM score based | 0.227±0.124 | 0.224±0.096 | 0.765 |
| AutoCC_AAN _Lag7 | PSSM score based | −0.014±0.106 | 0.010±0.095 | 0.003 |
| AvePscore_AAG_ResR | PSSM score based | 0.170±0.094 | 0.207±0.103 | <10−3 |
| AvePscore_AAL | PSSM score based | 0.323±0.057 | 0.329±0.048 | 0.117 |
| AvePscore_AAK_ResG | PSSM score based | 0.276±0.122 | 0.266±0.100 | 0.272 |
| AvePscore_AAQ | PSSM score based | 0.422±0.040 | 0.396±0.051 | <10−3 |
| AvePscore_AAR_ResM | PSSM score based | 0.250±0.164 | 0.229±0.124 | 0.080 |
| AAC_ResR_Ex0.2 | AA composition based | 0.092±0.043 | 0.080±0.037 | <10−3 |
| AvePscore_AAT | PSSM score based | 0.377±0.037 | 0.387±0.040 | 0.001 |
| AAC_ResD_Bu0.3 | AA composition based | 0.018±0.019 | 0.024±0.016 | <10−3 |
| AAC_ResV_Ex0.1 | AA composition based | 0.039±0.022 | 0.045±0.019 | <10−3 |
| AutoCC_AAR _Lag7 | PSSM score based | 0.008±0.091 | 0.031±0.100 | 0.004 |
| AAC_ResN | AA composition based | 0.038±0.021 | 0.043±0.022 | 0.003 |
| AvePscore_AAC_ResN | PSSM score based | 0.141±0.140 | 0.136±0.096 | 0.644 |
| AutoCC_AAI _Lag7 | PSSM score based | −0.015±0.092 | 0.009±0.097 | 0.002 |
| AAC_ResD_Bu0.2 | AA composition based | 0.012±0.018 | 0.016±0.015 | 0.004 |
| AvePscore_AAG_ResK | PSSM score based | 0.197±0.090 | 0.232±0.107 | <10−3 |
| AvePscore_AAR_ResE | PSSM score based | 0.493±0.100 | 0.478±0.085 | 0.044 |
| AutoCC_AAI _Lag8 | PSSM score based | −0.010±0.105 | −0.005±0.076 | 0.511 |
| AAC_ResA_Ex0.1 | AA composition based | 0.062±0.035 | 0.068±0.037 | 0.04 |
| AAC_ResT_Ex0.3 | AA composition based | 0.044±0.028 | 0.055±0.033 | <10−3 |
| AutoCC_AAP _Lag7 | PSSM score based | −0.011±0.110 | 0.021±0.096 | <10−3 |
| AutoCC_AAE _Lag4 | PSSM score based | 0.106±0.120 | 0.100±0.125 | 0.524 |
| AAC_ResE_Ex0.1 | AA composition based | 0.093±0.032 | 0.088±0.035 | 0.049 |
| AutoCC_AAI _Lag3 | PSSM score based | 0.029±0.092 | 0.007±0.085 | 0.003 |
| AutoCC_AAK _Lag7 | PSSM score based | 0.021±0.101 | 0.041±0.105 | 0.018 |
| AvePscore_AAR _ResR | PSSM score based | 0.930±0.086 | 0.892±0.121 | <10−3 |
| AutoCC_AAR_Lag9 | PSSM score based | −0.021±0.110 | −0.007±0.079 | 0.084 |
| AAC_ResL_Bu0.5 | AA composition based | 0.111±0.038 | 0.102±0.031 | 0.003 |
| AvePscore_AAR_ResW | PSSM score based | 0.124±0.200 | 0.167±0.202 | 0.008 |
| Pscore_AAR _P75 | PSSM score based | 0.661±0.154 | 0.584±0.152 | <10−3 |
| AAC_ResT_Ex0.4 | AA composition based | 0.035±0.033 | 0.044±0.036 | 0.002 |
| AvePscore_AAI_ResD | PSSM score based | 0.089±0.065 | 0.108±0.059 | <10−3 |
| AvePscore_AAN_ResR | PSSM score based | 0.402±0.106 | 0.417±0.108 | 0.105 |
| AutoCC_AAV _Lag8 | PSSM score based | −0.015±0.099 | −0.009±0.077 | 0.381 |
| AvePscore_AAH_ResW | PSSM score based | 0.145±0.202 | 0.214±0.211 | <10−3 |
| AvePscore_AAR | PSSM score based | 0.387±0.055 | 0.357±0.055 | <10−3 |
| AvePscore_AAW_ResT | PSSM score based | 0.094±0.083 | 0.148±0.110 | <10−3 |
| AAC_ResN_Bu0.2 | AA composition based | 0.011±0.016 | 0.015±0.015 | 0.001 |
| AutoCC_AAI _Lag4 | PSSM score based | 0.014±0.106 | −0.010±0.099 | 0.004 |
| AvePscore_AAE | PSSM score based | 0.407±0.043 | 0.391±0.055 | <10−3 |
| DIP_ResDL | AA composition based | 0.005±0.006 | 0.005±0.005 | 0.687 |
| AvePscore_AAN_ResI | PSSM score based | 0.104±0.085 | 0.129±0.080 | <10−3 |
| AutoCC_AAC_Lag7 | PSSM score based | −0.002±0.103 | 0.020±0.082 | 0.003 |
| AutoCC_AAL _Lag7 | PSSM score based | −0.005±0.101 | 0.016±0.103 | 0.012 |
| AvePscore_AAI_ResA | PSSM score based | 0.285±0.100 | 0.287±0.098 | 0.745 |
| AAC_ResA_Bu0.2 | AA composition based | 0.096±0.056 | 0.102±0.050 | 0.158 |
| AAC_ResE | AA composition based | 0.074±0.025 | 0.069±0.027 | 0.009 |
| AutoCC_AAT_Lag2 | PSSM score based | 0.011±0.101 | 0.050±0.090 | <10−3 |
| AAC_ResT_Ex0.2 | AA composition based | 0.052±0.027 | 0.061±0.029 | <10−3 |
| AAC_ResC_SSC | AA composition based | 0.011±0.023 | 0.016±0.024 | 0.019 |