| Literature DB >> 32352006 |
Jun Wang1, Huiwen Zheng2, Yang Yang3, Wanyue Xiao4, Taigang Liu1.
Abstract
DNA-binding proteins (DBPs) play vital roles in all aspects of genetic activities. However, the identification of DBPs by using wet-lab experimental approaches is often time-consuming and laborious. In this study, we develop a novel computational method, called PredDBP-Stack, to predict DBPs solely based on protein sequences. First, amino acid composition (AAC) and transition probability composition (TPC) extracted from the hidden markov model (HMM) profile are adopted to represent a protein. Next, we establish a stacked ensemble model to identify DBPs, which involves two stages of learning. In the first stage, the four base classifiers are trained with the features of HMM-based compositions. In the second stage, the prediction probabilities of these base classifiers are used as inputs to the meta-classifier to perform the final prediction of DBPs. Based on the PDB1075 benchmark dataset, we conduct a jackknife cross validation with the proposed PredDBP-Stack predictor and obtain a balanced sensitivity and specificity of 92.47% and 92.36%, respectively. This outcome outperforms most of the existing classifiers. Furthermore, our method also achieves superior performance and model robustness on the PDB186 independent dataset. This demonstrates that the PredDBP-Stack is an effective classifier for accurately identifying DBPs based on protein sequence information alone.Entities:
Year: 2020 PMID: 32352006 PMCID: PMC7174956 DOI: 10.1155/2020/7297631
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1System diagram of PredDBP-Stack.
Figure 2The framework of a two-stage stacked ensemble scheme.
Performance comparison of six base classifiers on the PDB1075 dataset using the jackknife test.
| Method | OA (%) | SN (%) | SP (%) | MCC | AUC |
|---|---|---|---|---|---|
| DT | 74.53 | 74.71 | 74.36 | 0.4906 | 0.7838 |
| KNN | 76.22 | 75.68 | 76.73 | 0.5240 | 0.8364 |
| LR | 78.18 | 78.19 | 78.18 | 0.5635 | 0.8508 |
| XGB | 78.74 | 75.64 | 80.69 | 0.5634 | 0.8624 |
| RF | 78.28 | 83.4 | 73.45 | 0.5702 | 0.8648 |
| SVM | 80.34 | 81.66 | 79.27 | 0.6091 | 0.8774 |
Figure 3ROC curves of LR, DT, and SVM classifiers on the PDB1075 dataset using the jackknife test.
Performance comparison of five SMs on the PDB1075 dataset using the jackknife test.
| Method | OA (%) | SN (%) | SP (%) | MCC | AUC |
|---|---|---|---|---|---|
| SM1 | 92.42 | 92.47 | 92.36 | 0.8482 | 0.9677 |
| SM2 | 92.23 | 91.89 | 92.55 | 0.8444 | 0.9664 |
| SM3 | 91.76 | 91.31 | 92.18 | 0.8350 | 0.9635 |
| SM4 | 79.87 | 82.82 | 77.09 | 0.5993 | 0.8745 |
| SM5 | 90.54 | 90.93 | 90.18 | 0.8108 | 0.9560 |
Figure 4ROC curves of SM1 and its base classifiers on the PDB1075 dataset.
Performance comparison on the benchmark dataset PDB1075.
| Method | OA (%) | SN (%) | SP (%) | MCC | AUC |
|---|---|---|---|---|---|
| DNA-Prot | 72.55 | 82.67 | 59.76 | 0.44 | 0.7890 |
| iDNA-Prot | 75.40 | 83.81 | 64.73 | 0.50 | 0.7610 |
| iDNA-Prot|dis | 77.30 | 79.40 | 75.27 | 0.54 | 0.8260 |
| DNABinder | 73.95 | 68.57 | 79.09 | 0.48 | 0.8140 |
| Kmerl+ACC | 75.23 | 76.76 | 73.76 | 0.50 | 0.8280 |
| iDNAPro-PseAAC | 76.76 | 75.62 | 77.45 | 0.53 | 0.8392 |
| Local-DPP | 79.20 | 84.00 | 74.50 | 0.59 | — |
| HMMBinder | 86.33 | 87.07 | 85.55 | 0.72 | 0.9026 |
| StackDPPred | 89.96 | 91.12 | 88.80 | 0.80 | 0.9449 |
| Our method | 92.42 | 92.47 | 92.36 | 0.85 | 0.9677 |
Performance comparison on the independent dataset PDB186.
| Method | OA (%) | SN (%) | SP (%) | MCC | AUC |
|---|---|---|---|---|---|
| DNA-Prot | 61.80 | 69.90 | 53.80 | 0.240 | 0.7960 |
| iDNA-Prot | 67.20 | 67.70 | 66.70 | 0.334 | 0.8330 |
| iDNA-Prot|dis | 72.00 | 79.50 | 64.50 | 0.445 | 0.7860 |
| DNABinder | 60.80 | 57.00 | 64.50 | 0.216 | 0.6070 |
| Kmerl+ACC | 70.96 | 82.79 | 59.13 | 0.431 | 0.7520 |
| iDNAPro-PseAAC | 69.89 | 77.41 | 62.37 | 0.402 | 0.7754 |
| Local-DPP | 79.00 | 92.50 | 65.60 | 0.625 | — |
| HMMBinder | 69.02 | 61.53 | 76.34 | 0.394 | 0.6324 |
| StackDPPred | 86.55 | 92.47 | 80.64 | 0.736 | 0.8878 |
| Our method | 86.56 | 87.10 | 86.02 | 0.731 | 0.8932 |