| Literature DB >> 35154264 |
Ziye Zhao1, Wen Yang2, Yixiao Zhai1, Yingjian Liang3, Yuming Zhao1.
Abstract
The exploration of DNA-binding proteins (DBPs) is an important aspect of studying biological life activities. Research on life activities requires the support of scientific research results on DBPs. The decline in many life activities is closely related to DBPs. Generally, the detection method for identifying DBPs is achieved through biochemical experiments. This method is inefficient and requires considerable manpower, material resources and time. At present, several computational approaches have been developed to detect DBPs, among which machine learning (ML) algorithm-based computational techniques have shown excellent performance. In our experiments, our method uses fewer features and simpler recognition methods than other methods and simultaneously obtains satisfactory results. First, we use six feature extraction methods to extract sequence features from the same group of DBPs. Then, this feature information is spliced together, and the data are standardized. Finally, the extreme gradient boosting (XGBoost) model is used to construct an effective predictive model. Compared with other excellent methods, our proposed method has achieved better results. The accuracy achieved by our method is 78.26% for PDB2272 and 85.48% for PDB186. The accuracy of the experimental results achieved by our strategy is similar to that of previous detection methods.Entities:
Keywords: DNA-binding protein prediction; XGBoost model; dimensionality reduction; feature extraction; machine learning
Year: 2022 PMID: 35154264 PMCID: PMC8837382 DOI: 10.3389/fgene.2021.821996
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Process of predicting DBPs.
Dimensional information about the features.
| Model | Dimensionality |
|---|---|
| GE | 150 |
| MCD | 882 |
| MNBAC | 200 |
| PSSM-AB | 200 |
| PSSM-DCT | 399 |
| PSSM-DWT | 1,040 |
Basic information about four standard data sets.
| Data sets | The number of negative | The number of positive | The total numbers |
|---|---|---|---|
| PDB14189 | 7,060 | 7,129 | 14,189 |
| PDB1075 | 550 | 525 | 1,075 |
| PDB2272 | 1,119 | 1,153 | 2,272 |
| PDB186 | 93 | 93 | 186 |
FIGURE 2ROC curves of different feature extraction methods on PDB1075 data.
Performance of PDB1075 using different feature extraction methods in XGBoost.
| Model name | Feature extraction method | ACC (%) | SN (%) | MCC | Spec (%) |
|---|---|---|---|---|---|
| GE | 66.87 | 71.17 | 0.3342 | 62.09 | |
| MCD | 69.04 | 70.00 | 0.3975 | 67.97 | |
| NMBAC | 72.14 | 75.29 | 0.4404 | 68.62 | |
| XGboost | PSSM-AB | 76.47 | 75.29 | 0.5300 | 77.77 |
| PSSM-Pse | 74.30 | 75.88 | 0.4845 | 72.54 | |
| PSSM-DWT | 74.92 | 74.70 | 0.4981 | 75.16 | |
| The spliced sequence feature |
|
|
|
|
Bold indicates that their experimental results are the best and the experimental values are the highest.
Comparison between the XGBoost model and other methods on the PDB186 data set.
| Models | ACC (%) | SN (%) | Spec (%) | MCC |
|---|---|---|---|---|
| IDNA-Prot|dis | 72.0 | 79.5 | 64.5 | 0.445 |
| IDNA-Prot | 67.2 | 67.7 | 66.7 | 0.344 |
| DNA-Prot | 61.8 | 69.9 | 53.8 | 0.240 |
| DNAbinder | 60.8 | 57.0 | 64.5 | 0.216 |
| DBPPre | 76.9 | 79.6 | 74.2 | 0.538 |
| IDNAPro-PseAAC | 71.5 | 82.8 | 60.2 | 0.442 |
| Kmerl + ACC | 71.0 | 82.8 | 59.1 | 0.431 |
| Local-DPP | 79.0 | 92.5 | 65.6 | 0.625 |
| DPP-PseAAC | 77.4 | 83.0 | 70.9 | 0.550 |
| MSFBinder | 79.6 | 93.6 | 65.6 | 0.616 |
| MsDBP | 80.1 | 86.0 | 74.2 | 0.606 |
| MKSVM-HKA | 81.2 | 94.6 | 67.7 | 0.648 |
| Adilina’s work | 82.3 |
| 69.9 | 0.670 |
| XGboost |
| 90.3 |
|
|
Bold indicates that their experimental results are the best and the experimental values are the highest.
aThe experimental results of other methods come from (Wei et al., 2017).
Experimental findings for the independent data set PDB2272 using the XGBoost algorithm and other models.
| Methods | ACC (%) | MCC | SN (%) | Spec (%) |
|---|---|---|---|---|
| MK-FSVM-SVDD | 76.12 | 0.5476 |
| 60.41 |
| DPP-PseAAC | 58.10 | 0.1625 | 56.63 | 59.61 |
| PseDNA-Pro | 61.88 | 0.2430 | 75.28 | 48.08 |
| MK-SVM | 75.00 | 0.5264 | 91.41 | 58.09 |
| MsDBP | 66.99 | 0.3397 | 70.69 | 63.18 |
| XGboost |
|
| 80.39 |
|
Bold indicates that their experimental results are the best and the experimental values are the highest.
aThe experimental results of other methods come from (Du et al., 2019; Zou et al., 2021).