| Literature DB >> 32300371 |
Xiuzhi Sang1, Wanyue Xiao2, Huiwen Zheng3, Yang Yang4, Taigang Liu1.
Abstract
Prediction of DNA-binding proteins (DBPs) has become a popular research topic in protein science due to its crucial role in all aspects of biological activities. Even though considerable efforts have been devoted to developing powerful computational methods to solve this problem, it is still a challenging task in the field of bioinformatics. A hidden Markov model (HMM) profile has been proved to provide important clues for improving the prediction performance of DBPs. In this paper, we propose a method, called HMMPred, which extracts the features of amino acid composition and auto- and cross-covariance transformation from the HMM profiles, to help train a machine learning model for identification of DBPs. Then, a feature selection technique is performed based on the extreme gradient boosting (XGBoost) algorithm. Finally, the selected optimal features are fed into a support vector machine (SVM) classifier to predict DBPs. The experimental results tested on two benchmark datasets show that the proposed method is superior to most of the existing methods and could serve as an alternative tool to identify DBPs.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32300371 PMCID: PMC7142336 DOI: 10.1155/2020/1384749
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Figure 1Framework of the proposed method for DBPs prediction.
Figure 2This shows how different g values affect the accuracies based on two CV methods. (a) The prediction results by using the 10-fold CV method. (b) The prediction results by using the jackknife CV method.
Prediction results of SVM and RF classifiers based on the 10-fold CV.
| Classifier | Feature extraction method | ACC | SN | SP | MCC | AUC |
|---|---|---|---|---|---|---|
| SVM | AAC | 0.7893 | 0.8224 | 0.7582 | 0.5810 | 0.8586 |
| ACT | 0.7004 | 0.6795 | 0.7200 | 0.3999 | 0.7492 | |
| CCT | 0.7678 | 0.7336 | 0.8000 | 0.5352 | 0.8309 | |
| AAC+ACT+CCT | 0.8034 | 0.8147 | 0.7927 | 0.6071 | 0.8717 | |
|
| ||||||
| RF | AAC | 0.7772 | 0.8147 | 0.7418 | 0.5571 | 0.8600 |
| ACT | 0.7369 | 0.7394 | 0.7345 | 0.4737 | 0.8022 | |
| CCT | 0.7566 | 0.7896 | 0.7255 | 0.5154 | 0.8232 | |
| AAC+ACT+CCT | 0.7781 | 0.8205 | 0.7382 | 0.5596 | 0.8437 | |
Prediction results of SVM and RF classifiers based on the jackknife CV.
| Classifier | Feature extraction method | ACC | SN | SP | MCC | AUC |
|---|---|---|---|---|---|---|
| SVM | AAC | 0.7912 | 0.8185 | 0.7655 | 0.5841 | 0.8663 |
| ACT | 0.7004 | 0.6795 | 0.7200 | 0.3999 | 0.7641 | |
| CCT | 0.7650 | 0.7297 | 0.7982 | 0.5296 | 0.8373 | |
| AAC+ACT+CCT | 0.8015 | 0.8127 | 0.7909 | 0.6034 | 0.8806 | |
|
| ||||||
| RF | AAC | 0.7930 | 0.8161 | 0.7618 | 0.5885 | 0.8705 |
| ACT | 0.7369 | 0.7413 | 0.7327 | 0.4738 | 0.8125 | |
| CCT | 0.7547 | 0.7761 | 0.7345 | 0.5106 | 0.8299 | |
| AAC+ACT+CCT | 0.7706 | 0.8050 | 0.7382 | 0.5437 | 0.8539 | |
Figure 3This illustrates how different feature subsets affect the accuracies by using two different feature selection methods. (a) The prediction accuracy of SVM based on the 10-fold CV test. (b) The prediction accuracy of SVM based on the jackknife CV test.
Performance comparison before and after feature selection.
| Feature selection | CV methods | ACC | SN | SP | MCC | AUC |
|---|---|---|---|---|---|---|
| Before | 10-fold | 0.8034 | 0.8147 | 0.7927 | 0.6071 | 0.8720 |
| Jackknife | 0.8015 | 0.8127 | 0.7909 | 0.6034 | 0.8805 | |
| RF | 10-fold | 0.8221 | 0.8243 | 0.8200 | 0.6441 | 0.8819 |
| Jackknife | 0.8267 | 0.8262 | 0.8272 | 0.6533 | 0.8946 | |
| XGBoost | 10-fold | 0.8371 | 0.8301 | 0.8436 | 0.6738 | 0.8896 |
| Jackknife | 0.8390 | 0.8398 | 0.8382 | 0.6778 | 0.9018 |
Figure 4ROC curves of the SVM classifier before and after feature selection. (a) ROC curves based on the 10-fold CV. (b) ROC curves based on the jackknife CV.
Performance comparison on the PDB1075 dataset.
| Methods | ACC | SN | SP | MCC | AUC |
|---|---|---|---|---|---|
| DNAbinder | 0.7395 | 0.6857 | 0.7909 | 0.48 | 0.8140 |
| DNA-Prot | 0.7255 | 0.8267 | 0.5976 | 0.44 | 0.7890 |
| iDNA-Prot | 0.7540 | 0.8381 | 0.6473 | 0.50 | 0.7610 |
| iDNA-Prot|dis | 0.7730 | 0.7940 | 0.7527 | 0.54 | 0.8260 |
| Kmer1+ACC | 0.7523 | 0.7676 | 0.7376 | 0.50 | 0.8280 |
| iDNAPro-PseAAC | 0.7656 | 0.7562 | 0.7745 | 0.53 | 0.8392 |
| PseDNA-Pro | 0.7655 | 0.7961 | 0.7363 | 0.53 | — |
| Local-DPP | 0.7920 | 0.8400 | 0.7450 | 0.59 | — |
| HMMBinder | 0.8633 | 0.8707 | 0.8555 | 0.72 | 0.9026 |
| Our method | 0.8390 | 0.8398 | 0.8382 | 0.68 | 0.9018 |
Performance comparison on the independent dataset.
| Methods | ACC | SN | SP | MCC | AUC |
|---|---|---|---|---|---|
| DNAbinder | 0.6080 | 0.5700 | 0.6450 | 0.216 | 0.6070 |
| DNA-Prot | 0.6180 | 0.6990 | 0.5380 | 0.240 | — |
| iDNA-Prot | 0.6720 | 0.6770 | 0.6670 | 0.344 | — |
| iDNA-Prot|dis | 0.7200 | 0.7950 | 0.6450 | 0.445 | 0.7860 |
| Kmer1+ACC | 0.7096 | 0.8279 | 0.5913 | 0.431 | 0.7520 |
| iDNAPro-PseAAC | 0.7150 | 0.8276 | 0.6022 | 0.442 | 0.7780 |
| Local-DPP | 0.7900 | 0.9250 | 0.6560 | 0.625 | — |
| HMMBinder | 0.6902 | 0.6153 | 0.7634 | 0.394 | 0.6324 |
| Our method | 0.8118 | 0.9462 | 0.6774 | 0.648 | 0.8715 |