| Literature DB >> 28961273 |
Yubo Wang1,2, Yijie Ding1,2, Fei Guo1,2, Leyi Wei1,2, Jijun Tang1,3,2.
Abstract
Since the importance of DNA-binding proteins in multiple biomolecular functions has been recognized, an increasing number of researchers are attempting to identify DNA-binding proteins. In recent years, the machine learning methods have become more and more compelling in the case of protein sequence data soaring, because of their favorable speed and accuracy. In this paper, we extract three features from the protein sequence, namely NMBAC (Normalized Moreau-Broto Autocorrelation), PSSM-DWT (Position-specific scoring matrix-Discrete Wavelet Transform), and PSSM-DCT (Position-specific scoring matrix-Discrete Cosine Transform). We also employ feature selection algorithm on these feature vectors. Then, these features are fed into the training SVM (support vector machine) model as classifier to predict DNA-binding proteins. Our method applys three datasets, namely PDB1075, PDB594 and PDB186, to evaluate the performance of our approach. The PDB1075 and PDB594 datasets are employed for Jackknife test and the PDB186 dataset is used for the independent test. Our method achieves the best accuracy in the Jacknife test, from 79.20% to 86.23% and 80.5% to 86.20% on PDB1075 and PDB594 datasets, respectively. In the independent test, the accuracy of our method comes to 76.3%. The performance of independent test also shows that our method has a certain ability to be effectively used for DNA-binding protein prediction. The data and source code are at https://doi.org/10.6084/m9.figshare.5104084.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28961273 PMCID: PMC5621689 DOI: 10.1371/journal.pone.0185587
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1The framework of our method.
Fig 2Schematic diagram of a 4-level DWT.
Original values of six physicochemical properties of 20 amino acid types.
| Amino acid | H | VSC | P1 | P2 | SASA | NCISC |
|---|---|---|---|---|---|---|
| A | 0.62 | 27.5 | 8.1 | 0.046 | 1.181 | 0.007187 |
| C | 0.29 | 44.6 | 5.5 | 0.128 | 1.461 | -0.03661 |
| D | -0.9 | 40 | 13 | 0.105 | 1.587 | -0.02382 |
| E | -0.74 | 62 | 12.3 | 0.151 | 1.862 | 0.006802 |
| F | 1.19 | 115.5 | 5.2 | 0.29 | 2.228 | 0.037552 |
| G | 0.48 | 0 | 9 | 0 | 0.881 | 0.179052 |
| H | -0.4 | 79 | 10.4 | 0.23 | 2.025 | -0.01069 |
| I | 1.38 | 93.5 | 5.2 | 0.186 | 1.81 | 0.021631 |
| K | -1.5 | 100 | 11.3 | 0.219 | 2.258 | 0.017708 |
| L | 1.06 | 93.5 | 4.9 | 0.186 | 1.931 | 0.051672 |
| M | 0.64 | 94.1 | 5.7 | 0.221 | 2.034 | 0.002683 |
| N | -0.78 | 58.7 | 11.6 | 0.134 | 1.655 | 0.005392 |
| P | 0.12 | 41.9 | 8 | 0.131 | 1.468 | 0.239531 |
| Q | -0.85 | 80.7 | 10.5 | 0.18 | 1.932 | 0.049211 |
| R | -2.53 | 105 | 10.5 | 0.291 | 2.56 | 0.043587 |
| S | -0.18 | 29.3 | 9.2 | 0.062 | 1.298 | 0.004627 |
| T | -0.05 | 51.3 | 8.6 | 0.108 | 1.525 | 0.003352 |
| V | 1.08 | 71.5 | 5.9 | 0.14 | 1.645 | 0.057004 |
| W | 0.81 | 145.5 | 5.4 | 0.409 | 2.663 | 0.037977 |
| Y | 0.26 | 117.3 | 6.2 | 0.298 | 2.368 | 0.023599 |
Fig 3The accuracy of different lg values on PDB1075 (Five-fold cross validation).
Fig 4The accuracy of different m values on PDB1075 (Five-fold cross validation).
The performance of different features on PDB1075 dataset (Jackknife test evaluation).
| Feature | ACC(%) | MCC | SN(%) | SP(%) |
|---|---|---|---|---|
| NMBAC | 74.05 | 0.4836 | 77.90 | 70.36 |
| PSSM-DCT | 70.60 | 0.4117 | 66.86 | 74.18 |
| PSSM-DWT | 75.07 | 0.5010 | 73.33 | 76.73 |
| NMBAC+PSSM-DCT | 78.05 | 0.5606 | 77.14 | 78.91 |
| NMBAC+PSSM-DWT | 78.70 | 0.5740 | 79.24 | 78.18 |
| PSSM-DWT+PSSM-DCT | 73.77 | 0.4752 | 73.52 | 74.00 |
| NMBAC+PSSM-DWT+PSSM-DCT | 79.26 | 0.5853 | 80.00 | 78.55 |
Fig 5The AUROC comparison of seven feature combinations through Jackknife cross-validation on PDB1075 dataset.
Fig 6The feature score through SVM-RFE+CBR on the dataset of PDB1075.
The x-axis represents the feature index.
Fig 7The accuracy of different dimension features on PDB1075 dataset (Jackknife test evaluation).
Fig 8The AUROC comparison of seven feature combinations through Jackknife cross-validation on PDB1075 dataset.
* means this feature combination has employed feature selection.
The performance of different features after feature selection on PDB1075 dataset (Jackknife test evaluation).
| Feature | ACC(%) | MCC | SN(%) | SP(%) |
|---|---|---|---|---|
| NMBAC | 76.09 | 0.5218 | 76.19 | 76.00 |
| PSSM-DCT | 74.60 | 0.4928 | 76.19 | 73.09 |
| PSSM-DWT | 81.02 | 0.6213 | 82.86 | 79.27 |
| NMBAC+PSSM-DCT | 81.40 | 0.6276 | 80.38 | 82.36 |
| NMBAC+PSSM-DWT | 84.93 | 0.6987 | 85.52 | 84.36 |
| PSSM-DWT+PSSM-DCT | 78.33 | 0.5664 | 78.29 | 78.36 |
| NMBAC+PSSM-DWT+PSSM-DCT | 86.23 | 0.7250 | 87.43 | 85.09 |
* means this feature combination has employed feature selection.
The performance of our method and other existing methods on PDB1075 dataset (Jackknife test evaluation).
| Methods | ACC(%) | MCC | SN(%) | SP(%) |
|---|---|---|---|---|
| IDNA-Prot|dis | 77.30 | 0.54 | 79.40 | 75.27 |
| PseDNA-Pro | 76.55 | 0.53 | 79.61 | 73.63 |
| IDNA-Prot | 75.40 | 0.50 | 83.81 | 64.73 |
| DNA-Prot | 72.55 | 0.44 | 82.67 | 59.76 |
| DNAbinder(dimension = 400) | 73.58 | 0.47 | 66.47 | 80.36 |
| DNAbinder(dimension = 21) | 73.95 | 0.48 | 68.57 | 79.09 |
| iDNAPro = PseAAC | 76.56 | 0.53 | 75.62 | 77.45 |
| Kmer1+ACC | 75.23 | 0.50 | 76.76 | 73.76 |
| Local-DPP(n = 3,λ = 1) | 79.10 | 0.59 | 84.80 | 73.60 |
| Local-DPP(n = 2,λ = 2) | 79.20 | 0.59 | 84.00 | 74.50 |
| Our method | 86.23 | 0.73 | 87.43 | 85.09 |
The performance of our method and other existing methods on PDB594 dataset (Jackknife test evaluation).
| Methods | ACC(%) | MCC | SN(%) | SP(%) |
|---|---|---|---|---|
| GNB-based-wrapper | 80.5 | 0.610 | 82.8 | 78.1 |
| DT-based-wrapper | 69.2 | 0.384 | 68.4 | 70.0 |
| LongR-based-wrapper | 75.4 | 0.511 | 80.5 | 70.4 |
| KNN-based-wrapper | 74.6 | 0.492 | 72.1 | 77.1 |
| SVM-Poly-based-wrapper | 77.1 | 0.550 | 85.5 | 68.7 |
| SVM-RBF-based-wrapper | 80.1 | 0.605 | 84.8 | 75.4 |
| Our method | 86.2 | 0.724 | 87.2 | 85.2 |
The performance of our method and other existing methods on PDB186 dataset.
| Methods | ACC(%) | MCC | SN(%) | SP(%) |
|---|---|---|---|---|
| IDNA-Prot|dis | 72.0 | 0.445 | 79.5 | 64.5 |
| IDNA-Prot | 67.2 | 0.344 | 67.7 | 66.7 |
| DNA-Prot | 61.8 | 0.240 | 69.9 | 53.8 |
| DNAbinder | 60.8 | 0.216 | 57.0 | 64.5 |
| DNABIND | 67.7 | 0.355 | 66.7 | 68.8 |
| DNA-Threader | 59.7 | 0.279 | 23.7 | 95.7 |
| DBPPred | 76.9 | 0.538 | 79.6 | 74.2 |
| iDNAPro = PseAAC-EL | 71.5 | 0.442 | 82.8 | 60.2 |
| Kmer1+ACC | 71.0 | 0.431 | 82.8 | 59.1 |
| Local-DPP(n = 3,λ = 1) | 79.0 | 0.625 | 92.5 | 65.6 |
| Local-DPP(n = 2,λ = 2) | 77.4 | 0.568 | 90.3 | 64.5 |
| Our method | 76.3 | 0.557 | 92.5 | 60.2 |
PDB1075 serves as training dataset and PDB186 is applied as test dataset.
The computational time of feature extraction and jackknife test evaluation on PDB1075.
| Feature | FE(sec) | JT(sec) | JT-FS(sec) |
|---|---|---|---|
| NMBAC | 3.09 | 2317.6 | 486.7 |
| DCT | 187.98 | 1357.5 | 352.0 |
| DWT | 299.75 | 16166.0 | 757.8 |
| NMBAC+DCT+DWT | 490.82 | 17520.0 | 1642.0 |
The values of column “FE” indicate the computational time of feature extraction on PDB1075. The values of column “JT” indicate the computational time of jackknife test evaluation which has not used feature selection algorithm on PDB1075. The values of column “JT-FS” indicate the computational time of jackknife test evaluation which has used feature selection algorithm on PDB1075.