| Literature DB >> 30071697 |
Xiu-Juan Liu1,2, Xiu-Jun Gong3,4, Hua Yu5,6, Jia-Hui Xu7.
Abstract
Nowadays, various machine learning-based approaches using sequence information alone have been proposed for identifying DNA-binding proteins, which are crucial to many cellular processes, such as DNA replication, DNA repair and DNA modification. Among these methods, building a meaningful feature representation of the sequences and choosing an appropriate classifier are the most trivial tasks. Disclosing the significances and contributions of different feature spaces and classifiers to the final prediction is of the utmost importance, not only for the prediction performances, but also the practical clues of biological experiment designs. In this study, we propose a model stacking framework by orchestrating multi-view features and classifiers (MSFBinder) to investigate how to integrate and evaluate loosely-coupled models for predicting DNA-binding proteins. The framework integrates multi-view features including Local_DPP, 188D, Position-Specific Scoring Matrix (PSSM)_DWT and autocross-covariance of secondary structures(AC_Struc), which were extracted based on evolutionary information, sequence composition, physiochemical properties and predicted structural information, respectively. These features are fed into various loosely-coupled classifiers such as SVM and random forest. Then, a logistic regression model was applied to evaluate the contributions of these individual classifiers and to make the final prediction. When performing on the training dataset PDB1075, the proposed method achieves an accuracy of 83.53%. On the independent dataset PDB186, the method achieves an accuracy of 81.72%, which outperforms many existing methods. These results suggest that the framework is able to orchestrate various predicted models flexibly with good performances.Entities:
Keywords: DNA-binding proteins; logistic regression; model stacking; multi-view features
Year: 2018 PMID: 30071697 PMCID: PMC6116045 DOI: 10.3390/genes9080394
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1The MSFBinder workflow.
The two benchmark datasets PDB1075 and PDB186.
| DNA-Binding Proteins | Non-DNA-Binding Proteins | The Ratio | ||
|---|---|---|---|---|
| PDB1075 | 525 | 550 | 0.9545 | |
| PDB186 | 93 | 93 | 1 |
Figure 2Discrete Wavelet Transform (DWT) feature extraction. denotes the high pass filter, which can filter out the low frequency band of the discrete signal and retain the high frequency band, and denotes the low pass filter.
Categories of the eight physiochemical properties.
| Physiochemical Property | the 1st Class | the 2nd Class | the 3rd Class |
|---|---|---|---|
| hydrophobicity | RKEDQN | GASTPHY | CVLIMFW |
| normalized Van der Waals volume | GASCTPD | NVEQIL | MHKFRYW |
| polarity | LIFWCMVY | PATGS | HQRKNED |
| polarizability | GASDT | CPNVEQIL | KMHFRYW |
| charge | KR | ANCQGHILMFPSTWYV | DE |
| Surface tension | GQDNAHR | KTSEC | ILMFPWYV |
| Secondary structure | EALMQKRH | VIYCWFT | GNPSD |
| Solvent accessibility | ALFCGIVW | RKQEND | MPSTHY |
Number of features via different extraction methods.
| Local_DPP | PSSM_DWT | 188D | AC_struct |
|---|---|---|---|
|
| 1040 | 188 | 153 |
The performances of different feature representations.
| ACC (%) | MCC | SN (%) | SP (%) | |
|---|---|---|---|---|
| Local_DPP ( | 78.32 | 0.5681 | 81.14 | 75.63 |
| Local_DPP ( | 77.30 | 0.5539 | 84.57 | 70.36 |
| 188D | 75.16 | 0.5034 | 75.81 | 74.55 |
| PSSM_DWT | 72.47 | 0.4488 | 70.67 | 74.18 |
| AC_struct | 68.00 | 0.3595 | 63.62 | 72.18 |
The performances of three predictors on PDB1075.
| ACC (%) | MCC | SN (%) | SP (%) | |
|---|---|---|---|---|
| MSFBinder (SVM) ( | 83.53 | 0.6707 | 83.81 | 83.27 |
| MSFBinder (SVM) ( | 83.35 | 0.6670 | 83.62 | 83.09 |
| MSFBinder (SVM, RF) ( | 84.74 | 0.6948 | 84.95 | 84.55 |
| MSFBinder (SVM, RF) ( | 84.84 | 0.6969 | 85.52 | 84.18 |
| MSFBinder (SVM, RF, NB) ( | 84.65 | 0.6935 | 86.10 | 83.27 |
| MSFBinder (SVM, RF, NB) ( | 84.28 | 0.6859 | 85.33 | 83.27 |
The performances of the three predictors on PDB186.
| ACC (%) | MCC | SN (%) | SP (%) | |
|---|---|---|---|---|
| MSFBinder (SVM) ( | 81.72 | 0.6417 | 89.25 | 74.19 |
| MSFBinder (SVM) ( | 79.57 | 0.6160 | 93.55 | 65.59 |
| MSFBinder (SVM, RF) ( | 81.18 | 0.6343 | 90.32 | 72.04 |
| MSFBinder (SVM, RF) ( | 79.03 | 0.6028 | 92.47 | 65.59 |
| MSFBinder (SVM, RF, NB) ( | 80.65 | 0.6276 | 91.40 | 69.89 |
| MSFBinder (SVM, RF, NB) ( | 80.11 | 0.6215 | 92.47 | 67.74 |
Figure 3The coefficients of the four base models in the first predictor.
Figure 4The coefficients of the eight base models in the second predictor.
Figure 5The coefficients of the twelve base models in the third predictor.
Figure 6T-test for the first predictor. The meaning of the y-axis denotes the distance between the p-value and the threshold. The larger the value of the y-axis, the greater the distance. The x-axis denotes the combination of different classifiers and feature extraction methods.
Figure 7T-test for the second predictor.
Figure 8T-test for the third predictor.
Performance comparisons with single classifiers.
| ACC (%) | MCC | SN (%) | SP (%) | |
|---|---|---|---|---|
| SVM ( | 82.60 | 0.6527 | 84.19 | 81.09 |
| SVM ( | 82.14 | 0.6434 | 83.62 | 80.73 |
| RF ( | 81.58 | 0.6315 | 81.33 | 81.82 |
| RF ( | 80.84 | 0.6166 | 80.76 | 80.91 |
| LR ( | 81.86 | 0.6371 | 81.71 | 82.00 |
| LR ( | 82.33 | 0.6465 | 82.67 | 82.00 |
| MSFBinder (SVM) ( | 83.53 | 0.6707 | 83.81 | 83.27 |
| MSFBinder (SVM) ( | 83.35 | 0.6670 | 83.62 | 83.09 |
The performance comparisons to the majority voting-based methods.
| ACC (%) | MCC | SN (%) | SP (%) | |
|---|---|---|---|---|
| Majority voting (LR) ( | 81.08 | 0.6230 | 78.89 | 83.24 |
| Majority voting (LR) ( | 81.39 | 0.6296 | 79.55 | 83.33 |
| Majority voting (RF) ( | 81.60 | 0.6338 | 82.51 | 80.94 |
| Majority voting (RF) ( | 81.28 | 0.6252 | 81.71 | 80.98 |
| Majority voting (SVM) ( | 81.77 | 0.6361 | 82.28 | 81.39 |
| Majority voting (SVM) ( | 81.08 | 0.6234 | 82.74 | 79.63 |
| MSFBinder (SVM) ( | 83.70 | 0.6744 | 84.47 | 83.19 |
| MSFBinder (SVM) ( | 82.47 | 0.6503 | 82.96 | 82.08 |
The means and standard deviations for the 5 × 5-fold cross-validations.
| ACC | MCC | SN | SP | |
|---|---|---|---|---|
| LR |
|
|
|
|
| RF |
|
|
|
|
| SVM |
|
|
|
|
| Majority voting (LR) |
|
|
|
|
| Majority voting (RF) |
|
|
|
|
| Majority voting (SVM) |
|
|
|
|
| MSFBinder (SVM) |
|
|
|
|
Performance comparisons to existing methods on the training set.
| ACC (%) | MCC | SN (%) | SP (%) | |
|---|---|---|---|---|
| IDNA-Protdis | 77.30 | 0.54 | 79.40 | 75.27 |
| IDNA-Prot | 75.40 | 0.50 | 83.81 | 64.73 |
| DNA-Prot | 72.55 | 0.44 | 82.67 | 59.76 |
| DNAbinder (dimension = 400) | 73.58 | 0.47 | 66.47 | 80.36 |
| DNAbinder (dimension = 21) | 73.95 | 0.48 | 68.57 | 79.09 |
| iDNAPro-PseAAC | 76.56 | 0.53 | 75.62 | 77.45 |
| Kmer1 + ACC | 75.23 | 0.50 | 76.76 | 73.76 |
| Local-DPP ( | 79.10 | 0.59 | 84.80 | 73.60 |
| Local-DPP ( | 79.20 | 0.59 | 84.00 | 74.50 |
| PSSM_DT | 79.96 | 0.62 | 78.00 | 81.91 |
| PSSM-DBT | 81.02 | 0.62 | 84.19 | 78.00 |
| iDNAProt-ES |
|
|
|
|
| MSFBinder (SVM) ( | 83.53 | 0.67 | 83.81 | 83.27 |
| MSFBinder (SVM) ( | 83.35 | 0.67 | 83.62 | 83.09 |
Performance comparisons to existing methods on the testing dataset.
| ACC (%) | MCC | SN (%) | SP (%) | |
|---|---|---|---|---|
| IDNA-Protdis | 72.0 | 0.445 | 79.5 | 64.5 |
| IDNA-Prot | 67.2 | 0.344 | 67.7 | 66.7 |
| DNA-Prot | 61.8 | 0.240 | 69.9 | 53.8 |
| DNAbinder | 60.8 | 0.216 | 57.0 | 64.5 |
| iDNAPro-PseAAC-EL | 71.5 | 0.442 | 82.8 | 60.2 |
| iDNA-KACC-EL | 79.0 | 0.611 |
| 63.4 |
| Kmer1 + ACC | 71.0 | 0.431 | 82.8 | 59.1 |
| Local-DPP ( | 79.0 | 0.625 | 92.5 | 65.6 |
| Local-DPP ( | 77.4 | 0.568 | 90.3 | 64.5 |
| PSSM_DT | 80.00 |
| 87.09 | 72.83 |
| PSSM-DBT | 80.65 | 0.624 | 90.32 | 70.97 |
| iDNAProt-ES | 80.64 | 0.6130 | 81.31 |
|
| MSFBinder (SVM) ( |
| 0.6417 | 89.25 | 74.19 |
| MSFBinder (SVM) ( | 79.57 | 0.6160 | 93.55 | 65.59 |