| Literature DB >> 28937647 |
Kaiyang Qu1, Ke Han2, Song Wu3, Guohua Wang4, Leyi Wei5,6.
Abstract
DNA-binding proteins play vital roles in cellular processes, such as DNA packaging, replication, transcription, regulation, and other DNA-associated activities. The current main prediction method is based on machine learning, and its accuracy mainly depends on the features extraction method. Therefore, using an efficient feature representation method is important to enhance the classification accuracy. However, existing feature representation methods cannot efficiently distinguish DNA-binding proteins from non-DNA-binding proteins. In this paper, a multi-feature representation method, which combines three feature representation methods, namely, K-Skip-N-Grams, Information theory, and Sequential and structural features (SSF), is used to represent the protein sequences and improve feature representation ability. In addition, the classifier is a support vector machine. The mixed-feature representation method is evaluated using 10-fold cross-validation and a test set. Feature vectors, which are obtained from a combination of three feature extractions, show the best performance in 10-fold cross-validation both under non-dimensional reduction and dimensional reduction by max-relevance-max-distance. Moreover, the reduced mixed feature method performs better than the non-reduced mixed feature technique. The feature vectors, which are a combination of SSF and K-Skip-N-Grams, show the best performance in the test set. Among these methods, mixed features exhibit superiority over the single features.Entities:
Keywords: DNA-binding protein; mixed feature representation methods; support vector machine
Mesh:
Substances:
Year: 2017 PMID: 28937647 PMCID: PMC6151557 DOI: 10.3390/molecules22101602
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1Overview of the paper framework for a DNA-binding protein classifier. First, the protein sequences are represented by Information Theory, SSF, and K-Skip-N-Grams. Then, three methods are combined. Finally, max-relevance-max-distance (MRMD) is used to reduce the dimensions. The support vector machine is used to classify the features which generated by the above three steps, respectively.
The results of using single feature representation methods and the PDB186 dataset.
| Method | Ten-Cross Validation Accuracy (%) | Test Set Validation Accuracy (%) | |||
|---|---|---|---|---|---|
| SN | SP | MCC | ACC | ||
| Information theory | 64.86 | 68.42 | 57.89 | 0.26 | 64.16 |
| K-Skip-N-Grams | 61.86 | 68.42 | 78.95 | 0.48 | 73.68 |
| SSF | 66.22 | 73.68 | 84.21 | 0.58 | 78.95 |
Results of protein classification based on multiple features and PDB186 dataset.
| Method | Non-Dimensionality-Reduction | Dimensionality-Reduction | ||
|---|---|---|---|---|
| Ten-Cross Validation Accuracy (%) | Test Set Validation Accuracy (%) | Ten-Cross Validation Accuracy (%) | Test Set Validation Accuracy (%) | |
| SSF + K-Skip-N-Grams | 67.57 | 81.58 | 68.24 | 81.58 |
| Information theory + K-Skip-N-Grams | 66.22 | 55.26 | 64.19 | 63.16 |
| SSF + Information theory | 68.92 | 71.05 | 70.27 | 78.95 |
| SSF + Informationtheory + K-Skip-N-Grams | 69.59 | 71.05 | 71.62 | 71.05 |
Accuracy of existing feature representation methods using PDB1075 dataset.
| Method | References | ACC (%) | MCC | SN (%) | SP (%) |
|---|---|---|---|---|---|
| SSF + Informationtheory + K-Skip-N-Grams (reduction) | This paper | 77.43 | 0.55 | 77.84 | 77.05 |
| SSF + Informationtheory + K-Skip-N-Grams | This paper | 75.19 | 0.51 | 76.88 | 73.59 |
| PseDNA-Pro | [ | 76.55 | 0.53 | 79.61 | 73.63 |
| DNAbinder (P400) | [ | 73.58 | 0.47 | 66.47 | 80.36 |
| DNAbinder (P21) | [ | 73.95 | 0.48 | 68.57 | 79.09 |
| DNA-Prot | [ | 72.55 | 0.44 | 82.67 | 59.76 |
| iDNA-Prot | [ | 75.40 | 0.50 | 83.81 | 64.73 |
The PDB186 is classified by a random forest and single feature representation.
| Method | Ten-Cross Validation Accuracy (%) | Test Set Validation Accuracy (%) |
|---|---|---|
| Information theory | 61.49 | 52.63 |
| K-Skip-N-Grams | 56.76 | 78.95 |
| SSF | 62.16 | 76.32 |
PDB186 is classified by a random forest using mixed feature representations.
| Method | Non-Dimensionality-Reduction | Dimensionality-Reduction | ||
|---|---|---|---|---|
| Ten-Cross Validation Accuracy (%) | Test Set Validation Accuracy (%) | Ten-Cross Validation Accuracy (%) | Test Set Validation Accuracy (%) | |
| SSF + K-Skip-N-Grams | 61.49 | 78.95 | 62.84 | 86.84 |
| Information theory + K-Skip-N-Grams | 62.84 | 73.68 | 63.51 | 73.68 |
| SSF + Informationtheory | 63.51 | 81.58 | 64.19 | 73.68 |
| SSF + Informationtheory + K-Skip-N-Grams | 57.43 | 78.95 | 61.49 | 81.58 |
Figure 2Comparison of the classification accuracy of single feature representation methods.
Figure 3Comparison of the accuracy of multiple feature representation methods.