| Literature DB >> 24151525 |
Xin Ma1, Jiansheng Wu, Xiaoyun Xue.
Abstract
DNA-binding proteins are fundamentally important in understanding cellular processes. Thus, the identification of DNA-binding proteins has the particularly important practical application in various fields, such as drug design. We have proposed a novel approach method for predicting DNA-binding proteins using only sequence information. The prediction model developed in this study is constructed by support vector machine-sequential minimal optimization (SVM-SMO) algorithm in conjunction with a hybrid feature. The hybrid feature is incorporating evolutionary information feature, physicochemical property feature, and two novel attributes. These two attributes use DNA-binding residues and nonbinding residues in a query protein to obtain DNA-binding propensity and nonbinding propensity. The results demonstrate that our SVM-SMO model achieves 0.67 Matthew's correlation coefficient (MCC) and 89.6% overall accuracy with 88.4% sensitivity and 90.8% specificity, respectively. Performance comparisons on various features indicate that two novel attributes contribute to the performance improvement. In addition, our SVM-SMO model achieves the best performance than state-of-the-art methods on independent test dataset.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24151525 PMCID: PMC3787635 DOI: 10.1155/2013/524502
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
The distribution of proteins in main dataset, training dataset, and independent test dataset.
| Dataset | Number of binding proteins | Number of nonbinding proteins | Total number of proteins |
|---|---|---|---|
| Main dataset | 6653 | 6653 |
|
|
| |||
| Training dataset (TrD_10642) | 5321 | 5321 |
|
| Independent test dataset (TeD_2664) | 1332 | 1332 |
|
The performance of different kinds of feature descriptors with various machine learning algorithms based on main dataset using 5-fold cross-validation.
| Machine learning algorithm | Feature descriptor | |||||
|---|---|---|---|---|---|---|
| PP | EI | BP + NBP + PP | BP + NBP + EI | PP + EI | BP + NBP + PP + EI | |
| Accuracy (%) | ||||||
| SVM-SMO | 83.2 | 85.7 | 85.6 | 87.3 | 87.3 | 89.6 |
| Simple logistic regression | 81.8 | 84.2 | 84.2 | 87.0 | 85.7 | 88.3 |
| Random forest | 81.3 | 84.3 | 83.5 | 86.7 | 85.9 | 88.1 |
| Naive bayes | 78.6 | 77.2 | 82.8 | 82.3 | 82.6 | 84.3 |
| Decision tree | 80.2 | 82.5 | 82.6 | 84.4 | 84.1 | 86.2 |
|
| ||||||
| Sensitivity (%) | ||||||
| SVM-SMO | 82.4 | 84.9 | 84.4 | 86.5 | 85.8 | 88.4 |
| Simple logistic regression | 80.7 | 83.1 | 82.3 | 84.4 | 85.6 | 86.7 |
| Random forest | 81.1 | 83.6 | 82.8 | 86.0 | 85.3 | 86.2 |
| Naive bayes | 76.9 | 76.1 | 79.4 | 80.8 | 81.1 | 82.6 |
| Decision tree | 78.6 | 80.4 | 81.7 | 82.7 | 82.5 | 84.7 |
|
| ||||||
| Specificity (%) | ||||||
| SVM-SMO | 84.6 | 86.3 | 86.7 | 88.2 | 88.6 | 90.8 |
| Simple logistic regression | 82.9 | 85.5 | 86.0 | 88.8 | 85.9 | 90.2 |
| Random forest | 81.6 | 85.2 | 84.1 | 87.5 | 86.3 | 90.0 |
| Naive bayes | 80.2 | 78.5 | 85.6 | 83.8 | 84.7 | 86.0 |
| Decision tree | 81.8 | 84.7 | 83.5 | 86.2 | 85.7 | 87.7 |
|
| ||||||
| Matthew correlation coefficient | ||||||
| SVM-SMO | 0.55 | 0.58 | 0.62 | 0.66 | 0.66 | 0.67 |
| Simple logistic regression | 0.56 | 0.55 | 0.64 | 0.62 | 0.64 | 0.66 |
| Random forest | 0.55 | 0.56 | 0.60 | 0.62 | 0.63 | 0.66 |
| Naive bayes | 0.52 | 0.49 | 0.56 | 0.53 | 0.54 | 0.59 |
| Decision tree | 0.53 | 0.55 | 0.61 | 0.63 | 0.62 | 0.64 |
|
| ||||||
| AUC | ||||||
| SVM-SMO | 0.83 | 0.86 | 0.86 | 0.88 | 0.87 | 0.90 |
| Simple logistic regression | 0.83 | 0.84 | 0.85 | 0.86 | 0.85 | 0.88 |
| Random forest | 0.81 | 0.84 | 0.84 | 0.86 | 0.85 | 0.87 |
| Naive bayes | 0.78 | 0.76 | 0.80 | 0.79 | 0.80 | 0.82 |
| Decision tree | 0.80 | 0.82 | 0.83 | 0.84 | 0.84 | 0.86 |
BP: binding propensity feature; NBP: nonbinding propensity feature; PP: physicochemical property feature; EI: evolutionary information feature.
Figure 1Box plots of the two components of BP feature for binding and nonbinding proteins. (a) BP(1); (b) BP(2).
Figure 2Box plots of the two components of NBP feature for binding and nonbinding proteins. (a) NBP(1); (b) NBP(2).
Figure 3Three classifiers were tested on the same testing dataset TeD_2264. The predictors have the following accuracy value: our SVM-SMO 74.88%, iDNA-Prot 54.06%, and DNA-Prot 47.30%; sensitivity: our SVM-SMO 72.22%, iDNA-Prot 50.22%, and DNA-Prot 45.72%; specificity: our SVM-SMO 77.55%, iDNA-Prot 57.88%, and DNA-Prot 48.87%; MCC: our SVM-SMO 0.4981, iDNA-Prot0.0814, and DNA-Prot −0.0541.