| Literature DB >> 29097781 |
Shahana Yasmin Chowdhury1, Swakkhar Shatabda2, Abdollah Dehzangi3.
Abstract
DNA-binding proteins play a very important role in the structural composition of the DNA. In addition, they regulate and effect various cellular processes like transcription, DNA replication, DNA recombination, repair and modification. The experimental methods used to identify DNA-binding proteins are expensive and time consuming and thus attracted researchers from computational field to address the problem. In this paper, we present iDNAProt-ES, a DNA-binding protein prediction method that utilizes both sequence based evolutionary and structure based features of proteins to identify their DNA-binding functionality. We used recursive feature elimination to extract an optimal set of features and train them using Support Vector Machine (SVM) with linear kernel to select the final model. Our proposed method significantly outperforms the existing state-of-the-art predictors on standard benchmark dataset. The accuracy of the predictor is 90.18% using jack knife test and 88.87% using 10-fold cross validation on the benchmark dataset. The accuracy of the predictor on the independent dataset is 80.64% which is also significantly better than the state-of-the-art methods. iDNAProt-ES is a novel prediction method that uses evolutionary and structural based features. We believe the superior performance of iDNAProt-ES will motivate the researchers to use this method to identify DNA-binding proteins. iDNAProt-ES is publicly available as a web server at: http://brl.uiu.ac.bd/iDNAProt-ES/ .Entities:
Mesh:
Substances:
Year: 2017 PMID: 29097781 PMCID: PMC5668250 DOI: 10.1038/s41598-017-14945-1
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Comparison of performance of the proposed method with other state-of-the-art predictors using jack knife test on the benchmark dataset.
| Method | Accuracy | Sensitivity | Specificity | MCC | auROC |
|---|---|---|---|---|---|
| iDNAPro-PseAAC | 76.76% | 0.7562 | 0.7745 | 0.53 | 0.8392 |
| DNAbinder (dimension 21) | 73.95% | 0.6857 | 0.7909 | 0.48 | 0.8140 |
| DNAbinder (dimension 400) | 73.58% | 0.6647 | 0.8036 | 0.47 | 0.8150 |
| DNA-Prot | 72.55% | 0.8267 | 0.5976 | 0.44 | 0.7890 |
| iDNA-Prot | 75.40% | 0.8381 | 0.6473 | 0.50 | 0.7610 |
| iDNA-Prot|dis | 77.30% | 0.7940 | 0.7527 | 0.54 | 0.8310 |
| PseDNA-Pro | 76.55% | 0.7961 | 0.7363 | 0.53 | — |
| Kmer1 + ACC | 75.23% | 0.7676 | 0.7376 | 0.50 | 0.8280 |
| Local-DPP | 79.20% | 0.8400 | 0.7450 | 0.59 | — |
| iDNAProt-ES |
|
|
|
|
|
Comparison of performance of the proposed method with other state-of-the-art predictors on the independent dataset.
| Method | Accuracy | Sensitivity | Specificity | MCC | auROC |
|---|---|---|---|---|---|
| iDNAPro-PseAAC | 69.89% | 0.7741 | 0.6237 | 0.402 | 0.7754 |
| iDNA-Prot | 67.20% | 0.6770 | 0.6670 | 0.344 | — |
| DNA-Prot | 61.80% | 0.6990 | 0.5380 | 0.240 | — |
| DNAbinder | 60.80% | 0.5700 | 0.6450 | 0.216 | 0.6070 |
| DNABIND | 67.70% | 0.6670 | 0.6880 | 0.355 | 0.6940 |
| DNA-Threader | 59.70% | 0.2370 |
| 0.279 | — |
| DBPPred | 76.90% | 0.7960 | 0.7420 | 0.538 | 0.7910 |
| iDNA-Prot|dis | 72.00% | 0.7950 | 0.6450 | 0.445 | 0.7860 |
| Kmer1 + ACC | 70.96% | 0.8279 | 0.5913 | 0.431 | 0.7520 |
| Local-DPP | 79.00% |
| 0.6560 |
| — |
| iDNAProt-ES |
| 0.8131 | 0.8000 | 0.6130 |
|
Figure 1Effect of number of features selected on the accuracy on the benchmark dataset.
Figure 2Color map showing the importance or ranking of the features on the benchmark dataset.
Comparison of performance of different feature selection methods on the benchmark dataset using 10-fold cross validation.
| Method | Accuracy | Sensitivity | Specificity | MCC | auROC | auPR |
|---|---|---|---|---|---|---|
| RFE |
|
|
|
|
|
|
| Tree Based Method | 70.93% | 0.7627 | 0.6480 | 0.4196 | 0.7775 | 0.6470 |
| Sparse Elimination | 75.98% | 0.7727 | 0.7461 | 0.5210 | 0.8308 | 0.7464 |
| No Feature Selection | 74.01% | 0.7581 | 0.7211 | 0.4835 | 0.8224 | 0.7242 |
Figure 3Receiver Operating Characteristic (ROC) curve of different feature selection methods on the benchmark dataset.
Comparison of performance of different Classifiers on the benchmark dataset using 10-fold cross validation.
| Classifier | Accuracy | Sensitivity | Specificity | MCC | auROC | auPR |
|---|---|---|---|---|---|---|
| SVM (linear kernel) |
|
|
|
|
|
|
| SVM (rbf kernel) | 81.96% | 0.8309 | 0.8076 | 0.6415 | 0.8866 | 0.8117 |
| SVM (sigmoid kernel) | 56.07% | 0.5672 | 0.5538 | 0.1218 | 0.6010 | 0.5527 |
| Random Forest | 70.56% | 0.7636 | 0.6442 | 0.4107 | 0.7881 | 0.6451 |
| Naive Bayes | 61.58% | 0.7545 | 0.4692 | 0.2362 | 0.7005 | 0.4726 |
| Logistic Regression | 86.72% | 0.8800 | 0.8538 | 0.7359 | 0.9359 | 0.8567 |
Figure 4Receiver Operating Characteristic (ROC) curve of different classifiers for the benchmark dataset.
Figure 5System flow diagram of iDNAProt-ES showing the training and prediction procedure as flowchart.
Figure 6Screen shot of Web-Server homepage.
Summary of evolutionary and structural features used in this paper.
| Feature Name | Feature Type | Feature Vector Size |
|---|---|---|
| Amino acid composition | Evolutionay(PSSM) | 20 |
| Dubchak feature | Evolutionay(PSSM) | 105 |
| Bigram | Evolutionay(PSSM) | 400 |
| PSSM composition | Evolutionay(PSSM) | 20 |
| PSSM auto covariance | Evolutionay(PSSM) | 200 |
| One lead bigram | Evolutionay(PSSM) | 400 |
| Segmented distribution | Evolutionay(PSSM) | 200 |
| Secondary structure composition | Structural(SPD3) | 3 |
| Secondary structure occurrence | Structural(SPD3) | 3 |
| ASA, Angle occurrence, probability of CHE | Structural(SPD3) | 12 |
| Bigram of angle sine cosine | Structural(SPD3) | 64 |
| Angles auto covariance | Structural(SPD3) | 80 |
| Bigram probabilities | Structural(SPD3) | 9 |
| Probabilities auto covariance | Structural(SPD3) | 30 |