| Literature DB >> 24977146 |
Ruifeng Xu1, Jiyun Zhou2, Bin Liu3, Lin Yao4, Yulan He5, Quan Zou6, Xiaolong Wang1.
Abstract
DNA-binding proteins are crucial for various cellular processes, such as recognition of specific nucleotide, regulation of transcription, and regulation of gene expression. Developing an effective model for identifying DNA-binding proteins is an urgent research problem. Up to now, many methods have been proposed, but most of them focus on only one classifier and cannot make full use of the large number of negative samples to improve predicting performance. This study proposed a predictor called enDNA-Prot for DNA-binding protein identification by employing the ensemble learning technique. Experiential results showed that enDNA-Prot was comparable with DNA-Prot and outperformed DNAbinder and iDNA-Prot with performance improvement in the range of 3.97-9.52% in ACC and 0.08-0.19 in MCC. Furthermore, when the benchmark dataset was expanded with negative samples, the performance of enDNA-Prot outperformed the three existing methods by 2.83-16.63% in terms of ACC and 0.02-0.16 in terms of MCC. It indicated that enDNA-Prot is an effective method for DNA-binding protein identification and expanding training dataset with negative samples can improve its performance. For the convenience of the vast majority of experimental scientists, we developed a user-friendly web-server for enDNA-Prot which is freely accessible to the public.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24977146 PMCID: PMC4058174 DOI: 10.1155/2014/294279
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
The summarization of datasets.
| Dataset | DNA-binding proteins | Non-DNA-binding proteins |
|---|---|---|
| Benchmark dataset | 146 | 250 |
| Expanded benchmark dataset | 146 | 2125 |
| Independent dataset1 | 82 | 100 |
| Independent dataset2 | 770 | 815 |
The three groups of amino acids for each physicochemical property.
| Physicochemical property | The 1st group | The 2nd group | The 3rd group |
|---|---|---|---|
| Hydrophobicity | RKEDQN | GASTPHT | CVLIMFW |
| Normalized van der Waals volume | GASCTPD | NVEQIL | MHKFRYW |
| Polarity | LIFWCMVY | PATGS | HQRKNED |
| Polarizability | GASDT | CPNVEQIL | KMHFRYW |
| Charge | KR | ANCQGHILMFPSTWYV | DE |
| Surface tension | GQDNAHR | KTSEC | ILMFPWYV |
| Secondary structure | EALMQKRH | VIYCWFT | GNPSD |
| Solvent accessibility | ALFCGIVW | RKQEND | MPSTHY |
Figure 1The frame diagram of enDNA-Prot.
Algorithm 1The pseudocode of Unbalanced-AdaBoost.
Performance for independent dataset1 (trained on benchmark dataset).
| Method | ACC (%) | MCC | SE (%) | SP (%) | F1-M (%) |
|---|---|---|---|---|---|
| DNAbinder(P21) | 79.00 | 0.61 | 54.87 | 98.08 | 70.31 |
| DNAbinder(P400) | 80.11 | 0.62 | 58.53 | 97.97 | 72.73 |
| DNA-Prot | 84.61 | 0.69 | 73.17 | 94.00 | 81.08 |
| iDNA-Prot | 77.47 | 0.55 | 78.05 | 77.00 | 75.73 |
| enDNA-Prot | 84.62 | 0.70 | 73.18 | 94.00 | 84.62 |
P400 and P21 denote the two vectorization methods PSSM-400 based DNAbinder and PSSM-21 based DNAbinder, respectively.
Performance for independent datset2 (trained on benchmark dataset).
| Method | ACC (%) | MCC | SE (%) | SP (%) | F1-M (%) |
|---|---|---|---|---|---|
| DNAbinder(P21) | 76.64 | 0.55 | 86.18 | 67.57 | 74.89 |
| DNAbinder(P400) | 76.38 | 0.52 | 72.35 | 80.19 | 75.23 |
| DNA-Prot | 77.74 | 0.56 | 85.19 | 70.71 | 78.79 |
| iDNA-Prot | 72.19 | 0.45 | 77.01 | 67.64 | 72.89 |
| enDNA-Prot | 81.71 | 0.64 | 84.55 | 79.05 | 81.71 |
P400 and P21 denote the two vectorization methods PSSM-400 based DNAbinder and PSSM-21 based DNAbinder, respectively.
Figure 2The influence of n on performance.
Performance of enDNA-Prot trained on different dataset.
| Testing dataset | Training dataset | ACC (%) | MCC | SE (%) | SP (%) | F1-M (%) |
|---|---|---|---|---|---|---|
| ID1 | BD | 84.62 | 0.70 | 73.18 | 94.00 | 84.62 |
| EBD1100 | 89.56 | 0.79 | 80.48 | 97.00 | 87.42 | |
|
| ||||||
| ID2 | BD | 81.71 | 0.64 | 84.55 | 79.05 | 81.71 |
| EBD1100 | 83.48 | 0.67 | 84.29 | 82.72 | 83.21 | |
ID1 and ID2 denote the independent dataset1 and independent dataset2, respectively; BD and EBD1100 denote the benchmark dataset and expanded benchmark dataset1100, respectively.
Performance for independent dataset1 (trained on expanded benchmark dataset1100).
| Method | ACC (%) | MCC | SE (%) | SP (%) | F1-M (%) |
|---|---|---|---|---|---|
| DNAbinder(P21) | 72.93 | 0.52 | 42.24 | 100 | 57.39 |
| DNAbinder(P400) | 78.45 | 0.61 | 52.44 | 100 | 68.80 |
| DNA-Prot | 76.37 | 0.58 | 47.56 | 100 | 64.46 |
| iDNA-Prot | 76.92 | 0.58 | 50.00 | 99.00 | 66.13 |
| enDNA-Prot | 89.56 | 0.79 | 80.48 | 97.00 | 87.42 |
P400 and P21 denote the two vectorization methods PSSM-400 based DNAbinder and PSSM-21 based DNAbinder, respectively.
Performance for independent datase2 (trained on expanded benchmark dataset1100).
| Method | ACC (%) | MCC | SE (%) | SP (%) | F1-M (%) |
|---|---|---|---|---|---|
| DNAbinder(P21) | 75.11 | 0.51 | 64.41 | 85.27 | 71.59 |
| DNAbinder(P400) | 81.65 | 0.65 | 67.14 | 95.42 | 78.09 |
| DNA-Prot | 79.07 | 0.60 | 65.32 | 92.03 | 75.19 |
| iDNA-Prot | 75.60 | 0.54 | 57.01 | 93.14 | 69.41 |
| enDNA-Prot | 83.48 | 0.67 | 84.29 | 82.72 | 83.21 |
P400 and P21 denote the two vectorization methods PSSM-400 based DNAbinder and PSSM-21 based DNAbinder, respectively.