| Literature DB >> 21935457 |
Wei-Zhong Lin1, Jian-An Fang, Xuan Xiao, Kuo-Chen Chou.
Abstract
DNA-binding proteins play crucial roles in various cellular processes. Developing high throughput tools for rapidly and effectively identifying DNA-binding proteins is one of the major challenges in the field of genome annotation. Although many efforts have been made in this regard, further effort is needed to enhance the prediction power. By incorporating the features into the general form of pseudo amino acid composition that were extracted from protein sequences via the "grey model" and by adopting the random forest operation engine, we proposed a new predictor, called iDNA-Prot, for identifying uncharacterized proteins as DNA-binding proteins or non-DNA binding proteins based on their amino acid sequences information alone. The overall success rate by iDNA-Prot was 83.96% that was obtained via jackknife tests on a newly constructed stringent benchmark dataset in which none of the proteins included has ≥25% pairwise sequence identity to any other in a same subset. In addition to achieving high success rate, the computational time for iDNA-Prot is remarkably shorter in comparison with the relevant existing predictors. Hence it is anticipated that iDNA-Prot may become a useful high throughput tool for large-scale analysis of DNA-binding proteins. As a user-friendly web-server, iDNA-Prot is freely accessible to the public at the web-site on http://icpr.jci.edu.cn/bioinfo/iDNA-Prot or http://www.jci-bioinfo.cn/iDNA-Prot. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21935457 PMCID: PMC3174210 DOI: 10.1371/journal.pone.0024756
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The numerical codes of 20 native amino acids.
| Amino acid | Factor score |
|
| A |
| 0.325 |
| C |
| 0.297 |
| D |
| 0.025 |
| E | 1.477 | 0.814 |
| F | 1.891 | 0.869 |
| G | 1.330 | 0.791 |
| H |
| 0.158 |
| I | 2.131 | 0.849 |
| K | 0.533 | 0.630 |
| L |
| 0.182 |
| M | 2.219 | 0.902 |
| N | 1.299 | 0.720 |
| P |
| 0.164 |
| Q |
| 0.047 |
| R | 1.502 | 0.818 |
| S |
| 0.008 |
| T | 2.213 | 0.901 |
| V |
| 0.367 |
| W | 0.672 | 0.662 |
| Y | 3.097 | 0.957 |
Results obtained by iDNA-Prot on the benchmark dataset of Information S1 through the jackknife testa.
| Protein type | Number of proteins | Number of correct prediction | Success rate |
| DNA-binding protein | 212 | 179 | 84.43% |
| Non DNA-binding protein | 212 | 177 | 83.49% |
| Overall | 424 | 356 | 83.96% |
The following parameters were used for Random Forest algorithm: the number of tree grown was 560 and the number of predictors sampled for splitting at each node was 5.
A comparison of the predicted results by DNA-Prot [22] and iDNA-Prot on the independent dataset in the Information S3.
| Protein type | DNA-Prot | iDNA-Prot |
| DNA-binding protein | 87/122 = 71.31% | 109/122 = 89.34% |
| Non DNA-binding protein | 101/122 = 82.79% | 111/122 = 90.98% |
| Overall | 188/244 = 77.05% | 220/244 = 90.16% |
The predicted results by iDNA-Prot on the 100 DNA-binding hypothetical proteins from http://www.ncbi.nlm.nih.gov/protein/?term=DNAbindinghypothetical.
| GI code | Predicted result | GI code | Predicted result | GI code | Predicted result |
| 21960164 | DBP | 29122980 | non DBP | 26832636 | DBP |
| 21957418 | DBP | 21671920 | DBP | 26832400 | DBP |
| 21961058 | DBP | 32880245 | DBP | 21835917 | DBP |
| 21960858 | DBP | 90578605 | DBP | 21835539 | DBP |
| 21960204 | DBP | 90410315 | DBP | 21843120 | DBP |
| 21958545 | DBP | 89076244 | non DBP | 21836833 | DBP |
| 21958841 | non DBP | 23326729 | non DBP | 14627522 | DBP |
| 14828174 | DBP | 90439438 | DBP | 78363301 | DBP |
| 52629876 | DBP | 90328556 | DBP | 30116886 | DBP |
| 21958534 | DBP | 89048073 | non DBP | 20673954 | DBP |
| 21958313 | DBP | 30724697 | DBP | 88595361 | DBP |
| 21957779 | non DBP | 30726408 | DBP | 15769834 | DBP |
| 21958822 | DBP | 30726252 | DBP | 68057023 | DBP |
| 21957238 | DBP | 30725976 | DBP | 59480370 | non DBP |
| 1552778 | DBP | 30725598 | DBP | 52004347 | DBP |
| 21960397 | DBP | 30725306 | DBP | 11114707 | DBP |
| 21960196 | DBP | 30725067 | DBP | 22984739 | DBP |
| 21960777 | DBP | 30724845 | DBP | 22984549 | DBP |
| 21960121 | DBP | 16882676 | DBP | 14564202 | DBP |
| 21960008 | DBP | 30687056 | DBP | 14563738 | non DBP |
| 21959358 | non DBP | 30686776 | DBP | 14563550 | DBP |
| 21959322 | DBP | 30686615 | DBP | 90406920 | DBP |
| 21959035 | DBP | 30686107 | DBP | 14563368 | DBP |
| 21958991 | DBP | 30685947 | DBP | 29434364 | DBP |
| 21957969 | DBP | 30685727 | DBP | 23327083 | non DBP |
| 21957386 | DBP | 30685502 | DBP | 93211002 | DBP |
| 21956952 | DBP | 30685211 | DBP | 14498544 | DBP |
| 21877200 | DBP | 30977777 | DBP | 14526726 | DBP |
| 26832542 | DBP | 18859138 | DBP | 14527329 | DBP |
| 26832403 | DBP | 52002457 | DBP | 22981160 | DBP |
| 33989345 | DBP | 17093877 | DBP | 46913396 | DBP |
| 33989345 | DBP | 30891446 | DBP | 71382240 | DBP |
| 33875610 | DBP | 26832638 | DBP | ||
| 52004491 | DBP | 26832637 | DBP |