| Literature DB >> 20156993 |
Abstract
DNA-binding proteins perform vital functions related to transcription, repair and replication. We have developed a new sequence-based machine learning protocol to identify DNA-binding proteins. We compare our method with an extensive benchmark of previously published structure-based machine learning methods as well as a standard sequence alignment technique, BLAST. Furthermore, we elucidate important feature interactions found in a learned model and analyze how specific rules capture general mechanisms that extend across DNA-binding motifs. This analysis is carried out using the malibu machine learning workbench available at http://proteomics.bioengr.uic.edu/malibu and the corresponding data sets and features are available at http://proteomics.bioengr.uic.edu/dna.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20156993 PMCID: PMC2879530 DOI: 10.1093/nar/gkq061
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Statistics of data sets used in this work
| Example | Pos. | Neg. | Identity | |
|---|---|---|---|---|
| JMB03 | 304 | 54 | 250 | 35/25 |
| JMB04 | 188 | 78 | 110 | 25?/25 |
| NAR05 | 359 | 121 | 238 | 35/25? |
| JMB06 | 248 | 138 | 110 | 35/25 |
| ABME07 | 289 | 75 | 214 | 20 |
| LEAC35 | 388 | 138 | 250 | 35/25 |
| LEAC25 | 372 | 122 | 250 | 25/25 |
In the identity column, the first number refers to the identity of the positive set and the second the negative set. ? indicates there is some question to the accuracy of the number.
Figure 1.Illustration of the calculation of local environment amino acid composition.
Figure 3.An ADTree built over the JMB06 data set. The square nodes in the model hold the name of the feature and order it was learned. The round nodes hold the weighted vote where a positive number predicts DNA binding. Below the square node is the threshold of prediction, if this number is exceeded then the right path is taken, otherwise the left. Below each path in the tree, there is a set of numbers in the format counted/total for the prefixing DNA-binding subgroup.
Comparison of new protocol with previous work and BLAST
| Accuracy | MCC | Sensitivity | Specificity | AUC | |
|---|---|---|---|---|---|
| JMB03 | |||||
| BLAST | 79.3 | 21.5 | 27.8 | 90.4 | 66.0 |
| OURS | 89.1 | 66.2 | 48.1 | 98.0 | 90.3 (91.1) |
| Stawiski | 92.0 | 74.0 | 81.0 | 94.4 | – |
| Szilagyi and Skolnick ( | – | 73.0 | – | – | 93.0 |
| JMB04 | |||||
| BLAST | 81.4 | 70.4 | 80.8 | 81.8 | 90.5 |
| OURS | 89.9 | 84.9 | 84.6 | 93.6 | 97.1 |
| Ahmad and Sarai ( | 83.9 | 68.0 | 80.8 | 87.0 | - |
| Szilagyi and Skolnick ( | – | 79.0 | – | – | 95.0 |
| NAR05 | |||||
| BLAST | 82.4 | 70.2 | 75.2 | 86.1 | 90.3 |
| OURS | 94.7 | 88.8 | 88.4 | 97.9 | 96.7 |
| Bhardwaj | 86.3 | – | 80.6 | 87.8 | – |
| JMB06 | |||||
| BLAST | 71.8 | 45.1 | 79.7 | 61.8 | 80.1 |
| OURS | 85.9 | 74.8 | 89.9 | 80.9 | 93.4 |
| Szilagyi and Skolnick ( | – | 74.0 | – | – | 93.0 |
| ABME07 | |||||
| BLAST | 72.7 | 32.5 | 42.7 | 83.2 | 69.0 |
| OURS | 89.6 | 74.3 | 69.3 | 96.7 | 91.3 |
| Langlois | 88.5 | – | 66.7 | 96.3 | 88.7 |
| LEAC35 | |||||
| BLAST | 72.9 | 46.3 | 59.4 | 80.4 | 74.9 |
| OURS | 84.0 | 69.5 | 68.8 | 92.4 | 92.3 |
| LEAC25 | |||||
| BLAST | 69.4 | 28.6 | 42.6 | 82.4 | 67.8 |
| OURS | 84.7 | 66.2 | 64.8 | 94.4 | 91.5 |
aMaximum MCC from ROC curve.
bArea under the ROC curve.
cUsing leave-one-out cross-validation instead of 10-fold.
dMetric calculated from original published data by Szilagyi and Skolnick (11).
Figure 2.A ROC comparison of the new sequence-based feature representation and Boosted Trees with the JMB06 structure-based protocol and BLAST.
Figure 4.Several example protein structures bound to DNA. (a) 3PVI illustrating turn leucine residues in contact with DNA. (b) 3PVI illustrating turn histidine residues in contact with DNA. (c) 1ECR illustrating sheet arginine residues in contact with DNA.