| Literature DB >> 23703214 |
Chih-Kang Lin1, Chien-Yu Chen.
Abstract
Predicting binding sites of a transcription factor in the genome is an important, but challenging, issue in studying gene regulation. In the past decade, a large number of protein-DNA co-crystallized structures available in the Protein Data Bank have facilitated the understanding of interacting mechanisms between transcription factors and their binding sites. Recent studies have shown that both physics-based and knowledge-based potential functions can be applied to protein-DNA complex structures to deliver position weight matrices (PWMs) that are consistent with the experimental data. To further use the available structural models, the proposed Web server, PiDNA, aims at first constructing reliable PWMs by applying an atomic-level knowledge-based scoring function on numerous in silico mutated complex structures, and then using the PWM constructed by the structure models with small energy changes to predict the interaction between proteins and DNA sequences. With PiDNA, the users can easily predict the relative preference of all the DNA sequences with limited mutations from the native sequence co-crystallized in the model in a single run. More predictions on sequences with unlimited mutations can be realized by additional requests or file uploading. Three types of information can be downloaded after prediction: (i) the ranked list of mutated sequences, (ii) the PWM constructed by the favourable mutated structures, and (iii) any mutated protein-DNA complex structure models specified by the user. This study first shows that the constructed PWMs are similar to the annotated PWMs collected from databases or literature. Second, the prediction accuracy of PiDNA in detecting relatively high-specificity sites is evaluated by comparing the ranked lists against in vitro experiments from protein-binding microarrays. Finally, PiDNA is shown to be able to select the experimentally validated binding sites from 10,000 random sites with high accuracy. With PiDNA, the users can design biological experiments based on the predicted sequence specificity and/or request mutated structure models for further protein design. As well, it is expected that PiDNA can be incorporated with chromatin immunoprecipitation data to refine large-scale inference of in vivo protein-DNA interactions. PiDNA is available at: http://dna.bime.ntu.edu.tw/pidna.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23703214 PMCID: PMC3692134 DOI: 10.1093/nar/gkt388
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Comparison of predicted PFMs with annotated PFMs based on the Ψ-test
| PDB ID | Ori-seq | PiDNA-2mut | PiDNA-3mut | PiDNA-4mut | 3DTF | 3D-footprint |
|---|---|---|---|---|---|---|
| 1aay | 0.131 | 0.111 | 0.169 | 0.161 | 0.101 | |
| 3dfv | 0.184 | 0.141 | 0.193 | 0.619 | 0.122 | |
| 2wty | 0.421 | 0.246 | 0.271 | 0.706 | 0.558 | |
| 1ig7 | 0.381 | 0.170 | 0.177 | 0.585 | 0.172 | |
| 3u2b | 0.621 | 0.461 | 0.461 | 0.461 | 0.965 | |
| 3f27 | 0.388 | 0.129 | 0.129 | 0.129 | 0.878 | |
| 1ysa | 0.223 | 0.324 | 0.342 | 0.274 | ||
| 2er8 | 0.205 | 0.136 | 0.170 | 0.718 | 0.289 | |
| 1mnn | 0.182 | 0.119 | 0.123 | 0.158 | 0.151 | |
| 1a0a | 0.308 | 0.408 | 0.517 | 0.898 | 0.457 | |
| 3ukg | 0.713 | 0.475 | 0.430 | 0.713 | 0.486 | |
| 3mln | 0.364 | 0.245 | 0.603 | 0.389 | ||
| 1dh3 | 0.133 | 0.187 | 0.270 | – | 0.321 | |
| 1awc | 0.134 | 0.139 | 0.251 | 0.270 | – | |
| 1puf | 0.342 | 0.216 | 0.215 | 0.395 | 0.334 | |
| 1h88 | 0.326 | 0.159 | 0.168 | 0.364 | 0.177 | |
| 2ql2 | 0.372 | 0.179 | 0.205 | 0.205 | 0.529 | |
| 3exj | 0.353 | 0.290 | 0.296 | |||
| 1io4 | 0.162 | 0.261 | 0.261 | 0.440 | 0.045 | |
| 3qsv | 0.581 | 0.398 | 0.434 | 0.442 | ||
| 1gt0 | 0.307 | 0.168 | 0.157 | 0.566 | 0.167 | |
| 1pue | 0.205 | 0.185 | 0.215 | 0.355 | 0.194 | |
| 3brg | 0.135 | 0.196 | 0.282 | 0.833 | 0.134 | |
| 2i9t | 0.226 | 0.210 | 0.210 | 0.210 | 0.459 | |
| 1d66 | 0.773 | 0.441 | 0.355 | 0.412 | – | |
| 1yrn | 0.293 | 0.302 | 0.319 | 0.496 | 0.300 | |
| 1le8 | 0.226 | 0.162 | 0.208 | 0.499 | 0.511 | |
| 1mnm | 0.408 | 0.162 | 0.158 | 0.371 | 0.485 | |
| 1pyi | 0.513 | 0.173 | 0.127 | 0.476 | – | |
| 1zme | 0.902 | 0.260 | 0.205 | – | 0.297 | |
| Average | 0.339 | 0.227 | 0.243 | 0.526 | 0.265 | |
| Standard deviation | 0.209 | 0.107 | 0.098 | 0.100 | 0.214 | 0.147 |
A smaller number on the Ψ-test implies a higher degree of consistency between two PFMs.
‘Ori-seq’ denotes the PFM constructed by the original (native) sequence in the protein–DNA complex.
The title ‘PiDNA-kmut’ denotes that PiDNA constructed the PFM based on selected sequences with at most k mutations.
The best performance on each row is highlighted in bold.
a(3DTF) No prediction available.
bThe sequence logos of the predicted PFMs are shown in Figure 1.
c(3D-footprint) No structural evidence for specific binding to DNA (<4 informative columns).
Figure 1.More mutations are desirable on binding sites with a large number of degenerated positions. The term ‘kmut’ denotes that the maximum number of mutations in a single sequence is set to k when constructing the PFMs.
Figure 2.Comparison of PiDNA in predicting high-specificity sites among all the sequence with up to two mutations using the AUC scores. The method ‘ddG’ denotes the ranked list produced based on the change on ΔG, i.e. ΔG′ − ΔGnative.
AUC scores for different Web servers based on validation set 2
| Protein | PiDNA (PFM) | PiDNA (ddG) | 3DTF | 3D-footprint |
|---|---|---|---|---|
| Zif268 (mouse) | 0.912 | 0.897 | 0.928 | 0.878 |
| Gata3 (mouse) | 0.932 | 0.842 | 0.738 | 0.944 |
| Mafb (mouse) | 0.775 | 0.798 | 0.669 | 0.821 |
| Msx1 (mouse) | 0.842 | 0.701 | 0.859 | 0.832 |
| Sox4 (mouse) | 0.915 | 0.933 | 0.825 | 0.938 |
| Sox17 (mouse) | 0.959 | 0.958 | 0.773 | 0.929 |
| Gcn4 (yeast) | 0.687 | 0.548 | 0.744 | 0.776 |
| Leu3 (yeast) | 0.713 | 0.498 | 0.659 | 0.709 |
| Ndt80 (yeast) | 0.933 | 0.905 | 0.886 | 0.870 |
| Pho4 (yeast) | 0.901 | 0.756 | 0.547 | 0.745 |
| Rap1 (yeast) | 0.863 | 0.870 | 0.872 | 0.832 |
| Average | 0.857 | 0.791 | 0.773 | 0.843 |
The testing data in this table include sequences with up to two mutations.
The top-10 high-specificity sequences are assigned as the positives.
The performance of PiDNA based on PFM scoring or based on the change on ΔG (denoted as ‘ddG’) is also compared.
Data set (validation set 3) used for evaluating the performance of PiDNA in detecting true binding sequences from random sequences
| Protein | Species | PDB | Number of sites | Width of the sites | True-positive rate (%) | True-negative rate (%) |
|---|---|---|---|---|---|---|
| Zif268 | 1aay | 6 | 10 | 100.00 | 99.98 | |
| Ndt80 | 1mnn | 8 | 12 | 87.50 | 99.44 | |
| Gcn4 | 1ysa | 9 | 7 | 100.00 | 99.43 | |
| MAT a1/alpha2 | 1yrn | 19 | 19 | 100.00 | 99.09 | |
| EcR/Usp | 1r0o | 33 | 15 | 87.88 | 99.86 | |
| Ttk | 2drp | 16 | 11 | 62.50 | 98.77 | |
| Prd (homeo) | 1fjl | 15 | 13 | 100.00 | 98.99 | |
| Ubx/Exd | 1b8i | 4 | 10 | 100.00 | 99.95 | |
| Trl | 1yui | 5 | 7 | 100.00 | 99.58 | |
| MetJ | 1mj2 | 16 | 16 | 100.00 | 99.40 | |
| TrpR | 1tro | 15 | 18 | 86.67 | 99.56 | |
| PhoB | 1gxp | 16 | 20 | 93.75 | 99.32 | |
| DnaA | 1j1v | 9 | 13 | 77.78 | 99.98 | |
| PurR | 2puc | 23 | 16 | 100.00 | 99.90 | |
| Crp | 1run | 50 | 9 | 76.00 | 98.46 |
Data are retrieved from the Supplement of Morozov et al., 2005.
Testing data include the listed numbers of positives and 10 000 negatives.