| Literature DB >> 28642456 |
Munazah Andrabi1,2, Andrew Paul Hutchins3, Diego Miranda-Saavedra4,5,6, Hidetoshi Kono7, Ruth Nussinov8,9, Kenji Mizuguchi1, Shandar Ahmad10,11.
Abstract
DNA shape is emerging as an important determinant of transcription factor binding beyond just the DNA sequence. The only tool for large scale DNA shape estimates, DNAshape was derived from Monte-Carlo simulations and predicts four broad and static DNA shape features, Propeller twist, Helical twist, Minor groove width and Roll. The contributions of other shape features e.g. Shift, Slide and Opening cannot be evaluated using DNAshape. Here, we report a novel method DynaSeq, which predicts molecular dynamics-derived ensembles of a more exhaustive set of DNA shape features. We compared the DNAshape and DynaSeq predictions for the common features and applied both to predict the genome-wide binding sites of 1312 TFs available from protein interaction quantification (PIQ) data. The results indicate a good agreement between the two methods for the common shape features and point to advantages in using DynaSeq. Predictive models employing ensembles from individual conformational parameters revealed that base-pair opening - known to be important in strand separation - was the best predictor of transcription factor-binding sites (TFBS) followed by features employed by DNAshape. Of note, TFBS could be predicted not only from the features at the target motif sites, but also from those as far as 200 nucleotides away from the motif.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28642456 PMCID: PMC5481346 DOI: 10.1038/s41598-017-03199-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Overall design of the present study. The study consisted of three steps. (a) Molecular dynamics (MD) simulations were performed for all the unique tetramers flanked by a fixed tetramer on both terminals and the conformational trajectory of the central four nucleotides was converted into a conformational ensemble by defining equal frequency ensemble bins from the entire data. (b) A set of 65 SVR models were trained, one each for the five ensemble bins of the 13 conformational parameters. Models could then use a nucleotide sequence as the input and predict 65 features (representing ensemble bin occupancies) of a nucleotide in the corresponding sequence environment. A number of benchmarks for the effectiveness of DynaSeq were performed. These included the models’ performance in recalling PDB deposited structures (using predicted occupancy-weighted averages of ensemble bins) and DREAM5 TF specificities (from the ensemble occupancies for a sequence window). (c) Benchmarks on DynaSeq’s ability to classify TFBS from genomic controls were performed. Predictors were trained by pooling all the 65 features together and also by using just a 5-bin ensemble of a single conformational parameter at a time as the sequence feature.
Figure 2Cross-validation and predictability of DNA conformational ensemble occupancy at each base position. (a) Variation of mean absolute error (absolute difference between prediction and observed ensemble occupancy in each bin) with training window sizes. Standard deviation in the overall data is shown in red, whereas other values represent cross-validation performances. (b) Overall cumulative frequency of absolute error distribution at window size = 5. Prediction for each base in any position of a tetranucleotide is counted once and errors computed are for the left out sets in leave-one-tetranucleotide cross-validations. (c) Scatterplot of predicted versus observed occupancies in all bins and all conformational parameters (d) Mean absolute error averages for each bin occupancy.
Figure 3Agreement between DNAshape and DynaSeq features. The four conformational features provided by DNAshape have also been predicted using DynaSeq (occupancy-weighted average of ensemble bins). All the four features show strong correlation, supporting the evidence of sequence-dependent specificity in DNA structures. (a) Detailed scatterplot of each of the overlapping features. Even though, there is an implicit offset between the MGW values reported in DNAshape and DynaSeq (due to the use of different definitions and software to compute MGW), the general agreement observed through Pearson’s correlation remains strong. (b) Comparison between the 12 DynaSeq features and their mutual correlation with the four DNAshape features shows that several of the 12 parameters (e.g. shift and tilt) are significantly novel as they show no correlation with any of the DNAshape features.
Summary of statistics obtained from DynaSeq-derived 3D structural models of 115 DNA sequences observed in PDB and 1000 equally sized randomly generated ones.
| Minimum | Q1 | Median | Q3 | Maximum | |
|---|---|---|---|---|---|
| PDB RMSD (Å) | 1.5 | 4.1 | 4.2 | 4.7 | 12.8 |
| Random RMSD (Å) | 1.9 | 5.9 | 7.5 | 8.9 | 14.4 |
| Z-score (all PDBs) | −2.79 | −1.70 | −1.48 | −1.26 | 2.73 |
| P-values (all PDBs) | 0.0026 | 0.0448 | 0.069 | 0.103 | 0.997 |
Complete distribution for individual sequences is provided in supplementary table ST2.
Figure 4Ability of single parameter conformational ensembles to predict TFBS in PIQ data. (a) TFBS’s were classified from genomic control positions using 65 ensemble features at motif site and its +/−200-nt distance using 7-nt window at each position in each of the 1312 TFs and distribution of AUC for all such predictions was represented in a boxplot. (b) The data were separated for motif positions (0 to 15 bases from motif start) and (c) outside of it. Results in (a–c) indicated that “opening” ensemble is the best predictor of TFBS in both regions (even if the difference between AUC is small), followed by parameters whose static values are also used in DNAshape (see (d)) (d) Relative number of times a conformational parameter appeared in the top-ranked position in all positions in all TFs was counted.
Figure 5Comparison between DNAshape and DynaSeq implementations to discriminate TFBS from genome-wide controls. Different sliding windows are placed at the motif start positions and performance levels are scanned +/−200 bases away from motif sites assigned by PIQ. AUC results are computed on 1312 TFs in PIQ data and averaged to produce these plots to have a comprehensive global view. Performance of DynaSeq is comparable with DNAshape in most cases and seems to be slightly better for smaller window sizes but difference in performances of DNAshape and DynaSeq diminishes at large window sizes because of strict cross-validation which penalizes models with higher number of features.
Figure 6Performance evaluation of DynaSeq for individual TFs and in comparison to DNAshape and GC content based models for a 7-nt window. (a) Each TF is represented by a single AUC, which is the highest value from the 401 AUC values computed at each of the +/−200 nt positions from the motif start position of that TF. (b) Correlation between the best AUC values by DynaSeq and those by DNAshape. (c) A single cumulative performance level is obtained by averaging AUCs of all TFs at each of the 401 positions relative to their motif start site and shows how the perfomance levels vary when DNAshape and DynaSeq features from these positios are used in predictive models. However, such changes are not observed when GC content used. (d) AUC values plotted as a function of distance from motif. The values were calculated using data of Fig. 6(c). The plot gives a directionless estimate of predictability of TFBs from non-motif positions.
Transcription factors, which show significantly better (>5% AUC) binding site predictability using (a) DynaSeq than DNAshape features and (b) vice versa of (a).
| S. No. | TF ID | TF Name | AUC gain (%) | S. No. | TF ID | TF Name | AUC gain (%) |
|---|---|---|---|---|---|---|---|
| (a) | |||||||
| 1 | MA03841 | SNT2 | 10.0 | 47 | MA01931 | Lag1 | 5.8 |
| 2 | CN00091 | LM9 | 9.7 | 48 | PB00051 | Bbx1 | 5.7 |
| 3 | PB00801 | Tbp1 | 9.4 | 49 | PB01791 | Sp1002 | 5.7 |
| 4 | MA03511 | DOT6 | 9.3 | 50 | MA02591 | HIF1AARNT | 5.6 |
| 5 | PL00131 | hlh2hlh15 | 8.6 | 51 | MA02671 | ACE2 | 5.6 |
| 6 | PH00341 | Gbx2 | 8.6 | 52 | CN00841 | LM84 | 5.6 |
| 7 | CN00441 | LM44 | 8.2 | 53 | MA03501 | TOD6 | 5.6 |
| 8 | PH01681 | Hnf1b | 8.2 | 54 | PB01261 | Gata52 | 5.6 |
| 9 | PH01641 | Six4 | 8.0 | 55 | PF00101 | GCCATNTTG | 5.6 |
| 10 | MA04001 | SUT2 | 7.7 | 56 | PB01321 | Hbp12 | 5.6 |
| 11 | PB01651 | Sox112 | 7.7 | 57 | PF00561 | GGGTGGRR | 5.6 |
| 12 | MA03351 | MET4 | 7.5 | 58 | MA02371 | pan | 5.5 |
| 13 | MA01382 | REST | 7.4 | 59 | MA01421 | Pou5f1 | 5.5 |
| 14 | MA01551 | INSM1 | 7.3 | 60 | POL0121 | TATABox | 5.5 |
| 15 | MA02251 | ftz | 7.1 | 61 | PB01111 | Bhlhb22 | 5.5 |
| 16 | MA00831 | SRF | 7.1 | 62 | MA02321 | lbl | 5.5 |
| 17 | CN00271 | LM27 | 7.0 | 63 | MA00411 | Foxd3 | 5.5 |
| 18 | MA01431 | Sox2 | 7.0 | 64 | PB01631 | Six62 | 5.4 |
| 19 | PH01071 | Msx2 | 7.0 | 65 | MA04071 | THI2 | 5.4 |
| 20 | MA00071 | Ar | 7.0 | 66 | PF01401 | RNGTGGGC | 5.4 |
| 21 | PB01981 | Zfp1282 | 6.9 | 67 | PB01751 | Sox42 | 5.4 |
| 22 | MA03861 | TBP | 6.9 | 68 | CN02111 | LM211 | 5.4 |
| 23 | PH01171 | Nkx31 | 6.8 | 69 | CN00521 | LM52 | 5.4 |
| 24 | PF00231 | TAATTA | 6.7 | 70 | PB01001 | Zfp7401 | 5.4 |
| 25 | PL00041 | hlh27 | 6.7 | 71 | MA02861 | CST6 | 5.4 |
| 26 | PB01571 | Rara2 | 6.7 | 72 | MA01851 | Deaf1 | 5.3 |
| 27 | PH01701 | Tgif2 | 6.6 | 73 | MA02701 | AFT2 | 5.3 |
| 28 | PH01161 | Nkx29 | 6.5 | 74 | PF00761 | CAGGTA | 5.3 |
| 29 | PL00181 | hlh25 | 6.4 | 75 | MA03141 | HAP3 | 5.3 |
| 30 | MA01031 | ZEB1 | 6.4 | 76 | PH01731 | Uncx | 5.2 |
| 31 | PH01481 | Pou3f3 | 6.3 | 77 | PF00121 | CAGGTG | 5.2 |
| 32 | MA00661 | PPARG | 6.2 | 78 | CN01571 | LM157 | 5.2 |
| 33 | CN02301 | LM230 | 6.2 | 79 | PF01421 | ACAWYAAAG | 5.2 |
| 34 | MA00181 | CREB1 | 6.2 | 80 | MF00031 | RELclass | 5.2 |
| 35 | PF01131 | AACYN NNNTTCCS | 6.2 | 81 | PH00981 | Lhx8 | 5.2 |
| 36 | MA00601 | NFYA | 6.1 | 82 | PF00131 | CTTTGT | 5.2 |
| 37 | PB01491 | Myb2 | 6.1 | 83 | PH00971 | Lhx62 | 5.2 |
| 38 | PH01221 | Obox2 | 6.0 | 84 | PB01501 | Mybl12 | 5.2 |
| 39 | PH01621 | Six2 | 6.0 | 85 | PH01091 | Nkx11 | 5.2 |
| 40 | CN00821 | LM82 | 6.0 | 86 | PB00081 | E2F21 | 5.2 |
| 41 | MA01591 | RXRRARDR5 | 6.0 | 87 | CN00591 | LM59 | 5.1 |
| 42 | PH00701 | Hoxc5 | 5.9 | 88 | MA02581 | ESR2 | 5.1 |
| 43 | CN01191 | LM119 | 5.9 | 89 | CN02101 | LM210 | 5.1 |
| 44 | MA04111 | UPC2 | 5.9 | 90 | MA01101 | ATHB5 | 5.0 |
| 45 | MA00861 | sna | 5.8 | 91 | MA00792 | SP1 | 5.0 |
| 46 | PF01631 | GGAR NTKYCCA | 5.8 | ||||
|
| |||||||
|
|
|
|
| ||||
| 1 | MF00071 | bHLHzipclass | 8.0 | ||||
| 2 | MA02771 | AZF1 | 7.0 | ||||
| 3 | PH00011 | Alx3 | 7.0 | ||||
| 4 | MA01951 | Lim3 | 6.5 | ||||
| 5 | PB01861 | Tcf32 | 6.2 | ||||
| 6 | PF01191 | GGGNRMNNYCAT | 5.7 | ||||
| 7 | MA01891 | E5 | 5.6 | ||||
| 8 | PH00111 | Alx12 | 5.5 | ||||
| 9 | CN00361 | LM36 | 5.4 | ||||
| 10 | MA01621 | Egr1 | 5.3 | ||||
| 11 | MA00131 | brZ4 | 5.3 | ||||
| 12 | PB01851 | Tcf12 | 5.2 | ||||