| Literature DB >> 29588420 |
Daniel D Le1, Tyler C Shimko1, Arjun K Aditham2,3, Allison M Keys3,4, Scott A Longwell2, Yaron Orenstein5, Polly M Fordyce6,2,3,7.
Abstract
Transcription factors (TFs) are primary regulators of gene expression in cells, where they bind specific genomic target sites to control transcription. Quantitative measurements of TF-DNA binding energies can improve the accuracy of predictions of TF occupancy and downstream gene expression in vivo and shed light on how transcriptional networks are rewired throughout evolution. Here, we present a sequencing-based TF binding assay and analysis pipeline (BET-seq, for Binding Energy Topography by sequencing) capable of providing quantitative estimates of binding energies for more than one million DNA sequences in parallel at high energetic resolution. Using this platform, we measured the binding energies associated with all possible combinations of 10 nucleotides flanking the known consensus DNA target interacting with two model yeast TFs, Pho4 and Cbf1. A large fraction of these flanking mutations change overall binding energies by an amount equal to or greater than consensus site mutations, suggesting that current definitions of TF binding sites may be too restrictive. By systematically comparing estimates of binding energies output by deep neural networks (NNs) and biophysical models trained on these data, we establish that dinucleotide (DN) specificities are sufficient to explain essentially all variance in observed binding behavior, with Cbf1 binding exhibiting significantly more nonadditivity than Pho4. NN-derived binding energies agree with orthogonal biochemical measurements and reveal that dynamically occupied sites in vivo are both energetically and mutationally distant from the highest affinity sites.Entities:
Keywords: microfluidics; protein–DNA binding; transcription factor binding; transcription factor specificity; transcriptional regulation
Mesh:
Substances:
Year: 2018 PMID: 29588420 PMCID: PMC5910820 DOI: 10.1073/pnas.1715888115
Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN: 0027-8424 Impact factor: 11.205
Fig. 1.DNA library design and assay overview. (A) Schematic of flanking sequence library design indicating Illumina sequencing adapters (blue), unique molecular identifiers (UMIs, red), variable flanking regions (orange), and E-box consensus. (B) Photograph of MITOMI device and schematic showing device operation. (C) Schematic showing downstream sample analysis. Counting of individual molecules within bound and input fractions allows calculation of relative binding energies for each sequence (Left and Middle) to yield a comprehensive thermodynamic binding affinity landscape (Pho4 example). Each color-coded point (blue = –G; red = +G) represents a sequence, grouped by Hamming distance from the highest affinity sequence and alphabetically ordered in clockwise polar coordinates.
Fig. 2.Probing the relationship between assay accuracy, read depth, library size, and energy range. (A) Normalized median accuracy (Pearson’s , where is the fraction of observed species) comparing results for down-sampled data with “true” values as a function of read depth for MN model coefficients (red) (40) and individual G values (blue). (B) Median squared Pearson’s correlation coefficients between recovered and true values (, Left) and median fraction of observed species (Right) as a function of binding affinity range for various library sizes (rows) sequenced to different depths. (C) Median squared Pearson’s correlation coefficient () between recovered and true values (blue) and fraction of observed species (red) as a function of mean reads per sequence for 10 replicate simulations; cyan rectangle denotes flank library assay conditions presented here.
Fig. 3.Modeling and interpretation of binding specificity based on MN features. (A) Data analysis diagram shows NN modeling from raw data followed by model interpretation. (B) Magnitude of energetic changes (Gs) for core (red) (32) and flanking (blue) mutations for Pho4 and Cbf1. (C) Pho4 and Cbf1 mean MN G values as a function of flanking sequence position (Top), compared with the ScerTF database (67–70) sequence logos (Bottom). Gray boxes show position of core consensus CACGTG. (D) Density scatter plots comparing NN model estimates and MN additive model predictions based on those estimates.
Fig. 4.NN model interpretation using DN features. (A) Density scatter plots of NN model estimates vs. DN (nearest neighbor) additive model predictions based on those estimates. (B) Fraction of variance in Pho4 and Cbf1 NN model estimates explained by MN and DN models. (C) Heatmap of mean residual energetic contributions when MN effects are removed for Cbf1.
Fig. 5.Pho4 and Cbf1 binding energies and in vivo activity. (A) Pho4 and Cbf1 affinities (Kd, in nM) compared with in vivo ChIP-seq enrichment (63). (B) Functional-energetic landscapes of ChIP enrichment at dynamically regulated loci, relative to the measured highest affinity sequence in Hamming distance space.