Bryan J Venters1, B Franklin Pugh. 1. Center for Eukaryotic Gene Regulation, Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA.
Abstract
The human genome is pervasively transcribed, yet only a small fraction is coding. Here we address whether this non-coding transcription arises at promoters, and detail the interactions of initiation factors TATA box binding protein (TBP), transcription factor IIB (TFIIB) and RNA polymerase (Pol) II. Using ChIP-exo (chromatin immunoprecipitation with lambda exonuclease digestion followed by high-throughput sequencing), we identify approximately 160,000 transcription initiation complexes across the human K562 genome, and more in other cancer genomes. Only about 5% associate with messenger RNA genes. The remainder associates with non-polyadenylated non-coding transcription. Regardless, Pol II moves into a transcriptionally paused state, and TBP and TFIIB remain at the promoter. Remarkably, the vast majority of locations contain the four core promoter elements- upstream TFIIB recognition element (BREu), TATA, downstream TFIIB recognition element (BREd), and initiator element (INR)-in constrained positions. All but the INR also reside at Pol III promoters, where TBP makes similar contacts. This comprehensive and high-resolution genome-wide detection of the initiation machinery produces a consolidated view of transcription initiation events from yeast to humans at Pol II/III TATA-containing/TATA-less coding and non-coding genes.
The human genome is pervasively transcribed, yet only a small fraction is coding. Here we address whether this non-coding transcription arises at promoters, and detail the interactions of initiation factors TATA box binding protein (TBP), transcription factor IIB (TFIIB) and RNA polymerase (Pol) II. Using ChIP-exo (chromatin immunoprecipitation with lambda exonuclease digestion followed by high-throughput sequencing), we identify approximately 160,000 transcription initiation complexes across the humanK562 genome, and more in other cancer genomes. Only about 5% associate with messenger RNA genes. The remainder associates with non-polyadenylated non-coding transcription. Regardless, Pol II moves into a transcriptionally paused state, and TBP and TFIIB remain at the promoter. Remarkably, the vast majority of locations contain the four core promoter elements- upstream TFIIB recognition element (BREu), TATA, downstream TFIIB recognition element (BREd), and initiator element (INR)-in constrained positions. All but the INR also reside at Pol III promoters, where TBP makes similar contacts. This comprehensive and high-resolution genome-wide detection of the initiation machinery produces a consolidated view of transcription initiation events from yeast to humans at Pol II/III TATA-containing/TATA-less coding and non-coding genes.
The classic paradigm for assembling the minimal core transcription machinery at mRNA promoters starts with the recruitment of the TATA binding protein (TBP) to the TATA box core promoter element[1]. Next is the docking of TFIIB, which straddles TBP and locks onto flanking TFIIB-responsive elements (BREu and BREd)[2,3]. Together with TFIIF, TFIIB then engages Pol II in its active site to help set the start site of transcription (TSS) at an Initiator element (INR)[4-6]. The recruitment of the transcription machinery has long been thought to be an important rate-limiting step in gene expression[7]. Concepts in transcription initiation by all three RNA polymer-ases (I, II, and III) have been guided by this basic theme[8].Clashing with this seemingly simplified view is that the TATA box has been identified at only ~10% of human promoters[9,10], with most genes ostensibly being classified as “TATA-less” in all three RNA polymerase systems. The other core promoter elements are apparently equally rare. A second complication of the classic view, particular to multi-cellular eukaryotes, is that the general transcription factors may be largely pre-assembled at promoters. There Pol II is in a transcriptionally engaged but paused state, approximately 30-50 bp downstream from the TSS[11-13]. A third complication is that transcription of genomes is not restricted to coding genes, but appears to be quite pervasive, without clear evidence of being coupled to definable promoters[14]. These complications, together, paint a seemingly complex picture of eukaryotic transcription initiation.Towards reconciling simplistic models against complex data, we recently developed the ChIP-exo assay to map sites of protein-DNA interactions at near single-base resolution[15]. We discovered in yeast that so-called TATA-less promoters actually possess degenerate versions of the TATA-box, and that most yeast promoters assemble the transcription machinery fundamentally in accord with the classic paradigm[16], although a deep dichotomy between the TATA/SAGA/stress-induced genes and TATA-less/TFIID/housekeeping genes remains. This led us to consider whether similar simplicity was true in humans, albeit with additional complications of paused polymerase and pervasive noncoding transcription.
TBP/TFIIB separation from paused Pol II
Using ChIP-exo, we detected 159,117 TFIIB locations (peak pairs) in K562 cells, of which 36% were associated with ENCODE-defined transcriptional domains ()[17]. Remarkably, half were associated with heterochromatic regions, which are generally thought to be devoid of stable RNA production. However, heterochromatic transcription may be more pervasive.We assigned a TBP/TFIIB location to >50% of all annotated protein-coding K562-expressed genes (), thereby providing independent validation. Seemingly expressed genes that lacked a TBP/TFIIB location may have arisen from multiple sources including rare but stable mRNAs, detection noise, and antisense transcription arising from a more distal promoter. TBP/TFIIB/Pol II occupancy and mRNA levels were correlated (), as expected of recruitment being at least partially rate-limiting in gene expression.We initially focused on all 8,364 K562TFIIB locations near the TSS of 6,511 coding RNAs as defined by RefSeq[18]. Fig. 1a provides one example of the raw tag distribution and the identified core promoter elements concentrated ~25 bp upstream of the RPS12 ribosomal protein gene TSS. When individual genes were examined (Fig. 1b), or averaged across all 6,511 genes (Fig. 1c), two regions of high TFIIB/TBP/Pol II occupancy were observed. The major right-ward peaks corresponded to primary promoter transcription initiated complexes (Fig. 1c, upper panel). Those in the left-ward direction matched divergent TSSs[19-22], although the resulting RNA was less abundant than expected from TFIIB/TBP/Pol II occupancy levels (Fig. 1c, lower vs upper panel; Note that 2° TSS represents only 24% of the total TSS signal). This may result from RNA instability, as seen in yeast. The clear spatial separation of complexes indicates that divergent transcripts arise from distinct initiation complexes, most (78%) of which were in CpG islands. On average two complexes were detected per CpG island[23], regardless of island length, with the center of the island being enriched ~100 bp downstream of the primary TSS (). Complexes tended to be separated by 70-180 bp (, red), and had un-correlated occupancies (, black), which suggests that they are regulated independently.
Figure 1
Transcription machinery organization at human mRNA promoters
a, Smoothed distribution of strand-separated ChIP-exo tag 5’ ends at the RPS12 gene. Core promoter elements are shown with lower case denoting mismatches to the consensus. b, Peak-pair distribution or RNA at RefSeq genes (rows). Rows are linked, and sorted by TFIIB occupancy. c, Upper panel: Averaged ChIP-exo patterns around the closest (1°) RefSeq TSS. The “spikes” of TBP and TFIIB are indiscernible (vertically offset in inset). Lower panel: Distribution of 2° polyadenylated RNA[38], with traces separated by sense (blue) and antisense (red, inverted trace) orientations relative to the corresponding mRNA TSS.
For the vast majority of transcription units, Pol II crosslinked 50 bp downstream of the primary TSS (Fig. 1b, c), where it is expected to pause after initiating transcription[13]. Pol II was most depleted over the core promoter, indicating that it does not stably reside there in proliferating K562 cells. Therefore when Pol II enters the core promoter, it rapidly initiates transcription and then moves into a paused state ~50 bp downstream, thereby preventing any new polymerase from detectably engaging the core promoter.The crosslinking pattern of humanTFIIB was of particular interest since TFIIB in budding yeast crosslinks broadly across the relatively stable single-stranded DNA region within the Pol II active site at core promoters[16], in accord with crystallographic models of “open” complexes[24]. Remarkably, humanTFIIB maintained its contact within this region, despite the absence of polymerase (Fig. 1c, upper panel). Mechanistically, this might occur via TFIIB contacts with BREd3 (see below), which are absent in budding yeast. The coincidence of TBP and TFIIB crosslinking at the BREd suggests that TBP may be predominantly crosslinking to TFIIB there, rather than directly to DNA.
BREu TATA, BREd and INR are common
We looked for core promoter elements (illustrated in Fig. 2a) within the narrow intervals defined by 8,364 mRNA TSS-proximal TFIIB locations. Remarkably, and consistent with yeast[16], nearly 85% of them had a sequence with 0-3 mismatches to the TATA-box consensus (TATAWAWR)[25] (Fig. 2b-c). Less than 3% had a perfect match to the consensus. Deviations from the TATA box consensus inversely correlated with TFIIB and TBP occupancy levels (), indicating that TATA element sequence quality contributes to their occupancy level, consistent with previous observations[26] on their in vivo functionality.
Figure 2
TATA elements at most mRNA genes
a, Core promoter schematic. b, Nucleotide distribution for TATA elements with 0-3 mismatches (panels) to the consensus, and sorted by ascending p-value. Colors are reflected in the logo color. c, Cumulative percent of TFIIB locations having a TATAWAWR sequence with 0-3 mismatches (solid line). Controls include a randomized sequence (60% GC, dashed black line), a scrambled consensus (dashed red line), and 8,364 locations represented by a single background tag (dashed gray line). d, Distance of strand-specific TFIIB peaks (exonuclease stop sites) from TATA element midpoints. Opposite-strand peaks are in red and inverted.
Several controls put the false positive rate for TATA elements at ~20% (Fig. 2c). First, 10,000 randomly generated sequences having the same human genome sequence bias found that only 16% were called by chance. Second, a scrambled version of the motif (having 0-3 mismatches) was identified only 20% of the time, and had no positional relationship with TFIIB/TBP (not shown). Third, coordinates having a single isolated tag were used to generate an essentially random set of false-positive locations, and the analysis repeated. TATA elements (0-3 mismatches) were identified only 20% of the time. Fourth, whereas control sequences were distributed randomly across the query space, the distribution of TATA elements was not random. Instead it displayed a tight peak 20 bp upstream of TFIIB and TBP locations (Fig. 2d, and data not shown).TFIIB in complex with TBP makes sequence-specific contacts with BREu and BREd, which flank the TATA box[2,3] and are upstream of the INR (Fig. 2a). However, these elements are essentially nonexistent in yeast, and ill-defined across mammalian genomes. Using the identified TATA elements as a reference point, we searched upstream for the BREu and downstream for the BREd and INR. Strikingly, in nearly every instance a sequence with three or less mismatches to the literature-derived consensus for BRE (SSRCGCC)[2], BREd(RTDKKKK)[3], and INR (YYANWYY)[27] was found (Fig. 3a-c). Remarkably, sequences within each element appeared to co-vary. For example, the BREd consensus tended towards either GTKGGGG or ATKTTTT, rather than an equal mixture of all possible combinations (Fig. 3b), making them less degenerate than the consensus would suggest. Similarly, the INR consensus tended towards either CCANWCC or TTANWTT (Fig. 3c). Sequence bifurcation was not observed with TATA or BREu elements. Given the strong bias towards either strong (G/C) or weak (A/T) base-pairing, this sequence dimorphism may reflect selection for distinct thermodynamic stabilities towards helix melting, which is an essential first step in initiation at these elements. Consistent with this, A/T-rich BREd and INR elements had substantially higher crosslinking levels of TFIIB than their G/C-rich counterparts (not shown). However, this may not explain the strand bias of the sequences.
Figure 3
BRE and INR at most mRNA genes
a-c, Nucleotide distribution for BREu, BREd, and INR, vertically separated by 0-3 mismatches to the consensus, and sorted by ascending p-value within panels. d, Distance of strand-specific TFIIB peaks from BREu, BREd, and INR. Opposite-strand peaks are in red and inverted. e, Cumulative percent of genes with 0-3 mismatches to each motif in panels a-c. Controls were randomized sequences (60% GC, dashed lines). f, Distribution of core promoter elements relative to TATA box borders.
Similar to our TATA analysis, the TFIIB peak density was tightly focused at a fixed distance from each core promoter element (Fig. 3d), and randomized controls were rarely found (Fig. 3e), thereby validating them. TFIIB peak-pairs were centered over BREd, suggesting that the primary crosslinking point is through the BREd. Unlike the TATA element, the BRE and INR elements deviated relatively little from their consensus (compare Figs. 2c and 3e) and such deviations did not correlate with TBP and TFIIB occupancy levels (not shown). Thus, BRE and INR sequence variability may regulate occupancy of the basal initiation complex to a lesser extent than TATA. Within their search space, the locations of each core promoter element peaked at previously-defined canonical positions (Fig. 3f and ), thereby providing cross-validation and a core promoter consensus: SSRCGCCNNNTATAWAWRNNRTDKKKKNNNNYYANWYY. The tolerance for mismatches in these elements appears to be 2-3-2-1, respectively.
150,000 noncoding initiation complexes
We next examined the remaining 150,753 putative TFIIB locations that were far (>500 bp) from a protein-coding gene (). At a 20% false discovery rate per element, we identified at least 3 of the 4 core promoter elements at 97% of all non-mRNA TFIIB locations (). Deviations from the consensus were no more than at mRNA genes (average of 5 deviations across 28 positions within the four core promoter elements). TBP, TFIIB, and Pol II peaked at the same canonical distances from each motif as found at mRNA promoters (). They were also embedded in the same chromatin environment as mRNA promoters (Fig. 4a,b), but displayed comparatively lower TFIIB occupancy ().
Figure 4
Noncoding TFIIB locations have chromatin marks and non-polyadenylated RNA
a, Distribution of chromatin marks around TFIIB at RefSeq genes (left) and ncRNA (right). b, TFIIB locations that overlap with chromatin marks and epigenetic regulators[39]. c, Distribution of polyadenylated[38] and non-polyadenylated[40] RNA-seq tags around TFIIB >500 bp from a RefSeq TSS. Percentages reflect TFIIB having an RNA tag <2 kb away. Left panels include sense (blue) and antisense (red and inverted) strands for RefSeq genes, which was not applied to ncRNA (right panels). d, 100-gene moving average of polyadenylated and nonpolyadenylated RNA levels versus TFIIB occupancy at mRNA and ncRNA genes (left and right panels, respectively) on a median-centered log2 scale.
Remarkably, TBP/TFIIB/Pol II complexes were linked to the production of nonpolyadenylated RNA (87% had them) rather than polyadenylated transcripts (Fig. 4c and ), which is in agreement with the finding of enhancer RNAs[28]. Their locations mapped precisely to the location of TFIIB. Nonpolyadenylated transcript levels also correlated more strongly with “noncoding” TFIIB occupancy than did polyadenylated levels (Fig. 4d), further validating the link. Taken together, we conclude that the vast majority of all 159,117 TFIIB locations (noncoding plus coding) detected in K562 cells represent bona fide and fundamentally identical core promoter initiation complexes of which ~5% produce mRNA and ~95% produce RNA that is non-polyadenylated and noncoding.
Restricted motif spacing in promoters
We searched for an overall core promoter element (CPE) consensus (SSRCGCCNNNTATAWAWRNNRTDKKKKNNNNYYANWYY) and ~40 spacing variants within 100 bp of all TFIIB locations, and plotted their distribution relative to TFIIB (). Remarkably, the consensus spacing defined in Fig. 3f displayed the strongest positional relationship with TFIIB (Fig. 5a). For example, a consensus having the spacing between BREd and INR reduced by 1 bp displayed almost no positional relationship with TFIIB (red vs black thick traces in Fig. 5a), as would be expected of a random/nonfunctional sequence.
Figure 5
Restricted spacing of core promoter elements
a, Candidate core promoter enrichment at varying distances from all 159,117 TFIIB locations, for spacing variants “323” and “324”, for motifs with weak (thick lines) and strong (thin lines) p-values. b, Traces from panel a and Extended Data Fig. 6, were transformed into enrichment scores and shown as a table, sectored by element spacing, and at two motif p-value intervals. Values are heat-map colored from green to light gray. Configurations in white were not examined. c, Schematic of core promoters having the strongest positional correlation with TFIIB, rank ordered by opacity. “324” (***) stood out as the strongest.
There was very little or no tolerance for variable spacing between core promoter elements (Fig. 5b), which reflects structural constraints of the initiation complex[5]. Surprisingly, proper spacing was accompanied by greater sequence deviations within individual core promoter elements (thick vs thin black line in Fig. 5a), whereas small (~1 bp) spacing deviations were accompanied by stronger elements (thick vs thin red line in Fig. 5a, and summarized in Fig. 5b,c). In short, core promoters may be weak by design, through a compensatory balance of sequence and spacing deviations from the consensus. This allows for greater dependence on transcriptional activators, but also provides for a specified basal output.We conducted ChIP-exo mapping of TFIIB locations across four ENCODE cancer cell lines: HeLa S3, HepG2, and MCF7 in addition to K562 (cervical, liver, breast, and blood, respectively). We detected TFIIB at 9,074 mRNA genes in at least one cell line, and at 1,691 genes in all lines (group 1 in Extended Data Fig. 9). Cluster analysis suggested that while TFIIB occupancy levels varied from gene to gene, most were relatively constant at individual genes across cell lines. About a third displayed noticeable cell-type specificity (e.g., group 3 in Extended Data Fig. 9). For noncoding initiation complexes, we focused on those present in two or more cell types, and found 100,349 such locations (376,074 locations were found in at least one cell type). Noncoding complexes appeared to have more cell-type specificity and were bimodally distributed at high and low occupancy levels. This heterogeneity may reflect more numerous and diverse roles for the resulting noncoding transcription and/or RNA in cell-type specific physiology compared to proteins.
tRNA genes have TATA and BRE
With some exception[29], tRNA genes have been classically defined as TATA-less, where TFIIIC recognizes specific sequences downstream of the TSS, then recruits TFIIIB to a region immediately upstream of the TSS that lacks apparent sequence specificity[30,31]. Pol III then binds to form an initiation complex. TFIIIB contains TBP (and BRF, a factor related to TFIIB) and thus it has been enigmatic as to how TBP in TFIIIB engages the upstream region without a TATA box.Remarkably, TBP crosslinked ~21 bp upstream of 386 tRNA genes (Fig. 6a, left panel), as seen at Pol II promoters. In nearly every instance we found a TATA element (Fig. 6a, middle) that was ~18 bp further upstream (Fig. 6b). Similar to TBP crosslinking through TFIIB, we suspect that TBP crosslinks through BRF. Indeed, the peaks of BRF and TBP crosslinking are coincident at Pol III genes in mice[32]. As with Pol II promoters, we found a BREd centered between each TBP peak pair (Fig. 6a, right panel) and a BREu immediately upstream of TATA (not shown). Enrichment of these elements, but not the Pol II-specific INR[33], were statistically significant (Fig. 6c). Thus, TBP in complex with a TFIIB family member engages a set of BREu-TATA-BREd core promoter elements similarly in Pol II and III systems.
Figure 6
TATA and BRE elements at most tRNA genes
a, Left panel: TBP peak density separated by forward and reverse strand orientation (blue and red colors, respectively) relative to each tRNA TSS. Corresponding sequences are shown in the right two panels. b, Average distribution of TBP peaks around all identified tRNA TATA elements. c, Cumulative percent of tRNA genes with the indicated promoter element having 0, 1, 2, or 3 mismatches to the consensus. Dashed lines represent calculations for an equivalent number of randomized sequences for the color-linked solid traces.
Consolidated genomic view of initiation
Genome-wide mapping of the general transcription machinery at near single-base resolution offers a consolidated model of certain transcription initiation events from yeast to humans, Pol II to Pol III, TATA-containing to TATA-less, and mRNA to ncRNA. In general, a TFIIB/BRF family member is recruited to all coding or noncoding core promoters via a TBP family member and spatially-constrained core promoter elements. Sequence-specific (BREd) contact with the DNA a few bp downstream of TATA, might “bookmark” the site of DNA melting for a rapidly departing Pol II or III. Yeast Pol II is relatively slow to depart, and so it produces equivalent TFIIB-“open” promoter contacts in the absence of a BREd. Pol II then scans downstream several bp, where it encounters an INR that allows for productive transcription, which subsequently pauses 30-50 bp further downstream. In yeast, where an INR and pausing appear absent, a nucleosome border may help set the start site of productive transcription.Although core promoters are seemingly long (~38 bp in human) for sequence-specific binding, they are designed to be inherently low in specificity, presumably to keep basal transcription low and to maintain high dependence on transcriptional activators. Appropriate specificity is achieved via a blend of degeneracy in motif sequence and spacing. Broad clusters of TSSs at mammalian genes[4] can therefore be explained in terms of clusters of core promoters, many of which may fall below bioinformatic detection.The discovery that transcription of the human genome is vastly more pervasive than what produces coding mRNA raises the question as to whether Pol II initiates transcription promiscuously through random collisions with chromatin as biological noise or whether it arises specifically from canonical Pol II initiation complexes in a regulated manner. Our discovery of ~150,000 noncoding promoter initiation complexes in humanK562 cells and more in other cell lines suggests that pervasive noncoding transcription is promoter-specific, regulated, and not much different from coding transcription, except that it remains nuclear and nonpolyadenylated. An important next question is the extent to which transcription factors regulated this ncRNA.We detected promoter transcription initiation complexes at 25% of all ~24,000 human coding genes, and found that there were 18-fold more noncoding complexes than coding. We therefore estimate that the human genome potentially harbors as many as 500,000 potential promoter initiation complexes, corresponding to an average of about one every 3 kb in the non-repetitive portion of the human genome. This number may vary more or less depending on what constitutes a meaningful transcription initiation event. The finding that these initiation complexes are largely limited to locations having well-defined core promoters and measured TSSs indicates that they are functional and specific, but it remains to be determined to what end. Their massive numbers would appear to provide an origin for the so-called dark matter RNA of the genome[34], and could house a substantial portion of the missing heritability[35].
METHODS
Cell Culture
Human chronic myelogenous leukemia cells (K562, ATCC) were maintained between 1×105 – 1×106 cells/milliliter in DMEM media supplemented with 10% bovinecalf serum at 37°C with 5% CO2. Humanadenocarcinoma cells from the cervix (HeLa S3, ATCC), liver (HepG2, ATCC), and breast (MCF7, ATCC) were grown in a similar manner as K562 cells except that they were maintained between 25-90% confluence. Cells were washed and phosphate buffered saline (1× PBS, 8 mM NaHPO4, 2 mM KH2PO4, 150 mM NaCl, and 2.7 mM KCl) before incu bation with formaldehyde in a final concentration of 1% for 10 minutes. Cells were lysed (10 mM Tris pH 8, 10 mM NaCl, 0.5% NP40, and complete protease inhibitor cocktail (CPI, Roche), and then the nuclei lysed (50 mM Tris pH 8, 10 mM EDTA, 0.32% SDS, CPI). Purified chromatin was resuspended in IP dilution buffer (40 mM Tris pH 8.0, 7 mM EDTA, 56 mM NaCl, 0.4% Triton x-100, 0.2% SDS, and CPI) and sonicated with a Bioruptor (Diagenode) to obtain fragments with a size range between 100 and 500 bp.
ChIP-exo and Antibodies
With the following modifications, ChIP-exo was carried out as previously described[36] with chromatin extracted from 10 million cells, ProteinG MagSepharose resin (GE Healthcare), and 3 ug of either TFIIB (Santa Cruz Biotech, sc-225), TBP (Santa Cruz Biotech, sc-204), or Pol II (Santa Cruz Biotech, sc-899, directed against the N-terminus of the Pol II large subunit encoded by POL2RA).
Alignment to Genome, Peak Calling, and Data Access
Libraries were sequenced on an Illumina HiSeq sequencer. The entire length of the sequenced tags were aligned to the human hg18 reference genome using BWA[41] using default parameters. Raw sequencing data are available at NCBI Sequence Read Archive (SRA067908). The resulting sequence read distribution was used to identify peaks on the forward (W) and reverse (C) strand separately using the peak calling algorithm in GeneTrack (sigma = 20, exclusion zone = 40 bp)[42]. For strand-specific and strand-merged plots, sequencing tags were normalization to input. All 11,458 locations that were present in the ENCODE designated blacklist were removed from the analysis. Peaks were paired if they were 0-80 bp in the 3’ direction from each other and on opposite strands. Since patterns described here were evident among individual biological replicates, and replicates were well correlated, we merged all tags from biological replicate data sets to make final peak-pair calls. Peak pairs were considered to be TFIIB if they had a tag count of >4 in the merged datasets. 159,117 locations met these criteria. Peak pair matches across cell lines required that their midpoints be within 80 bp of each other.NCBI-curated RefSeq TSSs (n=26,987)[18] comprising 23,181 nonredundant mRNA genes were considered. Assignment of TFIIB (8,364 peak-pairs) and TBP (7,642 peak-pairs) to the nearest RefSeq TSS in required that they be within 500 bp of the TSS, yielding 6,511 nonredundant mRNA genes. Importantly, using a more stringent interval only marginally changed these numbers and did not alter our conclusions. If a gene had >1 TSS, then the TSS nearest to the bound location (peak-pair midpoint) was used as the primary TSS, and other nearby TSSs were considered secondary (Fig. 1f, lower panel).
Motif analysis
At each of these 6,511 promoters, using the MEME suite of tools[37], we searched for TATA elements within 80 bp of the midpoint of TFIIB-bound locations on the sense strand, first by searching for the consensus TATAWAWR (), then sequentially for one to three mismatches to the consensus, if an element was not found. In rare cases where multiple elements were found, we chose the one closest to the TFIIB peak. This rule had no qualitative impact on the data since such events were rare and choosing the furthest element gave the same result (not shown). Moreover, peak motif detection for BREu, TATA, and INR were not centered over TFIIB, indicating that this distance criteria was not driving the observed motif enrichment at TFIIB locations. Using a similar strategy, we searched for candidate BREu element (Supplementary Table 4) within 40 bp upstream of the 5,546 identified TATA elements, and searched for candidate BREd and INR elements () within 40 bp and 60 bp downstream of the 5,546 TATA elements, respectively. At Pol III promoters, candidate BREd elements were required to be within 20 bp of a TBP peak pair midpoint, and in the same orientation as the TATA element.Our searches infrequently picked up multiple motif instances within the search window. Where this did occur, we chose the motif with the best match to the published consensus (not the closest to TFIIB). In the situation where we obtained more than one motif with the same number of mismatches, we chose the one closest to TFIIB. Third, when we discard these multiple occur-rences, the results qualitatively did not change. Fourth, the peak locations that we obtained for BREu, TATA, and INR were not centered over TFIIB. Instead they peaked at the canonical location that had been established in the literature. This provided independent validation.Using a PSPM matrix derived from individual core promoter element (CPE) logos from Figs. 2 and 3 (the matrices and data processing details are presented in ), FIMO[37] was used to find 37-40 bp sequences within 100 bp of a TFIIB peak pair, and had either a p-value of <10-4 (thick trace in Fig. 5 and ) or between 10-4 and 10-3 (thin trace). Any CPE <50 bp from a stronger CPE (defined by motif and spacing similarity to the consensus) was eliminated. Distances between the two (TFIIB peak-pair midpoint to consensus BREd midpoint, i.e. 13 bp upstream of the CPE 3’ end) were then calculated for those CPE spacing variants listed at the top of . Their frequency distribution was then plotted as a 5 bp moving average. Distributions were transformed into enrichment scores by calculating the ratio of occurrences near TFIIB (0-15 bp) to those far from TFIIB (55-70 bp), then log2-transforming the data.
Extended Data Table 1
Illumina sequencing statistics
Summary of uniquely mapped sequencing reads for each biological replicate.
Authors: Sascha H C Duttke; Scott A Lacadie; Mahmoud M Ibrahim; Christopher K Glass; David L Corcoran; Christopher Benner; Sven Heinz; James T Kadonaga; Uwe Ohler Journal: Mol Cell Date: 2015-11-05 Impact factor: 17.970
Authors: Lisa L Hall; Dawn M Carone; Alvin V Gomez; Heather J Kolpa; Meg Byron; Nitish Mehta; Frank O Fackelmayer; Jeanne B Lawrence Journal: Cell Date: 2014-02-27 Impact factor: 41.582