Jason A Watts1, Joshua Burdick2, Jillian Daigneault3, Zhengwei Zhu3, Christopher Grunseich4, Alan Bruzel3, Vivian G Cheung5. 1. Department of Internal Medicine, Division of Nephrology, University of Michigan, Ann Arbor, MI, USA. 2. Howard Hughes Medical Institute, Chevy Chase, MD, USA. 3. Life Sciences Institute, University of Michigan, Ann Arbor, MI, USA. 4. National Institute of Neurological Disorders and Stroke, National Institute of Health, Bethesda, MD, USA. 5. Howard Hughes Medical Institute, Chevy Chase, MD, USA; Life Sciences Institute, University of Michigan, Ann Arbor, MI, USA; Department of Pediatrics, Division of Neurology, University of Michigan, Ann Arbor, MI, USA. Electronic address: vgcheung@med.umich.edu.
Abstract
Aberrant gene expression underlies many human diseases. RNA polymerase II (Pol II) pausing is a key regulatory step in transcription. Here, we mapped the locations of RNA Pol II in normal human cells and found that RNA Pol II pauses in a consistent manner across individuals and cell types. At more than 1,000 genes including MYO1E and SESN2, RNA Pol II pauses at precise nucleotide locations. Characterization of these sites shows that RNA Pol II pauses at GC-rich regions that are marked by a sequence motif. Sixty-five percent of the pause sites are cytosines. By differential allelic gene expression analysis, we showed in our samples and a population dataset from the Genotype-Tissue Expression (GTEx) consortium that genes with more paused polymerase have lower expression levels. Furthermore, mutagenesis of the pause sites led to a significant increase in promoter activities. Thus, our data uncover that RNA Pol II pauses precisely at sites with distinct sequence features that in turn regulate gene expression.
Aberrant gene expression underlies many human diseases. RNA polymerase II (Pol II) pausing is a key regulatory step in transcription. Here, we mapped the locations of RNA Pol II in normal human cells and found that RNA Pol II pauses in a consistent manner across individuals and cell types. At more than 1,000 genes including MYO1E and SESN2, RNA Pol II pauses at precise nucleotide locations. Characterization of these sites shows that RNA Pol II pauses at GC-rich regions that are marked by a sequence motif. Sixty-five percent of the pause sites are cytosines. By differential allelic gene expression analysis, we showed in our samples and a population dataset from the Genotype-Tissue Expression (GTEx) consortium that genes with more paused polymerase have lower expression levels. Furthermore, mutagenesis of the pause sites led to a significant increase in promoter activities. Thus, our data uncover that RNA Pol II pauses precisely at sites with distinct sequence features that in turn regulate gene expression.
RNA polymerase synthesizes RNA in a highly regulated manner to allow for a wide range of cellular functions.1, 2, 3, 4, 5, 6 One such regulation is the pausing of the RNA polymerase as it moves along the DNA. Dysregulation leads to too low or too high levels of gene expression that can have dire consequences. Human diseases from cancer to kidney disorders and neurodegeneration are known to result from defects in RNA processing.7, 8, 9The discontinuous synthesis of RNA by RNA polymerase II was first noted in some selected genes but as advances allowed genome-wide studies, RNA Pol II pausing has been recognized as a general regulatory step. Paused RNA polymerase was found in the 5′ ends of β-globin,
hsp70, and proto-oncogenes.12, 13 The resultant accumulation of RNA Pol II modulates gene expression at baseline and in response to stimuli. For example, in Drosophila, RNA Pol II pauses in the promoter region of hsp70, then upon heat induction, these paused polymerases are released and continue into RNA chain elongation. As paused RNA polymerase was identified, studies began to determine how they are regulated. Early studies found two protein complexes, 5, 6-Dichloro-1-β-D-ribofuranosylbenzimidazole Sensitivity Inducing Factor (DSIF)14, 15 and Negative Elongation Factor (NELF) that act in trans to retain the RNA Pol II in the promoter regions. In separate studies, a cis-element was identified at pause sites in Drosophila17, 18 and more recently in human cells.In parallel, our understanding of on the consequences of dysregulation of gene expression has grown. Those studies have led to therapeutics to restore aberrant gene expression, as exemplified by the antisense oligonucleotideNusinersen for treatment of spinal muscular atrophy. RNA-based therapeutics is a burgeoning modality. These drugs aim to change the expression levels of target genes. To achieve this goal optimally, it is critical to understand how human cells fine-tune gene expression. Polymerase pausing represents such a step. A few groups have already suggested manipulations of the DSIF subunit, Spt4, as a treatment for ALS and Huntington disease.7, 21 To advance these early proposals into the clinic, a deeper understanding of RNA polymerase pausing in normal human cells is needed. Studies of RNA Pol II pausing have largely relied on cancer and stem cells which are transcriptionally very active; thus, the pattern of pausing may be quite different than that in normal cells. Furthermore, for disease treatments, it is optimal to target RNA Pol II pausing at genes individually rather than manipulating the trans-acting protein complexes.Recently, techniques such as GRO-seq, PRO-seq, NET-seq,24, 25 and Start-seq have allowed the isolation of nascent RNA and the precise mapping of active RNA polymerases genome-wide. These advances greatly facilitate the quantitative assessment of RNA Pol II pausing. Here, we carried out PRO-seq in adult and neonatal skin as well as in kidney cells. We found that the RNA Pol II pauses in a highly regulated manner. In more than 1,000 sites, RNA Pol II paused at the same nucleotide positions among individuals and in different cell types. To identify the cis-elements that regulate RNA Pol II pausing, we found that the pause sites are found in regions with high GC content, a 9-mer sequence motif, and are predominantly cytosines. The genes with paused polymerases have lower gene expression levels. Perturbations of the cytosines in the 9-mer motif through natural sequence variants and site-directed mutagenesis show that RNA Pol II pausing decreases gene expression. These findings lay the foundation for specific targeting of RNA Pol II pausing to restore aberrant gene expression.
Material and Methods
Cell Culture
Skin fibroblasts from anonymized healthy adults collected as control subjects from an unrelated project and foreskin fibroblasts from a healthy newborn (obtained from the University of Pennsylvania core, SBDRC, see Web Resources) were cultured in DMEM medium (Thermo-Fisher) with 10% fetal bovine serum at 37°C with 5% CO2. Cells were passaged every 72 h using Trypsin-EDTA (0.05%). HK-2 cells (ATCC) were grown in keratinocyte serum-free media (GIBCO-BRL) 37°C with 5% CO2, and media were changed 3 times per week. Adult fibroblast tissue samples were collected using a study approved by the NIH Combined Neuroscience Institutional Review Board, and informed written consent was obtained from all participants.
Precision Run-On Sequencing (PRO-Seq)
PRO-seq libraries were prepared as described previously. Briefly, 5 × 106 nuclei were added to 2X Nuclear Run-On (NRO) reaction mixture (final concentrations: 10 mM Tris-HCl [pH 8.0], 300 mM KCl, 1% Sarkosyl, 5 mM MgCl2, 1 mM DTT, 0.03 mM each of biotin-11-A/C/G/UTP [Perkin-Elmer], 0.2 u/μL RNase inhibitor) and incubated for 3 min at 37°C. Nascent RNA was extracted by phenol (Trizol LS)/chloroform and then fragmented by base hydrolysis in 0.2 N NaOH on ice for 15 min. The reaction was neutralized by adding 0.7 × volume of 1 M Tris-HCl (pH 6.8). The fragmented nascent RNA was purified using 30 μL of Streptavidin M-280 magnetic beads (Invitrogen) and ligated with 3′ RNA adaptor (5′p-GAUCGUCGGACUGUAGAACUCUGAAC-/3InvdT/). Biotin-labeled products were recovered by streptavidin beads. For 5′ end repair, in PRO-seq the RNA products were successively treated with 5′ pyrophosphohydrolase (NEB) and polynucleotide kinase (NEB). 5′ repaired RNA was ligated to the 5′ RNA adaptor (5′-CCUUGGCACCCGAGAAUUCCA-3′). The products were further purified by the streptavidin beads. RNA was reverse transcribed using RT primer (5′-AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGA-3′). The product was PCR amplified, the resulting amplicons that are between 150 and 250 bp (insert > 70 bp) were purified using the BluePippin (Sage Science) agarose gel electrophoresis, and then sequenced on the HiSeq 2500 instrument (Illumina) to a depth of >150 million reads per sample (see Table S1).Raw sequencing files were processed by trimming the adaptor sequences from the ends of reads using fastx_clipper from FASTX-Toolkit (Hannon Lab). Sequences with low-quality represented by a stretch of “#” in the quality score string in FASTQ file were removed. Reads that were >35 nt after trimming were included for downstream analysis. Reads were aligned to human reference genome (hg18) using GSNAP (version 2013-10-28) with the following parameters: mismatches < [(read length +2)/12-2]; mapping score > 20; soft-clipping on (-trim-mismatch-score = −3). Bam files were generated and data normalized to reads per million mapped reads (RPM). Data were visualized using IGV. For all the analyses, for each gene, we focused on the longest transcribed isoform.
Pausing Index
For each gene, we calculated the pausing index (PI) which is the ratio of normalized PRO-seq reads in a 1-kb window centered on the TSS to that in the rest of the gene with normalization for gene length. We included the 14,503 genes that are more than 2 kb in length and more than 1 kb from another gene on the same strand. For downstream analyses, comparing gene expression to RNA Pol II pausing, we included genes with pausing indicies from 2 to 900.
Pause Sites
In the in vitro run-on portion of PRO-seq, RNA Pol II incorporates biotinylated bases and halts chain elongation; thus, the biotinylated bases are found at the 3′ end of each PRO-seq read. Accordingly, by determining the number of PRO-seq reads that ends at each nucleotide position, one obtains the number of RNA Pol II found at that position. We determined the sites where most (top 20%) PRO-seq reads end and marked them as to where the RNA Pol II pauses. We then compared the five samples to identify where the RNA Pol II pauses at the same locations among them. Even though in PRO-seq, the RNA chain elongation is expected to halt upon addition of one biotinylated base, sometimes a few bases are added. To accommodate these extra bases, we allowed up to 3 bases among the samples in our comparisons. 1,367 RNA Pol II pause sites were identified with these criteria. The shared pause sites within 500 bp of an annotated transcription start site are considered as being in the promoter region.To determine whether the overlap among individuals in these 1,367 pause sites were more frequent than would be expected by chance, we carried out a permutation test. We used the 6,372,215 sites with a PRO-seq read end in at least one of the 5 adult fibroblast samples. The read counts at each site were randomly shuffled within each of the 14,503 genes, keeping the same number of sites per gene as in the experimental data. The permutation was performed 10,000 times. After each iteration, we determined the number of pause sites in the randomized data using the criteria described above where a pause site has reads in the top 20% in all 5 adult fibroblast samples. This procedure gives 10,000 estimates of the number of pause sites that would be found under the null hypothesis that RNA Pol II pausing occurs randomly. The most number of pause sites found in the randomized data was 7 sites (in 2 of the 10,000 permutations), far less than the 1,367 that were found experimentally. Since none of the 10,000 permutations yielded the number of sites that we observed, we rejected the null hypothesis with a permuted p value < 0.0001.
NELF and DSIF Occupancy
Chromatin immunoprecipitation (ChIP) was performed as described previously. Briefly, foreskin fibroblasts were cross-linked with 1% formaldehyde for 15 min. Cross-linking was stopped with 2.5 M glycine for 5 min. Nuclei were isolated by rotating crosslinked cells for 10 min at 4°C in 5 mL lysis buffer 1 (50 mM HEPES [pH 7.6], 140 mM NaCl, 1 mM EDTA, 10% glycerol, 0.5% NP-40, 0.25% Triton X-100) followed by pelleting, and 10 min rotating in 5 mL lysis buffer 2 (200 mM NaCl, 1 mM EDTA, 0.5 mM EGTA, 10 mM Tris [pH 8]). Nuclei were pelleted, then swelled in lysis buffer 3 (10 mM Tris [pH 8], 1 mM EDTA, 0.5 mM EGTA, 100 mM NaCl, 0.1% deoxycholic acid, 10% N-lauryl sarcosine) for 10 min, then sonicated on high setting (30 s on, 30 s off) for 5 min, 3 times, to shear chromatin to less than 500 bp with Bioruptor (Diagenode). After pelleting the insoluble fraction, the supernatant was pre-cleared with Protein G agarose beads (Sigma) and anti-rabbit IgG (Sigma). 50 μg sheared chromatin was incubated in RIPA buffer (50 mM Tris [pH 8], 150 mM NaCl, 1% NP-40, 0.5% sodium deoxycholate, 0.1% SDS) with 5 μg rabbit IgG (Sigma), 5 μg NELFA (Santa Cruz) or 5 μg SPT5 (Santa Cruz) and recovered with Protein G agarose beads. Beads were washed twice with low-salt RIPA (150 mM NaCl) and twice in high-salt RIPA (300 mM NaCl), then eluted in 100 μL 1% SDS plus 100 mM sodium bicarbonate. After cross-link reversal, DNA was purified over QIAquick PCR Purification Kit (QIAGEN). Factor enrichment was verified with qPCR at the HSP70 promoter (Forward-TCCAGTGAATCCCAGAAGACTC, Reverse-CCTGGGCTTTTATAAGTCGTCA) and gene body (Forward-GTTTGAGCACAAGAGGAAGGAG, Reverse-AGGAAATGCAAAGTCTTGAAGC). ChIP-seq libraries were prepared using the Ovation Ultralow Library system (NuGen). Libraries were sequenced on the HiSeq 2500 instrument (Illumina) and ∼40 million 100-nt reads were generated per ChIP sample. Sequence pre-processing and alignment were performed as described for PRO-seq. NELFA or SPT5 enrichment compared to input DNA was determined using MACS under default settings. MACS fold-enrichment > 5 with an FDR < 0.05 was considered as positive for factor occupancy.
Sequence Features
To identify a sequence motif, we used Weblogo to analyze the 21 bases that flank each of the 1,367 pause sites. The resulting motif is the 9-mer shown in Figure 2A. Then to assess sequence features around the pause sites, we extracted reference genome sequence (hg18) corresponding to 500 bases upstream and 500 bases downstream of each of 1,367 pause sites. The motif scores were determined using the motifcounter package. Then, GC content was determined as (G+C)/(A+T+C+G) in 100-bp sliding windows. GC skew was determined as (G−C)/(G+C) in 100-bp sliding windows. As background, we extracted sequences flanking 1,367 sites randomly selected from regions where RNA Pol II paused (average pausing index > 2 in fibroblast samples) but not at the same sites across the 5 individuals.
Figure 2
Sequence Features of RNA Pol II Pause Sites
(A) Sequence motif found at pause sites in gene promoters.
(B–D) Plots of motif score (B), GC content (C), GC skew (D) at sites with (blue) and without (orange) paused RNA Pol II.
(E) RABL2B gene model (top) and the average profile of RNA Pol II reads among five individuals (middle) are shown. Bottom: An example of differential pausing for two individuals heterozygous (C/T) at a pause site of RABL2B.
(F and G) Data presented as in (E), showing differential pausing for an individual at TTLL12 (F) and FRMD6 (G).
In (E)–(G), asterisk (∗) marks truncated reads where the RNA Pol II have paused; C-allele in blue, T-allele in red. y axis ranges are indicated in parentheses.
(H) Average number of sequence reads that end at each allele of C/A, C/G, and C/T pairs (C versus A ∗∗∗p < 0.004 after removing 1 outlier; C versus G ∗∗∗p < 0.0002; C versus T ∗∗∗p < 10−12, t test). Error bars are the SEM. The extent of differential allelic pausing at cytosine compared to the alternate alleles are indicated.
Linear Discriminant Analysis and Effect Size Determination
To determine whether the sequence features and trans-acting protein complexes allow us to classify sites as to where RNA Pol II pauses, we carried out conventional linear discriminant analysis with leave-one-out cross-validation (in the statistical package Minitab). The analyses were performed using data on NELFA and SPT5 abundance from ChIP assays, motif score, GC content, GC skew, as well as the presence or absence of cytosine at pause site and +1 purine on all 5 fibroblast samples. Then, to assess the effects of the sequence features on RNA Pol II pausing, we carried out stepwise binary logistic regression (in the statistical package Minitab). GC content, GC skew, and motif score were Z-transformed to allow for comparisons of their effect sizes. The results are reported as odds ratios in Table 1.
Table 1
Relative Contributions of the Sequence Feature at 1,367 Locations where RNA Pol II Pauses in Gene Promoters
Feature
Odds Ratio (95% CI)
9-mer Motif
2.02 (1.8, 2.27)
“Cytosine at pause site”
1.42 (1.18, 1.71)
“+1 Purine”
1.21 (1.01, 1.44)
GC skew
1.17 (1.08, 1.28)
GC content
1.14 (1.03, 1.25)
Relative Contributions of the Sequence Feature at 1,367 Locations where RNA Pol II Pauses in Gene Promoters
Gene Expression Analysis
RNA was isolated using RNeasy Mini-Kit (QIAGEN). Sequencing libraries were prepared from total RNA using TruSeq Stranded Total RNA Library Prep Kit (Illumina). Sequencing was performed on Illumina HiSeq 2500, and >135 million 100-nt reads were generated from each sample. Low-quality bases were trimmed from the 3′ end of reads and 3′ adaptor was trimmed using FASTQ/A Clipper with default settings (Hannon lab). Reads shorter than 35 bp were excluded from the analysis. Sequencing reads were aligned to human reference (hg18) using GSNAP (v.2013-10-28) using the following parameters: mismatches ≤ [(read length+2)/12-2]; mapping score ≥ 20; soft-clipping on (-trim-mismatch-score = −3). Reads counts from each sample were normalized to the total number of mapped reads. Relative transcript abundance in fragments per kilobase mapped (FPKM) was determined using Cufflinks (v.2.2.1). For analysis of allelic expression, we considered genes (N = 12) where C/T and C/C genotypes were represented at promoter pause sites. Expression of each gene was normalized across individuals by Z-score and then averaged by genotype.
Differential Allelic Pausing
To determine whether there are allelic differences in RNA Pol II pausing, we identified heterozygous sites from the DNA sequences of the five adult fibroblast samples. For each sample, sites with >10 read coverage and at least 25% of the reads showed an alternate base were considered as heterozygous. To avoid reference bias, we used the identified variants to construct an “alternate” genome where the reference alleles at the heterozygous sites were replaced with the alternate alleles. PRO-seq reads were then aligned to the reference and alternate genomes using GSNAP. We considered 134 sites in 83 genes where RNA Pol II pauses (sites in the top 50 percentile) in at least two samples. To assess for differential allelic pausing, the number of PRO-seq read-ends on the “C” versus “non-C” allele was determined and the group means were compared by t test. Results, including the ratio of read-ends on the “C” versus “non-C” allele, are reported in Figure 2H.We queried the GTEx database for SNPs that overlap the RNA Pol II pause sites. We included SNPs with a minor allele frequency greater than 20%, those that coincide with where RNA Pol II pauses (sites in the top 50 percentile), and are enriched for SPT5 (per ChIP assays). This yielded seven SNPs: rs66966963, rs11547138, rs11248061, rs2303754, rs71476227, rs1049346, and rs35024348. Table 2 lists the genotypes, extent of differential allelic expression by tissue, and p values obtained from the GTEx portal v7 090617.
Table 2
Differential Allelic Expression at Heterozygous Pause Sites
SNP
Gene
Genotype
Tissue
DAEa(C versus Alternate Allele)
p Value
rs66966963
ACTR3B
C/T
prostate
−17%
4 × 10−5
rs11547138
AKIRIN1
C/T
transformed fibroblasts
−12%
3 × 10−22
rs11248061
IDUA
C/A
skin
−12%
8 × 10−10
stomach
−13%
6 × 10−12
nerve
−14%
4 × 10−19
rs2303754
POP4
C/G
aorta
−4%
8 × 10−5
stomach
−4%
5 × 10−5
transformed fibroblasts
−4%
1 × 10−4
nerve
−4%
5 × 10−6
esophagus (muscularis)
−4%
9 × 10−7
breast
−6%
8 × 10−5
lung
−7%
1 × 10−6
coronary artery
−7%
3 × 10−5
cerebellum
−8%
3 × 10−5
rs71476227
ZDHHC21
C/T
esophagus (mucosa)
−10%
2 × 10−9
esophagus (GEJb)
−13%
3 × 10−10
omentum
−14%
2 × 10−16
esophagus (muscularis)
−17%
5 × 10−24
spleen
−17%
1 × 10−7
adipose
−18%
1 × 10−33
whole blood
−19%
1 × 10−6
skin (leg)
−23%
7 × 10−51
skin (suprapubic)
−23%
4 × 10−39
Data obtained from GTEx portal (see Web Resources). We identified seven SNPs that overlap the pause sites in this study. The five of them that showed significant allelic association with gene expression are shown in this table.
Extent of differential allelic expression
Gastresophageal junction
Differential Allelic Expression at Heterozygous Pause SitesData obtained from GTEx portal (see Web Resources). We identified seven SNPs that overlap the pause sites in this study. The five of them that showed significant allelic association with gene expression are shown in this table.Extent of differential allelic expressionGastresophageal junction
Luciferase Assay
The promoter region including the first exon for MYO1E, BLCAP, and SESN2 were cloned into a Topo TA cloning vector (Invitrogen) following PCR amplification using primers: (MYO1E) 5′-GCTAGCTTGCTCACAATCCAGACGTAGG-3′, 5′-CTCGAGCACCCAAGCACTCACAGGA-3′, (BLCAP) 5′-CTTTGAGCCACGAGAAGGTTTT-3′, 5′-CAGGAGTACTATGACCCACCTC-3′ and (SESN2) 5′-GCTAGCCTGTGTCTCGCATCTTTGGAG-3′, 5′-CTCGAGGCTTTGGTGCTGGACTCTTC-3′. Cloned promoters were verified by Sanger sequencing and then subcloned into pGL4.17 firefly luciferase plasmid (Promega). Point mutations were introduced by using Quickchange II site-directed mutagenesis kit (Agilent) and confirmed by Sanger sequencing. Primers for site-directed mutagenesis are listed in Table S2. 293T cells were cotransfected with 100 ng of pGL4-firefly luciferase and 50 ng pGL4.73-Renilla (Promega) using Lipofectamine 3000 (Invitrogen). Luciferase activity was determined 24 h post-transfection using the Dual Glo-Luciferase assay kit (Promega) and quantified on a Microplate Luminometer (Veritas). Differences in reporter activity were determined by t test.
Results
RNA Polymerase II Pauses at the Same Nucleotide Positions across Individuals and Cell Types
Using the precision nuclear run-on assay (PRO-seq), we determined the locations of transcriptionally engaged RNA Pol II at single-base resolution.23, 35 We carried out PRO-seq in skin fibroblasts from forearms of five adults and focused on genes that are at least 2 kb long and more than 1 kb from adjacent ones, to avoid transcription signals from neighboring genes. In PRO-seq, biotinylated ribonucleotides are used in the run-on assay, incorporation of these nucleotides inhibits the RNA polymerase from further chain elongation. Thus, the RNA Pol II is found at the 3′ end of each nascent transcript, and mapping of these transcripts provides the locations of the RNA Pol II. Accumulation of the polymerase in a region such as the promoter relative to the rest of the gene is used as an indicator of pausing. Two examples of RNA Pol II pausing on SESN2 and SLC35D1 are illustrated in Figure 1A.
Figure 1
RNA Pol II Pauses at the Same Nucleotide Positions across Individuals and Cell Types
(A) Paused RNA Pol II at two genes in five individuals, highlighting similar RNA Pol II distribution. Scale bar 5 kb. y axis ranges are indicated in parentheses.
(B) 21-bp regions in SESN2 and SLC35D1 where polymerases are paused at the same base across individuals.
(C) Top: Profile of RNA Pol II in adult fibroblasts across five individuals. Bottom: Heatmap of the locations of RNA Pol II for 9,320 genes in adult fibroblasts; genes are plotted on the same rows for all individuals to allow direct comparisons.
(D) Pairwise correlation coefficients of the pause maxima between fibroblast samples (p < 10−10; Spearman).
(E) Top: Profile of RNA Pol II in adult fibroblasts (average of the five individuals), neonatal fibroblasts, and kidney proximal tubular cells. Bottom: Heatmap of RNA Pol II positions for 7,760 genes; the genes are plotted on the same rows to allow direct comparisons.
RNA Pol II Pauses at the Same Nucleotide Positions across Individuals and Cell Types(A) Paused RNA Pol II at two genes in five individuals, highlighting similar RNA Pol II distribution. Scale bar 5 kb. y axis ranges are indicated in parentheses.(B) 21-bp regions in SESN2 and SLC35D1 where polymerases are paused at the same base across individuals.(C) Top: Profile of RNA Pol II in adult fibroblasts across five individuals. Bottom: Heatmap of the locations of RNA Pol II for 9,320 genes in adult fibroblasts; genes are plotted on the same rows for all individuals to allow direct comparisons.(D) Pairwise correlation coefficients of the pause maxima between fibroblast samples (p < 10−10; Spearman).(E) Top: Profile of RNA Pol II in adult fibroblasts (average of the five individuals), neonatal fibroblasts, and kidney proximal tubular cells. Bottom: Heatmap of RNA Pol II positions for 7,760 genes; the genes are plotted on the same rows to allow direct comparisons.Visual inspection of the PRO-seq data showed that RNA Pol II pauses at very similar genic locations among individuals, and in some cases, at the same nucleotide positions (Figure 1B). To assess whether this is a general feature of RNA Pol II pausing, we asked whether this is seen more globally. For each individual and in each gene promoter region, the base positions with the highest number of paused RNA Pol II (or pause-maxima for short) were determined. We then compared the locations of the pause-maxima among the five individuals. Figure 1C shows across 9,320 gene promoters, RNA Pol II pauses at very similar locations among the five individuals; pairwise correlations (Figure 1D) are highly significant (p < 10−10). We then extended the analysis to include skin samples from a newborn. Figure 1E shows that the locations of paused RNA Pol II in newborn fibroblasts are also very similar to those in the adult samples (p < 10−10). Next, we studied proximal tubule cells, HK-2, from the kidney. The locations of paused RNA Pol II for 7,760 genes that are expressed in the adult and newborn skin, as well as kidney cells, are plotted in Figure 1E which shows that across the different cell types, RNA Pol II pauses at very similar locations along the DNA. The correlations of the locations of the pause-maxima across cell types are highly significant (p < 10−10). For 1,469 (19%) genes, the pause-maxima are within 3 nucleotides in the different cells.We have so far focused on sites with the greatest number of paused polymerases. However, numbers that are at the far ends of distributions could result from various biases, so we carried out another analysis that examines not just one site but those in the top 20%. With this definition of paused RNA Pol II, we asked how often are the polymerases paused at the same locations across our five fibroblast samples. We found there were 1,367 sites where RNA Pol II paused at the same nucleotide locations in gene promoters across the five individuals (Table S3). We show these sites on the UCSC Genome Browser (see Web Resources). The precise pausing of RNA Pol II at these 1,367 sites did not occur by chance, as we performed a permutation test and found at most 7 sites (in two permutations) where RNA Pol II paused in the same locations among the five individuals in randomized data (p < 0.0001; see Material and Methods). Additionally, we looked at PRO-seq data from other groups and found that even though the studies were carried out in different labs on different cells, the RNA Polymerase paused at the same locations. Specifically, of the 1,367 sites, we found 1,086 sites are shared in common with PRO-seq data from Sistonen and colleagues, while in 3 coPRO datasets from Lis and colleagues, 599, 744, and 877 sites were shared. Collectively, these data show that during transcription, RNA Pol II pauses very precisely.
cis-Acting Elements that Characterize RNA Polymerase II Pause Sites
To determine the code that signals for the RNA polymerase to pause, we analyzed the 1,367 pause sites. The pausing complexes NELFA and SPT5 were found at 1,018 (74%) of these sites (Figures S1A and S1B), which is consistent with their role in mediating promoter-proximal pausing.15, 16, 38, 39 We then examined and found several sequence features. First, 906 (66%) of the pause sites are cytosines. Second, at 952 sites (69%), the next base to be added to the RNA chain after the pause is a purine; we will refer to these as “+1 purine.” Third, there is a 9-mer sequence motif (Figures 2A and 2B; Table S2). Cramer and colleagues used a different approach (NET-seq) to map promoter-proximal pauses and observed a similar motif (Figure S2). Fourth, in the 50-nucleotide regions around the pause sites, the GC-content is very high at 70% (Figure 2C). Fifth, there is GC skewing around the pause sites, indicating G-rich RNA (and G-rich non-template strand; Figure 2D).Sequence Features of RNA Pol II Pause Sites(A) Sequence motif found at pause sites in gene promoters.(B–D) Plots of motif score (B), GC content (C), GC skew (D) at sites with (blue) and without (orange) paused RNA Pol II.(E) RABL2B gene model (top) and the average profile of RNA Pol II reads among five individuals (middle) are shown. Bottom: An example of differential pausing for two individuals heterozygous (C/T) at a pause site of RABL2B.(F and G) Data presented as in (E), showing differential pausing for an individual at TTLL12 (F) and FRMD6 (G).In (E)–(G), asterisk (∗) marks truncated reads where the RNA Pol II have paused; C-allele in blue, T-allele in red. y axis ranges are indicated in parentheses.(H) Average number of sequence reads that end at each allele of C/A, C/G, and C/T pairs (C versus A ∗∗∗p < 0.004 after removing 1 outlier; C versus G ∗∗∗p < 0.0002; C versus T ∗∗∗p < 10−12, t test). Error bars are the SEM. The extent of differential allelic pausing at cytosine compared to the alternate alleles are indicated.Next, we asked whether these features allow us to identify pause sites. To assess this, we carried out a linear discriminant procedure using the 7 factors: the abundance of NELFA, SPT5, a cytosine at pause site, +1 purine, motif, GC content, and GC skew. When combined as a linear discriminant function, they correctly classified 72% of the pause sites. By cross-validation, we left out the site to be classified from the calculation of the discriminant function, and then assigned the site as paused or not based on the discriminant function of the remaining 1,366 sites. With this more stringent criterion, 71.8% of the promoter sites were still correctly classified.Together these features allow identification of pause sites, but it is also important to know their relative contributions. However, it is very difficult to assess the relative effects with molecular approaches. Epidemiologic studies have identified risk factors and their effects on health conditions from heart disease to cancers.40, 41, 42 Here, we took a similar approach and carried out regression analyses to assess the relative effects of the sequence features on pausing. We performed stepwise regression and found odds ratios range from 1.1 to 2.0 for the motif, “cytosine at pause site,” “+1 purine,” GC skew, and GC content (see Table 1).Among these features, the one that is the most amenable for further investigation experimentally on a gene-by-gene basis is the cytosine at pause site. Additionally, this cytosine is conserved; as it was found to mediate pausing in E. coli.43, 44 To assess these cytosines, it would be best to focus on the cytosines while controlling other factors that may affect transcription. To accomplish this, we looked for pause sites where our subjects are heterozygous. To include more sites, we broadened the search to sites with paused polymerase in at least 2 (rather than 5) individuals, which yielded 134 heterozygous sites. We compared the number of paused RNA Pol II on the “C-allele” versus the other alleles. In all three comparisons, C versus A, C versus G, C versus T, there were more paused RNA Pol II on the C-alleles. An example is shown in Figure 2E, where at the pause site of RABL2B, two individuals are heterozygous for a C/T variant and RNA Pol II paused only on the C-allele. Similarly, RNA Pol II paused more often on the C-allele at the pause sites in TTLL12 and FRMD6 (Figures 2F and 2G). On average, RNA Pol II paused three times more often on the cytosine than the other alleles (Figure 2H). Thus, these sequence variants as experiments of nature show that the cytosine in the 9-mer motif is part of the cis-regulatory code that governs RNA Pol II pausing. They suggest that these pauses at specific genes can be averted if the cytosines are changed to another base.
Genes with More Paused RNA Polymerase II Have Lower Expression Levels
Next, we investigated the biological implications of RNA Pol II pausing by assessing its effect on gene expression. First, we obtained gene expression levels by sequencing the mRNA from the same adult fibroblast samples that we had determined RNA Pol II pausing. Then to compare gene expression to the level of RNA Pol II pausing, we used the pausing index (PI), which is the abundance of RNA Pol II in the promoter relative to the entire gene. We focused on 5,260 genes with paused RNA Pol II (PI > 2) in their promoters in all 5 individuals. We found that genes with more paused RNA Pol II have significantly (p < 10−16) lower gene expression levels (Figure 3A). There is also a significant negative correlation (R = −0.29; p < 10−83) between the extent of RNA polymerase II pausing and gene expression levels (Figure S3).
Figure 3
RNA Pol II Pausing Is Negatively Correlated with Gene Expression
(A) Genes with high pausing index have significantly lower expression levels (p < 10−16, ANOVA, N = 5,620 genes). Boxes show 25th, 50th, 75th percentiles and whiskers are 5th and 95th percentiles. The ranges of pausing indices by quartiles: Q1 2.5–7.6, Q2 7.6–13, Q3 13–23, Q4 23–803.
(B) Normalized expression levels of genes (n = 12) with C/T variants at promoter pause sites, data for individuals with C/C or C/T genotype are plotted (red bar indicates the averages, p < 0.006; t test).
RNA Pol II Pausing Is Negatively Correlated with Gene Expression(A) Genes with high pausing index have significantly lower expression levels (p < 10−16, ANOVA, N = 5,620 genes). Boxes show 25th, 50th, 75th percentiles and whiskers are 5th and 95th percentiles. The ranges of pausing indices by quartiles: Q1 2.5–7.6, Q2 7.6–13, Q3 13–23, Q4 23–803.(B) Normalized expression levels of genes (n = 12) with C/T variants at promoter pause sites, data for individuals with C/C or C/T genotype are plotted (red bar indicates the averages, p < 0.006; t test).Next, we again leveraged sequence variants and assessed whether RNA polymerase pausing leads to lower gene expression. We examined the heterozygous pause sites to assess the effect of the paused polymerase on gene expression. Among our samples, we have the largest number of C/T heterozygotes, so we compared the expression of genes in individuals who are C/C homozygous to individuals who are heterozygous C/T at the promoter pause sites. The results show that gene expression levels are significantly lower (p < 0.006; t test) for individuals who have C/C genotypes compared to those with C/T genotypes (Figure 3B), thus confirming that the C-alleles with more paused polymerase are expressed at lower levels. To assess whether this relationship between pausing and gene expression can be generalized, we turned to the gene expression data collected by the Genotype-Tissue Expression Consortium, GTEx. We searched the GTEx database for single nucleotide polymorphisms (SNPs) that overlap the pause sites identified in this study. We do not expect to find many SNPs that overlap our pause sites, but any of them allows us to ask whether findings from our samples can be generalized to a much larger dataset. We indeed found seven SNPs that overlap our pause sites. For five of the seven SNPs, the C-alleles are significantly associated with lower gene expression (see Table 2) across different tissues, beyond the skin and renal tubule cells that led us to the finding. For example, across nine tissue types, individuals with C/C genotypes in ZDHHC21 have from 10% to 23% lower expression than individuals with T/T genotypes. Therefore, in a dataset with more individuals and cell types, C-alleles at sites where the polymerase is more likely to pause are associated with lower expression levels. Together, the results show that genes with more paused RNA Pol II have lower gene expression levels.
Mutagenesis of Regulators of RNA Polymerase II Pausing Changes Gene Expression
To validate experimentally the effect of RNA Pol II pausing on gene expression, we turned to luciferase reporter assays. We selected three genes, MYO1E, BLCAP, and SESN2, which have promoter-proximal pause sites. The promoters, including the first exons, were cloned into a luciferase reporter, and then by site-directed mutagenesis, at the pause sites, cytosines were converted to thymines. The single-base change from cytosine to thymine resulted in significantly higher promoter activity for MYO1E (p < 0.05), BLCAP (p < 10−8), and SESN2 (p < 0.001; Figures 4A–4C).
Figure 4
Mutagenesis of Sequences at or near RNA Pol II Pause Sites Changes Gene Expression
(A–C) Luciferase reporter activity of MYO1E, BLCAP, and SESN2 promoters with cytosine at pause sites, compared to those where the pause sites were mutated to thymine (n = 15, ∗p < 0.05, ∗∗∗p < 0.001; t test).
(D) Luciferase reporter activity of SESN2 promoter with the indicated mutations (n = 9, −4T ∗∗p < 0.01; −2T ∗∗∗p < 10−5; +2A5A ∗∗∗p < 10−4; t test). Red line indicates luciferase activity of the wild-type sequence. The 9-mer motif is shown for reference.
Error bars are SEM.
Mutagenesis of Sequences at or near RNA Pol II Pause Sites Changes Gene Expression(A–C) Luciferase reporter activity of MYO1E, BLCAP, and SESN2 promoters with cytosine at pause sites, compared to those where the pause sites were mutated to thymine (n = 15, ∗p < 0.05, ∗∗∗p < 0.001; t test).(D) Luciferase reporter activity of SESN2 promoter with the indicated mutations (n = 9, −4T ∗∗p < 0.01; −2T ∗∗∗p < 10−5; +2A5A ∗∗∗p < 10−4; t test). Red line indicates luciferase activity of the wild-type sequence. The 9-mer motif is shown for reference.Error bars are SEM.Next, we assessed other sequence features that were identified as cis-regulators. Work in bacteria has shown RNA polymerase pausing is affected not only by sequences at the pause site but also sequences upstream of the polymerase active site where the template and non-template DNA strands re-anneal. Using the SESN2 promoter, we changed the guanine at the “−11 position” to thymine and found that resulted in significantly (p < 10−7; Table S5) higher promoter activity. Our regression analysis suggests sequence changes that deviate from the 9-mer motif would lead to less pausing and higher expression, whereas sequence changes which restore cis elements would lead to more pausing and lower expression. We mutated several of the sequences within the 9-mer motif in the SESN2 promoter and all of them resulted in significantly higher promoter activity (Figure 4D). Additionally, changing the +1 thymine to guanine and thereby creating a +1 purine resulted in lower reporter activity (p < 0.01) as is consistent with the expected increase of RNA Pol II pausing (see Table S5). These results confirm that sequences in gene promoters that mediate RNA polymerase pausing regulate gene expression.
RNA Pol II Pausing Affects the Expression of Genes Including Those Mutated in Human Diseases
While our findings uncovered the contributions of sequences to RNA polymerase pausing, they also point to alteration of pausing as a potential treatment of genetic diseases that arise due to aberrant gene expression. Therapies have aimed at restoring gene expression since dysregulated gene expression is the mechanistic basis of many diseases. A recent success is Nusinersen that promotes the expression of SMN2 in spinal muscular atrophy.The 1,367 pause sites that are characterized in this study are found in 1,141 genes. The Online Mendelian Inheritance in Man (OMIM) database shows that mutations in 347 of these genes (30%) are known to cause human diseases (see Table S6 or as an interactive table see Web Resources). Additionally, some of these mutations have already been shown to affect gene expression (see examples in Table 3). Restoring the expression levels of these genes through RNA polymerase pausing is a potential treatment option. The large number of these disease-causing mutant genes with paused RNA Pol II implies that approaches which target the regulatory sequences identified here could have broad applicability.
Table 3
Diseases Characterized by Dysregulated Gene Expression
Disease
Gene
Reference
Multiple myeloma
ELL2
Li et al.52
Elliptocytosis
EPB41L2
Moriniere et al.47
Monocytopenia and mycobacterial infection syndrome (monoMAC)
GATA2
Johnson et al.53
Nephrotic syndrome
KANK2
Gee et al.54
Focal segmental nephrosclerosis
MYO1E
Mele et al.48
Diamond-Blackfan anemia
RPL35A
Noel55
Diamond-Blackfan anemia
RPS19
Gazda et al.56
Diseases Characterized by Dysregulated Gene Expression
Discussion
In this study, we found that RNA Pol II pausing is highly regulated in normal human cells. Regardless of individuals or cell types, at more than 1,300 nucleotide locations of more than 1,000 genes, RNA polymerase pauses very precisely. These large number of sites allowed the identification of cis factors that regulate pausing of humanRNA Pol II. Alteration of these regulators of RNA Pol II pausing affects expression levels of specific genes. These findings thus provide a basis for altering gene expression levels in mechanistic studies and for the development of expression-based therapeutics.Aberrant gene expression leads to human diseases, including many single-gene disorders. To understand how gene expression is regulated, studies have yielded complex interactions between RNA polymerase, regulatory protein complexes, and underlying DNA sequences. While these findings are elegant, they are also daunting since the regulation appears so complex that one wonders whether gene transcription can be targeted to restore aberrant gene expression as disease treatment. However, the development of Nusinersen for treatment of spinal muscular dystrophy by increasing the expression of SMN demonstrates that gene expression-based therapeutics is possible. The urgent question is how to generalize the knowledge of transcription to develop treatments for other disorders.The current approach often involves screening antisense oligonucleotides or small molecules for ones that alter the expression of specific genes. Thus, a study has to be designed for each disease. In addition, while these screening methods may yield ways to alter the expression of a gene, they do not provide any mechanistic information. In contrast, we identified DNA sequences that regulate polymerase pausing. These sequences can be targeted to alter the expression levels of more than 1,000 human genes where we find RNA Pol II pausing, including more than 300 genes that are known to be mutated in genetic diseases. Mutations that result in aberrant gene expression could be ameliorated by targeting RNA Pol II pausing as a means to restore gene expression levels, regardless of whether the causal mutation affects polymerase pausing. Gene therapy trials have shown that relatively modest changes in gene expression can be therapeutic, for example, an increase in the expression of Factor IX to about 10% of normal is sufficient in the treatment of hemophilia B. There are other diseases where gene expression is the underlying cause and/or affects severity. For example, mutations in EPB41L2 that alter splicing result in increased RNA turnover, lower gene expression, and hereditary elliptocytosis. The severity of elliptocytosis is correlated to the expression of EPB41L2, so one can posit that a treatment can be developed that aims at deterring RNA Pol II from pausing to increase transcription and therefore gene expression.Among the genes that we examined in our experimental validations are MYO1E and SESN2. By mutagenesis, we showed that changing the cytosines at the pause sites to thymine led to higher promoter activities for both genes. Loss-of-function mutations in MYO1E results in podocyte injury leading to nephrotic syndrome, whereas overexpression of MYO1E can be protective against podocyte injury. Our results here suggest abrogating the RNA Pol II pause would lead to upregulation of MYO1E expression, which could be protective against podocyte injury in nephrotic syndromes. Similarly, SESN2 encodes an antioxidant enzyme that is protective against cellular stress in the liver and is being considered as a target for the treatment of chronic liver disease. This suggests that targeting RNA Pol II pausing may be used not only to correct pathogenic gene expression in Mendelian disorders, but it could also be used to change the expression of genes in chronic diseases.The ability to alter the expression level of genes specifically is important not only in the therapeutic setting. For mechanistic studies, it is often necessary to manipulate the expression of a gene of interest. Methods such as overexpression and knockdown/knockout of genes often produce expression levels that are too high or too low. Targeting RNA polymerase pausing may allow experiments to be conducted with gene expression changes that are within a more physiologic range.In conclusion, our study identifies the sequences that regulate RNA polymerase pausing on more than 1,000 human genes and show that these sequences can be altered to affect the expression level of specific genes. Thus, our finding provides a rationale to target RNA polymerase pausing in development of expression-based therapeutics for genetic disorders. Studies to elucidate how the sequence features promote the RNA Pol II to pause will expand our understanding of how nucleic acid sequence and most likely structure regulate transcription.
Authors: Peter B Rahl; Charles Y Lin; Amy C Seila; Ryan A Flynn; Scott McCuine; Christopher B Burge; Phillip A Sharp; Richard A Young Journal: Cell Date: 2010-04-30 Impact factor: 41.582
Authors: Chengqi Lin; Edwin R Smith; Hidehisa Takahashi; Ka Chun Lai; Skylar Martin-Brown; Laurence Florens; Michael P Washburn; Joan W Conaway; Ronald C Conaway; Ali Shilatifard Journal: Mol Cell Date: 2010-02-12 Impact factor: 17.970
Authors: Dig Bijay Mahat; Hojoong Kwak; Gregory T Booth; Iris H Jonkers; Charles G Danko; Ravi K Patel; Colin T Waters; Katie Munson; Leighton J Core; John T Lis Journal: Nat Protoc Date: 2016-07-21 Impact factor: 13.491
Authors: Chanhyo Lee; Xiaoyong Li; Aaron Hechmer; Michael Eisen; Mark D Biggin; Bryan J Venters; Cizhong Jiang; Jian Li; B Franklin Pugh; David S Gilmour Journal: Mol Cell Biol Date: 2008-03-10 Impact factor: 4.272
Authors: Telmo Henriques; Benjamin S Scruggs; Michiko O Inouye; Ginger W Muse; Lucy H Williams; Adam B Burkholder; Christopher A Lavender; David C Fargo; Karen Adelman Journal: Genes Dev Date: 2018-01-29 Impact factor: 11.361
Authors: Martyna Gajos; Olga Jasnovidova; Alena van Bömmel; Susanne Freier; Martin Vingron; Andreas Mayer Journal: Nucleic Acids Res Date: 2021-05-07 Impact factor: 16.971
Authors: Jing Luan; Guanjue Xiang; Pablo Aurelio Gómez-García; Jacob M Tome; Zhe Zhang; Marit W Vermunt; Haoyue Zhang; Anran Huang; Cheryl A Keller; Belinda M Giardine; Yu Zhang; Yemin Lan; John T Lis; Melike Lakadamyali; Ross C Hardison; Gerd A Blobel Journal: Cell Rep Date: 2021-02-23 Impact factor: 9.423
Authors: Sergey V Lobanov; Branduff McAllister; Mia McDade-Kumar; G Bernhard Landwehrmeyer; Michael Orth; Anne E Rosser; Jane S Paulsen; Jong-Min Lee; Marcy E MacDonald; James F Gusella; Jeffrey D Long; Mina Ryten; Nigel M Williams; Peter Holmans; Thomas H Massey; Lesley Jones Journal: NPJ Genom Med Date: 2022-09-05 Impact factor: 6.083
Authors: Ning Deng; Yue Zhang; Zhihai Ma; Richard Lin; Tzu-Hao Cheng; Hua Tang; Michael P Snyder; Stanley N Cohen Journal: NAR Genom Bioinform Date: 2022-07-27