Literature DB >> 27663494

A generative model for the behavior of RNA polymerase.

Joseph G Azofeifa¹, Robin D Dowell^1,2,3.

Abstract

MOTIVATION: Transcription by RNA polymerase is a highly dynamic process involving multiple distinct points of regulation. Nascent transcription assays are a relatively new set of high throughput techniques that measure the location of actively engaged RNA polymerase genome wide. Hence, nascent transcription is a rich source of information on the regulation of RNA polymerase activity. To fully dissect this data requires the development of stochastic models that can both deconvolve the stages of polymerase activity and identify significant changes in activity between experiments.
RESULTS: We present a generative, probabilistic model of RNA polymerase that fully describes loading, initiation, elongation and termination. We fit this model genome wide and profile the enzymatic activity of RNA polymerase across various loci and following experimental perturbation. We observe striking correlation of predicted loading events and regulatory chromatin marks. We provide principled statistics that compute probabilities reminiscent of traveler's and divergent ratios. We finish with a systematic comparison of RNA Polymerase activity at promoter versus non-promoter associated loci.
AVAILABILITY AND IMPLEMENTATION: Transcription Fit (Tfit) is a freely available, open source software package written in C/C ++ that requires GNU compilers 4.7.3 or greater. Tfit is available from GitHub (https://github.com/azofeifa/Tfit). CONTACT: robin.dowell@colorado.eduSupplementary information: Supplementary data are available at Bioinformatics online.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：

Year: 2016 PMID： 27663494 PMCID： PMC5942361 DOI： 10.1093/bioinformatics/btw599

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Regulation of gene expression plays crucial roles in diseased and healthy cellular phenotypes. Gene expression requires RNA Polymerase II (RNAP) recruitment to promoters and subsequent signaling cues to direct RNAP to fully transcribe the protein coding region (Bentley, 2014; Fuda ). With the advent of high throughput sequencing data, RNAP’s location has been profiled genome-wide providing deep insight into the enzymatic stages of transcription. In brief, RNAP recruitment, initiation, pause, pause release, elongation and termination are highly controlled transcriptional stages that are distinctly regulated (Fuda ; Jonkers and Lis, 2015; Kwak ; Nojima ). Nascent transcription assays, such as Global Run-On (GRO-seq), Precision Nuclear Run-on (PRO-seq) and Native Elongating Transcript (NET-seq), measure the production of transcripts from all cellular RNA polymerases genome-wide (Core and Lis, 2008; Kwak ; Nojima ). Given their high degree of resolution and strand specific nature, these assays have tremendous potential to refine our understanding of each stage of the transcription process. Indeed, these assays have been used to study the transition from paused to elongating polymerase, enhancer RNA transcription and sites of RNAP termination (Azofeifa ; Fong ; Hah ; Wang ). Yet the precision of these techniques depends on an inherently noisy sequencing process with biases in both the experimental protocol (Kwak ) and read mapping strategies. To fully explore the richness of nascent transcriptional assays requires the development of biologically motivated models of RNAP that provide meaningful summary statistics of the data. To identify transcripts within nascent transcription data, work has primarily focused on segmentation based algorithms such as hidden Markov models (HMMs) (Azofeifa ; Chae ) and windowing approaches (Allison ). These ‘transcribed regions’ share similar statistical properties such as comparable levels of mapped reads. However, mapped reads within these regions are distinctly non-stationary. Seen commonly in chromatin immunoprecipitation followed by sequencing (ChIP-seq) for RNAP, ‘peaks’ of GRO-seq mapping occur at both promoter proximal and enhancer regions (Natoli and Andrau, 2012). In fact, traditional segmentation analysis tends to group these visually distinct elements into one long classification. Consequently, ‘transcribed regions’ often do not correspond to individual transcripts. Within transcribed regions, the behavior of polymerase lends itself to substructure. For example, the initiating form of polymerase pauses and produces bidirectional peak signatures upstream of the gene body (Bentley, 2014; Kwak ). Several recent efforts have focused on identifying the bidirectional transcripts characteristic of initiating/paused RNAP using supervised learning approaches such as naive Bayes (Melgar ), support vector regression (Danko ) or logistic regression (Azofeifa ). Although each approach shows promise, these classifications lack an easy biological interpretation as learned regression coefficients do not directly represent a biological process. Furthermore, methods that focus solely on the bidirectional peak signal fail to capture the productive elongation stage of transcription. Our first approach to a unified model of RNAP behavior described paused RNAP as an asymmetric double geometric followed by elongation signal defined as a homogeneous Poisson point process (Lladser ). Although a significant step forward, model inference was constrained to single isoform genes as the parameter estimation method supports only one loading event. Additionally, the model was single stranded and therefore most applicable to organisms such as Drosophila melanogaster where paused RNAP does not show bidirectional transcript signal. To address these limitations, we propose a novel generative model of RNAP that describes both initiating/paused and elongating RNAP. The model accounts for signal on both strands simultaneously, capturing the behavior of RNA Polymerase II genome-wide. Even in light of the non-exponential family distribution functions, we develop a parameter estimation method based on the theory of maximum likelihood. With our model in hand, we perform inference into RNAP activity and assay changes in loading event locations and pausing probabilities across conditions.

2 Algorithm

2.1 Model description

Eukaryotic gene expression is a highly coordinated stochastic process involving the enzymatic synthesis of RNA by RNAP. The precise location of RNAP along DNA can be measured either by chromatin immunoprecipitation or nascent transcription assays. Conceptually, in the absence of noise, each read originates from an actively engaged RNAP molecule. Here we present a unified probabilistic model of transcription that captures the position Z of RNAP (Fig. 1).

Fig. 1

Model of polymerase activity. A summary of the probabilistic model (on left, see text for full description of parameters) with examples of data generated from the model (on right). Here ‘Loading’ refers to recruitment of polymerase and pre-initiation complex formation, ‘Initiation’ refers to initiation of transcription and promoter-proximal pausing, and ‘Elongation’ refers to productive elongation following pause release (Fuda ; Adelman and Lis, 2012; Lee and Young, 2013; Jonkers and Lis, 2015) (Color version of this figure is available at Bioinformatics online.) At protein coding genes, RNAP is first recruited to the promoter region at the transcriptional start site (TSS). We model the loading position X as a Gaussian distributed random variable with parameters where μ represents the typical loading position and the amount of error in recruitment to μ. Upon recruitment, RNAP selects and binds to either the forward or reverse strand, which we characterize as a Bernoulli random variable S with parameter π. Following loading and pre-initiation, RNAP immediately escapes the promoter and transcribes a short distance, Y. We assume that the initiation distance (also referred to as entry length (Jonkers and Lis, 2015)), is distributed as an exponential random variable with rate parameter λ. For paused polymerase, the final genomic position Z of RNAP is a sum of two independent random variables (Equation 1). In Equation 1, represents the reverse and forward strand decision respectively. Since RNAP processes in a direction, S also encodes the signed displacement away from μ. We solve these convolutions analytically and provide a properly normalized probability distribution function (Equation 2) governing the loading position and entry length of RNAP. From Equation 2, denotes the standard normal distribution and an indicator function. represents the Mill’s ratio which is defined as where is the cumulative distribution function of the standard Gaussian density. To note, the functional limits of h(z, s) as and σ tend to zero are Gaussian and exponential density functions, respectively. For these reasons, h(z, s) has been referred to as an exponentially modified Gaussian (Reed and Jorgensen, 2004). Following recruitment, strand decision and initiation, RNAP may transition to productive elongation (e.g. pause release) moving to transcribe the full length of the coding sequence. And finally, RNAP releases from the DNA at termination sites, for the forward and reverse strand respectively. We describe the location of elongating polymerases as a homogeneous Poisson point process (Lladser ). By well-known conditioning results for Poisson point processes (Kingman, 1992), the positions of elongating polymerases should be uniformly distributed (Equation 3) between the loading (μ) and termination sites (l). In equation 3, represents an indicator function that directly encodes the biological constraint that elongating RNAP must first load at μ. Given that Z describes the location of RNAP, it originates either from the initiating/paused, h(z, s), or elongating, u(z, s), stage of transcription. For brevity, we refer to the Loading/Initiation/Paused and Elongating/Termination stages as LI and ET respectively. Nascent transcription assays serve as a readout on RNAP dynamics. Like most high throughput assays, GRO-seq (or PRO-seq) is a population averaged assay, thus providing a histogram reflecting the distribution of RNAP locations. In this way, however, GRO-seq does not directly identify whether a read originated from either the LI or ET stage of polymerase activity. To capture these processes jointly, let k be a multinomial random variable that records a specific transcriptional component and is selected with probability w. Thusly, represents a finite set of M transcriptional components. With this in mind, represents a mixture distribution describing an arbitrary number of initiation and elongation components (Equation 4). Importantly, in Equation 4 may represent either h(z, s) or u(z, s) (Equations 2, 3 respectively). If represents the set of LI components and represents the set of ET components, then there exists a such that μ lower or upper bounds the support of k depending on the strand orientation of k. In this way, LI and ET components are directly linked.

2.2 Model inference

Under a finite M-mixture model, we wish to perform model inference over Θ given nascent transcription data. Let G be the set of aligned reads across the entire genome where each consists of a genomic coordinate z and strand identifier s. At some genomic locus , let and . In total, we seek to identify a parameter set under which D is most probable, (Equation 5), i.e. the maximum likelihood estimate (MLE). Without specifying the set of transcriptional component identifiers (K) associated with D, Equation 5 does not emit a closed form solution. Even still, does not fully specify or as z equals the sum of two latent random variables: x (loading position) and y (initiating length). However, observing the set of initiating lengths (Y) effectively decouples the convolution in Equation 2 allowing for a straightforward computation of the MLE for a Gaussian and an exponential distribution. Taken together, let the complete data be . It follows easily that has a closed form solution given the assumed independence of . Although we do not observe K or Y, we can treat k and y as random variables and perform iterative optimization of Equation 5 by the Expectation Maximization algorithm (EM). The EM algorithm alternates between two steps: the E-step computes the conditional expectation of latent variables given observed variables (Equation 6) and the M-step performs a gradient step along this expectation. Admittedly daunting, simplification of Equation 6 can be achieved in a number of ways. First, we assume that k and y are independent therefore . Furthermore, integrates to one across and sums to one over all components. Finally, we need not consider for mixture components involving elongating polymerase. Therefore, we can see that the complete data log-likelihood function depends only on three quantities: and k. The probability a component k given a data point g (Equation 7) follows immediately from Bayes’ Theorem and for succinctness we define this term as . Commonly referred to as the responsibility term (Bilmes ), measures the extent to which g belongs to some component k. Turning to Y, the convolution of Z induced by the simultaneous loading and initiation of RNAP requires a more involved computation to the expected value of given the incomplete data (Equation 8). Fortunately however, knowledge of Z = z reveals conditional dependence between X and Y and thus a way forward for iterative optimization of and λ. To complete the E-step, we define the conditional expectation of the initiating length Y conditioned on θ, Equation 9. Expectations over X (the random variable governing RNAP loading position) given g and θ can be shown easily from the linearity of expectation. With the necessary conditional expectations defined, we solve for the maximum of Equations 6 and 8. Equation 10 provides the ‘update rules’ for the EM algorithm. In keeping with the traditional notation of mixture models in Equation 10, we define and . Due to the finite nature of uniform distributions, our EM update rules (Equation 10) assume that l is fixed, presumably to the minimal or maximal order statistics, g0 and g of D. However, the length of elongation or exact site of termination varies throughout the genome (Derrien ). In this way, a fixed l is an unattractive modeling assumption of RNAP. To optimize l requires an adjusted EM algorithm. In brief, we want to preserve the contractive map property of the EM namely where and refers to a fixed point of the EM map. Yet, moving l away from the max and min order statistics will result in some having no probability mass () or to monotonically decrease. To estimate for an optimal l, we place a uniform distribution over D with support between and for either or s = – 1. The mixing weight remains fixed at where E represents the expected number of mapping errors across . Under a binomial error model assumption, where S refers to the length of the genome, the total number of mapped reads and p represents the probability that a read maps by chance alone. With this addition, l is no longer confined to the min and max order statistics of D. We provide a pseudo-code description of our MLE methodology in the Supplement (Algorithm 1).

2.3 Model selection

An important limitation of finite mixture models is a-priori knowledge of , the number of transcription components. To perform model selection over potentially many component sizes, we utilize penalized Bayesian Information Criterion (BIC), equation 11. represents the likelihood function evaluated at , α is the penalty term, is the number of free parameters within a specific model topology (e.g. one initiation component, one forward strand and one reverse strand elongation component contains 8 free parameters) and N is the total number of data points within D. In brief, BIC penalizes model complexity while balancing improvement in . Unless otherwise specified, α is set to one for all subsequent analysis.

3 Results

3.1 Estimation of RNAP location from GRO-seq

After confirming our model inference method using simulated data (see Supplementary text and Fig. S1), we assess our model on publicly available biological data. To perform model inference of RNAP location requires an interval where GRO-seq read mapping data (D) can be collected. To this end, we utilized Fast Read Stitcher (FStitch) that implements a maximum entropy Markov model to segment the genome into ‘transcribed regions’ (Azofeifa , 2016). In a HCT116 GRO-seq dataset (Allen ), FStitch classified 19 709 transcribed regions. With these regions in hand, we computed by our MLE methodology across mixtures containing for each interval independently and selected a final by the minimum BIC score. Figure 2A shows a transcribed region with reference to the estimated density function. As a final illustrative example, Figure 2B displays the associated Bayesian Information Criterion scores as a function of model complexity. Supplementary Table S1 provides a collection of statistics describing the distribution of fitted parameters across all transcribed regions.

Fig. 2

Characteristic loci showing RNAP inference. (A) The final inferred density function at characteristic transcribed region and the super enhancer region contained therein, defined by FStitch (Azofeifa ) and dbSuper (Khan and Zhang, 2016) respectively. (B) The BIC calculation across 20 mixture models. A model complexity of 5 is shown to be minimal and thus considered optimal (Color version of this figure is available at Bioinformatics online.) To address the accuracy of our model complexity procedure, we reasoned that at active single isoform genes we should predict only one LI component while at transcriptionally inactive regions (by FStitch) we should predict no components. Supplementary Figure S2 highlights the accuracy of our RNAP inference model to discriminate between active and inactive transcribed regions based solely on LI component presence (AUC). At a FDR of 0.05, we observe that the distribution of at single isoform genes contains a clear and prominent mode at (Supplementary Fig. S2). The distribution of at both single isoform genes and all FStitch-defined transcribed regions is heavy tailed, suggesting an appreciable number of transcribed regions contain more than one RNAP loading event (Supplementary Fig. S3). To assess whether the model is incorrectly introducing extra LI component centers to compensate for data poorly described by an exponentially modified Gaussian distribution, we compared the distribution of () at loci harboring to the distribution of LI component standard deviation (). We observe that median pairwise LI component center distance far exceeds what we would expect under the variability of LI component sizes, indicating that describe independent portions of the data. Suggested by the associated standard deviations in Supplementary Table S1, varies from locus to locus: some genes experience a large degree of initiation () or considerable strand bias . Whether this variability relates to experimental noise or actual biological structure, may be addressed by the reproducibility and consistency of across biological replicates. To this end, FStitch defined transcribed regions between replicate one and two were merged and model inference, by Tfit, was performed in each replicate independently. We observe strong correlation in model selection (Supplementary Fig. S4A) and exceedingly high correlation between identical parameters (e.g. , Supplementary Fig. S4B), suggesting that estimated parameters display low variance. Apart from w and l, we observe little to no correlation between differing parameters (e.g. and , Supplementary Fig. S4C), suggesting that there is no confounding dependencies between parameters. Estimates of w sufficiently larger than the population average (two standard deviations, Supplementary Table S1) are significantly lacking in an overlapping transcription start site (p-value numerically indistinguishable from zero, Hypergeometric test). Intuitively, this is expected as transcription over enhancer regions or non-coding regulatory loci do not harbor downstream gene bodies. We observed that the loading strand bias (π) tracks closely with the strand orientation of the underlying RefSeq gene (Supplementary Fig. S5A). Particularly, and for forward and reverse strand gene annotations respectively. Loading events lacking an annotated TSS display no appreciable strand bias, . Given our model predicts μ as the site of RNAP loading and l as the site of elongating termination, we compared the location of μ and l to estimates of annotated transcriptional start sites (TSS) and termination sites respectively. We observed a high degree of correlation between μ and the TSS, noting a significant base pair upstream displacement of μ from the TSS (Supplementary Fig. S5B). This displacement is in line with estimates from other groups using independent methods (Jonkers and Lis, 2015). Similar to previous estimates of transcription termination (Azofeifa ; Fong ), we observed l to be KB downstream the polyadenylation site (Supplementary Fig. S5C).

3.2 Predicting enzymatic changes of RNAP following experimental perturbation

Of significant importance to transcriptional studies is to monitor changes in RNAP activity following experimental perturbation. Specifically, the transition between promoter-proximal pausing into the subsequent RNAP elongation constitutes a highly regulated process of tremendous interest (Adelman and Lis, 2012). A popular metric to quantify changes in RNAP pausing, the ‘pausing ratio’ computes mapped reads under some TSS-centered window divided by mapped reads under some gene body-centered window. Given that our model directly infers LI and ET RNAP stages from data alone, we ask whether we can correctly identify changes in RNAP activity following experimental perturbation known to affect promoter proximal pausing. We reanalyzed data from two studies that utilized GRO-seq to probe RNAP pausing activity. Specifically, one study knocked down bromodomain-containing protein 4 (Brd4) in HEK293 cells and observed global changes in RNAP pausing ratios suggesting a critical role in RNAP pause release (Liu ). Yet another study noted global shifts in RNAP pausing ratios by cyclin-dependent kinase (CDK) 9 inhibition in HeLa cells (Laitem ). With these datasets, we hypothesis that our model should accurately reproduce the observed changes in pausing ratios by appropriate changes in the LI and ET mixing weights. For each dataset, we performed model estimation and computed changes in estimated parameters between the untreated and treated cells. Specifically, we fit a single component mixture model (one LI component and two ET components) at each gene annotation region. We then compared pairwise fold change in LI component mixing weights first between untreated biological replicates and then between untreated and treated experiments. In both studies, we observed highly significant global changes in LI component mixing weights relative to untreated replicates (Fig. 3A).

Fig. 3

Changes in promoter proximal pausing are correctly identified by a generative model of RNAP. Mixture model inference was performed over RefSeq gene annotations between control and treated cell lines in two independently derived datasets: Laitem 2015 and Liu 2013. (A) The predicted difference in LI mixing weights between control (on left) and treated cells. Under a normal assumption, mean mixing weights were compared using a t-test. (B) The distribution of LI length (box) with the traditional computation of the pausing ratio as a function of window size, plotted for the control (bottom line) and treated (top line) cell lines of the Laitem 2015 dataset. Grey shading indicates one tenth of one standard deviation (Color version of this figure is available at Bioinformatics online.) Although easily computable, the standard pausing ratio calculations rely on ad hoc methods of window sizes and distance thresholds (Adelman and Lis, 2012). To highlight this point explicitly, we examined the impact of TSS-centered window size on the pausing ratio (Fig. 3B). Intuitively, window sizes that are either too small or too large dramatically reduce the observable differences between treated and untreated cells. For comparison, we provide the distribution of LI standard deviation obtained from our model.

3.3 RNAP model accurately predicts marks of regulatory elements

Beyond annotated genes, it is well known that key chromatin marks are associated with transcription in different parts of the genome (The ENCODE Project Consortium, 2007, 2012). Moderate levels of H3K4me1/2 and high levels of H3K27ac mark active enhancers whereas high levels of H3K4me3 and H3K27ac mark areas of active promoters. Recent studies show that these marks harbor transcription (Li ) and show a characteristic ‘bidirectional’ signature, where forward and reverse strand read coverage appear positively and negatively skewed respectively. To study the interplay between enhancer transcription and chromatin landscape requires the development of models to accurately identify bidirectional transcripts. Although our model of RNAP does not implicitly assume LI components to appear bidirectional, certain parameter combinations (e.g. and ) will show both positive and negative skew emanating from μ. To profile for bidirectional transcripts genome wide, we hereafter utilized our EM seeding method (complete description in Supplement; Θ set to the parameter values in Supplementary Table S1). To assess the accuracy of our bidirectional transcript classifications, we monitored how well a regulatory mark (DNase I HS, H3K27ac, H3K4me1/3) may be predicted from bidirectional transcription alone. With an HCT116 GRO-seq dataset (Allen ), we benchmarked our method against the current state-of-the-art bidirectional detection algorithm, dREG (Danko ). The BIC penalty α (Tfit) and support vector regression score (dREG) was varied. True positives were considered as an overlap with chromatin mark peaks (HCT116 (The Encode Project Consortium, 2012), MACS broad peak settings) and the resulting bidirectional transcript prediction. To assess false positives, we randomly selected an equivalent number of 2KB loci that do not overlap MACS peak calls. Thus, a false positive is a bidirectional prediction overlapping a negative example. We observed improvements over dREG across all regulatory marks (Fig. 4A, B). Both dREG and Tfit predict H3K27ac marks exceedingly well from bidirectional transcript presence alone, suggesting that H3K27ac signal reflects nascent transcription.

Fig. 4

RNAP model accurately profiles for bidirectional transcription. (A) A receiver operating characteristic (ROC) curve displaying the relationship between true and false positive rates of H3K27ac prediction from bidirectional transcription alone. The area under the ROC curve (AUC) values are summarized for multiple marks. As a Venn diagram, (B) shows the overlap in bidirectional transcription classifications between dREG and Tfit at false discovery rate of 0.05 relative to the H3K27ac prediction (Color version of this figure is available at Bioinformatics online.)

3.4 Three dimensionally paired loci display centrality and associativity based on bidirectional transcription

The role of transcription at enhancer elements (defined commonly as non-TSS associated H3K27ac presence) remains an open and exciting question. Correlation in both transcript levels and three dimensional proximity of enhancer elements and target genes (Allen ; Azofeifa ; Le ) point prominently to the functional importance of enhancer transcripts. To begin to address the question of enhancer RNA function, we present an analysis that demonstrates both the utility of our predicted RNAP loading events and the intriguing relationship between chromatin interaction datasets and nascent transcription assays. Given that the insulator protein CTCF has been implicated as a key player in enhancer to gene looping events (Phillips and Corces, 2009; Splinter ), we examined the loci-loci pair interaction network defined by Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PET) for CTCF derived from the K562 cell line (The Encode Project Consortium, 2012). With a cell line matched GRO-seq dataset (Core ), we compared network attributes of loci containing or lacking bidirectional transcript predictions (γ = 1, FDR = 0.05 according to H3K27ac prediction). Figure 5A displays two illustrative connected component examples, built as described in the Supplement.

Fig. 5

CTCF paired loci network displays assortativity by bidirectional presence (A) displays two characteristic connected components on chromosome 1 and 2 from a CTCF ChIA-PET dataset derived from the K562 cell line. Nodes are colored as to whether a Tfit prediction overlaps a paired loci by one base pair or not; Bidir. and ∼Bidir. respectively. The circumference of the node is proportional to the degree. (B) The proportion of edges containing a similar label, significance is calculated by a Binomial test. (C) The distribution of the assortativity coefficient across all connected components, > 0 indicates modularity (Color version of this figure is available at Bioinformatics online.) Enhancers are implicated in defining key cellular phenotypes such as cell fate (Creyghton ; Whyte ) and tumorigenesis (Hnisz ; Qian ) and thus may play a central role in three dimensional looping. Our constructed network reflects locations in the genome (nodes) connected by the CTCF ChIA-PET data. With this in mind, we grouped nodes by whether they lacked or contained an association with an annotated TSS or Tfit prediction and computed common measures of node centrality. We observed the highest degree of node centrality at non TSS associated bidirectional transcripts across all criteria (Supplementary Table S2). This result suggests that enhancer RNAs play a central role in the 3D configuration of the genome. With the advent of chromosome conformation capture technology, extensive chromosomal looping has been observed at so called transcription factories (Deng ). These discrete nuclear sites of transcription allow for rapid expression of many three dimensionally proximal genes (Edelman and Fraser, 2012). To this end, we investigated evidence of modularity or network homophily based on lacking or containing bidirectional transcript presence. Indeed, the proportion of edges linking bidirectionally transcribed nodes is much higher than by chance alone (Fig. 5B). Furthermore, computation of network modularity (a measure of network label clustering) shows weak assortativity across most connected components (Fig. 5C).

4 Discussion

We described a probabilistic, generative mixture model that is founded on a biologically motivated description of RNAP behavior. Our model is inspired by the current understanding of polymerase behavior (Core ; Fuda ; Kwak ; Lee and Young, 2013) at protein coding genes and provides a principled mathematical approach to GRO-seq data analysis. To perform model inference, we derived a parameter estimation scheme based on the theory of maximum likelihood and the Expectation Maximization algorithm. When applied to GRO-seq, our model can not only identify individual transcripts but also quantify their characteristics, as evidenced by the fact that the loading site predictions of our model correlate well with RefSeq annotation. The relatively recent and unexpected discovery that enhancers are themselves transcribed challenges our conventional view of transcriptional regulation (Li ). Whether the underlying region is a promoter or an enhancer, our model explicitly assumes a singular behavior of RNAP genome-wide. The ability of our model to fit well to transcribed regions that are unannotated suggests that the unified model recently proposed in the literature (Andersson ; Core ) and assumed here is appropriate. Turning our generative model to a discriminative classifier, we observed that our bidirectional transcript predictions precisely recover sites of marked regulatory chromatin. Indeed, our method outperforms the current state of the art algorithm (Danko ) for bidirectional transcript classification. Furthermore, our observation that non-TSS associated bidirectional transcripts constitute a central role in CTCF ChIA-PET networks points yet again to the increasing importance of enhancer RNAs. The parameters of our model provide meaningful summary statistics of nascent transcription data. Changes in these statistics across cell types, experimental conditions and/or perturbations reflect biologically meaningful alterations in RNAP behavior. Consistent with this idea, our inference procedure confirmed a dramatic change in pausing probability following experimental perturbations for two independently derived datasets. We anticipate that our model will be used to assign particular regulatory proteins to the distinct stage of polymerase they regulate. In the case of a single experiment, correlations between the parameters of our model and other high throughput datasets will inform on the underlying regulatory process. For example, it is of great interest to monitor the co-occurrence of transcription factor binding events and motifs with initiation site predictions obtained from our model. As we learn more about how polymerase is regulated, it will be possible to extend our model accordingly. Close inspection of GRO-seq read coverage reveals oscillatory behavior within the gene body, as such a homogeneous Poisson point process may not be an appropriate model governing transcriptional elongation. Pol II elongation rates vary and influence a number of co-transcriptional processes (Jonkers and Lis, 2015; Jonkers ), suggesting that the undulations observed in the elongation region could be biologically informative. However, the extent to which heightened levels of mapped reads correspond to mapping biases or biologically meaningful points of transcription regulation is unclear. Additionally, RNAP is thought to pause proximal to its termination (Bentley, 2014). Indeed, an interesting and prominent 3′ peak is often observed a few kilobases upstream of the end of the elongation region (Fong ). More detailed studies of both elongation and transcriptional termination, both experimentally and computationally, are needed to shed light onto these processes (Fong ; Jonkers and Lis, 2015). Furthermore, an improved understanding of these process may also require us to re-evaluate the interpretation of particular parameters. In summary, with the advent of high throughput sequencing transcriptional assays like GRO-seq, RNAP is now being studied in increasingly exciting and precise ways. One of the key goals of our mixture model was to provide a set of biologically interpretable parameters that capture alterations in polymerase behavior induced by changes in regulatory proteins. To this end, a key next step will be the development of rigorous statistical methods for detecting potentially small, but meaningful changes in model parameters between experiments. We believe our proposed mixture model is one of the first steps in building a comprehensive predictive model of RNAP. Click here for additional data file.

39 in total

1. Transcription regulation through promoter-proximal pausing of RNA polymerase II.

Authors: Leighton J Core; John T Lis
Journal: Science Date: 2008-03-28 Impact factor: 47.728

2. RNA Pol II transcription model and interpretation of GRO-seq data.

Authors: Manuel E Lladser; Joseph G Azofeifa; Mary A Allen; Robin D Dowell
Journal: J Math Biol Date: 2016-05-03 Impact factor: 2.259

3. Analysis of nascent RNA identifies a unified architecture of initiation regions at mammalian promoters and enhancers.

Authors: Leighton J Core; André L Martins; Charles G Danko; Colin T Waters; Adam Siepel; John T Lis
Journal: Nat Genet Date: 2014-11-10 Impact factor: 38.330

4. Convergence of developmental and oncogenic signaling pathways at transcriptional super-enhancers.

Authors: Denes Hnisz; Jurian Schuijers; Charles Y Lin; Abraham S Weintraub; Brian J Abraham; Tong Ihn Lee; James E Bradner; Richard A Young
Journal: Mol Cell Date: 2015-03-19 Impact factor: 17.970

5. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

Authors: Ewan Birney; John A Stamatoyannopoulos; Anindya Dutta; Roderic Guigó; Thomas R Gingeras; Elliott H Margulies; Zhiping Weng; Michael Snyder; Emmanouil T Dermitzakis; Robert E Thurman; Michael S Kuehn; Christopher M Taylor; Shane Neph; Christoph M Koch; Saurabh Asthana; Ankit Malhotra; Ivan Adzhubei; Jason A Greenbaum; Robert M Andrews; Paul Flicek; Patrick J Boyle; Hua Cao; Nigel P Carter; Gayle K Clelland; Sean Davis; Nathan Day; Pawandeep Dhami; Shane C Dillon; Michael O Dorschner; Heike Fiegler; Paul G Giresi; Jeff Goldy; Michael Hawrylycz; Andrew Haydock; Richard Humbert; Keith D James; Brett E Johnson; Ericka M Johnson; Tristan T Frum; Elizabeth R Rosenzweig; Neerja Karnani; Kirsten Lee; Gregory C Lefebvre; Patrick A Navas; Fidencio Neri; Stephen C J Parker; Peter J Sabo; Richard Sandstrom; Anthony Shafer; David Vetrie; Molly Weaver; Sarah Wilcox; Man Yu; Francis S Collins; Job Dekker; Jason D Lieb; Thomas D Tullius; Gregory E Crawford; Shamil Sunyaev; William S Noble; Ian Dunham; France Denoeud; Alexandre Reymond; Philipp Kapranov; Joel Rozowsky; Deyou Zheng; Robert Castelo; Adam Frankish; Jennifer Harrow; Srinka Ghosh; Albin Sandelin; Ivo L Hofacker; Robert Baertsch; Damian Keefe; Sujit Dike; Jill Cheng; Heather A Hirsch; Edward A Sekinger; Julien Lagarde; Josep F Abril; Atif Shahab; Christoph Flamm; Claudia Fried; Jörg Hackermüller; Jana Hertel; Manja Lindemeyer; Kristin Missal; Andrea Tanzer; Stefan Washietl; Jan Korbel; Olof Emanuelsson; Jakob S Pedersen; Nancy Holroyd; Ruth Taylor; David Swarbreck; Nicholas Matthews; Mark C Dickson; Daryl J Thomas; Matthew T Weirauch; James Gilbert; Jorg Drenkow; Ian Bell; XiaoDong Zhao; K G Srinivasan; Wing-Kin Sung; Hong Sain Ooi; Kuo Ping Chiu; Sylvain Foissac; Tyler Alioto; Michael Brent; Lior Pachter; Michael L Tress; Alfonso Valencia; Siew Woh Choo; Chiou Yu Choo; Catherine Ucla; Caroline Manzano; Carine Wyss; Evelyn Cheung; Taane G Clark; James B Brown; Madhavan Ganesh; Sandeep Patel; Hari Tammana; Jacqueline Chrast; Charlotte N Henrichsen; Chikatoshi Kai; Jun Kawai; Ugrappa Nagalakshmi; Jiaqian Wu; Zheng Lian; Jin Lian; Peter Newburger; Xueqing Zhang; Peter Bickel; John S Mattick; Piero Carninci; Yoshihide Hayashizaki; Sherman Weissman; Tim Hubbard; Richard M Myers; Jane Rogers; Peter F Stadler; Todd M Lowe; Chia-Lin Wei; Yijun Ruan; Kevin Struhl; Mark Gerstein; Stylianos E Antonarakis; Yutao Fu; Eric D Green; Ulaş Karaöz; Adam Siepel; James Taylor; Laura A Liefer; Kris A Wetterstrand; Peter J Good; Elise A Feingold; Mark S Guyer; Gregory M Cooper; George Asimenos; Colin N Dewey; Minmei Hou; Sergey Nikolaev; Juan I Montoya-Burgos; Ari Löytynoja; Simon Whelan; Fabio Pardi; Tim Massingham; Haiyan Huang; Nancy R Zhang; Ian Holmes; James C Mullikin; Abel Ureta-Vidal; Benedict Paten; Michael Seringhaus; Deanna Church; Kate Rosenbloom; W James Kent; Eric A Stone; Serafim Batzoglou; Nick Goldman; Ross C Hardison; David Haussler; Webb Miller; Arend Sidow; Nathan D Trinklein; Zhengdong D Zhang; Leah Barrera; Rhona Stuart; David C King; Adam Ameur; Stefan Enroth; Mark C Bieda; Jonghwan Kim; Akshay A Bhinge; Nan Jiang; Jun Liu; Fei Yao; Vinsensius B Vega; Charlie W H Lee; Patrick Ng; Atif Shahab; Annie Yang; Zarmik Moqtaderi; Zhou Zhu; Xiaoqin Xu; Sharon Squazzo; Matthew J Oberley; David Inman; Michael A Singer; Todd A Richmond; Kyle J Munn; Alvaro Rada-Iglesias; Ola Wallerman; Jan Komorowski; Joanna C Fowler; Phillippe Couttet; Alexander W Bruce; Oliver M Dovey; Peter D Ellis; Cordelia F Langford; David A Nix; Ghia Euskirchen; Stephen Hartman; Alexander E Urban; Peter Kraus; Sara Van Calcar; Nate Heintzman; Tae Hoon Kim; Kun Wang; Chunxu Qu; Gary Hon; Rosa Luna; Christopher K Glass; M Geoff Rosenfeld; Shelley Force Aldred; Sara J Cooper; Anason Halees; Jane M Lin; Hennady P Shulha; Xiaoling Zhang; Mousheng Xu; Jaafar N S Haidar; Yong Yu; Yijun Ruan; Vishwanath R Iyer; Roland D Green; Claes Wadelius; Peggy J Farnham; Bing Ren; Rachel A Harte; Angie S Hinrichs; Heather Trumbower; Hiram Clawson; Jennifer Hillman-Jackson; Ann S Zweig; Kayla Smith; Archana Thakkapallayil; Galt Barber; Robert M Kuhn; Donna Karolchik; Lluis Armengol; Christine P Bird; Paul I W de Bakker; Andrew D Kern; Nuria Lopez-Bigas; Joel D Martin; Barbara E Stranger; Abigail Woodroffe; Eugene Davydov; Antigone Dimas; Eduardo Eyras; Ingileif B Hallgrímsdóttir; Julian Huppert; Michael C Zody; Gonçalo R Abecasis; Xavier Estivill; Gerard G Bouffard; Xiaobin Guan; Nancy F Hansen; Jacquelyn R Idol; Valerie V B Maduro; Baishali Maskeri; Jennifer C McDowell; Morgan Park; Pamela J Thomas; Alice C Young; Robert W Blakesley; Donna M Muzny; Erica Sodergren; David A Wheeler; Kim C Worley; Huaiyang Jiang; George M Weinstock; Richard A Gibbs; Tina Graves; Robert Fulton; Elaine R Mardis; Richard K Wilson; Michele Clamp; James Cuff; Sante Gnerre; David B Jaffe; Jean L Chang; Kerstin Lindblad-Toh; Eric S Lander; Maxim Koriabine; Mikhail Nefedov; Kazutoyo Osoegawa; Yuko Yoshinaga; Baoli Zhu; Pieter J de Jong
Journal: Nature Date: 2007-06-14 Impact factor: 49.962

Review 6. Transcription factories: genetic programming in three dimensions.

Authors: Lucas Brandon Edelman; Peter Fraser
Journal: Curr Opin Genet Dev Date: 2012-02-23 Impact factor: 5.578

7. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression.

Authors: Thomas Derrien; Rory Johnson; Giovanni Bussotti; Andrea Tanzer; Sarah Djebali; Hagen Tilgner; Gregory Guernec; David Martin; Angelika Merkel; David G Knowles; Julien Lagarde; Lavanya Veeravalli; Xiaoan Ruan; Yijun Ruan; Timo Lassmann; Piero Carninci; James B Brown; Leonard Lipovich; Jose M Gonzalez; Mark Thomas; Carrie A Davis; Ramin Shiekhattar; Thomas R Gingeras; Tim J Hubbard; Cedric Notredame; Jennifer Harrow; Roderic Guigó
Journal: Genome Res Date: 2012-09 Impact factor: 9.043

8. Identification of active transcriptional regulatory elements from GRO-seq data.

Authors: Charles G Danko; Stephanie L Hyland; Leighton J Core; Andre L Martins; Colin T Waters; Hyung Won Lee; Vivian G Cheung; W Lee Kraus; John T Lis; Adam Siepel
Journal: Nat Methods Date: 2015-03-23 Impact factor: 28.547

9. B cell super-enhancers and regulatory clusters recruit AID tumorigenic activity.

Authors: Jason Qian; Qiao Wang; Marei Dose; Nathanael Pruett; Kyong-Rim Kieffer-Kwon; Wolfgang Resch; Genqing Liang; Zhonghui Tang; Ewy Mathé; Christopher Benner; Wendy Dubois; Steevenson Nelson; Laura Vian; Thiago Y Oliveira; Mila Jankovic; Ofir Hakim; Anna Gazumyan; Rushad Pavri; Parirokh Awasthi; Bin Song; Geng Liu; Longyun Chen; Shida Zhu; Lionel Feigenbaum; Louis Staudt; Cornelis Murre; Yijun Ruan; Davide F Robbiani; Qiang Pan-Hammarström; Michel C Nussenzweig; Rafael Casellas
Journal: Cell Date: 2014-12-04 Impact factor: 41.582

10. Global analysis of p53-regulated transcription identifies its direct targets and unexpected regulatory mechanisms.

Authors: Mary Ann Allen; Zdenek Andrysik; Veronica L Dengler; Hestia S Mellert; Anna Guarnieri; Justin A Freeman; Kelly D Sullivan; Matthew D Galbraith; Xin Luo; W Lee Kraus; Robin D Dowell; Joaquin M Espinosa
Journal: Elife Date: 2014-05-27 Impact factor: 8.140

20 in total

1. Discovering Transcriptional Regulatory Elements From Run-On and Sequencing Data Using the Web-Based dREG Gateway.

Authors: Tinyi Chu; Zhong Wang; Shao-Pei Chou; Charles G Danko
Journal: Curr Protoc Bioinformatics Date: 2018-12-27

2. Building a Science Gateway For Processing and Modeling Sequencing Data Via Apache Airavata.

Authors: Zhong Wang; Marcus A Christie; Eroma Abeysinghe; Tinyi Chu; Suresh Marru; Marlon Pierce; Charles G Danko
Journal: Pract Exp Adv Res Comput 2018 (2018) Date: 2018-07

3. Quantifying RNA synthesis at rate-limiting steps of transcription using nascent RNA-sequencing data.

Authors: Adelina Rabenius; Sajitha Chandrakumaran; Lea Sistonen; Anniina Vihervaara
Journal: STAR Protoc Date: 2022-01-05

Review 4. Prostate Cancer Epigenetic Plasticity and Enhancer Heterogeneity: Molecular Causes, Consequences and Clinical Implications.

Authors: Jeroen Kneppers; Andries M Bergman; Wilbert Zwart
Journal: Adv Exp Med Biol Date: 2022 Impact factor: 3.650

5. Integrated genomics approaches identify transcriptional mediators and epigenetic responses to Afghan desert particulate matter in small airway epithelial cells.

Authors: Arnav Gupta; Sarah K Sasse; Reena Berman; Margaret A Gruca; Robin D Dowell; Hong Wei Chu; Gregory P Downey; Anthony N Gerber
Journal: Physiol Genomics Date: 2022-09-05 Impact factor: 4.297

6. TFIID Enables RNA Polymerase II Promoter-Proximal Pausing.

Authors: Charli B Fant; Cecilia B Levandowski; Kapil Gupta; Zachary L Maas; John Moir; Jonathan D Rubin; Andrew Sawyer; Meagan N Esbin; Jenna K Rimel; Olivia Luyties; Michael T Marr; Imre Berger; Robin D Dowell; Dylan J Taatjes
Journal: Mol Cell Date: 2020-03-30 Impact factor: 17.970

7. Connected Gene Communities Underlie Transcriptional Changes in Cornelia de Lange Syndrome.

Authors: Imène Boudaoud; Éric Fournier; Audrey Baguette; Maxime Vallée; Fabien C Lamaze; Arnaud Droit; Steve Bilodeau
Journal: Genetics Date: 2017-07-05 Impact factor: 4.562

8. Stress-induced transcriptional memory accelerates promoter-proximal pause release and decelerates termination over mitotic divisions.

Authors: Anniina Vihervaara; Dig Bijay Mahat; Samu V Himanen; Malin A H Blom; John T Lis; Lea Sistonen
Journal: Mol Cell Date: 2021-03-29 Impact factor: 17.970

9. A comparison of experimental assays and analytical methods for genome-wide identification of active enhancers.

Authors: Li Yao; Jin Liang; Abdullah Ozer; Alden King-Yung Leung; John T Lis; Haiyuan Yu
Journal: Nat Biotechnol Date: 2022-02-17 Impact factor: 68.164

10. Transcription factor enrichment analysis (TFEA) quantifies the activity of multiple transcription factors from a single experiment.

Authors: Jonathan D Rubin; Jacob T Stanley; Rutendo F Sigauke; Cecilia B Levandowski; Zachary L Maas; Jessica Westfall; Dylan J Taatjes; Robin D Dowell
Journal: Commun Biol Date: 2021-06-02