Jordan Douglas1,2, Richard Kingston1, Alexei J Drummond1,2. 1. School of Biological Sciences, University of Auckland, Auckland, New Zealand. 2. Centre for Computational Evolution, School of Computer Science, University of Auckland, Auckland, New Zealand.
Abstract
Transcription elongation can be modelled as a three step process, involving polymerase translocation, NTP binding, and nucleotide incorporation into the nascent mRNA. This cycle of events can be simulated at the single-molecule level as a continuous-time Markov process using parameters derived from single-molecule experiments. Previously developed models differ in the way they are parameterised, and in their incorporation of partial equilibrium approximations. We have formulated a hierarchical network comprised of 12 sequence-dependent transcription elongation models. The simplest model has two parameters and assumes that both translocation and NTP binding can be modelled as equilibrium processes. The most complex model has six parameters makes no partial equilibrium assumptions. We systematically compared the ability of these models to explain published force-velocity data, using approximate Bayesian computation. This analysis was performed using data for the RNA polymerase complexes of E. coli, S. cerevisiae and Bacteriophage T7. Our analysis indicates that the polymerases differ significantly in their translocation rates, with the rates in T7 pol being fast compared to E. coli RNAP and S. cerevisiae pol II. Different models are applicable in different cases. We also show that all three RNA polymerases have an energetic preference for the posttranslocated state over the pretranslocated state. A Bayesian inference and model selection framework, like the one presented in this publication, should be routinely applicable to the interrogation of single-molecule datasets.
Transcription elongation can be modelled as a three step process, involving polymerase translocation, NTP binding, and nucleotide incorporation into the nascent mRNA. This cycle of events can be simulated at the single-molecule level as a continuous-time Markov process using parameters derived from single-molecule experiments. Previously developed models differ in the way they are parameterised, and in their incorporation of partial equilibrium approximations. We have formulated a hierarchical network comprised of 12 sequence-dependent transcription elongation models. The simplest model has two parameters and assumes that both translocation and NTP binding can be modelled as equilibrium processes. The most complex model has six parameters makes no partial equilibrium assumptions. We systematically compared the ability of these models to explain published force-velocity data, using approximate Bayesian computation. This analysis was performed using data for the RNA polymerase complexes of E. coli, S. cerevisiae and Bacteriophage T7. Our analysis indicates that the polymerases differ significantly in their translocation rates, with the rates in T7 pol being fast compared to E. coli RNAP and S. cerevisiae pol II. Different models are applicable in different cases. We also show that all three RNA polymerases have an energetic preference for the posttranslocated state over the pretranslocated state. A Bayesian inference and model selection framework, like the one presented in this publication, should be routinely applicable to the interrogation of single-molecule datasets.
Transcription is carried out by RNA polymerases: RNAP in Escherichia coli, pol II in Saccharomyces cerevisiae, and T7 pol in Bacteriophage T7. It involves the copying of template double-stranded DNA (dsDNA) into single-stranded messenger RNA (mRNA). RNAP and pol II are comprised of multiple subunits, and their catalytic subunits are homologous [1, 2]. In contrast, T7 pol exists as a monomer with a distinct sequence, and resembles the E. coli DNA polymerase I [3].Optical trapping experiments have been performed on the transcription elongation complex (TEC) from a variety of organisms [4-10]. In a typical experimental setup, two polystyrene beads (around 600 nm in diameter) are tethered to the system; one attached to the RNA polymerase and the other to the DNA [4]. As transcription elongation progresses, the distance between the two beads increases and the velocity of a single TEC can be computed. Optical tweezers can be used to apply a force F to the system (Fig 1).
Fig 1
Effect of an applied force on elongation velocity.
(A) Optical trapping setup showing dsDNA being transcribed by RNA polymerase (grey ellipse) into mRNA. Two polystyrene beads are tethered to the system allowing the application of force using optical tweezers. An assisting load F > 0 acts in the same direction as transcription (top) while a hindering load F < 0 acts in the opposing direction (bottom). Figure not to scale. (B) Schematic depiction of the effect of applying a force on RNA polymerase. Due to the stochastic nature of transcription at the single-molecule level, each experiment yields a different distance-time trajectory, even under the same applied force.
Effect of an applied force on elongation velocity.
(A) Optical trapping setup showing dsDNA being transcribed by RNA polymerase (grey ellipse) into mRNA. Two polystyrene beads are tethered to the system allowing the application of force using optical tweezers. An assisting load F > 0 acts in the same direction as transcription (top) while a hindering load F < 0 acts in the opposing direction (bottom). Figure not to scale. (B) Schematic depiction of the effect of applying a force on RNA polymerase. Due to the stochastic nature of transcription at the single-molecule level, each experiment yields a different distance-time trajectory, even under the same applied force.Single-molecule studies of the TEC have revealed that RNA polymerases progress in a discontinuous fashion [4, 11–14] with step sizes that correspond to the dimensions of a single nucleotide (3.4 Å [15]). Consequently, at the single molecule level, transcription is best modelled as a discrete process rather than a continuous one.A single cycle in the main transcription elongation pathway (Fig 2) requires (1) Forward translocation of the RNA polymerase, making the active site accessible; (2) Binding of the complementary nucleoside triphosphate (NTP); (3) Addition of the nucleotide onto the 3′ end of the mRNA. This third step involves NTP hydrolysis. Nucleoside monophosphate is added onto the chain and pyrophosphate is released from the enzyme.
Fig 2
State diagrams of RNA polymerase.
(A) The model of the main transcription elongation pathway, which shows the postulated states; the pathways for interconversion; and the rate constants that govern each part of the reaction. The transcription bubble is the set of β1 + h + β2 bases (see main text for definitions) in the double-stranded DNA which are unpaired. States are denoted by S(l, t) where l is the length of the mRNA and t is the position of the polymerase active site (small grey rectangle) with respect to the 3′ end of the mRNA. Polymerase translocation displaces the polymerase by a distance of δ = 1 bp = 3.4 Å. During polymerisation the chain is extended by one nucleotide. (B) Instantiated posttranslocated state of RNA polymerase transcribing the rpoB gene sequence, with β1 = 2, h = 9, β2 = 1. Forward translocation requires melting two T/A basepairs (right arrows). Backward translocation requires melting two C/G basepairs (left arrows). The mRNA secondary structure would also require reconfiguration [16, 17].
State diagrams of RNA polymerase.
(A) The model of the main transcription elongation pathway, which shows the postulated states; the pathways for interconversion; and the rate constants that govern each part of the reaction. The transcription bubble is the set of β1 + h + β2 bases (see main text for definitions) in the double-stranded DNA which are unpaired. States are denoted by S(l, t) where l is the length of the mRNA and t is the position of the polymerase active site (small grey rectangle) with respect to the 3′ end of the mRNA. Polymerase translocation displaces the polymerase by a distance of δ = 1 bp = 3.4 Å. During polymerisation the chain is extended by one nucleotide. (B) Instantiated posttranslocated state of RNA polymerase transcribing the rpoB gene sequence, with β1 = 2, h = 9, β2 = 1. Forward translocation requires melting two T/A basepairs (right arrows). Backward translocation requires melting two C/G basepairs (left arrows). The mRNA secondary structure would also require reconfiguration [16, 17].Our study aimed to identify the best model to describe this reaction cycle for RNAP, pol II, and T7 pol, based on analysis of published force-velocity data. As there are three reactions, up to six rate constants may be necessary for a kinetic model of a single nucleotide addition. These describe forward and backwards translocation (k and k), binding and release of NTP (k and k), and NTP catalysis and reverse-catalysis (k and k), also known as pyrophosphorolysis [18]. However fewer than six parameters may be required in practice.First, it is reasonable to assume that polymerisation is effectively irreversible [17, 19–21], as pyrophosphorolysis is a highly exergonic reaction, reducing the number of rate constants to five. Second, translocation between the pretranslocated and posttranslocated states, and/or NTP binding, may occur on timescales significantly more rapid than the other steps, in which case they may be modelled as equilibrium processes. These assumptions simplify the model, as the respective forward and reverse reaction rate constants are subsumed by a single equilibrium constant. Third, thermodynamic models of nucleic acid structure can be used to estimate sequence-dependent translocation rates k(l) and k(l), by invoking transition state theory, and this can sometimes result in parameter reduction [16, 17, 21].Irrespective of equilibrium assumptions and parameterisation, transcription elongation under applied force can be modelled in two fundamentally distinct ways. First, there are the deterministic equations which can be used to calculate the mean pause-free elongation velocity v(F, [NTP]) as a function of force F and NTP concentration [NTP]. This kind of model can be derived from the differential equations describing the time evolution of all species, by application of the steady state approximation. Force effects on the translocation step are incorporated using transition state theory [22, 23].An example is the following 3-parameter model [4].
where δ is the distance between adjacent basepairs (3.4 Å, [15]), is the equilibrium constant of NTP binding, is the equilibrium constant of translocation, k is the Boltzmann constant, and T is the absolute temperature. Increasingly complex equations may be used as more parameters or states are added to the model [4, 6, 17]. Such equations describe the velocity averaged across an ensemble of molecules. Parameter inference applied to velocity-force-[NTP] experimental data is straightforward and computationally fast when using these equations. However these equations do not describe the distribution of velocity nor do they account for site heterogeneity across the nucleic acid sequence and therefore cannot predict local sequence effects.Second, there are the stochastic models, which can be implemented via simulation of single-molecule behaviour using the Gillespie algorithm [24]. The mean velocity can be calculated by averaging velocities over a number of simulations for a given F and [NTP]. This offers not just the mean but a full distribution of velocities and could potentially explain emergent properties unavailable from a deterministic model. Unfortunately, simulating can be very slow and therefore parameter inference can be a problem.In this study we used a Markov-chain-Monte-Carlo approximate-Bayesian-computation (MCMC-ABC) algorithm [25] to estimate transcription elongation parameters for stochastic models via simulation. The observed pause-free velocities we are fitting to were measured at varying applied force and NTP concentration. For each RNA polymerase under study—E. coli RNAP, S. cerevisiae pol II, and T7 pol—we fit to one respective dataset from the single-molecule literature [4, 26, 27].
Models
Notation and state space
Suppose the TEC is transcribing a gene of length L. Then let S(l, t) denote a TEC state, where the mRNA is currently of length l ≤ L, and describes the position of the active site with respect to the 3′ end of the mRNA. When t = 0 the polymerase is pretranslocated and cannot bind NTP, and when t = 1 the polymerase is posttranslocated and can bind NTP (Fig 2). This study is focused on the main elongation pathway and the observed velocities being fitted have pauses filtered out. Therefore, although additional backtracked states (t < 0) [4, 28, 29] and hypertranslocated states (t > 1) [30, 31] exist, these are not incorporated in the model.Let β1 and β2 be the number of unpaired template nucleotides upstream and downstream of RNA polymerase, respectively, and let h be the number of basepairs in the DNA/mRNA hybrid (Fig 2A). Although there are uncertainties in these parameters, they are held constant at h = 9, β1 = 2, and β2 = 1 [17, 32].Transcription of the gene begins at state S(l0, 0) and ends upon reaching S(L, 0), where l0 = β1 + h + 2.
Parameterisation of the NTP binding step
NTP binding has been modelled as both a kinetic and equilibrium process in the literature [4, 17, 21].In a kinetic binding model, NTP binding occurs at pseudo-first order rate k[NTP], while NTP release occurs at rate k. In this case, k and must be estimated.Under a partial equilibrium approximation NTP binding and release are assumed to be rapid enough that equilibrium is achieved. In this case, the rate constants k and k are subsumed by the NTP dissociation constant which becomes the sole binding-related parameter to estimate.
Parameterisation of the translocation step
While inferences about the rate constants associated with NTP binding and catalysis (k, , and k) can be made directly from the data, the translocation step is more complex. Transition state theory is invoked in order to estimate k and k. Recasting the problem in this way (1) provides a way of accommodating the effects of applied force on the elongation process, and (2) allows the sequence-dependence of translocation to be incorporated by considering the energetics of basepairing. When allowing for sequence dependence, the total number of translocation rates required to model translocation of the full gene is 2(L − l0).
Thermodynamic models of base pairing energies
The standard Gibbs free energies ΔG0(= ΔG) involved in duplex formation are calculated using nearest neighbour models. The standard Gibbs energy of state S—arising from nucleotide basepairing and dangling ends—is calculated as
where SantaLucia’s DNA/DNA basepairing parameters [33] are used to calculate and Sugimoto’s DNA/RNA parameters [34] are used for . For the latter, dangling end energies are estimated as described by Bai et al. 2004 [21]. Here, and elsewhere, the ( superscript is used to denote a model parameter that can be evaluated from the sequence alone. Gibbs energies are expressed on a per molecule basis, relative to the thermal energy of the system, in multiples of kT, where kT = 4.28001 × 10−21 J at T = 310 K.In order for RNA polymerase to translocate forward (backward), up to two basepairs must be disrupted: (1) the basepair at the downstream (upstream) edge of the transcription bubble, and (2) the basepair at the upstream (downstream) end of the DNA/mRNA hybrid (Fig 2B). Differences in the basepairing energies in these regions confer sequence-dependence on the rate of translocation.
Calculation of translocation rates or translocation equilibrium constant
The standard Gibbs energies of the pre and posttranslocated states, and , respectively, are used with up to four additional terms—ΔG, δ1, , and —to calculate the translocation rates. The first three are model parameters which must be estimated while the latter is directly evaluated from the sequence.Let T(l, t) be the translocation transition state between S(l, t) and S(l, t + 1). Then is the sequence-dependent standard Gibbs energy of activation which must be overcome in order to translocate (Fig 3).
Fig 3
Parameterisation of the translocation step.
(A) Effects of model parameters on state energies. The figure displays a schematic Gibbs energy landscape of translocation, with backtracked states included for visualisation purposes. The solid red lines represent translocation states (t = 0: pretranslocated, t = 1: posttranslocated, and t < 0: backtracked), while the dashed red lines represent transition states. Applying an assisting force F > 0 tilts the landscape in favour of higher values of t. The effect of ΔG is observed at the posttranslocated state t = 1. In a translocation equilibrium model, the barrier height is assumed to be so small, = translocation is so rapid, that the transition states are disregarded. (B) A model for the sequence-dependent transition state between translocation states S(l, 0) and S(l, 1). This is required for estimating the Gibbs energy of basepairing in the transition state. The basepairing energy, added to a baseline term , together specify the height of the activation barrier (Eq 10).
Parameterisation of the translocation step.
(A) Effects of model parameters on state energies. The figure displays a schematic Gibbs energy landscape of translocation, with backtracked states included for visualisation purposes. The solid red lines represent translocation states (t = 0: pretranslocated, t = 1: posttranslocated, and t < 0: backtracked), while the dashed red lines represent transition states. Applying an assisting force F > 0 tilts the landscape in favour of higher values of t. The effect of ΔG is observed at the posttranslocated state t = 1. In a translocation equilibrium model, the barrier height is assumed to be so small, = translocation is so rapid, that the transition states are disregarded. (B) A model for the sequence-dependent transition state between translocation states S(l, 0) and S(l, 1). This is required for estimating the Gibbs energy of basepairing in the transition state. The basepairing energy, added to a baseline term , together specify the height of the activation barrier (Eq 10).Given an applied force F, the translocation rates governing transition between the pre and posttranslocated states (k(l) and k(l)) are calculated from barrier height using an Arrhenius type relation:The derived rates k(l) and k(l) are therefore dependent on the local sequence. The pre-exponential factor A is held constant at 106 s−1. This term has been arbitrarily set to a variety of values in previous studies (106−109 s−1 [16, 17, 21]). This has little consequence for model fitting, however the value of is entangled with the value of the pre-exponential factor A and can only be meaningfully interpreted in light of its value.If the system has time to reach equilibrium, the probabilities of observing the pretranslocated state S(l, 0) and posttranslocated state S(l, 1) areThis is described by equilibrium constant K.The physical meanings of the terms ΔG, δ1, , and , and the way they are used in the model, are detailed below.
Energetic bias for the posttranslocated states
ΔG (units kT) is a parameter added to the standard Gibbs energy of the posttranslocated state. If ΔG = 0, then the sequence alone determines the Gibbs energy difference between pre and posttranslocated states. In this case, pretranslocated states are usually favoured over posttranslocated states due to the loss of a single basepair in the hybrid of the latter.ΔG has frequently been estimated for T7 pol [35-37] and there has been discussion around whether such a term is necessary for RNAP [6].
Polymerase displacement and formation of the transition state
δ1 (units Å) is the distance that the polymerase must translocate forward to facilitate the formation of the transition state. The distance between adjacent basepairs is held constant at an experimentally measured value δ = 3.4 Å [15], and 0 < δ1 < δ. The response of the system to an applied force F depends on this term. In general, the application of force F tilts the Gibbs energy landscape—the Gibbs energy difference between adjacent translocation states being augmented by a factor (Fig 3A, [38, 39]).It may be necessary to estimate δ1 to model the data adequately [17], or it may be sufficient to simply set δ1 = δ/2 [38].
Energy barrier of translocation
and (units kT) together determine the activation barrier height in the translocation step. It is assumed that the sequence-dependent standard Gibbs energy of activation can be written asis therefore a sequence-independent baseline term used to compute the translocation barrier heights. The parameter must be estimated in order to evaluate translocation rates.In contrast is a term that is evaluated directly from the sequence derived from a model of the transition state (Fig 3B). The term is evaluated as the standard Gibbs energy of a TEC containing all hybrid and gene basepairs found in both S(l, t) and S(l, t + 1), ie. the intersection between the two sets of basepairs.
Model space
The full transcription elongation model makes use of the following 6 parameters:k (units s−1).(units μM).k (units μM−1 s−1).ΔG (units kT).δ1 (units Å).(units kT).However fewer than 6 parameters may be needed to adequately describe the data. If it is assumed that the energy differences between pre and posttranslocated states are determined by basepairing energies alone, the parameter ΔG does not need to be estimated. This is equivalent to holding ΔG constant at 0. If it is assumed that the displacement required for formation of the translocation transition state is half the distance between adjacent basepairs, the parameter δ1 does not need to be estimated. This is equivalent to holding δ1 constant at δ/2.Partial equilibrium approximations may also simplify the model, as detailed above. If binding is approximated as an equilibrium process, k does not need to be estimated. If translocation is approximated as an equilibrium process, and δ1 do not need to be estimated. One, both, or neither of these two steps (binding and translocation) could be assumed to achieve equilibrium, thus yielding four equilibrium model variants (Fig 4A). The introduction of partial equilibrium approximations for both the NTP binding and translocation steps has implications when specifying the prior distributions for the Bayesian analysis (S4 Appendix) The chemical master equations for single nucleotide addition cycles of these models are presented in S2 Appendix.
Fig 4
The space of models to be compared.
(A) The four equilibrium model variants. NTP binding, translocation, both, or neither, could be assumed to achieve equilibrium prior to catalysis. (B) The 12 transcription elongation models. An arrow connects model i to j if augmentation of model i with a single parameter generates model j. The number of parameters to estimate k is shown for each level in the network. Equilibrium approximation colour scheme is the same as in A. ΔG and δ1 can each be estimated or set to a constant.
The space of models to be compared.
(A) The four equilibrium model variants. NTP binding, translocation, both, or neither, could be assumed to achieve equilibrium prior to catalysis. (B) The 12 transcription elongation models. An arrow connects model i to j if augmentation of model i with a single parameter generates model j. The number of parameters to estimate k is shown for each level in the network. Equilibrium approximation colour scheme is the same as in A. ΔG and δ1 can each be estimated or set to a constant.Incorporating these simplifications to the model in a combinatorial fashion results in a total of 12 related models, which together constitute the model space. Our objective was to determine which of these 12 models provides the best description of the experimental data. The simplest model (Model 1) contains 2 parameters (k and K). The most complex model (Model 12) contains all 6 parameters. The full model space is displayed in Fig 4B.
Stochastic modelling
For each model we performed stochastic simulations, appropriate for the modelling of single-molecule force-velocity data. The simulations, performed using the Gillespie algorithm [24, 40], can be used to estimate the mean elongation velocity under a model.The estimation of mean velocity can be broken down into three steps. First, the system is initialised by placing the RNA polymerase at the 3′ end of the template—state S(l0, 0)—with the transcription bubble open and a DNA/RNA hybrid formed. The force and NTP concentrations are assigned their experimentally set values. Second, a chemical reaction is randomly sampled. The probability that reaction is selected is proportional to its rate constant k (Fig 2). The amount of time taken for the reaction to occur is sampled from the exponential distribution. States which are subject to a partial equilibrium approximation are coalesced into a single state, which augments the outbound rate constants. The second step is repeated until the RNA polymerase has copied the entire template. Third, the previous two steps are repeated c times. The mean elongation velocity is evaluated as the mean of each mean elongation velocity across c simulations. For further information, see S1 Appendix.
Relation to previous models and stochastic simulations
There is an extensive literature concerned with the kinetic modelling of transcription elongation. Such models may incorporate backtracking, hypertranslocation, and other reactions. Here we are concerned only with the central elongation pathway.A stochastic and sequence-dependent model was proposed by Bai et al. 2004 [21] for RNAP, with both NTP binding and translocation treated as equilibrium processes. The translocation equilibrium constant was calculated entirely from basepairing energies. Therefore this model is equivalent to Model 1, and the parameters were estimated as k = 24.7 s−1 and K = 15.6 μM from fit to experimental data. Maoiléidigh et al. 2011 also presented stochastic simulations of RNAP. The elongation component of their model is equivalent to Model 6 [17]. We build on this work by providing a systematic Bayesian framework for model comparison and parameter estimation.While our analysis employed sequence-dependent stochastic models, comparisons can also be made with some deterministic models.Abbondanzieri et al. 2005 [4], Larson et al. 2012 [41], Schweikhard et al. 2014 [26], and Thomen et al. 2008. [27, 37] described a deterministic model (for RNAP, pol II, pol II, and T7 pol respectively) which estimated k, K and translocation equilibrium constant . These are most similar to Model 4.Maoiléidigh et al. 2011 for RNAP, and Dangkulwanich et al. 2013 for pol II, however found that the translocation and catalysis were occurring on similar timescales, and modelled only NTP binding as an equilibrium process [17, 42]. They also estimated the distance of translocation. These deterministic models are most similar to Model 11.Finally, Mejia et al. 2015 [43] used a model that is quite different to all the above models, as it does not explicitly treat translocation. Instead elongation is modelled with a two step kinetic scheme, the first step involving NTP binding and conformational change, and the second step involving nucleotide incorporation and product release. This model is most similar to a special case of Model 5 where ΔG becomes extremely negative, driving the polymerase into the posttranslocated position.
Results and discussion
Model selection with MCMC-ABC
Our aim was to 1) use Bayesian inference to select the best of 12 transcription elongation models for each RNA polymerase; and 2) estimate the parameters for those models appearing in the 95% credible set of the posterior distribution. Selecting prior distributions behind each parameter is a critical process in Bayesian inference. A prior distribution should reflect what is known about the parameter before observing the new data. We have explicated our prior assumptions, with justifications, in Table 1.
Table 1
Prior distributions used during Bayesian inference.
Parameter
Prior distribution(s)
Justification of prior distribution(s)
Model M
P(Mi) = 2/16 for i ∈ {1, 2, 4, 5}P(Mi) = 1/16 for i ∈ {3, 6, 7, 8, 9, 10, 11, 12}
Each model should each have uniformly distributed values. Models with translocation at equilibrium have double the prior probability since these models do not use δ1.
kcat (s−1)
Lognormal(μ = 3.454, σ = 0.587) for RNAP/pol IILognormal(μ = 4.585, σ = 0.457) for T7 pol
kcat and elongation velocity estimates for E. coli RNAP and S. cerevisiae pol II range from 18 to 50 s−1 for optical trapping experiments [6–8, 21, 43], but as much as 100 bp/s in vivo [45–48]. Distribution selected such that (10, 100) is central 95% interval. For T7 pol kcat and elongation velocity estimates range from 43–240 bp/s [9, 49–51]. Distribution selected such that (40, 240) is central 95% interval.
KD (μM)
Lognormal(μ = 1.844, σ = 1.762)
Estimates for KD under binding equilibrium models range from 20-140 μM [6, 20, 37, 41, 52]. In models where binding is kinetic and slow, KD≡krelkbind could be much lower (S4 Appendix). To accommodate for both binding models, the prior distribution was selected such that the central 95% interval is (0.2, 200).
kbind (μM−1s−1)
Lognormal(μ = −1.498, Sσ = 1.585)
Central 95% interval set so that NTP binding is a slow kinetic step (S4 Appendix). Centered around (0.01, 5).
ΔGτ1 (kBT)
Normal(μ = 0, σ = 1.55) for RNAP/pol IINormal(μ = −3.3, σ = 1.55) for T7 pol
For RNAP and pol II, centered around 0 with a standard deviation comparable to the free energy of a single nucleotide basepair doublet, and such that the 95% central interval is (-4, 4). For T7 pol ΔGτ1 has been estimated as -4.3 [37] and -4.87 kBT [35]. However these estimates are likely resulting partially from dangling ends. Thus, we subtracted the mean dangling end contribution of ∼ -1 kBT [33] and centered the prior around this interval with a standard deviation the same as above.
ΔGτ‡ (kBT)
Normal(μ = 5.5, σ = 0.97) for RNAP/pol IINormal(μ = 2.5, σ = 1.36) for T7 pol
Central 95% interval set so that translocation is a slow kinetic step (S4 Appendix). Selected so that 99% central interval is (3, 8) for RNAP and pol II, and (-1, 6) for T7 pol.
δ1 (Å)
Uniform(l = 0, u = 3.4)
Uniformly distributed across all possible values.
Prior distributions behind all estimated parameters and the model indicator. Unless specified otherwise, the prior distribution is used for all three RNA polymerases. Lognormal priors (parameterised in log space) are used for rates and equilibrium constants while normal priors are used for Gibbs energy terms. To maintain statistical integrity of the Bayesian analysis, prior distributions were not derived from the data presented by Abbondanzieri et al. 2005 [4] for RNAP, by Schweikhard et al. 2014 [26] for pol II, or by Thomen et al. 2008 [27] for T7 pol.
Prior distributions behind all estimated parameters and the model indicator. Unless specified otherwise, the prior distribution is used for all three RNA polymerases. Lognormal priors (parameterised in log space) are used for rates and equilibrium constants while normal priors are used for Gibbs energy terms. To maintain statistical integrity of the Bayesian analysis, prior distributions were not derived from the data presented by Abbondanzieri et al. 2005 [4] for RNAP, by Schweikhard et al. 2014 [26] for pol II, or by Thomen et al. 2008 [27] for T7 pol.We performed MCMC-ABC experiments which estimated the parameters and model indicator M for . Models which appear more often in this posterior distribution are better choices, given the data. The model indicator is a discrete variable which can take 12 values, and is treated identically to the 6 continuous parameters in the Bayesian framework.The datasets we fit our models to are all from the single-molecule literature and are presented in: Figures 5a and 5b of Abbondanzieri et al. 2005 [4] for E. coli RNAP, Figure 2a of Schweikhard et al. 2014 [26] for S. cerevisiae pol II, and Table 2 of Thomen et al. 2008 [27] for T7 pol. To computationally replicate these experiments as faithfully as we could with the available information and computational limitations, simulations in this study were run on the 4 kb E. coli
rpoB gene for RNAP (GenBank: EU274658), the first 4.75 kb of the humanrpb1 gene for pol II (NCBI: NG_027747) the first 10 kb of the Enterobacteria phage λ genome for T7 pol (NCBI: NC_001416). The mean velocities from 32 (for RNAP), 10 (for pol II) and 3 (for T7 pol) simulations of the full respective sequences were used to estimate the mean elongation velocity during MCMC-ABC, given F and [NTP].For further information about the MCMC-ABC algorithm [25, 44], or the model indicator M, see S3 Appendix.
The posterior distributions
The posterior distributions from our MCMC-ABC experiments are presented in Table 2, Figs 5 and 6.
Table 2
Summary of MCMC-ABC experiments.
Enzyme
E. coli RNAP
S. cerevisiae pol II
Bacteriophage T7 pol
ϵ
2.39
0.705
4.63
Number of chains
70
10
26
Combined chain length
3.5 × 107
6.2 × 107
1.2 × 108
i
11
12
11
12
5
Model
Description
Binding equilibrium, Translocation kinetic
Binding kinetic, Translocation kinetic
Binding equilibrium, Translocation kinetic
Binding kinetic, Translocation kinetic
Binding kinetic, Translocation equilibrium
ESS / R^
k^cat
257 / 1.04
1441 / 1.03
549 / 1.02
1203 / 1.01
2110 / 1.00
krelkbind^
328 / 1.01
101 / 1.01
536 / 1.01
133 / 1.05
106 / 1.00
k^bind
−
705 / 1.09
−
516 / 1.02
154 / 1.00
ΔG^τ1
466 / 1.02
1844 / 1.00
1145 / 1.00
2769 / 1.00
1626 / 1.02
δ^1
300 / 1.04
2290 / 1.03
658 / 1.01
1469 / 1.00
−
ΔG^τ‡
340 / 1.02
1680 / 1.03
589 / 1.02
1179 / 1.00
−
Posterior
P(Mi|D)
0.81
0.19
0.29
0.71
0.96
Each column summarises the posterior distribution for the respective RNA polymerase, which arises from multiple independent MCMC chains. The combined chain length refers to the total number of states sampled, a small fraction of which are used to estimate the 6 continuous model parameters and the model indicator M. Approximate Bayesian computation threshold ϵ is shown for each enzyme; state Θ is accepted into the posterior distribution only if X2(Θ) ≤ ϵ (S3 Appendix). Models which appear in an RNA polymerase’s 95% credible set and their posterior probabilities P(M|D) are shown. The effective sample size (ESS, calculated with Tracer 1.6 [53]) and R-hat ( [54–56]) of each parameter, conditional on M, are displayed. A large ESS (> 100) and a small (< 1.1) imply that the MCMC experiment has converged. Where a parameter is not incorporated in the kinetic model, a ‘−’ is left in its place.
Fig 5
Posterior and prior distribution plots.
Posterior distributions for all models which appear in the 95% credible set are displayed (two models for RNAP, two models for pol II, and one model for T7 pol). Plots show the prior probability density P(θ) of each parameter and posterior probability density of each parameter conditional on the model P(θ|D, M). The geometric median point-estimates and highest posterior density (HPD) intervals (calculated with Tracer 1.6 [53]) are displayed above each plot (3 sf).
Fig 6
Posterior distributions of simulated velocities.
Black open circles represent experimentally measured mean velocities reported in the original publication for (A) RNAP, (B) pol II, and (C) T7 pol [4, 26, 27]. Each coloured dot represents a single sample simulated from the posterior distribution of parameters/models for the respective polymerase. 30 samples were generated from each of the three posterior distributions. For RNAP, [NTP] is defined as [ATP] = 5 μM, [CTP] = 2.5 μM, [GTP] = 10 μM, and [UTP] = 10 μM.
Each column summarises the posterior distribution for the respective RNA polymerase, which arises from multiple independent MCMC chains. The combined chain length refers to the total number of states sampled, a small fraction of which are used to estimate the 6 continuous model parameters and the model indicator M. Approximate Bayesian computation threshold ϵ is shown for each enzyme; state Θ is accepted into the posterior distribution only if X2(Θ) ≤ ϵ (S3 Appendix). Models which appear in an RNA polymerase’s 95% credible set and their posterior probabilities P(M|D) are shown. The effective sample size (ESS, calculated with Tracer 1.6 [53]) and R-hat ( [54-56]) of each parameter, conditional on M, are displayed. A large ESS (> 100) and a small (< 1.1) imply that the MCMC experiment has converged. Where a parameter is not incorporated in the kinetic model, a ‘−’ is left in its place.
Posterior and prior distribution plots.
Posterior distributions for all models which appear in the 95% credible set are displayed (two models for RNAP, two models for pol II, and one model for T7 pol). Plots show the prior probability density P(θ) of each parameter and posterior probability density of each parameter conditional on the model P(θ|D, M). The geometric median point-estimates and highest posterior density (HPD) intervals (calculated with Tracer 1.6 [53]) are displayed above each plot (3 sf).
Posterior distributions of simulated velocities.
Black open circles represent experimentally measured mean velocities reported in the original publication for (A) RNAP, (B) pol II, and (C) T7 pol [4, 26, 27]. Each coloured dot represents a single sample simulated from the posterior distribution of parameters/models for the respective polymerase. 30 samples were generated from each of the three posterior distributions. For RNAP, [NTP] is defined as [ATP] = 5 μM, [CTP] = 2.5 μM, [GTP] = 10 μM, and [UTP] = 10 μM.A large effective sample size (> 100 [53]) and a small (< 1.1, as defined by Gelman et al. 1992 [54-56]) are essential for making reliable parameter estimates. Table 2 suggests that the parameters in the 95% credible set of models are sufficiently estimated by these criteria.These results indicate that the best models for the datasets examined are Models 11 and 12 for both RNAP and pol II, and Model 5 for T7 pol (Fig 4B).For pol II, Model 12 has the highest posterior probability P(M12|D) = 0.71. This is the most complex model considered, with 6 estimated parameters. In Model 12 translocation, NTP binding and catalysis are all kinetic processes; the displacement required to facilitate formation of the translocation transition state, δ1 < δ, is estimated (); and the standard Gibbs energy of the posttranslocated state is influenced by parameter ΔG ≠ 0.The posterior distribution for RNAP consists of the same set models as that of pol II. For RNAP, Model 11 has the highest probability P(M11|D) = 0.81. This model is a submodel of Model 12 with one fewer parameter: in Model 11 NTP binding is treated as an equilibrium process while in Model 12 it is not.The only model in the 95% credible set for T7 pol is Model 5 P(M5|D) = 0.96. In Model 5 (4 parameters) translocation, but not binding, is treated as an equilibrium process, and ΔG is estimated. This positions T7 pol in a quite different area of the model space to the other two polymerases.
Translocation rates differ among RNA polymerases
For RNAP and pol II, we estimate that a partial equilibrium approximation for the translocation step is inadequate. The posterior probability that such models are inadequate is 1.00 (see Table 2). For T7 pol, however, translocation is significantly faster than catalysis and is best modelled with a partial equilibrium approximation. Using estimates for and ΔG under the maximum posterior models (Model 11 for RNAP and Model 12 for pol II) we estimate the mean forward and backward translocation rates averaged across the rpoB sequence as: 230 s−1 and 112 s−1 for RNAP, and 350 s−1 and 12.7 s−1 for pol II, respectively (3 sf). These estimates are within one order of magnitude of the respective estimate for the rate of catalysis (Fig 5) suggesting that translocation and catalysis indeed occur on similar timescales.For RNAP and pol II, translocation has frequently been modelled as an equilibrium process [4, 21, 26, 41, 43], however in some recent analyses this assumption has been rejected [16, 17, 42, 57, 58]. Our Bayesian analysis supports this. In contrast, there is general agreement that translocation in T7 pol is adequately modelled as an equilibrium process [27, 59, 60].
The data does not determine the kinetics of the NTP binding step
It remains unclear how to best model the NTP binding step. Models that describe NTP binding as a kinetic process have posterior probabilities of 0.19 for RNAP, 0.71 for pol II and 0.96 for T7 pol (Table 2). However, in an earlier experiment, where we used different a prior distribution for , the latter probability was 0.21 and P(M4|D) was 0.79. The intermediate magnitude of these posterior probabilities, and sensitivity to the choice of prior, imply that the data contains very little information about which binding model is preferred.Furthermore, and k (Models 5 and 12) are unable to be estimated simultaneously. For pol II and for T7 pol, k is estimated at around 0.48 and 1.4 μM−1 s−1 respectively with fairly narrow 95% highest posterior density (HPD) intervals (Fig 5). However, the HPD interval of spans three orders of magnitude and the value of this parameter was therefore poorly informed by the data. For RNAP, in contrast, neither k nor were well-informed by the data and both have HPD intervals spanning 1-2 orders of magnitude. This non-identifiability—where two or more parameters are unable to be estimated simultaneously (S4 Appendix)—highlights the appeal of an NTP binding equilibrium model where only one parameter needs to be estimated, despite the unrealistic assumptions it may invoke. In the case of each enzyme, the data has taught us nothing about one or two of the binding parameters.The pause-free mean velocities measured during transcription elongation follow Michaelis-Menten kinetics even though the reaction cycle is more complicated than that of a simple enzyme [61]. As such, the inability to resolve the timescale of the substrate binding step is unsurprising [62-64].In the transcription literature, NTP binding is almost always assumed to achieve equilibrium for RNAP, pol II, and T7 pol [4, 16, 17, 21, 26, 27, 37, 41, 42, 60]. However Mejia et al. 2015 [43] have shown that NTP binding is indeed rate-limiting, and that mutations in the RNAP trigger loop impair the binding rate thus suggesting that the trigger loop is coupled with NTP binding.
RNAP has an energetic preference for the posttranslocated state
In previous stochastic sequence-dependent models [16, 21] the standard Gibbs energies of the pre and posttranslocated states have been based solely on the nucleic acid basepairing energies. Our models include an additional term, ΔG, to account for potential interactions between the protein and the nucleic acid. The marginal posterior probability of a model in which an additional term ΔG is required is 1.00 in all three polymerases. In each case ΔG was estimated to be less than 0 kT and 0 kT is not included in the 95% HPD interval (Fig 5). We find that is the most significant in pol II and T7 pol: −4.6 kT and −4.0 kT respectively, while
kT for RNAP (2 sf).These results suggest that structural elements within RNA polymerases can energetically favour posttranslocated states over pretranslocated states. We note that the sequence-dependent contribution of the dangling end of the DNA/RNA hybrid is included in the thermodynamic model. The energetic bias for the posttranslocated state is separable from this effect.To facilitate comparison with previous deterministic models, using our estimates of ΔG we calculated the equilibrium constant between the pre and posttranslocated states. Geometrically averaged across the rpoB gene, these areThus, for all three polymerases, K < 1, indicating that the small energetic preference that the protein has for the posttranslocated state is sufficient to override the loss of basepairing energy, thereby biasing the system towards population of the posttranslocated positions. This is in agreement with estimates made for pol II and T7 pol [26, 27, 35, 36, 41] and Kireeva et al. 2018 [58] for RNAP: “forward translocation occurs in milliseconds and is poorly reversible”. However these estimates are inconsistent with some RNAP and pol II studies which place this ratio above 1 [4, 17, 42, 52].Kinetic modelling can itself suggest no physical mechanism for the stabilisation. Yu et al. 2012 [36] have identified a conserved tyrosine residue near the active site of T7 pol that pushes against the 3′ end of the mRNA, and thus stabilises the posttranslocated state. They propose a similar mechanism for the multi-subunit RNA polymerases.
δ1 may be an important parameter but its physical meaning is unclear
Our results suggest that δ1, the distance that RNA polymerase must translocate forward by to reach the translocation transition state, is a necessary parameter to estimate for RNAP and pol II. Setting δ1 = δ/2 is not sufficient. The marginal posterior probability of models which estimate this term is 1.00. δ1 is irrelevant to the modeling of the T7 pol data because the best models invoke a partial equilibrium approximation for the translocation step.While our prior distribution restricted δ1 to lie in the range (0, δ), the upper end our 95% HPD intervals of δ1 for RNAP and pol II are very close to δ = 3.4 Å. If it was not for this prior distribution, δ1 estimates would have included values higher than δ. Similar results have been observed by Maoiléidigh et al. 2011 [17] for RNAP.Our interpretation of δ1 implies it should never be greater than δ nor should δ be more than the width of one basepair. The physical meaning of δ1 with values greater than δ is thus unclear. It is noted that δ1 is only used when F ≠ 0.
Comparing the kinetics of RNA polymerases
The in vivo rate of transcription elongation varies considerably across RNAP, pol II and T7 pol. The prokaryotic and eukaryotic RNA polymerases have a mean rate ranging from 20-120 bp/s [45, 46, 48, 49, 65–67], which may be slowed down in histone-wrapped regions of eukaryotic genomes [7]. In contrast, Bacteriophage T7 pol operates up to an order of magnitude faster (around 200-240 bp/s [49, 68]) and is known to be quite insensitive to transcriptional pause sites [9, 27].In additional to these differences, we have shown that translocation is very rapid in T7 pol, relative to the rate of NTP incorporation, while the disparity is much less significant in RNAP and pol II. Furthermore, the model does not fit the data for T7 pol as closely it does for RNAP and pol II (Fig 6). T7 pol therefore seems to operate under quite a different kinetic scheme than that of the cellular polymerases, which is not unexpected given their distant evolutionary relationship [3].In general, the elongation velocity of RNA polymerase is significantly slower in an optical trap (with estimates ranging from 9.7-22 bp/s for RNAP [11–13, 43, 69]) compared with that of the untethered enzyme (with estimates in vitro or in vivo ranging from 25-118 bp/s for RNAP [45, 49, 70, 71]). This relationship holds for multiple RNA polymerases including E. coli RNAP, S. cerevisiae pol II [41, 42, 52, 72], Bacteriophage T7 pol [9, 27, 49, 51], and Bacteriophage Φ6 P2 [10, 73]. This suggests that optical trapping perturbs the system to a significant extent. Additionally, varying degrees of heterogeneity in elongation rate have been observed across different polymerase complexes even under the same conditions [11, 13, 27].The velocity perturbations resulting from the optical trapping apparatus will be propagated into the model parameters, especially k, and , and some caution is needed when extrapolating these results to untethered systems.
Bayesian inference of transcription elongation
To our knowledge we are the first to perform Bayesian inference on single-molecule models of transcription elongation. This was achieved by simulation which necessitated the use of approximate Bayesian computation. An alternative would be to build and use a likelihood function (ie. the probability of taking exactly t units of time for RNA polymerase to copy the sequence n times). The latter approach can be achieved using chemical master equations, as opposed to (Gillespie) sampling from the distribution. Finding analytical, stable numerical, or approximate solutions to the chemical master equations could provide a similar insight in less computational time, however is susceptible to a multitude of analytical and numerical issues associated with the exponentiation of an arbitrary transition rate matrix that grows with the length of the sequence (S2 Appendix) [74]. This problem would be amplified by the introduction of backtracking, hypertranslocation, or NTP misincorporation reactions into the model, for instance. The Bayesian framework we have presented, although computationally intensive due to its simulation requirement, is general and will work on any model of transcription without the need to resolve these issues. The path has been paved for modelling transcriptional pausing, for instance [16, 21, 75]. Nevertheless, likelihood-based Bayesian inference is an approach that should be explored in the future.We have demonstrated that single-molecule data can be usefully analysed using a Bayesian inference and model selection framework. This analysis would have even greater statistical power if applied to the progression of individual RNA polymerase complexes instead of mean velocities averaged across multiple experiments.
Conclusion
In this article we evaluated some simple Brownian ratchet models of transcription elongation (Fig 2). By varying the parameterisation of the translocation step (Fig 3) and incorporating partial equilibrium approximations commonly invoked in the literature (Fig 4A) we enumerated a total of 12 related models (Fig 4B). Using stochastic simulations and approximate Bayesian computation, we then assessed which of these models were capable of describing the force-velocity data previously measured for several RNA polymerases (Table 2 and Fig 5) using single-molecule optical trapping experiments [4, 26, 27].Our analysis suggests that 1) different partial equilibrium approximations of the translocation step are appropriate for the multisubunit RNA polymerases versus the single subunit T7 RNA polymerase. 2) Treatment of the NTP binding step remains a point of ambiguity. The existing data does not place strong constraints on the modelling of this step. 3) There is an energetic bias for posttranslocated state. 4) The model of the force-dependent translocation, which invokes transition state theory, is not physically realistic.
Stochastic simulation.
The Gillespie algorithm is described.(PDF)Click here for additional data file.
Chemical master equations.
Master equations for the four equilibrium model variants are presented.(PDF)Click here for additional data file.
MCMC-ABC.
A description of the MCMC-ABC algorithm and how it is used to infer parameters and models from experimental data.(PDF)Click here for additional data file.
Prior distributions.
Simulation-based justifications behind prior distributions for , k, and .(PDF)Click here for additional data file.
Simulations of the elongation pathway.
Each point is a single simulation of the full rpoB gene (4029 nt). For (A-C), Parameters on the x- and z-axis are sampled uniformly at random from the displayed range at the beginning of each trial. The y-axis of each plot (mean elongation velocity) is then measured from the respective simulation. [NTP] and F held constant at 1000 μM and 0 pN respectively. (A) and (B): Relationship between and k for the melting model with binding at equilibrium (Model 8). ΔG set to its prior mean (0 for RNAP and pol II, and -3.3 for T7 pol). (C) Relationship between k and k for the kinetic binding model with translocation at equilibrium (Model 2). (D) Relationship between K and k with translocation held at equilibrium (Model 2). K and k sampled uniformly from specified range and velocity is measured. Samples with simulated velocities outside of the range 1-2 bp/s were discarded. [NTP] = 10 μM and k = 100 s−1.(PDF)Click here for additional data file.
Authors: Karen Adelman; Arthur La Porta; Thomas J Santangelo; John T Lis; Jeffrey W Roberts; Michelle D Wang Journal: Proc Natl Acad Sci U S A Date: 2002-10-07 Impact factor: 11.205
Authors: Elio A Abbondanzieri; William J Greenleaf; Joshua W Shaevitz; Robert Landick; Steven M Block Journal: Nature Date: 2005-11-13 Impact factor: 49.962