Literature DB >> 20529917

Thermodynamics of RNA structures by Wang-Landau sampling.

Abstract

MOTIVATION: Thermodynamics-based dynamic programming RNA secondary structure algorithms have been of immense importance in molecular biology, where applications range from the detection of novel selenoproteins using expressed sequence tag (EST) data, to the determination of microRNA genes and their targets. Dynamic programming algorithms have been developed to compute the minimum free energy secondary structure and partition function of a given RNA sequence, the minimum free-energy and partition function for the hybridization of two RNA molecules, etc. However, the applicability of dynamic programming methods depends on disallowing certain types of interactions (pseudoknots, zig-zags, etc.), as their inclusion renders structure prediction an nondeterministic polynomial time (NP)-complete problem. Nevertheless, such interactions have been observed in X-ray structures.
RESULTS: A non-Boltzmannian Monte Carlo algorithm was designed by Wang and Landau to estimate the density of states for complex systems, such as the Ising model, that exhibit a phase transition. In this article, we apply the Wang-Landau (WL) method to compute the density of states for secondary structures of a given RNA sequence, and for hybridizations of two RNA sequences. Our method is shown to be much faster than existent software, such as RNAsubopt. From density of states, we compute the partition function over all secondary structures and over all pseudoknot-free hybridizations. The advantage of the WL method is that by adding a function to evaluate the free energy of arbitrary pseudoknotted structures and of arbitrary hybridizations, we can estimate thermodynamic parameters for situations known to be NP-complete. This extension to pseudoknots will be made in the sequel to this article; in contrast, the current article describes the WL algorithm applied to pseudoknot-free secondary structures and hybridizations. AVAILABILITY: The WL RNA hybridization web server is under construction at http://bioinformatics.bc.edu/clotelab/.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
RNA

Year: 2010 PMID： 20529917 PMCID： PMC2881402 DOI： 10.1093/bioinformatics/btq218

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

RNA is an important biomolecule, now known to play both an information carrying role, as well as a catalytic role. Indeed, the genomic information of retroviruses, such as the hepatitis C and human immunodeficiency viruses, is encoded by RNA rather than DNA, while the peptidyl transferase reaction, arguably the most important enzymatic reaction responsible for life, is catalyzed not by a protein, but rather by RNA (Weinger et al., 2004). It has recently emerged that RNA plays a wide range of previously unsuspected roles in many biological processes, including retranslation of the genetic code [selenocysteine insertion (Böck et al., 1991), ribosomal frameshift (Bekaert et al., 2003)], transcriptional and translational gene regulation (Lim et al., 2003; Mandal et al., 2003), temperature-sensitive conformational switches (Chowdhury et al., 2003; Tucker and Breaker, 2005), chemical modification of specific nucleotides in the ribosome (Omer et al., 2000), regulation of alternative splicing (Cheah et al., 2007), etc. A secondary structure for a given RNA nucleotide sequence a1,…, a is a set S of base pairs (i, j), such that a, a forms either a Watson–Crick or GU (wobble) base pair, and such that there are no base triples or pseudoknots in S. For example, the secondary structure of Y RNA with EMBL access code AAPY01489510/220-119 is displayed in Figure 1a and b, while Figure 1c and d depicts the pseudoknotted structure of the Gag/pro ribosomal frameshift site of mouse mammary tumor virus (Van Batenburg et al., 2001). In conventional dot-bracket notation, this latter structure is given as follows, where it should be noted that two kinds of bracket are needed due to the pseudoknot

Fig. 1.

(a and b) Pseudoknot-free secondary structure of Y RNA with EMBL access code AAPY01489510/220-119, depicted in (a) in Feynman circular form, and in panel (b) in classical form. (c and d) Pseudoknotted structure for the Gag/pro ribosomal frameshift site of mouse mammary tumor virus, depicted in (c) in Feynman circular form, and in (d) in classical form. Images produced with sofware jViz (Wiese et al., 2005) from structures taken, respectively, from Rfam (Griffiths-Jones et al., 2003) and Pseudobase (Van Batenburg et al., 2001). It is computationally intractable to compute the minimum free-energy tertiary structure of RNA; indeed, determining the optimal pseudoknotted structure is nondeterministic polynomial time (NP)-complete Lyngso and Pedersen (2000). In contrast, by disallowing pseudoknots, secondary structure prediction is algorithmically tractable; there are dynamic programming algorithms to compute the minimum free-energy structure for a single RNA molecule, as well as for the hybridization of two or more RNA molecules. In particular, such methods can be loosely grouped into two types of algorithm—those that use (i) a stochastic context free grammar to compute a covariation model and (ii) free-energy parameters obtained from UV absorbance (optical melting) experiments, in order to determine the minimum free energy structure (i.e. thermodynamic-based algorithms). Examples of stochastic context-free grammars are the programs Infernal (Nawrocki et al., 2009) and Pfold (Knudsen and Hein et al., 2003). Examples of thermodynamics-based algorithms are the programs mfold (Zuker and Stiegler, 1981), UNAFOLD (Markham and Zuker, 2008), RNAfold (Hofacker et al., 1994), RNAstructure (Mathews et al., 2004). Thermodynamics-based algorithms for hybridization of two structures are given in UNAFOLD (Dimitrov and Zuker, 2004), RNAcofold (Bernhart et al., 2006; Mückstein et al., 2006), while the NUPACK software considers hybridization of three or more RNA molecules. (Dirks et al., 2007). Such thermodynamics-based algorithms are particularly important, since the tertiary structure of RNA is believed to be largely determined by secondary structure, which acts as a scaffold for tertiary contacts; see Banerjee et al. (1993) for experimental data supporting this view. Computing the minimum free-energy pseudoknotted structure for a given RNA sequence is NP-complete Lyngso and Pedersen (2000) for the Turner nearest neighbor energy model. For that reason, pseudoknot structure prediction algorithms fall into three categories: (i) exponential time exact algorithms, (ii) dynamic programming algorithms that restrict pseudoknots to a particular class and (iii) heuristic methods. Examples of exact algorithms for pseudoknot structure prediction are the branch-and-bound algorithm of (Bon, 2009) and the method using tree-width decomposition of Zhao et al. (2008). Examples of algorithms that consider only pseudoknots of a particular class are found in the pioneering work of Rivas and Eddy (1999) and Lefebvre (1995), with subsequent refinements in Dirks and Pierce (2003); Reeder and Giegerich (2004) and Ren et al. (2005) Examples of heuristic approaches include Monte Carlo methods Metzler and Nebel (2008), genetic algorithms Abrahams et al. (1990) and a simple, yet elegant algorithm called ProbKnot (D.H. Mathews, to appear) that appears to be the state-of-the art method according to recent benchmarking studies. Finally, it is beyond the scope of this article to provide additional background on algorithms for RNA structural alignment, motif detection or tertiary structure prediction. As will be shown later, by Wang-Landau (WL) Monte Carlo methods, we can obtain essentially the same results as by dynamic programming computation of the partition function from UNAFOLD and RNAcofold; however, the advantage of the WL approach is that by extending the energy evaluation function for a given structure or hybridization, we can estimate the partition function for arbitrary pseudoknotted structures, known to be an NP-complete problem. Before proceeding, we formally define a secondary structure as follows. Given an RNA sequence s = a1,…, a, a secondary structure S on s is defined to be a set of ordered pairs corresponding to base pair positions, which satisfies the following requirements. For steric reasons, following convention, the threshold θ, or minimum number of unpaired bases in a hairpin loop, is taken to be three. For any additional background on RNA and dynamic programming computation of secondary structures, see Clote and Backofen (2000) and the recent review Eddy (2004). Watson–Crick or GU wobble pairs: if (i, j) belongs to S, then pair (a, a) must be one of the following canonical base pairs: (A, U), (U, A), (G, C), (C, G), (G, U) and (U, G). Threshold requirement: if (i, j) belongs to S, then j − i > θ. Non-existence of pseudoknots: if (i, j) and (k, ℓ) belong to S, then it is not the case that i < k < j < ℓ. No base triples: if (i, j) and (i, k) belong to S, then j = k; if (i, j) and (k, j) belong to S, then i = k.

2 APPROACH

The non-Boltzmannian WL Monte Carlo algorithm was developed by Wang and Landau, 2001a, b) to estimate the density of states and partition function for complex systems, such as the Ising model, that exhibit a phase transition. While the Metropolis-Hastings Monte Carlo algorithm samples low energy states, the WL algorithm is designed to visit states uniformly across all energies in a discrete energy landscape. Indeed, for the Metropolis–Hastings algorithm, the expected frequency, or stationary probability, p*(x) of visiting the state x, whose energy is E, is given by the uniform probability times the Boltzmann probability , where g(E) is the number of states having energy E and the partition function Z = ∑e−; in contrast, for the WL algorithm, the expected frequency or stationary probability, of visiting state x is , where ℰ is the total number of distinct energies E (in the discrete case), or of energy bins (in the continuous case). It follows that non-Boltzmannian sampling strategies, such as that devised by Wang and Landau (2001a, b), Kou and Wong Kou et al. (2006a), etc. are potentially useful in biopolymer folding, where one searches for a global energy minimum in a landscape having many local energy minima. Indeed in Chen and Xu (2006), Chen and Xu applied the WL algorithm for the structure prediction of helical transmembrane proteins, while the equi-energy sampling method of Kou and Wong Kou et al. (2006a), related to Monte Carlo with replica exchange, has been applied to estimate the density of states for lattice protein folding under the hydrophobic–hydrophilic (HP) energy model Kou et al. (2006b), as well as in protein structure prediction by the fragment assembly Zhang et al. (2009). In this article, we apply the WL algorithm to compute the density of states and partition function for RNA secondary structure as well as for the hybridization of two RNA sequences. We begin by validating and benchmarking the WL method against the exhaustive method RNAsubopt Wuchty et al. (1999), that enumerates all secondary structures of a given RNA sequence. Next, we compute the partition function over all secondary structures and over all pseudoknot-free hybridizations. We describe as well how to compute the partition function Z(T) over all temperatures from 0°C to 100°C by performing two WL computations, followed by convolution calculations. Although the computation of the partition function over all secondary structures and over all pseudoknot-free hybridizations can be done using the existent software RNAfold (Hofacker, 2003), respectively, RNAcofold (Bernhart et al., 2006), UNAFold (Markham and Zuker, 2008) and a recently published method of Chitsaz et al., the real advantage of our method is that by adding a function to evaluate arbitary pseudoknotted structures and arbitrary hybridizations, we can approximately compute the partition function, heat capacity, melting temperature, etc. for a context known to be NP-complete Lyngso and Pedersen (2000). The density of states is defined to be the absolute frequency function for energy; i.e. density of states g(e) counts the number of states having energy e. In the context of RNA secondary structure, a state is a secondary structure for an arbitrary but fixed RNA sequence s. In Cupal et al. (1996), described the first efficient algorithm, running in O(m2n3) time, to compute the density of states for an RNA sequence of length n, where energy is discretized into m bins. The program of Cupal et al. (1996) is no longer available, since it has been superceded by the program RNAsubopt, developed by Wuchty et al. (1999), which enumerates all secondary structures, whose free energy is within a user-defined bound above the minimum free energy. Though not documented, the RNAsubopt program additionally admits the option -D, which, instead of outputting structures, outputs only the number of secondary structures in each energy bin above the minimum free energy (bin size 0.1 kcal/mol).

3 METHODS

Monte Carlo algorithms have been implemented by a number of groups, to study RNA kinetics of folding. In particular, KinFold, developed by Flamm et al. (2000), computes the mean first passage time (MFPT) of folding, by using a variant of the Gillespie algorithm in an event-driven simulation with a choice of Metropolis–Hastings and Kawasaki dynamics. In Isambert and Siggia (2000) and Xayaphoummine et al. (2005) a similar time-driven Monte Carlo simulation program, KineFold, is described to compute kinetically determine pseudoknotted structure for a given RNA sequence. Danilova et al. (2006) describe the RNAkinetics web server used to study the kinetics of the folding transitions of a growing RNA molecule, as in the case of transcriptional folding. We now begin by providing background definitions and describing the WL algorithm.

3.1 WL

The WL algorithm, (Wang and Landau, 2001a, b) was designed in order to compute the density of states and partition function, neither of which can be computed directly by classical Monte Carlo methods, such as the Metropolis–Hastings algorithm, simulated annealing, replica exchange, etc. Recall the definition of Markov chain. Let Q = {1,…, n} be a finite set of states, let π = (p1,…, p) be the distribution for initial state, and let P = (p) be a matrix of transition probabilities, satisfying ∑p = 1 for all i. A (first-order, time-homogeneous) Markov chain M = (Q, π, P) is a stochastic process, whose state q at time t is a random variable determined by Define p(t) = Pr[q = i] and p( = Pr[q = j | q0 = i]. Clearly, the (i, j)-th entry of the t-th power P of P equals p(; moreover, by time-homogeneity it follows that p( = Pr[q = j|q = i], for all t0. The stationary probability of state i is defined by limp(t) = p*, provided the limit exists. It is a classical result that every finite, aperiodic, irreducible Markov chain has an equilibrium distribution of stationary probabilities; see the text of Clote and Backofen (2000) for a new, self-contained proof of this result. A Markov chain with state set Q and stationary probabilities p*1,…, p* is reversible, if for all i, j ∈ Q, p*p = p*p. Figure 2 presents pseudocode for the classical Metropolis–Hastings Monte Carlo algorithm with simulated annealing (Kirkpatrick et al., 1983; Metropolis et al., 1953), which implements a random walk on the Markov chain whose transition probabilities p of moving from state x to x is given by where 𝒩(x) is the set of immediate neighbors of state x and 𝒩(x) the set of immediate neighbors of state x; i.e. 𝒩(x) is the set of states that can be reached by a single move from state x. It can be proved that the stationary probabilities for this Markov chain are given by the Boltzmann probabilities , as shown in Clote and Backofen (2000).

Fig. 2.

Pseudocode for Metropolis–Hastings algorithm with simulated annealing (Kirkpatrick et al., 1983).

Pseudocode for Metropolis–Hastings algorithm with simulated annealing (Kirkpatrick et al., 1983). In contrast, Figure 3 presents pseudocode for the WL algorithm, which implements a random walk on the Markov chain whose transition probabilities p of moving from state x to x are given by In this case, the stationary probability of state x is are given by .

Fig. 3.

Pseudocode for WL algorithm, as applied to RNA secondary structure density of states computation. In line 8, 𝒩(S) denotes the collection of immediate neighbors of structure S; i.e. those obtained by adding or removing a single base pair. In line 16, d.o.s. abbreviates density of states. The mathematical–justification for applying the Metropolis-Hastings Monte Carlo method (Metropolis et al., 1953) to determine the minimum energy conformation of a biopolymer (Bradley et al., 2005; Das and Baker, 2007; Ortiz et al., 1998) depends on two facts: (i) every finite, irreducible, aperiodic Markov chain has a stationary probability distribution and (ii) if the Markov chain is reversible, a situation called detailed balance by the physics community, then the stationary distribution of the Markov chain corresponding to the Metropolis–Hastings algorithm is the Boltzmann distribution, defined by , where E(x) is the energy of state (i.e. conformation) x, R is the universal gas constant 1.986 cal/mol, T is absolute temperature, and the partition function Z is defined by ∑ exp(−E(x)/RT, where the sum is taken over all states x in the Markov chain. As temperature T approaches zero, the Boltzmann probability of the minimum energy state approaches 1, in the case of a unique minimum energy state, or more generally 1/m, in the case of m distinct minimum energy states. See Clote and Backofen (2000) for details. In contrast to the Metropolis–Hastings algorithm, which performs a random walk on the Markov chain of states (secondary structures), the WL algorithm performs a random walk on the energy space of the Markov chain of states (secondary structures), where the stationary probability of visiting energy e is proportional to , then the histogram of energies encountered in the random walk will be flat. In this article, we consider the Markov chain, whose states are the secondary structures of a given RNA sequence, and for which permissible local moves correspond to the addition or removal of a single base pair (Flamm et al., 2000). Although detailed balance holds for the Metropolis–Hastings algorithm in Figure 2, it does not necessarily hold for the Metropolis algorithm, obtained by replacing line by Indeed for the case of RNA secondary structures, detailed balance does not hold in this situation, since if we define the stationary probability p* for state x to be the Boltzmann probability , and the transition probabilities given by Equation (1), then it is not always the case that p* · p = p* · p. For instance, the empty structure S =………. on the 10-mer GGGGGCCCCC has 18 immediate neighbors, one of which is T = (……). The structure T has 11 immediate neighbors, one of which is the empty structure S. Letting x = S and x = T, we have E(x) = 0 kcal/mol, E(x) = 2.70 kcal/mol, ensemble free energy is −RTln(Z) = −3.96, hence Z = exp(3.96/RT) where T = 310°C so Z = 621.5 and we have stationary probabilities , , and . We compute that Summarizing, in the Metropolis algorithm (with modified line 11), reversibility of a Markov chain depends on the permissible local moves, while in the Metropolis–Hastings algorithm (with line 11 as in Fig. 2), reversibility is always ensured. In the case at hand, if every secondary structure is an immediate neighbor of every secondary structure, then in the Metropolis algorithm, transition probabilities would be where 𝒩 is the number of secondary structures. In this case, an easy computation shows that the Markov chain is reversible. Despite the non-reversible nature of the Markov chain corresponding to the Metropolis algorithm, whose states are the secondary structures of a given RNA sequence, and whose local moves consist of the addition or removal of a single base pair, it has been a standard practice to apply the Metropolis algorithm in this case (Danilova et al., 2006; Flamm et al., 2000; Isambert and Siggia, 2000; Xayaphoummine et al., 2005). For that reason, we do not hesitate to apply the WL algorithm for the study of RNA secondary structure formation. Note that in Figure 3, the WL computes the relative density of states, defined by g(i) = N(e)/N, where N(e) is the number of states having energy e and N is the total number of states. In the case of RNA secondary structures, it is simple to compute the total number of secondary structures by dynamic programming, given as follows. Given an RNA sequence of length n, let BP = 1 if positions i,j can form a Watson–Crick or wobble pair, otherwise let BP = 0. Let θ = 3 denote the minimum number of unpaired bases in a hairpin loop. Letting N denote the number of secondary structures on subsequence [i, j] of the given RNA sequence, we have that N = 0 if j < i + 3, and otherwise It follows that the total number of secondary structures is then N1,. From the relative density of states computed by WL algorithm, we compute the absolute density of states by For fixed temperature T for which the WL computation was done, we can compute the partition function Z(T) = ∑ exp(−E(S)/RT) by In their original article Wang and Landau (2001a, b) mentioned that in the case of the Ising model, Equation (4) allows one to compute the partition function at any desired temperature T from the density of states. Unfortunately, this is no longer the case for the Turner nearest neighbor model Xia et al., 1999 of RNA secondary structure, since the free energy parameters for stacked base pairs, hairpins, bulges, internal loops, etc. all depend on temperature. We can nevertheless proceed by computing the density of states for free energy at T = 37°C, and the density of states for enthalpy (assumed to be temperature independent), and then by convoluting these values, we obtain the density of states for free energy at any desired temperature.

3.2 Partition function for a single RNA

Figure 4a displays the relative density of states for the free energy of secondary structures of the 45 nt flavivirus capsid hairpin (cHP) with EMBL access code AB010982/1-45. Figure 4b displays the sum of squared differences between the density of states and the best fitting normal distribution, respectively, extreme value distribution. The cHP is a conserved RNA hairpin structure in the capsid-coding region of flavivirus genomes. Note that the relative density of states, or energy histogram, is approximately normal. In Clote et al. (2009) it is rigorously proved that the relative density of states is asymptotically normal; specifically, it is shown that the limit, as n approaches infinity, of the relative density of states for an RNA sequence of length n is normal, where for the purpose of mathematical analysis it is assumed that any base can pair with any other base (homopolymer model) and that the energy of a secondary structure S is −1 times the number of base pairs in S (Nussinov energy model; Nussinov and Jacobson, 1980).

Fig. 4.

(a) Density of states for free energy of secondary structures of the 45 nt flavivirus cHP with EMBL access code AB010982/1-45 and sequence AUGAACAACC AACGAAAAAG GACGGGAAAA CCGUCUAUCA AUAUG. Overlaid on the graph is the best fitting normal distribution and the best fitting extreme value distribution. (b) Sum of squared differences between the density of states and the best fitting normal distribution, respectively extreme value distribution. The x-axis of both panels depicts free energy in kcal/mol.

3.3 Partition function of hybridization

Following the approach in program RNAcofold of Bernhart et al. (2006), we can modify the WL program of Figure 3 to compute the density of states for all hybridizations of two RNA sequences, where both intermolecular and intramolecular base-pairing is allowed, provided that there are no pseudoknots. In the case of the hybridization of twoRNA secondary structures, the first of length n and the second of length m, we can compute the total number of hybridizations as follows. Given an RNA sequence A = a1,…, a of length n and an RNA sequence B = b1,…, b of length m, let HP = 1 if positions a, b can hybridize, forming a Watson–Crick or wobble pair, otherwise let HP = 0. For 1 ≤ i, j ≤ n, 1 ≤ k, ℓ ≤ m, let H denote the number of hybridizations of the subsequence a,…, a with b,…, bℓ. From Equation (3), we can compute the number NA, respectively, NB of secondary structures on subsequence a,…,a of A, respectively, b,…, b of B. If j < i or ℓ < k, then defined H = 0; otherwise define H by It follows that the total number of pseudoknot-free hybridizations is then H1,. The previous algorithm is clearly O(n4). By considering the number of hybridizations to be the same as the number of secondary structures of a chimeric sequence, formed by concatenating A, B to form c1,…, c = a1,…, a, b1,…, b, we have an O(n3) algorithm, as follows. For 1 ≤ i, j ≤ n + m, if j < i or (1 ≤ i, j ≤ n, j − i ≤ θ = 3), then N = 0, while if 1 ≤ i ≤ n, n + 1 ≤ j ≤ n + m, then N = 1; otherwise N is equal to It follows that the total number of hybridizations is then N1,. We now describe how to compute the melting temperature T of hybridization. Compute number of structures for each of five species (temperature independent): 𝒮(A), 𝒮(B), 𝒮(AA), 𝒮(BB) and 𝒮(AB). For temperature T ∈ {0°C,…, 100°C}, compute relative density of states f(A, T), f(B, T), f(AA, T), f(BB, T) and f(AB, T) for each species by WL. For temperature T ∈ {0°C,…, 100°C}, compute partition functions Z(A, T), Z(B, T), Z(AA, T), Z(BB, T) and Z(AB, T) by where absolute density of states g(E) is relative density times number of structures. For instance Following Dimitrov and Zuker (2004), for temperature T ∈ {0°C,…, 100°C}, compute ensemble free energy ΔG(A, T), ΔG(B, T), ΔG(AA, T), ΔG(BB, T) and ΔG(AB, T). This involves the following. Redundancy correction: Symmetry correction: Temperature-dependent chemical equilibrium constants: Temperature-dependent concentration (number) of molecules A and B: where N0, N0 are given and K, K, K are obtained from the previous step. Values N and N are gotten by using, for example, Newton's method for solving two nonlinear functions; due to issues of numerical instability, Markham uses binary search (p. 43 of Markham, 2006). Letting Z(A, B, AB, AA, BB) equal the following expression: it follows that the total partition function Z satisfies which can be approximated by the term Z(A, B, AB, AA, BB) where N, N, N, N, N obtained as previously explained. The chemical potential μ for each species X is the partial derivative of ensemble free energy with respect to number of molecules of X, hence so Total free energy satisfies which simplifies to Normalize the ensemble free energy in terms of energy per mole of solute: Determine heat capacity as a function of temperature by by computing the second partial of a fitting parabola determined by 2m+1 evenly spaced points, using the approximation for given by In a post-processing step, smooth the heat capacity curve by computing a running average. The melting temperature T(C) is computed by determining the temperature at which heat capacity achieves a maximum.

4 DISCUSSION

The Figure 5a displays the run time of the WL method, compared with that of RNAsubopt from the Vienna RNA package, while the Figure 5b of the same figure shows sample output from our WL program. Figure 5 clearly shows the advantage of WL over existent methods in computing the density of states for both single RNA molecules and for hybridization complexes of two RNA molecules. Figure 6a and b depicts the heat capacity computed by the WL method (Fig. 5a) and the program UNAFold (Fig. 5b). Melting temperature, which is usually defined as the temperature at which half of the molecules are single-stranded, while the other half are hybridized, is determined as that temperature where heat capacity achieves its maximum. The program UNAFold does not allow any intramolecular structure (base pairing between 2 nt of the same structure), a feature that our WL method permits, as does the RNAcofold program. While it is clear that additional work must be done to improve heat capacity computation with the WL method, the melting temperature T computed by WL agrees reasonably well with that computed by O(n3) methods UNAFold, RNAcofold, and the recent O(n6) method of Chitsaz et al. (2009) each of which methods admits slightly different interactions.

Fig. 5.

Fig. 6.

Computation of heat capacity c(T) for the toy sequence 5′-AGCGA-3′, hybridized to its reverse complement 3′-UCGCU-5′. (a) Graph generated by WL method described in this article. (b) Graph generated by the program UNAFold (Markham and Zuker, 2008).

(a) Comparison of execution times of WL and program RNAsubopt-D (Wuchty et al., 1999), in computing density of states. Since the program of Cupal et al. (1996) is no longer publicly available, and is superceded by RNAsubopt (private correspondence from. Hofacker), we computed the execution time in seconds as a function of log n, where n is RNA sequence size. Horizontal green line is slightly above the value of exp(25) seconds, or equivalently a day. It appears that for sequences of length ≥46 nt, the WL method is more efficient than RNAsubopt. (b) Sample output of WL method on sequence CUGCUUUGAGGACAAAGAGAAUAAAGACUUCAUGUU, after 17 402 000 WL Monte Carlo steps, where the value of ε in line 4 of Figure 3 is defined to be 0.001. The leftmost column contains the energy bin, the middle column contains the relative frequency in the WL sampling run, and the rightmost column contains the lowest energy secondary structure in the associated energy bin. Though our WL program allows the user to modify bin size, the default energy bin size (here) is 0.1 kcal/mol; empty bins, where no structure has yet been sampled, are not displayed. The lowest energy structure sampled by the WL method is ((.(((((….)))))))…………….. with energy −3.3 kcal/mol, which is identical to the minimum free energy structure, as computed by RNAfold. Only a portion of the output is displayed. In particular, the largest energy of any sampled structure is 48.8 kcal/mol; in that energy bin the least energy structure is .(..(…).)((…)(…).((…)(…))). Computation of heat capacity c(T) for the toy sequence 5′-AGCGA-3′, hybridized to its reverse complement 3′-UCGCU-5′. (a) Graph generated by WL method described in this article. (b) Graph generated by the program UNAFold (Markham and Zuker, 2008). We now describe how to approximately compute the partition function Z(T) over all secondary structures and over all pseudoknot-free hybridizations, simultaneously over all temperatures from 0°C to 100°C, by performing two WL computations, followed by a computation of the convolution of enthalpy relative frequency with free-energy relative frequency. Similar computations using existent methods require over 100 cubic time computations. By this method, one can approximate the partition function Z(T) for all temperatures from 0°C to 100°C, by performing two WL sampling runs, respectively, at temperatures −373°C and 37°C, and then to repeatedly perform a fast convolution. The method just described, which involves two WL computations, together with convolution computations, has until now not worked well in practice, for certain technical reasons. This direction needs further exploration. Compute the relative density of states p for free energy using WL with temperature T = −273°C (absolute zero Kelvin). It follows that p is the relative density of states for enthalpy, Due to the fundamental thermodynamic relation where T (K) is absolute temperature and ΔG, ΔH, ΔS, respectively, denote the change in free energy, enthalpy and entropy. Compute the relative density of states p for free energy using WL with temperature T = 37°C (310 K). From Equation (6), we have that Given arbitrary absolute temperature T, compute the relative density of states for free energy at temperature T by the following pseudocode, representing a kind of convolution of p with p. Compute the absolute density of states g(z) = p(z) · N, where N is the total number of secondary structures, computed by Equation (3). Another issue concerning any sampling method is the required time to obtain reasonably good estimates of the quantity in question. In the case of RNA kinetics, computations of MFPT to reach the minimum free-energy structure take inordinate amounts of time, when using Metropolis–Hastings Monte Carlo methods, which are time-driven simulations. For this reason, the program KinFold (Flamm et al., 2001) uses an event-driven simulation, where time is incremented by an exponentially distributed random variable. It may be possible to use similar ideas to increase efficiency of our WL program, which should further improve the accuracy in the computation of heat capacity. Finally, we intend to implement a new energy evaluation function, that allows arbitrary pseudoknots, zig-zags, etc. using energy parameters from the recent dissertation of Bon (Bon, 2009). This will allow us to estimate the partition function, ensemble free energy, heat capacity, melting temperature, etc. for a context known to be NP-complete.

5 CONCLUSION

In this article, we have implemented the WL algorithm to compute the relative density of states for RNA secondary structures and hybridizations. Separately computing the number of structures and hybridizations, we obtain the absolute density of states, which then yields the partition function, and thence, in the case of hybridization, the melting temperature. The WL method is much faster than existent software RNAsubopt in computing the density of states, but could not be benchmarked with the binning method of Cupal et al. (1996) which runs in O(m2n3) time, for length n sequence and m energy bins, since the latter software is no longer available, being superceded by RNAsubopt-D. In preliminary tests, we obtain roughly the same melting temperature for duplex RNA, as that computed by existent methods; however, the real advantage of the WL method is that there is no restriction on types of allowed interaction, unlike the situation with dynamic programming approaches that disallow pseudoknots, zig-zags, etc.

54 in total

1. Determining the density of states for classical statistical models: a random walk algorithm to produce a flat histogram.

Authors: F Wang; D P Landau
Journal: Phys Rev E Stat Nonlin Soft Matter Phys Date: 2001-10-17

2. Prediction of hybridization and melting for double-stranded nucleic acids.

Authors: Roumen A Dimitrov; Michael Zuker
Journal: Biophys J Date: 2004-07 Impact factor: 4.033

3. Substrate-assisted catalysis of peptide bond formation by the ribosome.

Authors: Joshua S Weinger; K Mark Parnell; Silke Dorner; Rachel Green; Scott A Strobel
Journal: Nat Struct Mol Biol Date: 2004-10-10 Impact factor: 15.369

4. Toward high-resolution de novo structure prediction for small proteins.

Authors: Philip Bradley; Kira M S Misura; David Baker
Journal: Science Date: 2005-09-16 Impact factor: 47.728

5. Structure prediction of helical transmembrane proteins at two length scales.

Authors: Zhong Chen; Ying Xu
Journal: J Bioinform Comput Biol Date: 2006-04 Impact factor: 1.122

6. Prediction of RNA secondary structure, including pseudoknotting, by computer simulation.

Authors: J P Abrahams; M van den Berg; E van Batenburg; C Pleij
Journal: Nucleic Acids Res Date: 1990-05-25 Impact factor: 16.971

7. Thermodynamics of RNA-RNA binding.

Authors: Ulrike Mückstein; Hakim Tafer; Jörg Hackermüller; Stephan H Bernhart; Peter F Stadler; Ivo L Hofacker
Journal: Bioinformatics Date: 2006-01-29 Impact factor: 6.937

8. Rapid ab initio prediction of RNA pseudoknots via graph tree decomposition.

Authors: Jizhen Zhao; Russell L Malmberg; Liming Cai
Journal: J Math Biol Date: 2007-09-29 Impact factor: 2.259

9. Fold assembly of small proteins using monte carlo simulations driven by restraints derived from multiple sequence alignments.

Authors: A R Ortiz; A Kolinski; J Skolnick
Journal: J Mol Biol Date: 1998-03-27 Impact factor: 5.469

Review 10. Emerging themes in non-coding RNA quality control.

Authors: Karin M Reinisch; Sandra L Wolin
Journal: Curr Opin Struct Biol Date: 2007-03-28 Impact factor: 6.809

1 in total

1. Effect of 3'UTR RET Variants on RET mRNA Secondary Structure and Disease Presentation in Medullary Thyroid Carcinoma.

Authors: Lucieli Ceolin; Mirian Romitti; Débora Rodrigues Siqueira; Carla Vaz Ferreira; Jessica Oliboni Scapineli; Beatriz Assis-Brazil; Rodolfo Vieira Maximiano; Tauanne Dias Amarante; Miriam Celi de Souza Nunes; Gerald Weber; Ana Luiza Maia
Journal: PLoS One Date: 2016-02-01 Impact factor: 3.240

1 in total