Literature DB >> 21685061

A folding algorithm for extended RNA secondary structures.

Christian Höner zu Siederdissen¹, Stephan H Bernhart, Peter F Stadler, Ivo L Hofacker.

Abstract

MOTIVATION: RNA secondary structure contains many non-canonical base pairs of different pair families. Successful prediction of these structural features leads to improved secondary structures with applications in tertiary structure prediction and simultaneous folding and alignment.
RESULTS: We present a theoretical model capturing both RNA pair families and extended secondary structure motifs with shared nucleotides using 2-diagrams. We accompany this model with a number of programs for parameter optimization and structure prediction. AVAILABILITY: All sources (optimization routines, RNA folding, RNA evaluation, extended secondary structure visualization) are published under the GPLv3 and available at www.tbi.univie.ac.at/software/rnawolf/.

Entities: Chemical Disease Gene

Mesh：

Substances：
RNA

Year: 2011 PMID： 21685061 PMCID： PMC3117358 DOI： 10.1093/bioinformatics/btr220

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The classical RNA secondary structure model considers only the Watson–Crick AU and GC base pairs as well as the GU wobble pair. A detailed analysis of RNA 3D structures, however, reveals that there are 12 basic families of interactions between the bases, all of which appear in nature (Leontis and Westhof, 2001; Leontis ). Moreover, virtually all known RNA tertiary structures contain the so-called non-Watson–Crick base pairs. This has led to the development of an extended presentation of RNA contact structures with edges labeled by their pairing type (an example can be seen in Fig. 1). This extended description of base pairing is commonly termed after its inventors the Leontis–Westhof (LW) representation.

Fig. 1.

Example of a structure containing base triplets. The inner part (bases 14–37) of the PDB structure 1dul is shown in a 3D representation and as a 2D structure plot displaying the non-standard base pairs in LW representation. The four bases highlighted in the 3D structure form the two base triplets that can be seen in the upper part of the interior loop in the 2D structure. The LW representation has proved to be a particularly useful means of analyzing 3D structures of RNA as determined by X-ray crystallography and NMR spectroscopy (Leontis and Lescoute, 2006). In particular, it has led to the discovery of recurrent structural motifs, such as kink-turns and C-loops, that act as distinctive building blocks of 3D structures. The sequence variation in these structural motifs follows combinatorial rules that can be understood by the necessity to maintain the overall geometry when base pairs are exchanged. These isostericity rules are discussed in detail by Lescoute ); Stombaugh ). As a new level of RNA structure description, the ability to predict non-standard base pairs can be expected to improve the performance of RNA structure prediction. Furthermore, information about evolutionary conservation of the isostericity classes of these non-standard base pairs will improve consensus structure-prediction and structure-dependent RNA gene finding. Since many additional interactions beyond the standard base pairs are represented in the LW formalism, what was considered to be a loop in classical secondary structures can now appear as complex structures of non-standard base pairs. These non-standard base pairs effectively divide the long ‘classical’ loops into much shorter ones. Parisien and Major (2008) proposed a model that contains loops with no more than four unpaired bases. For unbranched structures, the model is scored using a statistical potential estimated from the available 3D structures by counting the relative frequencies of base pairs, short unbranched loops of particular shapes in dependence of their sequences and combinations of loops with a common base pair. An accompanying folding procedure, MC-Fold (Parisien and Major, 2008), which exhaustively enumerates stem-loop components, is available and has been used very successfully as a first step toward the de novo prediction of RNA 3D structures using MC-Sym (Parisien and Major, 2008), which takes as input the proposed secondary structure from MC-Fold.

2 MC-FOLD REVISITED

2.1 Algorithm

Like ordinary secondary structure prediction tools, MC-Fold (Parisien and Major, 2008) is based on a decomposition of the RNA structure into ‘loops’. In contrast to the standard energy model, however, it considers the full set of base pair types available in the LW representation. Each base pair, therefore, corresponds to a triple (i,j;θ) where θ is one of the 12 types of pairs. In this model, ordinary secondary structures are the subset of pairs with Watson-Crick type (θ=‘WW’) and the two nucleotides form one of the six canonical combinations {AU,UA,CG,GC,GU,UG{. This extension of the structure model also calls for a more sophisticated energy model. While the standard model assumes the contributions of the loops to be strictly additive, MC-Fold also considers interactions between adjacent unbranched loops (hairpins, stacked pairs, bulges and general interior loops). This means that the total energy of a structure is not only dependent on the loop types present, but also on the arrangement of these loops. Dispensing with details of the parametrization, the scoring function of MC-Fold for a structure 𝒮 on sequence x can be written as follows (see Fig. 2): where 𝒞,𝒞′,𝒞′ are different loops of 𝒮. The additive term E tabulates the (sequence-dependent) contributions of the loops. The interaction term E accounts for the ‘junction’ and ‘hinge’ terms in stem–loop regions. These interaction terms depend on the type of the adjacent loops as well as on the type θ and sequence (x[k], x[l]) of the base pair that connects them. For multiloops, only the additive term is considered.

Fig. 2.

MC-Fold and MC-Fold-DP both consider small loops, like the hairpin AAGUG (𝒞) and the 2×2 stack AAGU (𝒟) (read clockwise, starting bottom left). Each loop is scored by a function E(𝒞|AAGUG). The stack (𝒟) follows analoguously. The interaction term between two loops is calculated as indicated by the arrow (α), where the two loops are overlayed at the common AG pair. The contribution of the interaction is Ejunction+hinge(𝒞,𝒟;θ;A,G) with θ the unknown pair family. Let us ignore multiloops for the moment. A basepair (i,j;θ) then encloses a loop of type ℒ which is either a hairpin or encloses a loop 𝒦. It is connected to 𝒦 by a base pair (k,l;ψ) with i This can be expanded to a full ‘next-nearest-neighbor’ model by enforcing an explicit dependence on the type of the inner base pair: The effort to evaluate this recursion equation for a fixed base pair (i,j) is L3T3, where L is the number of loop types and T is the number of base pair types. While this prefactor is inconveniently large, we nevertheless obtain an 𝒪(n2) [or 𝒪(n3) with multibranched loops] folding algorithm instead of the exponential runtime of MC-Fold. The problem with this general form of energy parametrization is the unmanageable number of parameters that need to be measured, estimated or learned from a rather limited set of experiments and known RNA structures.

2.2 Parametrization and implementation

Since the folding problem for the MC-Fold model can be solved in polynomial time, the associated parameter estimation problem becomes amenable to advanced parameter optimization techniques (Andronescu ; Do ). At present, however, we have opted to extend the original MC-Fold parameters only by simple sparse data corrections that can be applied on top of the original MC-Fold database. This has the advantage of allowing a direct comparison between the original version of MC-Fold and our dynamic programming version MC-Fold-DP. In contrast to the original version, MC-Fold-DP can cope with large data sets and long sequences (3 s for 250 nt, about 24 s for 500 nt with MC-Fold-DP, compared to 660 s for 100 nt with MC-Fold). In terms of algorithmic design, we have made several changes. The grammar underlying MC-Fold-DP follows the ideas of Wuchty ). This makes the generation of all suboptimal structures in an energy band above the ground state possible. The decomposition of interior loops into small loops implies that MC-Fold-DP runs in 𝒪(n3) time without the need for the usual explicit truncation of long interior loops. The recursion that fills stem loops [Nucleotide Cyclic Motifs (NCMs) in the nomenclature of Parisien and Major (2008)] is now reduced to a function NCM(i,j,type,k,l,type). For the matrix, entry (i,j,type) is minimized over all (k,l,type) with (k,l) determined by the newly inserted motif type. Hairpins are even simpler: they follow NCM(i,j,type,…) but there is no inner part (k,l,type). The total number of motif types is small (15 in the original set, of which not all are actually used). Both the time and space complexities are, therefore, small enough to handle RNAs with a length of several hundred nucleotides, i.e. in the range that is typically of interest. In fact, the time complexity is similar to ordinary secondary structure prediction where interior loop size is bounded by a constant. Since the grammar is unambiguous, it is also straightforward to compute partition functions and base pairing probabilities, although this feature is not available in the current implementation.

3 BEYOND 1-DIAGRAMS

3.1 Base triplets

An important restriction of secondary structures is that each nucleotide interacts with at most one partner. In combinatorial terms, secondary structures are 1-diagrams. A closer analysis of the available 3D structures, however, reveals that many nucleotides form specific base pairs with two other nucleotides, forming ‘base triplets’ or, more generally, ‘multi-pairings’. Cross-free b-diagrams with maximal number b of interaction partners for each nucleotide can be treated combinatorically in complete analogy with (pseudoknot-free) secondary structures by conceptually splitting each node into as many vertices as there are incident base pairs (arcs). As in the case of secondary structures, we say that (i,j) and (k,l) cross if i Here, we consider only structures with at most two base pairs involving the same nucleotide, i.e. 2-diagrams. In this case, there is a convenient string representation generalizing the Vienna (dot-parentheses) notation for secondary structures by introducing three additional symbols <, >, X for positions in which two arcs meet: (( = <, )) = > and )( = X. For general b, the number of necessary symbols grows quadratically, s=(b+1)(b+2)/2, since each must encode b1 opening and b2 closing pairs with b1,b2≥0 and b1+b2≤b. These symbols provide a direct representation of the arc nodes ‘⊳,⊲, ×’ of Figure 3 and are an optional output of the folding program described below to visualize 2-diagrams in the secondary structure.

Fig. 3.

A simple unambiguous grammar for non-crossing 2-diagrams (The symbols used here denote (non-)terminals in a context-free grammar and are not to be confused with the LW notation used in other figures). Connected parts of diagrams correspond to terminal [individual bullet with no arc (closed circle) = unpaired nucleotide; arc with circular end points (closed circle, open circle) = base pair; arc with triangular endpoints (left-faced triangle, right faced triangle, cross) = part of base triple] or non-terminal (horizontal lines and semicircle) symbols of the grammar. It is important to realize that left-faced and right-faced triangles refer to the same nucleotide when they are adjacent. In terms of a recursion, the index for both left-faced and right-faced triangles is therefore the same. One triangle ‘points’ to the outer arc and one to the inner arc incident to the same nucleotide.

3.2 A grammar with base triplets

In order to design a dynamic programming folding algorithm for cross-free 2-structures we need a decomposition, i.e. a grammar for 2-structures. For practical applications, it is desirable to have not only a minimization algorithm, but also a partition function version. To this end, an unambiguous grammar is required (Dowell and Eddy, 2004; Reeder ). A simple version, treating base pairs as the elementary entities is shown in Figure 3. It translates into an extension of either a Nussinov-style algorithm for maximizing the number of base pairs or a recursion for counting the number of non-crossing 2-diagrams. Let F denote the minimum energy of a structure on the sequence interval x[i..j]. We have and analogous recursion for U,V and W, denoting the minimum energies over all structures whose left, right or both ends, are involved in a triplet. The symbol C refers to structures enclosed by a non-triplet base pair. In the simplest case, C=ϵ+F (lower right corner of Fig. 3). The terminal symbols are the unpaired base •, the ordinary base pairs and the three types of base pairs involved in triplets, contributing ϵ=0 sequence−dependent energy increment ϵ and sequence-dependent energy increments ϵa, ϵb and ϵc, respectively. The recursion is initialized with F=0. Only certain combinations of types of base pairs can occur in triplets. Thus, in a refined model we need to replace U, V and W by U[ν], U[μ] and W[ξ] explicitly referring to the base pair type(s) of the triplet. Furthermore, the energy parameters also become type dependent ϵa→ϵa[ρ] or even ϵa[ρ, ν] where ρ is the type of the pair itself and ν is type of the second pair of the triplet. The first variant is chosen for Nussinov-like algorithms, where each individual base pair is evaluated, splitting triplets, and the second variant is more fitting for Turner-like nearest neighbor models. In that case, recursion on W changes to W[ν,μ] to reflect the pairing choice being made.

3.3 Full loop-based model

The grammar of Figure 3 can be extended to incorporate the standard loop-based Turner energy model (Turner and Mathews, 2010) (which distinguishes hairpin loops, stacks of two base pairs, bulges, interior loops and multibranched loops). The modification of the grammar is tedious but rather straightforward, as seen in Figure 4. Instead of treating the base pairs themselves as terminal symbols (as in Fig. 3), this role is taken over by entire loops. Note that as in the case of ordinary secondary structures, each loop in a given structure is uniquely determined by its closing pair. The energy contributions now depend, in a more complex way, on the characteristics of the loop, hence we also need additional non-terminals to describe e.g. the components of multiloops.

Fig. 4.

Decomposition of one non-terminal in the full loop-based model with triples. The l.h.s. of the production rule denotes a structure enclosed by a base pair where the base at the 3′ end is part of a triple. The second base pair of this triple ends within the structure. The structural element is either bulge like (first column) or multiloop like. In the first case, we have to distinguish whether the enclosed structure has a normal pair or a triple at its 5′ side. In the multiloop case, we use the linear decomposition into components familiar from the Turner model with a non-terminal denoting a partial multiloop containing at least one base pair. Here, we need to distinguish whether the 5′ end of the rightmost component and 3′end of the left components are triples or not. As the multiloop part is not implemented in our current version, it is grayed out. We use a decomposition that is similar to that of MC-Fold and in addition encompasses 2-diagrams. A p×q-loop, p≤q, consists of p nucleotides on one strand and q nucleotides on the other one. In particular, 2×2-loops correspond to stacked base pairs, 1×q-loops, q>1 are triplets and 2×3-loops are stacks with a bulged-out nucleotide. In addition to hairpin loops and these p×q-loops, we consider generic bulges with and without a shared nucleotide, interior loops of larger sizes and multibranched loops, again possibly with shared nucleotides. Figure 4 gives an example for the full loop-based decomposition of one particular non-terminal. In our current implementation, we use several simplifications in particular for multiloops that involve triplets. Some information on the complete grammar used in our implementation can be found in the Appendix A, other information is available on the RNAwolf homepage.

4 IMPLEMENTATION

4.1 Folding software

The implementation available on the RNAwolf homepage is written in the high-level functional programming language Haskell. While this leads to an increase in running times (by a constant factor), the high-level notation and a library of special functions lead to very concise programs, and enable, e.g. the use of multiple cores. Currently, the following algorithms are implemented: (i) an optimizer which takes a set of melting experiments and the PDB database as input and produces a parameter file optimized as described below. (ii) A folding program which expects a sequence of nucleotides as input and produces an extended secondary structure prediction which includes nucleotide pairs of non-canonical types. Furthermore, it can contain motifs with base triplets. (iii) An evaluation program which expects both, a sequence and a secondary structure. The input is then evaluated to return the score of said structure and, if requested, tries to fill the given (canonical) structure with additional pairs. This allows to turn a classical secondary structure into an extended secondary structure by filling large loops with non-canonical pairs. At the moment, base triplets have been restricted slightly in that shared nucleotides are only possible in stem structures, not within a multibranched loop motif. Allowing shared nucleotides between two helices of a multiloop would slow down multiloops by a significant factor. Nevertheless, we will lift this restriction for the full nearest neighbor model we plan to implement. In the full model, we will be able to use data gathered from our current model to reduce the combinatorial complexity of the algorithm within multibranched loops.

4.2 Parameter estimation

In contrast to the Turner model, which considers only canonical base pairs [i.e. Watson–Crick and GU (wobble) pairs], we include all types of base pairs. Thus, we also have to derive parameters for all possible base pair families in our motifs of choice. To this end, we need to find sufficient evidence for each parameter and we need an efficient numerical algorithm for optimizing the parameters. Even if a large body of sequence/structure pairs is available to train the parameters, it is still highly unlikely that each parameter is witnessed. A simple calculation for canonical stacked pairs already produces 44×122=36864 (ignoring symmetries) parameters to be trained. While symmetries reduce the number of distinct parameters, canonical stacks still require ~10000 independent parameters. In total, the number of parameters easily reaches 105, which means that only a very small set of parameters will actually be observed in experimentally verified structures. The second problem is of numerical nature in that it gets hard to estimate a solution in ℝ100000 even under ideal circumstances. In addition, the computational effort for the computation of the solution vector is rather high. There are two different types of approaches to this problem, described in some detail by Andronescu ). In max-margin formulations, parameters are optimized such as to drive them away from wrongly determined structures and toward correctly determined ones. Alternatively, the conditional likelihood of known structures is maximized. Andronescu ) described an extension of the algorithm that can deal with unobserved configurations by employing a hierarchical statistical model. We have selected yet another way of dealing with the immense number of features. Instead of optimizing the full set of parameters directly, we first optimize the parameters for a restricted model closely following the simple unambiguous grammar given in Figure 3. In short a loop of type m (e.g. stacked pair, bulge, etc.) enclosed by two pairs p1, p2 is assigned an energy ϵ(m)+ϵ(p1)+ϵ(p2), where ϵ(m) depends on the type and size of the loop but is independent of sequence, and the pair energies depend on the identity of the nucleotides as well as the LW type (e.g. GC,cWW). We call our model enhanced Nussinov as it distinguishes between loops of different types (say bulges of different lengths are assigned different scores) but assumes that pair energies are independent as in the Nussinov model. This approach has several advantages. First, the resulting algorithm is an accessible ‘toy-model’ that can be employed to test different hypotheses. Second, the estimated parameters provide a useful set of priors for the full model. This is important since, in contrast to the work of Andronescu ), we cannot derive a complete set of priors from known data. Finally, the computational requirements are significantly lower. Training a full Boltzmann model for conditional likelihood maximization might easily have taken months of CPU time (Andronescu ). Here, we utilize both melting experiments and PDB data for parameter estimation. Melting experiments yield a small set of sequences, structures and corresponding free energies. The structural data, unfortunately, provides almost exclusively canonical Turner features and no information regarding the base pair family, although it can be assumed that all pairs are of the Watson–Crick (cWW) style. The PDB data, on the other hand, contain not only non-canonical base pairs, but also provide information on the base pair family. In addition, PDB entries typically refer to structures that are much larger than those used in melting experiments. Together, both sets provide data required for the estimation of an extended set of parameters. In order to keep computation times short, we employ the original no-max-margin constraint-generation approach used by Andronescu ). While not providing the most accurate parameters in the original paper, the relatively short runtimes of ~ 1 CPU day are convenient for experimental purposes. In addition, since we are training an enhanced Nussinov-style model, we can assume that the prediction accuracy is limited by the structure of the model. More advanced, and hence computationally more expensive, training methods are therefore unlikely to lead to substantial improvements of the prediction accuracy.

4.3 Optimization

Our task is to estimate the energy contributions x for a given collection of features j. In this context, a feature corresponds to a terminal symbol in our grammar with a fixed underlying sequence, such as as GC/GC stacked pair or a 1×3-loop with sequence (G—AUC) where GA is a Hoogsteen pair and GC is a Watson–Crick pair. We are given the following types of data: (i) a matrix A whose entries A encode how often feature j occurs in sequence/structure pair i, and (ii) a vector y containing measured melting temperatures y for experiment i. Constraints are now generated as follows. For each entry k of the PDB, we extract the (extended) secondary structure features. This means that neither pseudoknots nor intermolecular interactions (which require more complicated grammars) are considered. The entry f of the row vector f counts how often feature j is observed in the structure. Using the current parameter values x (see below), the sequence of PDB entry k is folded and the corresponding feature vector g is constructed. If the predicted fold has a lower free energy than the known structure, a new constraint (f−g)x≤0 is introduced. Note that fx and gx are, by construction, the free energies of the known and the predicted structure evaluated with the current parameters x. Since the true structure is expected to be the thermodynamic ground state, its free energy must be smaller than that of any other structure. The constraint matrix D contains all currently active constraints where D is the k−th active constraint (in this notation D selects the k-th row, while D., would select the l-th column). Following Andronescu ), we use a slack variable d for each constraint so that Dx≤d. This guarantees that the problem remains feasible as otherwise conflicting constraints could reduce the feasible set for x to the empty set. The slack variables d are bounded from below by 0≤d because (f−g)x≥0, with equality for cooptimal structures. Norm minimization problems can drive individual variables x to extreme values. We, therefore, constrain the energy contribution of individual features to |x|<5 kcal/mol. A subset S of features that act as penalties are constrained to positive values, x>0 for j∈S. The set S is defined along the following principles: unpaired loop regions destabilize the structure relative to a random coil and hence should be penalized. Hairpins, bulges and interior loops fall into this category. In addition, 1×2 and 2×3 stems, which are otherwise modeled as 2×2 stems, are penalized. Hence, for e.g. the 2×3 loop CAUGG with A unpaired, we have ϵ(CG)+ϵ(UG)+ϵ(2×3) where ϵ(2×3) is the penalty term. Parameter estimation is thus reduced to the constrained norm optimization problem with the linear constraints Since this optimization problem is convex it can be solved efficiently. The parameter vector x is optimized iteratively. Initially, D is empty and no slack variables d are used. After the first step, all PDB sequences have been folded and those for which the predicted structure is different from the known structure are included as a row in D as described above. The slack variables are initialized as d=Dx+γ for each constraint k, where γ∈ℝ+ is a small constant. Iterations of the optimization procedure continue until no more constraints have to be added. The computational effort required, both to estimate the parameters and to fold a single sequence, is higher than what is required for the Turner model. The additional computational effort required by the folding algorithm is mainly a result of the inclusion of the pair family information. In the case of 2-loops (stacks, bulges, interior loops), we incur an additional factor of 12 since each possible pair family has to be considered. More problematic are multibranched loops in the case of shared nucleotides as now there are up to 12×11 possibilities to connect a shared nucleotide with its pairing partners.

4.4 Comparison with turner parameters

A comparison with the parameter sets by Turner (Turner and Mathews, 2010) shows that individual contributions are similar enough to make the the ‘enhanced Nussinov’ model a useful prior in the parameter optimization for the full model. Consider, for example, canonical 2×2 stacks, where one pair is of type GC, cWW type and the other pair is of type XY, cWW, with XY ∈{GC,CG,AU,UA,GU,UG} and cWW stands for cis/Watson–Crick/Watson–Crick, the canonical pair type. In the Turner-2004 model, energy contributions range from −1.5 to −3.4 kcal/mol, while the base−pair contribution for the GC, cWW pair is −1.36 kcal/mol in the optimized ‘enhanced Nussinov’ model. Depending on the second pair, we observe discrepancies of ≈0.5 when comparing the sum of individual pair energies to the total stacking energy. This level of agreement is expected and suggests that it makes sense in later iterations of parameter estimation to constrain features to tighter intervals than the current setting of using the open interval of ]−5,5[.

5 RESULTS AND DISCUSSION

MC-Fold-DP: MC-Fold-DP and the original MC-Fold by (Parisien and Major, 2008) show comparable performance on a set of 347 sequences selected from the RNAstrand (Andronescu ) database. There are several differences between the two algorithms. First, the runtime, where MC-Fold-DP is about ×200−×1000 faster for biologically relevant sequences (i.e. <1000 nt). Table 1 shows a small comparison of the prediction accuracy given different measures. Second, we allow for sparse data correction, which can be disabled by the user. And third, the algorithm accepts non-canonical input (e.g. ‘N’ characters) and can be configured to calculate approximate scores for motifs containing such characters.

Table 1.

Prediction accuracy of MC-Fold, MC-Fold-DP and RNAfold 1.8.4 on a set of 347 sequences from the RNAstrand database

Algorithm	Count	MCC	F	S	PPV
MC-Fold, ≤50 nt	298	0.74	0.74	0.80	0.70
MC-Fold, ≤100 nt	37	0.54	0.54	0.66	0.46
MC-Fold, >100 nt	12	0.49	0.49	0.53	0.46
[4pt] MC-Fold-DP, ≤50 nt	298	0.71	0.71	0.77	0.68
MC-Fold-DP, ≤100 nt	37	0.53	0.53	0.64	0.45
MC-Fold-DP, >100 nt	12	0.38	0.37	0.51	0.29
[4pt] RNAfold, ≤50 nt	298	0.76	0.76	0.73	0.81
RNAfold, ≤100 nt	37	0.73	0.73	0.73	0.73
RNAfold, >100 nt	12	0.63	0.63	0.66	0.60

All sequences are < 200 nt long. The longest sequence took just under an hour of computation time using MC-Fold. MC-Fold-DP can compute the predicted structure in ~1 s (loading the MC-Fold motif database requires an additional 1–2 s). Prediction quality has been measured on canonical base pairs only for comparison purposes. Note the small number of sequences > 100 nt. (MCC, Matthews correlation coefficient; F, F-Measure; S, Sensitivity; PPV, Positive Predictive Value).

Prediction accuracy of MC-Fold, MC-Fold-DP and RNAfold 1.8.4 on a set of 347 sequences from the RNAstrand database All sequences are < 200 nt long. The longest sequence took just under an hour of computation time using MC-Fold. MC-Fold-DP can compute the predicted structure in ~1 s (loading the MC-Fold motif database requires an additional 1–2 s). Prediction quality has been measured on canonical base pairs only for comparison purposes. Note the small number of sequences > 100 nt. (MCC, Matthews correlation coefficient; F, F-Measure; S, Sensitivity; PPV, Positive Predictive Value). Differences in predictions are the result of internals of the orignial algorithm that have remained unknown to us since they are not described in full detail in Parisien and Major (2008). It should be noted that our reformulation makes MC-Fold-DP amenable for the parameter optimization approaches pioneered by Andronescu ) for which a polynomial-time prediction algorithm is crucial. The non-ambiguous grammar allows even the advanced, Boltzmann Likelihood-based, approaches to be employed. This presents an opportunity for future research. RNAwolf: we compared our enhanced Nussinov algorithm to three state-of-the-art thermodynamic folding algorithms [RNAfold (Hofacker ), UNAfold (Markham and Zuker, 2008) and RNAstructure (Reuter and Mathews, 2010)] to assess the prediction quality of our model. We folded a subset of 550 randomly chosen structures from RNAstrand (Andronescu ) and compared the F-measure of our results with those of the other programs. The results in Figure 5A show that, not unexpectedly, the ‘enhanced Nussinov’ algorithm cannot compete with state-of-the-art tools due to its simplified energy model.

Fig. 5.

(A) Histogram of F-measures for different folding algorithms, given 550 random RNAstrand entries. (B) F-measures, given 155 PDB entries from the RNAstrand, which are a subset of the 550 random RNAstrand entries. Interestingly, once we focused on data gathered from the PDB database (Fig. 5B), the results showed a remarkable improvement. This could suggest that the PDB structures used for training do not sufficiently cover the RNA structure space and that additional RNAs (for which only secondary structure information is available) should be included in the training. Because of the large number of base pair types, the ‘enhanced Nussinov-algorithm’, has to perform more work than classical secondary structure prediction programs when filling the dynamic programming matrices. This is reflected by rather high runtimes (25 s for 100 nt, 110 s for 200 nt). However, the asymptotic time complexity is still in 𝒪(n3). A constrained folding variant of the ‘enhanced-Nussinov’ algorithm can be used, for example, to predict non-canonical base pairs in large interior loops of structures. As an example, Figure 6, shows that RNAwolf is able to correctly predict the non-canonical base pairs in a situation where the canonical base pairs are already given, i.e. where the input consists of both the sequence and a dot-bracket string representing canonical Watson–Crick base pairs. Only the zig-zag motif (upper part of the interior loop) was not predicted, presumably due to the large penalty of +3.89 for each of the two 1×2 stacks.

Fig. 6.

Prediction of non-canonical base pairs with RNAwolf. (A) Known structure of PDB entry 1dul. (B) Constrained prediction (canonical base pairs were given) of 1dul. Only the central part of the structure is shown. The outer part of the stem contains only canonical base pairs and is not shown. Further results and a semi-automatic system for secondary structure prediction comparison (SSPcompare) are available on the RNAwolf homepage. Table 1 has been created using said program.

6 CONCLUSION

Large experimentally verified RNA structures contain a sizable number of non-canonical base pairs (Stombaugh ). However, only a few RNA folding programs predict non-canonical pairs (Do ; Parisien and Major, 2008). With the exception of MC-Fold, the pair families are not explicitly taken into account. Here, we have shown that the prediction of non-canonical pairs together with the corresponding pair families and their possible interactions in base triples is feasible by efficient dynamic programming approaches. Although direct thermodynamic measurements are not available to cover all aspects of such an extended and refined model of RNA structures, meaningful parameter sets can nevertheless be constructed. To this end, the information of the thermodynamic measurements is combined with a feature analysis of 3D structures using one of several approaches to large-scale parameter optimization. The extended combinatorial model, which in essence covers the LW representations of RNA structures, allows a much more detailed modeling of the intrinsic structures in particular of hairpins, interior loops and bulges. We emphasize that our contribution does not yet provide a full-fledged loop-based LW-style energy model. In essence, we still lack an implementation for the full model of multiloops. As the example of Figure 6 suggests, interactions of adjacent loops as in the MC-Fold model may also be required to obtain satisfactory prediction accuracies for practical applications. Due to the computational cost, it will also make sense to investigate the trade-off between further refinements of the model and speed-ups resulting from additive approximations. Another facet that naturally should be taken into account is coaxial stacking, in particular in the context of multiloops (Tyagi and Mathews, 2007). We have demonstrated here that the goal of an accurate, practically applicable folding algorithm for LW structures is meaningful and reachable: the work of Parisien and Major (2008) shows that major improvements of prediction accuracy can be obtained by employing LW-based folding algorithms. Although RNAwolf does not yet reach the desired levels of accuracy, it allows us to explore the missing components of the energy model in a systematic manner, and it demonstrates that this can be achieved without leaving the realm of fast, efficient and exact dynamic programming approaches. The next step, therefore, is a toolkit for optimizing parameters in the full loop-based model. An interesting possibility for further extensions of the model is the explicit incorporation of recurring RNA structural motifs with non-canonical pairs, such as Kink-Turns (Klein ), into the grammar and the energy model. This may be particularly useful in those cases where motifs are not crossing-free and hence would require a pseudoknot version of the folding algorithm. While the inclusion of various types of pseudoknots is conceptually not more difficult than for ordinary secondary structures, the parametrization of such models will be even more plagued by the lack of training data in the LW framework. The folding algorithm introduced here, furthermore, sets the stage for a complete suite of bioinformatics tools for LW structures. Simple extension can cover the cofolding of two or more RNAs along the lines of (Bernhart ; Dimitrov and Zuker, 2004; Dirks ). Consensus structures can be predicted from given sequence alignments using the same recursions. As in RNAalifold (Bernhart ), it suffices to redefine the energy parameters for alignment columns instead of individual nucleotides. Instead of RIBOSUM-like scores as measures of conservation (Klein and Eddy, 2003), one naturally would employ the isostericity rules for the individual base pair types (Leontis ; Lescoute ). Inverse folding algorithms (Andronescu ; Busch and Backofen, 2006; Hofacker ) design RNA sequences that fold into prescribed structures by iteratively modifying and folding sequences to optimize their fit to substructures of the target. This strategy can immediately be generalized to LW structures; in fact, in essence it suffices to replace secondary structure folding by LW style folding. Combining the algorithmic ideas of this contribution with the Sankoff-style alignment approach of Zhong ) and the progressive multiple alignment scheme of mlocarna (Will ) directly leads to an LW variant of structural alignment algorithms.

27 in total

1. Geometric nomenclature and classification of RNA base pairs.

Authors: N B Leontis; E Westhof
Journal: RNA Date: 2001-04 Impact factor: 4.942

2. Prediction of hybridization and melting for double-stranded nucleic acids.

Authors: Roumen A Dimitrov; Michael Zuker
Journal: Biophys J Date: 2004-07 Impact factor: 4.033

3. Consensus folding of unaligned RNA sequences revisited.

Authors: Vineet Bafna; Haixu Tang; Shaojie Zhang
Journal: J Comput Biol Date: 2006-03 Impact factor: 1.479

4. CONTRAfold: RNA secondary structure prediction without physics-based models.

Authors: Chuong B Do; Daniel A Woods; Serafim Batzoglou
Journal: Bioinformatics Date: 2006-07-15 Impact factor: 6.937

Review 5. The building blocks and motifs of RNA architecture.

Authors: Neocles B Leontis; Aurelie Lescoute; Eric Westhof
Journal: Curr Opin Struct Biol Date: 2006-05-19 Impact factor: 6.809

6. Predicting helical coaxial stacking in RNA multibranch loops.

Authors: Rahul Tyagi; David H Mathews
Journal: RNA Date: 2007-05-16 Impact factor: 4.942

7. RNAstructure: software for RNA secondary structure prediction and analysis.

Authors: Jessica S Reuter; David H Mathews
Journal: BMC Bioinformatics Date: 2010-03-15 Impact factor: 3.169

8. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering.

Authors: Sebastian Will; Kristin Reiche; Ivo L Hofacker; Peter F Stadler; Rolf Backofen
Journal: PLoS Comput Biol Date: 2007-02-22 Impact factor: 4.475

9. Effective ambiguity checking in biosequence analysis.

Authors: Janina Reeder; Peter Steffen; Robert Giegerich
Journal: BMC Bioinformatics Date: 2005-06-20 Impact factor: 3.169

10. RNAalifold: improved consensus structure prediction for RNA alignments.

Authors: Stephan H Bernhart; Ivo L Hofacker; Sebastian Will; Andreas R Gruber; Peter F Stadler
Journal: BMC Bioinformatics Date: 2008-11-11 Impact factor: 3.169

16 in total

1. JAR3D Webserver: Scoring and aligning RNA loop sequences to known 3D motifs.

Authors: James Roll; Craig L Zirbel; Blake Sweeney; Anton I Petrov; Neocles Leontis
Journal: Nucleic Acids Res Date: 2016-05-27 Impact factor: 16.971

2. Algebraic Dynamic Programming over general data structures.

Authors: Christian Höner zu Siederdissen; Sonja J Prohaska; Peter F Stadler
Journal: BMC Bioinformatics Date: 2015-12-16 Impact factor: 3.169

Review 3. Computational analysis of noncoding RNAs.

Authors: Stefan Washietl; Sebastian Will; David A Hendrix; Loyal A Goff; John L Rinn; Bonnie Berger; Manolis Kellis
Journal: Wiley Interdiscip Rev RNA Date: 2012-09-18 Impact factor: 9.957

Review 4. How to benchmark RNA secondary structure prediction accuracy.

Authors: David H Mathews
Journal: Methods Date: 2019-04-02 Impact factor: 3.608

5. CompaRNA: a server for continuous benchmarking of automated methods for RNA secondary structure prediction.

Authors: Tomasz Puton; Lukasz P Kozlowski; Kristian M Rother; Janusz M Bujnicki
Journal: Nucleic Acids Res Date: 2013-02-21 Impact factor: 16.971

6. Towards 3D structure prediction of large RNA molecules: an integer programming framework to insert local 3D motifs in RNA secondary structure.

Authors: Vladimir Reinharz; François Major; Jérôme Waldispühl
Journal: Bioinformatics Date: 2012-06-15 Impact factor: 6.937