Jana Sperschneider1, Amitava Datta. 1. School of Computer Science and Software Engineering, The University of Western Australia, Perth, WA 6009, Australia. janaspe@csse.uwa.edu.au
Abstract
RNA pseudoknots are functional structure elements with key roles in viral and cellular processes. Prediction of a pseudoknotted minimum free energy structure is an NP-complete problem. Practical algorithms for RNA structure prediction including restricted classes of pseudoknots suffer from high runtime and poor accuracy for longer sequences. A heuristic approach is to search for promising pseudoknot candidates in a sequence and verify those. Afterwards, the detected pseudoknots can be further analysed using bioinformatics or laboratory techniques. We present a novel pseudoknot detection method called DotKnot that extracts stem regions from the secondary structure probability dot plot and assembles pseudoknot candidates in a constructive fashion. We evaluate pseudoknot free energies using novel parameters, which have recently become available. We show that the conventional probability dot plot makes a wide class of pseudoknots including those with bulged stems manageable in an explicit fashion. The energy parameters now become the limiting factor in pseudoknot prediction. DotKnot is an efficient method for long sequences, which finds pseudoknots with higher accuracy compared to other known prediction algorithms. DotKnot is accessible as a web server at http://dotknot.csse.uwa.edu.au.
RNA pseudoknots are functional structure elements with key roles in viral and cellular processes. Prediction of a pseudoknotted minimum free energy structure is an NP-complete problem. Practical algorithms for RNA structure prediction including restricted classes of pseudoknots suffer from high runtime and poor accuracy for longer sequences. A heuristic approach is to search for promising pseudoknot candidates in a sequence and verify those. Afterwards, the detected pseudoknots can be further analysed using bioinformatics or laboratory techniques. We present a novel pseudoknot detection method called DotKnot that extracts stem regions from the secondary structure probability dot plot and assembles pseudoknot candidates in a constructive fashion. We evaluate pseudoknot free energies using novel parameters, which have recently become available. We show that the conventional probability dot plot makes a wide class of pseudoknots including those with bulged stems manageable in an explicit fashion. The energy parameters now become the limiting factor in pseudoknot prediction. DotKnot is an efficient method for long sequences, which finds pseudoknots with higher accuracy compared to other known prediction algorithms. DotKnot is accessible as a web server at http://dotknot.csse.uwa.edu.au.
RNA is a versatile nucleic acid, which is no longer seen as the passive intermediate between DNA and proteins. Numerous functional RNAs with an astonishing variety have been uncovered in the past decade. For example, non-coding RNAs participate in a wide range of cellular processes, are able to regulate gene expression and can act as catalyst (1,2). Macromolecule function is closely connected to its 3D folding and structure prediction from the base sequence is thus of great importance. RNA structure formation is understood to be hierarchical and, therefore, secondary structure prediction is the foundation for determining the tertiary folding (3,4).There are two streams in computational RNA secondary structure prediction: comparative approaches and single sequence methods. In general, comparative approaches give accurate results for a set of well-conserved sequences (5). However, comparative methods rely on the quality of multiple alignments and are not always feasible for RNA structure prediction, due to a lack of reliable data sets (6).When only a single sequence is given, the most popular approach for RNA structure prediction is free energy minimization. In the minimum free energy (MFE) model, continuous base pairs contribute enthalpic terms and loop regions are purely entropic. RNA comprises various secondary structure elements, i.e. stems, hairpin loops, bulge loops, internal loops and multiloops. Much experimental work has been done to determine their free energy parameters (7,8). The key concept that allows for dynamic programming is that all of these motifs are non-crossing and self-contained in terms of their free energy. The MFE secondary structure based on the additive free energy model can be predicted in time and space using dynamic programming (9,10). MFE prediction has been extended in several ways (11). Suboptimal structures with free energy close to the MFE can be calculated (12,13). Using the dynamic programming principle, the full equilibrium partition function for RNA secondary structure is computed in time and space (14). From the partition function, probabilities for base pairs and structure elements are derived. The main advantage of the MFE algorithm is its guarantee to find an optimal structure with regards to the underlying energy model. However, the inability to predict crossing structure elements, so-called pseudoknots, is a major drawback.Pseudoknots are functional structure elements, which have been reported in most classes of RNA (15). A pseudoknot forms when unpaired bases in a loop pair with complementary bases in a single-stranded region outside the loop (Figure 1). Pseudoknots play key roles in viral genome replication and regulation of protein synthesis (16). Pseudoknots are also found in the cell where they participate in processes such as splicing, ribosomal frameshifting, telomerase activity and ribosome function (17–19). In many cases, they assist in the overall 3D folding (20,21) and should not be excluded from computational structure prediction.
Figure 1.
A simple H-type pseudoknot. (a) Unpaired bases within a hairpin loop bond with unpaired bases in a single-stranded region outside the loop. (b) The resulting pseudoknot has two stems and and three loops , and . (c) A pseudoknot has at least two crossing stems, which can be displayed as intervals on the line.
A simple H-type pseudoknot. (a) Unpaired bases within a hairpin loop bond with unpaired bases in a single-stranded region outside the loop. (b) The resulting pseudoknot has two stems and and three loops , and . (c) A pseudoknot has at least two crossing stems, which can be displayed as intervals on the line.RNA secondary structure prediction with arbitrary pseudoknots under a basic energy model is NP-complete (22,23). Restricted classes of pseudoknots can be included in the dynamic programming algorithm for prediction of the MFE structure, resulting in high computational complexity. In dynamic programming, there is always a trade-off between the generality of pseudoknots, which can be predicted, and runtime. Rivas and Eddy (24) cover a broad class of pseudoknots including kissing hairpins in their algorithm, which requires time and space. More restricted pseudoknots are included in other dynamic programming algorithms, which have runtime of using or space (22,23,25). All of these algorithms are only feasible for short RNA sequences. The most practical method is pknotsRG, which computes the MFE structure with canonical simple recursive pseudoknots in time and space (26). Dynamic programming does guarantee to find a structure with minimum free energy with respect to the underlying energy model. However, the energy model for pseudoknots used in dynamic programming is only a simple parameterization adopted from the affine multiloop energy model (24–26). It was first introduced by Rivas and Eddy (24) because of the lack of experimentally measured parameters for pseudoknot energies. Predictive accuracy of MFE folding is always limited by the underlying energy model, and hence pseudoknot prediction results are poor for longer sequences.Due to the computational complexity of dynamic programming for pseudoknot prediction, heuristic approaches were developed as an alternative. Heuristic methods do not necessarily return the MFE structure; however, they can include a wide class of pseudoknots and more advanced energy models in reasonable runtime. RNA secondary structure prediction including pseudoknots has been approached using genetic algorithms (27,28), stochastic context-free grammars (29,30), kinetic folding simulations (31,32) and maximum weighted matching in a folding graph (33). Iterative stem adding procedures have also been developed (34–36). In several heuristic algorithms, the underlying energy model for secondary structures is simple base pair maximization, neglecting loop entropies (33,34). This may lead to unreliable results, especially for longer sequences. Another drawback is that most heuristic methods reported in the literature employ the same affine pseudoknot energy model as dynamic programming.A different algorithmic framework is used in heuristic pseudoknot detection programs (37–40). Here, one attempts to find promising pseudoknot candidates in a sequence as a first step. These potential pseudoknots are subsequently analysed and verified. After pseudoknot detection, the remaining sequence can be folded using free energy minimization in time and space. Pseudoknot detection has two main advantages over dynamic programming. First, it is computationally much more efficient and, therefore, practical for scanning long RNA sequences for pseudoknots. Second, the underlying framework is less restrictive and allows for easy incorporation of sophisticated energy rules for pseudoknots or even comparative information. Pseudoknot detection delivers accurate pseudoknot prediction results for longer sequences in many cases (37,39).Pseudoknot energy models have been studied in more detail in the past years and reliable energy parameters are in high demand. It is widely accepted that pseudoknot energy cannot be estimated with an additive model as used in MFE folding. There is strong interference between opposite loops and stems ( and , and ), which come in close contact in the 3D fold. In a simple hairpin type (H-type) pseudoknot, loops and span across the deep narrow (major) groove and the shallow wide (minor) groove, respectively. The corresponding loop entropies are, therefore, not equivalent and depend on the structure and length of the opposite stem (41,42).The lack of loop entropy parameters is a critical issue in pseudoknot prediction (43). For H-type pseudoknots with interhelix loop size nt, loop entropy values were derived using several fitted parameters (41). Gaussian chain approximation based on polymer physics for pseudoknot loop entropies was also proposed (44). Lattice-based models were developed to take into account volume exclusion effects (45–47). The most successful models are based on the atomic coordinates of the RNA backbone. Using polymer statistical mechanics, Cao and Chen (42) calculated loop entropy parameters for pseudoknots with interhelix loop size nt. Recently, their so-called virtual bond model has been extended to pseudoknots with interhelix loop size nt (48). Inclusion of such a pseudoknot energy model is not straightforward in dynamic programming in reasonable runtime due to dependencies between opposite loops and stems. Calculation of the full partition function under the virtual bond model takes time and space, making the approach only feasible for sequences shorter than say 150 nt (42). However, pseudoknot energy parameters derived from the virtual bond model are readily available in tabular form for many stem and loop length combinations (42,48). It is straightforward to incorporate such energy parameters in a pseudoknot detection approach, making prediction using much improved energy models feasible for long sequences.We present DotKnot, a pseudoknot prediction method that incorporates stem–loop correlated pseudoknot energy parameters from the virtual bond model (42,48). The workflow is similar to the detection approach used in KnotSeeker (39), with two main improvements. First, a secondary structure partition function calculation in time and space is the basis for finding a set of promising structure elements with high confidence (14). This set includes heuristically derived bulge loops, internal loops and multiloops with low free energy. Second, pseudoknot candidates are assembled using the set of promising structure elements and, therefore, loop entropy parameters can readily be used for pseudoknot energy evaluations (42,48). There is no dynamic programming kernel for verification of pseudoknot candidates. This is a major step towards successful pseudoknot prediction as the vast majority of methods in the literature use simple approximations of pseudoknot energies. However, improving the folding model behind the algorithmic framework is clearly the key to accurate prediction (43).DotKnot predicts the class of recursive H-type pseudoknots where one of the pseudoknot stems can be interrupted by bulge or internal loops. All of the three pseudoknot loops are allowed to form internal secondary structures. Using the set of structural elements derived from the probability dot plot as a construction kit, we could predict a broad class of pseudoknots, including kissing hairpins, in an efficient manner. However, our knowledge about energy parameters and folding mechanisms become the limiting factor for a pseudoknot search tool such as DotKnot. Given an RNA sequence, DotKnot returns only the (possibly empty) set of detected pseudoknots. These detected pseudoknots can subsequently be verified using laboratory techniques or comparative information. The remaining non-crossing sequence can then be folded in time and space using state-of-the-art RNA secondary structure prediction algorithms to obtain a global folding.
MATERIALS AND METHODS
The detailed workflow for DotKnot is shown in Figure 2. Given an RNA sequence, DotKnot finds sequence fragments with pseudoknot folding potential. These candidates are analysed in terms of their free energy values and credibility in the folded sequence. The output is a (possibly empty) set of detected pseudoknots, which can then be examined closely.
Figure 2.
Workflow used by DotKnot for detecting pseudoknots in a sequence using the probability dot plot. MWIS stands for maximum weight independent set calculation (56).
Workflow used by DotKnot for detecting pseudoknots in a sequence using the probability dot plot. MWIS stands for maximum weight independent set calculation (56).
Stems in the probability dot plot
RNA structure forms through complementary base pairing, resulting in stabilizing stems and destabilizing loop regions. The basic building blocks for a pseudoknot are two crossing stems. We use the probability dot plot derived from the partition function as a guide for finding such building blocks. Given an RNA sequence, the partition function is defined as the weighted sum over the set of all possible secondary structures , i.e. where is the universal gas constant and the temperature. Once the partition function is known, probabilities for base pairs and structure elements can be calculated. The software RNAfold returns the probability dot plot representing both base pair probabilities and stack probabilities (11). The stack probability for a base pair is defined as the probability that pair and the subsequent pair are formed simultaneously (49).It has been shown that choosing base pairs with high probability can improve secondary structure prediction (50). There are also several approaches that discuss the exclusion of base pairs with low probability for improving runtime. Here, the common technique is to use a cut-off value for base pair probabilities in order to determine only significant base pairs (51). However, for pseudoknot stems we cannot simply dismiss base pairs with low probability. The partition function calculation is based on the ensemble of secondary structure elements, which are non-crossing interactions. In general, the pseudoknot stems are visible in the probability dot plot as they are members of the folding space. One can expect that at least one of the pseudoknot stems will have low base pair and stack probabilities. By default, RNAfold only displays probabilities larger than . For pseudoknot detection in the dot plot, we use a cutoff probability of to make sure that all pseudoknot stems can be found for long sequences.Given the probability dot plot, stems are assembled according to certain criteria. First, a stem must have at least three base pairs. Second, one can expect that the stack probabilities in a stable stem do not rise or drop sharply (49). Helix elongation is an energetically favourable process; however, wobble base pairs can be destabilizing (8,52,53). Furthermore, there may be other stable secondary structure elements competing for base pairs. Therefore, we demand that the absolute percentage increase or decrease of stack probabilities for subsequent base pairs , in a stem has to be smaller than a certain threshold .One has to keep in mind that base pair probabilities are not independent. Therefore, stem probabilities can never be calculated by simply multiplying the base pair probabilities. However, the average probability of participating base pairs in a stem can be used as a confidence indicator (54). In our approach, we calculate the confidence for a stem as the average stack probabilities. For each stem, an absolute weight is introduced in addition. The confidence is based on the energy model for secondary structures, excluding pseudoknots. Therefore, pseudoknot stems tend to have low average probabilities especially for longer sequences, which does not necessarily correspond to their dominance in native RNA structures. We assign two additional weights for a stem based on a local energy evaluation. The simple stacking model employs the favourable base pair stacking parameters in a stem; however, it excludes the destabilizing energy contributed by the hairpin loop. The more sophisticated free energy model includes all entropy and enthalpy parameters derived by the Turner group (8). The tool RNAeval is used to evaluate the local free energies of the stem candidates according to the two energy models introduced above, taking into account dangling ends on both sides (11). DotKnot stores two stem weights (simple stacking model) and (free energy model) for a stem . Only stems satisfying the following conditions are kept: kcal/mol and kcal/mol. The resulting stems are stored in a stem dictionary . Each stem has a unique key , where is the start position and the end position of the stem in sequence . For each stem , we store its length and the following values in the stem dictionary: and .
Finding bulges, internal loops and multiloops
In RNA structures, stems are often interrupted by bulge loops or internal loops. In the following, a stem interrupted by bulge loops or internal loops is denoted as (Figure 3). Naive construction of interrupted stems from the stem dictionary would be inefficient due to a large number of possible stem combinations. Therefore, we employ a heuristic strategy for finding interrupted stems with low free energy.
Figure 3.
(a) Example for a stem interrupted by a bulge loop and an internal loop. (b) The corresponding secondary structure is shown.
(a) Example for a stem interrupted by a bulge loop and an internal loop. (b) The corresponding secondary structure is shown.Given the stem dictionary , interrupted stems are constructed using maximum weight independent set (MWIS) calculations with the confidence indicators as stem weights (Supplementary Algorithm 1). Using confidence instead of local free energy gives better results in the MWIS calculation. First, it implicitly penalizes the formation of long bulge or internal loops. Second, the confidence is a relative measurement based on the whole folding ensemble. Only stems with confidence are considered as base pairs below this threshold are unlikely to participate in non-crossing secondary structure formation (55). This cutoff step also significantly reduces runtime for longer sequences. Each stem has two (left and right) endpoints and and is represented as an interval on the line. First, the sorted endpoints list for all stems is constructed and scanned from left to right. If a right endpoint is discovered, a candidate list of stems nested in the interval is returned for the corresponding stem . A MWIS calculation on the candidate list returns the set of internal stems with maximum weight in linear time (56). The outer stem can, therefore, become an interrupted stem with bulges and internal loops or the exterior stem of a multiloop. The sum of confidences serves as the updated weight for outer stem . After the whole endpoints list has been scanned, stems interrupted by bulge loops or internal loops are stored in a new dictionary . Multiloops are stored in a separate dictionary . As a last step, the free energies of the interrupted stems and multiloops are evaluated with the tool RNAeval (11). Only structure elements with negative free energy are stored.So far, DotKnot derived a set of regular stems, interrupted stems and multiloops in a heuristic fashion from the probability dot plot (Figure 2). There is no guarantee to find the structure element with local minimum free energy for a given sequence stretch. However, dictionaries and will contain structures with free energy value close or equal to the local minimum free energy. These elements will be of great benefit for the elimination of false positive pseudoknots.
Construction of candidate pseudoknots
One major advantage of the detection approach is that the pseudoknot prediction target class is predefined and transparent. In contrast, dynamic programming algorithms construct pseudoknots through their recursion scheme, which in some cases leads to unspecific pseudoknot target classes (24,57). In this work, recursive H-type pseudoknots will be considered similar to the class of pseudoknots predicted by pknotsRG (26). A recursive H-type pseudoknot has two crossing stems and , resulting in three loops , and . All of the three loops are allowed to form internal secondary structures; however, loop–loop interactions are not allowed. In this work, we also include pseudoknots where one of the stems or is interrupted by bulges or internal loops. This leads to a more comprehensive prediction class as in pknotsRG, where only bulges of size 1 nt are considered (26). In general, the three dictionaries , and allow for the construction of complicated pseudoknot folds. For example, it is straightforward to construct kissing hairpins from the stem dictionary. The pseudoknot energy parameters become the bottleneck in this approach, not necessarily the complexity of pseudoknots.The three main steps for constructing recursive H-type pseudoknots are shown in Figure 4. First, so-called core H-type pseudoknots form through simple combination of two crossing stems. They become recursive H-type pseudoknots when additional secondary structure elements fold in each of the three loops. Note that recursive pseudoknots are not allowed in the loops as this may lead to sterically infeasible configurations. The overall recursive H-type candidate pseudoknot is assembled in a third step.
Figure 4.
Construction of a recursive H-type pseudoknot. On the first level, two stems form a core H-type pseudoknot. On the second level, recursive secondary structure elements may form in each of the three loops. On the third level, the recursive H-type pseudoknot is assembled.
Construction of a recursive H-type pseudoknot. On the first level, two stems form a core H-type pseudoknot. On the second level, recursive secondary structure elements may form in each of the three loops. On the third level, the recursive H-type pseudoknot is assembled.
First level: assembling core H-type pseudoknots
On the first level, two crossing stems are combined to form a core H-type pseudoknot (Figure 4). Core H-type pseudoknots are the building blocks for more complex pseudoknots. The pseudoknot stems can either be regular stems taken from the dictionary or interrupted stems from the dictionary (Figure 5). Here, we only allow pseudoknots with at most one interrupted stem because for more complex and less rigid pseudoknots we would have to employ a highly assumptive energy model. Certain loop length restrictions are applied because more meaningful results can be expected from prediction of shorter and well-studied pseudoknots. Loop and are both required to have at least 1 and 2 nt, respectively. Interhelix loop can have a size of 0 nt. All three loops are restricted to a maximum length, as there are no reliable energy parameters for very long pseudoknots. Loops and can have a length of up to 100 nt, whereas interhelix loop is restricted to a maximum length of 50 nt. During construction of the pseudoknot candidates, we discovered that crossing stems may compete for a base pair. This leads to an overlap at loop . In such a case, one of the stems is truncated according to certain rules (Supplementary Figure S2 and Supplementary Algorithm 2).
Figure 5.
On the first level of pseudoknot construction two stems form a crossing structure. (a) Two regular stems and are crossing. (b) Stem and stem are crossing. (c) Stem and stem are crossing.
On the first level of pseudoknot construction two stems form a crossing structure. (a) Two regular stems and are crossing. (b) Stem and stem are crossing. (c) Stem and stem are crossing.There is little knowledge about the 3D folding of complex pseudoknots with interrupted stems, which may lead to bending or distortions of the RNA A-helix. Loops and need to cross the major and minor groove of stems and , respectively. To disallow sterically infeasible configurations for long interrupted stems, we make the following assumptions (41). For interrupted stems with more than 10 bp (including bulges and internal loops), the loops bridging the stems must have a minimum length: loop must be nt and loop must be nt.After the first level of pseudoknot construction, free energy values are evaluated for each core H-type pseudoknot, which allows to filter unlikely pseudoknots. Only pseudoknots with low free energy (and therefore likely to form) will remain for the next step, which involves recursive secondary structure formation in the three pseudoknot loops. For all core H-type pseudoknots , free energy is calculated as
where is the purely entropic free energy for loops , and . Three different pseudoknot energy models are employed for the loop entropy calculation, each of that comprise pseudoknots with certain characteristics. The length of loop determines which energy model is used for the respective pseudoknot candidate (Table 1). For details on pseudoknot loop entropy calculation using the virtual bond models CC06 and CC09 see (42) and (48), respectively. For pseudoknots with long interhelix loop , there is no physical loop entropy model available and we have to employ heuristics (24–26). This heuristic energy model (LongPK) is also used for pseudoknots with one interrupted stem regardless of the length of loop . Stems interrupted by long bulge or internal loops are likely to result in bending rather than rigid formations (48). Therefore, the loop entropy calculation becomes intricate. We calculate the loop entropy as , where is the number of unpaired nucleotides in the three pseudoknot loops with kcal/mol and kcal/mol.
Table 1.
Three different energy models used for pseudoknot energy evaluation
ID
Loop
Stems ,
Entropy
CC06
regular
Virtual bond model
CC09
regular
Virtual bond model
LongPK
regular
Heuristic model
interrupted
Heuristic model
Three different energy models used for pseudoknot energy evaluationFor pseudoknots with regular stems and loop length of 0 or 1 nt, one can generally assume that the two pseudoknot stems are coaxially stacked. This leads to a stabilizing effect for the two base pairs at the interhelix junction. Here, coaxial stacking is calculated using the Turner energy model, multiplied by an estimated weighting parameter and added to the free energy of a pseudoknot (7,8,24). For pseudoknots with interrupted stems and absent loop , we also add the appropriate coaxial stacking energy multiplied by an estimated weighting parameter . After energy evaluation, only pseudoknots with negative free energy are stored in the pseudoknot dictionary . Additionally, we demand that the free energy of a core H-type pseudoknot needs to be lower than the free energies and of the pseudoknot stems and .
Second level: recursive structure formation
Secondary structure elements often form in pseudoknot loops, resulting in a recursive pseudoknot. After finding stable core H-type pseudoknots, the three loops , and are examined for likely secondary structure elements. It is a valid assumption that the three loops form recursive elements independently and can be treated separately (Figure 4). From an algorithmic point of view, it is efficient to find recursive structure elements using a MWIS calculation (Supplementary Algorithm 3). Given a core H-type pseudoknot, three candidate lists hold all possible secondary structure elements from dictionaries , and contained in each of the loops , and . A standard MWIS calculation with free energy weights for each of the three lists returns the set of secondary structure elements with best local free energy for each loop. The results are combined to form a recursive H-type pseudoknot.For each recursive H-type pseudoknot, we need to evaluate the free energy with respect to the recursive structure elements. As described in Table 1, the set of pseudoknots is divided into the three different classes according to the length of loop . To account for recursive structure elements, the loop entropies need to be recalculated. Following the notation in Cao and Chen (48), we first calculate the effective loop lengths. The effective loop length for a pseudoknot loop () with internal structure elements is the number of unpaired nucleotides outside those internal structure elements plus the number of internal structure elements. For all pseudoknots with recursive structure elements, the loop entropy is recalculated using the effective loop lengths. Internal secondary structure elements add free energy values as given by the Turner energy model (8). We keep only pseudoknots in the pseudoknot dictionary , which satisfy the following two conditions. First, a recursive H-type pseudoknot must have free energy kcal/mol. Second, the normalized pseudoknot free energy must fulfill where denotes the length of pseudoknot . Setting the threshold to is not too restrictive; however, it helps us to eliminate pseudoknots with high free energy (58). For example, for the 3′-UTR TMV sequence with length 214 nt, we have 445 candidate stems in dictionary and 2026 pseudoknot candidates before filtering. After the length-normalized filtering step, only 274 pseudoknot candidates remain.
Verification of pseudoknot candidates in the sequence
The set of recursive H-type pseudoknots stored in dictionary will certainly contain false positive pseudoknots. Therefore, a MWIS calculation with local free energy weights is employed using the structure elements from all four dictionaries: , , and . Stems need to have confidence to participate (55). Furthermore, stems are allowed to contain nested structure elements, including pseudoknots. The MWIS procedure with nesting is described in detail in KnotSeeker (39). Note that we do not include the free energy gain of kcal/mol for the outer loop as in KnotSeeker, because favourable nested structures are already included through dictionaries and .
RESULTS
We evaluated our algorithm on a set of pseudoknotted and pseudoknot-free sequences of different RNA types (Table 2). Given a sequence, DotKnot is a method that predicts only pseudoknots. Therefore, predictive accuracy is measured for base pairs belonging to a pseudoknot. Two measurements are used for comparison of DotKnot and other selected algorithms from the literature. For each published pseudoknot in a sequence we report sensitivity and the positive predictive value . True positive () corresponds to the number of correctly predicted base pairs in the predicted pseudoknot, False negative () to the number of base pairs in the published pseudoknot that were not predicted and False positive () to the number of incorrectly predicted base pairs in the predicted pseudoknot. A pseudoknot is said to be predicted by an algorithm if it is a crossing structure element and at least one of the two pseudoknot stems is partially predicted. Furthermore, the ratio (number of correctly predicted pseudoknots)/(number of predicted pseudoknots) is reported. We compare DotKnot to two dynamic programming methods, namely pknots (24) and pknotsRG (26), the pseudoknot detection tool KnotSeeker (39) and the heuristic approach HotKnots (35). HotKnots returns a number of sub-optimal scenarios; however, we only evaluate predictive accuracy for the best solution.
Table 2.
RNA types and sequences used for pseudoknot prediction
Type of RNA
Sequence ID
Reference
5S rRNA
5SColi, 5SDmobilis
(59)
tRNA
DA0260, DC0010, DY4441
(60)
miRNA
Dros-mel, ath-MIR156a, Human-mi
(61)
Ribozymes
HDV, HDVanti
(62,63)
IRES
CrPV
(64)
3′-UTR
BCV, MHV, NeRNV, TMV
(65–67)
tmRNA
EColi-tmRNA, LP-tmRNA
(68)
Aptamer
HIV1-1.3a
(69)
Viral tRNA-like
LRSVbeta, TYMV
(70,71)
Frameshifting
BWYV, JEV, MEV, VMV, SARS-CoV
(72–75)
Telomerase
Human-telo, Tetra-telo
(76)
5′-UTR
HPeV1
(77)
mRNA
T2, T4
(78)
Note that 5S rRNA, tRNA and miRNA are pseudoknot-free.
RNA types and sequences used for pseudoknot predictionNote that 5S rRNA, tRNA and miRNA are pseudoknot-free.Summary of pseudoknot detection results on various RNA sequencesFor each pseudoknot, the best results in terms of both sensitivity S and positive predictive value PPV are marked in bold. The * symbol indicates that we were not able to run the algorithm due to the high time and space requirements. PK corresponds to the number of pseudoknots in the sequence as reported in the literature. We use pknots 1.05 with coaxial energies, pknotsRG 1.3 and HotKnots 1.2 without suboptimal solutions.5S rRNA, tRNA and miRNA are pseudoknot-free types of RNA. For the 5S rRNA and miRNA sequences chosen in our test set, DotKnot does not introduce any spurious pseudoknots. For two of the tRNA sequences, DotKnot predicts false positive pseudoknots. KnotSeeker, pknotsRG and HotKnots also predict false positive pseudoknots for some of the tRNA sequences. Minimum free energy prediction is known to have low accuracy for tRNAs. This is due to modified bases as well as coaxially stacked helices, which determine the characteristic 3D cloverleaf fold of tRNAs (24,26). Coaxial energies are implemented by pknots, which might explain why it does not predict false positive pseudoknots for the tested tRNA sequences.Several of the test sequences contain pseudoknots where one of the core pseudoknot stems is interrupted by bulges or internal loops: Escherichia coli tmRNA, Legionella pneumophila tmRNA, the SARS frameshifting pseudoknot, human telomerase and Tetrahymena telomerase. For all of these pseudoknots, DotKnot delivers the best results in terms of sensitivity and specificity. For example, the Tetrahymena telomerase RNA (TER) contains a pseudoknot with a conserved central GA bulge in one of its stems, which is vital for telomerase function (76). DotKnot perfectly predicts this bulged pseudoknot, while all other methods do not predict a pseudoknot structure. The biological relevance of pseudoknots with bulged residues should not be underestimated. Therefore, a prediction algorithm that can handle pseudoknots with interrupted stems such as DotKnot is highly desirable.For complex pseudoknot foldings such as the hepatitis delta virus (HDV) double pseudoknot configuration, DotKnot gives the best prediction results out of all tested algorithms. The CrPV IRES has a long pseudoknot, which contains another nested pseudoknot. For this pseudoknot, DotKnot also gives the best prediction in terms of sensitivity and PPV.DotKnot has excellent accuracy for simple H-type pseudoknots as those reported in many viral 3′-UTRs and frameshifting regions. We found that the energy models CC06 and CC09 used for predicting such pseudoknots give better results than the heuristic pseudoknot energy parameters employed by the other algorithms. For example, the NeRNV and TMV 3′-UTRs both have five pseudoknots where four are simple H-type pseudoknots with interhelix loop nt. For these pseudoknots, DotKnot gives the most accurate predictions, which we claim is due to the improved energy parameters by Cao and Chen (42,48).In terms of computational performance, DotKnot is very efficient due to the sparseness of the probability dot plot, the resulting low number of pseudoknot candidates and the implementation using dictionaries in Python. DotKnot runs in the order of seconds for all of the test sequences except T2 and T4, which take several minutes. For T4 with 1340 nt, we have 6567 candidate stems in dictionary and 7534 pseudoknot candidates before filtering. After the length-normalized filtering step, only 100 pseudoknot candidates remain for verification. Overall, it takes DotKnot <5 min to predict the correct pseudoknot in this sequence on our reference machine (Intel QC 2.66 GHz, 4 GB RAM). This is significantly faster than HotKnots, which takes 29 min, and pknotsRG, which takes 31 min. KnotSeeker is even faster than DotKnot and takes <2 min for the T4 sequence, because it does not rely on a partition function calculation. However, DotKnot is a more powerful prediction algorithm than KnotSeeker due to the inclusion of pseudoknots with one interrupted stem.
CONCLUSION
We presented DotKnot, a program that detects recursive H-type pseudoknots given an RNA sequence. Pseudoknot detection is a promising and efficient approach for determining the folding of an RNA. Using pseudoknot detection tools such as DotKnot, KnotSeeker or HPknotter, one can find likely pseudoknots in a sequence with high accuracy (37,39). The structure of the detected pseudoknots can subsequently be investigated using laboratory or bioinformatics techniques. The remaining non-crossing sequence can be folded using secondary structure prediction algorithms in time and space. DotKnot and other pseudoknot detection approaches are very time efficient, even allowing scanning of long regions in viral genomes.DotKnot assembles pseudoknot candidates from a set of structural building blocks. This set contains stems, bulge loops, internal loops and multiloops. In general, complex pseudoknots can be constructed. However, there is a trade-off between the generality of predictable pseudoknots and the biological relevance of the result. Therefore, we restrict DotKnot to the prediction of recursive H-type pseudoknots where one of the pseudoknot stems can contain bulges and internal loops. For these recursive H-type pseudoknots, we are confident that the pseudoknot energy parameters used give a good approximation. For more complex pseudoknot folds such as those with loop–loop interactions, we would have to employ a highly assumptive energy model, thus sacrificing predictive accuracy. In the future, DotKnot will be extended to the prediction of kissing hairpins and other biologically relevant classes of pseudoknots.DotKnot uses stack probabilities from the probability dot plot as the basis for finding pseudoknot building blocks. At this stage, only a single sequence is accepted as an input. It is well-known that functional pseudoknots are highly conserved in nature, for example in virus families. The pseudoknot detection framework used in DotKnot allows for incorporation of comparative information using aligned probability dot plots. One can expect that reliable alignments will greatly improve confidence in the predicted pseudoknots. Especially for complex pseudoknot foldings such as those found in bacterial tmRNA or in the telomerase RNA component, inclusion of comparative information will be invaluable.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
Funding for open access charge: The University of Western Australia.
Table 3.
Summary of pseudoknot detection results on various RNA sequences
Sequence
DotKnot
KnotSeeker
pknotsRG
pknots
HotKnots
ID
nt
PK
S
PPV
r
S
PPV
r
S
PPV
r
S
PPV
r
S
PPV
r
5SColi
120
0
–
–
0/0
–
–
0/0
–
–
0/0
–
–
0/0
–
–
0/0
5SDmobilis
133
0
–
–
0/0
–
–
0/0
–
–
0/0
–
–
0/0
–
–
0/0
DA0260
75
0
–
–
0/1
–
–
0/1
–
–
0/0
–
–
0/0
–
–
0/1
DC0010
73
0
–
–
0/0
–
–
0/1
–
–
0/0
–
–
0/0
–
–
0/0
DY4441
73
0
–
–
0/2
–
–
0/0
–
–
0/1
–
–
0/0
–
–
0/0
Dros-mel
81
0
–
–
0/0
–
–
0/0
–
–
0/0
–
–
0/0
–
–
0/0
ath-MIR156a
123
0
–
–
0/0
–
–
0/0
–
–
0/0
–
–
0/0
–
–
0/0
Human-mi
110
0
–
–
0/0
–
–
0/1
–
–
0/0
–
–
0/1
–
–
0/0
HDV
87
1
93.8
100
1/1
90.6
93.6
1/1
90.6
93.6
1/1
81.3
81.3
1/1
0
0
0/0
HDVanti
91
1
100
96.1
1/1
84
75
1/1
0
0
0/0
44
33.4
1/1
0
0
0/0
CrPV
190
2
52.3
57.5
2/2
20.5
32.1
2/2
0
0
0/0
*
*
*
0
0
0/0
78.6
73.4
78.6
73.4
0
0
*
*
*
0
0
BCV
345
1
100
81.2
1/1
100
90
1/1
100
90
1/1
*
*
*
0
0
0/0
MHV
315
1
100
81.2
1/1
100
90
1/2
100
90
1/3
*
*
*
0
0
0/0
NeRNV
287
5
100
100
4/7
100
88.9
4/4
0
0
1/2
*
*
*
0
0
0/0
77.8
100
77.8
87.4
0
0
*
*
*
0
0
90
100
90
100
0
0
*
*
*
0
0
0
0
0
0
0
0
*
*
*
0
0
100
100
100
100
100
100
*
*
*
0
0
TMV
214
5
100
100
5/5
77.8
87.5
5/5
0
0
0/0
*
*
*
0
0
0/0
100
100
81.8
90
0
0
*
*
*
0
0
88.9
100
88.9
100
0
0
*
*
*
0
0
95.7
86.9
100
95.8
0
0
*
*
*
0
0
100
100
100
100
0
0
*
*
*
0
0
EColi-tmRNA
363
4
100
100
3/5
100
100
2/3
0
0
0/0
*
*
*
0
0
0/0
100
100
0
0
0
0
*
*
*
0
0
0
0
28.6
40
0
0
*
*
*
0
0
68.4
81.3
0
0
0
0
*
*
*
0
0
LP-tmRNA
362
4
90
90
3/8
0
0
0/2
0
0
0/0
*
*
*
0
0
0/0
0
0
0
0
0
0
*
*
*
0
0
25
26.7
0
0
0
0
*
*
*
0
0
70.6
100
0
0
0
0
*
*
*
0
0
HIV1-1.3a
37
1
100
100
1/1
100
100
1/1
100
100
1/1
100
100
1/1
100
100
1/1
LRSVbeta
221
1
85.3
87.9
1/1
14.7
62.5
1/1
0
0
0/0
*
*
*
0
0
0/0
TYMV
85
2
0
0
1/2
0
0
1/1
0
0
1/1
0
0
1/1
0
0
0/0
62.5
55.6
100
80
62.5
50
100
88.9
0
0
BWYV
69
1
50
57.1
1/1
37.5
50
1/1
0
0
0/1
0
0
0/0
100
88.9
1/1
JEV
138
1
100
78.3
1/3
100
90
1/2
0
0
0/0
0
0
0/1
0
0
0/0
MEV
138
1
100
72
1/2
0
0
0/1
0
0
0/1
0
0
0/0
0
0
0/0
VMV
79
1
100
82.3
1/1
100
82.3
1/1
0
0
0/0
100
60.9
1/1
0
0
0/0
SARS-CoV
244
1
92.3
100
1/2
92.3
92.3
1/3
38.5
66.7
1/1
*
*
*
38.5
55.6
1/1
Human-telo
210
1
61.3
51.3
1/1
35.5
35.5
1/2
0
0
0/1
*
*
*
0
0
0/0
Tetra-telo
159
1
100
100
1/1
0
0
0/0
0
0
0/0
0
0
0/0
0
0
0/0
HPeV1
709
1
54.5
54.5
1/7
100
100
1/8
54.5
54.5
1/4
*
*
*
0
0
0/0
T2
946
1
100
100
1/9
100
100
1/5
100
100
1/1
*
*
*
0
0
0/0
T4
1340
1
100
100
1/10
100
100
1/8
0
0
0/1
*
*
*
0
0
0/0
For each pseudoknot, the best results in terms of both sensitivity S and positive predictive value PPV are marked in bold. The * symbol indicates that we were not able to run the algorithm due to the high time and space requirements. PK corresponds to the number of pseudoknots in the sequence as reported in the literature. We use pknots 1.05 with coaxial energies, pknotsRG 1.3 and HotKnots 1.2 without suboptimal solutions.
Authors: Jamie J Cannone; Sankar Subramanian; Murray N Schnare; James R Collett; Lisa M D'Souza; Yushi Du; Brian Feng; Nan Lin; Lakshmi V Madabusi; Kirsten M Müller; Nupur Pande; Zhidi Shang; Nan Yu; Robin R Gutell Journal: BMC Bioinformatics Date: 2002-01-17 Impact factor: 3.169
Authors: Amin Espah Borujeni; Dennis M Mishler; Jingzhi Wang; Walker Huso; Howard M Salis Journal: Nucleic Acids Res Date: 2015-11-30 Impact factor: 16.971