| Literature DB >> 23435231 |
Tomasz Puton1, Lukasz P Kozlowski, Kristian M Rother, Janusz M Bujnicki.
Abstract
We present a continuous benchmarking approach for the assessment of RNA secondary structure prediction methods implemented in the CompaRNA web server. As of 3 October 2012, the performance of 28 single-sequence and 13 comparative methods has been evaluated on RNA sequences/structures released weekly by the Protein Data Bank. We also provide a static benchmark generated on RNA 2D structures derived from the RNAstrand database. Benchmarks on both data sets offer insight into the relative performance of RNA secondary structure prediction methods on RNAs of different size and with respect to different types of structure. According to our tests, on the average, the most accurate predictions obtained by a comparative approach are generated by CentroidAlifold, MXScarna, RNAalifold and TurboFold. On the average, the most accurate predictions obtained by single-sequence analyses are generated by CentroidFold, ContextFold and IPknot. The best comparative methods typically outperform the best single-sequence methods if an alignment of homologous RNA sequences is available. This article presents the results of our benchmarks as of 3 October 2012, whereas the rankings presented online are continuously updated. We will gladly include new prediction methods and new measures of accuracy in the new editions of CompaRNA benchmarks.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23435231 PMCID: PMC3627593 DOI: 10.1093/nar/gkt101
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
List of 28 single-sequence methods analyzed in the CompaRNA benchmarks
| Name | Description | Availability | Predicts pseudoknots | Reference |
|---|---|---|---|---|
| Afold | Evaluates internal loops of RNA secondary structure with optimized nearest-neighbor model energy functions. | Local installation | No | ( |
| Alterna | Dynamic programming algorithm that minimizes the energy density sum and free energy of an RNA structure. | Web server | No | ( |
| CentroidFold | Uses generalized centroid estimators that maximize the expected weighted true predictions of base pairs in the predicted structure. | Local installation | No | ( |
| CentroidHomfold-LAST | An upgraded version of CentroidHomfold that uses additional homologous sequences collected automatically by the LAST program ( | Web server | No | ( |
| ContextFold | Uses rich parameterized machine-learning models (>70 000 free parameters). | Local installation | No | ( |
| Contrafold | Uses conditional log-linear models (CLLMs), a flexible class of probabilistic models that generalize on stochastic context-free grammars (SCFGs) by using discriminative training and feature-rich scoring. | Local installation | No | ( |
| CRWrnafold | A new version of RNAfold that uses statistical potentials derived from comparative data. | Web server | No | ( |
| Cylofold | Simulates the folding process in a coarse-grained manner by choosing helices based on established energy rules. The steric feasibility of the chosen set of helices is checked during the folding process using a coarse-grained 3D model of the RNA structures. | Web server | Yes | ( |
| Fold | A program from the RNAstructure package for single sequence secondary structure prediction by free-energy minimization. | Local installation | No | ( |
| HotKnots | A heuristic algorithm that iteratively forms stable stems using a free-energy minimization criterion to identify promising candidate stems. | Local installation | Yes | ( |
| IPknot | Predicts the maximum expected accuracy (MEA) structure using integer programming with a threshold cut. | Local installation | Yes | ( |
| MaxExpect | A program from the RNAstructure package for secondary structure prediction by maximizing expected accuracy. | Local installation | No | ( |
| MC-Fold | Uses a nucleotide cyclic motif (NCM) fusion process to generate a pool of secondary structures, from which the final prediction is selected. | Local installation | Yes | ( |
| McQFold | Markov Chain Monte Carlo (MCMC) sampling of secondary structures with pseudoknots. | Local installation | Yes | ( |
| NanoFolder | Predicts the base pairing of potentially pseudoknotted multistrand RNA nanostructures. | Local installation | Yes | ( |
| Pknots | A dynamic programming algorithm for ‘optimal’ RNA pseudoknot prediction. | Local installation | Yes | ( |
| PknotsRG | Uses the same model as Pknots, but additionally uses the Turner energy rules for finding the minimum free-energy structure. Dedicated to pseudoknot prediction. | Local installation | Yes | ( |
| ProbKnot | A program from the RNAstructure package for fast prediction of RNA secondary structure including pseudoknots. Assembles maximum expected accuracy structures from computed base pairing probabilities. | Local installation | Yes | ( |
| RDfolder | RNA folding by energy-weighted Monte Carlo simulations. | Web server | No | ( |
| RNAfold | RNA structure prediction program that comes with the Vienna package. Predicts MFE structures and base pair probabilities based on the dynamic programming algorithm originally developed by Zuker and Stiegler ( | Local installation | No | ( |
| RNASLOpt | Predicts stable locally optimal secondary structures represented by stack configurations. | Local installation | No | ( |
| RNAshapes | Unique suboptimal structures (shapes) are selected based on an abstract representation of RNA secondary structure, which is inspired by the dot bracket representation known from the Vienna RNA package. The user can choose from five different types of shape resolution corresponding to different abstraction levels. | Local installation | No | ( |
| RNAsubopt | Calculates all suboptimal secondary structures within a user-defined energy range above the MFE. | Local installation | No | ( |
| RNAwolf | Predicts an extended structure (including non-canonical base pairs and structures composed of two-diagrams). The allowed base pairs can contain all 4 × 4 nt, and the nucleotide bonds are explicitly annotated with the paired edges and isostericity information. | Local installation | No | ( |
| Sfold | Statistical sampling of all possible structures. The sampling is weighted by partition function probabilities. | Local installation | No | ( |
| UNAFold | An integrated collection of programs that simulate folding, hybridization and melting pathways for one or two single-stranded nucleic acid sequences. Folding (secondary structure) prediction for single-stranded RNA or deoxyribonucleic acid (DNA) combines free-energy minimization, partition function calculations and stochastic sampling. | Local installation | No | ( |
| Vsfold4 | Uses dinucleotide pairing energies for short-range interactions and for long-range entropy interactions, an entropy-loss model based on the accumulated sum of the entropy of bonding between each base pair weighted inversed by the correlation of the RNA sequence (the Kuhn length). | Web server | No | ( |
| Vsfold5 | An upgraded version of Vsfold4 capable of predicting pseudoknots. | Web server | Yes | ( |
aThis feature has been deliberately disabled in our evaluation of MCFold because of its long runtime in the ‘pseudoknotted mode’.
A list of 13 comparative methods used in the CompaRNA benchmarks
| Name | Description | Requires alignment as input | Reference |
|---|---|---|---|
| Carnac | Combines three features: energy minimization, phylogenetic comparison and sequence conservation to predict an RNA secondary structure. | No | ( |
| CentroidAlifold | An extension of the CentroidFold program that takes as an input multiple sequences. | Yes | ( |
| CMfinder | An RNA motif prediction tool. It is reported to perform well on unaligned sequences with long flanking regions, and in cases when the motif is only present in a subset of sequences. It is an expectation maximization algorithm that uses covariance models for motif description, heuristics for effective motif search and a Bayesian method for structure prediction combining folding energy and sequence covariation. | No | ( |
| Mastr | Uses an MCMC sampling approach in a simulated annealing framework, where both structure and alignment are optimized by making small local changes. The score combines the log-likelihood of the alignment, a covariation term and the base pair probabilities. | No | ( |
| Multilign | Finds the lowest free-energy secondary structure common to more than two homologous sequences. Uses multiple iterations of Dynalign ( | No | ( |
| Murlet | A variant of the Sankoff algorithm ( | No | ( |
| MXScarna | Performs fast structural multiple alignment of RNA sequences using a progressive alignment based on the pairwise structural alignment algorithm of SCARNA. | No | ( |
| PETFold | Predicts the consensus RNA secondary structure from an RNA alignment. | No | ( |
| PPfold | A new version of Pfold ( | Yes | ( |
| RNAalifold | Computes the minimum free-energy structure that is simultaneously formed by a set of aligned sequences. | Yes | ( |
| RNASampler | A sampling-based program that predicts common RNA secondary structure motifs in a group of related sequences. | No | ( |
| RSpredict | Takes into account sequence covariation and uses effective heuristics for improving accuracy. | Yes | ( |
| TurboFold | The base pairing probabilities for a sequence are estimated by combining intrinsic information, derived from the sequence itself via the nearest neighbor thermodynamic model, with extrinsic information, derived from the other sequences in the input set. For a given sequence, the extrinsic information is computed by using pairwise-sequence-alignment–based probabilities for co-incidence with each of the other sequences, along with estimated base pairing probabilities, from the previous iteration, for the other sequences. | No | ( |
Except for CMfinder, all these methods were run locally on CompaRNA server. If run with default options, none of them predicts pseudoknots.
Summary of the number of different base pair types in the PDB and RNAstrand data sets used for benchmarking RNA 2D prediction methods
| Base pair | PDB data set | RNAstrand data set | |
|---|---|---|---|
| ‘Standard’ base pair counts | ‘Extended’ base pair counts | ‘Extended’ base pair counts | |
| CG | 2716 | 3023 | 119 146 |
| AU | 957 | 1291 | 69 220 |
| GU | 418 | 541 | 26 525 |
| AG | 0 | 637 | 4606 |
| AA | 0 | 192 | 1502 |
| AC | 0 | 150 | 2352 |
| GG | 0 | 113 | 733 |
| UU | 0 | 85 | 1975 |
| CU | 0 | 73 | 1225 |
| CC | 0 | 41 | 553 |
Data sets used for benchmarking methods predicting RNA secondary structure
| Source | Data set name | Type of RNAs | Sequence length | Number of sequences |
|---|---|---|---|---|
| PDB | All RNAs, standard base pair definition | All | ≥20 | 121 |
| All RNAs, extended base pair definition | All | ≥20 | 121 | |
| Only pseudoknotted RNAs, standard base pair definition | Pseudoknotted | ≥20 | 33 | |
| Only pseudoknotted RNAs, extended base pair definition | Pseudoknotted | ≥20 | 62 | |
| RNAstrand | All RNAs | All | ≥20 | 1987 |
| All short RNAs | All | 21–200 | 869 | |
| All medium-sized RNAs | All | 201–800 | 818 | |
| All long RNAs | All | >800 | 287 | |
| Pseudoknotted RNAs | Pseudoknotted | ≥20 | 919 | |
| Pseudoknotted-short RNAs | Pseudoknotted | 21–200 | 53 | |
| Pseudoknotted medium-sized RNAs | Pseudoknotted | 201–800 | 610 | |
| Pseudoknotted long RNAs | Pseudoknotted | >800 | 256 |
Equation 1.Formula for calculating sensitivity. TP = number of true-positive base pairs; FN = number of false-negative base pairs.
Equation 2.Formula for calculating PPV. TP = number of true-positive base pairs; FP = number of false-positive base pairs, ε = number of compatible false-positive base pairs.
Equation 3.Formula for calculating the MCC. TP = number of true-positive base pairs; FP = number of false-positive base pairs, ε = number of compatible false positives, TN = true negatives; FN = false negatives.
Figure 1.Assignment of RNAs from the PDB (A) and RNAstrand (B) data sets to specific Rfam families. Both charts show the numbers of RNAs from different Rfam families for which CompaRNA assigned an Rfam family—in case of the PDB data set, these were 48 sequences, and in case of RNAstrand, these were 1242 sequences. The names on the charts correspond to Rfam identifiers of the following families: tRNA = transfer RNA; tmRNA = transfer-messenger RNA; RNaseP_bact_a = bacterial RNase P class A; Bacteria_small_SRP = bacterial small signal recognition particle RNA; 5S_rRNA = 5S ribosomal RNA; SSU_rRNA_bacteria = bacterial small subunit ribosomal RNA; Metazoa_SRP = metazoan signal recognition particle RNA; SSU_rRNA_eukarya = eukaryotic small subunit ribosomal RNA; Hammerhead_1 = hammerhead ribozyme (type I); K10_TLS = K10 transport/localization element (TLS); Purine = purine riboswitch; SAM = SAM riboswitch (S box leader); 5_8S_rRNA = 5.8S ribosomal RNA; crcB − crcB RNA and Gammaretro_CES = gammaretrovirus core encapsidation signal.
Best methods according to rankings on the PDB data set
| Ranking type | First rank | Second rank | Third rank |
|---|---|---|---|
| All RNAs | |||
| Std | MXScarna(seed) (W: 38, L: 3, NW: 12) | CentroidAlifold(20) (W: 36, L: 0, NW: 17) | CentroidFold (W: 36, L: 8, NW: 9) |
| Ext | MXScarna(seed) (W: 38, L: 2, NW: 13) | CentroidFold (W: 37, L: 7, NW: 9) | CentroidAlifold(20) (W: 36, L: 0, NW: 17) |
| Pseudoknotted RNAs | |||
| Std | CentroidAlifold(20) (W: 33, L: 0, NW: 20) | RNAalifold(20) (W: 32, L: 1, NW: 20) | CentroidAlifold(seed) and MXScarna(seed) (W: 31, L: 2, NW: 20) |
| Ext | MXScarna(seed) (W: 39, L: 1, NW: 13) | CentroidAlifold(20) (W: 35, L: 0, NW: 18) | RNAalifold(20) (W: 33, L: 2, NW: 18) |
Std = standard base pair definition; Ext = extended base pair definition (see ‘Materials and Methods’ section); W = number of wins; L = number of defeats; NW = number of cases in which it was impossible to select winner; (20) = refers to the test of a comparative method in which 20 representatives of an Rfam seed alignment were used; (seed) = refers to the test in which all sequences from a given seed alignment were used.
Figure 2.Comparison of top performing methods predicting RNA secondary structure in a ranking generated on all RNAs extracted from the PDB data set. Plus means that a method in the left column scored higher in the pairwise comparison. Minus means that a method on the left scored lower in the pairwise comparison. Equal to denotes a draw, i.e. both methods have generated at least 10 predictions for common targets, but the accuracies of their results are statistically undistinguishable (P > 0.001). Question mark means that two methods could not have been compared (<10 predictions for common targets). The numbers in the lower left part of the figure correspond to the number of common targets on which both methods were evaluated.
Best methods according to rankings on the RNAstrand data set
| Ranking type | First rank | Second rank | Third rank | |
|---|---|---|---|---|
| All RNAs | ext | TurboFold(seed) (W: 52, L: 1, NW: 0) | TurboFold(20) (W: 51, L: 1, NW: 1) | ContextFold (W: 51, L: 2, NW: 0) |
| Short RNAs (20–200 nt) | ext | ContextFold (W: 53, L: 0, NW: 0) | TurboFold(20) (W: 51, L: 1, NW: 1) | CentroidHomfold-LAST and CentroidAlifold(seed) (W: 50, L: 3, NW: 0) |
| Medium-sized RNAs (201–800 nt) | ext | ContextFold (W: 43, L: 3, NW: 7) | CentroidAlifold(seed) (W: 42, L: 4, NW: 7) | TurboFold(20) (W: 40, L: 1, NW: 12) |
| Long RNAs (801–30 000 nt) | ext | ContextFold (W: 24, L: 0, NW: 29) | CentroidAlifold(seed) and CentroidAlifold(20) (W: 22, L: 1, NW: 30) | RNAalifold(seed) |
| All pseudoknotted RNAs | ext | CentroidAlifold(seed) (W: 46, L: 4, NW: 3) | ContextFold (W: 45, L: 5, NW: 3) | CentroidHomfold-LAST (W: 43, L: 8, NW: 2) |
| Pseudoknotted-short RNAs (20–200 nt) | ext | Cylofold (W: 35, L: 0, NW: 18) | McQFold (W: 35, L: 1, NW: 17) | Pknots (W: 33, L: 2, NW: 18) |
| Pseudoknotted medium-sized RNAs (201–800 nt) | ext | ContextFold (W: 42, L: 0, NW: 11) | TurboFold(20) (W: 39, L: 1, NW: 13) | PPfold(20) (W: 38, L: 2, NW: 13) |
| Pseudoknotted long RNAs (801–30 000 nt) | ext | ContextFold (W: 24, L: 0, NW: 29) | CentroidAlifold(seed) and CentroidAlifold(20) (W: 22, L: 1, NW: 30) | RNAalifold(seed) |
| Robustness test—1987 sequences | ext | ContextFold (W: 53, L: 0, NW: 0) | IPknot (W: 52, L: 1, NW: 0) | Contrafold (W: 51, L: 2, NW: 0) |
| Robustness test—1242 sequences with Rfam family assigned | ext | ContextFold (W: 53, L: 0, NW: 0) | CentroidAlifold(seed) (W: 52, L: 1, NW: 0) | IPknot (W: 51, L: 2, NW: 0) |
aFourth place.
W = number of wins; L = number of defeats; NW = number of cases in which it was impossible to select winner; (20) = refers to the test of a comparative method in which 20 representatives of an Rfam seed alignment were used; (seed) = refers to the test in which all sequences from a given seed alignment were used.
Figure 3.The results of a robustness test on the RNAstrand data set. The numbers on the right to each bar corresponds to the percent of RNAs for which a given method returned predictions (dark = 1987 RNAs from the RNAstrand data set; light = 1242 RNAs for which CompaRNA assigned an Rfam family). (20) = refers to the test of a comparative method in which 20 representatives of an Rfam seed alignment were used; (seed) = refers to the test in which all sequences from a given seed alignment were used.
Figure 4.Difference between the performances of MXScarna and CentroidAlifold run on RNAstrand RNAs (all sequences from the seeds versus 20 representatives). The MXScarna variants were tested on a data set consisting of 416 sequences, whereas the CentroidAlifold variants were compared on 402 sequences. For each method, an average MCC with 95% confidence interval is plotted.
Figure 5.Difference in performance of MXScarna and CentroidAlifold on a data set consisting of 706 sequences from the RNAstrand data set, for which CompaRNA identified an Rfam family. Both methods were run on all sequences from seed alignments corresponding to the identified families. Average MCCs with 95% confidence intervals (CI) were plotted (for each Rfam family n denotes the number of sequences used to calculate average MCC and CI). Using the CI errors bars overlap rule, one can easily estimate the significance of difference in performance of two methods. If the number of MCCs used to calculate the averages is >10 and if the error bars do not overlap, then it can be assumed that P-value is <0.01 (67). CentroidAlifold outperformed MXScarna in tests on the following Rfam families (P ≤ 0.01, n ≥ 10): ciliate telomerase RNA (telomerase_CIL, n = 12), bacterial small signal recognition particle RNA (Bacteria_small_SRP, n = 84), archaeal RNase P (RNaseP_arch, n = 24) and bacterial RNase P class A (RNaseP_bact_a, n = 197).