| Literature DB >> 18586747 |
Chuong B Do1, Chuan-Sheng Foo, Serafim Batzoglou.
Abstract
MOTIVATION: The need for accurate and efficient tools for computational RNA structure analysis has become increasingly apparent over the last several years: RNA folding algorithms underlie numerous applications in bioinformatics, ranging from microarray probe selection to de novo non-coding RNA gene prediction. In this work, we present RAF (RNA Alignment and Folding), an efficient algorithm for simultaneous alignment and consensus folding of unaligned RNA sequences. Algorithmically, RAF exploits sparsity in the set of likely pairing and alignment candidates for each nucleotide (as identified by the CONTRAfold or CONTRAlign programs) to achieve an effectively quadratic running time for simultaneous pairwise alignment and folding. RAF's fast sparse dynamic programming, in turn, serves as the inference engine within a discriminative machine learning algorithm for parameter estimation.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18586747 PMCID: PMC2718655 DOI: 10.1093/bioinformatics/btn177
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Sparsity patterns in posterior probability matrices. Panels (a) and (b) illustrate the pairwise pairing posterior probabilities for two different sequences (such as generated by a single-sequence probabilistic or partition function–based RNA folding program). Panel (c) shows the alignment match probabilities for these sequences (such as generated by a probabilistic HMM). In each panel, the darkness of each square represents the posterior confidence in the corresponding base pairing or alignment match. While the single sequence folder or the pairwise sequence aligner may not be able to identify the single correct folding or alignment, respectively, the set of likely candidate base pairings and matched positions, nonetheless, is extremely sparse.
Comparison of computational complexity of RNA simultaneous folding and alignment algorithms
| Algorithm | Time complexity | Space complexity |
|---|---|---|
| Sankoff | ||
| FOLDALIGN | ||
| LocARNA | ||
| Murlet | ||
| RAF |
Here, L denotes the sequence length, c is the number of candidate base pairs per position, d is the number of candidate alignment matches per position and κ is the minimum allowed distance between adjacent helices.
Fig. 2.Trade-off between sparsity factor and proportion of reference base-pairings or aligned matches covered when varying the cutoffs ɛpaired and ɛaligned. This graph was made using training set 𝒯3.
Performance comparison on BRAliBASE II datasets. The best number in each column is marked in bold
| Dataset | Program | Time (s) | SP | Sens | PPV | MCC |
|---|---|---|---|---|---|---|
| 5S rRNA | Murlet | 687 | 0.94 | |||
| LocARNA | 812 | 0.93 | 0.55 | 0.60 | 0.57 | |
| RNA Sampler | 2361 | 0.90 | 0.55 | 0.64 | 0.59 | |
| RAF | 0.66 | 0.66 | 0.66 | |||
| group II intron | Murlet | 962 | 0.75 | |||
| LocARNA | 250 | 0.74 | 0.79 | 0.65 | 0.72 | |
| RNA Sampler | 1626 | 0.72 | 0.77 | 0.65 | 0.71 | |
| RAF | 0.78 | 0.65 | 0.73 | |||
| SRP | Murlet | 20548 | ||||
| LocARNA | 22467 | 0.85 | 0.66 | 0.70 | 0.68 | |
| RAF | 0.87 | 0.72 | 0.71 | 0.70 | ||
| tRNA | Murlet | 525 | 0.93 | 0.86 | 0.90 | 0.88 |
| LocARNA | 246 | 0.86 | 0.90 | 0.88 | ||
| RNA Sampler | 763 | 0.92 | ||||
| RAF | 0.94 | 0.81 | 0.85 | 0.83 | ||
| U5 | Murlet | 1772 | 0.69 | 0.75 | 0.72 | |
| LocARNA | 549 | 0.80 | 0.56 | 0.61 | 0.58 | |
| RNA Sampler | 4084 | 0.77 | 0.75 | 0.70 | 0.72 | |
| RAF | 0.82 |
Performance comparison on MASTR benchmarking sets. The best number in each column is marked in bold.
| Program | SP | Sens | PPV | MCC |
|---|---|---|---|---|
| CLUSTAL W+Alifold | 0.81 | 0.57 | 0.73 | 0.65 |
| FoldalignM | 0.78 | 0.38 | 0.55 | |
| LocARNA | 0.75 | 0.41 | 0.77 | 0.56 |
| MASTR | 0.84 | 0.64 | 0.73 | 0.68 |
| Murlet | 0.62 | 0.78 | 0.70 | |
| RNAforester | 0.53 | 0.55 | 0.55 | 0.55 |
| RNA Sampler | 0.82 | 0.65 | 0.70 | 0.67 |
| RAF | 0.88 | 0.77 |