| Literature DB >> 18439255 |
Kazutaka Katoh1, Hiroyuki Toh.
Abstract
BACKGROUND: Structural alignment of RNAs is becoming important, since the discovery of functional non-coding RNAs (ncRNAs). Recent studies, mainly based on various approximations of the Sankoff algorithm, have resulted in considerable improvement in the accuracy of pairwise structural alignment. In contrast, for the cases with more than two sequences, the practical merit of structural alignment remains unclear as compared to traditional sequence-based methods, although the importance of multiple structural alignment is widely recognized.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18439255 PMCID: PMC2387179 DOI: 10.1186/1471-2105-9-212
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Schematic representation of the calculation procedure X-INS-i with Four-way Consistency (A) in comparison to that of G-INS-i (B).
Version number and command-line arguments for each method
| Method | Arguments |
| ClustalW 2.0 (iterative) | -Iteration = tree |
| ProbConsRNA 1.1 | (default) |
| MAFFT-G-INS-i 6.516 | mafft-ginsi |
| LaRA 1.3/1.31 * | (The default parameter file was used.) |
| Murlet 0.1 | (default) |
| MXSCARNA 2 | (default) |
| RNA Sampler 1.3 | RNASampler_driver.pl -i 15 -S 100 > |
| RNA Sampler 1.3 (fast) | RNASampler_driver.pl -i 15 -S 100 -f 1 > |
| MASTR 1.0 | (default) |
| X-INS-i-scarnapair 6.516 | mafft-xinsi --scarnapair |
| X-INS-i-larapair 6.516 | mafft-xinsi --larapair |
* As the latest version of LaRA (1.31) frequently aborted for cases with more than two sequences, version 1.3 was used together with the parameter file of version 1.31.
The KKA dataset.
| Family name | Rfam accession # | Mean length | % identity |
| 5S_rRNA | RF00001 | 116 | 57 |
| 5_8S_rRNA | RF00002 | 154 | 61 |
| IRES_HCV | RF00061 | 261 | 94 |
| Lysine | RF00168 | 181 | 49 |
| RFN | RF00050 | 140 | 66 |
| Retroviral_psi | RF00175 | 117 | 92 |
| SECIS | RF00031 | 64 | 41 |
| SRP_bact | RF00169 | 93 | 47 |
| SRP_euk_arch | RF00017 | 291 | 40 |
| S_box | RF00162 | 107 | 66 |
| T-box | RF00230 | 244 | 45 |
| THI | RF00059 | 105 | 55 |
| U1 | RF00003 | 157 | 59 |
| U2 | RF00004 | 182 | 62 |
| UnaL2 | RF00436 | 54 | 73 |
| sno_14q_I_II | RF00181 | 75 | 64 |
| tRNA | RF00005 | 73 | 45 |
| Average | 142 | 59 | |
The length and identity values were taken from Table 1 in Kiryu et al. [32].
Figure 2A flowchart of benchmarks using the KKA dataset.
Effects of two different parts that incorporate the structural information
| Accuracy of predicted structure (MCC) | |||||
| Structural pairwise alignment | Four-way consistency | SPS | Pfold | McCaskill-MEA | RNAalifold |
| Disabled (globalpair) | Disabled | 0.768 | 0.622 | 0.646 | 0.622 |
| Disabled (globalpair) | Enabled (McCaskill) | 0.782 | 0.674 | 0.680 | 0.670 |
| Disabled (globalpair) | Enabled (CONTRAfold) | 0.781 | 0.665 | 0.675 | 0.668 |
| Enabled (larapair) | Disabled | 0.758 | 0.646 | 0.661 | 0.630 |
| Enabled (larapair) | Enabled (McCaskill) | 0.758 | 0.665 | 0.692 | 0.672 |
| Enabled (larapair) | Enabled (CONTRAfold) | 0.761 | 0.661 | 0.689 | 0.677 |
| Enabled (scarnapair) | Disabled | 0.787 | 0.699 | 0.687 | 0.693 |
| Enabled (scarnapair) | Enabled (McCaskill) | 0.789 | 0.724 | 0.712 | 0.726 |
| Enabled (scarnapair) | Enabled (CONTRAfold) | 0.794 | 0.711 | 0.705 | 0.704 |
The KKA dataset was used as the benchmark. The accuracies of alignments measured by the SPS criterion are listed in the SPS column. The accuracies of predicted common secondary structures are shown in the three columns on the right. The alignment by each method was subjected to three external prediction programs, Pfold, McCaskill-MEA and RNAalifold, and then the differences from the Rfam curated structure were calculated with the MCC criterion.
Comparison to existing methods
| Accuracy of predicted structure (MCC) | ||||||
| Method | Time (s.) | SPS | Pfold | McCaskill-MEA | RNAalifold | (intrinsic) |
| ClustalW (iterative) | 98 | 0.669 | 0.488 | 0.554 | 0.482 | |
| ProbConsRNA | 61 | 0.763 | 0.654 | 0.651 | 0.613 | |
| G-INS-i | 12 | 0.768 | 0.622 | 0.646 | 0.622 | |
| LaRA 1.31 | 15,000 | 0.687 | 0.607 | 0.649 | 0.600 | |
| Murlet | 64,000 | 0.773 | 0.712 | 0.702 | 0.668 | |
| MXSCARNA 2 | 700 | 0.769 | 0.718 | 0.666 | ||
| RNA Sampler (fast) | 19,000 | 0.641 | 0.659 | 0.684 | 0.662 | 0.655 |
| RNA Sampler | 70,000 | 0.655 | 0.685 | 0.705 | 0.705 | |
| MASTR | 24,000 | 0.662 | 0.570 | 0.616 | 0.592 | 0.601 |
| X-INS-i-larapair | 15,000 | 0.758 | 0.665 | 0.692 | 0.672 | |
| X-INS-i-scarnapair | 1,800 | |||||
The KKA dataset was used as the benchmark. The accuracies of alignments measured by the SPS criterion are listed in the SPS column. The SPS value was computed for each alignment and then averaged across all the alignments. The accuracies of predicted common secondary structures are shown in the four columns on the right. The alignment by each method was subjected to three external prediction programs, Pfold, McCaskill-MEA and RNAalifold, and then the differences from the Rfam curated structure were assessed. The MCC values were computed for each sequence and then averaged across all the sequences. The accuracy values for secondary structure internally predicted by RNA Sampler and MASTR are shown in the (intrinsic) column. The highest score in each column is underlined. The scores close to the highest (p > 0.01 in the Wilcoxon test) are shown in bold. McCaskill-MEA was run with the default value α = 0.91.
Figure 3Accuracy of alignment and structure prediction as a function of the average percent identity among input sequences. The KKA dataset was used. The alignment and structure in Rfam were assumed to be correct and the difference from them were estimated with the SPS (for assessing the alignment accuracy) and MCC (for assessing the accuracy of structure prediction). The programs used for predicting secondary structure are indicated in parentheses. The percent identities were calculated from the reference alignments. The curves were fitted using a cubic spline.
SPS scores for BRAliBASE version 2.1
| SPS | |||||||
| Method | Time | ||||||
| ClustalW (iterative) | 52 minutes | 0.796 | 0.810 | 0.828 | 0.837 | 0.850 | 0.853 |
| ProbConsRNA | 33 minutes | 0.836 | 0.855 | 0.879 | 0.890 | 0.899 | 0.907 |
| G-INS-i | 8.8 minutes | 0.837 | 0.851 | 0.874 | 0.890 | 0.901 | 0.913 |
| LaRA 1.31 | 5.5 days | 0.798 | 0.830 | 0.864 | 0.883 | 0.898 | 0.913 |
| Murlet | 2.5 weeks | 0.843 | 0.863 | 0.886 | 0.897 | 0.906 | 0.915 |
| MXSCARNA 2 | 4.2 hours | 0.850 | 0.866 | 0.884 | 0.894 | 0.907 | 0.914 |
| RNA Sampler (fast) | 8.2 days | 0.787 | 0.801 | 0.824 | 0.828 | 0.841 | 0.855 |
| RNAsampler | 2.9 weeks | 0.785 | 0.812 | 0.839 | 0.850 | 0.858 | 0.869 |
| X-INS-i-larapair | 5.4 days | 0.837 | 0.869 | 0.896 | 0.909 | 0.919 | |
| X-INS-i-scarnapair | 18 hours | ||||||
| # of alignments | 8,976 | 4,835 | 2,405 | 1,426 | 845 | 503 | |
| (used for Wilcoxon test) | (8,976) | (4,832) | (2,399) | (1,420) | (836) | (491) | |
The highest scores within each group (N = 2, 3, 5, 7, 10, 15) are underlined. The scores close to the highest (p > 0.01 in the Wilcoxon test) are shown in bold. As Murlet and RNA Sampler aborted for a small number of datasets, the Wilcoxon test was carried out using a limited set (the numbers of alignments are in parentheses), for which every method returned an alignment.
Figure 4SPS values as a function of the precent identity among input sequences. The BRAliBASE dataset was used. The percent identities given with the dataset were used. The curves were fitted using a cubic spline.