| Literature DB >> 17062125 |
Andreas Wilm1, Indra Mainz, Gerhard Steger.
Abstract
BACKGROUND: The performance of alignment programs is traditionally tested on sets of protein sequences, of which a reference alignment is known. Conclusions drawn from such protein benchmarks do not necessarily hold for the RNA alignment problem, as was demonstrated in the first RNA alignment benchmark published so far. For example, the twilight zone - the similarity range where alignment quality drops drastically - starts at 60 % for RNAs in comparison to 20 % for proteins. In this study we enhance the previous benchmark.Entities:
Year: 2006 PMID: 17062125 PMCID: PMC1635699 DOI: 10.1186/1748-7188-1-19
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Number of reference alignments and average Structure Conservation Index (SCI) for each alignment of k sequences.
| k2 | k3 | k5 | k7 | k10 | k15 | total | |
| no. aln. | 8976 (118) | 4835 | 2405 (481) | 1426 | 845 | 504 | 18990 |
| ∅ SCI | 0.95 (1.05) | 0.92 | 0.91 (0.87) | 0.90 | 0.89 | 0.89 | 0.93 |
Values for the previously used data-set1 [22] are given in brackets.
Figure 1MAFFT (FFT-NS-2) and ClustalW performance with optimized and old parameters. PROALIGN (earlier identified to be a good aligner [22]) is included as a reference. Performance is measured as BRALISCORE vs. reference APSI and exemplified for k = 5 sequences. MAFFT version 5.667 was used with optimized parameters, which are default in version 5.667, and with (old) parameters of version 4, respectively; CLUSTALW was used either with default parameters or with optimized parameters (see Table 2 and text).
Averaged ranks derived from Friedman rank sum tests for ClustalW's gap parameter optimization.
| 0.42 | 0.83 | 1.67 | 3.33 | 4.99 | 6.66 | 8.32 | 9.99 | |
| 7.5 | 56.0 | 55.0 | 54.0 | 53.0 | 51.2 | 50.0 | 47.0 | 42.8 |
| 11.25 | 47.5 | 44.0 | 41.5 | 37.2 | 34.5 | 27.3 | 28.2 | 31.5 |
| 15.0 | 20.8 | 24.0 | 20.0 | 14.5 | 13.5 | 22.3 | 29.3 | |
| 18.75 | 10.8 | 8.3 | 8.2 | 7.5 | 11.3 | 20.8 | 27.5 | 35.8 |
| 22.5 | 4.7 | 3.7 | 8.8 | 17.7 | 27.0 | 34.5 | 39.2 | |
| 26.25 | 5.8 | 5.5 | 8.8 | 17.5 | 31.2 | 36.7 | 42.3 | 46.2 |
| 30.0 | 15.2 | 17.2 | 22.8 | 32.8 | 39.3 | 45.0 | 49.0 | 51.5 |
Ranks (smaller values mean better performance) for each gap-open (go)/gap-extension (ge) penalty combination are based on the BRALISCORE averaged over all alignment sets with k ∈ {2, 3, 5, 7, 10, 15} sequences and APSI ≤ 80 %. CLUSTALW's default and the optimized value combinations are given in bold-face.
Averaged ranks derived from Friedman rank sum tests for prank's gap parameter optimization.
| 0.05 | 0.125 | 0.1875 | 0.25 | 0.375 | 0.5 | |
| 0.0025 | 3.5 | 2.0 | 4.8 | NA | NA | NA |
| 0.00625 | 6.8 | 3.5 | 3.2 | NA | NA | NA |
| 0.00938 | 8.8 | 6.5 | 8.0 | NA | NA | NA |
| 0.0125 | NA | NA | NA | 8.2 | 11.0 | 13.5 |
| 0.01875 | NA | NA | NA | 12.8 | 12.5 | 15.8 |
| 0.025 | NA | NA | NA | 15.8 | 17.2 | |
| 0.03125 | NA | NA | NA | 20.0 | 22.0 | 23.8 |
| 0.0375 | NA | NA | NA | 25.0 | 27.0 | 27.8 |
Ranks (smaller values mean better performance) for each gap-open (go)/gap-extension (ge) value combination are averaged over all alignment sets with k ∈ {5, 7, 10, 15} sequences and APSI ≤ 80 %. The default option for PRANK version 1508b is given in bold-face. Values for sets k2 and k3 are missing because PRANK crashed repeatedly with these sets, but we needed all values to compute the Friedman tests.
Comparison of default vs. RIBOSUM substitution matrix by Wilcoxon tests
| Program | k2 | k3 | k5 | k7 | k10 | k15 |
| ALIGN-M | / | + | + | + | / | / |
| CLUSTALW | - | - | - | - | - | - |
| POA | + | + | + | / | / | / |
If the use of the RIBOSUM 85–60 matrix resulted in a statistically significant performance increase in comparison to use of the default matrix this is indicated with a "+"; "-" indicates that the default matrix scores significantly better. If no statistical significance was found this is indicated with a "/".
Figure 2Performance of Prrn compared to ClustalW in dependence on sequence number per alignment. The plot shows the difference of the scores of PRRN as a representative of an iterative alignment approach and CLUSTALW (standard options) as a representative of a progressive approach.
Ranks determined by Friedman rank sum tests for all top-ranking programs.
| Program/Option | k2 | k3 | k5 | k7 | k10 | k15 |
| CLUSTALW (default) | 8 | 7 | 8 | 8 | 7 | 7 |
| CLUSTALW (optimized) | 6 | 6 | 7 | 7 | 6 | 6 |
| MAFFT (FFT-NS-2) | 2 | 4 | 4 | 4 | 5 | 5 |
| MAFFT (G-INS-i) | 1 | 1 | 1 | 1 | 1 | 1 |
| MUSCLE | 3 | 3 | 3 | 2 | 2 | 2 |
| PCMA | 9 | 10 | 10 | 10 | 10 | 10 |
| POA | 7 | 8 | 9 | 9 | 9 | 9 |
| PROALIGN | 5 | 5 | 6 | 6 | 8 | 8 |
| PROBCONSRNA | 4 | 2 | 2 | 3 | 3 | 4 |
| PRRN | 10 | 9 | 5 | 5 | 4 | 3 |
Programs were ranked according to BRALISCORE averaged over all alignment sets with k ∈ {2, 3, 5, 7, 10, 15} sequences and APSI ≤ 80 %. MAFFT (G-INS-i) is the top performing program on all test sets. For program versions and options see Methods.
Number of reference alignments for each RNA family
| ∑ | |||||||
| 5S_rRNA | 1162 | 568 | 288 | 150 | 90 | 50 | 2308 |
| 5_8S_rRNA | 76 | 45 | 17 | 5 | 3 | 0 | 146 |
| Cobalamin | 188 | 61 | 15 | 4 | 0 | 0 | 268 |
| Entero_5_CRE | 48 | 32 | 19 | 10 | 8 | 5 | 122 |
| Entero_CRE | 65 | 38 | 20 | 13 | 8 | 4 | 148 |
| Entero_OriR | 49 | 31 | 17 | 11 | 8 | 4 | 120 |
| gcvT | 167 | 67 | 22 | 12 | 3 | 1 | 272 |
| Hammerhead_1 | 53 | 32 | 9 | 1 | 0 | 0 | 95 |
| Hammerhead_3 | 126 | 99 | 52 | 32 | 17 | 12 | 338 |
| HCV_SLIV | 98 | 63 | 36 | 26 | 16 | 10 | 249 |
| HCV_SLVII | 51 | 33 | 19 | 13 | 10 | 7 | 133 |
| HepC_CRE | 45 | 29 | 18 | 11 | 7 | 3 | 113 |
| Histone3 | 84 | 59 | 27 | 11 | 7 | 6 | 194 |
| HIV_FE | 733 | 408 | 227 | 147 | 98 | 56 | 1669 |
| HIV_GSL3 | 786 | 464 | 246 | 151 | 95 | 61 | 1803 |
| HIV_PBS | 188 | 124 | 76 | 55 | 38 | 25 | 506 |
| Intron_gpII | 181 | 82 | 35 | 22 | 11 | 4 | 335 |
| IRES_HCV | 764 | 403 | 205 | 146 | 83 | 47 | 1648 |
| IRES_Picorna | 181 | 117 | 75 | 53 | 35 | 25 | 486 |
| K_chan_RES | 124 | 40 | 2 | 0 | 0 | 0 | 166 |
| Lysine | 80 | 48 | 30 | 17 | 7 | 3 | 185 |
| Retroviral_psi | 89 | 57 | 34 | 24 | 17 | 11 | 232 |
| SECIS | 114 | 67 | 33 | 16 | 11 | 6 | 247 |
| sno_14q I_II | 44 | 14 | 1 | 0 | 0 | 0 | 59 |
| SRP_bact | 114 | 76 | 39 | 19 | 12 | 7 | 267 |
| SRP_euk_arch | 122 | 94 | 42 | 21 | 12 | 6 | 297 |
| S_box | 91 | 51 | 25 | 12 | 7 | 2 | 188 |
| T-box | 18 | 8 | 0 | 0 | 0 | 0 | 26 |
| TAR | 286 | 165 | 92 | 62 | 42 | 28 | 675 |
| THI | 321 | 144 | 69 | 32 | 17 | 5 | 588 |
| tRNA | 2039 | 1012 | 461 | 267 | 143 | 100 | 4022 |
| U1 | 82 | 65 | 26 | 16 | 6 | 0 | 195 |
| U2 | 112 | 83 | 38 | 22 | 14 | 7 | 276 |
| U6 | 30 | 21 | 14 | 7 | 1 | 0 | 73 |
| UnaL2 | 138 | 71 | 43 | 20 | 7 | 0 | 279 |
| yybP-ykoY | 127 | 64 | 33 | 18 | 12 | 8 | 262 |
| ∑ | 8976 | 4835 | 2405 | 1426 | 845 | 503 | 18990 |
Figure 3Lowess smoothing. The plot shows the scattered data points, each corresponding to one alignment, exemplified by the performance of PROALIGN with k = 7 sequences per alignment. The curve is the result of a lowess smoothing with a smoothing factor of 0.3.