| Literature DB >> 18226231 |
Usman Roshan1, Satish Chikkagoudar, Dennis R Livesay.
Abstract
BACKGROUND: Identification of RNA homologs within genomic stretches is difficult when pairwise sequence identity is low or unalignable flanking residues are present. In both cases structure-sequence or profile/family-sequence alignment programs become difficult to apply because of unreliable RNA structures or family alignments. As such, local sequence-sequence alignment programs are frequently used instead. We have recently demonstrated that maximal expected accuracy alignments using partition function match probabilities (implemented in Probalign) are significantly better than contemporary methods on heterogeneous length protein sequence datasets, thus suggesting an affinity for local alignment.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18226231 PMCID: PMC2248559 DOI: 10.1186/1471-2105-9-61
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Mean and median percent error for all methods on the full benchmark (13,716 datasets) including query RNAs with flanks of size 50, 100, and 150.
| Mean and median error | Probalign | SSEARCH | BLAST | ClustalW |
| Complete benchmark | 38.7 | 33.2 | 41.0 | 34.0 | 47.6 | 50.3 | |
| Datasets with pairwise sequence identity at most 30% | 73.0 | 83.4 | 75.9 | 85.3 | 82.9 | 85.0 |
BLAST does not return an alignment in 425 datasets and hence they are omitted from the calculations. HMMER is not shown since queries with unalignable flanks cannot be used to produce a reliable model. There are 14 families that contain datasets with at most 30% sequence identity. Probalign has overall lowest mean and median error. Bold indicates the best performance; the difference is larger on datasets with low sequence identity and significant with P-value < 0.05 (indicated by *).
Mean Probalign and SSEARCH percent error shown for each RFAM family in the full benchmark and for datasets with maximum pairwise sequence identity of 30%.
| RFAM Family | Complete benchmark dataset | Subset with pairwise identity up to 30% | ||||
| Probalign | SSEARCH | Difference | Probalign | SSEARCH | Difference | |
| 5S_rRNA | 22.7 | 20.7 | -2.0 | |||
| U1 (4) | 15.0 | 15.6 | 0.6 | 87.3 | 100.0 | 12.7 |
| tRNA (256) | 62.0 | 74.4 | 12.3 | 69.8 | 84.8 | 15.0 |
| RNaseP_bact_a | 34.0 | 33.0 | -1.0 | |||
| RNaseP_bact_b | 29.0 | 29.1 | -0.1 | |||
| U3 | 41.3 | 38.8 | -2.5 | |||
| U4 (8) | 25.3 | 22.2 | -3.1 | 52.8 | 11 | -41.8 |
| SRP_euk_arch (132) | 43.8 | 56.4 | 12.6 | 62.1 | 78.0 | 15.9 |
| tmRNA (180) | 32.0 | 36.3 | 4.3 | 50.5 | 59.8 | 9.4 |
| Intron_gpI (4) | 67.4 | 80.1 | 12.7 | 100.0 | 100.0 | 0.0 |
| SECIS (208) | 82.3 | 93.9 | 11.5 | 87.9 | 100.0 | 12.1 |
| IRE (216) | 44.4 | 48.7 | 4.2 | 88.7 | 96.5 | 7.7 |
| THI | 29.5 | 30.1 | 0.6 | |||
| Hammerhead_1 | 43.7 | 46.0 | 2.3 | |||
| Purine (4) | 16.2 | 16.4 | 0.2 | 17.4 | 1.8 | -15.6 |
| Lysine (16) | 48.0 | 57.3 | 9.3 | 73.1 | 100.0 | 26.9 |
| SRP_bact (80) | 28.5 | 25.7 | -2.8 | 62.6 | 65.0 | 2.3 |
| SSU_rRNA_5 (4) | 30.5 | 32.4 | 1.9 | 39 | 61 | 22 |
| T-box | 27.4 | 46.0 | 18.6 | |||
| glmS (4) | 23.4 | 21.0 | -2.4 | 73.8 | 78.4 | 4.6 |
| RNaseP_arch (8) | 32.4 | 34.0 | 1.6 | 87 | 100.0 | 13 |
| IRES_Cripavirus | 5.7 | 3.9 | -1.8 | |||
Unlike Table 1 above, where some datasets are omitted due to BLAST, all datasets of the benchmark are considered here. Difference is always calculated as the SSEARCH error minus Probalign error, meaning positive numbers indicates Probalign outperforms SSEARCH. Shown in parenthesis is the number of datasets in each family with maximum pairwise sequence identity of 30% (the same query RNA but with different flank sizes is considered a separate dataset).
Mean percent error as a function of query RNA flank size.
| Query RNA flank size | Probalign | SSEARCH | BLAST | ClustalW |
| 50 | 39.3 | 41.9 | 48.5 | |
| 100 | 40.8 | 44.5 | 51.4 | |
| 150 | 43.3 | 45.9 | 53.2 |
For each flank size there are 3,429 datasets (see Methods Section for description of benchmark). As in Table 1 about 105 datasets per flank size are omitted on which BLAST does not return any output. Bold indicates the best performance and * indicates Friedman rank test P-value < 0.05.
Mean and median percent error for all methods on the benchmark without query RNA flanks (3,429 datasets).
| Mean and median error | Probalign | SSEARCH | BLAST | ClustalW | HMMER |
| Complete benchmark | 31.4 | 22.1 | 32.0 | | 37.9 | 38.5 | 44.9 | 44.7 | |
| Datasets with pairwise sequence identity at most 30% (14) | 70.8 | 94.5 | 78.4 | 100.0 | 74.5 | 97.5 | 96.7 | 100.0 |
Probalign has lowest mean and median error on low sequence identity datasets.
Figure 1ROC curves for Probalign mean posterior probability and SSEARCH normalized Z-score. To construct this curve we added to our dataset a set of false hits by replacing each genomic sequence in each dataset of the benchmark with a randomly selected one from a benchmark dataset of a different RNA family. The ROC analysis clearly demonstrates that the Probalign is better able to discriminate true from false alignments.
Description of optimized parameters derived for each method used herein.
| Method | Scoring matrix | Gap opening penalty | Gap extension penalty |
| Probalign | +5/-4 (T = 7) | 32 | 2 |
| SSEARCH | +5/-4 | 10 | 4 |
| ClustalW | +10/-9 | 13 | 6 |
| BLAST | +5/-4 | 8 | 6 |
Figure 2A cartoon of false positive and false negative situations for a query-target alignment.