| Literature DB >> 21788211 |
Donglai Wei1, Lauren V Alpert, Charles E Lawrence.
Abstract
MOTIVATION: RNA secondary structure plays an important role in the function of many RNAs, and structural features are often key to their interaction with other cellular components. Thus, there has been considerable interest in the prediction of secondary structures for RNA families. In this article, we present a new global structural alignment algorithm, RNAG, to predict consensus secondary structures for unaligned sequences. It uses a blocked Gibbs sampling algorithm, which has a theoretical advantage in convergence time. This algorithm iteratively samples from the conditional probability distributions P(Structure | Alignment) and P(Alignment | Structure). Not surprisingly, there is considerable uncertainly in the high-dimensional space of this difficult problem, which has so far received limited attention in this field. We show how the samples drawn from this algorithm can be used to more fully characterize the posterior space and to assess the uncertainty of predictions.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21788211 PMCID: PMC3167047 DOI: 10.1093/bioinformatics/btr421
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Average performance of different secondary structure prediction methods in the PPV–SEN plane for the MASTR dataset (Lindgreen ). PPV = TP/P = TP/(TP + FP), SEN = TP/T = TP/(TP + FN). Note: the axis ranges are set from 0.3 to 1.0 to improve readability. Points showing the performance of extant procedures were taken from Do ) except for CMfinder, which was included because of its similarity to RNAG. CMfinder was run at default values and settings.
Fig. 2.Average performance of different secondary structure prediction methods in the PPV–SEN plane for four RNA families (5S rRNA, group II intron, tRNA and U5 spliceosomal RNA) from the BRAliBASE II dataset (Gardner ). Note: the axis ranges are set from 0.3 to 1.0 to improve readability. Points showing the performance of extant procedures were taken from Do ) except for CMfinder, which was run at defaults.
Effects of the number of sequences on prediction results
| No. of sequences | Area under PPV–SEN curve | Bias | SD | No. of samples | 95% credibility limit | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Ensemble | First cluster | Second cluster | First cluster | Second cluster | First+second cluster | Ensemble | First cluster | Second cluster | |||
| 2 | 0.44 | 0.46 | 0.37 | 0.27 | 0.04 | 728.13 | 150.76 | 878.89 | 0.21 | 0.14 | 0.11 |
| 3 | 0.58 | 0.59 | 0.49 | 0.20 | 0.03 | 793.15 | 124.94 | 918.09 | 0.14 | 0.10 | 0.07 |
| 4 | 0.58 | 0.58 | 0.48 | 0.20 | 0.03 | 791.66 | 115.00 | 906.66 | 0.14 | 0.09 | 0.06 |
| 5 | 0.62 | 0.63 | 0.51 | 0.17 | 0.03 | 802.20 | 113.24 | 915.44 | 0.12 | 0.08 | 0.05 |
| 6 | 0.67 | 0.67 | 0.54 | 0.16 | 0.03 | 800.50 | 111.66 | 912.16 | 0.11 | 0.07 | 0.05 |
| 7 | 0.70 | 0.69 | 0.57 | 0.15 | 0.03 | 795.52 | 111.92 | 907.44 | 0.10 | 0.07 | 0.05 |
| 8 | 0.73 | 0.71 | 0.60 | 0.15 | 0.03 | 797.56 | 116.19 | 913.75 | 0.10 | 0.07 | 0.04 |
| 9 | 0.73 | 0.73 | 0.60 | 0.14 | 0.02 | 790.59 | 122.38 | 912.97 | 0.09 | 0.06 | 0.04 |
| 10 | 0.75 | 0.74 | 0.63 | 0.13 | 0.02 | 792.85 | 125.11 | 917.96 | 0.09 | 0.06 | 0.04 |
For each row, we not only calculate the average area under the PPV–SEN curve for accuracy comparison, but also summarize the bias-variance statistics and the size of the two biggest clusters to visualize the clustering results. In order to normalize bias, SD and credibility limits with respect to the sequence length, we divide them by the average sequence length for the family.
Fig. 3.Improvement of the RNAG PPV–SEN curves with increasing numbers of input sequences.
A detailed look into the RNAG results on 17 RNA families, listed in groups by their functional type
| RNA family | RNA type | Mean length (percent identity) | Bias | SD | 95% credibility limit | PPV–SEN area | No. of samples | Separation index | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ensemble | First cluster | Second cluster | Ensemble | First cluster | Second cluster | First+second | First cluster | Second cluster | ||||||
| T-box | tRNA | 244 (45) | 0.10 | 0.01 | 0.06 | 0.04 | 0.02 | 0.58 | 0.55 | 0.47 | 926 | 826 | 100 | 1.00 |
| t-RNA | tRNA | 73 (45) | 0.02 | 0.01 | 0.03 | 0.01 | 0.01 | 1.00 | 0.99 | 0.91 | 949 | 888 | 61 | 2.50 |
| 5S-rRNA | rRNA | 116 (57) | 0.17 | 0.02 | 0.07 | 0.05 | 0.03 | 0.70 | 0.70 | 0.67 | 922 | 751 | 171 | 0.88 |
| 5-8S-rRNA | rRNA | 154 (61) | 0.18 | 0.03 | 0.14 | 0.10 | 0.08 | 0.43 | 0.42 | 0.26 | 907 | 744 | 163 | 0.56 |
| Retroviral-psi | Rviral | 117 (92) | 0.07 | 0.05 | 0.15 | 0.11 | 0.05 | 0.99 | 0.99 | 0.47 | 981 | 952 | 29 | 1.25 |
| U1 | sRNA | 157 (59) | 0.16 | 0.02 | 0.06 | 0.06 | 0.02 | 0.69 | 0.69 | 0.63 | 988 | 928 | 60 | 1.13 |
| U2 | sRNA | 182 (62) | 0.08 | 0.02 | 0.05 | 0.05 | 0.02 | 0.90 | 0.90 | 0.71 | 981 | 941 | 40 | 1.14 |
| Sno-14q-I-II | sRNA | 75 (64) | 0.07 | 0.03 | 0.12 | 0.08 | 0.07 | 1.00 | 0.92 | 0.86 | 838 | 636 | 202 | 0.47 |
| Lysine | riboswitch | 181 (49) | 0.07 | 0.02 | 0.06 | 0.05 | 0.03 | 0.94 | 0.93 | 0.84 | 983 | 923 | 60 | 0.88 |
| RFN | riboswitch | 140 (66) | 0.15 | 0.03 | 0.11 | 0.06 | 0.06 | 0.68 | 0.64 | 0.60 | 820 | 574 | 246 | 0.58 |
| THI | riboswitch | 105 (55) | 0.08 | 0.02 | 0.07 | 0.06 | 0.02 | 0.89 | 0.88 | 0.75 | 968 | 869 | 99 | 1.13 |
| S-box | riboswitch | 107 (66) | 0.09 | 0.02 | 0.07 | 0.03 | 0.03 | 0.88 | 0.87 | 0.74 | 945 | 806 | 139 | 1.17 |
| IRES-HCV | Cis | 261 (94) | 0.25 | 0.05 | 0.21 | 0.16 | 0.08 | 0.61 | 0.58 | 0.44 | 936 | 877 | 59 | 1.00 |
| SECIS | Cis | 64 (41) | 0.17 | 0.02 | 0.08 | 0.02 | 0.02 | 0.74 | 0.71 | 0.72 | 840 | 679 | 161 | 1.50 |
| UnaL2 | Cis | 54 (73) | 0.18 | 0.03 | 0.06 | 0.02 | 0.02 | 0.33 | 0.62 | 0.61 | 867 | 752 | 115 | 1.00 |
| SRP-bact | srpRNA | 93 (47) | 0.16 | 0.03 | 0.12 | 0.04 | 0.04 | 0.79 | 0.78 | 0.70 | 834 | 646 | 188 | 2.75 |
| SRP-euk-arch | srpRNA | 291 (40) | 0.23 | 0.01 | 0.04 | 0.03 | 0.02 | 0.49 | 0.48 | 0.47 | 921 | 837 | 84 | 0.80 |
| Average | 142 | 0.13 | 0.02 | 0.09 | 0.06 | 0.04 | 0.76 | 0.74 | 0.63 | 926 | 826 | 100 | 0.90 | |
We calculated the average area under the PPV–SEN curve for accuracy comparison, as well as statistics like bias, SD, credibility limit, and separation index from cluster analysis, to better understand the posterior secondary structure space.
Fig. 4.2D plot of bias per base pair and the area under the PPV–SEN curve of the ensemble centroid for the 17 RNA families in Table 2. The results for each family are represented by a symbol indicating their functional group.