| Literature DB >> 19129237 |
Abstract
Structural elements in RNA molecules have a distinct nucleotide composition, which changes gradually over evolutionary time. We discovered certain features of these compositional patterns that are shared between all RNA families. Based on this information, we developed a structure prediction method that evaluates candidate structures for a set of homologous RNAs on their ability to reproduce the patterns exhibited by biological structures. The method is named SPuNC for 'Structure Prediction using Nucleotide Composition'. In a performance test on a diverse set of RNA families we demonstrate that the SPuNC algorithm succeeds in selecting the most realistic structures in an ensemble. The average accuracy of top-scoring structures is significantly higher than the average accuracy of all ensemble members (improvements of more than 20% observed). In addition, a consensus structure that includes the most reliable base pairs gleaned from a set of top-scoring structures is generally more accurate than a consensus derived from the full structural ensemble. Our method achieves better accuracy than existing methods on several RNA families, including novel riboswitches and ribozymes. The results clearly show that nucleotide composition can be used to reveal the quality of RNA structures and thus the presented technique should be added to the set of prediction tools.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19129237 PMCID: PMC2655677 DOI: 10.1093/nar/gkn987
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.RNA structure and nucleotide composition. (A) Alignment of three RNA sequences with a reference structure (both vienna structure and pairing mask) and the corresponding structural classification. (B) Secondary structure diagram indicating four different structural categories: S = stem, L = loop, B = bulge, O = other. (C) RNA composition space, containing the nucleotide composition of 80 SSU rRNA sequences decomposed under the E. coli reference structure [source: Comparative RNA Web 23]. The space contains five distributions: S = stem, L = loop, B = bulge, O = other, T = total sequence. The inset shows the compositional patterns generated by an incorrect structure, containing more scatter and overlap in the loops and bulges and increased scatter and lower GC content in the stems.
Compositional properties
| Compositional property | Reference | Weights | ||||
|---|---|---|---|---|---|---|
| Metric | Structure | Axis | Trained | Extreme | Weighted | Unweighted |
| SD | S | UC | −1.270 | −2 | 1/0.674 | 1 |
| SD | S | UG | −0.985 | −2 | 1/0.535 | 1 |
| MEAN | S | UA | −0.686 | −1 | 1/0.694 | 1 |
| SD | LB | UX | −0.519 | −1 | 1/0.806 | 1 |
| MEAN | LBO | UA | +0.814 | +2 | 1/0.719 | 1 |
aMetric: Standard deviation (SD) or mean (MEAN).
bStructural element: stem (S), loop (L), bulge (B), other (O).
cCompositional axis: UX indicates the average over all three axes.
dReference Z-score: We experimented both with trained Z-scores as observed in the training dataset and with exaggerations of these (extreme Z-scores).
eCompositional properties in the scoring function can be weighted or not. Weights are one divided by the SD of the Z-scores observed in the training set.
Figure 2.Application of the scoring function. Data shown is for an alignment of 20 5S rRNA sequences and an ensemble of 963 unique structures. The scoring function used the extreme Z-scores and weighted properties. Consensus structures were calculated with a top-cutoff of 10% and a bp-cutoff of 0.4. The accuracy of the structures is expressed by the correlation coefficient (CC). In both sets, the mean accuracy is indicated with a dotted line and the accuracy of the consensus structure with a solid line. The height of a bar corresponds to the fraction of structures in the corresponding accuracy range, and all bars within one set add up to 1.0.
Alignments and reference structures
| Set | RNA family | Seq | Len | Bps | PK |
|---|---|---|---|---|---|
| T | tRNA-PHE | 20 | 77 | 21 | no |
| T | 5S rRNA | 20 | 122 | 37 | no |
| T | 16S rRNA | 20 | 1546 | 478 | yes |
| V | Hammerhead rz. | 8 | 51 | 14 | no |
| V | Purine rs. | 12 | 77 | 23 | yes |
| V | TPP rs. | 12 | 102 | 20 | no |
| V | glmS rz. | 13 | 172 | 52 | yes |
| B | tRNA-PHE (H) | 11 | 73 | 20 | no |
| B | tRNA-PHE (M) | 11 | 74 | 20 | no |
| B | RNase P (H) | 9 | 385 | 122 | yes |
| B | RNase P (M) | 11 | 431 | 122 | yes |
| B | SSU rRNA (H) | 11 | 1551 | 478 | yes |
| B | SSU rRNA (M) | 11 | 1598 | 478 | yes |
| B | LSU rRNA (H) | 12 | 2940 | 869 | yes |
| B | LSU rRNA (M) | 12 | 3197 | 869 | yes |
Set, T = training, V = validation, B = benchmark; Seq, number of sequences in the alignment; Len, length of the alignment; Bps, number of base pairs in the reference structure; PK, whether reference structure contains pseudoknots.
Prediction accuracy (CC) for 15 alignments using extreme reference Z-scores and weighted properties in the scoring function, a top-cutoff of 0.1, and a bp-cutoff of 0.4
| Alignment | Ensemble size | Accuracy (CC) | |||||
|---|---|---|---|---|---|---|---|
| Ensemble mean | Ensemble consensus | Top mean | Top consensus | RNA fold | RNA alifold | ||
| tRNA-PHE | 311 | 0.579 | 0.976 | 0.815 | 1.000 | 0.748 | 1.000 |
| 5S rRNA | 963 | 0.446 | 0.638 | 0.744 | 0.919 | 0.521 | 0.839 |
| 16S rRNA | 908 (1000) | 0.379 (0.492) | 0.555 (0.696) | 0.465 (0.737) | 0.577 (0.865) | 0.388 | 0.465 |
| Training: mean accuracy | 0.468 (0.506) | 0.723 (0.770) | 0.675 (0.765) | 0.832 (0.928) | 0.553 | 0.768 | |
| Hammerhead rz. | 157 | 0.573 | 0.886 | 0.773 | 0.889 | 0.809 | 1.000 |
| Purine rs. | 336 | 0.736 | 0.861 | 0.813 | 0.909 | 0.841 | 0.861 |
| TPP rs. | 795 | 0.500 | 0.700 | 0.725 | 0.886 | 0.505 | 0.868 |
| glmS rz. | 857 | 0.553 | 0.697 | 0.610 | 0.823 | 0.579 | 0.745 |
| Validation: mean accuracy | 0.591 | 0.786 | 0.730 | 0.877 | 0.683 | 0.869 | |
| tRNA-PHE (H) | 576 | 0.477 | 0.592 | 0.518 | 0.462 | 0.390 | 0.950 |
| tRNA-PHE (M) | 547 | 0.573 | 0.837 | 0.865 | 0.976 | 0.611 | 0.976 |
| RNase P (H) | 999 (1000) | 0.500 (0.576) | 0.645 (0.751) | 0.572 (0.609) | 0.647 (0.700) | 0.420 | 0.698 |
| RNase P (M) | 990 (1000) | 0.385 (0.476) | 0.595 (0.666) | 0.464 (0.499) | 0.503 (0.672) | 0.351 | 0.617 |
| SSU rRNA (H) | 990 (1000) | 0.448 (0.606) | 0.622 (0.824) | 0.514 (0.749) | 0.617 (0.900) | 0.474 | 0.650 |
| SSU rRNA (M) | 990 (1000) | 0.412 (0.574) | 0.680 (0.838) | 0.461 (0.689) | 0.621 (0.869) | 0.415 | 0.808 |
| LSU rRNA (H) | 996 (1000) | 0.401 (0.559) | 0.605 (0.809) | 0.403 (0.752) | 0.579 (0.891) | 0.426 | 0.687 |
| LSU rRNA (M) | 996 (1000) | 0.353 (0.519) | 0.601 (0.798) | 0.393 (0.776) | 0.644 (0.902) | 0.386 | 0.798 |
| Benchmark: mean accuracy | 0.444 (0.545) | 0.647 (0.764) | 0.524 (0.682) | 0.631 (0.796) | 0.434 | 0.773 | |
If we applied the method to an extended ensemble, using the true structure, in addition to the normal ensemble generated by RNAsubopt, the results are displayed between parentheses. Average values between brackets use the results for extended ensembles where possible.