| Literature DB >> 18700982 |
Mirela Andronescu1, Vera Bereg, Holger H Hoos, Anne Condon.
Abstract
BACKGROUND: The ability to access, search and analyse secondary structures of a large set of known RNA molecules is very important for deriving improved RNA energy models, for evaluating computational predictions of RNA secondary structures and for a better understanding of RNA folding. Currently there is no database that can easily provide these capabilities for almost all RNA molecules with known secondary structures.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18700982 PMCID: PMC2536673 DOI: 10.1186/1471-2105-9-340
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1RNA secondary structure example. Schematic representation of the secondary structure for the RNase P RNA molecule of Methanococcus marapaludis from the RNase P Database; the RNA STRAND ID for this molecule is ASE_00199. Solid grey lines represent the ribose-phosphate backbone. Dotted grey lines represent missing nucleotides. Solid circles mark base pairs. Dashed boxes mark structural features. We define an RNA secondary structure as a set of base pairs [22]. In this work, we consider all C-G, A-U and G-U base pairs as canonical, and all other base pairs as non-canonical. However, we note that from the point of view of the planar edge-to-edge hydrogen bonding interaction [42], there are C-G, A-U and G-U base pairs that do not interact via Watson-Crick edges, and vice-versa [14,42]. Comparative sequence analysis tools do not currently describe bond types. A number of structural motifs can be identified in a secondary structure: A stem is composed of one or more consecutive base pairs. A hairpin loop contains one closing base pair, and all the bases between the paired bases are unpaired. An internal loop is a loop with two closing base pairs, and all bases between them are unpaired. A bulge loop can be seen as a variant of an internal loop in which there are no unpaired bases on one side. A multi-loop is a loop which has at least three closing base pairs; stems emanating from these base pairs are called multi-loop branches. A pseudoknot is a structural motif that involves non-nested, crossing base pairs.
Figure 2Database schema. Construction of RNA STRAND, from data collection to data presentation via dynamic web pages.
The main RNA types included in RNA STRAND v2.0.
| RNA type | Main source(s) | # | Length | % PKBP | ||
| entries | mean | std | mean | std | ||
| Transfer messenger RNA | tmRDB [ | 726 | 368 | 86 | 21.0 | 6.1 |
| 16S ribosomal RNA | CRW [ | 723 | 1529 | 286 | 1.8 | 0.5 |
| Transfer RNA | Sprinzl DB [ | 707 | 76 | 21 | 0.1 | 2.3 |
| Ribonuclease P RNA | RNase P DB [ | 470 | 323 | 71 | 5.7 | 3.2 |
| Signal rec. particle RNA | SRPDB [ | 394 | 220 | 111 | 0.0 | 0.0 |
| 23S ribosomal RNA | CRW [ | 205 | 2699 | 716 | 2.4 | 1.1 |
| 5S ribosomal RNA | CRW [ | 161 | 115 | 21 | 0.0 | 0.0 |
| Group I intron | CRW [ | 152 | 563 | 412 | 5.8 | 2.2 |
| Hammerhead ribozyme | Rfam [ | 146 | 61 | 24 | 0.0 | 0.0 |
| Group II intron | CRW [ | 42 | 1298 | 829 | 1.4 | 3.5 |
| All molecules | All of the above | 4666 | 527 | 722 | 5.3 | 9.1 |
Overview of the main RNA types in version 2.0 of the RNA STRAND database, their provenance, the number of RNAs, the mean length and standard deviation for each type. % PKBP denotes the percentage of the base pairs that need to be removed in order to render the structure pseudoknot-free. Most of the major RNA types are represented by a large number of molecules.
Statistics on the complexity of pseudoknots in RNA STRAND v2.0.
| RNA type | # | Stem length | # PKBP | ||||
| entries | median | mean | std | median | mean | std | |
| 16S ribosomal RNA | 644 | 4.00 | 4.30 | 2.50 | 3.00 | 2.50 | 0.68 |
| 23S ribosomal RNA | 93 | 4.00 | 4.14 | 2.39 | 2.00 | 3.75 | 3.12 |
| Transfer messenger RNA | 657 | 4.00 | 4.11 | 2.24 | 5.00 | 5.51 | 1.00 |
| Ribonuclease P RNA | 433 | 4.00 | 4.45 | 2.51 | 4.00 | 5.18 | 1.36 |
| All, non-redundant | 4104 | 4.00 | 4.35 | 2.44 | 4.00 | 4.14 | 1.86 |
| All, non-redundant & normalised | 4104 | 4.96 | 5.05 | 0.58 | 4.65 | 4.95 | 1.78 |
The columns represent the RNA type, the number of entries for each type, the median, mean and standard deviation of the stem length (i.e., number of adjacent base pairs) and the minimum number of base pairs to break in order to open pseudoknots (# PKBP). For each row, a non-redundant set was selected, and outliers were removed.
Figure 3Histogram of the occurence of non-canonical base pairs. Histogram of non-canonical base pairs in the 729 non-redundant entries whose structures were determined by NMR or X-ray crystallography.
Figure 4Prediction accuracy achieved by various energy models. Sensitivity vs. positive predictive value (PPV) of various secondary structure prediction methods. Sensitivity is the number of correctly predicted base pairs divided by the number of base pairs in the reference structure, PPV is the number of correctly predicted base pairs, divided by the number of predicted base pairs. Higher prediction accuracy is achieved when the free energy parameters are obtained by training on a larger set of structures. The CONTRAfold prediction program uses a trade-off parameter γ between sensitivity and PPV, and thus we report predictions for γ ranging from 2 to 20.