| Literature DB >> 15180907 |
Abstract
BACKGROUND: RNA secondary structure prediction methods based on probabilistic modeling can be developed using stochastic context-free grammars (SCFGs). Such methods can readily combine different sources of information that can be expressed probabilistically, such as an evolutionary model of comparative RNA sequence analysis and a biophysical model of structure plausibility. However, the number of free parameters in an integrated model for consensus RNA structure prediction can become untenable if the underlying SCFG design is too complex. Thus a key question is, what small, simple SCFG designs perform best for RNA secondary structure prediction?Entities:
Mesh:
Substances:
Year: 2004 PMID: 15180907 PMCID: PMC442121 DOI: 10.1186/1471-2105-5-71
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Examples of CFG parse trees for an example RNA structure Left: an RNA secondary structure with two stems. Middle: a parse tree for that structure using the grammar S → aS | aS | Sa | SS | ε, with nonterminals in red and terminals in black. Note the correspondence between the RNA structure and the structure of the parse tree; that the individual steps in the grammar correspond to base pairs and single nucleotides (that is, the grammar is used to factor the structure down into individually scored steps); and that the RNA sequence can be read off the parse tree by following the margin of the tree counterclockwise. Right: an alternative parse tree for the same structure, demonstrating that this grammar is structurally ambiguous.
Figure 2Example parse trees for four different unambiguous grammars, G3-G6. Each grammar's production rules defines how the example structure will be described. Some aspects seem artificial, having more to do with the constraints of being unambiguous rather than biologically meaningful features of RNA. G3 must bifurcate to accomodate a bulge on one side, but not the other. G4 shows a stutter-step behavior in stems, with a cycle of S ⇒ T ⇒ aS productions for each base pair. G5 (rightmost) uses a bifurcation and a null production for every base pair. G6 uses bifurcations at every single stranded residue.
Simple grammar specifications.
| Parameters | |||||
| NT | Total | Tied | Notes | ||
| G1 | 1 | 29 (25) | 25 (22) | 26.9 | ambiguous |
| G3 | 3 | 56 (47) | 28 (23) | 79.4 | min 1 nt loop; no ε string |
| G4 | 2 | 46 (40) | 26 (22) | 53.2 | none |
| G5 | 1 | 23 (20) | 23 (20) | 26.9 | none |
| G6 | 3 | 42 (36) | 26 (21) | 79.4 | min 2 nt loop; no ε string |
Each simple grammar requires a different number of nonterminals (NT) and parameters (with free parameters in parenthesis). Memory requirements were determined empirically by measuring the memory each grammar utilizes to fold a single C. elegans large subunit ribosomal RNA sequence (3662 nucleotides). Under "notes", we give some of the implications each grammar has on the language it describes.
Stacking grammar specifications.
| Parameters | |||||
| NT | Total | Tied | Notes | ||
| G2 | 2 | 287(281) | 283(278) | 53.2 | ambiguous |
| G7 | 5 | 341(326) | 289(281) | 128 | min 1 nt loop; no lone pairs; no ε string |
| G8 | 4 | 347(334) | 287(280) | 103 | min 1 nt loop; no lone pairs |
| G6 | 3 | 282(276) | 282(276) | 79.4 | min 2 nt loop; no ε string |
Stacking adds to grammars more nonterminals (NT) and total (free) parameters. This translates into increased memory requirements, as shown by the folding of C. elegans large subunit ribosomal RNA (3662 nucleotides). It also may introduce other restrictions such as not permitting lone pairs.
Grammar performance.
| Sensitivity % (PPV %) | ||||
| RNase P | SRP | tmRNA | ||
| G1 | 14 (11) | 37 (32) | 10(6) | |
| G3 | 37 (35) | 28 (28) | 31 (22) | |
| G4 | 10(8) | 19 (17) | 4(2) | |
| G5 | 3(4) | 2(3) | 4(3) | |
| G6 | 49 (49) | 47 (49) | 44 (33) | |
| G2 | 31 (23) | 59 (48) | 33 (17) | |
| G7 | 46 (46) | 50 (52) | 40 (30) | |
| G8 | 46 (46) | 52 (53) | 44 (32) | |
| G6 | 50 (50) | 50 (51) | 44 (32) | |
| 56 (51) | 70 (66) | 46 (30) | ||
| Vienna RNA | 55 (51) | 67 (64) | 45 (30) | |
| PKNOTS | 53 (46) | 58 (55) | 38 (24) | |
| RNAstructure | 58 (52) | 62 (58) | 46 (30) | |
| Pfold | 42 (76) | 35 (64) | 33 (54) | |
The first section contains simple grammars, the second contains stacking grammars, and the last section contains other available software packages that predict secondary structure. The nine grammars were trained on rRNA as described in the text. The second column gives the performance on the full benchmarking dataset, which is subdivided into each family in the subsequent columns. The metrics are calculated over base pairs as described in the text, evaluating the metrics over individual sequences shows similar results (data not shown).