| Literature DB >> 17452365 |
Karin Lagesen1, Peter Hallin, Einar Andreas Rødland, Hans-Henrik Staerfeldt, Torbjørn Rognes, David W Ussery.
Abstract
The publication of a complete genome sequence is usually accompanied by annotations of its genes. In contrast to protein coding genes, genes for ribosomal RNA (rRNA) are often poorly or inconsistently annotated. This makes comparative studies based on rRNA genes difficult. We have therefore created computational predictors for the major rRNA species from all kingdoms of life and compiled them into a program called RNAmmer. The program uses hidden Markov models trained on data from the 5S ribosomal RNA database and the European ribosomal RNA database project. A pre-screening step makes the method fast with little loss of sensitivity, enabling the analysis of a complete bacterial genome in less than a minute. Results from running RNAmmer on a large set of genomes indicate that the location of rRNAs can be predicted with a very high level of accuracy. Novel, unannotated rRNAs are also predicted in many genomes. The software as well as the genome analysis results are available at the CBS web server.Entities:
Mesh:
Year: 2007 PMID: 17452365 PMCID: PMC1888812 DOI: 10.1093/nar/gkm160
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
The initial number of rRNA sequences and the number of sequences excluded for different reasons.
| Kingdom | Type | Initial count | Environmental samples | Incomplete sequences | Redundancy reduction | Total in HMM |
| Archaea | 5S | 58 | 0 | 0 | 10 | 48 |
| 16S | 589 | 239 | 471 | 287 | 76 | |
| 23S | 37 | 0 | 18 | 8 | 15 | |
| Bacteria | 5S | 461 | 0 | 0 | 101 | 360 |
| 16S | 12 107 | 1429 | 10 723 | 2485 | 743 | |
| 23S | 398 | 0 | 155 | 130 | 127 | |
| Eukaryotes | 5S | 316 | 0 | 0 | 33 | 283 |
| 18S | 6585 | 24 | 5222 | 836 | 979 | |
| 28S | 157 | 0 | 91 | 8 | 58 |
Environmental samples were excluded due to lack of phylogenetic information. Sequences with too many unknown nucleotides in either end of the sequence were excluded to improve HMM accuracy. Redundancy reduction was performed to reduce bias. Note that these groups may overlap. The last column indicates the number of sequences used to build each HMM.
Figure 1.The graphs show conservation in the alignments as measured by information content: C=∑ where i sums over the four nucleotides, f is the frequency of nucleotide i in the column and q =1/4 is used as the background frequency. Ambiguous nucleotide symbols were evenly divided between the corresponding f, gaps between all four nucleotides. The grey line represents the value for each position in the alignment, the black line is a running average over 75 nt around the current position, whereas the white dot indicates the center of the most conserved 75 nt region of the alignment.
The number of rRNAs annotated and predicted in the genomes that were examined.
| Kingdom | Type | Annotated | Same strand | Other strand | Not found | Full model predictions | Novel |
| Archaea ( | 5S | 56 (24) | 43 (21) | 1 (1) | 12 (8) | 47 (23) | 4 (3) |
| 16S | 47 (25) | 45 (25) | 2 (2) | 0 (0) | 47 (27) | 2 (2) | |
| 23S | 47 (25) | 44 (24) | 2 (2) | 1 (1) | 46 (26) | 2 (2) | |
| Bacteria ( | 5S | 1205 (285) | 1166 (285) | 30 (16) | 9 (5) | 1339 (320) | 173 (69) |
| 16S | 1172 (299) | 1146 (299) | 22 (12) | 4 (4) | 1237 (320) | 91 (34) | |
| 23S | 1197 (297) | 1154 (291) | 22 (13) | 21 (12) | 1248 (313) | 94 (36) | |
| Eukaryotes ( | 5S | 65 (7) | 46 (6) | 19 (1) | 0 (0) | 324 (9) | 278 (5) |
| 18S | 13 (4) | 6 (4) | 0 (0) | 7 (2) | 13 (6) | 7 (3) | |
| 28S | 13 (5) | 12 (4) | 0 (0) | 1 (1) | 19 (7) | 7 (3) |
The table gives the number of annotations, and splits this into those matching predictions on the same strand, on the other strand, and not found. The total number of full model predictions is given. Novel predictions are full model predictions not matching any annotation on the same strand, and include those annotated on the other strand. Numbers in parentheses indicate the number of genomes. It should be noted that the eukaryotic annotated count is somewhat uncertain due to ambiguous rRNA annotations. The genomes which were analyzed were from the GenomeAtlas database, a database over all available fully sequenced genomes.
Figure 2.Deviation of start and stop positions between predicted and annotated RNA is presented as pairs of panels. The number of predictions among the archaea, bacteria and eukaryotes are denoted beneath the panel group heading. The zero position in each panel corresponds to the annotation start or stop position with predicted positions presented relative to these. The yellow dot indicates the median deviation and the black box the quartile range. The hinges on the side of the box extend from the side of the box to the data point that is closest to, but does not exceed, 1.5 times the interquartile range. The curves show the density of the distribution.
Evaluation of spotter and full model predictions.
| Kingdom | Type | Number of model predictions | Full model scores | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Full | Spotter | FPS | Min | Q1 | Med | Q3 | Max | ||||
| Archaea | 5S | 47 | 35 | 7 | 2.9 | 12.7 | 20.0 | 35.3 | 50.6 | 34.9 | 0.69 |
| 16S | 47 | 47 | 0 | 1180.8 | 1891.9 | 1937.9 | 2004.0 | 2096.5 | <0 | 1.0 | |
| 23S | 46 | 46 | 1 | 2240.7 | 2714.1 | 2870.7 | 3155.3 | 3267.3 | <0 | 1.0 | |
| Bacteria | 5S | 1339 | 1339 | 123 | 39.9 | 77.7 | 89.5 | 94.6 | 109.6 | 14.0 | 1.0 |
| 16S | 1237 | 1237 | 31 | 721.9 | 1905.5 | 1989.4 | 2058.7 | 2148.5 | <0 | 1.0 | |
| 23S | 1248 | 1248 | 20 | 2502.8 | 3267.8 | 3586.5 | 3690.7 | 3876.1 | <0 | 1.0 | |
| Eukaryotes | 5S | 324 | 324 | 251 | 43.9 | 51.1 | 53.9 | 74.3 | 82.2 | <0 | 1.0 |
| 18S | 13 | 13 | 14 | 625.3 | 625.3 | 1733.1 | 1777.5 | 1777.6 | <0 | 1.0 | |
| 28S | 19 | 19 | 5 | 1434.2 | 2904.7 | 3225.0 | 3335.9 | 3380.9 | <0 | 1.0 | |
This table shows the total number of full models, the number of spotter predictions that had matching full model predictions and the number of false positive spotter model predictions. The characteristics of the full model prediction score distributions are shown. FPS denotes the number of false positive spotter predictions. T99 refers to the lowest score a full model could have while still being detected with 99% probability by a spotter model with positive score. P is the probability that a spotter with positive score would find a full model with the minimum score indicated. The lowest score for a full model score can be used as a lower limit on which results could be expected to be real.