| Literature DB >> 11869452 |
Jamie J Cannone1, Sankar Subramanian, Murray N Schnare, James R Collett, Lisa M D'Souza, Yushi Du, Brian Feng, Nan Lin, Lakshmi V Madabusi, Kirsten M Müller, Nupur Pande, Zhidi Shang, Nan Yu, Robin R Gutell.
Abstract
BACKGROUND: Comparative analysis of RNA sequences is the basis for the detailed and accurate predictions of RNA structure and the determination of phylogenetic relationships for organisms that span the entire phylogenetic tree. Underlying these accomplishments are very large, well-organized, and processed collections of RNA sequences. This data, starting with the sequences organized into a database management system and aligned to reveal their higher-order structure, and patterns of conservation and variation for organisms that span the phylogenetic tree, has been collected and analyzed. This type of information can be fundamental for and have an influence on the study of phylogenetic relationships, RNA structure, and the melding of these two fields.Entities:
Mesh:
Substances:
Year: 2002 PMID: 11869452 PMCID: PMC65690 DOI: 10.1186/1471-2105-3-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Introductory view of the CRW Site. The top frame divides the site into eight sections; the first four sections are the primary focus of this manuscript. The bottom frame contains the CRW Site's Table of Contents. Color-coding is used consistently throughout the CRW Site to help orient users.
Reference sequence and nucleotide frequency data available at the CRW Site. Nucleotide frequency data available in tabular form is indicated with "Y." Entries marked with "*" are also available mapped on the phylogenetic tree. L, Lousy; M, Model; T, Tentative.
| Reference Sequence | Single Nucleotide | Base Pair | Base Triple | ||||
|---|---|---|---|---|---|---|---|
| M | T | L | M | T | |||
| 5S | Y | Y | Y | ||||
| 16S | Y* | Y* | Y | Y | Y* | Y | |
| 23S | V* | Y* | Y | Y | Y* | Y | |
| Y | Y | Y | |||||
| Group I | Y | Y | |||||
| Group IIA | Y | Y | |||||
| Group IIB | Y | Y | |||||
Figure 2The most recent (November 1999) versions of the rRNA comparative structure models (see text for additional details). A. E. coli 23S rRNA, 5' half. B. E. coli 23S rRNA, 3' half. C. E. coli 16S rRNA. D. The "histogram" format for the E. coli 16S rRNA.
Alignments available from the CRW Site. These alignments were used to generate conservation diagrams (rRNA only) and correspond to the alignments used in the nucleotide frequency tables.
| Molecule | Alignment | # of Sequences |
|---|---|---|
| rRNA (5S / 16S / 23S) | T (Three Domains/Two Organelles) | 686/6389/922 |
| 3 (Three Phylogenetic Domains) | -- / 5591 / 585 | |
| A (Archaea) | 53/171/39 | |
| B (Bacteria) | 323/4213/431 | |
| C (Eukaryota chloroplast) | -- / 127 / 52 | |
| E (Eukaryota nuclear) | 299/1937/115 | |
| M (Eukaryota mitochondria) | -- / 899 / 295 | |
| Group I Intron | A (IA1, IA2, and IA3 subgroups) | 82 |
| B (IB1, IB2, IB3, and IB4 subgroups) | 72 | |
| C (IC1 and IC2 subgroups) | 305 | |
| Z (IC3 subgroup) | 125 | |
| D (ID subgroup) | 19 | |
| E (IE subgroup) | 46 | |
| U (all other group I introns) | 41 | |
| Group II Intron | A (IIA subgroup) / B (IIB subgroup) | 171/571 |
| tRNA | A (Alanine tRNAs) / C (Cysteine tRNAs) | 64/19 |
| D (Aspartic Acid tRNAs) / E (Glutamic Acid tRNAs) | 35/49 | |
| F (Phenylalanine tRNAs) / G (Glycine tRNAs) | 54/69 | |
| H (Histidine tRNAs) / I (Isoleucine tRNAs) | 38/56 | |
| K (Lysine tRNAs) / M (Methionine tRNAs) | 53/36 | |
| N (Asparagine tRNAs) / P (Proline tRNAs) | 35/55 | |
| Q (Glutamine tRNAs) / R (Arginine tRNAs) | 35/62 | |
| T (Threonine tRNAs) / V (Valine tRNAs) | 49/65 | |
| W (Tryptophan tRNAs) / X (Methionine Initiator tRNAs) | 30/65 | |
| Y (Tyrosine tRNAs) / Z (All Type 1 tRNAs) | 47 / 895 | |
Summary of the Evolution of the Noller-Woese-Gutell 16S rRNA Comparative Structure Model. Categories marked with "*" are calculated compared to the 1999 version of the 16S rRNA model.
| Date of Model | 1980 | 1983 | 1984–86 | 1989–90 | 1993–96 | Current (1999) |
|---|---|---|---|---|---|---|
| 1. Approximate # Complete Sequences | 2 | 15 | 35 | 420 | 1000 | 7000 |
| 2.% of 1999 Sequences | 0.03 | 0.2 | 0.5 | 6.0 | 14.3 | 100 |
| 3. # BP Proposed Correctly * | 284 | 388 | 429 | 450 | 465 | 478 |
| 4. # BP Proposed Incorrectly * | 69 | 49 | 38 | 28 | 6 | 0 |
| 5. Total BP in Model (#3 + #4) | 353 | 437 | 477 | 478 | 471 | 478 |
| 6. % of BP in This Model that Appear in the Current Model (#3 / 478) * | 59.4 | 81.2 | 89.7 | 94.1 | 97.3 | 100 |
| 7. Accuracy of Proposed BP (#3 / #5) | 80.5 | 88.8 | 89.9 | 94.1 | 98.7 | 100 |
| 8. # BP in Current Model Missing from This Model (478 - #3) * | 194 | 90 | 49 | 28 | 13 | 0 |
| 9. # Tertiary BP Proposed Correctly * | 4 | 8 | 15 | 25 | 35 | 40 |
| 10. % Tertiary BP Proposed Correctly * | 10.0 | 20.0 | 37.5 | 62.5 | 87.5 | 100 |
| 11. # Base Triples Proposed Correctly * | 0 | 0 | 0 | 0 | 0 | 6 |
| 12. % Base Triples Proposed Correctly * | 0 | 0 | 0 | 0 | 0 | 100 |
Summary of the Evolution of the Noller-Woese-Gutell 23S rRNA Comparative Structure Model. Categories marked with "*" are calculated compared to the 1999 version of the 23S rRNA model.
| Date of Model | 1981 | 1984 | 1988–90 | 1992–96 | Current (1997–2000) |
|---|---|---|---|---|---|
| 1. Approximate # Complete Sequences | 2 | 15 | 55 | 220 | 1050 |
| 2.% of 1999 Sequences | 0.2 | 1.4 | 5.2 | 21.0 | 100 |
| 3. # BP Proposed Correctly * | 676 | 692 | 794 | 836 | 870 |
| 4. # BP Proposed Incorrectly * | 102 | 93 | 69 | 26 | 0 |
| 5. Total BP in Model (#3 + #4) | 778 | 785 | 863 | 862 | 870 |
| 6. % of 1999 Model Proposed Correctly (#3 / 870) * | 77.7 | 79.5 | 91.3 | 96.1 | 100 |
| 7. Accuracy of Proposed BP (#3 / #5) | 86.9 | 88.2 | 92.0 | 97.0 | 100 |
| 8. # BP in Current Model Missing from This Model (870 - #3) * | 194 | 178 | 76 | 34 | 0 |
| 9. # Tertiary BP Proposed Correctly * | 4 | 3 | 29 | 49 | 65 |
| 10. % Tertiary BP Proposed Correctly * | 6.2 | 4.6 | 44.6 | 75.4 | 100 |
| 11. # Base Triples Proposed Correctly * | 0 | 0 | 0 | 2 | 7 |
| 12. % Base Triples Proposed Correctly * | 0 | 0 | 0 | 28.6 | 100 |
Figure 3RDBMS (Standard) search form.
RDBMS Fields and Short Descriptions.
| # | Search Query | Output Field | Description |
|---|---|---|---|
| 1 | ---- | Row# | Index for ease of usage. |
| 2 | Organism | Organism | |
| 3 | Cell Location | L | |
| 4 | RNA Type | RT | |
| 5 | RNA Class | RC | |
| 6 | Exon | EX | Exon sequence containing the intron. The expanded names for the exon abbreviations are available online. |
| 7 | ---- | IN | |
| 8 | Intron Position | IP | |
| 9 | ORF | 0 | |
| 10 | Sequence Length | Size | Number of nucleotides in the RNA sequence. |
| 11 | ---- | Cmp | |
| 12 | Accession Number | AccNum | GenBank Accession Number. Links directly to the GenBank entry at the NCBI web site. |
| 13 | Secondary Structures | StrDiags | |
| 14 | Common Name | Common Name | From the NCBI Phylogeny, where available. |
| 15 | Group ID | Gr.Id | (Partially implemented feature.) |
| 16 | Group Class | Gr.Class | (Feature not presently implemented.) |
| 17 | Comment | Comment | Additional information about a sequence. |
| 18 | Phylogeny | Phylogeny | NCBI Phylogeny for the Organism. The first level is shown; the remainder is available by following the "m" ("more") link. |
| 19 | ---- | Row# | Index for ease of usage. |
#: order of appearance of fields in the RDBMS output. Search Query: names of fields on the Search screen; ----, not available as a search criterion. Output Field: names of fields in the RDBMS output. Description: more information about the field and its contents. The RDBMS Search page contains two additional options: Results / Page, which allows users to display 20, 50, 100, 200, or 400 results per page, and Color Display, which toggles alternating colored highlighting of adjacent organisms. Expanded descriptions of each field and the corresponding contents are available online at the CRW RDBMS Help Page.
Attributes for the "RNA Structure Query System." The 5' and 3' ends of helices and loops are based on the global orientation determined from the 5' and 3' ends of the entire RNA molecule.
| RNA Types | 5S rRNA, 16S rRNA, 23S rRNA, Group I intron | |
|---|---|---|
| Phylogenetic Groups / Cell Locations | Bacteria (nucleus), Archaea (nucleus), Eucarya (nucleus, mitochondria, and chloroplast) | |
| single nuc | total | |
| paired (helix) | paired positions | |
| unpaired (loop) | unpaired positions | |
| 5' helix end | 5' end of helix | |
| 3' helix end | 3' end of helix | |
| 5' loop end | 5' end of loop | |
| 3' loop end | 3' end of loop | |
| helix center | in helix but not at the 5' or 3' ends | |
| loop center | in loop but not at the 5' or 3' ends | |
| unpaired/paired | ratio of 'unpaired' / 'paired' | |
| adjacent nucs | total | |
| in helix | paired positions | |
| in loop | unpaired positions | |
| 3'helix 5'loop | junction: 3' end ofhelix/5' end of loop | |
| 3'loop 5'helix | junction: 3' end ofloop/5' end of helix | |
| in loop/in helix | ratio 'in loop' / 'in helix' | |
| base pairs | total | |
| 5'helix end | at the 5' end of a helix | |
| 3'helix end | at the 3' end of a helix | |
| helix center | in helix, but not at the 5' or 3' ends | |
| three nucs | total | |
| 000, 111,001,011,010, 100, 101,110 | 0 = unpaired, 1 = paired; patterns of three consecutive nucleotides | |
| 5'-(A:C)B | base pair with an unpaired nucleotide 3' to one paired position | |
| 5'-A(B:C) | base pair with an unpaired nucleotide 3' to one paired position | |
| four nucs | total | |
| 0000,1111,0001, 1110,0010, 1101,0011,1100,0100, 1011, 0101,1010,0110, 1001, 1000, 0111 | 0 = unpaired, 1 = paired; patterns of four consecutive nucleotides | |
| double pair@5end | two consecutive base pairs at the 5' end of helices | |
| double pair@mid | two consecutive base pairs not at the 5' or 3'ends of helices | |
| double pair@3end | two consecutive base pairs at the 3' end of helices | |
| 5-(A:D)BC | base pair with two consecutive unpaired nucleotides 3' to one paired position | |
| lonepair | base pair with unpaired nucleotides 5' and 3' to one unpaired position | |
| 5-AB(C:D) | base pair with two consecutive unpaired nucleotides 5' to one paired position | |
Figure 4RDBMS (PhyloBrowser) basic phylogenetic search screen, showing two additional levels of phylogeny.
Figure 5Analysis of the Bacterial 16S and 23S rRNA structure models using the "RNA Structure Query System." The entire system (selection frame and results) is shown with the results for the distribution of single nucleotides, sorted in order of decreasing prevalence in unpaired regions.
Significant values from the "RNA Structure Query System." The 5' and 3' ends of helices and loops are based on the global orientation determined from the 5' and 3' ends of the entire RNA molecule. Values are for the Bacterial 16S and 23S rRNA comparative structure models.
| Number/Type of Nucleotides | Structural Element | ||
|---|---|---|---|
| single nuc | total | ||
| paired (helix) | G (36.57%) | A (14.46%) | |
| 5' helix end | G (46.23%) | U(13.52%) | |
| 3' helix end | C (38.07%) | A (10.57%) | |
| 5' loop end | G (37.06%) | C (10.33%) | |
| adjacent nucs | total | GG (9.863%) | UU (4.093%) |
| in helix | GG (14.06%) | AA (1.981%) | |
| 3'helix 5'loop | CG (14.75%) | UC(1.495%) | |
| loop/helix ratio | AA (5.67934) | CC (.112825) | |
| base pairs | total | GC/CG (28.29%) | CU/UC (0.1351%) |
| 5'helix end | GC (38.76%) | UC (0.09088%) | |
| 3'helix end | CG (38.77%) | CU (0.09089%) | |
| three nucs | total | GGG (3.0%), GAA (2.6%), AAG (2.6%), GGA (2.5%), AGG (2.4%) | |
| 000 | GAA (7.5%), AAA (6.7%), UAA (5.2%) | ||
| 011 | AGC (9.3%), AGG (8.8%) | ||
| 100 | CGA (7.6%), UGA (5.8%), GGA (5.3%) | ||
| 110 | GCG (6.9%), GGG (4.8%), GGA (4.7%) | ||
| 001 | AAG (14.4%), AAC (6.9%), GAG (5.4%) | ||
| 101 | CAG (7.2%) | ||