| Literature DB >> 19264806 |
Rune Matthiesen1, Finn Kirpekar.
Abstract
The idea of identifying or characterizing an RNA molecule based on a mass spectrum of specifically generated RNA fragments has been used in various forms for well over a decade. We have developed software-named RRM for 'RNA mass mapping'-which can search whole prokaryotic genomes or RNA FASTA sequence databases to identify the origin of a given RNA based on a mass spectrum of RNA fragments. As input, the program uses the masses of specific RNase cleavage of the RNA under investigation. RNase T1 digestion is used here as a demonstration of the usability of the method for RNA identification. The concept for identification is that the masses of the digestion products constitute a specific fingerprint, which characterize the given RNA. The search algorithm is based on the same principles as those used in peptide mass fingerprinting, but has here been extended to work for both RNA sequence databases and for genome searches. A simple and powerful probability model for ranking RNA matches is proposed. We demonstrate viability of the entire setup by identifying the DNA template of a series of RNAs of biological and of in vitro transcriptional origin in complete microbial genomes and by identifying authentic 16S ribosomal RNAs in a 'small ribosomal subunit RNA' database. Thus, we present a new tool for a rapid identification of unknown RNAs using only a few picomoles of starting material.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19264806 PMCID: PMC2665245 DOI: 10.1093/nar/gkp139
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Bioinformatics strategy for identifying the genetic origin of an RNA molecule.
Figure 2.Mass spectrometry data and search result for a H. marismortui 23S rRNA subfragment. (a) Mass spectrum of H. marismortui 23S rRNA subfragment (around positions 2323–2630) digested with RNase T1. Assigned masses are from singly protonated digestion products, these masses were used in the subsequent genome search. Insert: zoom on peak clusters to illustrate the effect of digestion products with partially overlapping isotope distributions—see text for details. (b) Top scoring genomic region with flanks for RNA mass mapping of H. marismortui 23S rRNA subfragments 2323–2630. Underlined: identified sequence. Yellow highlight: RNase T1 digestion fragments with masses in peak list. Bold italic: RNase T1 digestion fragments with masses not present in peak list.
Top-scoring genomic regions for search with RNase T1 digested H. marismortui 23S rRNA subfragment (positions ∼2320–2630) against the H. marismortui ATCC 43049 genome
| Candidate region | Positions in gene | Score | GenBank accession | |
|---|---|---|---|---|
| 23S rRNA, rrnA operon | 2305–2627 | 380 | 9.9 | AY596297 |
| 23S rRNA, rrnC operon | 2305–2627 | 380 | 9.9 | AY596297 |
| 23S rRNA, rrnB operon | 2305–2627 | 348 | 9.5 | AY596298 |
Score is calculated according to Equation (3) and the Z-score according to Equation (4).
Overview of top scoring genomic regions for various RNAs
| RNA | Position in gene, calculated | Position in gene, found | Sequence coverage |
|---|---|---|---|
| ∼530–697 | 531–702 | 92.4 | |
| ∼681–969 | 685–935 | 98.8 | |
| 2446–2632 | 2456–2621 | 100 | |
| 1–109 | 4–97 | 100 | |
| 1–117 | 9–113 | 89.4 |
aThe sequence coverage is calculated as the percentage of the identified genomic region that is represented by masses from the peak list when considering that RNase T1 was used to obtain the peak list. Each mass may match several positions in the identified genomic region. Note that genome sequences that would result in mono- or di-nucleotides at the RNase T1 digestion level are not included in calculation of the sequence coverage.
Figure 3.Mass spectrum of T. thermophilus 16S rRNA digested with RNase T1. Assigned masses are of singly protonated digestion products, these masses were used in the subsequent database search.
Top scoring entries for search with RNase T1 digested T. thermophilus 16S rRNA in the RDP 16S rRNA database
| Rank | Score | Organism | GenBank accession | |
|---|---|---|---|---|
| 1 | 364 | 55 | AY554280 | |
| 1 | 364 | 55 | AY497773 | |
| 1 | 364 | 55 | No information | |
| 1 | 364 | 55 | DQ087525 | |
| 5 | 362 | 54 | AE017221 | |
| 5 | 362 | 54 | AP008226 | |
| 5 | 362 | 54 | AE017221 | |
| 5 | 362 | 54 | AP008226 | |
| 9 | 360 | 54 | AY788091 | |
| 10 | 358 | 53 | AJ251938 | |
| . . . | . . . | . . . | . . . | |
| 19 | 339 | 49 | X07998 | |
| 20 | 313 | 43 | S000345626 uncultured bacterium; G24 | AF407704 |
The 10 highest scoring entries as well as number 19 and 20 are specified.
aEach identified rRNA represents one of the two copies present in the T. thermophilus genome.
Top three scoring entries for search with RNase T1 digested E. coli 16S rRNA in the RDP 16S rRNA database
| Score | Description of organism | GenBank accession | |
|---|---|---|---|
| 452 | 110 | Z83204 | |
| 452 | 110 | Z83203 | |
| 450 | 110 | CP000034 |
aThe subsequent seven identified 16S rRNA candidates originated from either E. coli or S. dysenteriae.