| Literature DB >> 25392425 |
Eric P Nawrocki1, Sarah W Burge2, Alex Bateman2, Jennifer Daub2, Ruth Y Eberhardt2, Sean R Eddy1, Evan W Floden2, Paul P Gardner3, Thomas A Jones1, John Tate2, Robert D Finn4.
Abstract
The Rfam database (available at http://rfam.xfam.org) is a collection of non-coding RNA families represented by manually curated sequence alignments, consensus secondary structures and annotation gathered from corresponding Wikipedia, taxonomy and ontology resources. In this article, we detail updates and improvements to the Rfam data and website for the Rfam 12.0 release. We describe the upgrade of our search pipeline to use Infernal 1.1 and demonstrate its improved homology detection ability by comparison with the previous version. The new pipeline is easier for users to apply to their own data sets, and we illustrate its ability to annotate RNAs in genomic and metagenomic data sets of various sizes. Rfam has been expanded to include 260 new families, including the well-studied large subunit ribosomal RNA family, and for the first time includes information on short sequence- and structure-based RNA motifs present within families.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25392425 PMCID: PMC4383904 DOI: 10.1093/nar/gku1063
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 19.160
Comparison of the old Rfam 11.0 BLAST and Infernal 1.0 search strategy versus the new Rfam 12.0 Infernal 1.1 search strategy for 15 of 200 randomly chosen families
| Accession | Family ID | Length (nt) | #of seed seqs | Time new (h) | Time old (h) | Time (old/new) | New total hits | Old total hits | New unique hits | Old unique hits |
|---|---|---|---|---|---|---|---|---|---|---|
| Top five families | ||||||||||
| RF00028 | Intron_gpI | 251 | 12 | 125.0 | 357.2 | 2.8 | 71 433 | 60 264 | 11 175 | 1 |
| RF00026 | U6 | 104 | 188 | 31.2 | 181.1 | 5.8 | 66 517 | 62 174 | 4367 | 14 |
| RF00003 | U1 | 166 | 100 | 11.6 | 64.0 | 5.5 | 15 770 | 14 867 | 904 | 1 |
| RF00162 | SAM | 108 | 433 | 8.3 | 590.0 | 70.8 | 4905 | 4797 | 108 | 0 |
| RF00050 | FMN | 140 | 144 | 17.1 | 169.9 | 23.9 | 4381 | 4306 | 76 | 1 |
| Middle five families | ||||||||||
| RF01426 | snoR126 | 101 | 4 | 40.3 | 7.3 | 0.2 | 78 | 66 | 12 | 0 |
| RF01252 | snR5 | 196 | 11 | 41.1 | 9.8 | 0.2 | 76 | 72 | 4 | 0 |
| RF00544 | snopsi28S-3327 | 143 | 14 | 11.3 | 15.1 | 1.3 | 75 | 74 | 1 | 0 |
| RF00439 | SNORD87 | 85 | 10 | 26.8 | 12.6 | 0.5 | 75 | 74 | 1 | 0 |
| RF01537 | TB11Cs2H1 | 70 | 7 | 5.8 | 7.3 | 1.3 | 74 | 73 | 1 | 0 |
| Bottom five families | ||||||||||
| RF01439 | S_pombe_snR36 | 164 | 2 | 25.0 | 1.7 | 0.1 | 5 | 2 | 3 | 0 |
| RF01448 | S_pombe_snR93 | 143 | 2 | 11.0 | 1.5 | 0.1 | 4 | 3 | 1 | 0 |
| RF00967 | mir-281 | 83 | 2 | 6.0 | 2.6 | 0.4 | 4 | 4 | 0 | 0 |
| RF00925 | MIR1027 | 142 | 2 | 20.4 | 1.6 | 0.1 | 3 | 3 | 0 | 0 |
| RF01576 | DdR8 | 88 | 2 | 10.4 | 1.6 | 0.2 | 2 | 2 | 0 | 0 |
| all 200 | - | - | - | 4222.2 | 4069.8 | 0.96 | 201 814 | 179 681 | 22 312 | 53 |
The top five, middle five and lowest five families are shown, as ranked by number of hits found above Rfam GA thresholds using the new search strategy. Identical Rfam 12.0 score thresholds and CM parameters were used for both the new and old strategies (new: Rfam 12.0 CM file in Infernal 1.1 format; old: Rfam 12.0 CM file converted to Infernal 1.0 format using Infernal 1.1's cmconvert program). For each family, columns 1–4 include the Rfam accession, family identifier, model length in nucleotides and number of sequences in the seed alignment, columns 5–7 report on the running time for the new strategy in hours, old strategy in hours and the ratio of the running time (old/new), respectively, columns 8 and 9 report the number of hits found above the per-family Rfam 12.0 thresholds for the new strategy and old strategy, respectively; column 10 reports the number of unique hits found by the new strategy and not the old, and column 11 reports the number of unique hits found by the old strategy but not the new. A unique hit is defined as a hit found by one strategy for which none of the hits found by the other strategy overlap by ≥1 nucleotides on the same strand. The 200 families were randomly chosen from the set of 2190 families that exist in both Rfam 12.0 and Rfam 11.0, the last release for which the old strategy was used. Initially, MIR1122 (RF00906) was included in the 200, but we replaced it with another random choice (SNORD97, RF01291) after learning that MIR1122 is clearly related to a MITE (miniature inverted-repeat transposable element) in plants and that the curators at the microRNA database mirBase (4) suspect it may not be a true miRNA gene. If the family is removed from mirBase, it will also be removed from Rfam.
Figure 1.Number of Rfam family matches for each of the 34 RMfam motifs.
Figure 2.Overview of the motif page for RM00022, the Terminator1 motif, on the Rfam 12.0 website. As in family and clan pages, tabs on the left-hand side allow the user to access different information for each motif.
Figure 3.Screenshot of the secondary structure representation for the RsmY RNA family (RF00195) with the annotation for the CsrA binding motif (RM00005) overlaid. Positions in red indicate that all the seed sequences at that position are found to contain the motif while other colours represent fewer sequences having matches at that position. The CsrA protein is a homo-dimeric, RNA binding protein. Each CsrA binds a specific RNA motif that is characterized by a short hairpin that hosts a GGA subsequence, these motifs generally occur in pairs. The CsrA-binding sRNAs, like RsmY, generally sequester excess copies of CsrA which would otherwise bind mRNAs and inhibit translation (23). Therefore, the expression of these sRNAs is a rapid way of altering expression levels for a potentially large network of proteins (24).
Summary statistics for Rfam-based annotation of RNAs in various genomes and metagenomics data sets
| Genome/data set | Size (Mb) | # of hits | # of fams | CPU time (hours) | Mb/hour |
|---|---|---|---|---|---|
| 3099.7 | 14 508 | 796 | 650 | 4.8 | |
| 2808.5 | 6177 | 625 | 460 | 6.1 | |
| 168.7 | 4321 | 156 | 30 | 5.7 | |
| 100.3 | 1022 | 175 | 20 | 5.2 | |
| 12.2 | 376 | 96 | 1.7 | 7.3 | |
| 4.6 | 256 | 112 | 0.46 | 10.2 | |
| 4.1 | 211 | 52 | 0.57 | 7.2 | |
| 1.7 | 257 | 18 | 0.31 | 5.6 | |
| 1.6 | 52 | 7 | 0.22 | 7.3 | |
| 0.9 | 44 | 7 | 0.22 | 4.1 | |
| Human immunodeficiency virus (HIV) | 0.01 | 12 | 10 | 0.016 | 0.63 |
| Human gut microbiome sample (sample ERS167139, 454 sequencing) | 166.1 | 4342 | 54 | 22 | 7.7 |
| Human gut microbiome sample (sample ERS235581, Illumina HiSeq sequencing) ( | 52.9 | 3159 | 47 | 8.5 | 6.2 |
| Ocean metagenome (sample SRS580499, Illumina genome analyzer) | 44.3 | 6692 | 59 | 13 | 3.5 |
The cmsearch program of Infernal 1.1 was used with Rfam 12.0 CM files and the following command-line options: --noali --cut ga --rfam --nohmmonly --cpu 0. Overlapping hits were removed such that no nucleotide was matched by more than one family by keeping the hit with the lower E-value in the case of overlaps (and higher bit score in the case of tying E-values). All searches were run as single execution threads on 3.0 GHz Intel Xeon processors. The Homo sapiens, Sus scrofa, Drosophila melanogaster and Saccharomyces cerevisiae genomes searched were obtained from Ensembl release 76 (http://www.ensembl.org/) (26) and the Escherishia coli (K12 substr MG1655), Bacillus subtilis (BSn5), Methanocaldococcus jannaschii (DSM 2661), Aquifex aeolicus (VF5) and Borrelia burgdorferi (CA-11 2A) genomes were obtained from release 23 of Ensembl Genomes (http://ensemblgenomes.org/) (27) for all of those the actual sequence file searched was downloaded via FTP and suffixed with .dna.toplevel.fa.gz. The HIV genome used is ENA accession AJ291720 and the four metagenomic samples were downloaded from the EBI Metagenomics Portal (https://www.ebi.ac.uk/metagenomics/) (29), and can be accessed by the sample accession listed in the table. ‘CPU time’ and ‘Mb/hour’ columns are rounded to two significant digits.