Literature DB >> 18953034

Rfam: updates to the RNA families database.

Paul P Gardner¹, Jennifer Daub, John G Tate, Eric P Nawrocki, Diana L Kolbe, Stinus Lindgreen, Adam C Wilkinson, Robert D Finn, Sam Griffiths-Jones, Sean R Eddy, Alex Bateman.

Abstract

Rfam is a collection of RNA sequence families, represented by multiple sequence alignments and covariance models (CMs). The primary aim of Rfam is to annotate new members of known RNA families on nucleotide sequences, particularly complete genomes, using sensitive BLAST filters in combination with CMs. A minority of families with a very broad taxonomic range (e.g. tRNA and rRNA) provide the majority of the sequence annotations, whilst the majority of Rfam families (e.g. snoRNAs and miRNAs) have a limited taxonomic range and provide a limited number of annotations. Recent improvements to the website, methodologies and data used by Rfam are discussed. Rfam is freely available on the Web at http://rfam.sanger.ac.uk/and http://rfam.janelia.org/.

Entities: Disease Gene Species

Mesh：

Substances：
RNA

Year: 2008 PMID： 18953034 PMCID： PMC2686503 DOI： 10.1093/nar/gkn766

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Rfam is a database of sequence families of structural RNAs, including non-coding RNA genes as well as cis-regulatory RNA elements. Rfam release 9.0 contains 603 families, each represented by a multiple sequence alignment of known and predicted representative members of the family, annotated with a consensus base-paired secondary structure. This so-called SEED alignment is used to build a covariance model (CM) with the Infernal software (1). Each Rfam covariance model is searched against a nucleotide sequence database, producing a list of putative hits. Matches that score above a curated threshold are then aligned to the CM to produce a so-called FULL alignment. This process is outlined diagrammatically in Figure 1. The Rfam database was developed as a generic approach to the annotation of structured RNA families on genomic sequences (2,3), but it has been widely used as a source of reliable alignments and structures for the purposes of training and benchmarking RNA sequence and secondary structure analysis software.

Figure 1.

An outline of the Rfam 9.0 databases and methods. RFAMSEQ is drawn from EMBL excluding only the EST, synthetic and patented divisions. There are 603 Rfam families in release 9.0, which are used to scan RFAMSEQ for homologues using first WU-BLAST filters followed by the more accurate CM-based methods cmsearch and cmalign. This results in 603 FULL alignments annotating 636 138 regions.

DATA AND METHODOLOGICAL IMPROVEMENTS

RFAMSEQ

All Rfam models are searched against an underlying nucleotide sequence database, known as RFAMSEQ, which is derived from the EMBL nucleotide sequence database (4). Prior to release 9.0, RFAMSEQ represented only the various species sections of EMBL. These sections contained only sequences that were considered to be of finished quality and excluded sequences from many important genomes. With release 9.0, RFAMSEQ has been expanded to include the whole genome shotgun (WGS) and environmental sequence (ENV) divisions. These changes have increased the number of sequences in RFAMSEQ by more than an order of magnitude (2 225 030 sequences in Rfam 8.0 versus 29 574 458 sequences in Rfam 9.0).

Sequence filters

In order to make it feasible to search more than 120 gigabases of sequence with hundreds of covariance models in a reasonable time, we use sequence-based filters to prune the search space prior to applying the more accurate and more computationally expensive CMs. One of the primary limitations of the Rfam annotation pipe-line has been the use of BLAST-based sequence filters, which are likely to compromise search sensitivity. In order to address this issue at least partially, NCBI-BLAST has been replaced with a WU-BLAST search, which has been tuned for high sensitivity and low sequence similarity. A benchmark of several homology search tools has shown WU-BLAST to be the more accurate of the two methods on nucleotide data (5). Additionally, in order to make the BLAST filters more similar to profile HMMs, a sequence mask has been applied to each sequence in the alignment. Any nucleotide in an alignment column that has either a low frequency or is an insert relative to the majority of the rest of the sequences is ‘soft masked’ and not used for the BLAST word matches. These masked nucleotides do, however, still contribute to alignments that were seeded in the flanking regions. This approach has resulted in many fewer spurious hits with no detectable cost to sensitivity (data not shown), thus allowing E-value thresholds to be further relaxed. These observations together mean that the BLAST filters have been improved in terms of specificity and sensitivity.

Iteration of families

In order to improve the species and sequence depth of individual Rfam families, more than 370 families have been expanded by an ‘iteration’ process, in which some sequences in the FULL alignment are chosen for promotion to the SEED alignment. The sequences selected from FULL alignments for inclusion in the SEED must pass a series of stringent quality control requirements and be manually approved by a curator. The quality control steps include: ensuring that the sequence and secondary structure are consistent with the existing SEED sequences; the sequence identity with existing SEED sequences falls within 60–95%; the sequence is not truncated with respect to the SEED alignment. The curator also ensures that the new sequences make phylogenetic sense before allowing them to be incorporated into an updated SEED. An example of the utility of iteration is the snoRNA U103 SEED from Rfam 8.1 (accession: RF00188), which contained just three sequences and spanned two eutherian mammals (human and mouse). The SEED in Rfam 9.0 after iteration contains 42 sequences and spans Eutheria, Teleost (ray-finned fishes), Iguanidae, Monotremes, Marsupials, Placentals, Aves and Chondrichthyes (cartilaginous fishes). Phylogenetic trees have been estimated for both the SEED and FULL alignments. For the majority of the alignments we produced the trees using an accurate maximum-likelihood approach, which included models of indels (6). However, the computational complexity of tree estimation meant that maximum-likelihood was not always possible and hence, where the number of sequences in the alignment was greater than 64, a neighbour-joining method was used instead (7). Large alignments and trees are problematic, both in terms of the computational cost of generation and the challenges of displaying them. Therefore, where the number of sequences in the alignment was greater than 1024, the highly similar sequences were filtered by sequence similarity, resulting in relatively sparse and easily presented trees that required comparatively little computing power to generate.

PRESENTATION IMPROVEMENTS

Website redesign

We are currently developing a new Rfam website, with the aim of improving the presentation of Rfam data and providing more and better tools for searching the data. The new site is now available from http://rfam.sanger.ac.uk/and can be used to access Rfam 9.0 data. The new site lacks some features of the old site, but we aim to add all existing features and add many new ones over the coming months. Note that, at time of writing, the new website was available only at the Wellcome Trust Sanger Institute (http://rfam.sanger.ac.uk/). The two mirror sites will be updated to run the same website to coincide with the release of Rfam 9.1. The new site provides detailed overviews of Rfam families, including: a snapshot of the latest community-contributed annotation from Wikipedia (see below); tools for viewing and downloading the sequence alignments in various formats; representations of predicted secondary structure (see below); the taxonomic tree for the family; and phylogenetic trees for the SEED and FULL alignments. Additionally, we provide several search tools in the new site. We currently support interactive searches, allowing a single RNA sequence to be searched against the whole Rfam database, and a batch search tool for searching multiple sequences against Rfam, the results of which are returned to the user via email. A new taxonomy search tool allows the user to find Rfam families that are specific to a given taxonomic level, or those found in a set of taxonomic levels that are specified by a complex, boolean query. For example, the query ‘Drosophila AND Caenorhabditis NOT Mammalia’ returns the two Rfam families (RF00047 and RF00533) that contain sequences from both drosophila and caenorhabditis but no sequences from any mammalian species.

Structure graphics

New graphical representations of secondary structures have been added to the Rfam website, based on software from the Vienna RNA package (8). We now annotate several statistics directly on secondary structure diagrams, including sequence conservation, covariation, base-pair conservation and the maximum CM scores (Figure 2). The sequence conservation metric uses a metric computed for each column in the alignment; this is the frequency of the most common nucleotide in each column (Figure 2A). The covariation metric is based upon that used by RNAalifold (9). For each base pair in the consensus structure and for each pair of sequences in the alignment, the difference in structurally consistent and inconsistent mutations is taken. Each mutation is weighted using a tree-weighting scheme (10) and this value is then normalized by the number of possible mutations (Figure 2B). The base-pair conservation metric is the fraction of canonical base pairs (Watson–Crick and G:U) in any two columns that correspond to a base pair in the consensus structure (Figure 2C). The maximum covariance model score and corresponding nucleotide/base pair is computed for each node in the CM. The resulting sequence, structure and bit-scores are used to produce a marked up secondary structure (Figure 2D).

Figure 2.

An example of the new secondary markups used by Rfam. The coronavirus 3′-UTR pseudoknot is shown (Rfam Accession RF00165). We display coloured markups of sequence conservation (A), covariation (B), base-pair conservation also known as the fraction of canonical base pairs (C) and CM scores (D).

Wikipedia

The Rfam website now draws textual annotation of RNA families directly from the scientific community, through the online encyclopedia Wikipedia. Any updates to relevant Wikipedia articles are downloaded on a nightly basis using the MediaWiki API, verified by members of the consortium and presented on the Rfam site (11). We consider the resulting articles to be a great improvement on the original static text because they are frequently updated, provide cross links to related articles and are generally considerably more comprehensive and informative than the original Rfam annotations that they replace.

FUTURE CHALLENGES

The rate of discovery of new RNA families is accelerating rapidly, facilitated by advancements in new sequencing technologies (12,13) and targeted computational screens (14–17). Keeping abreast of these updates whilst still ensuring the quality of alignments and secondary structures is an ongoing challenge for Rfam. We continue to evaluate new technologies and techniques as they emerge and will adopt new procedures for building and checking Rfam families as necessary. We have been actively updating Rfam families and database crosslinks using more specialized RNA databases such as miRBase (18), IRESite (19), Pseudobase (20), snoRNABase (21), the plant snoRNA database (22), TransTerm (23) and the Yeast snoRNA database (24). As a result of these efforts, the next release of Rfam (version 9.1) will contain more than 700 entirely new families, bringing the total number of Rfam families to over 1300. A new version of Infernal (v1.0) is now available (http://infernal.janelia.org) and we plan to use this latest version to prepare the next major release of Rfam. Testing suggests that, compared with the version used for Rfam 9 (v0.72), v1.0 is faster and slightly more sensitive, whilst introducing for the first time E-values for hits returned from database searches. Although the speed increase will not be sufficient to obviate the need for BLAST filters in the Rfam production pipeline, this remains a major goal for Infernal development. Importantly, Infernal v1.0 is not compatible with the Rfam 9 CM files. Rfam/Infernal users may wish to generate new CMs from Rfam 9 SEED or FULL alignment files. We have mapped a subset of three-dimensional RNA structures found in the Protein DataBank (PDB) (25) (primarily SRP and ribosomal RNAs) to corresponding sequences in Rfam. In an initial feasibility study, we have demonstrated that RNA sequences can be retrieved from PDB files and mapped to Rfam sequences reliably. The mapping is currently performed using BLAT (26) to detect local regions of high similarity with high specificity. The positions of matches to Rfam entries are transferred to the PDB sequences, allowing us to colour three-dimensional structures as in Figure 3. We intend to roll-out this mapping across all Rfam families and PDB entries using both local similarities and global matches to Rfam models. This sequence-to-structure mapping will allow us to use determined tertiary structures to calculate secondary structure as a quality control for existing families, and catalogue interactions between RNA–RNA and RNA–protein families.

Figure 3.

An example of how PDB structures are displayed in Rfam. In this case, the structure 1l ng, containing the SRP19-7S.S RNA Complex from M. jannaschii, is rendered as cartoons using Jmol. Protein regions are coloured using the following scheme: beta-sheets (yellow), helices (magenta) and unstructured regions (white). RNA bases that match the Rfam model are coloured according to the key given in the web page (not shown here). In this structure, green represents a match to the eukaryotic SRP model, whereas those unmatched bases are coloured orange.

A further area of active research at Rfam is how best to distribute genome annotations. We plan to make annotations available in a variety of formats including the distributed annotation service (DAS) (27), General Feature Format (GFF) (http://song.sourceforge.net/gff3.shtml) and EMBL format, together with links to relevant genome browsers, e.g. ENSEMBL, UCSC and Genome Reviews. An example of how PDB structures are displayed in Rfam. In this case, the structure 1l ng, containing the SRP19-7S.S RNA Complex from M. jannaschii, is rendered as cartoons using Jmol. Protein regions are coloured using the following scheme: beta-sheets (yellow), helices (magenta) and unstructured regions (white). RNA bases that match the Rfam model are coloured according to the key given in the web page (not shown here). In this structure, green represents a match to the eukaryotic SRP model, whereas those unmatched bases are coloured orange.

FUNDING

Wellcome Trust (to P.P.G., J.D., J.T., R.D.F. and A.B.) Howard Hughes Medical Institute (to E.P.N., D.L.K. and S.R.E); University of Manchester (to S.G.J.); University of Copenhagen (to S.L.). Funding for open access charge: The Wellcome Trust. Conflict of interest statement. None declared.

26 in total

1. Fast and reliable prediction of noncoding RNAs.

Authors: Stefan Washietl; Ivo L Hofacker; Peter F Stadler
Journal: Proc Natl Acad Sci U S A Date: 2005-01-21 Impact factor: 11.205

2. The RNA WikiProject: community annotation of RNA families.

Authors: Jennifer Daub; Paul P Gardner; John Tate; Daniel Ramsköld; Magnus Manske; William G Scott; Zasha Weinberg; Sam Griffiths-Jones; Alex Bateman
Journal: RNA Date: 2008-10-22 Impact factor: 4.942

3. The transcriptional landscape of the mammalian genome.

Authors: P Carninci; T Kasukawa; S Katayama; J Gough; M C Frith; N Maeda; R Oyama; T Ravasi; B Lenhard; C Wells; R Kodzius; K Shimokawa; V B Bajic; S E Brenner; S Batalov; A R R Forrest; M Zavolan; M J Davis; L G Wilming; V Aidinis; J E Allen; A Ambesi-Impiombato; R Apweiler; R N Aturaliya; T L Bailey; M Bansal; L Baxter; K W Beisel; T Bersano; H Bono; A M Chalk; K P Chiu; V Choudhary; A Christoffels; D R Clutterbuck; M L Crowe; E Dalla; B P Dalrymple; B de Bono; G Della Gatta; D di Bernardo; T Down; P Engstrom; M Fagiolini; G Faulkner; C F Fletcher; T Fukushima; M Furuno; S Futaki; M Gariboldi; P Georgii-Hemming; T R Gingeras; T Gojobori; R E Green; S Gustincich; M Harbers; Y Hayashi; T K Hensch; N Hirokawa; D Hill; L Huminiecki; M Iacono; K Ikeo; A Iwama; T Ishikawa; M Jakt; A Kanapin; M Katoh; Y Kawasawa; J Kelso; H Kitamura; H Kitano; G Kollias; S P T Krishnan; A Kruger; S K Kummerfeld; I V Kurochkin; L F Lareau; D Lazarevic; L Lipovich; J Liu; S Liuni; S McWilliam; M Madan Babu; M Madera; L Marchionni; H Matsuda; S Matsuzawa; H Miki; F Mignone; S Miyake; K Morris; S Mottagui-Tabar; N Mulder; N Nakano; H Nakauchi; P Ng; R Nilsson; S Nishiguchi; S Nishikawa; F Nori; O Ohara; Y Okazaki; V Orlando; K C Pang; W J Pavan; G Pavesi; G Pesole; N Petrovsky; S Piazza; J Reed; J F Reid; B Z Ring; M Ringwald; B Rost; Y Ruan; S L Salzberg; A Sandelin; C Schneider; C Schönbach; K Sekiguchi; C A M Semple; S Seno; L Sessa; Y Sheng; Y Shibata; H Shimada; K Shimada; D Silva; B Sinclair; S Sperling; E Stupka; K Sugiura; R Sultana; Y Takenaka; K Taki; K Tammoja; S L Tan; S Tang; M S Taylor; J Tegner; S A Teichmann; H R Ueda; E van Nimwegen; R Verardo; C L Wei; K Yagi; H Yamanishi; E Zabarovsky; S Zhu; A Zimmer; W Hide; C Bult; S M Grimmond; R D Teasdale; E T Liu; V Brusic; J Quackenbush; C Wahlestedt; J S Mattick; D A Hume; C Kai; D Sasaki; Y Tomaru; S Fukuda; M Kanamori-Katayama; M Suzuki; J Aoki; T Arakawa; J Iida; K Imamura; M Itoh; T Kato; H Kawaji; N Kawagashira; T Kawashima; M Kojima; S Kondo; H Konno; K Nakano; N Ninomiya; T Nishio; M Okada; C Plessy; K Shibata; T Shiraki; S Suzuki; M Tagami; K Waki; A Watahiki; Y Okamura-Oho; H Suzuki; J Kawai; Y Hayashizaki
Journal: Science Date: 2005-09-02 Impact factor: 47.728

4. IRESite: the database of experimentally verified IRES structures (www.iresite.org).

Authors: Martin Mokrejs; Václav Vopálenský; Ondrej Kolenaty; Tomás Masek; Zuzana Feketová; Petra Sekyrová; Barbora Skaloudová; Vítezslav Kríz; Martin Pospísek
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

5. snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D box snoRNAs.

Authors: Laurent Lestrade; Michel J Weber
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

6. Transterm--extended search facilities and improved integration with other databases.

Authors: Grant H Jacobs; Peter A Stockwell; Warren P Tate; Chris M Brown
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

7. Integrating biological data--the Distributed Annotation System.

Authors: Andrew M Jenkinson; Mario Albrecht; Ewan Birney; Hagen Blankenburg; Thomas Down; Robert D Finn; Henning Hermjakob; Tim J P Hubbard; Rafael C Jimenez; Philip Jones; Andreas Kähäri; Eugene Kulesha; José R Macías; Gabrielle A Reeves; Andreas Prlić
Journal: BMC Bioinformatics Date: 2008-07-22 Impact factor: 3.169

8. Priorities for nucleotide trace, sequence and annotation data capture at the Ensembl Trace Archive and the EMBL Nucleotide Sequence Database.

Authors: Guy Cochrane; Ruth Akhtar; Philippe Aldebert; Nicola Althorpe; Alastair Baldwin; Kirsty Bates; Sumit Bhattacharyya; James Bonfield; Lawrence Bower; Paul Browne; Matias Castro; Tony Cox; Fehmi Demiralp; Ruth Eberhardt; Nadeem Faruque; Gemma Hoad; Mikyung Jang; Tamara Kulikova; Alberto Labarga; Rasko Leinonen; Steven Leonard; Quan Lin; Rodrigo Lopez; Dariusz Lorenc; Hamish McWilliam; Gaurab Mukherjee; Francesco Nardone; Sheila Plaister; Stephen Robinson; Siamak Sobhany; Robert Vaughan; Dan Wu; Weimin Zhu; Rolf Apweiler; Tim Hubbard; Ewan Birney
Journal: Nucleic Acids Res Date: 2007-11-26 Impact factor: 16.971

9. Deep sequencing analysis of small noncoding RNA and mRNA targets of the global post-transcriptional regulator, Hfq.

Authors: Alexandra Sittka; Sacha Lucchini; Kai Papenfort; Cynthia M Sharma; Katarzyna Rolle; Tim T Binnewies; Jay C D Hinton; Jörg Vogel
Journal: PLoS Genet Date: 2008-08-22 Impact factor: 5.917

10. Probabilistic phylogenetic inference with insertions and deletions.

Authors: Elena Rivas; Sean R Eddy
Journal: PLoS Comput Biol Date: 2008-09-19 Impact factor: 4.475

452 in total

1. Premature terminator analysis sheds light on a hidden world of bacterial transcriptional attenuation.

Authors: Magali Naville; Daniel Gautheret
Journal: Genome Biol Date: 2010-09-29 Impact factor: 13.583

2. Complete genome sequence of the giant virus OBP and comparative genome analysis of the diverse ΦKZ-related phages.

Authors: Anneleen Cornelissen; Stephen C Hardies; Olga V Shaburova; Victor N Krylov; Wesley Mattheus; Andrew M Kropinski; Rob Lavigne
Journal: J Virol Date: 2011-11-30 Impact factor: 5.103

3. Metatranscriptomic analysis of microbes in an Oceanfront deep-subsurface hot spring reveals novel small RNAs and type-specific tRNA degradation.

Authors: Shinnosuke Murakami; Kosuke Fujishima; Masaru Tomita; Akio Kanai
Journal: Appl Environ Microbiol Date: 2011-12-09 Impact factor: 4.792

Review 4. Long non-coding RNAs and cancer: a new frontier of translational research?

Authors: R Spizzo; M I Almeida; A Colombatti; G A Calin
Journal: Oncogene Date: 2012-01-23 Impact factor: 9.867

5. RNA-Puzzles: a CASP-like evaluation of RNA three-dimensional structure prediction.

Authors: José Almeida Cruz; Marc-Frédérick Blanchet; Michal Boniecki; Janusz M Bujnicki; Shi-Jie Chen; Song Cao; Rhiju Das; Feng Ding; Nikolay V Dokholyan; Samuel Coulbourn Flores; Lili Huang; Christopher A Lavender; Véronique Lisi; François Major; Katarzyna Mikolajczak; Dinshaw J Patel; Anna Philips; Tomasz Puton; John Santalucia; Fredrick Sijenyi; Thomas Hermann; Kristian Rother; Magdalena Rother; Alexander Serganov; Marcin Skorupski; Tomasz Soltysinski; Parin Sripakdeevong; Irina Tuszynska; Kevin M Weeks; Christina Waldsich; Michael Wildauer; Neocles B Leontis; Eric Westhof
Journal: RNA Date: 2012-02-23 Impact factor: 4.942