Literature DB >> 15608160

Rfam: annotating non-coding RNAs in complete genomes.

Sam Griffiths-Jones1, Simon Moxon, Mhairi Marshall, Ajay Khanna, Sean R Eddy, Alex Bateman.   

Abstract

Rfam is a comprehensive collection of non-coding RNA (ncRNA) families, represented by multiple sequence alignments and profile stochastic context-free grammars. Rfam aims to facilitate the identification and classification of new members of known sequence families, and distributes annotation of ncRNAs in over 200 complete genome sequences. The data provide the first glimpses of conservation of multiple ncRNA families across a wide taxonomic range. A small number of large families are essential in all three kingdoms of life, with large numbers of smaller families specific to certain taxa. Recent improvements in the database are discussed, together with challenges for the future. Rfam is available on the Web at http://www.sanger.ac.uk/Software/Rfam/ and http://rfam.wustl.edu/.

Entities:  

Mesh:

Substances:

Year:  2005        PMID: 15608160      PMCID: PMC540035          DOI: 10.1093/nar/gki081

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Non-coding RNA (ncRNA) genes produce a functional RNA product instead of a translated protein. These products are components of some of the most important cellular machines, such as the ribosome (ribosomal RNAs), the spliceosome (U1, U2, U4, U5 and U6 RNAs) and the telomerase (telomerase RNA). The known repertoire of ncRNA cellular functions is expanding rapidly. Small nucleolar RNAs (snoRNAs) guide essential modifications of ribosomal and spliceosomal RNAs [reviewed in (1)]. Ribozymes catalyse a range of reactions, such as self-cleavage of hepatitis delta virus transcripts, and 5′ maturation of transfer RNAs (tRNAs) by the ubiquitous RNase P. A class of small RNAs almost unknown before 2000, the microRNAs (miRNAs), are found to be involved in regulation of ever more processes in higher eukaryotes—including development, cell death and fat metabolism—by repressing the translation of mRNA targets [reviewed in (2)]. Similar mRNA-binding regulatory roles in bacteria are fulfilled by distinct families of small RNAs [reviewed in (3)]. Like protein-coding genes, ncRNA sequences can be grouped into families and much can be learnt about structure and function from multiple sequence alignments of such families. Unlike proteins, ncRNAs often conserve a base-paired secondary structure with low primary sequence similarity. The combined secondary structure and primary sequence profile of a multiple sequence alignment of ncRNAs can be captured by statistical models, called profile stochastic context-free grammars (SCFGs), analogous to profile hidden Markov models (HMMs) of protein alignments. Rfam is a database of ncRNA families represented by multiple sequence alignments and profile SCFGs, available via the Web at http://www.sanger.ac.uk/Software/Rfam/ and http://rfam.wustl.edu/. All the data are also available for download, local installation and sequence searching using the INFERNAL software package (http://infernal.wustl.edu/) (4). The Rfam/INFERNAL model is much like the Pfam/HMMER system (5), extended to deal with RNA secondary structure consensus, and has been discussed previously (6). Here, we concentrate on recent improvements and discuss challenges that we expect to address through future development.

RECENT DEVELOPMENTS

The database has grown dramatically over the past two years: from 25 families annotating around 55 000 regions in the nucleotide sequences databases in release 1.0, to 379 families annotating over 280 000 regions in release 6.1. This growth is partly due to a significant increase in scope. The evolution of some large gene families, such as miRNAs and snoRNAs, is constrained partially by inter-molecular base-pairing, and thus they do not conserve significant sequence or secondary structure. While we cannot therefore represent all C/D box snoRNAs, or all miRNAs, with a single alignment and model, subfamilies are conserved and are now well represented in the database. Rfam also now includes not only bona fide ncRNA genes, but also structured regions of mRNA transcripts. These fall into two broad classes: self-splicing introns and cis-regulatory elements in the untranslated regions (UTRs). The latter can be used as detectors for a wide range of environmental conditions [e.g. bacterial riboswitches bind a range of metabolites as reviewed previously (7,8), and the 5′-UTR of the PrfA acts as a temperature-dependent switch (9)] to regulate message stability or translational efficiency. This increased scope has led to the introduction of a limited type ontology, with the top-level types representing the three classes of structured RNA discussed above—‘Gene’, ‘Intron’ and ‘Cis-reg’. The database currently contains 308 gene families, 69 cis-regulatory elements and two self-splicing introns. The type field provides one of the primary entry points for family browsing and searching, enabling the user to quickly identify all snoRNA gene families for instance, or to find all riboswitches in the database. One of the primary uses of the Rfam database is to search for homologues of known RNAs in a query sequence, including a complete genome. Indeed, the profile SCFG library has been used to annotate a number of newly sequenced genomes [e.g. Caenorhabditis briggsae (10), chicken (11) and Erwinia caratova (12)]. In addition, we calculate hits in over 200 complete genomes and chromosomes. These data are available through the web interface and are discussed briefly in the following section.

NON-CODING RNAS IN COMPLETE GENOMES

Rfam makes available annotation of over 13 400 candidate ncRNA genes (plus 172 self-splicing introns and 1285 cis-regulatory RNA elements) belonging to 172 families in 224 completed chromosomes and genomes. The average bacterial genome contains over 80 hits, dominated by the number of tRNAs. A total of 170 regions are annotated in Escherichia coli, in which most experimental validation of computationally predicted ncRNAs has been carried out. Rfam annotated regions in Bacillus genomes (B.anthracis is shown in Figure 1) include a number of recently described riboswitches (7,8).
Figure 1

Rfam genome page for Bacillus anthracis. The table contains a summary of the number of members of each Rfam family in the genome, with the distribution of hits shown on the map.

These data provide the first comprehensive view of the distribution of ncRNAs in the three kingdoms of life. There are a small number of very large families representing some of the best-understood RNAs. Figure 2 shows that these few large families are the only RNAs that are ubiquitous between all three domains of life—only the essential translation components, tRNA and ribosomal RNA, together with RNase P (tRNA maturation) and SRP RNA (protein export) are found in eukaryotes, bacteria and archaea. It is tempting to believe that very few families will be added to the catalogue of universally conserved RNAs. However, it is clear that members of some families are highly divergent so as to be computationally almost unrecognizable. For example, although most eukaryotes would be expected to have a telomerase RNA, current computational techniques are unable to identify homologues in even well-studied model organisms such as Caenorhabditis elegans.
Figure 2

Taxonomic distribution of Rfam family members in the three kingdoms of life.

Only snoRNAs are found in eukaryotes and archaea and not in bacteria, but RNA families have not yet been identified that are common to bacteria and archaea but not eukaryotes, or eukaryotes and bacteria but not archaea. The vast majority of Rfam families are small, and are often specific to one taxonomic group, and in some cases to one organism, suggesting relatively recent evolution of function or divergence beyond our ability to recognize homologues. Many novel bacterial ncRNAs have been identified by a number of recent computational screens in E.coli [reviewed in (13)], but comparatively few have been experimentally verified. Rfam contains more than 30 ncRNA families based on the verified genes. Few large-scale studies have been conducted in archaea or eukaryotes, and it is clear that such efforts will identify many more small families.

FUTURE CHALLENGES

Profile SCFG searches are computationally expensive. Rfam at present uses a BLAST-based heuristic (14) as described previously (6), reducing the search space with an inevitable sensitivity cost. This allows us to search a 5 Mb bacterial genome against the entire Rfam library in ∼24 h. Annotation of large eukaryotic genomes is just feasible using this approach. Recent advances allow the speed of profile SCFGs to be increased by a factor of ∼100 for most families, and provably do not reduce the sensitivity of the full SCFG search (15). Work is ongoing to incorporate such algorithms into the Rfam/INFERNAL approach. We also recognize that the current approach is restricted to RNAs with defined secondary structures, precluding inclusion of important families of essentially unstructured RNAs like XIST (X-Inactive Specific Transcript), RoX (RNA on X) and IPW (Imprinted in Prader–Willi). Furthermore, the consensus structure annotation may conceal additional elements in divergent structures. We plan to evaluate how the use of profile HMMs may allow the detection of homologues of unstructured RNAs, and investigate the propagation of structure annotation at the sequence level. Perhaps the biggest challenge for annotation of higher eukaryotic genomes is the problem of ncRNA-derived pseudogenes and repeats. For example, the B2 repeat in mouse is evolutionarily related to a tRNA, and Alu repeats in human derive from SRP RNA (16). Over 10% of the draft human genome sequence is made up of 1.1 million Alu sequences (17), and there are over 350 000 B2 repeat sequences in mouse (18). The human genome also contains over 1000 sequences that are closely related to U6 spliceosomal RNA, yet sensible estimates of the U6 gene count suggest that <50 are functional. Other problem families include the polIII transcribed Y and 7SK RNAs. Distinguishing the functional copies from the large numbers of pseudogenes is an unsolved problem and presents a significant challenge to RNA computational biologists. It seems likely that computational and experimental screens will continue to identify numerous novel ncRNAs. Most of these genes are predicted to fall into small families with narrow taxonomic ranges. In contrast, we believe that very few universally conserved RNAs will be found, and the large, well-studied and ubiquitous families will continue to make up the large majority of ncRNAs in a single genome. Rfam will continue to translate novel discoveries of ncRNA genes into alignments and models that are immediately useful for genome annotation and phylogenetic analysis.
  18 in total

1.  Initial sequencing and analysis of the human genome.

Authors:  E S Lander; L M Linton; B Birren; C Nusbaum; M C Zody; J Baldwin; K Devon; K Dewar; M Doyle; W FitzHugh; R Funke; D Gage; K Harris; A Heaford; J Howland; L Kann; J Lehoczky; R LeVine; P McEwan; K McKernan; J Meldrim; J P Mesirov; C Miranda; W Morris; J Naylor; C Raymond; M Rosetti; R Santos; A Sheridan; C Sougnez; Y Stange-Thomann; N Stojanovic; A Subramanian; D Wyman; J Rogers; J Sulston; R Ainscough; S Beck; D Bentley; J Burton; C Clee; N Carter; A Coulson; R Deadman; P Deloukas; A Dunham; I Dunham; R Durbin; L French; D Grafham; S Gregory; T Hubbard; S Humphray; A Hunt; M Jones; C Lloyd; A McMurray; L Matthews; S Mercer; S Milne; J C Mullikin; A Mungall; R Plumb; M Ross; R Shownkeen; S Sims; R H Waterston; R K Wilson; L W Hillier; J D McPherson; M A Marra; E R Mardis; L A Fulton; A T Chinwalla; K H Pepin; W R Gish; S L Chissoe; M C Wendl; K D Delehaunty; T L Miner; A Delehaunty; J B Kramer; L L Cook; R S Fulton; D L Johnson; P J Minx; S W Clifton; T Hawkins; E Branscomb; P Predki; P Richardson; S Wenning; T Slezak; N Doggett; J F Cheng; A Olsen; S Lucas; C Elkin; E Uberbacher; M Frazier; R A Gibbs; D M Muzny; S E Scherer; J B Bouck; E J Sodergren; K C Worley; C M Rives; J H Gorrell; M L Metzker; S L Naylor; R S Kucherlapati; D L Nelson; G M Weinstock; Y Sakaki; A Fujiyama; M Hattori; T Yada; A Toyoda; T Itoh; C Kawagoe; H Watanabe; Y Totoki; T Taylor; J Weissenbach; R Heilig; W Saurin; F Artiguenave; P Brottier; T Bruls; E Pelletier; C Robert; P Wincker; D R Smith; L Doucette-Stamm; M Rubenfield; K Weinstock; H M Lee; J Dubois; A Rosenthal; M Platzer; G Nyakatura; S Taudien; A Rump; H Yang; J Yu; J Wang; G Huang; J Gu; L Hood; L Rowen; A Madan; S Qin; R W Davis; N A Federspiel; A P Abola; M J Proctor; R M Myers; J Schmutz; M Dickson; J Grimwood; D R Cox; M V Olson; R Kaul; C Raymond; N Shimizu; K Kawasaki; S Minoshima; G A Evans; M Athanasiou; R Schultz; B A Roe; F Chen; H Pan; J Ramser; H Lehrach; R Reinhardt; W R McCombie; M de la Bastide; N Dedhia; H Blöcker; K Hornischer; G Nordsiek; R Agarwala; L Aravind; J A Bailey; A Bateman; S Batzoglou; E Birney; P Bork; D G Brown; C B Burge; L Cerutti; H C Chen; D Church; M Clamp; R R Copley; T Doerks; S R Eddy; E E Eichler; T S Furey; J Galagan; J G Gilbert; C Harmon; Y Hayashizaki; D Haussler; H Hermjakob; K Hokamp; W Jang; L S Johnson; T A Jones; S Kasif; A Kaspryzk; S Kennedy; W J Kent; P Kitts; E V Koonin; I Korf; D Kulp; D Lancet; T M Lowe; A McLysaght; T Mikkelsen; J V Moran; N Mulder; V J Pollara; C P Ponting; G Schuler; J Schultz; G Slater; A F Smit; E Stupka; J Szustakowki; D Thierry-Mieg; J Thierry-Mieg; L Wagner; J Wallis; R Wheeler; A Williams; Y I Wolf; K H Wolfe; S P Yang; R F Yeh; F Collins; M S Guyer; J Peterson; A Felsenfeld; K A Wetterstrand; A Patrinos; M J Morgan; P de Jong; J J Catanese; K Osoegawa; H Shizuya; S Choi; Y J Chen; J Szustakowki
Journal:  Nature       Date:  2001-02-15       Impact factor: 49.962

Review 2.  The expanding snoRNA world.

Authors:  Jean Pierre Bachellerie; Jérôme Cavaillé; Alexander Hüttenhofer
Journal:  Biochimie       Date:  2002-08       Impact factor: 4.079

3.  Initial sequencing and comparative analysis of the mouse genome.

Authors:  Robert H Waterston; Kerstin Lindblad-Toh; Ewan Birney; Jane Rogers; Josep F Abril; Pankaj Agarwal; Richa Agarwala; Rachel Ainscough; Marina Alexandersson; Peter An; Stylianos E Antonarakis; John Attwood; Robert Baertsch; Jonathon Bailey; Karen Barlow; Stephan Beck; Eric Berry; Bruce Birren; Toby Bloom; Peer Bork; Marc Botcherby; Nicolas Bray; Michael R Brent; Daniel G Brown; Stephen D Brown; Carol Bult; John Burton; Jonathan Butler; Robert D Campbell; Piero Carninci; Simon Cawley; Francesca Chiaromonte; Asif T Chinwalla; Deanna M Church; Michele Clamp; Christopher Clee; Francis S Collins; Lisa L Cook; Richard R Copley; Alan Coulson; Olivier Couronne; James Cuff; Val Curwen; Tim Cutts; Mark Daly; Robert David; Joy Davies; Kimberly D Delehaunty; Justin Deri; Emmanouil T Dermitzakis; Colin Dewey; Nicholas J Dickens; Mark Diekhans; Sheila Dodge; Inna Dubchak; Diane M Dunn; Sean R Eddy; Laura Elnitski; Richard D Emes; Pallavi Eswara; Eduardo Eyras; Adam Felsenfeld; Ginger A Fewell; Paul Flicek; Karen Foley; Wayne N Frankel; Lucinda A Fulton; Robert S Fulton; Terrence S Furey; Diane Gage; Richard A Gibbs; Gustavo Glusman; Sante Gnerre; Nick Goldman; Leo Goodstadt; Darren Grafham; Tina A Graves; Eric D Green; Simon Gregory; Roderic Guigó; Mark Guyer; Ross C Hardison; David Haussler; Yoshihide Hayashizaki; LaDeana W Hillier; Angela Hinrichs; Wratko Hlavina; Timothy Holzer; Fan Hsu; Axin Hua; Tim Hubbard; Adrienne Hunt; Ian Jackson; David B Jaffe; L Steven Johnson; Matthew Jones; Thomas A Jones; Ann Joy; Michael Kamal; Elinor K Karlsson; Donna Karolchik; Arkadiusz Kasprzyk; Jun Kawai; Evan Keibler; Cristyn Kells; W James Kent; Andrew Kirby; Diana L Kolbe; Ian Korf; Raju S Kucherlapati; Edward J Kulbokas; David Kulp; Tom Landers; J P Leger; Steven Leonard; Ivica Letunic; Rosie Levine; Jia Li; Ming Li; Christine Lloyd; Susan Lucas; Bin Ma; Donna R Maglott; Elaine R Mardis; Lucy Matthews; Evan Mauceli; John H Mayer; Megan McCarthy; W Richard McCombie; Stuart McLaren; Kirsten McLay; John D McPherson; Jim Meldrim; Beverley Meredith; Jill P Mesirov; Webb Miller; Tracie L Miner; Emmanuel Mongin; Kate T Montgomery; Michael Morgan; Richard Mott; James C Mullikin; Donna M Muzny; William E Nash; Joanne O Nelson; Michael N Nhan; Robert Nicol; Zemin Ning; Chad Nusbaum; Michael J O'Connor; Yasushi Okazaki; Karen Oliver; Emma Overton-Larty; Lior Pachter; Genís Parra; Kymberlie H Pepin; Jane Peterson; Pavel Pevzner; Robert Plumb; Craig S Pohl; Alex Poliakov; Tracy C Ponce; Chris P Ponting; Simon Potter; Michael Quail; Alexandre Reymond; Bruce A Roe; Krishna M Roskin; Edward M Rubin; Alistair G Rust; Ralph Santos; Victor Sapojnikov; Brian Schultz; Jörg Schultz; Matthias S Schwartz; Scott Schwartz; Carol Scott; Steven Seaman; Steve Searle; Ted Sharpe; Andrew Sheridan; Ratna Shownkeen; Sarah Sims; Jonathan B Singer; Guy Slater; Arian Smit; Douglas R Smith; Brian Spencer; Arne Stabenau; Nicole Stange-Thomann; Charles Sugnet; Mikita Suyama; Glenn Tesler; Johanna Thompson; David Torrents; Evanne Trevaskis; John Tromp; Catherine Ucla; Abel Ureta-Vidal; Jade P Vinson; Andrew C Von Niederhausern; Claire M Wade; Melanie Wall; Ryan J Weber; Robert B Weiss; Michael C Wendl; Anthony P West; Kris Wetterstrand; Raymond Wheeler; Simon Whelan; Jamey Wierzbowski; David Willey; Sophie Williams; Richard K Wilson; Eitan Winter; Kim C Worley; Dudley Wyman; Shan Yang; Shiaw-Pyng Yang; Evgeny M Zdobnov; Michael C Zody; Eric S Lander
Journal:  Nature       Date:  2002-12-05       Impact factor: 49.962

Review 4.  Riboswitches: the oldest mechanism for the regulation of gene expression?

Authors:  Alexey G Vitreschak; Dimitry A Rodionov; Andrey A Mironov; Mikhail S Gelfand
Journal:  Trends Genet       Date:  2004-01       Impact factor: 11.639

Review 5.  MicroRNAs: genomics, biogenesis, mechanism, and function.

Authors:  David P Bartel
Journal:  Cell       Date:  2004-01-23       Impact factor: 41.582

6.  Rfam: an RNA family database.

Authors:  Sam Griffiths-Jones; Alex Bateman; Mhairi Marshall; Ajay Khanna; Sean R Eddy
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

7.  A survey of small RNA-encoding genes in Escherichia coli.

Authors:  Ruth Hershberg; Shoshy Altuvia; Hanah Margalit
Journal:  Nucleic Acids Res       Date:  2003-04-01       Impact factor: 16.971

8.  Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution.

Authors: 
Journal:  Nature       Date:  2004-12-09       Impact factor: 49.962

9.  The Pfam protein families database.

Authors:  Alex Bateman; Lachlan Coin; Richard Durbin; Robert D Finn; Volker Hollich; Sam Griffiths-Jones; Ajay Khanna; Mhairi Marshall; Simon Moxon; Erik L L Sonnhammer; David J Studholme; Corin Yeats; Sean R Eddy
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

10.  A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure.

Authors:  Sean R Eddy
Journal:  BMC Bioinformatics       Date:  2002-07-02       Impact factor: 3.169

View more
  700 in total

1.  The First Draft Genome Assembly of Snow Sheep (Ovis nivicola).

Authors:  Maulik Upadhyay; Andreas Hauser; Elisabeth Kunz; Stefan Krebs; Helmut Blum; Arsen Dotsev; Innokentiy Okhlopkov; Vugar Bagirov; Gottfried Brem; Natalia Zinovieva; Ivica Medugorac
Journal:  Genome Biol Evol       Date:  2020-08-01       Impact factor: 3.416

2.  Genome sequence of strain HIMB30, a novel member of the marine Gammaproteobacteria.

Authors:  Megan J Huggett; Michael S Rappé
Journal:  J Bacteriol       Date:  2012-02       Impact factor: 3.490

3.  Evaluation of a sophisticated SCFG design for RNA secondary structure prediction.

Authors:  Markus E Nebel; Anika Scheid
Journal:  Theory Biosci       Date:  2011-12-02       Impact factor: 1.919

4.  Comprehensive analysis of microRNA genomic loci identifies pervasive repetitive-element origins.

Authors:  Glen M Borchert; Nathaniel W Holton; Jonathan D Williams; William L Hernan; Ian P Bishop; Joel A Dembosky; James E Elste; Nathaniel S Gregoire; Jee-Ah Kim; Wesley W Koehler; Joe C Lengerich; Arianna A Medema; Marilyn A Nguyen; Geoffrey D Ower; Michelle A Rarick; Brooke N Strong; Nicholas J Tardi; Nathan M Tasker; Darren J Wozniak; Craig Gatto; Erik D Larson
Journal:  Mob Genet Elements       Date:  2011-05

5.  Identification of microRNA-like RNAs in a plant pathogenic fungus Sclerotinia sclerotiorum by high-throughput sequencing.

Authors:  Jiahong Zhou; Yanping Fu; Jiatao Xie; Bo Li; Daohong Jiang; Guoqing Li; Jiasen Cheng
Journal:  Mol Genet Genomics       Date:  2012-04       Impact factor: 3.291

6.  Genome sequence of strain HIMB55, a novel marine gammaproteobacterium of the OM60/NOR5 clade.

Authors:  Megan J Huggett; Michael S Rappé
Journal:  J Bacteriol       Date:  2012-05       Impact factor: 3.490

7.  ProbKnot: fast prediction of RNA secondary structure including pseudoknots.

Authors:  Stanislav Bellaousov; David H Mathews
Journal:  RNA       Date:  2010-08-10       Impact factor: 4.942

8.  A two-length-scale polymer theory for RNA loop free energies and helix stacking.

Authors:  Daniel P Aalberts; Nagarajan Nandagopal
Journal:  RNA       Date:  2010-05-26       Impact factor: 4.942

9.  Complete genome sequence of Algoriphagus sp. PR1, bacterial prey of a colony-forming choanoflagellate.

Authors:  Rosanna A Alegado; Steven Ferriera; Chad Nusbaum; Sarah K Young; Qian Zeng; Alma Imamovic; Stephen R Fairclough; Nicole King
Journal:  J Bacteriol       Date:  2010-12-23       Impact factor: 3.490

10.  A simple high-throughput technology enables gain-of-function screening of human microRNAs.

Authors:  Wen-Chih Cheng; Tami J Kingsbury; Sarah J Wheelan; Curt I Civin
Journal:  Biotechniques       Date:  2013-02       Impact factor: 1.993

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.