Literature DB >> 30689724

ensembldb: an R package to create and use Ensembl-based annotation resources.

Johannes Rainer¹, Laurent Gatto², Christian X Weichenberger¹.

Abstract

SUMMARY: Bioinformatics research frequently involves handling gene-centric data such as exons, transcripts, proteins and their positions relative to a reference coordinate system. The ensembldb Bioconductor package retrieves and stores Ensembl-based genetic annotations and positional information, and furthermore offers identifier conversion and coordinates mappings for gene-associated data. In support of reproducible research, data are tied to Ensembl releases and are kept separately from the software. Premade data packages are available for a variety of genomes and Ensembl releases. Three examples demonstrate typical use cases of this software.
AVAILABILITY AND IMPLEMENTATION: ensembldb is part of Bioconductor (https://bioconductor.org/packages/ensembldb). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2019 PMID： 30689724 PMCID： PMC6736197 DOI： 10.1093/bioinformatics/btz031

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

When the human genome was released as a first draft (Lander ), researchers started to manage these kinds of large, fragmented and rapidly evolving complex datasets by creating genome browsing and database systems such as EMBL-EBI Ensembl (Birney ). The availability of a reference genome allows definition of a coordinate system, in which genomic data, such as collaboratively defined gene models, are described unambiguously by chromosomal positions. Ensembl publishes several data releases per year, rendering it a valuable resource for consistent and tightly integrated data. These data are used in high-throughput genomic data analyses, which are frequently carried out in the R statistical programing language using tools provided by Bioconductor (Huber ). Here we present ensembldb, a Bioconductor package enabling the creation and usage of comprehensive, locally stored Ensembl-based offline annotation databases. In addition to gene model annotations we include protein annotations in the pre-built databases, offer a fast and powerful filter mechanism and provide functions for the mapping of arbitrary positions between the genome, exome, transcriptome and proteome.

2 Implementation and available data

Ensembl is one of the main annotation resources for genomic data with a web service for online data access and APIs enabling programmatic data access. The ensembldb package provides functions to retrieve annotations for any of the >300 species available through Ensembl and EnsemblGenomes using their Perl API and to store information in small custom databases, which can be distributed as self-contained SQLite files or MySQL databases. The annotations included in our EnsDb databases comprise (i) genomic coordinates for all genes, transcripts and exons of a species and their relation to each other; (ii) general metadata information such as gene and transcript biotypes, NCBI Entrez gene IDs; and (iii) protein annotations including amino acid sequences, positions of protein domains within these (from e.g. Pfam; Finn ) and mappings of Ensembl protein identifiers to UniProt accession numbers. Some of these annotations are also available in other Bioconductor annotation resources, in particular TxDb databases from the GenomicFeatures package (Lawrence ) providing genomic coordinates, or org*db packages that contain gene-related annotations. With ensembldb, all this information is bundled conveniently into a single database. We distribute pre-built EnsDb databases covering all Ensembl core species for a range of Ensembl releases using Bioconductor’s AnnotationHub resource, which can be thought of as a queryable repository for annotation data. These locally stored databases enable offline access to Ensembl annotations in Bioconductor, in contrast to the biomaRt package (Durinck ) that, while also providing Ensembl annotations, requires active internet connectivity. In Bioconductor, the AnnotationDbi package provides a common interface for retrieving annotation data. Furthermore, the GenomicFeatures package defines means for representation, organization and structured retrieval of transcript models and genomic positions of genes and their exons. The ensembldb package is compliant with both interfaces, such that data retrieval and data access is handled in a standardized way. In addition, we developed a powerful filtering framework in ensembldb, which directly translates to SQL queries for performance increase (benchmarks provided in the supplement). It is based on our AnnotationFilter classes, available as a separate Bioconductor package to encourage usage beyond ensembldb. This filtering framework can be classified into two main groups: one to query arbitrary textual information, such as gene symbols or UniProt accession numbers, and the other to handle positional information of genes, exons, transcripts and protein domains. Filters can be combined with logical expressions to create tailored queries and retrieve only specific data from the databases. This is particularly useful for visualizing transcript models from certain genomic regions: ensembldb facilitates plotting with Bioconductor packages ggbio (Yin ) and Gviz (Hahne and Ivanek, 2016). Generally, results returned by ensembldb are compatible with the standards defined by Bioconductor, such that data can be easily exchanged with other packages for further analysis.

3 Usage and examples

The first example illustrates filtered data retrieval using ensembldb in the context of Down syndrome, a genetic disorder characterized by the presence of all or parts of a third copy of chromosome 21. In our example, we are interested in transcription factors encoded on Chromosome 21 with a basic helix-loop-helix DNA-binding domain, as described by Pfam ID PF00010: given a variable edb of type EnsDb, the simple command gene(edb, filter = ∼ protein_domain_id == “PF00010” & seq_name == “21”) returns the genomic annotations for three genes: SIM2, a master regulator of neurogenesis thought to contribute to some phenotypes of Down syndrome (Gardiner and Costa, 2006), and the two genes OLIG1 and OLIG2, triplication of which was shown to cause developmental brain defects (Chakrabarti ). Visualization of the genomic neighborhood is accomplished by passing the filter to an ensembldb function extracting data for plotting using Gviz, as shown in Figure 1.

Fig. 1.

Schematic overview of the gene SIM2. Shown are the exons of the gene in brown and all co-locating Pfam protein domains in blue

Schematic overview of the gene SIM2. Shown are the exons of the gene in brown and all co-locating Pfam protein domains in blue Another hallmark of ensembldb is its capability to convert any position within a protein, transcript or the genome to any other of these three entities, extending the genome to transcript mapping functionality of GenomicFeatures. For example, one of the known variants responsible for human red hair color is located at position 16:89920138 (dbSNP ID rs1805009) on the human genome (version GRCh38) and is readily converted by ensembldb to position 294 on the respective protein given by Ensembl ID ENSP00000451605 using the command genomeToProtein(GRanges(“16”, IRanges(89920138, width = 1)), edb) with edb as defined in the first example. We furthermore find that this protein is encoded by the MC1R gene issuing the AnnotationDbi-compatible query command select(edb, keys = “ENSP00000451605”, keytype = “PROTEINID”, columns = “SYMBOL”). Our annotation packages also contain protein sequence information. Thus, with the call proteins(edb, filter = ∼ protein_id == “ENSP00000451605”)$protein_sequence, we get the protein sequence for the selected ID to find on position 294 an aspartic acid (‘D’), which is in agreement with the reference amino acid of variant Asp294His (Valverde ) described by the dbSNP ID cited above. Expanded code with descriptions and results for these two examples is provided as Supplementary Material and as a Bioconductor package vignette. Finally, by providing gene annotations and positional information of exons on the genome and supporting the standard Bioconductor interfaces for data retrieval, ensembldb can be easily integrated into genome analysis pipelines. An extended example is given in the Supplementary Material, where we present a modified version of the standard Bioconductor RNA-seq workflow (Love ).

4 Conclusion

Here we have described the Bioconductor package ensembldb, which utilizes annotation resources from Ensembl and integrates them into Bioconductor. The separation of source code and annotation data facilitates reproducible research by allowing ensembldb to access any set of annotations published in the past. With an extensive filtering system, searches can be customized to meet very specific requirements and powerful coordinate mapping functions enable conversion of coordinates between proteins, transcripts, and the genome. Providing protein and protein domain annotations along with genome-centered annotations makes ensembldb also an asset for any post-genome data analysis that aims to combine data from these various domains. For each new Ensembl release, we create EnsDb annotation databases for all Ensembl vertebrates and plan to provide future continuous support for them via AnnotationHub. Conflict of Interest: none declared. Click here for additional data file.

12 in total

1. Initial sequencing and analysis of the human genome.

Authors: E S Lander; L M Linton; B Birren; C Nusbaum; M C Zody; J Baldwin; K Devon; K Dewar; M Doyle; W FitzHugh; R Funke; D Gage; K Harris; A Heaford; J Howland; L Kann; J Lehoczky; R LeVine; P McEwan; K McKernan; J Meldrim; J P Mesirov; C Miranda; W Morris; J Naylor; C Raymond; M Rosetti; R Santos; A Sheridan; C Sougnez; Y Stange-Thomann; N Stojanovic; A Subramanian; D Wyman; J Rogers; J Sulston; R Ainscough; S Beck; D Bentley; J Burton; C Clee; N Carter; A Coulson; R Deadman; P Deloukas; A Dunham; I Dunham; R Durbin; L French; D Grafham; S Gregory; T Hubbard; S Humphray; A Hunt; M Jones; C Lloyd; A McMurray; L Matthews; S Mercer; S Milne; J C Mullikin; A Mungall; R Plumb; M Ross; R Shownkeen; S Sims; R H Waterston; R K Wilson; L W Hillier; J D McPherson; M A Marra; E R Mardis; L A Fulton; A T Chinwalla; K H Pepin; W R Gish; S L Chissoe; M C Wendl; K D Delehaunty; T L Miner; A Delehaunty; J B Kramer; L L Cook; R S Fulton; D L Johnson; P J Minx; S W Clifton; T Hawkins; E Branscomb; P Predki; P Richardson; S Wenning; T Slezak; N Doggett; J F Cheng; A Olsen; S Lucas; C Elkin; E Uberbacher; M Frazier; R A Gibbs; D M Muzny; S E Scherer; J B Bouck; E J Sodergren; K C Worley; C M Rives; J H Gorrell; M L Metzker; S L Naylor; R S Kucherlapati; D L Nelson; G M Weinstock; Y Sakaki; A Fujiyama; M Hattori; T Yada; A Toyoda; T Itoh; C Kawagoe; H Watanabe; Y Totoki; T Taylor; J Weissenbach; R Heilig; W Saurin; F Artiguenave; P Brottier; T Bruls; E Pelletier; C Robert; P Wincker; D R Smith; L Doucette-Stamm; M Rubenfield; K Weinstock; H M Lee; J Dubois; A Rosenthal; M Platzer; G Nyakatura; S Taudien; A Rump; H Yang; J Yu; J Wang; G Huang; J Gu; L Hood; L Rowen; A Madan; S Qin; R W Davis; N A Federspiel; A P Abola; M J Proctor; R M Myers; J Schmutz; M Dickson; J Grimwood; D R Cox; M V Olson; R Kaul; C Raymond; N Shimizu; K Kawasaki; S Minoshima; G A Evans; M Athanasiou; R Schultz; B A Roe; F Chen; H Pan; J Ramser; H Lehrach; R Reinhardt; W R McCombie; M de la Bastide; N Dedhia; H Blöcker; K Hornischer; G Nordsiek; R Agarwala; L Aravind; J A Bailey; A Bateman; S Batzoglou; E Birney; P Bork; D G Brown; C B Burge; L Cerutti; H C Chen; D Church; M Clamp; R R Copley; T Doerks; S R Eddy; E E Eichler; T S Furey; J Galagan; J G Gilbert; C Harmon; Y Hayashizaki; D Haussler; H Hermjakob; K Hokamp; W Jang; L S Johnson; T A Jones; S Kasif; A Kaspryzk; S Kennedy; W J Kent; P Kitts; E V Koonin; I Korf; D Kulp; D Lancet; T M Lowe; A McLysaght; T Mikkelsen; J V Moran; N Mulder; V J Pollara; C P Ponting; G Schuler; J Schultz; G Slater; A F Smit; E Stupka; J Szustakowki; D Thierry-Mieg; J Thierry-Mieg; L Wagner; J Wallis; R Wheeler; A Williams; Y I Wolf; K H Wolfe; S P Yang; R F Yeh; F Collins; M S Guyer; J Peterson; A Felsenfeld; K A Wetterstrand; A Patrinos; M J Morgan; P de Jong; J J Catanese; K Osoegawa; H Shizuya; S Choi; Y J Chen; J Szustakowki
Journal: Nature Date: 2001-02-15 Impact factor: 49.962

Review 2. An overview of Ensembl.

Authors: Ewan Birney; T Daniel Andrews; Paul Bevan; Mario Caccamo; Yuan Chen; Laura Clarke; Guy Coates; James Cuff; Val Curwen; Tim Cutts; Thomas Down; Eduardo Eyras; Xose M Fernandez-Suarez; Paul Gane; Brian Gibbins; James Gilbert; Martin Hammond; Hans-Rudolf Hotz; Vivek Iyer; Kerstin Jekosch; Andreas Kahari; Arek Kasprzyk; Damian Keefe; Stephen Keenan; Heikki Lehvaslaiho; Graham McVicker; Craig Melsopp; Patrick Meidl; Emmanuel Mongin; Roger Pettett; Simon Potter; Glenn Proctor; Mark Rae; Steve Searle; Guy Slater; Damian Smedley; James Smith; Will Spooner; Arne Stabenau; James Stalker; Roy Storey; Abel Ureta-Vidal; K Cara Woodwark; Graham Cameron; Richard Durbin; Anthony Cox; Tim Hubbard; Michele Clamp
Journal: Genome Res Date: 2004-04-12 Impact factor: 9.043

3. The proteins of human chromosome 21.

Authors: Katheleen Gardiner; Alberto C S Costa
Journal: Am J Med Genet C Semin Med Genet Date: 2006-08-15 Impact factor: 3.908

Review 4. Orchestrating high-throughput genomic analysis with Bioconductor.

Authors: Wolfgang Huber; Vincent J Carey; Robert Gentleman; Simon Anders; Marc Carlson; Benilton S Carvalho; Hector Corrada Bravo; Sean Davis; Laurent Gatto; Thomas Girke; Raphael Gottardo; Florian Hahne; Kasper D Hansen; Rafael A Irizarry; Michael Lawrence; Michael I Love; James MacDonald; Valerie Obenchain; Andrzej K Oleś; Hervé Pagès; Alejandro Reyes; Paul Shannon; Gordon K Smyth; Dan Tenenbaum; Levi Waldron; Martin Morgan
Journal: Nat Methods Date: 2015-02 Impact factor: 28.547

5. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt.

Authors: Steffen Durinck; Paul T Spellman; Ewan Birney; Wolfgang Huber
Journal: Nat Protoc Date: 2009-07-23 Impact factor: 13.491

6. RNA-Seq workflow: gene-level exploratory analysis and differential expression.

Authors: Michael I Love; Simon Anders; Vladislav Kim; Wolfgang Huber
Journal: F1000Res Date: 2015-10-14

7. Software for computing and annotating genomic ranges.

Authors: Michael Lawrence; Wolfgang Huber; Hervé Pagès; Patrick Aboyoun; Marc Carlson; Robert Gentleman; Martin T Morgan; Vincent J Carey
Journal: PLoS Comput Biol Date: 2013-08-08 Impact factor: 4.475

8. Olig1 and Olig2 triplication causes developmental brain defects in Down syndrome.

Authors: Lina Chakrabarti; Tyler K Best; Nathan P Cramer; Rosalind S E Carney; John T R Isaac; Zygmunt Galdzicki; Tarik F Haydar
Journal: Nat Neurosci Date: 2010-07-18 Impact factor: 24.884

9. ggbio: an R package for extending the grammar of graphics for genomic data.

Authors: Tengfei Yin; Dianne Cook; Michael Lawrence
Journal: Genome Biol Date: 2012-08-31 Impact factor: 13.583

10. The Pfam protein families database: towards a more sustainable future.

Authors: Robert D Finn; Penelope Coggill; Ruth Y Eberhardt; Sean R Eddy; Jaina Mistry; Alex L Mitchell; Simon C Potter; Marco Punta; Matloob Qureshi; Amaia Sangrador-Vegas; Gustavo A Salazar; John Tate; Alex Bateman
Journal: Nucleic Acids Res Date: 2015-12-15 Impact factor: 16.971

26 in total

1. decorate: differential epigenetic correlation test.

Authors: Gabriel E Hoffman; Jaroslav Bendl; Kiran Girdhar; Panos Roussos
Journal: Bioinformatics Date: 2020-05-01 Impact factor: 6.937

2. Anti-tumour immunity induces aberrant peptide presentation in melanoma.

Authors: Osnat Bartok; Abhijeet Pataskar; Remco Nagel; Maarja Laos; Eden Goldfarb; Deborah Hayoun; Ronen Levy; Pierre-Rene Körner; Inger Z M Kreuger; Julien Champagne; Esther A Zaal; Onno B Bleijerveld; Xinyao Huang; Juliana Kenski; Jennifer Wargo; Alexander Brandis; Yishai Levin; Orel Mizrahi; Michal Alon; Sacha Lebon; Weiwen Yang; Morten M Nielsen; Noam Stern-Ginossar; Maarten Altelaar; Celia R Berkers; Tamar Geiger; Daniel S Peeper; Johanna Olweus; Yardena Samuels; Reuven Agami
Journal: Nature Date: 2020-12-16 Impact factor: 49.962

3. GCN5 contributes to intracellular lipid accumulation in human primary cardiac stromal cells from patients affected by Arrhythmogenic cardiomyopathy.

Authors: Chiara Volani; Alessandra Pagliaro; Johannes Rainer; Giuseppe Paglia; Benedetta Porro; Ilaria Stadiotti; Luisa Foco; Elisa Cogliati; Adolfo Paolin; Costanza Lagrasta; Caterina Frati; Emilia Corradini; Angela Falco; Theresa Matzinger; Anne Picard; Benedetta Ermon; Silvano Piazza; Marzia De Bortoli; Claudio Tondo; Réginald Philippe; Andrea Medici; Alexandros A Lavdas; Michael J F Blumer; Giulio Pompilio; Elena Sommariva; Peter P Pramstaller; Jakob Troppmair; Viviana Meraviglia; Alessandra Rossini
Journal: J Cell Mol Med Date: 2022-06-16 Impact factor: 5.295

4. Somatic evolution and global expansion of an ancient transmissible cancer lineage.

Authors: Kevin Gori; Andrea Strakova; Adrian Baez-Ortega; Janice L Allen; Karen M Allum; Leontine Bansse-Issa; Thinlay N Bhutia; Jocelyn L Bisson; Cristóbal Briceño; Artemio Castillo Domracheva; Anne M Corrigan; Hugh R Cran; Jane T Crawford; Eric Davis; Karina F de Castro; Andrigo B de Nardi; Anna P de Vos; Laura Delgadillo Keenan; Edward M Donelan; Adela R Espinoza Huerta; Ibikunle A Faramade; Mohammed Fazil; Eleni Fotopoulou; Skye N Fruean; Fanny Gallardo-Arrieta; Olga Glebova; Pagona G Gouletsou; Rodrigo F Häfelin Manrique; Joaquim J G P Henriques; Rodrigo S Horta; Natalia Ignatenko; Yaghouba Kane; Cathy King; Debbie Koenig; Ada Krupa; Steven J Kruzeniski; Young-Mi Kwon; Marta Lanza-Perea; Mihran Lazyan; Adriana M Lopez Quintana; Thibault Losfelt; Gabriele Marino; Simón Martínez Castañeda; Mayra F Martínez-López; Michael Meyer; Edward J Migneco; Berna Nakanwagi; Karter B Neal; Winifred Neunzig; Máire Ní Leathlobhair; Sally J Nixon; Antonio Ortega-Pacheco; Francisco Pedraza-Ordoñez; Maria C Peleteiro; Katherine Polak; Ruth J Pye; John F Reece; Jose Rojas Gutierrez; Haleema Sadia; Sheila K Schmeling; Olga Shamanova; Alan G Sherlock; Maximilian Stammnitz; Audrey E Steenland-Smit; Alla Svitich; Lester J Tapia Martínez; Ismail Thoya Ngoka; Cristian G Torres; Elizabeth M Tudor; Mirjam G van der Wel; Bogdan A Viţălaru; Sevil A Vural; Oliver Walkinton; Jinhong Wang; Alvaro S Wehrle-Martinez; Sophie A E Widdowson; Michael R Stratton; Ludmil B Alexandrov; Iñigo Martincorena; Elizabeth P Murchison
Journal: Science Date: 2019-08-02 Impact factor: 47.728

5. Immune expression profile identification in a group of proliferative verrucous leukoplakia patients: a pre-cancer niche for oral squamous cell carcinoma development.

Authors: Carlos Llorens; Beatriz Soriano; Lucia Trilla-Fuertes; Leticia Bagan; Ricardo Ramos-Ruiz; Angelo Gamez-Pozo; Cristina Peña; Jose V Bagan
Journal: Clin Oral Investig Date: 2020-09-11 Impact factor: 3.573

6. Expression of the Neuronal tRNA n-Tr20 Regulates Synaptic Transmission and Seizure Susceptibility.

Authors: Mridu Kapur; Archan Ganguly; Gabor Nagy; Scott I Adamson; Jeffrey H Chuang; Wayne N Frankel; Susan L Ackerman
Journal: Neuron Date: 2020-08-26 Impact factor: 17.173

7. A molecular map of long non-coding RNA expression, isoform switching and alternative splicing in osteoarthritis.

Authors: Georgia Katsoula; Julia Steinberg; Margo Tuerlings; Rodrigo Coutinho de Almeida; Lorraine Southam; Diane Swift; Ingrid Meulenbelt; J Mark Wilkinson; Eleftheria Zeggini
Journal: Hum Mol Genet Date: 2022-06-22 Impact factor: 5.121

8. Integrated transcriptomic analysis of human tuberculosis granulomas and a biomimetic model identifies therapeutic targets.

Authors: Michaela T Reichmann; Liku B Tezera; Andres F Vallejo; Milica Vukmirovic; Rui Xiao; James Reynolds; Sanjay Jogai; Susan Wilson; Ben Marshall; Mark G Jones; Alasdair Leslie; Jeanine M D'Armiento; Naftali Kaminski; Marta E Polak; Paul Elkington
Journal: J Clin Invest Date: 2021-08-02 Impact factor: 14.808

9. Metabolic and Immunological Subtypes of Esophageal Cancer Reveal Potential Therapeutic Opportunities.

Authors: Ryan J King; Fang Qiu; Fang Yu; Pankaj K Singh
Journal: Front Cell Dev Biol Date: 2021-07-08

10. Czechoslovakian Wolfdog Genomic Divergence from Its Ancestors Canis lupus, German Shepherd Dog, and Different Sheepdogs of European Origin.

Authors: Nina Moravčíková; Radovan Kasarda; Radoslav Židek; Luboš Vostrý; Hana Vostrá-Vydrová; Jakub Vašek; Daniela Čílová
Journal: Genes (Basel) Date: 2021-05-28 Impact factor: 4.096