Literature DB >> 28110292

Biomartr: genomic data retrieval with R.

Abstract

Motivation: Retrieval and reproducible functional annotation of genomic data are crucial in biology. However, the current poor usability and transparency of retrieval methods hinders reproducibility. Here we present an open source R package, biomartr , which provides a comprehensive easy-to-use framework for automating data retrieval and functional annotation for meta-genomic approaches. The functions of biomartr achieve a high degree of clarity, transparency and reproducibility of analyses.
Results: The biomartr package implements straightforward functions for bulk retrieval of all genomic data or data for selected genomes, proteomes, coding sequences and annotation files present in databases hosted by the National Center for Biotechnology Information (NCBI) and European Bioinformatics Institute (EMBL-EBI). In addition, biomartr communicates with the BioMart database for functional annotation of retrieved sequences. Comprehensive documentation of biomartr functions and five tutorial vignettes provide step-by-step instructions on how to use the package in a reproducible manner. Availability and Implementation: The open source biomartr package is available at https://github.com/HajkD/biomartr and https://cran.r-project.org/web/packages/biomartr/index.html . Contact: hgd23@cam.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: Gene Species

Mesh：

Year: 2017 PMID： 28110292 PMCID： PMC5408848 DOI： 10.1093/bioinformatics/btw821

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Modern genome studies are no longer limited to single genome analyses or pairwise genomic comparisons but increasingly involve meta-genomic approaches. For this purpose, the NCBI and EMBL-EBI organize and maintain specialized sequence databases that fulfil various scientific requirements. Among the most important and best-curated databases are Genbank, RefSeq and ENSEMBL. Genbank is an annotated collection of all publicly available DNA sequences (Benson ). The RefSeq collection offers a comprehensive, integrated, non-redundant and well-annotated set of sequences, including genomic DNA, transcripts and proteins (Pruitt ). The ENSEMBL database provides DNA sequence assemblies and curated Ensembl gene builds from various projects (Yates ). Current meta-genomic pipelines consist of custom-prepared scripts that automatically retrieve selected genomes from these resources. Post-processing, handling and analysis usually uses the Perl, Python, or R programming languages. Although powerful sequence retrieval frameworks have been implemented in Perl (BioPerl), Python (BioPython) and R, their use requires appropriate programming expertise. Furthermore, none of these frameworks combines meta-genomic scale sequence retrieval with functional annotation. These deficiencies also apply to the currently available R packages seqinr and biomaRt. The seqinr package aims to automate sequence retrieval in R but is not designed for meta-genomic approaches and does not include functional annotation. The biomaRt package aims to provide functional annotation methods but these are also not designed for meta-genomic approaches and are not easy to use for non-programming experts. To provide a fast, transparent and easy-to-use framework for combined genomic data retrieval and efficient functional annotation of genetic features in meta-genomic approaches, we have designed the R package biomartr for use with the NCBI, ENSEMBL, ESNEMBLGENOMES and BioMart infrastructures (Smedley ). The major advantage of biomartr is that it does not require profound programming expertise. It is optimized to handle multiple genomes simultaneously and allows, for example, assignment of Gene Ontology (GO) information and sequence homology relationships between different organisms by communicating with the BioMart database. The interface functions communicating with the BioMart database use a novel organism centered notation for information retrieval. Instead of learning the underlying database and dataset linking convention of BioMart, users can type the scientific name of an organism of interest (e.g. ‘Homo sapiens’) to retrieve a list of all available information provided by BioMart for this particular organism of interest. In summary, the biomartr package provides researchers with a powerful tool for efficient, straightforward and reproducible handling of large-scale meta-genomic data and intuitive organism centered interface functions to retrieve functional annotation information from the BioMart database.

2 Implementation

The biomartr package is released under the GNU General Public License within the CRAN project (R Core Team). The package can be downloaded from https://cran.r-project.org/web/packages/biomartr/index.html. The source code is publically available at https://github.com/HajkD/biomartr. The biomartr package depends on the R packages Biostrings, data.table, dplyr, readr, downloader, RCurl, XML, biomaRt (Durinck ), httr and stringr. The functionality of packages such as biomaRt (Durinck ) and seqinr (Charif and Lobry, 2007) are included in biomartr and significantly extended. This is achieved by additional data retrieval functions and the direct combination of biomaRt and seqinr functionality with improved retrieval capability.

3 Functions and examples

Thirty-seven functions are provided by the biomartr package. For genome and database retrieval, the functions listDatabases(), listKingdoms(), listGroups(), listSubgroups(), listGenomes() and is.genome.available() enable the listing of all databases and genomic sequences that are available for automated retrieval. For example, the entire NCBI nr database can then be downloaded easily using just one command: download.database.all(db = ‘nr’, path = ‘nr’) Analogous to the retrieval of databases as described above, selected genomes can also be retrieved using the following function. As exemplified below by the human genome, download can be triggered by typing: getGenome (db = ‘refseq’, organism = ‘Homo sapiens’) The command getGenome() also documents the source and version of the downloaded files. Corresponding download of proteomes, coding sequences and annotation files can be obtained by applying getProteome(), getCDS() and getGFF(), respectively. The db argument can be specified to retrieve genomes from other NCBI or ENSEMBL databases. For meta-genome approaches, biomartr includes the meta.retrieval() function to download the genomes of entire kingdoms: # Download all vertebrate genomes meta.retrieval(kingdom = ‘vertebrate_mammalian’, type = ‘genome’) Hence, for example, all mammalian genomes can be downloaded with just one command. The type argument can also be specified for proteomes, coding sequences and annotation files. For functional annotation, available datasets and BioMart connections for a specific organism of interest can be obtained by typing: organismBM(organism = ‘Homo sapiens’) For example, available sequence homology relationships to other organisms can be retrieved by running the command: organismAttributes(organism = ‘Homo sapiens’, topic = ‘homolog’) Finally, users can retrieve GO information for a particular gene or gene set, e.g. human gene GUCA2A by running the command: getGO(organism = ‘Homo sapiens’, genes = ‘GUCA2A’, filters = ‘hgnc_symbol’) Tutorials are available at https://github.com/HajkD/biomartr#tutorials and also in the Supplementary Tutorial.

4 Conclusions

The functions provided by biomartr enable fast data retrieval and functional annotation queries to prominent sequence and annotation databases such as NCBI, ENSEMBL, ENSEMBLGENOMES and BioMart. In addition, all data retrieval functions implemented in biomartr automatically archive and log the source, date, version, taxid and type of data retrieved. Thus, biomartr improves reproducibility and transparency in genomic data handling. It can be integrated easily into meta-genomic analyses.

Funding

This work was supported by an European Research Council grant named EVOBREED [grant number 322621] (to JP) and a Gatsby Fellowship [grant number AT3273/GLE] (to JP). Conflict of Interest: none declared. Click here for additional data file.

5 in total

1. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis.

Authors: Steffen Durinck; Yves Moreau; Arek Kasprzyk; Sean Davis; Bart De Moor; Alvis Brazma; Wolfgang Huber
Journal: Bioinformatics Date: 2005-08-15 Impact factor: 6.937

2. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Authors: Kim D Pruitt; Tatiana Tatusova; Donna R Maglott
Journal: Nucleic Acids Res Date: 2006-11-27 Impact factor: 16.971

3. Ensembl 2016.

Authors: Andrew Yates; Wasiu Akanni; M Ridwan Amode; Daniel Barrell; Konstantinos Billis; Denise Carvalho-Silva; Carla Cummins; Peter Clapham; Stephen Fitzgerald; Laurent Gil; Carlos García Girón; Leo Gordon; Thibaut Hourlier; Sarah E Hunt; Sophie H Janacek; Nathan Johnson; Thomas Juettemann; Stephen Keenan; Ilias Lavidas; Fergal J Martin; Thomas Maurel; William McLaren; Daniel N Murphy; Rishi Nag; Michael Nuhn; Anne Parker; Mateus Patricio; Miguel Pignatelli; Matthew Rahtz; Harpreet Singh Riat; Daniel Sheppard; Kieron Taylor; Anja Thormann; Alessandro Vullo; Steven P Wilder; Amonida Zadissa; Ewan Birney; Jennifer Harrow; Matthieu Muffato; Emily Perry; Magali Ruffier; Giulietta Spudich; Stephen J Trevanion; Fiona Cunningham; Bronwen L Aken; Daniel R Zerbino; Paul Flicek
Journal: Nucleic Acids Res Date: 2015-12-19 Impact factor: 16.971

4. GenBank.

Authors: Dennis A Benson; Mark Cavanaugh; Karen Clark; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971

5. BioMart--biological queries made easy.

Authors: Damian Smedley; Syed Haider; Benoit Ballester; Richard Holland; Darin London; Gudmundur Thorisson; Arek Kasprzyk
Journal: BMC Genomics Date: 2009-01-14 Impact factor: 3.969

5 in total

27 in total

Review 1. Recent Trends in System-Scale Integrative Approaches for Discovering Protective Antigens Against Mycobacterial Pathogens.

Authors: Aarti Rana; Shweta Thakur; Girish Kumar; Yusuf Akhter
Journal: Front Genet Date: 2018-11-27 Impact factor: 4.599

Review 2. Music of metagenomics-a review of its applications, analysis pipeline, and associated tools.

Authors: Bilal Wajid; Faria Anwar; Imran Wajid; Haseeb Nisar; Sharoze Meraj; Ali Zafar; Mustafa Kamal Al-Shawaqfeh; Ali Riza Ekti; Asia Khatoon; Jan S Suchodolski
Journal: Funct Integr Genomics Date: 2021-10-18 Impact factor: 3.410

3. Tumor microbiome links cellular programs and immunity in pancreatic cancer.

Authors: Bassel Ghaddar; Antara Biswas; Chris Harris; M Bishr Omary; Darren R Carpizo; Martin J Blaser; Subhajyoti De
Journal: Cancer Cell Date: 2022-10-10 Impact factor: 38.585

Review 4. CRISPRi-seq for genome-wide fitness quantification in bacteria.

Authors: Vincent de Bakker; Xue Liu; Afonso M Bravo; Jan-Willem Veening
Journal: Nat Protoc Date: 2022-01-07 Impact factor: 17.021

5. Genome-wide association studies of 74 plasma metabolites of German shepherd dogs reveal two metabolites associated with genes encoding their enzymes.

Authors: Pamela Xing Yi Soh; Juliana Maria Marin Cely; Sally-Anne Mortlock; Christopher James Jara; Rachel Booth; Siria Natera; Ute Roessner; Ben Crossett; Stuart Cordwell; Mehar Singh Khatkar; Peter Williamson
Journal: Metabolomics Date: 2019-09-06 Impact factor: 4.290

6. MTR3D: identifying regions within protein tertiary structures under purifying selection.

Authors: Michael Silk; Douglas E V Pires; Carlos H M Rodrigues; Elston N D'Souza; Moshe Olshansky; Natalie Thorne; David B Ascher
Journal: Nucleic Acids Res Date: 2021-07-02 Impact factor: 16.971

7. Generation of an arrayed CRISPR-Cas9 library targeting epigenetic regulators: from high-content screens to in vivo assays.

Authors: Tristan Henser-Brownhill; Josep Monserrat; Paola Scaffidi
Journal: Epigenetics Date: 2018-01-12 Impact factor: 4.528