Literature DB >> 23940251

Badger--an accessible genome exploration environment.

Ben Elsworth¹, Martin Jones, Mark Blaxter.

Abstract

SUMMARY: High-quality draft genomes are now easy to generate, as sequencing and assembly costs have dropped dramatically. However, building a user-friendly searchable Web site and database for a newly annotated genome is not straightforward. Here we present Badger, a lightweight and easy-to-install genome exploration environment designed for next generation non-model organism genomes. AVAILABILITY: Badger is released under the GPL and is available at http://badger.bio.ed.ac.uk/. We show two working examples: (i) a test dataset included with the source code, and (ii) a collection of four filarial nematode genomes. CONTACT: mark.blaxter@ed.ac.uk.

Mesh：

Year: 2013 PMID： 23940251 PMCID： PMC3799468 DOI： 10.1093/bioinformatics/btt466

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

High-throughput sequencing has revolutionized genome sequencing. New sequencing technologies and improved computational tools mean that a high quality draft eukaryotic genome can be sequenced and assembled in days, on a budget accessible to most researchers. This has produced an explosion of genome projects, an increasing number of which involve multiple species or strains. Genome data and annotation must be made accessible to collaborators via a restricted environment and to the wider research community following publication. Several genome exploration environments (GEEs) exist, including ENSEMBL (Flicek ) and National Center for Biotechnology Information (NCBI) Genomes (http://www.ncbi.nlm.nih.gov), but these do not allow real-time updates or restrictions on data access. ENSEMBL can be used as a stand-alone GEE, as can projects like GeneDB (Logan-Klumpler ), but these require skills not available in groups new to genome informatics. GEEs that use the Generic Model Organism Database (GMOD) Chado database model (Mungall ), including Tripal (Ficklin ) and Chado on Rails (http://gmod.org/wiki/Chado_on_Rails), are similar to ENSEMBL in scope. Tripal is the most mature GMOD GEE and is widely used, but requires a dedicated software engineer to develop and design each genome instance. We here present Badger, an easy-to-install GEE for genome scientists who need to produce a web-accessible portal for new data. A single installation of Badger can contain data from multiple species, and each species can have multiple genome drafts and gene sets.

2 FEATURES

A Badger site is managed by an administrator and has independent visibility settings (open or restricted by login) for all major sections. The home page for a Badger instance is fully customizable using a built-in text editor. It can be used to give project background and contains a blog-style news feed for up-to-date commentary. An information button is present on all pages, providing a guide to the site and page-specific information. The species overview page introduces the species included in the GEE and their relationships, including an optional phylogram. After selecting a species, the user selects a genome assembly version and a gene annotation version (multiple versions can coexist). These are displayed on the genome overview page, which displays summary metrics, including an interactive chart of metrics for each individual scaffold or contig (collectively referred to as genome objects). Selecting a genome object on the chart links to a page with detailed information. For genome objects with location data (contigs, scaffolds, genes and transcripts), an embedded GBrowse (Stein ) instance can be used to display it in genomic context. A search tab offers four ways of searching: (i) a broad search of all annotations and relevant publications, (ii) a restricted search of particular types of annotation (BLAST similarity, domain annotation, etc.), (iii) search for genes with a particular identifier and (iv) detailed searching of publication abstracts. Where appropriate, search results are presented as a table that allows sorting and filtering. All genome, transcript and protein sequence data can be searched using BLAST (Camacho ), with results presented in plain text (for BLAST output) or FASTA (for matching sequences) format. Individual gene pages offer a rich summary of metrics and annotation. Orthology information is a powerful way of exploring cross-species similarities, or drilling down into a biological component of interest. Optionally, genes can be grouped into putative ortholog sets using OrthoMCL (Li ). Orthology assignment is available on the gene page, and ortholog groups can also be identified by direct annotation search or by membership criteria (for example 1:1:1 … orthology). The ortholog overview page contains an interactive chart showing the size, frequency and proportion of members in each dataset, along with alignments.

3 IMPLEMENTATION

Badger is built using the Grails (http://grails.org) web application framework and uses the dynamic language Groovy, which runs on the Java Virtual Machine. Badger takes advantage of Grails’ embedded web server and uses plug-ins to provide caching and access control. Genome and annotation data are stored in only 15 custom domain classes, making the code base easy to understand and extend. Badger uses PostGresSQL (http://www.postgresql.org/) as a persistent data store. Badger was developed on Linux and has been deployed on both Ubuntu (12.04) and Centos (6.3) distributions. Hardware requirements will vary with the size of the dataset; we recommend a minimum of 4 GB random access memory and 2 central processing unit cores. Badger requires Grails (v2.1+), Java Development Kit (v1.6+), PostGreSQL (v8.4+) and BLAST+ (v2.2.26+). Detailed installation instructions are available at https://badger.bio.ed.ac.uk. Optionally, OrthoMCL (v2.0.4+) is required for clustering, Muscle (v3.8.31+) (Edgar, 2004) for ortholog group alignment, InterProScan (v4+) (Quevillon ) for domain identification and GBrowse for genome browser delivery. Loading data into a Badger instance is straightforward (Fig. 1). First, metadata are added, including species information and sequence and annotation data file paths. The minimum data requirement is a collection of contigs in FASTA format, which will generate a basic BLAST server, a genome overview page and a sequence download tool. The full potential of the GEE is achieved with annotated gene sets. Badger accepts gene set data (GFF3 format) with corresponding transcript and protein sequences (FASTA format), and also data in BLAST XML output, InterProScan raw output and Tab Separated Value (TSV) custom annotation files. TSV import allows custom annotation types to be added without editing underlying code. Ortholog group information should be provided in OrthoMCL format. New species can be added rapidly by taking advantage of existing annotation. Badger can import data from external resources and FASTA format genome and gene sequences. External data within a GBrowse instance can also be embedded.

Fig. 1.

Data flow in Badger. Data requirements are split into three groups: core, optional and external. Using core data alone results in a site with BLAST server, download facility, publication database and extensive search options. If provided, an external Gbrowse instance can be embedded. Optional data from OrthoMCL and PhyloXML (Han and Zmasek, 2009) files can provide powerful gene contextualization and phylogenetic visualization Data files can be uploaded, edited and updated individually through the administrator interface or en masse using a script (an example is provided in the source code). Once data files are uploaded and metadata entered, Badger parses input files, loads data into the PostgreSQL database, sources publications for the species from NCBI PubMed, creates BLAST databases, compresses FASTA files for download, generates a phylogenetic tree and catalogs ortholog data. Publication data are updated weekly. The whole process can be trivially rerun at any time to allow the inclusion of new data. Badger data upload is fast. For computationally expensive data in overview pages, Badger makes extensive use of caching to ensure that the interface remains responsive. For full-text searching of annotations, Badger uses PostgreSQLs full-text indexing, allowing it to search around 0.5 million annotations in a few seconds. The test dataset (100 scaffolds, 289 genes and 6615 annotations) takes <5 min, whereas the four-species filarial nematode dataset (50 000 scaffolds, 50 000 genes and 700 000 annotations) takes <24 h, to recreate on a minimum specification machine.

4 CONCLUSION

Badger is a lightweight GEE with a gene- and annotation-centric approach capable of storing, searching and visualizing diverse genomic data. Installation is simple, and a fully functional instance can be created quickly, even by novice users. Although not a substitute for data submission to databases of record, Badger is a customizable environment for public and collaborative display of gene-centred genomic information, and will aid in preparation for publication and submission.

10 in total

1. The generic genome browser: a building block for a model organism system database.

Authors: Lincoln D Stein; Christopher Mungall; ShengQiang Shu; Michael Caudy; Marco Mangone; Allen Day; Elizabeth Nickerson; Jason E Stajich; Todd W Harris; Adrian Arva; Suzanna Lewis
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

2. BLAST+: architecture and applications.

Authors: Christiam Camacho; George Coulouris; Vahram Avagyan; Ning Ma; Jason Papadopoulos; Kevin Bealer; Thomas L Madden
Journal: BMC Bioinformatics Date: 2009-12-15 Impact factor: 3.169

3. OrthoMCL: identification of ortholog groups for eukaryotic genomes.

Authors: Li Li; Christian J Stoeckert; David S Roos
Journal: Genome Res Date: 2003-09 Impact factor: 9.043

4. A Chado case study: an ontology-based modular schema for representing genome-associated biological information.

Authors: Christopher J Mungall; David B Emmert
Journal: Bioinformatics Date: 2007-07-01 Impact factor: 6.937

5. GeneDB--an annotation database for pathogens.

Authors: Flora J Logan-Klumpler; Nishadi De Silva; Ulrike Boehme; Matthew B Rogers; Giles Velarde; Jacqueline A McQuillan; Tim Carver; Martin Aslett; Christian Olsen; Sandhya Subramanian; Isabelle Phan; Carol Farris; Siddhartha Mitra; Gowthaman Ramasamy; Haiming Wang; Adrian Tivey; Andrew Jackson; Robin Houston; Julian Parkhill; Matthew Holden; Omar S Harb; Brian P Brunk; Peter J Myler; David Roos; Mark Carrington; Deborah F Smith; Christiane Hertz-Fowler; Matthew Berriman
Journal: Nucleic Acids Res Date: 2011-11-23 Impact factor: 16.971

6. Tripal: a construction toolkit for online genome databases.

Authors: Stephen P Ficklin; Lacey-Anne Sanderson; Chun-Huai Cheng; Margaret E Staton; Taein Lee; Il-Hyung Cho; Sook Jung; Kirstin E Bett; Doreen Main
Journal: Database (Oxford) Date: 2011-09-29 Impact factor: 3.451

7. InterProScan: protein domains identifier.

Authors: E Quevillon; V Silventoinen; S Pillai; N Harte; N Mulder; R Apweiler; R Lopez
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

8. phyloXML: XML for evolutionary biology and comparative genomics.

Authors: Mira V Han; Christian M Zmasek
Journal: BMC Bioinformatics Date: 2009-10-27 Impact factor: 3.169

9. Ensembl 2013.

Authors: Paul Flicek; Ikhlak Ahmed; M Ridwan Amode; Daniel Barrell; Kathryn Beal; Simon Brent; Denise Carvalho-Silva; Peter Clapham; Guy Coates; Susan Fairley; Stephen Fitzgerald; Laurent Gil; Carlos García-Girón; Leo Gordon; Thibaut Hourlier; Sarah Hunt; Thomas Juettemann; Andreas K Kähäri; Stephen Keenan; Monika Komorowska; Eugene Kulesha; Ian Longden; Thomas Maurel; William M McLaren; Matthieu Muffato; Rishi Nag; Bert Overduin; Miguel Pignatelli; Bethan Pritchard; Emily Pritchard; Harpreet Singh Riat; Graham R S Ritchie; Magali Ruffier; Michael Schuster; Daniel Sheppard; Daniel Sobral; Kieron Taylor; Anja Thormann; Stephen Trevanion; Simon White; Steven P Wilder; Bronwen L Aken; Ewan Birney; Fiona Cunningham; Ian Dunham; Jennifer Harrow; Javier Herrero; Tim J P Hubbard; Nathan Johnson; Rhoda Kinsella; Anne Parker; Giulietta Spudich; Andy Yates; Amonida Zadissa; Stephen M J Searle
Journal: Nucleic Acids Res Date: 2012-11-30 Impact factor: 16.971

10. MUSCLE: a multiple sequence alignment method with reduced time and space complexity.

Authors: Robert C Edgar
Journal: BMC Bioinformatics Date: 2004-08-19 Impact factor: 3.169