Literature DB >> 22576174

MicrobeDB: a locally maintainable database of microbial genomic sequences.

Morgan G I Langille¹, Matthew R Laird, William W L Hsiao, Terry A Chiu, Jonathan A Eisen, Fiona S L Brinkman.

Abstract

UNLABELLED: Analysis of microbial genomes often requires the general organization and comparison of tens to thousands of genomes both from public repositories and unpublished sources. MicrobeDB provides a foundation for such projects by the automation of downloading published, completed bacterial and archaeal genomes from key sources, parsing annotations of all genomes (both public and private) into a local database, and allowing interaction with the database through an easy to use programming interface. MicrobeDB creates a simple to use, easy to maintain, centralized local resource for various large-scale comparative genomic analyses and a back-end for future microbial application design. AVAILABILITY: MicrobeDB is freely available under the GNU-GPL at: http://github.com/mlangill/microbedb/

Entities: Chemical Disease Species

Mesh：

Year: 2012 PMID： 22576174 PMCID： PMC3389766 DOI： 10.1093/bioinformatics/bts273

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The study of bacterial and archaeal genomes has rapidly progressed from the analysis of single genomes to comparisons between hundreds and thousands. Any type of biological analyses or development of novel bioinformatic methods that uses more than a handful of genomes requires a basic but non-trivial method for obtaining, organizing and storing this genomic information. In the past, this has been a problem primarily limited to large scale data providers such as IMG (Markowitz ), NCBI (Sayers ), GOLD (Pagani ) and CMR (Davidsen ). Although many of these centers provide genomic data in a variety of static formats such as Genbank and Fasta, these are often inadequate for complex queries. To carry out these analyses efficiently, a relational database such as MySQL (http://mysql.com) can be used to allow rapid querying across many genomes at once. Some existing data providers such as CMR allow downloading of their database files directly, but these databases are designed for large web-based infrastructures and contain numerous tables that demand a steep learning curve. Also, addition of unpublished genomes to these databases is often not supported. A well known and widely used system is the Generic Model Organism Database (GMOD) project (http://gmod.org). GMOD is an open-source project that provides a common platform for building model organism databases such as FlyBase (McQuilton ) and WormBase (Yook ). GMOD supports a variety of options such as GBrowse (Stein ) and a variety of database choices including Chado (Mungall and Emmert, 2007) and BioSQL (http://biosql.org). GMOD provides a comprehensive system, but for many researchers such a complex system is not needed. For example, Chado and the simpler BioSQL schemas have over 130 and 20 database tables, respectively. We propose a minimalistic system that is easy to set up, requires minimal administration for automatic updates, focusing on a lab based setting where unpublished genomes can be easily added, and allowing individual users to work with an unchanging snapshot of genomes from a given download date. To fulfill these goals, we created MicrobeDB, an open-source project that has been used in several comparative genome projects (Ho Sui ; Winstanley ) and as a backend for previously developed applications (Langille and Brinkman, 2009; Yu ).

2 FEATURES

MicrobeDB offers an easy to access, manageable and centralized database for microbial genomes. The main features of MicrobeDB are automated downloading of archaeal and bacterial genomes from NCBI, organized storage of the flat files, annotations and genomic metadata stored in a MySQL database, and a Perl API database for interacting with the data. A single script (that can be scheduled to run weekly, monthly, etc.) looks after downloading and storing new genomes, parsing and loading the data into the MySQL database, and cleaning up any old ‘versions’ that have not been saved by individual users.

2.1 Genome data source

By default all genomes available in the NCBI RefSeq database (Pruitt ) are downloaded using the Aspera downloader (Beloslyudtsev, 2010). Users can optionally choose to include incomplete genomes and/or limit to a subset of genomes at the genera or species level of their choice. In addition, users may download the data in several formats beyond the standard gbk format required by MicrobeDB such as fna, faa, gff, etc. After download, all genomes are uncompressed into their original flat files, and stored under a date stamped central directory.

2.2 Annotation extraction and storage

The second step of each update parses annotations and metadata for each genome and stores the information in a locally installed MySQL database. Information is split into different levels of ‘objects’, including Gene (e.g. accession, start position, end position, product, name, etc.), Replicon (e.g. size, number of genes, replicon type, etc.) and Genome Project (NCBI taxon id, NCBI genome project id, GC%, habitat, pathogen, etc.) (Table 1). This information is obtained from the Genbank formatted files for each genome, from metadata tables from NCBI, or derived computationally (e.g. gene counts, GC%, etc.) (Table 1). Additionally, a simplified version of the NCBI taxonomy is stored for each genome and is associated with each Genome Project object. The MicrobeDB schema is easily extended so that users can add their own custom data fields if needed (e.g. SNP positions, regulatory elements, etc.). The MySQL database can be accessed using any MySQL client or through the MicrobeDB Perl API that is supplied with MicrobeDB. The MicrobeDB Perl API provides simple querying and retrieval of information in the MySQL database from within the user's own applications without having to write actual SQL queries. In addition there are many free graphical interfaces for interacting with MySQL databases that do not require programming skills including web based such as phpMyAdmin (http://phpmyadmin.net), and local desktop clients such as MySQL Work Bench (http://www.mysql.com/products/workbench/).

Table 1.

Annotations stored in the MicrobeDB database

Table/object^a	Field descriptions^a	Example
Genome project	Organism name	Pseudomonas aeruginosa LESB58
	NCBI taxon ID	557722
	Genome size (Mb)	6.6
	Pathogenic in	Human
	GC %	66.3
	Oxygen requirements	Aerobic
Replicon	Replicon type	Chromosome
	Accession (RefSeq)	NC_011770
	Replicon size (bp)	6601757
	Number of genes	6027
	Replicon sequence	TTTAAAGAG…
Gene	Gene type	CDS
	Locus ID	PLES_00001
	Start position	483
	End position	2027
	Gene name	dnaA
	Product	chromosomal replication initiation
	DNA sequence	GTGTCCGT…
	Protein sequence	MSVELWQQ…
Version	Download date	2011-12-17
	Flat file directory	/share/genomes/2011-12-17/
	Used by	Morgan, Matthew

aNot all fields and tables in MicrobeDB are listed.

Annotations stored in the MicrobeDB database aNot all fields and tables in MicrobeDB are listed.

2.3 Unpublished genomes

Unpublished genomes (those not in NCBI) can be loaded into MicrobeDB by placing their Genbank formatted files into a directory and running a single script. MicrobeDB does not support genome annotation or create Genbank files, but many programs are available for production of these files such as RAST (Aziz ) or ARTEMIS (Carver ). NCBI-specific metadata that is not available for unpublished genomes is simply left as blank fields in MicrobeDB without affecting functionality.

2.4 Stable versions of genomes

MicrobeDB keeps each update as a separate ‘version’. This allows users to save and work on a particular snapshot of genomes knowing that the underlying dataset remains consistent. Each MicrobeDB version has an associated download date and users can save a version until their research is complete. Old unsaved versions that are no longer needed will be automatically removed after each update is completed to save storage space. Overall, MicrobeDB provides support for researchers that require a manageable local organization of bacterial and archaeal genomes for either large comparative genome projects or for constructing new bioinformatic applications. Funding: This work was supported by the Canadian Institutes of Health Research, Michael Smith Foundation for Health Research, Genome Canada, and Gordon and Betty Moore Foundation. Conflict of Interest: none declared.

15 in total

1. The generic genome browser: a building block for a model organism system database.

Authors: Lincoln D Stein; Christopher Mungall; ShengQiang Shu; Michael Caudy; Marco Mangone; Allen Day; Elizabeth Nickerson; Jason E Stajich; Todd W Harris; Adrian Arva; Suzanna Lewis
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

2. PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes.

Authors: Nancy Y Yu; James R Wagner; Matthew R Laird; Gabor Melli; Sébastien Rey; Raymond Lo; Phuong Dao; S Cenk Sahinalp; Martin Ester; Leonard J Foster; Fiona S L Brinkman
Journal: Bioinformatics Date: 2010-05-13 Impact factor: 6.937

3. Newly introduced genomic prophage islands are critical determinants of in vivo competitiveness in the Liverpool Epidemic Strain of Pseudomonas aeruginosa.

Authors: Craig Winstanley; Morgan G I Langille; Joanne L Fothergill; Irena Kukavica-Ibrulj; Catherine Paradis-Bleau; François Sanschagrin; Nicholas R Thomson; Geoff L Winsor; Michael A Quail; Nicola Lennard; Alexandra Bignell; Louise Clarke; Kathy Seeger; David Saunders; David Harris; Julian Parkhill; Robert E W Hancock; Fiona S L Brinkman; Roger C Levesque
Journal: Genome Res Date: 2008-12-01 Impact factor: 9.043

4. IMG: the Integrated Microbial Genomes database and comparative analysis system.

Authors: Victor M Markowitz; I-Min A Chen; Krishna Palaniappan; Ken Chu; Ernest Szeto; Yuri Grechkin; Anna Ratner; Biju Jacob; Jinghua Huang; Peter Williams; Marcel Huntemann; Iain Anderson; Konstantinos Mavromatis; Natalia N Ivanova; Nikos C Kyrpides
Journal: Nucleic Acids Res Date: 2012-01 Impact factor: 16.971

5. WormBase 2012: more genomes, more data, new website.

Authors: Karen Yook; Todd W Harris; Tamberlyn Bieri; Abigail Cabunoc; Juancarlos Chan; Wen J Chen; Paul Davis; Norie de la Cruz; Adrian Duong; Ruihua Fang; Uma Ganesan; Christian Grove; Kevin Howe; Snehalata Kadam; Ranjana Kishore; Raymond Lee; Yuling Li; Hans-Michael Muller; Cecilia Nakamura; Bill Nash; Philip Ozersky; Michael Paulini; Daniela Raciti; Arun Rangarajan; Gary Schindelman; Xiaoqi Shi; Erich M Schwarz; Mary Ann Tuli; Kimberly Van Auken; Daniel Wang; Xiaodong Wang; Gary Williams; Jonathan Hodgkin; Matthew Berriman; Richard Durbin; Paul Kersey; John Spieth; Lincoln Stein; Paul W Sternberg
Journal: Nucleic Acids Res Date: 2011-11-08 Impact factor: 16.971

6. IslandViewer: an integrated interface for computational identification and visualization of genomic islands.

Authors: Morgan G I Langille; Fiona S L Brinkman
Journal: Bioinformatics Date: 2009-01-16 Impact factor: 6.937

7. The association of virulence factors with genomic islands.

Authors: Shannan J Ho Sui; Amber Fedynak; William W L Hsiao; Morgan G I Langille; Fiona S L Brinkman
Journal: PLoS One Date: 2009-12-01 Impact factor: 3.240

8. The comprehensive microbial resource.

Authors: Tanja Davidsen; Erin Beck; Anuradha Ganapathy; Robert Montgomery; Nikhat Zafar; Qi Yang; Ramana Madupu; Phil Goetz; Kevin Galinsky; Owen White; Granger Sutton
Journal: Nucleic Acids Res Date: 2009-11-05 Impact factor: 16.971

9. Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database.

Authors: Tim Carver; Matthew Berriman; Adrian Tivey; Chinmay Patel; Ulrike Böhme; Barclay G Barrell; Julian Parkhill; Marie-Adèle Rajandream
Journal: Bioinformatics Date: 2008-10-09 Impact factor: 6.937

10. The RAST Server: rapid annotations using subsystems technology.

Authors: Ramy K Aziz; Daniela Bartels; Aaron A Best; Matthew DeJongh; Terrence Disz; Robert A Edwards; Kevin Formsma; Svetlana Gerdes; Elizabeth M Glass; Michael Kubal; Folker Meyer; Gary J Olsen; Robert Olson; Andrei L Osterman; Ross A Overbeek; Leslie K McNeil; Daniel Paarmann; Tobias Paczian; Bruce Parrello; Gordon D Pusch; Claudia Reich; Rick Stevens; Olga Vassieva; Veronika Vonstein; Andreas Wilke; Olga Zagnitko
Journal: BMC Genomics Date: 2008-02-08 Impact factor: 3.969

13 in total

Review 1. A pharm-ecological perspective of terrestrial and aquatic plant-herbivore interactions.

Authors: Jennifer Sorensen Forbey; M Denise Dearing; Elisabeth M Gross; Colin M Orians; Erik E Sotka; William J Foley
Journal: J Chem Ecol Date: 2013-03-13 Impact factor: 2.626

2. IslandViewer update: Improved genomic island discovery and visualization.

Authors: Bhavjinder K Dhillon; Terry A Chiu; Matthew R Laird; Morgan G I Langille; Fiona S L Brinkman
Journal: Nucleic Acids Res Date: 2013-05-15 Impact factor: 16.971

3. IslandViewer 4: expanded prediction of genomic islands for larger-scale datasets.

Authors: Claire Bertelli; Matthew R Laird; Kelly P Williams; Britney Y Lau; Gemma Hoad; Geoffrey L Winsor; Fiona S L Brinkman
Journal: Nucleic Acids Res Date: 2017-07-03 Impact factor: 16.971

4. phyloSkeleton: taxon selection, data retrieval and marker identification for phylogenomics.

Authors: Lionel Guy
Journal: Bioinformatics Date: 2017-04-15 Impact factor: 6.937

5. OrtholugeDB: a bacterial and archaeal orthology resource for improved comparative genomic analysis.

Authors: Matthew D Whiteside; Geoffrey L Winsor; Matthew R Laird; Fiona S L Brinkman
Journal: Nucleic Acids Res Date: 2012-11-29 Impact factor: 16.971

6. High-throughput sequencing: a roadmap toward community ecology.

Authors: Timothée Poisot; Bérangère Péquin; Dominique Gravel
Journal: Ecol Evol Date: 2013-03-11 Impact factor: 2.912

7. Correcting Inconsistencies and Errors in Bacterial Genome Metadata Using an Automated Curation Tool in Excel (AutoCurE).

Authors: Sarah E Schmedes; Jonathan L King; Bruce Budowle
Journal: Front Bioeng Biotechnol Date: 2015-09-11

8. IslandViewer 3: more flexible, interactive genomic island discovery, visualization and analysis.

Authors: Bhavjinder K Dhillon; Matthew R Laird; Julie A Shay; Geoffrey L Winsor; Raymond Lo; Fazmin Nizam; Sheldon K Pereira; Nicholas Waglechner; Andrew G McArthur; Morgan G I Langille; Fiona S L Brinkman
Journal: Nucleic Acids Res Date: 2015-04-27 Impact factor: 16.971

9. Enhanced annotations and features for comparing thousands of Pseudomonas genomes in the Pseudomonas genome database.

Authors: Geoffrey L Winsor; Emma J Griffiths; Raymond Lo; Bhavjinder K Dhillon; Julie A Shay; Fiona S L Brinkman
Journal: Nucleic Acids Res Date: 2015-11-17 Impact factor: 16.971

10. PSORTdb: expanding the bacteria and archaea protein subcellular localization database to better reflect diversity in cell envelope structures.

Authors: Michael A Peabody; Matthew R Laird; Caitlyn Vlasschaert; Raymond Lo; Fiona S L Brinkman
Journal: Nucleic Acids Res Date: 2015-11-23 Impact factor: 16.971