Literature DB >> 19254921

DIYA: a bacterial annotation pipeline for any genomics lab.

Andrew C Stewart¹, Brian Osborne, Timothy D Read.

Abstract

UNLABELLED: DIYA (Do-It-Yourself Annotator) is a modular and configurable open source pipeline software, written in Perl, used for the rapid annotation of bacterial genome sequences. The software is currently used to take DNA contigs as input, either in the form of complete genomes or the result of shotgun sequencing, and produce an annotated sequence in Genbank file format as output. AVAILABILITY: Distribution and source code are available at (https://sourceforge.net/projects/diyg/). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Disease

Mesh：

Year: 2009 PMID： 19254921 PMCID： PMC2660880 DOI： 10.1093/bioinformatics/btp097

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Genome annotation is the process of embellishing raw DNA sequences with predictions of features such as genes and transcription factor binding sites. These assignments are necessary to identify important gene functions and to enable comparative analysis. There are now several web-based servers that allow anonymous public submission of bacterial genomes for high-quality automated annotation [e.g. Almeida et al., 2004; Aziz et al., 2008; Davila et al., 2005; Van Domselaar et al., 2005; IGS Annotation Engine (http://ae.igs.umaryland.edu/cgi/ae_pipeline_outline.cgi)]. However, there have been fewer open source tools developed that allow users to run microbial annotation pipelines at their own site. In recent years there has been a revolution in genome sequencing, allowing for rapid shotgun draft sequence production of microbial genomes overnight (Mardis, 2008). These technologies have created many new small ‘genome centers’ (Zwick, 2005). DIYA (Do-It-Yourself Annotator) arose out of the desire for our group to be able to examine annotated microbial genomes on our own servers as soon as possible after generating the raw sequence data on Roche/454 Sequencing GS-FLX instruments. The essential properties of the required program were that it, Accepts as input either randomly ordered DNA contigs or complete genomes in fasta format or can download contigs from the NCBI based on Genome project ID. Uses open source annotation programs and biological databases. Is relatively straightforward to configure. Can be installed on a wide a range of hardware. Is modular; allowing for extension and customization of the pipeline. Outputs common file formats.

2 METHODS AND RESULTS

DIYA is written in object-oriented Perl and uses the Bioperl library (Stajich et al., 2002) for sequence conversion and annotation. Installation and configuration of DIYA requires basic knowledge of Perl and XML. Each DIYA component is tested on installation. All DIYA pipelines are composed of steps that are executed in a specific order, and each step is called either a ‘parser’ step or a ‘script’ step. The single configuration file must be edited to change the order of the steps, or add and delete steps. The pipeline is controlled by a master Perl module, diya.pm, which reads the configuration file and executes each step in the pipeline, passing the output from one step to the next step as input. The full DIYA pipeline is executed at the command-line using a simple Perl script which calls methods in diya.pm. Many genomes can be annotated simultaneously by running in batch mode using the Sun Grid Engine (SGE) scheduler. For every DIYA parser step there will be a bioinformatics application that will analyze sequence and produce output. A corresponding Perl module parses that output and performs an action (for instance creating an annotated Genbank file). A script step is simpler than a parser step, and may do something like move a file, create a database or send an alert. An example pipeline is outlined below. This pipeline could be modified by adding other steps. For example, DIYA comes with code and configuration files for the identification of non-coding RNAs, tRNAs (using tRNAscan-SE; Lowe and Eddy, 1997), and for performing Blast or RPS Blast analysis (Altschul et al., 1990) of coding regions extracted from Genbank files. Currently, gene product names are based on the simplistic scheme of transferring annotation from the best Blast or RPS Blast match. This is appropriate for many end uses of the information, such as resequencing or first pass annotation but could be extended in the future by more sophisticated product-naming parsers. Download a Genbank genome sequence using its Entrez ID. The nucleotide sequences for an NCBI project are downloaded as a fasta format file. Create a Genbank file. The unordered contigs in the downloaded fasta file are assembled into a ‘pseudo-contig’ in Genbank format. Gene finding using Glimmer3 (Salzberg, 1998). This is a parser step that runs Glimmer3 and parses its output using the g3-from-scratch.csh script. Glimmer3 is commonly used for identification for coding regions in microbial genomes (Delcher et al., 2002). The output for this step is a reannotated Genbank file containing coding regions predicted by Glimmer3. The output from the analyses are Genbank files containing annotations. This file format is very familiar to biologists and can be used in a wide variety of commercial and open source software. We routinely use the Genbank outputs files as input to General Feature Format (GFF) converters, and the results are visualized in the open source genome browser, GBrowse (Stein et al., 2002). A DIYA script called gbconvert creates ASN.1 (Abstract Syntax Notation One) format files for easy submission of the genome project to the National Center for Biotechnology Information (NCBI). Information on eight genomes submitted to NCBI via the DIYA pipeline can be found in Supplementary Table 1. The gbconvert script contains more than 100 text-formatting rules derived from interaction with NCBI staff.

3 CONCLUSIONS

DIYA as currently implemented is a lightweight microbial annotation pipeline producing data suitable for rapid visualization of bacterial genomes. We have used DIYA to annotate more than 50 bacterial genomes to date (including the Yersinia genomes listed in the Supplementary Table 1) as a basis for large-scale comparative analysis. Since the program can be installed locally, the user can have control over how often, and with what priority, jobs are run. Local control is also important if the user is concerned about posting preliminary data on hard drives outside their institution. The recently published PIPA (Protein Identification Pipeline) software (Yu et al., 2008) automates the querying of multiple databases and organizes the output and can accept the Genbank output file from the DIYA pipeline as input. An alternative use of DIYA is to reannotate genomes already submitted to NCBI. This can be done by simply supplying the Entrez genome project database ID to DIYA, which will download and annotate all associated DNA molecules. Our plans for this project are to gradually make more modules available. Functions we are looking to add to the DIYA pipeline include software for detection of prophages, CRISPR elements (Sorek et al., 2008) and pseudogenes. In the future we plan to integrate DIYA into a virtual appliance for easy deployment across everything from laboratory workstations to cloud computing facilities.

14 in total

1. Fast algorithms for large-scale genome alignment and comparison.

Authors: Arthur L Delcher; Adam Phillippy; Jane Carlton; Steven L Salzberg
Journal: Nucleic Acids Res Date: 2002-06-01 Impact factor: 16.971

2. The generic genome browser: a building block for a model organism system database.

Authors: Lincoln D Stein; Christopher Mungall; ShengQiang Shu; Michael Caudy; Marco Mangone; Allen Day; Elizabeth Nickerson; Jason E Stajich; Todd W Harris; Adrian Arva; Suzanna Lewis
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

3. The Bioperl toolkit: Perl modules for the life sciences.

Authors: Jason E Stajich; David Block; Kris Boulez; Steven E Brenner; Stephen A Chervitz; Chris Dagdigian; Georg Fuellen; James G R Gilbert; Ian Korf; Hilmar Lapp; Heikki Lehväslaiho; Chad Matsalla; Chris J Mungall; Brian I Osborne; Matthew R Pocock; Peter Schattner; Martin Senger; Lincoln D Stein; Elia Stupka; Mark D Wilkinson; Ewan Birney
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

4. A System for Automated Bacterial (genome) Integrated Annotation--SABIA.

Authors: Luiz G P Almeida; Roger Paixão; Rangel C Souza; Gisele C da Costa; Frank J A Barrientos; M Trindade dos Santos; Darcy F de Almeida; Ana Tereza R Vasconcelos
Journal: Bioinformatics Date: 2004-04-15 Impact factor: 6.937

5. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

6. A genome sequencing center in every lab.

Authors: Michael E Zwick
Journal: Eur J Hum Genet Date: 2005-11 Impact factor: 4.246

Review 7. The impact of next-generation sequencing technology on genetics.

Authors: Elaine R Mardis
Journal: Trends Genet Date: 2008-02-11 Impact factor: 11.639

8. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.

Authors: T M Lowe; S R Eddy
Journal: Nucleic Acids Res Date: 1997-03-01 Impact factor: 16.971

9. BASys: a web server for automated bacterial genome annotation.

Authors: Gary H Van Domselaar; Paul Stothard; Savita Shrivastava; Joseph A Cruz; AnChi Guo; Xiaoli Dong; Paul Lu; Duane Szafron; Russ Greiner; David S Wishart
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

10. The RAST Server: rapid annotations using subsystems technology.

Authors: Ramy K Aziz; Daniela Bartels; Aaron A Best; Matthew DeJongh; Terrence Disz; Robert A Edwards; Kevin Formsma; Svetlana Gerdes; Elizabeth M Glass; Michael Kubal; Folker Meyer; Gary J Olsen; Robert Olson; Andrei L Osterman; Ross A Overbeek; Leslie K McNeil; Daniel Paarmann; Tobias Paczian; Bruce Parrello; Gordon D Pusch; Claudia Reich; Rick Stevens; Olga Vassieva; Veronika Vonstein; Andreas Wilke; Olga Zagnitko
Journal: BMC Genomics Date: 2008-02-08 Impact factor: 3.969

48 in total

1. Complete genome sequence of Salmonella enterica subsp. enterica serovar Typhi P-stx-12.

Authors: Su Yean Ong; Chandra Bhan Pratap; Xuehua Wan; Shaobin Hou; Ahmad Yamin Abdul Rahman; Jennifer A Saito; Gopal Nath; Maqsudul Alam
Journal: J Bacteriol Date: 2012-04 Impact factor: 3.490

2. Genome sequence of Listeria monocytogenes Scott A, a clinical isolate from a food-borne listeriosis outbreak.

Authors: Yves Briers; Jochen Klumpp; Markus Schuppler; Martin J Loessner
Journal: J Bacteriol Date: 2011-06-17 Impact factor: 3.490

Review 3. A Primer on Infectious Disease Bacterial Genomics.

Authors: Tarah Lynch; Aaron Petkau; Natalie Knox; Morag Graham; Gary Van Domselaar
Journal: Clin Microbiol Rev Date: 2016-09-07 Impact factor: 26.132

4. Comparative genomic characterization of Actinobacillus pleuropneumoniae.

Authors: Zhuofei Xu; Xiabing Chen; Lu Li; Tingting Li; Shengyue Wang; Huanchun Chen; Rui Zhou
Journal: J Bacteriol Date: 2010-08-27 Impact factor: 3.490

5. Complete genome sequence of the thermophilic bacterium Thermus sp. strain CCB_US3_UF1.

Authors: Beng Soon Teh; Ahmad Yamin Abdul Rahman; Jennifer A Saito; Shaobin Hou; Maqsudul Alam
Journal: J Bacteriol Date: 2012-03 Impact factor: 3.490

6. A computational genomics pipeline for prokaryotic sequencing projects.

Authors: Andrey O Kislyuk; Lee S Katz; Sonia Agrawal; Matthew S Hagen; Andrew B Conley; Pushkala Jayaraman; Viswateja Nelakuditi; Jay C Humphrey; Scott A Sammons; Dhwani Govil; Raydel D Mair; Kathleen M Tatti; Maria L Tondella; Brian H Harcourt; Leonard W Mayer; I King Jordan
Journal: Bioinformatics Date: 2010-06-02 Impact factor: 6.937

7. Rapid identification of genetic modifications in Bacillus anthracis using whole genome draft sequences generated by 454 pyrosequencing.

Authors: Peter E Chen; Kristin M Willner; Amy Butani; Shakia Dorsey; Matroner George; Andrew Stewart; Shannon M Lentz; Christopher E Cook; Arya Akmal; Lance B Price; Paul S Keim; Alfred Mateczun; Trupti N Brahmbhatt; Kimberly A Bishop-Lilly; Michael E Zwick; Timothy D Read; Shanmuga Sozhamannan
Journal: PLoS One Date: 2010-08-25 Impact factor: 3.240

8. Identification of Bacillus anthracis spore component antigens conserved across diverse Bacillus cereus sensu lato strains.

Authors: Sanghamitra Mukhopadhyay; Arya Akmal; Andrew C Stewart; Ru-Ching Hsia; Timothy D Read
Journal: Mol Cell Proteomics Date: 2009-02-09 Impact factor: 5.911

9. Convergence of plasmid architectures drives emergence of multi-drug resistance in a clonally diverse Escherichia coli population from a veterinary clinical care setting.

Authors: Sam Wagner; Nadejda Lupolova; David L Gally; Sally A Argyle
Journal: Vet Microbiol Date: 2017-09-22 Impact factor: 3.293

10. Genomic characterization of the Yersinia genus.

Authors: Peter E Chen; Christopher Cook; Andrew C Stewart; Niranjan Nagarajan; Dan D Sommer; Mihai Pop; Brendan Thomason; Maureen P Kiley Thomason; Shannon Lentz; Nichole Nolan; Shanmuga Sozhamannan; Alexander Sulakvelidze; Alfred Mateczun; Lei Du; Michael E Zwick; Timothy D Read
Journal: Genome Biol Date: 2010-01-04 Impact factor: 13.583