Carol L Ecale Zhou1, Stephanie Malfatti1, Jeffrey Kimbrel1, Casandra Philipson2,3, Katelyn McNair4, Theron Hamilton2, Robert Edwards4, Brian Souza1. 1. Global Security Computing Applications Division, Lawrence Livermore National Laboratory, Fort Detrick, MD, USA. 2. Biological Defense Research Directorate, Naval Medical Research Center, Fort Detrick, MD, USA. 3. Chemical and Biological Research, Defense Threat Reduction Agency, CA, USA. 4. Computational Sciences Research Center, San Diego State University, CA, USA.
Abstract
SUMMARY: To address the need for improved phage annotation tools that scale, we created an automated throughput annotation pipeline: multiple-genome Phage Annotation Toolkit and Evaluator (multiPhATE). multiPhATE is a throughput pipeline driver that invokes an annotation pipeline (PhATE) across a user-specified set of phage genomes. This tool incorporates a de novo phage gene calling algorithm and assigns putative functions to gene calls using protein-, virus- and phage-centric databases. multiPhATE's modular construction allows the user to implement all or any portion of the analyses by acquiring local instances of the desired databases and specifying the desired analyses in a configuration file. We demonstrate multiPhATE by annotating two newly sequenced Yersinia pestis phage genomes. Within multiPhATE, the PhATE processing pipeline can be readily implemented across multiple processors, making it adaptable for throughput sequencing projects. Software documentation assists the user in configuring the system. AVAILABILITY AND IMPLEMENTATION: multiPhATE was implemented in Python 3.7, and runs as a command-line code under Linux or Unix. multiPhATE is freely available under an open-source BSD3 license from https://github.com/carolzhou/multiPhATE. Instructions for acquiring the databases and third-party codes used by multiPhATE are included in the distribution README file. Users may report bugs by submitting to the github issues page associated with the multiPhATE distribution. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
SUMMARY: To address the need for improved phage annotation tools that scale, we created an automated throughput annotation pipeline: multiple-genome Phage Annotation Toolkit and Evaluator (multiPhATE). multiPhATE is a throughput pipeline driver that invokes an annotation pipeline (PhATE) across a user-specified set of phage genomes. This tool incorporates a de novo phage gene calling algorithm and assigns putative functions to gene calls using protein-, virus- and phage-centric databases. multiPhATE's modular construction allows the user to implement all or any portion of the analyses by acquiring local instances of the desired databases and specifying the desired analyses in a configuration file. We demonstrate multiPhATE by annotating two newly sequenced Yersinia pestis phage genomes. Within multiPhATE, the PhATE processing pipeline can be readily implemented across multiple processors, making it adaptable for throughput sequencing projects. Software documentation assists the user in configuring the system. AVAILABILITY AND IMPLEMENTATION:multiPhATE was implemented in Python 3.7, and runs as a command-line code under Linux or Unix. multiPhATE is freely available under an open-source BSD3 license from https://github.com/carolzhou/multiPhATE. Instructions for acquiring the databases and third-party codes used by multiPhATE are included in the distribution README file. Users may report bugs by submitting to the github issues page associated with the multiPhATE distribution. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
A bacteriophage (also known as ‘phage’) is a virus that parasitizes a bacterium by infecting it and reproducing within it. This work was motivated by a need to increase the throughput potential for describing newly sequenced phage genomes. Global pathogen discovery efforts, such as The Global Virome Project (Carrol ), are projected to invest billions of dollars to support surveillance projects that characterize the earth’s virosphere over the next 10 years. Already, the PhagesDB contains >13 000 phage genomes (Russell and Hatfull, 2017). Phage therapy has resurfaced as a method to combat antimicrobial resistance, and upcoming clinical trials necessitate complete sequencing and characterization of therapeutic candidates, but high-quality gene calling and functional annotation are vital for successful genomic comparison studies and for discovery of new phage-based therapeutic leads (Kutter ). Because annotation of phage genomes is a relatively new science, there exist few bioinformatics pipelines for phage analysis that can be readily adapted for use in phage research efforts. Currently, researchers typically apply bacterial gene callers for annotation of phage DNA, followed by largely manual analyses using web forms, and integration of summary results can be time consuming. Although there exist several codes for identifying prophage sequences in bacterial genomes (Arndt ; Kang ; Roux ; and others), once these sequences have been identified, they are typically annotated using methods developed for sequences from other taxa (Perkel, 2017; Seemann, 2014). Currently there exists only one automated annotation pipeline specifically for phage: Philipson describe a pipeline that identifies features in phage that determine their potential suitability as therapeutic reagents. However, there remains a need for an automated phage annotation pipeline that can be readily implemented on multiple nodes of a local server and that requires minimal software development expertise. To address this need, we present the multiple-genome Phage Annotation Toolkit and Evaluator (multiPhATE) automated high-throughput phage annotation pipeline.
2Description
The PhATE annotation pipeline incorporates four gene callers (if selected): GeneMarkS (Lomsadze ), Glimmer (Delcher ), Prodigal (Hyatt ) and a novel phage-centric gene caller, PHANOTATE (McNair ). Functional annotation is achieved by Basic Local Alignment Search Tool (BLAST) and Hidden Markov Model (HMM) searches for homologous sequences in protein- and phage-centric databases. The PhATE workflow is depicted in Supplementary File, ‘phate_Fig_1_PhATE_Workflow.pdf’.
2.1 Input
Input to multiPhATE consists of a configuration file that specifies a list of genomes to be processed by PhATE and a set of parameters controlling software execution. The user specifies the names of phage genome fasta files, the names of output subdirectories and other metadata pertaining to the genomes being analyzed. The user also specifies the following optional analyses: (i) gene caller(s) to be run; (ii) gene-caller to use for subsequent annotation (default: PHANOTATE); (iii) blast parameters; (iv) blast databases to be searched; (v) turn hmm search on/off. It is possible to run PhATE using any or all of the specified gene callers, databases and searches. In this way, installation can be achieved one gene-caller or database at a time, with stepwise testing. Also, the user can switch on/off searches (e.g. NR) in order to control execution time (this may be useful in performing preliminary annotation of large numbers of sequences). Although multiPhATE is intended for phage sequence annotation, it would be reasonable to run multiPhATE with bacterial genomes to assist identification of embedded phage sequence.
2.2 Annotation
PhATE begins by performing gene calling using the selected gene caller(s). When two or more are invoked, PhATE outputs a summary table showing a side-by-side comparison of the gene calls, plus summary statistics regarding the numbers and lengths of gene calls for each algorithm, and the numbers of calls in common and unique to each. Next, PhATE uses BLAST+ programs (Camacho ) blastn and blastp, and the HMM search program jackhmmer (Johnson ), to identify homologs of the input genome and its predicted gene and peptide sequences using several databases: National Center for Biological Information (NCBI) virus genomes, NCBI Refseq proteins, NCBI refseq genes, NCBI virus proteins and Non-Redundant protein sequence database (NR) (NCBI Resource Coordinators, 2016), as well as Swissprot (Bairoch and Apweiler, 2000), Phage Annotation Tools and Methods (PhAnToMe) (www.phantome.org), a virus subset of Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa ) and a fasta sequence dataset derived using the database of phage Virus Orthologous Groups (pVOG) identifiers (Grazziotin ). The latter database is modified to contain the pVOG identifiers in the fasta headers, by means of scripts included in the multiPhATE distribution.
2.3 Output
PhATE generates the following files and directories: (i) output from the gene-call algorithms and the gene-call comparison (Supplementary Material ‘phate_P2_CGC.pdf’); (ii) gene and translated peptide fasta files; (iii) combined-annotation summary files; (iv) directories containing raw BLAST outputs for genome and peptide blast runs; (v) directories with raw HMM search outputs for peptide searches; (vi) alignment-ready fasta files containing each predicted peptide plus the members of each identified pVOG family to which a peptide may be assigned and (vii) log files. BLAST and HMM raw data outputs can be saved or cleaned from the output directories (see README). We demonstrate application of multiPhATE to the annotation of two newly sequenced Yersinia pestis phage genomes (see Supplementary Material ‘phate_results.pdf’.Click here for additional data file.
Authors: Dennis Carroll; Peter Daszak; Nathan D Wolfe; George F Gao; Carlos M Morel; Subhash Morzaria; Ariel Pablos-Méndez; Oyewale Tomori; Jonna A K Mazet Journal: Science Date: 2018-02-23 Impact factor: 47.728
Authors: David Arndt; Jason R Grant; Ana Marcu; Tanvir Sajed; Allison Pon; Yongjie Liang; David S Wishart Journal: Nucleic Acids Res Date: 2016-05-03 Impact factor: 16.971
Authors: Casandra W Philipson; Logan J Voegtly; Matthew R Lueder; Kyle A Long; Gregory K Rice; Kenneth G Frey; Biswajit Biswas; Regina Z Cer; Theron Hamilton; Kimberly A Bishop-Lilly Journal: Viruses Date: 2018-04-10 Impact factor: 5.048
Authors: Carol L Ecale Zhou; Jeffrey Kimbrel; Robert Edwards; Katelyn McNair; Brian A Souza; Stephanie Malfatti Journal: G3 (Bethesda) Date: 2021-05-07 Impact factor: 3.154
Authors: Jolene Ramsey; Helena Rasche; Cory Maughmer; Anthony Criscione; Eleni Mijalis; Mei Liu; James C Hu; Ry Young; Jason J Gill Journal: PLoS Comput Biol Date: 2020-11-02 Impact factor: 4.475