| Literature DB >> 33137082 |
Jolene Ramsey1,2, Helena Rasche1,2, Cory Maughmer1,2, Anthony Criscione1,2, Eleni Mijalis1,2, Mei Liu1,2, James C Hu1,2, Ry Young1,2, Jason J Gill1,3.
Abstract
In the modern genomic era, scientists without extensive bioinformatic training need to apply high-power computational analyses to critical tasks like phage genome annotation. At the Center for Phage Technology (CPT), we developed a suite of phage-oriented tools housed in open, user-friendly web-based interfaces. A Galaxy platform conducts computationally intensive analyses and Apollo, a collaborative genome annotation editor, visualizes the results of these analyses. The collection includes open source applications such as the BLAST+ suite, InterProScan, and several gene callers, as well as unique tools developed at the CPT that allow maximum user flexibility. We describe in detail programs for finding Shine-Dalgarno sequences, resources used for confident identification of lysis genes such as spanins, and methods used for identifying interrupted genes that contain frameshifts or introns. At the CPT, genome annotation is separated into two robust segments that are facilitated through the automated execution of many tools chained together in an operation called a workflow. First, the structural annotation workflow results in gene and other feature calls. This is followed by a functional annotation workflow that combines sequence comparisons and conserved domain searching, which is contextualized to allow integrated evidence assessment in functional prediction. Finally, we describe a workflow used for comparative genomics. Using this multi-purpose platform enables researchers to easily and accurately annotate an entire phage genome. The portal can be accessed at https://cpt.tamu.edu/galaxy-pub with accompanying user training material.Entities:
Mesh:
Year: 2020 PMID: 33137082 PMCID: PMC7660901 DOI: 10.1371/journal.pcbi.1008214
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1The Galaxy and Web Apollo interface used for analyses and annotation.
By coupling the Galaxy platform for analyses with the editing capabilities in Apollo, contextualized evidence can be used to iteratively annotate genomes as a team effort.
Fig 2The structural workflow chains together tools in Galaxy for gene calling.
The structural workflow accepts as input a nucleotide FASTA genome sequence, and processes it through ARAGORN for tRNAs (dark noodles), Glimmer and MetaGeneAnnotator for high-confidence gene predictions (golden and teal, respectively), and Get ORFs as a naïve ORF/CDS caller (magenta). Potential protein-coding genes are filtered to ensure the presence of a phage (ATG/GTG or TTG) start codon and a Shine-Dalgarno feature is added to all features that have a detectable match. These are interconverted between formats and the gene models are corrected for display in Apollo.
Fig 3The functional workflow links tools in Galaxy used for functional prediction.
Inputs for the functional workflow are gene calls paired with their genome. These are piped through five sub-paths within the analysis. 1) The BLASTn path uses full genomic nucleotide sequence (light blue noodles). 2) The BLASTp protein analysis against curated (UniProtKB SwissProt) and sequence-inclusive databases (NCBI nonredundant (nr)) (dark blue noodles). 3) The search for interrupted genes like introns compiles separate CDS hits to the same protein (yellow noodles). 4) A directed search for spanin proteins using TMHMM, lipobox-finding (using LipoP and a motif search), and BLASTp against a curated database (magenta and pink noodles). 5) Domain analysis plots comprehensive TMHMM outputs and InterProScan results for conserved domains and signatures (green and purple noodles, respectively).
Fig 4The comparative genomics workflow calculates nucleotide and protein similarity to other phages.
The protein comparison branch starts with a BLASTp job against the nr database, restricted by all phage TaxIDs (see full list in S3 Table), and is then sorted according to the organisms with the highest number of unique protein hits (teal noodles). The nucleotide comparison branch begins with a BLASTn job against the nt database, also restricted by all phage TaxIDs. Top nucleotide hits are sorted based on dice score, which accounts for the total coverage. The top five genome sequences are fetched from NCBI, concatenated with the query genome and routed to MIST v3 for a dot plot, and progressiveMauve for calculation of pairwise percent identity (magenta noodles).