Literature DB >> 29106469

DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication.

Yasuhiro Tanizawa¹, Takatomo Fujisawa¹, Yasukazu Nakamura¹.

Abstract

Summary: We developed a prokaryotic genome annotation pipeline, DFAST, that also supports genome submission to public sequence databases. DFAST was originally started as an on-line annotation server, and to date, over 7000 jobs have been processed since its first launch in 2016. Here, we present a newly implemented background annotation engine for DFAST, which is also available as a standalone command-line program. The new engine can annotate a typical-sized bacterial genome within 10 min, with rich information such as pseudogenes, translation exceptions and orthologous gene assignment between given reference genomes. In addition, the modular framework of DFAST allows users to customize the annotation workflow easily and will also facilitate extensions for new functions and incorporation of new tools in the future. Availability and implementation: The software is implemented in Python 3 and runs in both Python 2.7 and 3.4-on Macintosh and Linux systems. It is freely available at https://github.com/nigyta/dfast_core/under the GPLv3 license with external binaries bundled in the software distribution. An on-line version is also available at https://dfast.nig.ac.jp/. Contact: yn@nig.ac.jp. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Mesh：

Year: 2018 PMID： 29106469 PMCID： PMC5860143 DOI： 10.1093/bioinformatics/btx713

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Most scientific journals require newly obtained sequence data to be deposited in the International Nucleotide Sequence Database Collaboration (INSDC) as a condition of publication (Cochrane ). However, submission of annotated genomes to public databases remains a burden for researchers. The NCBI provides an annotation service called Prokaryotic Genome Annotation Pipeline (PGAP) (Tatusova ) incorporated in its submission system, but it is only available for GenBank submitters. The on-line server Microbial Genome Annotation Pipeline (MiGAP) (Sugawara ) partly supports DDBJ submission; however, it requires extensive manual revision. To address these issues, we recently developed a web-based pipeline called DDBJ Fast Annotation and Submission Tool (DFAST), aiming to assist users to submit their genomes to DDBJ (Tanizawa ). The original version of DFAST employs the lightweight command-line program Prokka (Seemann, 2014) as an annotation engine, combined with curated reference databases and a graphical user interface to create submission files to DDBJ. Here, we report a new implementation of the background engine of DFAST, which is called DFAST-core to differentiate it from its web version. The new version features unique functions, such as pseudogene annotation and orthologous assignments between reference genomes. DFAST-core is also available as a standalone program, providing a flexible local annotation platform. Hereinafter, we simply refer to it as DFAST in this report.

2 Materials and methods

DFAST accepts a FASTA-formatted file as a minimum required input, and users can customize parameters, tools and reference databases by providing command line options or defining an original configuration file (see Supplementary Notes for more details). The workflow is mainly composed of two annotation phases, i.e. structural annotation for predicting biological features such as CDSs, RNAs and CRISPRs, and functional annotation for inferring protein functions of predicted CDSs. Figure 1 shows a schematic depiction of the pipeline. Each annotation process is implemented as a module with common interfaces, allowing both flexible annotation workflows and extensions for new functions in the future.

Fig. 1

DFAST annotation workflow. Items marked with asterisks are included in the default workflow

DFAST annotation workflow. Items marked with asterisks are included in the default workflow In the default configuration, functional annotation will be processed in the following order: DFAST output files include INSDC submission files as well as standard GFF3, GenBank and FASTA files. For GenBank submission, two input files for the tbl2asn program are generated, a feature table (.tbl) and a sequence file (.fsa). For DDBJ submission, DFAST generates submission files required for DDBJ Mass Submission System (MSS) (Mashima ). In particular, if additional metadata such as contact and reference information are supplied, it can generate fully qualified files that are ready for submission to MSS. Orthologous assignment (optional) All-against-all pairwise protein alignments are conducted between a query and each reference genome. Orthologous genes are identified based on a Reciprocal-Best-Hit approach. It also conducts self-to-self alignments within a query genome, in which genes scoring higher than their corresponding orthologs are considered in-paralogs and assigned with the same protein function. This process is effective in transferring annotations from closely related organisms and in reducing running time. Homology search against the default reference database DFAST uses GHOSTX as a default aligner, which runs tens to hundred times faster than BLASTP with similar levels of sensitivity where E-values are less than 10−6 (Suzuki ). Users can also choose BLASTP. For accurate annotation, we constructed a reference database from 124 well-curated prokaryotic genomes from public databases. See Supplementary Data for the breakdown of the database. Pseudogene detection CDSs and their flanking regions are re-aligned to their subject protein sequences using LAST, which allows frameshift alignment (Kiełbasa ). When stop codons or frameshifts are found in the flanking regions, the query is marked as a possible pseudogene. This also detects translation exceptions such as selenocysteine and pyrrolysine. Profile HMM database search against TIGRFAM (Haft ) It uses hmmscan of the HMMer software package. Assignment of COG functional categories RPS-BLAST and the rpsbproc utility are used to search against the Clusters of Orthologous Groups (COG) database provided by the NCBI Conserved Domain Database (Marchler-Bauer ). While the workflow described above is fully customizable in the stand-alone version, only limited features are currently available in the web version, e.g. orthologous assignment is not available. As a merit of the web version, users can curate the assigned protein names by using an on-line annotation editor with an easy access to the NCBI BLAST web service. We also offer optional databases for specific organism groups (Escherichia coli, lactic acid bacteria, bifidobacteria and cyanobacteria). They are downloadable from our web site and can be used in the stand-alone version. We are updating reference databases to cover more diverse organisms.

3 Results and discussion

We annotated the genome of Escherichia coli O26: H11 str. 11368 using DFAST, Prokka and MiGAP, and compared the results to the INSDC data manually curated by original submitters (deposited in the NCBI Assembly Database under GCA_000091005.1) and the RefSeq data annotated using PGAP (GCF_000091005.1), as summarized in Table 1.

Table 1.

Comparison of annotation results of E.coli O26: H11 str. 11368

Data source/Annotation tool	INSDC^a	RefSeq^b	DFAST	Prokka	MiGAP
Total CDS	5795	6243	5740	5759	5721
Pseudogene^c	276	337 (250/87)	344 (158/186)	[30^d]	—
Selenoprotein	3	1	3	—	—
With COG number	—	—	3965	—	4392
Unknown function	1203	1514	1347	2068	418
tRNA	101	101	105	105	100
rRNA	22	22	22	22	22
CRISPR array	—	2	2	2	—
Running time	—	—	3 m 27 s	3 m 20 s	4 h 43 m

Note: Numbers represent annotated features and running time. DFAST and Prokka were run on a 4-core Macintosh laptop with default settings.

Original annotation by submitters (GCA_000091005.1).

Annotated by PGAP (GCF_000091005.1).

Numbers in parentheses denote internal stop codon/frameshift and partial genes, respectively.

Candidates for pseudogenes are mentioned in the log file, not in the result.

Comparison of annotation results of E.coli O26: H11 str. 11368 Note: Numbers represent annotated features and running time. DFAST and Prokka were run on a 4-core Macintosh laptop with default settings. Original annotation by submitters (GCA_000091005.1). Annotated by PGAP (GCF_000091005.1). Numbers in parentheses denote internal stop codon/frameshift and partial genes, respectively. Candidates for pseudogenes are mentioned in the log file, not in the result. Our simple strategy to find pseudogenes depends on the accuracy of reference databases. However, when references from close relatives are available, DFAST outperforms other tools. Among 158 CDSs in which internal stop codons or frameshifts were identified, 123 were found to be consistent with the INSDC data (78%). Although the comparison is not straightforward as annotation formats are different, 97 out of 250 identified by PGAP were consistent (39%). Notably, DFAST succeeded in annotating all 3 selenoproteins present in the query genome. Another major advantage of our pipeline is its speed. The running time of DFAST is comparable with that of Prokka, yet the default reference database of DFAST (417 922 sequences in total) is 20 times larger than that of Prokka (18 276 sequences). This is mostly attributable to the efficient algorithm of GHOSTX. If BLASTP is used instead, running time will increase up to 40 min under the same condition. In accordance with the database size, the number of genes with assigned function was larger than Prokka, although smaller than MiGAP, which conducts sequence search against a more comprehensive database such as UniProtKB/TrEMBL. In general, DFAST performs well with the default settings on well-characterized organisms, such as Actinobacteria, Firmicutes and Proteobacteria. The annotation of the genomes from less-studied species, for which references of close relatives are not present in the default database, may contain relatively large number of uncharacterized genes. In such cases, providing additional references will improve the results as demonstrated in Supplementary Notes. Click here for additional data file.

9 in total

1. Adaptive seeds tame genomic sequence comparison.

Authors: Szymon M Kiełbasa; Raymond Wan; Kengo Sato; Paul Horton; Martin C Frith
Journal: Genome Res Date: 2011-01-05 Impact factor: 9.043

2. Prokka: rapid prokaryotic genome annotation.

Authors: Torsten Seemann
Journal: Bioinformatics Date: 2014-03-18 Impact factor: 6.937

3. The International Nucleotide Sequence Database Collaboration.

Authors: Guy Cochrane; Ilene Karsch-Mizrachi; Toshihisa Takagi
Journal: Nucleic Acids Res Date: 2015-12-10 Impact factor: 16.971

4. DFAST and DAGA: web-based integrated genome annotation tools and resources.

Authors: Yasuhiro Tanizawa; Takatomo Fujisawa; Eli Kaminuma; Yasukazu Nakamura; Masanori Arita
Journal: Biosci Microbiota Food Health Date: 2016-07-14

5. DNA Data Bank of Japan.

Authors: Jun Mashima; Yuichi Kodama; Takatomo Fujisawa; Toshiaki Katayama; Yoshihiro Okuda; Eli Kaminuma; Osamu Ogasawara; Kousaku Okubo; Yasukazu Nakamura; Toshihisa Takagi
Journal: Nucleic Acids Res Date: 2016-10-24 Impact factor: 16.971

6. CDD/SPARCLE: functional classification of proteins via subfamily domain architectures.

Authors: Aron Marchler-Bauer; Yu Bo; Lianyi Han; Jane He; Christopher J Lanczycki; Shennan Lu; Farideh Chitsaz; Myra K Derbyshire; Renata C Geer; Noreen R Gonzales; Marc Gwadz; David I Hurwitz; Fu Lu; Gabriele H Marchler; James S Song; Narmada Thanki; Zhouxi Wang; Roxanne A Yamashita; Dachuan Zhang; Chanjuan Zheng; Lewis Y Geer; Stephen H Bryant
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

7. TIGRFAMs and Genome Properties in 2013.

Authors: Daniel H Haft; Jeremy D Selengut; Roland A Richter; Derek Harkins; Malay K Basu; Erin Beck
Journal: Nucleic Acids Res Date: 2012-11-28 Impact factor: 16.971

8. GHOSTX: an improved sequence homology search algorithm using a query suffix array and a database suffix array.

Authors: Shuji Suzuki; Masanori Kakuta; Takashi Ishida; Yutaka Akiyama
Journal: PLoS One Date: 2014-08-06 Impact factor: 3.240

9. NCBI prokaryotic genome annotation pipeline.

Authors: Tatiana Tatusova; Michael DiCuccio; Azat Badretdin; Vyacheslav Chetvernin; Eric P Nawrocki; Leonid Zaslavsky; Alexandre Lomsadze; Kim D Pruitt; Mark Borodovsky; James Ostell
Journal: Nucleic Acids Res Date: 2016-06-24 Impact factor: 16.971

9 in total

237 in total

1. Intercellular Transfer of Chromosomal Antimicrobial Resistance Genes between Acinetobacter baumannii Strains Mediated by Prophages.

Authors: Jun-Ichi Wachino; Wanchun Jin; Kouji Kimura; Yoshichika Arakawa
Journal: Antimicrob Agents Chemother Date: 2019-07-25 Impact factor: 5.191

2. The genome insights of Streptomyces lannensis T1317-0309 reveals actinomycin D production.

Authors: Ram Hari Dahal; Tuan Manh Nguyen; Ramesh Prasad Pandey; Tokutaro Yamaguchi; Jae Kyung Sohng; Jongsung Noh; Seung-Woon Myung; Jaisoo Kim
Journal: J Antibiot (Tokyo) Date: 2020-07-09 Impact factor: 2.649

3. Computational Methods for Pseudogene Annotation Based on Sequence Homology.

Authors: Paul M Harrison
Journal: Methods Mol Biol Date: 2021

4. Genome analysis of Streptococcus salivarius subsp. thermophilus type strain ATCC 19258 and its comparison to equivalent strain NCTC 12958.

Authors: Hyejin Cho; Kyeong-Eun Park; Kwang-Sun Kim
Journal: Arch Microbiol Date: 2021-01-04 Impact factor: 2.552

5. Lipid Pathway Databases with a Focus on Algae.

Authors: Naoki Sato; Takeshi Obayashi
Journal: Methods Mol Biol Date: 2021

6. Unusual features in the photosynthetic machinery of Halorhodospira halochloris DSM 1059 revealed by complete genome sequencing.

Authors: Yusuke Tsukatani; Yuu Hirose; Jiro Harada; Chinatsu Yonekawa; Hitoshi Tamiaki
Journal: Photosynth Res Date: 2019-01-30 Impact factor: 3.573

7. Isolation and structure determination of new linear azole-containing peptides spongiicolazolicins A and B from Streptomyces sp. CWH03.

Authors: Mana Suzuki; Hisayuki Komaki; Issara Kaweewan; Hideo Dohra; Hikaru Hemmi; Hiroyuki Nakagawa; Hideki Yamamura; Masayuki Hayakawa; Shinya Kodani
Journal: Appl Microbiol Biotechnol Date: 2020-11-20 Impact factor: 4.813

8. Diazotrophic Anaeromyxobacter Isolates from Soils.

Authors: Yoko Masuda; Haruka Yamanaka; Zhen-Xing Xu; Yutaka Shiratori; Toshihiro Aono; Seigo Amachi; Keishi Senoo; Hideomi Itoh
Journal: Appl Environ Microbiol Date: 2020-08-03 Impact factor: 4.792

9. Mycoavidus sp. Strain B2-EB: Comparative Genomics Reveals Minimal Genomic Features Required by a Cultivable Burkholderiaceae-Related Endofungal Bacterium.

Authors: Yong Guo; Yusuke Takashima; Yoshinori Sato; Kazuhiko Narisawa; Hiroyuki Ohta; Tomoyasu Nishizawa
Journal: Appl Environ Microbiol Date: 2020-09-01 Impact factor: 4.792

10. MetaMiner: A Scalable Peptidogenomics Approach for Discovery of Ribosomal Peptide Natural Products with Blind Modifications from Microbial Communities.

Authors: Liu Cao; Alexey Gurevich; Kelsey L Alexander; C Benjamin Naman; Tiago Leão; Evgenia Glukhov; Tal Luzzatto-Knaan; Fernando Vargas; Robby Quinn; Amina Bouslimani; Louis Felix Nothias; Nitin K Singh; Jon G Sanders; Rodolfo A S Benitez; Luke R Thompson; Md-Nafiz Hamid; James T Morton; Alla Mikheenko; Alexander Shlemov; Anton Korobeynikov; Iddo Friedberg; Rob Knight; Kasthuri Venkateswaran; William H Gerwick; Lena Gerwick; Pieter C Dorrestein; Pavel A Pevzner; Hosein Mohimani
Journal: Cell Syst Date: 2019-10-16 Impact factor: 10.304