Literature DB >> 29635297

Genome Annotation Generator: a simple tool for generating and correcting WGS annotation tables for NCBI submission.

Scott M Geib¹, Brian Hall², Theodore Derego¹, Forest T Bremer², Kyle Cannoles^1,3, Sheina B Sim¹.

Abstract

Background: One of the most overlooked, yet critical, components of a whole genome sequencing (WGS) project is the submission and curation of the data to a genomic repository, most commonly the National Center for Biotechnology Information (NCBI). While large genome centers or genome groups have developed software tools for post-annotation assembly filtering, annotation, and conversion into the NCBI's annotation table format, these tools typically require back-end setup and connection to an Structured Query Language (SQL) database and/or some knowledge of programming (Perl, Python) to implement. With WGS becoming commonplace, genome sequencing projects are moving away from the genome centers and into the ecology or biology lab, where fewer resources are present to support the process of genome assembly curation. To fill this gap, we developed software to assess, filter, and transfer annotation and convert a draft genome assembly and annotation set into the NCBI annotation table (.tbl) format, facilitating submission to the NCBI Genome Assembly database. This software has no dependencies, is compatible across platforms, and utilizes a simple command to perform a variety of simple and complex post-analysis, pre-NCBI submission WGS project tasks. Findings: The Genome Annotation Generator is a consistent and user-friendly bioinformatics tool that can be used to generate a .tbl file that is consistent with the NCBI submission pipeline. Conclusions: The Genome Annotation Generator achieves the goal of providing a publicly available tool that will facilitate the submission of annotated genome assemblies to the NCBI. It is useful for any individual researcher or research group that wishes to submit a genome assembly of their study system to the NCBI.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2018 PMID： 29635297 PMCID： PMC5887294 DOI： 10.1093/gigascience/giy018

Source DB: PubMed Journal: Gigascience ISSN： 2047-217X Impact factor: 6.524

Introduction

While ever-improving sequencing technology and assembly software enable the collection of raw sequences for genome assembly and structural annotation, further steps need to be taken to ensure the quality and completeness of a whole genome sequencing (WGS) project for submission to the National Center for Biotechnology Information (NCBI) or other data repositories [1]. To submit a genome to the NCBI for curation, it must be converted to the NCBI annotation table format (.tbl). With a genome assembly project consisting of thousands of sequences demarcated by hundreds of thousands of structural annotations, this task clearly requires automation. However, there is currently no freely available tool that performs rapid and controlled conversion of a genome assembly and associated structural annotations into a .tbl format in addition to allowing for editing, modification, and revision of the content of the project. Moreover, the typical assembly and draft annotation contains some degree of questionable or erroneous data that requires correction or omission. It may also be desirable to add functional annotations to the submission and integrate results from InterProScan, Basic Local Alignment Search Tool (BLAST) homology to curated databases, or ontology terms generated by other tools [2-4]. The traditional approach used to address these problems is to use Linux command line tools or write custom scripts that modify and filter the genome using a scripting language such as Perl or Python [5-7] or large-scale genomic database systems [8]. This method may not be easily or readily reproducible or it may be entirely beyond the ability of an investigator who has less familiarity with generating custom scripts de novo. Even among those researchers who use best practices to write clean, well-tested, and reusable scripts to accomplish these tasks, doing so requires a large amount of duplicated effort. For this reason, the Genome Annotation Generator (GAG) was written to provide a straightforward and consistent tool for addressing the most common errors in genome assemblies, adding functional annotations from disparate sources, and producing an NCBI submission-ready annotation .tbl file. In addition, the software provides a means for integrating existing functional annotations and marking annotations that require manual curation or review. All of these tasks are done through an intuitive command line program that requires only basic Unix skills and has no required dependencies or packages. The program GAG facilitates the submission of WGS projects to NCBI as well as provides a standardized utility and workflow that fosters consistency between projects. Due to emerging genome sequencing initiatives such as the “5,000 Insect Genomes Initiative” (i5k) the Plant Genome Initiative, and Genome 10K (G10K) [9,10], many independent research groups that are not specialized in genome annotation and analysis are generating large genomic datasets and performing genome sequencing projects within their lab. This program can assist in ensuring quality and consistency of data for new genome biologists.

Overview

The GAG program is a command line Python program, written in Python 2.7 and requiring no additional outside programs or packages to run. The user directs the program to the genome .fasta file and a .gff3 file containing structural annotations. In addition, a number of options can be used to fix possible errors, flag or remove features (i.e., genomic elements described in the .gff structural annotation file) based on selected criteria, add functional annotations, trim regions of the genome out of the assembly, and, of course, write the genome to an NCBI .tbl file format. In addition, changes made to the genome annotation, functional annotations added, and flags requesting manual review are also annotated back to the .gff3 structural annotation file, and the original .fasta file is corrected as needed. When the user issues commands to modify the genome, e.g., to remove short introns, the statistics will display 2 columns, representing the original and modified genomes. This allows for stepwise and documented filtering and review to occur and for interactions between GAG and visual genome review tools (e.g., Artemis, Apollo, GBrowse, JBrowse, etc.) [11-15].

Methods

As an example, we consider a possible workflow for a user wishing to prepare a genome for submission to the NCBI Eukaryotic WGS Database. The user has a scaffolded genome assembly produced by one of many whole genome assemblers [16-18] in .fasta file format and a corresponding GFF3 feature file [19, 20] containing structural annotations resulting from an automated annotation pipeline or predictors such as Maker, Evidence Modeler, Jigsaw, and others see [21-28]. The approach would be to first possibly generate functional annotations of predicted genes if this is desired, using whatever approach the user is interested in, and then using the genome and annotations with GAG. After using GAG to remove or flag features of interest, the user may then further investigate flagged features in a genome browser by loading the output of GAG, editing, and then performing further filtering in GAG, and iterate through this process until a final draft genome product is generated. Finally GAG writes an NCBI table file, on which tbl2asn is run for submission to the NCBI. This may identify regions of the genome that need to be trimmed, due to possible adapter contamination in the genome or low-quality sequence. Any errors generated by tbl2asn can then be corrected in GAG and the genome trimmed, until an error-free submission is generated. To use GAG, the user creates a folder containing the genome files (or links to them) and runs gag.py from the terminal with the .fasta and .gff3 files. GAG will write a statistics file containing information on the number of each feature type, lengths, and other information that may be useful for the submitter. In our experience, automated genome annotation software frequently produces assemblies containing introns as short as 1 base pair (bp) long; if any such features are present, GAG can be run to detect them. It is important to note than while the NCBI requires short introns to be removed, cutoffs recommended by NCBI may be more stringent than what you want, as they are set to reduce the amount of erroneous data being entered into NCBI. For example, prediction of single base introns might not be errors and may represent true data. It is up to the user to dictate what cutoffs they want to define to remove or flag for manual review. To address these short introns, the user simply applies option -ris (REMOVE_INTRONS_SHORTER_THAN) with a value of 10 bp. GAG will discard any mRNA containing an intron shorter than the minimum of 10 bp. A comparison of the genome content before and after removal is printed to the .stats file. If the user instead wishes to only flag features that meet these criteria and not remove them, alternatively the -fis (FLAG_INTRONS_SHORTER_THAN) option could be used, which instead adds a GAG_FLAG annotation to the attributes column of the .gff3 file describing the reason for flagging, allowing manual review of flagged features in a genome browser. GAG will automatically update all parent and child features (gene or coding sequence (CDS) entries) to reflect removal of mRNA features. Available flag or removal options are listed in Table 1.

Table 1:

Options for GAG

Option	Type of function	Description
-a <annotation file>	Annotate	Adds functional annotations present in annotation file to .gff and .tbl
-t <.bed file>	Trim	Removes regions of genome indicated in .bed file from .fasta and .gff3
-fix_start_stop <no value>	Fix	Adds or corrects start and stop codon features to .gff3
-fix_terminal_ns <no value>	Fix	Removes any trailing ends from contig ends in assembly, updates .gff3 coordinates
-rcs <integer>	Remove	Remove CDS shorter than <integer>
-rcl <integer>	Remove	Remove CDS longer than <integer>
-res <integer>	Remove	Remove exons shorter than <integer>
-rel <integer>	Remove	Remove exons longer than <integer>
-ris <integer>	Remove	Remove introns shorter than <integer>
-ril <integer>	Remove	Remove introns longer than <integer>
-rgs <integer>	Remove	Remove genes shorter than <integer>
-rgl <integer>	Remove	Remove genes longer than <integer>
-fcs <integer>	Flag	Flag CDS shorter than <integer>
-fcl <integer>	Flag	Flag CDS longer than <integer>
-fes <integer>	Flag	Flag exons shorter than <integer>
-fel <integer>	Flag	Flag exons longer than <integer>
-fis <integer>	Flag	Flag introns shorter than <integer>
-fil <integer>	Flag	Flag introns longer than <integer>
-fgs <integer>	Flag	Flag genes shorter than <integer>
-fgl <integer>	Flag	Flag genes longer than <integer>

Options for GAG Another review for submission might be that all coding regions be a minimum length. For this example we use 150 bp in length, which is suggested by NCBI [29,30]. To add this additional level of filtering, a second option can be used: -rcs 150, to REMOVE_CDS_SHORTER_THAN 150 bp. When the genome is written to the output folder, GAG will write a file called genome.removed.gff containing all the features left out of the final version. It is important to remember that CDS cutoffs at 150 bp will possibly remove some biologically correct amino acids. GAG supports 2 straightforward correction, or fix, tools. If the user’s GFF3 file does not explicitly indicate the presence of start and stop codons or if there is reason to believe there are errors in ORF prediction in the provided GFF file, GAG can a add start and stop feature to the GFF file. The user simply issues the command with the option –fix_start_stop and these features will be added to the GFF3 file and their existence noted in the table file. A second issue that can arise in a draft genome assembly is for a contig or scaffold to have a string of ambiguous bases (Ns) at the very beginning or end of the contig. These should be removed from the assembly; this can be done using the –fix_terminal_ns option, as they can be misinterpreted as scaffold gaps. Removing these regions from the genome will disrupt the parity between coordinates in the .fasta genome file and the .gff3 annotation file. GAG will automatically update coordinates in the .gff3 file to reflect any regions removed from the sequence file. During execution of tbl2asn or submission to NCBI, it may be identified that regions of the genome may be contaminated with a microbial, vector, or sequencing adapter sequence as part of the “contaminate screen” step. A .bed formatted file can be supplied with the -trim option, containing regions of the assembly to exclude, either ranges within a contig or scaffold, or an entire scaffold. GAG will update both the .fasta and .gff3 files so that coordinates are still synchronized. This is a particularly difficult operation to perform without a specialized tool. At present, GAG has simple commands to remove or flag introns, exons, coding regions, and genes based on minimum or maximum lengths, which will also edit or remove any parent or child feature from the annotation file so as not to create incomplete feature annotations. It can also remove features from a list, which is useful in cases where a genome submission is rejected and a list of invalid mRNAs and genes is provided. In addition, all discarded features are retained in a “genome.removed.gff” file, and the entire editing session is documented so that the user can retain the filtering criteria used on the particular dataset. GAG supports 2 methods to add functional annotations to a genome. First, it can read an annotated GFF3 file containing gene names, protein products, cross-references to databases, and ontology terms following GFF3 qualified nomenclature in the attribute column of the GFF3 file [31-34]. Any annotations present will be automatically carried over to the NCBI feature table file. For users with annotations from another source, GAG can read them from a simple tab-delimited file. The annotations supported by the current version of GAG are Name (for genes), Dbxref, Ontology_term, and product (for descriptive mRNA products). These are also written to a new GFF3 file, so GAG can be utilized as a tool to also functionally annotate a GFF3 file. Detailed instructions for running GAG, examples for each of the main functions (e.g., removing features, adding start and stop codons, trimming features, adding annotations) as well as formats and conversion tools for functional annotations are available on the GAG software website webpage: http://genomeannotation.github.io/GAG/, and a stable release (version 2.0.1) is available in an accompanying Gigascience database and Code Ocean entries [35,36] and is shared through SciCrunch under (Genome Annotation Generator, RRID:SCR_016053).

Implementation

GAG is written in Python 2.7. It has no dependencies beyond the standard library. The program is modular, abstracting biological concepts such as Sequence, Gene, and CDS into classes that may be incorporated into other software tools. In addition, the code is covered by a suite of unit and integration tests, allowing developers to modify or add to the code base with reduced risk of introducing errors. It should be easily executable by the novice programmer with only basic command-line experience but also powerful enough to be implemented within robust genomic data processing pipelines.

Conclusion

GAG can be easily expanded in the future to support more specific needs of researchers and less common annotation types and to integrate conversion of common functional annotation output formats (e.g., InterProScan, BLAST, Blast2Go) for addition to NCBI annotation table formats. Currently, GAG is an intermediate but critical tool, it fits between a simple format conversion tool and more sophisticated annotation editors. In future developments of GAG, we plan to allow the integration of multiple lines of evidence supporting gene models to help users discriminate apparently high-quality annotations from annotations with little support or possible errors. This could rapidly improve and standardize manual annotation efforts in systems and user groups that are not integrated into genome center annotation pipelines.

Availability of supporting data

Snapshots of the software and accompanying files are available from the GigaScience GigaDB database [35], and the algorithm is also accessible in the Code Ocean cloud-based computational reproducibility platform [36] https://codeocean.com/2018/02/06/genome-annotation-generator-ncbi-for-submission/. Project name: Genome Annotation Generator Project home page: https://github.com/genomeannotation/GAG Operating systems: Linux, Windows, OS Programming language: Python Other requirements: Python 2 License: MIT SciCrunch RRID:SCR_016053

Abbreviations

BLAST: Basic Local Alignment Search Tool; bp: base pair; CDS: coding sequence; GAG: Genome Annotation Generator; NCBI: National Center for Biotechnology Information; WGS: whole genome sequencing

Competing Interests

The authors declare that they have no competing interests.

Funding

Funding for this project was provided by US Department of Agriculture (USDA)-Agricultural Research Service (ARS) and USDA-Animal and Plant Health Inspection Service (APHIS) Farm Bill Section 10007 projects 3.0251.02 (FY 2014), 3.0256.01 (FY 2015), 3.0392.02 (FY 2016).

Author contributions

S.M.G. conceived software concept. B.H., T.D., and S.M.G. designed and wrote the software. B.H., S.M.G., and S.B.S. wrote the manuscript. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. 28 Feb 2017 Reviewed Click here for additional data file. 08 Mar 2017 Reviewed Click here for additional data file. 15 Mar 2017 Reviewed Click here for additional data file.

32 in total

1. The Bioperl toolkit: Perl modules for the life sciences.

Authors: Jason E Stajich; David Block; Kris Boulez; Steven E Brenner; Stephen A Chervitz; Chris Dagdigian; Georg Fuellen; James G R Gilbert; Ian Korf; Hilmar Lapp; Heikki Lehväslaiho; Chad Matsalla; Chris J Mungall; Brian I Osborne; Matthew R Pocock; Peter Schattner; Martin Senger; Lincoln D Stein; Elia Stupka; Mark D Wilkinson; Ewan Birney
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

2. The Ensembl automatic gene annotation system.

Authors: Val Curwen; Eduardo Eyras; T Daniel Andrews; Laura Clarke; Emmanuel Mongin; Steven M J Searle; Michele Clamp
Journal: Genome Res Date: 2004-05 Impact factor: 9.043

3. JIGSAW: integration of multiple sources of evidence for gene prediction.

Authors: Jonathan E Allen; Steven L Salzberg
Journal: Bioinformatics Date: 2005-08-02 Impact factor: 6.937

4. Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species.

Authors:
Journal: J Hered Date: 2009-11-05 Impact factor: 2.645

5. Biopython: freely available Python tools for computational molecular biology and bioinformatics.

Authors: Peter J A Cock; Tiago Antao; Jeffrey T Chang; Brad A Chapman; Cymon J Cox; Andrew Dalke; Iddo Friedberg; Thomas Hamelryck; Frank Kauff; Bartek Wilczynski; Michiel J L de Hoon
Journal: Bioinformatics Date: 2009-03-20 Impact factor: 6.937

6. Bioconductor: open software development for computational biology and bioinformatics.

Authors: Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang
Journal: Genome Biol Date: 2004-09-15 Impact factor: 13.583

7. The Sequence Ontology: a tool for the unification of genome annotations.

Authors: Karen Eilbeck; Suzanna E Lewis; Christopher J Mungall; Mark Yandell; Lincoln Stein; Richard Durbin; Michael Ashburner
Journal: Genome Biol Date: 2005-04-29 Impact factor: 13.583

8. Sequence ontology annotation guide.

Authors: Karen Eilbeck; Suzanna E Lewis
Journal: Comp Funct Genomics Date: 2004

9. SOBA: sequence ontology bioinformatics analysis.

Authors: Barry Moore; Guozhen Fan; Karen Eilbeck
Journal: Nucleic Acids Res Date: 2010-05-21 Impact factor: 16.971

10. Finding the missing honey bee genes: lessons learned from a genome upgrade.

Authors: Christine G Elsik; Kim C Worley; Anna K Bennett; Martin Beye; Francisco Camara; Christopher P Childers; Dirk C de Graaf; Griet Debyser; Jixin Deng; Bart Devreese; Eran Elhaik; Jay D Evans; Leonard J Foster; Dan Graur; Roderic Guigo; Katharina Jasmin Hoff; Michael E Holder; Matthew E Hudson; Greg J Hunt; Huaiyang Jiang; Vandita Joshi; Radhika S Khetani; Peter Kosarev; Christie L Kovar; Jian Ma; Ryszard Maleszka; Robin F A Moritz; Monica C Munoz-Torres; Terence D Murphy; Donna M Muzny; Irene F Newsham; Justin T Reese; Hugh M Robertson; Gene E Robinson; Olav Rueppell; Victor Solovyev; Mario Stanke; Eckart Stolle; Jennifer M Tsuruda; Matthias Van Vaerenbergh; Robert M Waterhouse; Daniel B Weaver; Charles W Whitfield; Yuanqing Wu; Evgeny M Zdobnov; Lan Zhang; Dianhui Zhu; Richard A Gibbs
Journal: BMC Genomics Date: 2014-01-30 Impact factor: 3.969

19 in total

1. Foster thy young: enhanced prediction of orphan genes in assembled genomes.

Authors: Jing Li; Urminder Singh; Priyanka Bhandary; Jacqueline Campbell; Zebulun Arendsee; Arun S Seetharam; Eve Syrkin Wurtele
Journal: Nucleic Acids Res Date: 2022-04-22 Impact factor: 19.160

2. Genomic Evidence of Recombination in the Basidiomycete Wallemia mellicola.

Authors: Xiaohuan Sun; Cene Gostinčar; Chao Fang; Janja Zajc; Yong Hou; Zewei Song; Nina Gunde-Cimerman
Journal: Genes (Basel) Date: 2019-06-04 Impact factor: 4.096

3. Improving Illumina assemblies with Hi-C and long reads: An example with the North African dromedary.

Authors: Jean P Elbers; Mark F Rogers; Polina L Perelman; Anastasia A Proskuryakova; Natalia A Serdyukova; Warren E Johnson; Petr Horin; Jukka Corander; David Murphy; Pamela A Burger
Journal: Mol Ecol Resour Date: 2019-05-17 Impact factor: 7.090

4. Population Genomics of an Obligately Halophilic Basidiomycete Wallemia ichthyophaga.

Authors: Cene Gostinčar; Xiaohuan Sun; Janja Zajc; Chao Fang; Yong Hou; Yonglun Luo; Nina Gunde-Cimerman; Zewei Song
Journal: Front Microbiol Date: 2019-09-04 Impact factor: 5.640

5. Fifty Aureobasidium pullulans genomes reveal a recombining polyextremotolerant generalist.

Authors: Cene Gostinčar; Martina Turk; Janja Zajc; Nina Gunde-Cimerman
Journal: Environ Microbiol Date: 2019-06-18 Impact factor: 5.491

6. Comparative analysis of the Pocillopora damicornis genome highlights role of immune system in coral evolution.

Authors: R Cunning; R A Bay; P Gillette; A C Baker; N Traylor-Knowles
Journal: Sci Rep Date: 2018-10-31 Impact factor: 4.379

7. EMBLmyGFF3: a converter facilitating genome annotation submission to European Nucleotide Archive.

Authors: Martin Norling; Niclas Jareborg; Jacques Dainat
Journal: BMC Res Notes Date: 2018-08-13

8. High-quality Schistosoma haematobium genome achieved by single-molecule and long-range sequencing.

Authors: Andreas J Stroehlein; Pasi K Korhonen; Teik Min Chong; Yan Lue Lim; Kok Gan Chan; Bonnie Webster; David Rollinson; Paul J Brindley; Robin B Gasser; Neil D Young
Journal: Gigascience Date: 2019-09-01 Impact factor: 6.524

9. Hybrid de novo whole-genome assembly and annotation of the model tapeworm Hymenolepis diminuta.

Authors: Robert M Nowak; Jan P Jastrzębski; Wiktor Kuśmirek; Rusłan Sałamatin; Małgorzata Rydzanicz; Agnieszka Sobczyk-Kopcioł; Anna Sulima-Celińska; Łukasz Paukszto; Karol G Makowczenko; Rafał Płoski; Vasyl V Tkach; Katarzyna Basałaj; Daniel Młocicki
Journal: Sci Data Date: 2019-12-03 Impact factor: 6.444

10. Draft Genome Sequence of the Yeast Rhodotorula sp. Strain CCFEE 5036, Isolated from McMurdo Dry Valleys, Antarctica.

Authors: Claudia Coleine; Sawyer Masonjones; Silvano Onofri; Laura Selbmann; Jason E Stajich
Journal: Microbiol Resour Announc Date: 2020-04-02