| Literature DB >> 28486635 |
Michael Tillich1, Pascal Lehwark2, Tommaso Pellizzer1, Elena S Ulbricht-Jones1, Axel Fischer1, Ralph Bock1, Stephan Greiner1.
Abstract
We have developed the web application GeSeq (https://chlorobox.mpimp-golm.mpg.de/geseq.html) for the rapid and accurate annotation of organellar genome sequences, in particular chloroplast genomes. In contrast to existing tools, GeSeq combines batch processing with a fully customizable reference sequence selection of organellar genome records from NCBI and/or references uploaded by the user. For the annotation of chloroplast genomes, the application additionally provides an integrated database of manually curated reference sequences. GeSeq identifies genes or other feature-encoding regions by BLAT-based homology searches and additionally, by profile HMM searches for protein and rRNA coding genes and two de novo predictors for tRNA genes. These unique features enable the user to conveniently compare the annotations of different state-of-the-art methods, thus supporting high-quality annotations. The main output of GeSeq is a GenBank file that usually requires only little curation and is instantly visualized by OGDRAW. GeSeq also offers a variety of optional additional outputs that facilitate downstream analyzes, for example comparative genomic or phylogenetic studies.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28486635 PMCID: PMC5570176 DOI: 10.1093/nar/gkx391
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Examples for a likely mis-annotated gene in some (provisional) NCBI Reference Sequence (NCBI RefSeq) entries encountered during this work. Exon–exon junction of the chloroplast rpl2 gene as translated from NCBI RefSeq CDS entries (left) and comparison with the corresponding entries in the GeSeq internal reference sequences database (right; see Supplementary Table S1 for GeSeq names and GenBank IDs). Suspected annotation errors are labeled in black. Some NCBI RefSeq records lack annotation of given genes (‘no annotation’, cf. Brassica rapa) or introns resulting in no or truncated CDS (Oryza sativa ssp. indica and Saccharum), respectively. Inconsistent exon borders result in likely erroneous insertions or deletions of one or several codons (Ginkgo biloba, Medicago truncatula, Ricinus communis, Sorghum bicolor and Selaginella moellendorffii).
Figure 2.GeSeq annotation pipeline. The user provides nucleic acid FASTA sequence(s) for annotation and selects or provides reference nucleic acid sequences in GenBank or FASTA format (‘User Input’). Based on the selected or uploaded reference sequences (‘References’), GeSeq builds a non-protein-coding (rRNA, tRNA and DNA) and a protein-coding (CDS) BLAT database, carries out standard (‘BLATn’) and translated BLAT (‘BLATx’) searches, respectively, and filters the hits (‘Hit Filter’). GeSeq annotates from the filtered hits the classes rRNA, tRNA, CDS and gene (‘gene entries’). ‘Gene entries’ result from tRNA, rRNA and CDS hits, and include introns (if present). DNA hits are annotated as ‘misc_features’ (as shown here) or, alternatively, as ‘primer_bind’ if invoked by the user (see text for details). In addition, the user can activate an nhmmer search by selecting profile HMMs of CDS and rRNA sequences (currently chloroplast only) as references. All profile HMM hits are annotated as misc_features to support manual curation. Optionally, the user can invoke ARAGORN or tRNAscan-SE for de novo annotation of tRNAs and a self-BLATn search for the detection of the inverted repeat (IR) pair typically found in chloroplast genomes. The minimum GeSeq output (all output files are labeled in gray) is a GenBank file that contains all annotations and its interpretation by OGDRAW for a quick evaluation. Additionally, the user can choose additional optional outputs, including separate multi-FASTA files (‘mFASTAs’) containing the annotated sequences belonging to the classes gene, CDS, rRNA and tRNA. If several sequences were uploaded for annotation in the same job, also combined mFASTAs for all annotated sequences of the four classes are offered for download and optionally, codon-based alignments can be produced for all annotated CDS sequences with or without the selected or uploaded GenBank references.
Figure 3.Examples of protein-coding and tRNA gene annotations (green arrows) by GeSeq with high-quality settings visualized by SeqBuilder.v13.0.0 (DNASTAR, Madison, WI, USA). The coordinates of the chloroplast protein-coding genes psbZ and psbC (only 3΄ end shown) as annotated by BLATx exactly match to their corresponding profile HMM hits (orange arrows). The tRNA gene trnS which is annotated in-between the two protein-coding genes by BLATn is confirmed by both de novo tRNA predictors tRNAscan-SE and Aragorn (light and dark gray arrows, respectively). See Supplementary Figure S2B and (C) for the corresponding GenBank file excerpt.