| Literature DB >> 34739369 |
Oliver Schwengers1, Lukas Jelonek1, Marius Alfred Dieckmann1, Sebastian Beyvers1, Jochen Blom1, Alexander Goesmann1.
Abstract
Command-line annotation software tools have continuously gained popularity compared to centralized online services due to the worldwide increase of sequenced bacterial genomes. However, results of existing command-line software pipelines heavily depend on taxon-specific databases or sufficiently well annotated reference genomes. Here, we introduce Bakta, a new command-line software tool for the robust, taxon-independent, thorough and, nonetheless, fast annotation of bacterial genomes. Bakta conducts a comprehensive annotation workflow including the detection of small proteins taking into account replicon metadata. The annotation of coding sequences is accelerated via an alignment-free sequence identification approach that in addition facilitates the precise assignment of public database cross-references. Annotation results are exported in GFF3 and International Nucleotide Sequence Database Collaboration (INSDC)-compliant flat files, as well as comprehensive JSON files, facilitating automated downstream analysis. We compared Bakta to other rapid contemporary command-line annotation software tools in both targeted and taxonomically broad benchmarks including isolates and metagenomic-assembled genomes. We demonstrated that Bakta outperforms other tools in terms of functional annotations, the assignment of functional categories and database cross-references, whilst providing comparable wall-clock runtimes. Bakta is implemented in Python 3 and runs on MacOS and Linux systems. It is freely available under a GPLv3 license at https://github.com/oschwengers/bakta. An accompanying web version is available at https://bakta.computational.bio.Entities:
Keywords: bacteria; genome annotation; metagenome-assembled genomes; plasmids ; whole-genome sequencing
Mesh:
Year: 2021 PMID: 34739369 PMCID: PMC8743544 DOI: 10.1099/mgen.0.000685
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Overview of the Bakta annotation workflow.
Comparison of annotation results of O26 : H11 strain 11368
Numbers represent numbers of annotated features. Prokka, DFAST and Bakta were executed with default parameters providing all relevant information, e.g. genus and species, assembly status and sequence topology; a detailed list of command lines is provided in Supplementary Note S3. Annotations from PGAP were downloaded in GenBank file format from RefSeq.
|
Feature type |
|
Prokka |
|
Bakta |
|---|---|---|---|---|
|
Total CDSs |
5794 |
5754 |
5740 |
5841 |
|
Identified proteins |
5550 |
– |
– |
5738 |
|
With COG |
– |
113 |
3952 |
3277 |
|
With EC |
1518 |
1042 |
1217 |
1562 |
|
With GO |
– |
– |
– |
3474 |
|
Unknown function* |
423 |
1808 |
1358 |
225 |
|
sORF† |
44 |
– |
– |
82 |
|
Total tRNA |
102 |
106 |
106 |
106 |
|
tRNA |
101 |
105 |
105 |
105 |
|
tmRNA |
1 |
1 |
1 |
1 |
|
Pseudo |
– |
– |
– |
3 |
|
rRNA |
22 |
22 |
22 |
22 |
|
Total ncRNA |
17 |
295 |
– |
289 |
|
Genes |
10 |
– |
– |
223 |
|
Regulatory regions |
6 |
– |
– |
66 |
|
Miscellaneous | ||||
|
CRISPR array |
2 |
2 |
2 |
2 |
|
Origins of replication |
– |
– |
– |
4 |
|
Computational resources | ||||
|
Wall-clock runtime (min:s)‡ |
– |
4:13 |
3:48 |
7:09 |
|
RAM (GB) |
– |
1.2 |
1.8 |
4.4 |
|
DB size (GB) |
– |
0.6 |
3.3 |
53 |
*Protein sequences of unknown function: product denoted as hypothetical protein, putative protein, uncharacterized protein or conserved predicted protein.
†sORFs shorter than 29 amino acids.
‡Best out of three wall-clock runtimes executed with eight threads.
Fig. 2.Comparison of wall-clock runtimes. Runtimes of Prokka, DFAST, Bakta and Bakta w/o AFSI annotating O26 : H11 strain 11368 were measured three consecutive times using varying numbers of CPUs on a server machine with 4 Intel Xeon E5-4627 CPUs and 40 cores in total.
Fig. 3.Proportion of protein sequences annotated as hypothetical protein. Distributions of genome-wise ratios of numbers of total CDSs and those annotated as hypothetical protein are shown for 35 selected RefSeq genomes comprising species of high medical and biotechnological relevance.
Fig. 4.Proportion of protein sequences annotated as hypothetical protein. Distributions of genome-wise ratios of numbers of hypothetical proteins and total CDSs are shown for 362 GenBank genomes comprising species of undefined genera.
Functional categories of small proteins detected by Bakta
|
Function* |
No. of small proteins detected |
|---|---|
|
Attenuator and leader peptides |
53 |
|
Membrane |
10 |
|
Phage |
8 |
|
Regulation |
7 |
|
Phenol-soluble modulin |
7 |
|
Toxin–antitoxin systems |
6 |
|
Toxins |
5 |
|
Sporulation |
5 |
*Extracted from annotated product descriptions.
Fig. 5.Proportion of protein sequences annotated as hypothetical protein. Distributions of genome-wise ratios of numbers of total CDSs and those annotated as hypothetical protein are shown for 198 bacterial high-quality MAGs screened by genome completeness and contaminations.
Fig. 6.GUI screenshots of the Bakta web version. (a) Submission page with metadata input fields providing taxon autocompletion support for genus and species (top) and replicon table editor (bottom). (b) An igv.js-based genome browser visualizing annotated features. CDS features are coloured according to the annotated COG functional category. (c) Interactive annotation table providing search and filter features. Annotated dbxrefs are linked to target databases.