Literature DB >> 27867804

DFAST and DAGA: web-based integrated genome annotation tools and resources.

Yasuhiro Tanizawa¹, Takatomo Fujisawa², Eli Kaminuma², Yasukazu Nakamura², Masanori Arita³.

Abstract

Quality assurance and correct taxonomic affiliation of data submitted to public sequence databases have been an everlasting problem. The DDBJ Fast Annotation and Submission Tool (DFAST) is a newly developed genome annotation pipeline with quality and taxonomy assessment tools. To enable annotation of ready-to-submit quality, we also constructed curated reference protein databases tailored for lactic acid bacteria. DFAST was developed so that all the procedures required for DDBJ submission could be done seamlessly online. The online workspace would be especially useful for users not familiar with bioinformatics skills. In addition, we have developed a genome repository, DFAST Archive of Genome Annotation (DAGA), which currently includes 1,421 genomes covering 179 species and 18 subspecies of two genera, Lactobacillus and Pediococcus, obtained from both DDBJ/ENA/GenBank and Sequence Read Archive (SRA). All the genomes deposited in DAGA were annotated consistently and assessed using DFAST. To assess the taxonomic position based on genomic sequence information, we used the average nucleotide identity (ANI), which showed high discriminative power to determine whether two given genomes belong to the same species. We corrected mislabeled or misidentified genomes in the public database and deposited the curated information in DAGA. The repository will improve the accessibility and reusability of genome resources for lactic acid bacteria. By exploiting the data deposited in DAGA, we found intraspecific subgroups in Lactobacillus gasseri and Lactobacillus jensenii, whose variation between subgroups is larger than the well-accepted ANI threshold of 95% to differentiate species. DFAST and DAGA are freely accessible at https://dfast.nig.ac.jp.

Entities: Chemical Disease Gene Mutation Species

Keywords: Lactobacillus; Pediococcus; annotation; database; genome; lactic acid bacteria

Year: 2016 PMID： 27867804 PMCID： PMC5107635 DOI： 10.12938/bmfh.16-003

Source DB: PubMed Journal: Biosci Microbiota Food Health ISSN： 2186-3342

INTRODUCTION

Major scientific journals request that researchers deposit newly sequenced DNA in the International Nucleotide Sequence Database Collaboration (INSDC) [1]. DDBJ/ENA/GenBank are the core annotation databases, collecting publicly available DNA information with metadata. Recently, INSDC has also begun collecting raw sequences from the new-generation sequencing platforms for Sequence Read Archive (SRA) [2]. These primary public databases constitute the basis for accessibility, reproducibility, and reusability of genomic data. However, since quality assurance and correct assignment of taxonomy are the responsibility of data contributors, improving quality and taxonomic description has been an everlasting problem [3,4,5]. Low-quality data not only decrease the reliability of future analyses but also, in the worst case, lead to biologically incorrect conclusions. To avoid such problems, several tools and methods are available. QUAST [6] is a widely used assessment tool for genome assembly that reports statistical metrics such as N50 and detects misassemblies by using a reference genome. CheckM [7] estimates genome completeness and contamination by inspecting for the presence/absence of marker genes specific to each taxon. To confirm taxonomic affiliation of unidentified genomes, Bull et al. proposed using 16S rRNA genes together with housekeeping genes [4]. Beaz-Hidalgo et al. recommended the use of average nucleotide identity (ANI) to verify the taxonomic position of newly obtained genomes [8]. ANI represents the mean sequence identity of homologous regions between a given pair of genomes, and an ANI value of 95–96% is widely accepted as the threshold for distinguishing species [9,10,11]. Examples of ANI values and the 16S rRNA gene sequences for curated genomes can be available at the EzGenome and EzTaxon databases [12]. Recently, the use of genomic comparison methods including ANI was also proposed to find and correct misidentified genomes in the public databases at an NCBI workshop [13]. Along this line of research, we developed the DDBJ Fast Annotation and Submission Tool (DFAST) as a web-based bacterial annotation pipeline with integrated quality assessment using CheckM and taxonomic assessment using ANI. DFAST allows researchers to submit annotated genomes easily to INSDC through the DDBJ Mass Submission System (MSS) [14]. As the initial showcase of DFAST, we targeted lactic acid bacteria (LAB) and constructed a reference protein database tailored for Lactobacillus as well as Pediococcus to enable accurate and rapid annotation. We also developed an associated repository, DFAST Archive of Genome Annotation (DAGA), which stores LAB genomes obtained from DDBJ/ENA/GenBank and SRA with consistent annotation and assessment by DFAST. Our aim is to provide a reliable genome resource to the entire research community, thereby promoting accessibility and reusability of genomic data. Among LAB, Lactobacillus is highly heterogeneous and the largest genus in the family Lactobacillaceae, comprising 185 species and 18 subspecies as of June 2016 (http://www.bacterio.net/lactobacillaceae.html). The genus Pediococcus is another member of Lactobacillaceae consisting of 11 species, and it is phylogenetically placed within the Lactobacillus cluster, near L. plantarum and L. brevis [15, 16]. In a recent study, the term Lactobacillus sensu lato was also proposed to refer to these genera [17]. In both genera, the number of new species described and genomes published have been growing with the improvement of isolation, cultivation, and identification methods as well as sequencing technology (Fig. 1). Nowadays, most type strains have been sequenced and become publicly available through large-scale sequencing projects, such as “Genome sequencing of JCM strains under the NBRP program” in Japan (BioProject ID: PRJDB547), “Lactobacillus in severe early childhood caries” by Sanger Institute, UK (PRJEB3060), and “Genomic characterization of the genus Lactobacillus” in China (PRJNA222257). The results of such projects have enabled genus-wide analyses covering almost 90% of the known species based on the genomic information [17, 18]. Our DAGA provides annotated genomes for the family Lactobacillaceae, which includes many species that have undergone reclassification and species difficult to distinguish by 16S rRNA gene sequences. Our data will benefit all researchers who use LAB genomes, especially those focusing on inter- and intraspecific relations.

Fig. 1.

The number of described species and published genomes in Lactobacillus and Pediococcus.

Solid line represents the cumulative number of described (sub)species. Only valid species as of Jan. 2016 were included, not reclassified ones. The bar chart represents the cumulative number of genomes deposited in DDBJ/ENA/GenBank.

The number of described species and published genomes in Lactobacillus and Pediococcus. Solid line represents the cumulative number of described (sub)species. Only valid species as of Jan. 2016 were included, not reclassified ones. The bar chart represents the cumulative number of genomes deposited in DDBJ/ENA/GenBank. Screenshots of DFAST and DAGA. A) Main page of DAGA, listing genomes in the database. Users can query genomes from the search form. B) Detail page of each genome, showing statistics and external links. Data files are downloadable in several formats. C) Detail page of annotated features. Links to the Blast web service at NCBI are available. D) Submission form of DFAST. Users can annotate their own genome by uploading the FASTA file. E) Result of DFAST. Submission files for DDBJ Mass Submission System are ready. In the present article, we describe development of DAGA and DFAST, and we also report several findings related to the current nomenclature.

MATERIALS AND METHODS

Construction of the annotation pipeline

The reference protein database was first constructed to provide consistent annotation to all focused genomes. A total of 69 complete genomes of Lactobacillus and Pediococcus, publicly available as of September 2015, were collected from the NCBI Assembly Database, and their protein sequences were extracted. In addition, 12 other genomes were added to link with the Lactobacillales-specific Clusters of Orthologous Genes (LaCOGs) [19] and Microbial Genome Database (MBGD) [20]: Aerococcus urinae ACS-120-V-Col10a (GCA_000193205.1), Carnobacterium sp. 17-4 (GCA_000195575.1), Enterococcus faecalis V583 (GCA_000007785.1), Lactococcus lactis subsp. cremoris SK11 (GCA_000014545.1), Lactococcus lactis subsp. lactis Il1403 (GCA_000006865.1), Leuconostoc mesenteroides subsp. mesenteroides ATCC 8293 (GCA_000014445.1), Melissococcus plutonius ATCC 35311 (GCA_000270185.1), Oenococcus oeni PSU-1 (GCA_000014385.1), Streptococcus pyogenes M1 GAS (GCA_000006785.1), Streptococcus thermophilus LMD-9 (GCA_000014485.1), Tetragenococcus halophilus NBRC 12172 (GCA_000283615.1), and Weissella koreensis KACC 15510 (GCA_000219805.1). The identified 183,469 protein sequences were grouped into 28,002 orthologous clusters by using the GET_HOMOLOGUES software (version 1.3) with its default settings [21]. Briefly, candidates for orthologous genes were determined by bidirectional BLASTP alignments between each pair of the strains with an E-value threshold of 10e-5 and a minimum coverage threshold of 75%. Then, orthologous clusters were detected by the OrthoMCL algorithm [22]. Among them, 11,993 were shared clusters containing two or more protein sequences, and the remaining 16,009 singletons were discarded. To infer the protein names and gene symbols, the shared clusters were mapped to the orthologous clusters of LaCOGs and MBGD. A total of 6,428 clusters were assigned to LaCOGs, of which 98.9% formed a one-to-one relationship with specific LaCOG clusters. Likewise, an additional 1,601 clusters were assigned to MBGD, of which 94.4% were one-to-one. To confirm the protein functions, public protein databases and the NCBI Conserved Domain Database [23] were searched manually. All protein names followed the NCBI guidelines for naming proteins (http://www.ncbi.nlm.nih.gov/genbank/genomesubmit_annotation/). The core annotation process was based on the Prokka annotation software [24], performing prediction of tRNAs, rRNAs, CRISPRs, and protein-coding sequences as well as similarity searches against protein sequence databases and protein family profiles. The reference database was used in our customized Prokka pipeline that can generate DDBJ-compliant submission files.

Data collection

Publicly available genome sequences for Lactobacillus and Pediococcus were downloaded from the NCBI Assembly Database, which is a secondary database of DDBJ/ENA/Genbank that provides assembled sequences for each genome [25]. Raw sequence data (Illumina sequences with the paired-end method) were downloaded from SRA, and de novo assembly was conducted to reconstruct draft genome sequences as described below. All genomes were annotated with the customized Prokka pipeline.

Genome assembly

Raw sequence reads were preprocessed using Platanus_trim (version 1.0.7) to remove low-quality regions. De novo assembly was conducted using the Platanus assembler (version 1.2.4) [26]. Since Platanus was originally developed for heterozygous diploid genomes, we specified the parameters “-d 0.3 -u 0.05” to configure it for bacterial haploid genomes. For each genome, de novo assembly was repeated five times by randomly sampling read sequences of different coverage, and the best result was chosen by the completeness calculated using CheckM and the average sequence length.

Calculation of average nucleotide identity

The pyani script (https://github.com/widdowquinn/pyani) was used to calculate the ANI between two genomes, based on the method by Goris et al. [9]. In brief, one genome was cut into 1,020 nt fragments, which were searched against the other genome by using the BLASTN algorithm [27]. ANI was calculated as the mean identity of top-hit BLASTN matches for all fragments with a sequence identity of ≥30% and an overall aligned region of ≥70% of the fragment length. The trees in Fig. 3 (B–D) were constructed by the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) clustering method with a distance of (1 – ANI).

Fig. 3.

A) Distribution of ANI values among 191 representative genomes. Circles and open diamonds indicate interspecific and intraspecific ANI values, respectively. Black circles indicate problematic genomes. B–D) Hierarchical clustering results by using (1 – ANI) as the genome distance. Each label represents the accession number of NCBI Assembly Database and the strain name. B: Lactobacillus gasseri, C: L. jensenii, D: L. delbrueckii.

Quality assessment of genomes

CheckM (version 1.0.5) was used to calculate completeness and contamination of each genome [7]. CheckM inspected for the presence/absence of 409 and 664 single-copy gene markers specific for Lactobacillus and Pediococcus, respectively. Genome completeness and contamination were estimated by the number of distinct markers and their multiplicity in each genome, respectively.

Implementation of the web service

DFAST and DAGA were implemented in Python 2.7.11 with PostgreSQL 8.4.20 and Nginx 1.8.0, and run on a Red Hat Enterprise Linux server (release 6.7).

RESULTS

Overview of the DAGA service

We developed an integrated genome archive specialized for LAB, namely DAGA. The first version of the dataset targets the family Lactobacillaceae and contains 1,389 and 32 genomes for Lactobacillus and Pediococcus, respectively. Among them, 743 are publicly available genome sequences deposited in DDBJ/ENA/GenBank; they were obtained from the NCBI Assembly Database. The remaining 678 genomes were assembled de novo from raw reads deposited in SRA. All genomes were annotated by the Prokka (ver. 1.11) pipeline with the customized reference database for LAB. The quality of genomes was assessed by examining the presence of specific gene markers with CheckM, and taxonomic affiliation was verified by ANI. As of January 2016, DAGA covers 168 species and 18 subspecies of the genus Lactobacillus and 11 species of the genus Pediococcus, which correspond to 91% of the known species for both genera. DAGA utilizes accession numbers from the original source as the genome identifiers; data with “GCA” in the genome identifier are from the NCBI Assembly Database, and those with “DRR”, ”ERR”, or “SRR” are from SRA. Figure 2 shows screenshots of DAGA. Users can query genomes of interest from the search form in the upper part or select a taxonomic name. A keyword search is available too. The genome quality is rated in 5 grades, allowing users to easily select reliable genomes for comparative analysis. The definition of the rating scale and the number of genomes for each grade are shown in Tables 1 and 2 . DAGA also provides genome statistics: the number of coding sequences, estimated genome size, and external links to related databases. Annotation results can be downloaded in either GenBank or FASTA format files. DAGA is freely accessible at https://dfast.nig.ac.jp.

Fig. 2.

Screenshots of DFAST and DAGA.

A) Main page of DAGA, listing genomes in the database. Users can query genomes from the search form. B) Detail page of each genome, showing statistics and external links. Data files are downloadable in several formats. C) Detail page of annotated features. Links to the Blast web service at NCBI are available. D) Submission form of DFAST. Users can annotate their own genome by uploading the FASTA file. E) Result of DFAST. Submission files for DDBJ Mass Submission System are ready.

Table 1.

Number of genomes deposited in DAGA

Data source	Quality rating					Total
Data source	1	2	3	4	5	Total
DDBJ/ENA/GenBank	17	11	59	558	98	743
SRA	30	27	4	617	0	678
Total	47	38	63	1,175	98	1,421

Table 2.

Definition of the quality rating grades

Quality rating	Definition
5	High quality complete genomes with completeness ≥95% and contamination ≤5%
4	High quality draft genomes with completeness ≥95% and contamination ≤5%
3	Low quality genomes with completeness ≥80% and contamination ≤10%
2	Disqualified genomes with completeness <80% or contamination >10%
1	Taxonomically mislabeled or misidentified genomes

Selection of a representative genome for each species

To verify the taxonomic relationship of each species, we calculated pairwise ANI values among 191 strains representing each species (or subspecies). We gave priority to the type strains in the data selection, and when multiple genomes were available, the one with the highest completeness and the longest average sequence length was chosen. Figure 3A shows the results of ANI calculation (also see our website http://dfast.nig.ac.jp/download/). In most cases, the ANI values between species were below 95%, the threshold to differentiate species. Six strains in Table 3 (black circles in the Fig. 3A) showed anomalously high ANI values, indicating the incongruence of their taxonomic positions, which will be discussed later.

Table 3.

Strains with problematic taxonomic positions

Data source*	Organism name	Strain	Description
GCA_000159175.1	Lactobacillus brevis subsp. gravesensis	ATCC 27305^#	Shows an ANI value of 97.3% against L. hilgardii.
ERR387492	Lactobacillus fornicalis	JCM 12512^T	Shows an ANI value of 98.7% against L. plantarum subsp. plantarum.
GCA_001436985.1	Lactobacillus homohiochii	DSM 20571^T	Shows an ANI value of 99.9% against L. fructivorans.
GCA_001434215.1	Lactobacillus parakefiri	DSM 10551^T	Shows an ANI value of 99.9% against L. kefiri. Possibly contaminated with L. kefiri (contamination value 98%).
SRR1561417	Pediococcus lolii	DSM 19927^T	Shows an ANI value of 97.1% against P. acidilactici.
GCA_001437265.1	Pediococcus parvulus	DSM 203321^T	Shows an ANI value of 92.5% against P. acidilactici. Possibly contaminated with P. acidilactici (contamination value 98.9%).

# Non-type strain.

# Non-type strain. By excluding these six strains, we obtained 185 representative genomes whose interspecific pairwise ANI values were well below 95%. One exception was L. zeae and L. casei, which had an ANI of 94.4% (see Discussion). After a long period of controversy, L. zeae is now considered to be in the same taxon as L. casei [28]. However, the organism name has not been formally rejected in the current nomenclature, and L. zeae was counted with its original name in our database. It should also be noted that the publicly available genome of L. amylotrophicus (GCA_001434555.1), which exhibited an ANI of 99.9% with L. amylophilus, did not serve as the representative genome. Instead, we used data from SRA (ERR387486) as the representative of L. amylotrophicus. The validity of the 185 representative genomes was also confirmed by comparing their reconstructed 16S rRNA gene sequences with those deposited in public databases. When not available, housekeeping genes like pheS or rpoA were used instead. In addition, a phylogenetic tree was constructed using 132 conserved single-copy genes to verify their taxonomic positions, and this tree is available at our website (http://dfast.nig.ac.jp/download/). Selection of representative genomes was implemented as a procedure in our system to serve as a tool for taxonomic studies in which comparison with type strains is critical.

Detection of mislabeled genomes by ANI values

We next checked the taxonomic affiliation for all genomes in DAGA by conducting ANI calculations against the representative genomes. We adopted species names based on the ANI calculations for 77 mislabeled genomes and inferred names for 55 unidentified genomes that were deposited as Lactobacillus sp. Such genomes with problematic taxonomic positions were marked as Rating 1 (Table 4).

Table 4.

Mislabeled genomes deposited in DDBJ/ENA/GenBank

Data source*	Organism name	Strain	Description
GCA_000159195.1	Lactobacillus buchneri	ATCC 11577	Shows an ANI value of 99.1% against L. hilgardii.
GCA_001434555.1	Lactobacillus amylotrophicus	DSM 20534^T	Shows an ANI value of 100% against L. amylophilus. Possibly replaced by the strain of L. amylophilus.
GCA_001314245.1	Lactobacillus gallinarum	HFD4	Shows an ANI value of 96.7% against L. helveticus.
GCA_001273585.1	Lactobacillus plantarum	SNU.Lp177	Shows an ANI value of 98.9% against L. plantarum subsp. argentoratensis and an ANI value of 95.6% against subsp. plantarum.
GCA_001068345.1	Lactobacillus johnsonii	987_LJOH	Shows an ANI value of 93.4% against L. gasseri.
GCA_001066235.1	Lactobacillus johnsonii	770_LJOH	Shows an ANI value of 100% against L. gasseri.
GCA_001064985.1	Lactobacillus helveticus	459_LHEL	Shows an ANI value of 96.8% against L. gasseri.
GCA_001063065.1	Lactobacillus kefiranofaciens	249_LKEF	Shows an ANI value of 100% against L. gasseri.
GCA_001063045.1	Lactobacillus crispatus	240_LCRI	Shows an ANI value of 100% against L. gasseri.
GCA_000469115.1	Lactobacillus plantarum	AY01	Shows an ANI value of 99.6% against L. paraplantarum.
GCA_000463075.2	Lactobacillus plantarum	EGD-AQ4	Shows an ANI value of 92.8% against L. pentosus.
GCA_000191545.1	Lactobacillus acidophilus	30SC	Shows an ANI value of 100% against L. amylovorus.
GCA_000159195.1	Lactobacillus buchneri	ATCC 11577	Shows an ANI value of 99.1% against L. hilgardii.

* Those with GCA were derived from NCBI Assembly Database and those with DRR/SRR/ERR were derived from SRA.

* Those with GCA were derived from NCBI Assembly Database and those with DRR/SRR/ERR were derived from SRA. Remarkably, 28 of 32 “L. casei” genomes were in fact L. paracasei, as previously postulated in the literature [29] and indicated by the fact that they shared an ANI of over 98% with L. paracasei ATCC 25302 T and an ANI of less than 85% with L. casei ATCC 393T. Among the remaining four “L. casei” genomes, two were type strains, one was low quality with 22% ambiguous bases (N), and the last was the recently published L. casei N87 (GCA_001013375.1). The last strain shared an ANI of 96.8% with L. zeae DSM 20178T and an ANI of 94.3% with L. casei ATCC 393 T. In the L. plantarum group, the members of which are notoriously difficult to identify with 16S rRNA sequence similarity, three “L. plantarum” genomes were reassigned organism names inferred from ANI results. The strains SNU.Lp177 (GCA_001273585.1), EGD-AQ4 (GCA_000463075.2), and AY01 (GCA_000469115.1) were inferred to be L. plantarum subsp. argentoratensis, L. pentosus, and L. paraplantarum, respectively. All assignments were recorded, i.e., both the original and the corrected names are available in our database.

Genomic diversity of LAB revealed by ANI

As a demonstrative analysis taking advantage of the wealth of genomic data stored in DAGA, we conducted all-against-all ANI comparison between 704 genomes (N = 704 × 703/2 = 247,456) to further investigate genomic diversity. Low-quality genomes and genomes with ambiguous taxonomy were excluded. All interspecific ANI values (N=239,840) were less than 95%, while 198 out of the remaining 7,616 intraspecific ANI values were also less than 95%. Such exceptions included the divergence within L. kunkeei, L. gasseri, and L. jensenii. L. gasseri and L. jensenii were each clearly separated into two previously unknown subgroups (Figs. 3B and 3C). The ANI values between the subgroups were 93% and 88% for L. gasseri and L. jensenii, respectively, while the ANI values within the same subgroups were over 98% in both species. The intraspecific separation was also supported by the multiple alignments of their pheS gene sequences (alignment data not shown). For L. gasseri and L. jensenii, the nucleotide identities of pheS genes between the subgroups were 96% and 93%, while those of rpoA genes were 99% and 98%, respectively. The intraspecific separation in the two species might deserve subspecies-level differentiation. We must note, however, that our analysis was based on genomic information only. Further analysis including phenotypic characterization is required to establish their valid classifications. To assess the discriminating power of ANI, ANI values were calculated among six subspecies of L. delbrueckii. The ANI values for their type strains were distributed in the range of 97.2–98.4%. In spite of such high values, hierarchical clustering based on the ANI values could separate them (Fig. 3D), and the tree topology was roughly consistent with the ones from multilocus sequence analyses [30, 31]. This implies the reliability of ANI in evaluating the genetic subgroups within a species.

DFAST online annotation server

We developed a web interface for the DFAST annotation pipeline, so that users can manage metadata and submit annotated genomes to DDBJ. Users can annotate their own genomic data by uploading a FASTA formatted file via a submission form and can perform quality and taxonomic assessment using CheckM and the calculation of ANI. A simple annotation editor is also available, allowing users to modify gene product names or gene symbols. Submission files for the DDBJ Mass Submission System are then automatically generated. Results can be downloaded in several formats, including GenBank, Multi-FASTA, and tab-separated formats.

DISCUSSION

Recent new-generation sequencing technologies are constantly producing more and more genome sequences, making it important to assess their data quality and taxonomic positions. DAGA is a new genome archive that stores quality-controlled and taxonomically confirmed bacterial genomes with consistent annotation. Its quality measure is the genome completeness and contamination values calculated by CheckM, and we were able to use it to successfully identify genomes of incorrect size as compared with typical LAB strains without using any other selection method. In addition, we also identified taxonomically mislabeled genomes in public databases even for type strains (Table 2A). These results will help researchers to select genomes for comparative analysis. The NCBI Reference Sequence (RefSeq) and the Pathosystems Resource Integration Center (PATRIC) provide consistently annotated genome collections [32, 33]. They collect genome sequences from DDBJ/ENA/GenBank and re-annotate them using NCBI Prokaryotic Genome Annotation Pipeline (PGAP) and Rapid Annotation using Subsystem Technology (RAST), respectively. As far as we know, there is no database that collects genomic data from both DDBJ/ENA/GenBank and SRA. Because SRA stores raw sequence data, it is difficult for users without bioinformatics skills to exploit the data. DAGA facilitates the reuse of valuable data available in SRA, such as the only reliable genome for L. amylotrophicus, which can only be obtained from SRA (ERR387486). As of January 2016, DAGA provides 1,421 genomes collected from DDBJ/ENA/GenBank and SRA for two genera in Lactobacillaceae. The genus Sharpea was not included even though it is classified in the family Lactobacillaceae. Sharpea azabuensis, the only member of this genus, was initially described as a species related to Lactobacillus catenaformis, but L. catenaformis was later reclassified as Eggerthia catenaformis, and it is no longer a member of Lactobacillaceae [34, 35]. As the number of available genomes is increasing rapidly, we plan to update the database regularly and to expand the scope of the database to other taxonomic groups. The most widely used methodology for bacterial taxonomic identification is the combination of 16S rRNA gene sequencing and DNA-DNA hybridization (DDH) [36]. According to the minimal standard recommended for describing new species of Lactobacillus, DDH should be conducted if the 16S rRNA sequence similarity to the closest known species is beyond 97% [37]. Recently, however, ANI has been used as a substitute for DDH to describe novel species of Lactobacillus [38,39,40]. ANI has several advantages. First, it does not require a laboratory assay and is computationally reproducible. Second, it does not require gene calling and is applicable to draft genomes. It is especially valuable in the case of conducting de novo assembly from short reads because bacterial genomes normally encode multiple rRNA operons difficult to reconstruct. Lastly and most importantly, ANI shows prominent discriminatory power to determine genome identity. Even between hard-to-distinguish taxonomic groups such as L. casei and L. plantarum, the ANI values between two different species were below 85%, much less than the threshold of 95%. Furthermore, only 0.4% of the comparisons fell within the “twilight zone” of 85–95% in our analysis of 191 representative genomes (Fig. 3A). For these reasons, we emphasize the benefit of ANI to validate taxonomic status for genomes deposited in DAGA. As an exception, the ANI value between L. casei ATCC 393T and L. zeae DSM 20178T was slightly below the species-level threshold (94.4%) even though the two strains are now considered the same species. In our analysis, ANI values between species were always less than 95%, but the reverse is not always true. As shown by the results for L. gasseri and L. jensenii, intraspecific ANI values can be lower than 95% in some species. In several species, ANI can help determine subspecies of a given strain, as shown in the results for L. delbrueckii (Fig. 3D). It seems difficult to establish an ANI cutoff value to distinguish subspecies, however, because inter-subspecific ANI values depend on the species (Fig. 3A). For example, the lowest value exhibited by the subspecies of L. aviarius was 89%, much lower than the species-level threshold. The highest value was reported for L. kefiranofaciens, which showed an ANI value as high as 99.4%. According to the original description, the two subspecies of L. kefiranofaciens shared 100% 16S rRNA sequence identity and were distinguishable by morphological and biochemical characteristics [41]. On the other hand, the two subspecies of L. plantarum were distinguished mainly based on their genotypic traits because their morphological, physiological, and biochemical characteristics were almost identical, with the only exceptions being in a few carbohydrate fermentation patterns [42]. The ANI value between L. plantarum subsp. plantarum and subsp. argentoratensis was 95.3%. The difference in inter-subspecific ANI values between the two species seems to reflect their original descriptions. For several strains, the allocation of subspecies was found to be inconsistent with the ANI results. L. sakei subsp. sakei 23 K was more similar to subsp. carnosus than to subsp. sakei, as suggested by Chaillou et al. [43]. The two strains labeled as L. paracasei subsp. tolerans (GCA_000409835.1 and GCA_000410335.1) were more similar to subsp. paracasei, although the difference was as small as 0.2%. The genome sizes of subsp. paracasei and subsp. tolerans differ prominently: 3.0 Mbp and 2.4 Mbp, respectively. Judging from the genome sizes, the two strains are more likely to belong to the subspecies paracasei. However, we could not find any other evidence that supports this assumption. The values from all ANI calculations are available from our website: https://dfast.nig.ac.jp/download/. Our assessment found the six questionable genomes listed in Table 2A, namely, Pediococcus lolii DSM 19927T (GCA_001437115.1), Pediococcus parvulus DSM 203321T (GCA_001437265.1), Lactobacillus brevis subsp. gravesensis ATCC 27305 (GCA_000159175.1), Lactobacillus fornicalis JCM 12512T (ERR387492), Lactobacillus homohiochii DSM 20571T (GCA_001436985.1), and Lactobacillus parakefiri DSM 10551T (GCA_001434215.1). The P. lolii genome was presumably a misclassification of the sequenced strain. A previous study reported that the type strains of P. lolii deposited in DSMZ and JCM were strains of Pediococcus acidilactici [44]. Our analysis showed that not only P. lolii DSM 19927T but also strain NGRI 0510QT (GCA_000319265.1), an original type strain of P. lolii, shared an ANI of 97% with P. acidilactici. L. brevis subsp. gravesensis was first described over 60 years ago, but it was not mentioned in the Approved Lists of Bacterial Names published in 1980 [45]. This species is displayed as Lactobacillus sp. and Lactobacillus hilgardii in JCM and the EzGenome database, respectively [12, 46]. The type strains of L. homohiochii and L. fornicalis deposited in culture collections were reported to misrepresent the originally described strains [47] (http://www.bacterio.net/lactobacillus.html#fornicalis). Their original strains are no longer available, and designation of a neotype seems appropriate. The genome of L. parakefiri DSM 10551T (GCA_001434215.1) exhibited an extremely high contamination value (98%), indicating the mixture of different strains. Indeed, two pheS genes were found in the genome, each matching the deposited pheS gene sequences of L. kefiri and L. parakefiri. Our analysis suggests that its large genome size [18] and the similarity to L. kefiri [17] are attributable to the sequence contamination. Likewise, the genome of P. parvulus DSM 20332T seems to be contaminated with another strain of P. acidilactici. Our annotation pipeline is freely available as the DFAST web service. In comparison with other annotation tools such as RAST [48] or the Microbial Genome Annotation Pipeline (MiGAP) [49], the advantage of DFAST is the ability to generate ready-to-submit annotation files. RAST can perform detailed functional annotation based on the platform called SEED. However, if users want to submit an annotated genome to INSDC, they need to convert annotation results into an acceptable format. Although MiGAP partly supports the DDBJ-acceptable format, users are still required to prepare metadata and to curate annotated protein names before submission. As our curated reference database follows the protein naming guidelines of the NCBI, minimal manual curation, if any, is required before submitting genomes to DDBJ. Another advantage of DFAST is its short running time. It takes about 5 minutes to annotate a typical bacterial genome, while RAST and MiGAP take several hours. In addition, DFAST provides quality and taxonomy assessment tools, which prevent users from submitting low quality or mislabeled genomes to INSDC. We have already used DFAST to annotate 5 genomes of Lactobacillus strains, including two candidates for new species (manuscript in preparation). On average, 90.3% of protein coding sequences were annotated based on a similarity search against the reference protein database in this study. We were able to submit them to DDBJ without any manual curation. Currently, the reference database constructed in this study is based mainly on protein sequence data obtained from Lactobacillus and Pediococcus, with additional information from 12 representative strains of other genera. Our future tasks include an update and extension of the reference database to other genera, such as Lactococcus and Leuconostoc, and annotation of frameshifted genes or pseudo-genes. In conclusion, we assessed 1,421 genomes covering 191 (sub)species in the family Lactobacillaceae and developed a curated genome repository referred to as DAGA. This will improve the accessibility and reusability of LAB genome resources. The annotation and submission pipeline DFAST will help researchers to deal with large amounts of emerging sequence data, thereby accelerating studies of LAB that make use of genomic data.

43 in total

1. DNA-DNA hybridization values and their relationship to whole-genome sequence similarities.

Authors: Johan Goris; Konstantinos T Konstantinidis; Joel A Klappenbach; Tom Coenye; Peter Vandamme; James M Tiedje
Journal: Int J Syst Evol Microbiol Date: 2007-01 Impact factor: 2.747

2. Prokka: rapid prokaryotic genome annotation.

Authors: Torsten Seemann
Journal: Bioinformatics Date: 2014-03-18 Impact factor: 6.937

3. Multilocus sequence typing reveals a novel subspeciation of Lactobacillus delbrueckii.

Authors: Kana Tanigawa; Koichi Watanabe
Journal: Microbiology (Reading) Date: 2010-12-22 Impact factor: 2.777

4. Reclassification of Lactobacillus catenaformis (Eggerth 1935) Moore and Holdeman 1970 and Lactobacillus vitulinus Sharpe et al. 1973 as Eggerthia catenaformis gen. nov., comb. nov. and Kandleria vitulina gen. nov., comb. nov., respectively.

Authors: Elisa Salvetti; Giovanna E Felis; Franco Dellaglio; Anna Castioni; Sandra Torriani; Paul A Lawson
Journal: Int J Syst Evol Microbiol Date: 2010-11-26 Impact factor: 2.747

5. Lactobacillus herbarum sp. nov., a species related to Lactobacillus plantarum.

Authors: Yuejian Mao; Meng Chen; Philippe Horvath
Journal: Int J Syst Evol Microbiol Date: 2015-09-23 Impact factor: 2.747

6. Lactobacillus sicerae sp. nov., a lactic acid bacterium isolated from Spanish natural cider.

Authors: Ana Isabel Puertas; David R Arahal; Idoia Ibarburu; Patricia Elizaquível; Rosa Aznar; M Teresa Dueñas
Journal: Int J Syst Evol Microbiol Date: 2014-06-04 Impact factor: 2.747

7. Sharpea azabuensis gen. nov., sp. nov., a Gram-positive, strictly anaerobic bacterium isolated from the faeces of thoroughbred horses.

Authors: Hidetoshi Morita; Chiharu Shiratori; Masaru Murakami; Hideto Takami; Hidehiro Toh; Yukio Kato; Fumihiko Nakajima; Misako Takagi; Hiroaki Akita; Toshio Masaoka; Masahira Hattori
Journal: Int J Syst Evol Microbiol Date: 2008-12 Impact factor: 2.747

8. Lactobacillus delbrueckii subsp. jakobsenii subsp. nov., isolated from dolo wort, an alcoholic fermented beverage in Burkina Faso.

Authors: David B Adimpong; Dennis S Nielsen; Kim I Sørensen; Finn K Vogensen; Hagrétou Sawadogo-Lingani; Patrick M F Derkx; Lene Jespersen
Journal: Int J Syst Evol Microbiol Date: 2013-05-03 Impact factor: 2.747

9. RefSeq microbial genomes database: new representation and annotation strategy.

Authors: T Tatusova; S Ciufo; B Fedorov; K O'Neill; I Tolstoy
Journal: Nucleic Acids Res Date: 2015-03-30 Impact factor: 16.971

10. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST).

Authors: Ross Overbeek; Robert Olson; Gordon D Pusch; Gary J Olsen; James J Davis; Terry Disz; Robert A Edwards; Svetlana Gerdes; Bruce Parrello; Maulik Shukla; Veronika Vonstein; Alice R Wattam; Fangfang Xia; Rick Stevens
Journal: Nucleic Acids Res Date: 2013-11-29 Impact factor: 16.971

61 in total

1. Molecular Characterization of IMP-1-Producing Enterobacter cloacae Complex Isolates in Tokyo.

Authors: Kotaro Aoki; Sohei Harada; Koji Yahara; Yoshikazu Ishii; Daisuke Motooka; Shota Nakamura; Yukihiro Akeda; Tetsuya Iida; Kazunori Tomono; Satoshi Iwata; Kyoji Moriya; Kazuhiro Tateda
Journal: Antimicrob Agents Chemother Date: 2018-02-23 Impact factor: 5.191

2. Wastewater as a Probable Environmental Reservoir of Extended-Spectrum-β-Lactamase Genes: Detection of Chimeric β-Lactamases CTX-M-64 and CTX-M-123.

Authors: Hayato Tanaka; Wataru Hayashi; Masaki Iimura; Yui Taniguchi; Eiji Soga; Nao Matsuo; Kumiko Kawamura; Yoshichika Arakawa; Yukiko Nagano; Noriyuki Nagano
Journal: Appl Environ Microbiol Date: 2019-10-30 Impact factor: 4.792

3. Pseudofructophilic Leuconostoc citreum Strain F192-5, Isolated from Satsuma Mandarin Peel.

Authors: Shintaro Maeno; Yasuhiro Tanizawa; Akinobu Kajikawa; Yu Kanesaki; Eri Kubota; Masanori Arita; Leon Dicks; Akihito Endo
Journal: Appl Environ Microbiol Date: 2019-10-01 Impact factor: 4.792

4. Complete Genome Sequence of a Macrolide-Resistant Bordetella pertussis Isolated in Japan.

Authors: Kentaro Koide; Takahiro Yamaguchi; Chihiro Katsukawa; Nao Otsuka; Tsuyoshi Kenri; Kazunari Kamachi
Journal: Microbiol Resour Announc Date: 2022-09-21

5. In Vitro Derivation of Fluoroquinolone-Resistant Mutants from Multiple Lineages of Haemophilus influenzae and Identification of Mutations Associated with Fluoroquinolone Resistance.

Authors: Hiroyuki Honda; Toyotaka Sato; Masaaki Shinagawa; Yukari Fukushima; Chie Nakajima; Yasuhiko Suzuki; Koji Kuronuma; Satoshi Takahashi; Hiroki Takahashi; Shin-Ichi Yokota
Journal: Antimicrob Agents Chemother Date: 2020-01-27 Impact factor: 5.191

6. Acquisition of mcr-1 and Cocarriage of Virulence Genes in Avian Pathogenic Escherichia coli Isolates from Municipal Wastewater Influents in Japan.

Authors: Wataru Hayashi; Hayato Tanaka; Yui Taniguchi; Masaki Iimura; Eiji Soga; Ryoichi Kubo; Nao Matsuo; Kumiko Kawamura; Yoshichika Arakawa; Yukiko Nagano; Noriyuki Nagano
Journal: Appl Environ Microbiol Date: 2019-10-30 Impact factor: 4.792

7. Point mutation in the stop codon of MAV_RS14660 increases the growth rate of Mycobacterium avium subspecies hominissuis.

Authors: Tomomi Kawakita; Tetsu Mukai; Mitsunori Yoshida; Hiroyuki Yamada; Masaaki Nakayama; Yuji Miyamoto; Masato Suzuki; Noboru Nakata; Takemasa Takii; Akihide Ryo; Naoya Ohara; Manabu Ato
Journal: Microbiology (Reading) Date: 2021-02 Impact factor: 2.777

8. Potential Probiotic Strains From Milk and Water Kefir Grains in Singapore-Use for Defense Against Enteric Bacterial Pathogens.

Authors: Li Ling Tan; Chuan Hao Tan; Noele Kai Jing Ng; Yoke Hun Tan; Patricia Lynne Conway; Say Chye Joachim Loo
Journal: Front Microbiol Date: 2022-04-01 Impact factor: 6.064

9. Complete Genome Sequence of Ferrigenium kumadai An22, a Microaerophilic Iron-Oxidizing Bacterium Isolated from a Paddy Field Soil.

Authors: Takeshi Watanabe; Ashraf Khalifa; Susumu Asakawa
Journal: Microbiol Resour Announc Date: 2021-07-08

10. Comparative Genomics of Closely Related Tetragenococcus halophilus Strains Elucidate the Diversity and Microevolution of CRISPR Elements.

Authors: Minenosuke Matsutani; Takura Wakinaka; Jun Watanabe; Masafumi Tokuoka; Akihiro Ohnishi
Journal: Front Microbiol Date: 2021-06-18 Impact factor: 5.640