| Literature DB >> 27867804 |
Yasuhiro Tanizawa1, Takatomo Fujisawa2, Eli Kaminuma2, Yasukazu Nakamura2, Masanori Arita3.
Abstract
Quality assurance and correct taxonomic affiliation of data submitted to public sequence databases have been an everlasting problem. The DDBJ Fast Annotation and Submission Tool (DFAST) is a newly developed genome annotation pipeline with quality and taxonomy assessment tools. To enable annotation of ready-to-submit quality, we also constructed curated reference protein databases tailored for lactic acid bacteria. DFAST was developed so that all the procedures required for DDBJ submission could be done seamlessly online. The online workspace would be especially useful for users not familiar with bioinformatics skills. In addition, we have developed a genome repository, DFAST Archive of Genome Annotation (DAGA), which currently includes 1,421 genomes covering 179 species and 18 subspecies of two genera, Lactobacillus and Pediococcus, obtained from both DDBJ/ENA/GenBank and Sequence Read Archive (SRA). All the genomes deposited in DAGA were annotated consistently and assessed using DFAST. To assess the taxonomic position based on genomic sequence information, we used the average nucleotide identity (ANI), which showed high discriminative power to determine whether two given genomes belong to the same species. We corrected mislabeled or misidentified genomes in the public database and deposited the curated information in DAGA. The repository will improve the accessibility and reusability of genome resources for lactic acid bacteria. By exploiting the data deposited in DAGA, we found intraspecific subgroups in Lactobacillus gasseri and Lactobacillus jensenii, whose variation between subgroups is larger than the well-accepted ANI threshold of 95% to differentiate species. DFAST and DAGA are freely accessible at https://dfast.nig.ac.jp.Entities:
Keywords: Lactobacillus; Pediococcus; annotation; database; genome; lactic acid bacteria
Year: 2016 PMID: 27867804 PMCID: PMC5107635 DOI: 10.12938/bmfh.16-003
Source DB: PubMed Journal: Biosci Microbiota Food Health ISSN: 2186-3342
Fig. 1.The number of described species and published genomes in Lactobacillus and Pediococcus.
Solid line represents the cumulative number of described (sub)species. Only valid species as of Jan. 2016 were included, not reclassified ones. The bar chart represents the cumulative number of genomes deposited in DDBJ/ENA/GenBank.
Fig. 3.A) Distribution of ANI values among 191 representative genomes. Circles and open diamonds indicate interspecific and intraspecific ANI values, respectively. Black circles indicate problematic genomes. B–D) Hierarchical clustering results by using (1 – ANI) as the genome distance. Each label represents the accession number of NCBI Assembly Database and the strain name. B: Lactobacillus gasseri, C: L. jensenii, D: L. delbrueckii.
Fig. 2.Screenshots of DFAST and DAGA.
A) Main page of DAGA, listing genomes in the database. Users can query genomes from the search form. B) Detail page of each genome, showing statistics and external links. Data files are downloadable in several formats. C) Detail page of annotated features. Links to the Blast web service at NCBI are available. D) Submission form of DFAST. Users can annotate their own genome by uploading the FASTA file. E) Result of DFAST. Submission files for DDBJ Mass Submission System are ready.
Number of genomes deposited in DAGA
| Data source | Quality rating | Total | ||||
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | ||
| DDBJ/ENA/GenBank | 17 | 11 | 59 | 558 | 98 | 743 |
| SRA | 30 | 27 | 4 | 617 | 0 | 678 |
| Total | 47 | 38 | 63 | 1,175 | 98 | 1,421 |
Definition of the quality rating grades
| Quality rating | Definition |
|---|---|
| 5 | High quality complete genomes with completeness ≥95% and contamination ≤5% |
| 4 | High quality draft genomes with completeness ≥95% and contamination ≤5% |
| 3 | Low quality genomes with completeness ≥80% and contamination ≤10% |
| 2 | Disqualified genomes with completeness <80% or contamination >10% |
| 1 | Taxonomically mislabeled or misidentified genomes |
Strains with problematic taxonomic positions
| Data source* | Organism name | Strain | Description |
|---|---|---|---|
| GCA_000159175.1 | ATCC 27305# | Shows an ANI value of 97.3% against | |
| ERR387492 | JCM 12512T | Shows an ANI value of 98.7% against | |
| GCA_001436985.1 | DSM 20571T | Shows an ANI value of 99.9% against | |
| GCA_001434215.1 | DSM 10551T | Shows an ANI value of 99.9% against | |
| SRR1561417 | DSM 19927T | Shows an ANI value of 97.1% against | |
| GCA_001437265.1 | DSM 203321T | Shows an ANI value of 92.5% against |
# Non-type strain.
Mislabeled genomes deposited in DDBJ/ENA/GenBank
| Data source* | Organism name | Strain | Description |
|---|---|---|---|
| GCA_000159195.1 | ATCC 11577 | Shows an ANI value of 99.1% against | |
| GCA_001434555.1 | DSM 20534T | Shows an ANI value of 100% against | |
| GCA_001314245.1 | HFD4 | Shows an ANI value of 96.7% against | |
| GCA_001273585.1 | SNU.Lp177 | Shows an ANI value of 98.9% against | |
| GCA_001068345.1 | 987_LJOH | Shows an ANI value of 93.4% against | |
| GCA_001066235.1 | 770_LJOH | Shows an ANI value of 100% against | |
| GCA_001064985.1 | 459_LHEL | Shows an ANI value of 96.8% against | |
| GCA_001063065.1 | 249_LKEF | Shows an ANI value of 100% against | |
| GCA_001063045.1 | 240_LCRI | Shows an ANI value of 100% against | |
| GCA_000469115.1 | AY01 | Shows an ANI value of 99.6% against | |
| GCA_000463075.2 | EGD-AQ4 | Shows an ANI value of 92.8% against | |
| GCA_000191545.1 | 30SC | Shows an ANI value of 100% against | |
| GCA_000159195.1 | ATCC 11577 | Shows an ANI value of 99.1% against |
* Those with GCA were derived from NCBI Assembly Database and those with DRR/SRR/ERR were derived from SRA.