| Literature DB >> 30462302 |
Ikuo Uchiyama1,2, Motohiro Mihara3, Hiroyo Nishide2, Hirokazu Chiba4, Masaki Kato1.
Abstract
The Microbial Genome Database for Comparative Analysis (MBGD) is a database for comparative genomics based on comprehensive orthology analysis of bacteria, archaea and unicellular eukaryotes. MBGD now contains 6318 genomes. To utilize the database for both closely related and distantly related genomes, MBGD previously provided two types of ortholog tables: the standard ortholog table containing one representative genome from each genus covering the entire taxonomic range and the taxon specific ortholog tables for each taxon. However, this approach has a drawback in that the standard ortholog table contains only genes that are conserved in the representative genomes. To address this problem, we developed a stepwise procedure to construct ortholog tables hierarchically in a bottom-up manner. By using this approach, the new standard ortholog table now covers the entire gene repertoire stored in MBGD. In addition, we have enhanced several functionalities, including rapid and flexible keyword searching, profile-based sequence searching for orthology assignment to a user query sequence, and displaying a phylogenetic tree of each taxon based on the concatenated core gene sequences. For integrative database searching, the core data in MBGD are represented in Resource Description Framework (RDF) and a SPARQL interface is provided to search them. MBGD is available at http://mbgd.genome.ad.jp/.Entities:
Mesh:
Year: 2019 PMID: 30462302 PMCID: PMC6324027 DOI: 10.1093/nar/gky1054
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The bottom-up procedure for constructing hierarchical orthology relationships. (A) Overview of the procedure. The procedure progresses from bottom to top. (B) Hierarchical ortholog groups. Here, the construction process goes from right to left and the expansion process goes from left to right. A representative gene in each cluster is indicated in red, and the target clusters to be expanded are underlined. A gene in a pan-genome is represented as ‘taxid:clustid’, which is actually the representative gene of the cluster. The number in parentheses is the domain number and the two numbers after each gene name are the beginning and end positions of the domain. (C) Domain boundary mapping between clusters at different levels. The example is the same as in B. The red segment corresponds to the domain tax44249:7443(2) in the standard cluster 98932. Missing positions by this mapping are filled by a simple linear interpolation, shown by the numbers in parentheses.
Comparison of data sizes between the current and the previous approaches
| Number of sequences a | Number of clusters b | |
|---|---|---|
| Previous method (representative-genome-based) | 3 735 085 | 491 920 |
| New method (pan-genome-based) | 4 640 598 | 768 073 |
| Total sequences | 22 521 946 |
aThe number of sequences used for creating the standard ortholog table.
bThe number of clusters in the standard ortholog table.
Figure 2.Overall procedure for constructing MBGD. This figure is an update of the previous version (3).
Figure 3.Screenshots of the new functionalities in MBGD. (A) An example of a hierarchical ortholog group. Shown is the ortholog group containing Shiga-like toxin subunit A. (B) A phylogenetic tree shown in the ortholog table summary viewer. Shown is a part of the phylogenetic tree created from the conserved orthologs of the family Bacillaceae. (C) The output of the profile search using MMseqs2.
Figure 4.Interfaces for searching and browsing MBGD. Interfaces are shown in the light pink boxes.