| Literature DB >> 23193276 |
Akihiro Nakaya1, Toshiaki Katayama, Masumi Itoh, Kazushi Hiranuka, Shuichi Kawashima, Yuki Moriya, Shujiro Okuda, Michihiro Tanaka, Toshiaki Tokimatsu, Yoshihiro Yamanishi, Akiyasu C Yoshizawa, Minoru Kanehisa, Susumu Goto.
Abstract
The identification of orthologous genes in an increasing number of fully sequenced genomes is a challenging issue in recent genome science. Here we present KEGG OC (http://www.genome.jp/tools/oc/), a novel database of ortholog clusters (OCs). The current version of KEGG OC contains 1 176 030 OCs, obtained by clustering 8 357 175 genes in 2112 complete genomes (153 eukaryotes, 1830 bacteria and 129 archaea). The OCs were constructed by applying the quasi-clique-based clustering method to all possible protein coding genes in all complete genomes, based on their amino acid sequence similarities. It is computationally efficient to calculate OCs, which enables to regularly update the contents. KEGG OC has the following two features: (i) It consists of all complete genomes of a wide variety of organisms from three domains of life, and the number of organisms is the largest among the existing databases; and (ii) It is compatible with the KEGG database by sharing the same sets of genes and identifiers, which leads to seamless integration of OCs with useful components in KEGG such as biological pathways, pathway modules, functional hierarchy, diseases and drugs. The KEGG OC resources are accessible via OC Viewer that provides an interactive visualization of OCs at different taxonomic levels.Entities:
Mesh:
Year: 2012 PMID: 23193276 PMCID: PMC3531156 DOI: 10.1093/nar/gks1239
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Distribution of OCs in KEGG OC across three domains: eukaryotes, bacteria and archaea. The number indicates the number of OCs consisting of multiple genes, whereas the number in parenthesis indicates the number of singletons (OCs consisting of a single gene).
Comparison of the numbers of fully sequenced organisms, eukaryotes, bacteria, archaea, genes and the last update date among COG, KOG, EGO/TOGA, MultiParanoid, OrthoMCL, OMAbrowser, MBGD, eggNOG, KEGG OC, InParanoid and Roundup
| Database/Method | The number of | Last update year | ||||
|---|---|---|---|---|---|---|
| Organisms | Eukaryotes | Bacteria | Archaea | Genes | ||
| COG | 66 | 3 | 50 | 13 | 192 987 | 2003 |
| KOG | 7 | 7 | – | – | 60 759 | 2003 |
| EGO/TOGA | 88 | 88 | – | – | 618 733 | 2006 |
| MultiParanoid | 4 | 4 | – | – | 71 199 | 2011 |
| OrthoMCL | 150 | 98 | 36 | 16 | 1 398 546 | 2011 |
| OMAbrowser | 1211 | 124 | 994 | 93 | 5 701 696 | 2012 |
| MBGD | 1532 | 34 | 1382 | 116 | 5 415 388 | 2012 |
| eggNOG | 1133 | 121 | 943 | 69 | 5 214 234 | 2012 |
| KEGG OC | 2112 | 153 | 1830 | 129 | 8 357 175 | 2012 |
| InParanoid | 100 | 99 | 1 | – | 1 940 193 | 2009 |
| Roundup | 1786 | 226 | 1447 | 113 | 7 931 643 | 2011 |
Figure 2.An example of the output page of OC Viewer of query ‘eco:b0002’ (an example of KEGG GENES ID for a gene of E. coli K-12 MG1655) as an input. The PC column shows the PCs (eco.14, ecj.17, ecd.113, etc.). These PCs are aggregated into a TC named Escherichia_col.10890 at the higher taxonomic level indicated in 5th column. As the aggregation of the TCs is iterated from the 5th column to the 2nd column in the OC table, these PCs are merged to the top-level cluster OC.149602. By using the slider at the bottom left, one can focus to arbitrary depth in the taxonomic tree indicated at the bottom right.