Literature DB >> 16381885

From genomics to chemical genomics: new developments in KEGG.

Minoru Kanehisa¹, Susumu Goto, Masahiro Hattori, Kiyoko F Aoki-Kinoshita, Masumi Itoh, Shuichi Kawashima, Toshiaki Katayama, Michihiro Araki, Mika Hirakawa.

Abstract

The increasing amount of genomic and molecular information is the basis for understanding higher-order biological systems, such as the cell and the organism, and their interactions with the environment, as well as for medical, industrial and other practical applications. The KEGG resource (http://www.genome.jp/kegg/) provides a reference knowledge base for linking genomes to biological systems, categorized as building blocks in the genomic space (KEGG GENES) and the chemical space (KEGG LIGAND), and wiring diagrams of interaction networks and reaction networks (KEGG PATHWAY). A fourth component, KEGG BRITE, has been formally added to the KEGG suite of databases. This reflects our attempt to computerize functional interpretations as part of the pathway reconstruction process based on the hierarchically structured knowledge about the genomic, chemical and network spaces. In accordance with the new chemical genomics initiatives, the scope of KEGG LIGAND has been significantly expanded to cover both endogenous and exogenous molecules. Specifically, RPAIR contains curated chemical structure transformation patterns extracted from known enzymatic reactions, which would enable analysis of genome-environment interactions, such as the prediction of new reactions and new enzyme genes that would degrade new environmental compounds. Additionally, drug information is now stored separately and linked to new KEGG DRUG structure maps.

Entities: Chemical Species

Mesh：

Substances：

Year: 2006 PMID： 16381885 PMCID： PMC1347464 DOI： 10.1093/nar/gkj102

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

While traditional genomics and other types of omics approaches have contributed to our knowledge on the genomic space of possible genes and proteins that make up the biological system, the new chemical genomics initiatives will give us a glimpse of the chemical space of possible chemical substances that exist as an interface between the biological world and the natural world. The KEGG database project was initiated in 1995, the last year of the first 5-year phase of the Japanese Human Genome Programme (1). After 10 years of development in parallel with the growing number of completely sequenced genomes and increased activities in post-genomic research, the KEGG project has entered a new phase in accordance with the chemical genomics initiatives. KEGG is a database resource for understanding higher-order functions and utilities of the biological system, such as the cell or the organism, from genomic and molecular information. In fact, we consider KEGG as a computer representation of the biological system, consisting of building blocks and wiring diagrams, which can be used for modeling and simulation as well as for browsing and retrieval (2). Originally, the wiring diagrams involved endogenous molecules, both those that are directly encoded in the genome (proteins and RNAs) and those that are indirectly encoded through biosynthetic/biodegradation pathways (metabolites, glycans and so on). Now we are extending these wiring diagrams to include exogenous molecules. This will help understand interactions between the biological system and the natural environment, and would eventually lead to representation and reconstruction of another higher-level biological system, the biological world. Here we report new developments in KEGG towards this direction.

THE KEGG RESOURCE

Overview

KEGG consists of four main databases. As illustrated in Figure 1 they are categorized as building blocks in the genomic space (GENES databases) and the chemical space (LIGAND database), wiring diagrams in the network space (PATHWAY database) and ontologies for pathway reconstruction (BRITE database). BRITE had been a separate database for many years, but it was formally included in KEGG in release 34.0 (April 2005) to establish a logical foundation for the KEGG Project. The URLs for accessing KEGG are summarized in Table 1.

Figure 1

The overall architecture of KEGG now consisting of four main components. KEGG BRITE has been formally added to establish a logical foundation for inference of higher-order functions.

Table 1

URLs for the KEGG resource

Database/content	URL
KEGG home page
KEGG table of contents
KEGG PATHWAY
KEGG GENES
KEGG LIGAND
KEGG BRITE
KGML
KEGG API
KEGG DRUG
KEGG GLYCAN
KEGG REACTION
KEGG EXPRESSION
KEGG ANNOTATION
KegArray/KegDraw
DBGET
BLAST/FASTA
GenomeNet FTP
GenomeNet home page

The current GenomeNet address ‘’ is recommended, but the previous address ‘’ will still be made available.

Biological systems are represented in KEGG by two types of graphs, called nested graphs and line graphs in theoretical computer science. The nested graph is a graph whose nodes can themselves be graphs. It is used for representing KEGG network hierarchy and for pathway reconstruction and functional inference. The line graph is a graph derived by interchanging nodes and edges of another graph. It represents the inherent complementarity of the metabolic pathway, which can be viewed either as a network of genes (enzymes) or as a network of compounds, meaning that one can be generated from the other by the line graph transformation. Thus, the line graph is the basis for integrated analysis of genomic and chemical information.

BRITE database

KEGG BRITE is a collection of hierarchies and binary relations with two inter-related objectives corresponding to the two types of graphs: to automate functional interpretations associated with the KEGG pathway reconstruction and to assist discovery of empirical rules involving genome-environment interactions. Currently, we focus on hierarchical structuring of our knowledge on functional aspects of the genomic and chemical spaces (Table 2), including the KEGG orthology (KO) system for ortholog/paralog gene groups, the reaction classification (RC) system for biochemical reactions, and other classifications for compounds and drugs tentatively called chemical ontology as shown in Figure 1. We plan to extend the KO system to include the definition of functional modules in the KEGG pathways and to develop ontologies for computational inference of higher-order functions.

Table 2

Functional hierarchies in KEGG BRITE

Network hierarchy

Protein families

Enzymes

Transcription factors

Ribosome

Translation factors

ABC transporters

G-protein-coupled receptors

Ion channels

Cytokines

Cytokine receptors

Cell adhesion molecules (CAMs)

CAM ligands

CD molecules

Bacterial motility proteins

Compounds

Compounds with biological roles

Lipids

Phytochemical compounds

Compound interactions

Ion channel agonists/antagonists

Cytochrome P450 substrates

Drugs

Therapeutic category of drugs

Drug classification

Diseases

Disease genes, genomes and pathways

Organisms

KEGG organisms

As on September 12, 2005.

PATHWAY database

The KEGG PATHWAY database is a collection of manually drawn pathway maps for metabolism, genetic information processing, environmental information processing such as signal transduction, various other cellular processes and human diseases. During the past 2 years we have significantly increased the number of pathway maps for regulatory pathways including signal transduction, ligand–receptor interaction and cell communication, all based on extensive survey of published literature. For metabolic pathways we created two new sections, ‘Glycan Biosynthesis and Metabolism’ and ‘Biosynthesis of Polyketides and Nonribosomal Peptides’. The XML version of the pathway maps is available for both metabolic and regulatory pathways. These KEGG Markup Language (KGML) files provide graph information that can be used to computationally reproduce and manipulate KEGG pathway maps.

GENES database

The KEGG GENES database is a collection of gene catalogs for all complete genomes and some partial genomes (31 eukaryotes, 235 bacteriaand 23 archaea as of September 12, 2005), generated from publicly available resources, mostly NCBI RefSeq (3). All genomes in KEGG GENES are subject to SSDB computation and given manual KO assignments as described below. There are auxiliary collections of gene catalogs: DGENES for draft genomes (21 eukaryotes) and EGENES for expressed sequence tag consensus contigs (25 plants). These are meant to supplement the repertoire of KEGG organisms, and all are given automatic KO assignments using GENES as a reference dataset. Each GENES entry contains cross-reference information to outside databases, including NCBI gi numbers, Entrez Gene IDs and UniProt accession numbers. Starting with KEGG release 37.0 (January 2006) automatic ID conversion is implemented enabling use of such outside identifiers to access KEGG GENES and then the other KEGG databases.

KEGG orthology

There is a total of over one million genes in KEGG GENES, representing a tiny, but well-characterized part of the genomic space that makes up the biological world. From this part we organize knowledge about orthologous genes and paralogous genes, which, we hope, can be generalized for understanding the entire genomic space. This knowledge is stored in the KO system, a pathway-based classification of orthologous genes, including orthologous relationships of paralogous gene groups. The KO identifier, or the K number, is a common identifier for linking genomic information in the GENES database with network information in the PATHWAY database. The pathway nodes represented by rectangles in the KEGG reference pathway maps are given KO identifiers, so that organism-specific pathways can be computationally generated once each genome is annotated with KO's. This annotation or the KO assignment is done manually for KEGG GENES with the help of the GFIT tool using best-hit relations in pairwise genome comparisons stored in the SSDB database (4). Because the number of ortholog groups that can be linked to pathways is limited, we have introduced two additional ways to define KO's. One is to use COG (5) to cover a broad-range of possible ortholog groups. The other is to rely on experts' classifications of protein families, which tend to be more functionally oriented resulting in narrowly defined KO's. A growing number of protein families are being added to the KO system, and they are shown in separate hierarchies different from the KEGG network hierarchy. The KO system can be best viewed from the KEGG BRITE database (Table 2).

LIGAND database

Originally, the LIGAND database consisted of just two components: ENZYME for enzyme nomenclature and COMPOUND for chemical compound structures (6). It later successively included additional components: REACTION for chemical reaction formulas, GLYCAN for glycan structures, RPAIR for reactant pair transformation patterns and DRUG for drug information. This expansion of the LIGAND collection represents our expanded efforts for understanding the chemical space that is part of the biological world. The KEGG DRUG database is a new addition from KEGG release 36.1 (December 2005). It contains chemical structures and additional information such as therapeutic categories and target molecules. A most unique feature of KEGG DRUG is a collection of drug structure maps, which graphically illustrate, in a manner similar to KEGG pathway maps, our knowledge on groups of chemical structural patterns, therapeutic categories, their relationships and the chronology of drug development if known.

Reaction classification

The RC system in the chemical space is a counterpart of the KO system in the genomic space (Figure 1). It represents our attempt to organize knowledge on chemical reactions by categorizing chemical structure transformation patterns. The REACTION database contains individual reaction formulas taken from the ENZYME database. Each reaction formula is split into a set of substrate-product pairs, and the chemical structure comparison program SIMCOMP is applied to obtain an optimal alignment. This comparison is based on atom typing, which is the conversion of regular atomic (C, N, O, S, P and so on) representation to what we call KCF representation that consists of 68 atom types distinguishing functional groups and atomic environments (7). The chemical structure alignment generated by SIMCOMP is used to define the R atom for the reaction center, the D atom(s) for adjacent atom(s) in the mismatched region and the M atom(s) for adjacent atom(s) in the matched region (8). This is first done computationally and is followed by extensive manual curation. The RPAIR database is still under development, but it is the basis for the RC system categorizing curated RDM patterns. Since an enzymatic reaction usually involves multiple substrates and products, one EC number corresponds to a combination of RDM patterns. The RC system has enabled automatic assignment of EC numbers from a set of substrate and product structures (8) and will further enable exploration of unknown reactions by generating plausible combinations of RDM patterns, which may then be related to possible paralogs of enzyme genes.

Glycosyltransferase reactions

Functional glycomics has been a most successful area for integrated analysis of genomic and chemical information (9). The carbohydrate sequence of glycans is determined by a specific set of biosynthetic reactions catalyzed by different types of glycosyltransferases. Thus, once we know the repertoire of glycosyltransferases in the genome or in the transcriptome, it should in principle be possible to predict the repertoire of glycan structures. Conversely, the knowledge about glycan structures can be used to search and annotate new glycosyltransferases. Composite Structure Map in KEGG GLYCAN is a tool for converting genomic or transcriptomic data to glycan structure variations based on a curated set of known glycosyltransferase reactions.

ACCESSING KEGG

Web and FTP

KEGG is the major component of the Japanese GenomeNet, which is served by the Kyoto University Bioinformatics Center. The other GenomeNet services including DBGET and BLAST/FASTA searches are now primarily developed and used to support KEGG. The official URL for GenomeNet has been modified to , but the former URL will still be made available (Table 1). To download the KEGG data, academic users may use the GenomeNet FTP site.

KEGG API

The KEGG API service has become an increasingly popular mode of access. It is the SOAP/WSDL interface to KEGG, enabling users to write their own programs to access, customize and utilize KEGG.

KegArray and KegDraw

KegArray and KegDraw are standalone Java applications that make use of the KEGG resources. KegArray is for microarray data analysis in conjunction with KEGG pathways and genomes. KegDraw is for drawing glycan structures and chemical compound structures, which can then be used to query against KEGG and PubChem databases. Both are freely available to academic and non-academic users.

9 in total

1. The KEGG databases at GenomeNet.

Authors: Minoru Kanehisa; Susumu Goto; Shuichi Kawashima; Akihiro Nakaya
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

2. The KEGG resource for deciphering the genome.

Authors: Minoru Kanehisa; Susumu Goto; Shuichi Kawashima; Yasushi Okuno; Masahiro Hattori
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

3. Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways.

Authors: Masahiro Hattori; Yasushi Okuno; Susumu Goto; Minoru Kanehisa
Journal: J Am Chem Soc Date: 2003-10-01 Impact factor: 15.419

4. Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions.

Authors: Masaaki Kotera; Yasushi Okuno; Masahiro Hattori; Susumu Goto; Minoru Kanehisa
Journal: J Am Chem Soc Date: 2004-12-22 Impact factor: 15.419

5. LIGAND: chemical database for enzyme reactions.

Authors: S Goto; T Nishioka; M Kanehisa
Journal: Bioinformatics Date: 1998 Impact factor: 6.937

6. A database for post-genome analysis.

Authors: M Kanehisa
Journal: Trends Genet Date: 1997-09 Impact factor: 11.639

Review 7. KEGG as a glycome informatics resource.

Authors: Kosuke Hashimoto; Susumu Goto; Shin Kawano; Kiyoko F Aoki-Kinoshita; Nobuhisa Ueda; Masami Hamajima; Toshisuke Kawasaki; Minoru Kanehisa
Journal: Glycobiology Date: 2005-07-13 Impact factor: 4.313

8. The COG database: new developments in phylogenetic classification of proteins from complete genomes.

Authors: R L Tatusov; D A Natale; I V Garkavtsev; T A Tatusova; U T Shankavaram; B S Rao; B Kiryutin; M Y Galperin; N D Fedorova; E V Koonin
Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971

9. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Authors: Kim D Pruitt; Tatiana Tatusova; Donna R Maglott
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

9 in total

1151 in total

1. A Large-scale genetic association study of esophageal adenocarcinoma risk.

Authors: Chen-Yu Liu; Michael C Wu; Feng Chen; Monica Ter-Minassian; Kofi Asomaning; Rihong Zhai; Zhaoxi Wang; Li Su; Rebecca S Heist; Matthew H Kulke; Xihong Lin; Geoffrey Liu; David C Christiani
Journal: Carcinogenesis Date: 2010-05-07 Impact factor: 4.944

2. The production of conjugated α-linolenic, γ-linolenic and stearidonic acids by strains of bifidobacteria and propionibacteria.

Authors: Alan A Hennessy; Eoin Barrett; R Paul Ross; Gerald F Fitzgerald; Rosaleen Devery; Catherine Stanton
Journal: Lipids Date: 2011-12-10 Impact factor: 1.880

3. Computational methods for the identification of microRNA targets.

Authors: Yang Dai; Xiaofeng Zhou
Journal: Open Access Bioinformatics Date: 2010-05-01

4. Enzymatic deamination of the epigenetic base N-6-methyladenine.

Authors: Siddhesh S Kamat; Hao Fan; J Michael Sauder; Stephen K Burley; Brian K Shoichet; Andrej Sali; Frank M Raushel
Journal: J Am Chem Soc Date: 2011-01-28 Impact factor: 15.419

5. Predicted functions and linkage specificities of the products of the Streptococcus pneumoniae capsular biosynthetic loci.

Authors: David M Aanensen; Angeliki Mavroidi; Stephen D Bentley; Peter R Reeves; Brian G Spratt
Journal: J Bacteriol Date: 2007-08-31 Impact factor: 3.490

6. Topological signatures of species interactions in metabolic networks.

Authors: Elhanan Borenstein; Marcus W Feldman
Journal: J Comput Biol Date: 2009-02 Impact factor: 1.479

7. Altered transcription of murine genes induced in the small bowel by administration of probiotic strain Lactobacillus rhamnosus HN001.

Authors: Gerald W Tannock; Corinda Taylor; Blair Lawley; Diane Loach; Maree Gould; Amy C Dunn; Alexander D McLellan; Michael A Black; Les McNoe; James Dekker; Pramod Gopal; Michael A Collett
Journal: Appl Environ Microbiol Date: 2014-02-28 Impact factor: 4.792

Review 8. Molecular deconstruction, detection, and computational prediction of microenvironment-modulated cellular responses to cancer therapeutics.

Authors: Mark A Labarge; Bahram Parvin; James B Lorens
Journal: Adv Drug Deliv Rev Date: 2014-02-26 Impact factor: 15.470

9. Complete genome sequence of Macrococcus caseolyticus strain JCSCS5402, [corrected] reflecting the ancestral genome of the human-pathogenic staphylococci.

Authors: Tadashi Baba; Kyoko Kuwahara-Arai; Ikuo Uchiyama; Fumihiko Takeuchi; Teruyo Ito; Keiichi Hiramatsu
Journal: J Bacteriol Date: 2008-12-12 Impact factor: 3.490

10. Lactobacillus bulgaricus prevents intestinal epithelial cell injury caused by Enterobacter sakazakii-induced nitric oxide both in vitro and in the newborn rat model of necrotizing enterocolitis.

Authors: Catherine J Hunter; Monica Williams; Mikael Petrosyan; Yigit Guner; Rahul Mittal; Dennis Mock; Jeffrey S Upperman; Henri R Ford; Nemani V Prasadarao
Journal: Infect Immun Date: 2008-12-15 Impact factor: 3.441