| Literature DB >> 23530693 |
Tomer Altman1, Michael Travers, Anamika Kothari, Ron Caspi, Peter D Karp.
Abstract
BACKGROUND: The MetaCyc and KEGG projects have developed large metabolic pathway databases that are used for a variety of applications including genome analysis and metabolic engineering. We present a comparison of the compound, reaction, and pathway content of MetaCyc version 16.0 and a KEGG version downloaded on Feb-27-2012 to increase understanding of their relative sizes, their degree of overlap, and their scope. To assess their overlap, we must know the correspondences between compounds, reactions, and pathways in MetaCyc, and those in KEGG. We devoted significant effort to computational and manual matching of these entities, and we evaluated the accuracy of the correspondences.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23530693 PMCID: PMC3665663 DOI: 10.1186/1471-2105-14-112
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Comparison of chemical compounds in MetaCyc and KEGG
| All chemical compounds | 11991 | | | 15161 | | | 5120 (0.23) |
| All reaction substrates | 8891 | | | 6912 | | | 4232 (0.37) |
| Pathway reaction substrates | 5523 | 5371 | 5523 | 4759 | 828 | 4759 | 2384 (0.30) |
For each type of compound (row), we report the number of compounds in MetaCyc, the number of compounds in KEGG, and the number of compounds in common between MetaCyc and KEGG. “All chemical compounds” includes both compound classes and compound instances for MetaCyc; for KEGG it includes all compounds in the KEGG COMPOUND file. “All reaction substrates” is the union of all literal reaction substrates (reactants plus products) in the specified DB. M(all): all MetaCyc compounds; M(base): compounds in MetaCyc base pathways; M(super): compounds in MetaCyc super pathways; K(all): all KEGG compounds; K(module): compounds in KEGG module; K(map): compounds in KEGG map; Common: corresponding compounds by total number and by the Jaccard coefficient in parentheses.
Comparison of compound data content in MetaCyc and KEGG
| Compounds | 11991 | 15161 |
| Compounds with structures | 10546 | 14621 |
| Compounds with comments | 1486 | 2997 |
| Mean comment length | 47.69 | 6.51 |
| Mean names per compound | 2.37 | 1.62 |
| Mean DB links per compound | 1.71 | 3.71 |
| Mean associated reactions | 3.59 | 2.17 |
| Mean associated pathways (all) per compound | 1.78 | 0.67 |
| Duplicate compounds | 36 | 251 |
Compound entries in either DB may not have information on their chemical structures, and may not have comments describing the properties of the compound. Associated pathways of a compound include base pathways and superpathways in MetaCyc and KEGG maps and modules. Compounds were considered duplicates if they had identical standard InChI strings.
A comparison of MetaCyc and KEGG compound attributes, for those attributes where one hundred or more objects have a value for that attribute
| | | ||
|---|---|---|---|
| Monoisotopic-MW | 9475 | Exact_Mass | 14611 |
| Molecular-Weight | 9431 | Mol_Weight | 14611 |
| Creation-date | 11705 | | |
| Creator | 10573 | | |
| SMILES | 10546 | | |
| InChI | 9222 | | |
| Regulates | 3573 | | |
| Credits | 2895 | | |
| Gibbs-0 | 1033 | | |
| Cofactors-Of | 563 |
The table presents shared attributes (note that the name for the same conceptual attribute may differ between MetaCyc and KEGG), and attributes unique to MetaCyc; the two attribute sets are further sorted based on the number of objects containing non-null values for each attribute. Gibbs-0 is the Gibbs free energy of formation of a compound. Creator, Creation-Date, and Credits provide data provenance. KEGG compound attribute data are derived from the KEGG COMPOUND dataset.
Comparison of biochemical reactions in MetaCyc and KEGG
| All reactions | 10262 | | | 8692 | | | 3895 (0.26) |
| Pathway reactions | 6348 | 6155 | 6348 | 6174 | 878 | 6173 | 1961 (0.19) |
Mens are pathway reactions if they are part of one or more base pathways or superpathways. KEGG reactions are pathway reactions if they are part of one or more modules or maps. Columns are the same as defined for Table 1.
Comparison of reaction data content in MetaCyc and KEGG
| Reaction instances | 10262 | 8879 |
| Duplicate reactions | 279 | 341 |
| Reactions with comments | 3206 | 3022 |
| Unbalanced reactions (not counting hydrogen) | 474 | 872 |
| Unbalanced reactions (counting hydrogen) | 532 | 1475 |
| Mean associated pathways | 0.84 | 0.90 |
We report the number of reactions, the number of duplicate reactions, the number of reactions with comments, the number of unbalanced reactions disregarding hydrogen imbalance, the number of unbalanced reactions including hydrogen imbalance, and the average number of associated pathways. Associated pathways of a reaction include base pathways and superpathways in MetaCyc and KEGG modules and maps.
A comparison of MetaCyc (M) and KEGG (K) reaction attributes, for those attributes where one hundred or more objects have a value for that attribute
| | | ||
|---|---|---|---|
| Physiologically-Relevant? | 10262 | | |
| Creation-Date | 10247 | | |
| | | Rpair | 8292 |
| Creator | 8090 | | |
| EC-Number | 7998 | Enzyme | 7632 |
| Reaction-Direction | 6660 | | |
| Orphan? | 5967 | | |
| Credits | 2779 | | |
| Rxn-Locations | 282 | | |
| Spontaneous? | 238 |
Attributes are sorted based on the MetaCyc frequency column. Attribute Physiologically-Relevant? describes whether a reaction occurs in vivo. Reaction-Direction specifies the directionality of the reaction. Orphan? is true when no nucleotide or amino-acid sequence has been determined for any enzyme catalyzing this reaction [39,40]. Rxn-Locations specifies the cellular locations in which a reaction occurs (e.g., cytoplasm or mitochondrion). Spontaneous? specifies whether a reaction occurs spontaneously in living organisms and therefore requires no enzyme. KEGG reaction attribute data are derived from the KEGG REACTION dataset, and thus include glycan reactions.
Comparison of metabolic pathways, average reactions per pathway, and average compounds per pathway in MetaCyc (M) and KEGG (K)
| Pathway count | 1846 | 179 | 296 | 237 |
| Reactions perpathway | 4.37 | 6.22 | 14.24 | 28.84 |
| Compounds per pathway | 11.49 | 15.27 | 25.63 | 37.45 |
Comparison of pathway data content in MetaCyc and KEGG
| Pathway classes | 490 | 107 |
| Pathway instances | 2142 | 416 |
| Pathways with comments | 2122 | 51 |
| Mean comment length | 2240.6 | 83.6 |
| DB links per pathway | 0.34 | 0.88 |
| Reactions per pathway | 5.73 | 19.10 |
KEGG pathway classes were extracted from the MAP and MODULE datasets based on the CLASS attribute. Comment length is measured in number of characters.
Figure 1A histogram plot of MetaCyc base pathway and KEGG module size by reaction counts. We excluded one outlier consisting of a MetaCyc base pathway (PWYG-321, “mycolate biosynthesis”) with 192 reactions; 17% of MetaCyc base pathways consist of a single reaction.
Figure 2A histogram plot of MetaCyc super pathway and KEGG map size by reaction counts. We excluded one outlier consisting of a MetaCyc super pathway (PWY-6113, “mycolate biosynthesis”) with 233 reactions.
A comparison of MetaCyc (M) and KEGG (K) pathway attributes, for those attributes where one hundred or more objects have a value for that attribute
| | | ||
|---|---|---|---|
| Species | 2141 | | |
| Pathway-Links | 1412 | Rel_Pathway | 345 |
| Creation-Date | 2139 | | |
| Taxonomic-Range | 2135 | | |
| Creator | 2092 | | |
| Predecessors | 2089 | ECrel | 154 |
| Credits | 1944 | | |
| Key-Reactions | 373 | | |
| | | Disease | 220 |
| Hypothetical-Reactions | 105 |
Attributes are sorted based on frequency. KEGG pathway attribute data are pooled from all objects in the KEGG MODULE and MAP datasets (which include data from global pathways and pathway classes with no metabolic reaction data). Attribute Species specifies the organisms in which the pathway has been studied experimentally. Pathway-Links lists important substrates that connect to other metabolic pathways, whereas KEGG attribute Rel_Pathway links pathways to one another without specifying the compound in common. Taxonomic-Range describes the taxonomic groups in which the pathway is likely to be found; this information increases the accuracy of pathway prediction. Predecessors specifies for each reaction the reaction(s) that precede it in the pathway, and thus defines the connectivity structure of each pathway. KEGG encodes equivalent data in the “ECrel” relationship, obtained from the get_element_relations_by_pathway API function. Key-Reactions increases the accuracy of pathway prediction by specifying reactions whose presence is highly indicative of the pathway, and distinguish the pathway from other, similar pathways. Hypothetical-Reactions identifies pathway reactions that are speculative and have not been firmly established experimentally. The Disease attribute consists of links to the KEGG DISEASE dataset when disease-related genes encode enzymes for one or more reaction steps in the pathway.
Degree to which pathways in MetaCyc (M) and KEGG (K) have their reactions linked to the other DB
| All reactionslinked | 549 | 73 | 0 | 3 |
| Some reactionslinked | 731 | 80 | 73 | 128 |
| No reactionslinked | 566 | 26 | 223 | 106 |
For example, for three KEGG maps, all reactions in the pathway are present in MetaCyc.
MetaCyc pathway classes that are significantly enriched or depleted for reactions with links to KEGG
| Enriched | Amino Acids Biosynthesis | 112 | 1.4 × 10−20 | |
| Enriched | Individual Amino Acids Biosynthesis | 99 | 4.0 × 10−19 | |
| Enriched | Amino Acids Degradation | 118 | 2.0 × 10−17 | |
| Enriched | Purine Nucleotide Biosynthesis | 19 | 3.2 × 10−10 | |
| Enriched | Generation Of Precursor Metabolites And Energy | 162 | 2.6 × 10−9 | |
| Enriched | C1 Compounds Utilization And Assimilation | 28 | 9.3 × 10−9 | |
| Enriched | Autotrophic CO2 Fixation | 7 | 1.3 × 10−7 | |
| Enriched | CO2 Fixation | 9 | 2.8 × 10−7 | |
| Enriched | Vitamins Biosynthesis | 68 | 1.4 × 10−6 | |
| Enriched | Sugar Derivatives Degradation | 42 | 3.8 × 10−6 | |
| Enriched | Sugar Alcohols Degradation | 12 | 6.0 × 10−6 | |
| Enriched | Amines And Polyamines Biosynthesis | 37 | 6.9 × 10−6 | |
| Enriched | Carboxylates Degradation | 44 | 1.0 × 10−5 | |
| Enriched | Sugars Degradation | 51 | 1.5 × 10−5 | |
| Enriched | NAD Biosynthesis | 8 | 1.6 × 10−5 | |
| Enriched | Fermentation | 46 | 3.6 × 10−5 | |
| Enriched | Nucleosides And Nucleotides Biosynthesis | 35 | 6.8 × 10−5 | |
| Enriched | Nucleosides And Nucleotides Degradation | 29 | 1.4 × 10−4 | |
| Enriched | Purine Nucleotide Salvage | 13 | 1.7 × 10−4 | |
| Enriched | Arginine Degradation | 15 | 4.6 × 10−4 | |
| Enriched | Purine Nucleotide De Novo Biosynthesis | 6 | 4.9 × 10−4 | |
| Enriched | Mandelates Degradation | 2 | 9.9 × 10−4 | |
| Enriched | Gluconeogenesis | 2 | 1.1 × 10−3 | |
| Enriched | Glycolysis | 6 | 2.0 × 10−3 | |
| Enriched | NAD Metabolism | 11 | 2.1 × 10−3 | |
| Enriched | Geranylgeranyldiphosphate Biosynthesis | 3 | 2.3 × 10−3 | |
| Enriched | Catechol Degradation | 7 | 2.3 × 10−3 | |
| Enriched | Methionine Biosynthesis | 13 | 4.0 × 10−3 | |
| Enriched | Photosynthesis | 5 | 4.0 × 10−3 | |
| Enriched | Pyrimidine Nucleotide Biosynthesis | 8 | 5.7 × 10−3 | |
| Enriched | Toluenes Degradation | 13 | 7.2 × 10−3 | |
| Enriched | Glutamate Degradation | 10 | 1.5 × 10−2 | |
| Enriched | Formaldehyde Assimilation | 3 | 1.6 × 10−2 | |
| Enriched | Alcohols Degradation | 17 | 1.6 × 10−2 | |
| Enriched | Urate Degradation | 2 | 2.4 × 10−2 | |
| Enriched | Cobalamin Biosynthesis | 9 | 2.5 × 10−2 | |
| Depleted | Secondary Metabolites Biosynthesis | 460 | 3.8 × 10−35 | |
| Depleted | Glucosinolates Biosynthesis | 9 | 2.0 × 10−17 | |
| Depleted | Biosynthesis | 1182 | 2.3 × 10−16 | |
| Depleted | Nitrogen Containing Glucosides Biosynthesis | 13 | 8.0 × 10−15 | |
| Depleted | Hormones Degradation | 24 | 2.7 × 10−13 | |
| Depleted | Polymeric Compounds Degradation | 35 | 3.5 × 10−12 | |
| Depleted | Polysaccharides Degradation | 33 | 2.0 × 10−10 | |
| Depleted | Steroids Degradation | 8 | 4.2 × 10−6 | |
| Depleted | Polyketides Biosynthesis | 13 | 1.5 × 10−5 | |
| Depleted | Glucosinolates Degradation | 4 | 7.0 × 10−5 | |
| Depleted | Cholesterol Degradation | 4 | 3.3 × 10−4 | |
| Depleted | Fatty Acid Biosynthesis | 49 | 3.4 × 10−4 | |
| Depleted | Nitrogen Containing Secondary Compounds Degradation | 18 | 1.0 × 10−3 | |
| Depleted | Terpenoids Biosynthesis | 127 | 1.2 × 10−3 | |
| Depleted | Plant Hormones Degradation | 15 | 1.6 × 10−3 | |
| Depleted | Sesquiterpenoids Biosynthesis | 32 | 1.6 × 10−3 | |
| Depleted | Chlorotoluene Degradation | 5 | 2.3 × 10−3 | |
| Depleted | Auxins Degradation | 8 | 4.1 × 10−3 | |
| Depleted | Apocarotenoids Biosynthesis | 4 | 2.4 × 10−2 | |
| Depleted | Lignans Biosynthesis | 5 | 2.4 × 10−2 |
Class size is the number of pathway instances for the given pathway class. The ‘Links’ column is the number of reactions among the pathways of the pathway class that have links to KEGG reactions, over the total number of reactions for the pathway class. The Bonferroni-corrected p-value from the hypergeometric test indicates the probability that the observed proportion of reactions with links within the pathway occurred by chance. Pathways with a p-value at or below a cut-off of α = 0.025 are shown. The full list may be found in the Additional file 2.
KEGG pathway classes that are significantly enriched or depleted for reactions with links to MetaCyc
| Enriched | Nucleotide And Amino Acid Metabolism | 72 | 2.6 × 10−56 | |
| Enriched | Carbohydrate Metabolism | 15 | 4.1 × 10−29 | |
| Enriched | Amino Acid Metabolism | 13 | 3.0 × 10−20 | |
| Enriched | Energy Metabolism | 8 | 2.9 × 10−14 | |
| Enriched | Cofactor And Vitamin Biosynthesis | 19 | 1.8 × 10−11 | |
| Enriched | Energy Metabolism | 24 | 1.3 × 10−10 | |
| Enriched | Carbon Fixation | 13 | 7.6 × 10−8 | |
| Enriched | Aromatic Amino Acid Metabolism | 11 | 2.2 × 10−5 | |
| Enriched | Alkaloid And Other Secondary Metabolite Biosynthesis | 4 | 2.9 × 10−5 | |
| Enriched | Other Carbohydrate Metabolism | 6 | 2.9 × 10−5 | |
| Enriched | Nucleotide Metabolism | 2 | 7.6 × 10−5 | |
| Enriched | Cysteine And Methionine Metabolism | 6 | 3.4 × 10−4 | |
| Enriched | Central Carbohydrate Metabolism | 13 | 4.4 × 10−4 | |
| Enriched | Reaction Motif | 3 | 7.1 × 10−3 | |
| Enriched | Arginine And Proline Metabolism | 3 | 7.1 × 10−3 | |
| Enriched | Histidine Metabolism | 2 | 1.6 × 10−2 | |
| Enriched | Purine Metabolism | 3 | 2.1 × 10−2 | |
| Depleted | Xenobiotics Biodegradation And Metabolism | 20 | 1.0 × 10−37 | |
| Depleted | Glycan Biosynthesis And Metabolism | 15 | 1.9 × 10−21 | |
| Depleted | Metabolism Of Terpenoids And Polyketides | 20 | 4.4 × 10−13 | |
| Depleted | Glycan Metabolism | 10 | 6.8 × 10−11 | |
| Depleted | Glycosaminoglycan Metabolism | 7 | 2.0 × 10−5 | |
| Depleted | Lipid Metabolism | 17 | 1.4 × 10−3 |
Class size is the number of pathway instances for the given pathway class. The ‘Links’ column is the number of reactions among the pathways of the pathway class that have links to MetaCyc reactions, over the total number of reactions for the pathway class. The Bonferroni-corrected p-value from the hypergeometric test indicates the probability that the observed proportion of reactions with links within the pathway occurred by chance. Pathways with a p-value at or below a cut-off of α = 0.025 are shown. The full list may be found in the Additional file 2.
Taxonomic analysis of MetaCyc base pathways that are not represented in KEGG pathways
| 131567 | Cellular Organisms | 1840 | 878 | 47.7 |
| 2759 | Eukaryota | 1094 | 512 | 46.8 |
| 33154 | Opisthokonta | 351 | 131 | 37.3 |
| 33208 | Metazoa (multicellular animals) | 129 | 47 | 36.4 |
| 7711 | Chordata | 54 | 20 | 37.0 |
| 7742 | Vertebrata | 52 | 20 | 38.5 |
| 4751 | Fungi | 219 | 78 | 35.6 |
| 2 | Bacteria | 1040 | 426 | 41.0 |
| 1224 | Proteobacteria (purple photosynthetic bacteria) | 169 | 64 | 37.9 |
| 2157 | Archaea | 209 | 82 | 39.2 |
The ID column is the NCBI Taxonomy DB [42] identifier. The pathways column is the number of MetaCyc pathways that occur in that taxon based on its Taxonomic-Range slot. The unique pathways column is the number of MetaCyc base pathways for that taxon that are unique relative to KEGG pathways. The percent unique column is the fraction of MetaCyc base pathways for that taxon that are unique relative to KEGG pathways, with rows with a fraction greater than 50% shown in bold. The rows of the table are sorted with respect to the NCBI Taxonomy. Relative taxonomic rank is indicated by indentation. Taxa of the same rank are ordered by decreasing percent unique pathways. The taxon of “Cellular Organisms” is included to provide a baseline from which to compare other taxa.