| Literature DB >> 24564846 |
Masaaki Kotera, Yasuo Tabei, Yoshihiro Yamanishi, Yuki Moriya, Toshiaki Tokimatsu, Minoru Kanehisa, Susumu Goto.
Abstract
BACKGROUND: In order to develop hypothesis on unknown metabolic pathways, biochemists frequently rely on literature that uses a free-text format to describe functional groups or substructures. In computational chemistry or cheminformatics, molecules are typically represented by chemical descriptors, i.e., vectors that summarize information on its various properties. However, it is difficult to interpret these chemical descriptors since they are not directly linked to the terminology of functional groups or substructures that the biochemists use.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24564846 PMCID: PMC4029371 DOI: 10.1186/1752-0509-7-S6-S2
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
Figure 1KEGG Chemical Function (KCF) format. (a) KEGG Chemical Function (KCF) format of NADPH. KCF format has three sections; ENTRY, ATOM and BOND. ENTRY section describes the KEGG ID and the type of the entry. ATOM section describes the numbering of the atoms, KEGG Atom Types for the labels on the atoms, atomic species (C for carbon, N for nitrogen, etc), and 2D coordinates of the atoms. BOND section describes the numbering of the bonds, the numbering of the two atoms in the bond, and the bond order, and steric configuration of the bond. (b) KCF representation of NADPH. Molecules are represented as graph structures, where nodes represent atoms labeled with KEGG Atom Types.
KEGG Atom Types.
| Carbon atoms | |
|---|---|
| C1a | R-CH3 / methyl |
| C1b | R-CH2-R / methylene |
| C1c | R-CH(-R)-R / tertiary carbon |
| C1d | R-C(-R)2-R / quaternary carbon |
| C1x | ring-CH2-ring / methylene in ring |
| C1y | ring-CH(-R)-ring / tertiary carbon in ring |
| C1z | ring-C(-R)2-ring / quaternary carbon in ring |
| C2a | R=CH2 / alkenyl terminus carbon |
| C2b | R=CH-R / alkenyl secondary carbon |
| C2c | R=C(-R)2 / alkenyl tertiary carbon |
| C2x | ring-CH=ring / alkenyl secondary carbon in ring |
| C2y | ring-C(-R)=ring or ring-C(=R)-ring / alkenyl tertiary carbon in ring |
| C3a | R#CH / alkynyl terminus carbon |
| C3b | R#C-R / alkynyl secondary carbon |
| C4a | R-CH=O / aldehyde carbon |
| C5a | R-C(=O)-R / keto carbon |
| C5x | ring-C(=O)-ring / keto carbon in ring |
| C6a | R-C(=O)-OH / carboxylate carbon |
| C7a | R-C(=O)-O-R / carboxylate ester carbon |
| C7x | ring-C(=O)-O-ring / lactone carbon |
| C8x | ring-CH=ring / aromatic secondary carbon |
| C8y | ring-C(-R)=ring / aromatic tertiary carbon |
| C0 | Undefined carbon |
| N1a | R-NH2 / primary amine |
| N1b | R-NH-R / secondary amine |
| N1c | R-N(-R)2 / tertiary amine |
| N1d | R-N(-R)3+ / quaternary ammonium |
| N1x | ring-NH-ring / secondary amine in ring |
| N1y | ring-N(-R)-ring / tertiary amine in ring |
| N2a | R=N-H / primary imine |
| N2b | R=N-R / secondary imine |
| N2x | ring-N=ring / secondary imine in ring |
| N2y | ring-N(-R)+=ring / iminium |
| N3a | R#N / nitrile |
| N4x | ring-NH-ring / aromatic secondary amine |
| N4y | ring-N(-R)-ring / aromatic tertiary amine |
| N5x | ring-N=ring / aromatic secondary imine |
| N5y | ring-N(-R)+=ring / aromatic tertiary imine |
| N0 | Undefined nitrogen |
| O1a | R-OH / hydroxy |
| O1b | N-OH / N-hydroxy |
| O1c | P-OH / P-hydroxy |
| O1d | S-OH / S-hydroxy |
| O2a | R-O-R / hydroxy ether |
| O2b | P-O-R / hydroxy phosphate bond |
| O2c | P-O-P / pyrophosphate bond |
| O2x | ring-O-ring / cyclic ether |
| O3a | N=O / N-oxo |
| O3b | P=O / P-oxo |
| O3c | S=O / S-oxo |
| O4a | R-CH=O / aldehyde oxygen |
| O5a | R-C(=O)-R / keto oxygen |
| O5x | ring-C(=O)-ring / keto oxygen in ring |
| O6a | R-C(=O)-OH / carboxylate oxygen |
| O7a | R-C(=O)-O-R / carboxylate ester oxygen |
| O7x | ring-C(=O)-O-ring / lactone oxygen |
| O0 | Undefined oxygen |
| S1a | R-SH / mercapto |
| S2a | R-S-R / sulfide |
| S2x | ring-S-ring / sulfide in ring |
| S3a | R-S-S-R / disulfide |
| S3x | ring-S-S-ring / disulfide in ring |
| S4a | R-SO3 / sulfate |
| S0 | Undefined sulfur |
| P1a | P-R / phosphine |
| P1b | P-O / phosphate |
| X | F / fluoride |
| Cl / chloride | |
| Br / bromide | |
| I / iodide | |
| Z | Other atoms |
KEGG Atom Types were defined in 2003 [17], and were used to label the nodes in molecular graphs. KEGG atom label consists of three letters, such as "C1a" meaning a methyl carbon. The first and second letters represent atom species and orbital environments, respectively. The third letter describes the surroundings of a given atom in terms of its bonded neighbors.
Figure 2Examples of proposed KCF-Substructures and their relationships. Three types of arrows are used for explaining the relationships between objects. See the text for the detail.
Figure 3KEGG Chemical Function and Substructures (KCF-S) format, a proposed extension of KCF KCF-S format has two sections, ENTRY and SUBSTR (substructures). SUBSTR section is divided into the seven subsections, ATOM, BOND, TRIPLET, VICINITY, RING, SKELETON and INORGANIC. Each subsections contains the substructures with the strings, the number of the substructures appeared in the molecule (shown in the parentheses), and the atoms involved in the substructures.
Examples of named substructures and appearance in KEGG COMPOUND, KEGG DRUG and KNApSAcK databases.
| KCF-S / annotation | COMPOUND | DRUG | KNApSAcK |
|---|---|---|---|
| #S / #C | #S / #C | #S / #C | |
| BOND | |||
| C5a-N1b / amide bond | 4174 / 2192 | 2678 / 1385 | 6784 / 2528 |
| C7a-O7a / carboxylate ester bond | 3040 / 2198 | 1787 / 1329 | 21857 / 13166 |
| C5a-S2a / thioester bond | 455 / 453 | 31 / 30 | 36 / 36 |
| N2b-N2b / diazo bond | 83 / 73 | 83 / 19 | 11 / 11 |
| S3a-S3a / disulfide bond | 40 / 37 | 40 / 26 | 43 / 33 |
| N1b-N1b / hydrazine bond | 15 / 13 | 22 / 15 | 3 / 3 |
| TRIPLET | |||
| C6a-C1c-N1a / alpha-amino acid | 512 / 484 | 113 / 104 | 191 / 183 |
| C5a-C1b-C5a / beta-keto carbonyl | 270 / 106 | 6 / 6 | 36 / 36 |
| C6a-C5a-O5a / alpha-keto carboxylate | 169 / 168 | 10 / 8 | 46 / 46 |
| C6a-C1c-O1a / alpha-hydroxy carboxylate | 167 / 154 | 236 / 137 | 108 / 87 |
| VICINITY | |||
| C1y(C1y+C1y+O1a) / cyclic secondary alcohol | 10099 / 3090 | 1171 / 388 | 49015 / 11697 |
| C8y(C8x+C8x+O1a) / phenolic hydroxy | 1562 / 1263 | 376 / 313 | 9978 / 7219 |
| C5a(N1b+N1b+O5a) / pseudourea | 66 / 65 | 82 / 77 | 46 / 43 |
| N1c(C1b+C1b+C1b) / tertiary amine | 54 / 48 | 302 / 235 | 0 / 0 |
| C5x(N1x+N1x+O5x) / cyclic pseudourea | 36 / 36 | 30 / 29 | 20 / 20 |
| RING | |||
| C1y(C1b)-C1y(O1a)-C1y(O1a)-C1y(O1a)-C1y(O2a)-O2x / pyranose sugar ring | 1024 / 824 | 64 / 54 | 7670 / 6187 |
| C8x-N4y(C1y)-C8y(N5x)-C8y(C8y)-N5x / imidazole ring | 549 / 535 | 48 / 47 | 84 / 84 |
| C8x-N4y(C1y)-C8y-N5x-C8x-N5x-C8y(N1a)-C8y-N5x / adenine ring | 428 / 420 | 17 / 17 | 55 / 55 |
| C1x-C1x-N1y(C1b)-C1x-C1x-N1y(C1b) / piperazine ring | 7 / 7 | 45 / 45 | 0 / 0 |
| C8x-C8y(C2b)-C8x-C8y(O1a)-C8y(O1a)-C8y(O1a) / 5-alenylbenzene-1,2,3-triol | 3 / 3 | 0 / 0 | 12 / 12 |
| SKELETON | |||
| C1b(O2b)-C1y(O2x)-C1y(O1a)-C1y(O1a)-C1y(N4y+O2x) / ribofuranose | 255 / 255 | 20 / 20 | 62 / 62 |
| C1x(N1y)-C1x(N1y) / ethylenediamine in ring | 136 / 136 | 702 / 702 | 0 / 0 |
| C1a-C1c(C1a)-C1b-C1c(N1b)-C5a(N1b+O5a) / leucine residue | 102 / 102 | 79 / 79 | 228 / 228 |
| C7a(O6a+O7a)-C8y-C8x-C8x-C8y(O2a)-C8x-C8x / p-hydroxybenzoate residue | 0 / 0 | 3 / 3 | 51 / 51 |
| INORGANIC | |||
| O1c-P1b(O2b(C1y))(O1c)-O1c | 520 / 520 | 19 / 19 | 66 / 66 |
| / cyclic secondary alcohol orthophosphate | |||
| O1c-P1b(O2b(C1b))(O1c)-O1c | 387 / 387 | 43 / 43 | 97 / 97 |
| / primary alcohol orthophosphate | |||
| O1c-P1b(O2b(C1y))(O2b(C1b))-O1c / cyclic orthophosphate | 173 / 173 | 2 / 2 | 2 / 2 |
| O3a-N2b(C8y)-O3a / aryl nitro | 304 / 304 | 164 / 164 | 48 / 48 |
| N2b(C2c)-O1b / oxime | 27 / 27 | 22 / 22 | 61 / 61 |
#S represents the numbers of KCF-Substructures, and #C represent the numbers of compounds containing the KCF-Substructures. Note that the annotations are not necessary-and-sufficient definitions. For example, "N1b-N1b" bond is a hydrazine bond, but there are some other types of hydrazine bonds; e.g., "N1b-N1c" is a hydrazine bond with three substituted groups, and "N1x-N1x" is a hydrazine bond in a ring structure.
Figure 4Venn diagrams for common and uniq KCF Substructures in the KEGG COMPOUND, KEGG DRUG, KNApSAcK databases. The numbers of (a) KCF-Substructures, (b) BOND and TRIPLET entries, (c) VICINITY and INORGANIC entries, and (d) RING and SKELETON entries are shown in the top and bottom, respectively.
Top five complete-linkage clusters with weighted Jaccard coefficient >= 0.7.
| (a) clustered by KCF-S descriptor | ||||||||
|---|---|---|---|---|---|---|---|---|
| #1 acyl-CoA molecules | ||||||||
| 144 | 993.8 | C01894 | 883.8 | C04348 | 767.5 | C00010 | 3.317 | |
| #2 enoyl-CoA molecules | ||||||||
| 79 | 1124 | C16388 | 1026 | C16163 | 891.7 | C05276 | 6.789 | |
| #3 metals and inorganic ions | ||||||||
| 48 | 244.0 | C19159 | 97.75 | C00150 | 1.00 | C00080 | 10.11 | |
| #4 acyl-CoA molecules with aromatic substituted groups | ||||||||
| 48 | 1023 | C14118 | 929.6 | C00323 | 861.6 | C00845 | 6.107 | |
| #5 disaccharides | ||||||||
| 35 | 342.2 | C00897 | 339.3 | C04698 | 326.2 | C19758 | 1.153 | |
| (b) clustered by PubChem fingerprint | ||||||||
| #1 from furanocoumarins to glycosylated flavonoids | ||||||||
| 382 | 918.8 | C12636 | 372.7 | C09956 | 186.1 | C09060 | 5.993 | |
| #2 from biotinyl-5'-AMP to CoA-disulfide | ||||||||
| 237 | 1533 | C02015 | 959.5 | C16339 | 573.5 | C05921 | 7.893 | |
| #3 from flavonoids to pyrones (chromones), aggregated phenols | ||||||||
| 159 | 668.7 | C10669 | 325.1 | C09752 | 166.1 | C10712 | 6.879 | |
| #4 from xanthenes to tannins, glycosylated and acylated flavonoids | ||||||||
| 156 | 2108 | C16302 | 757.2 | C12646 | 346.2 | C09967 | 27.82 | |
| #5 steroids | ||||||||
| 135 | 514.2 | C15359 | 335.8 | C14621 | 270.3 | C14261 | 3.703 | |
| (c) clustered by MACCS fingerprint | ||||||||
| #1 from pyrimidine 5'-deoxynucleotide to CoA-disulfide | ||||||||
| 432 | 1533 | C02015 | 823.4 | C00100 | 277.1 | C08249 | 12.13 | |
| #2 from 3',5'-cyclic CMP to polypeptidyl UPD-glucose | ||||||||
| 195 | 1221 | C04894 | 564.8 | C00842 | 305.1 | C00941 | 13.41 | |
| #3 from xanthenes to highly glycosylated and aromatic acylated flavonoids | ||||||||
| 167 | 2108 | C16302 | 642.3 | C16290 | 244.1 | C10082 | 23.76 | |
| #4 from xanthenes to C-glycosylated flavonoids | ||||||||
| 159 | 610.5 | C10102 | 337.7 | C10049 | 222.2 | C00799 | 5.895 | |
| #5 from pyrones to biflavonoids | ||||||||
| 157 | 1120 | C10235 | 502.5 | C16191 | 206.1 | C09012 | 13.34 | |
#M indicates the numbers of molecules in the clusters. Max MW, Ave MW, and Min MW indicate the molecules with the maximum molecular weight, the molecules with the average molecular weight, and the molecules with the minimum molecular weights, respectively, with the respective molecular weights. SD shows the standard deviation of the obtained clusters. Description after the cluster numbers (#1 - #5) represents the group of molecules, in which "from ... to ..." indicates that the molecular structures in the cluster were so diverse that we could not find appropriate words to describe the clusters.
Figure 5Scatter plot of the clusters consisting of KEGG COMPOUND and KNApSAcK by KCF-S descriptors. Each dot represents the QCC clusters obtained by KCF-S descriptors with the weighted Jaccard coefficient >= 0.7 and the clique ratio >= 0.7. The horizontal and vertical axes represent the numbers of KEGG COMPOUND and KNApSAcK molecules in the cluster, respectively.
Cross validation experiments for predicting the enzymatic-reaction likeness.
| Chemical descriptors | Vector dimension | Diff-common L1SVM | Diff-only L1SVM | Baseline | Random | ||||
|---|---|---|---|---|---|---|---|---|---|
| KCF-S k3 | 53679 | 0.9841 | 0.2483 | 0.9827 | 0.1872 | 0.8254 | 0.0584 | 0.4981 | 0.0052 |
| 10000 | 0.9839 | 0.2481 | 0.9824 | 0.1840 | 0.8299 | 0.0594 | 0.4985 | 0.0052 | |
| 1000 | 0.9814 | 0.2269 | 0.9773 | 0.1508 | 0.8397 | 0.0592 | 0.5006 | 0.0053 | |
| KCF-S k2 | 28152 | 0.9761 | 0.2144 | 0.9691 | 0.1330 | 0.8122 | 0.0503 | 0.4995 | 0.0053 |
| 10000 | 0.9763 | 0.2148 | 0.9698 | 0.1366 | 0.8143 | 0.0501 | 0.4997 | 0.0053 | |
| 1000 | 0.9720 | 0.2012 | 0.9596 | 0.1029 | 0.8178 | 0.0481 | 0.4988 | 0.0053 | |
| KCF-S k1 | 11133 | 0.9702 | 0.1835 | 0.9620 | 0.1300 | 0.8184 | 0.0776 | 0.4962 | 0.0052 |
| 10000 | 0.9699 | 0.1835 | 0.9600 | 0.1197 | 0.8187 | 0.0769 | 0.4960 | 0.0052 | |
| 1000 | 0.9676 | 0.1757 | 0.9475 | 0.0868 | 0.8208 | 0.0744 | 0.4963 | 0.0052 | |
| PubChem FP | 879 | 0.9531 | 0.1341 | 0.9067 | 0.0571 | 0.8883 | 0.0667 | 0.5006 | 0.0052 |
| MACCS FP | 164 | 0.9275 | 0.0932 | 0.9097 | 0.0510 | 0.8200 | 0.0336 | 0.5001 | 0.0052 |
| ATOM k3 | 99 | 0.9532 | 0.1362 | 0.9378 | 0.0703 | 0.8195 | 0.0492 | 0.4983 | 0.0052 |
| BOND k3 | 973 | 0.9773 | 0.2023 | 0.9713 | 0.1319 | 0.8260 | 0.0546 | 0.5001 | 0.0053 |
Figure 6Example of predicted subnetworks. Nodes (with ID numbers of KEGG COMPOUND or KNApSAcK) represent molecules. Black bold lines indicate the predicted pairs that were considered as positive after manual examination. Black thin lines and gray lines indicate those that were considered suspicious and negative, respectively. (b) An example pair that was considered as positive, representing a cyclization reaction. (c) Another example pair considered as positive, representing a methylation reaction. (d) An example pair that was considered as suspicious, representing a hydroxylation reaction. (e) An example pair that was considered as negative, representing large rearrangement of carbon skeleton that seems impossible to occur.