| Literature DB >> 30736318 |
Nhung Pham1, Ruben G A van Heck2, Jesse C J van Dam3, Peter J Schaap4, Edoardo Saccenti5, Maria Suarez-Diez6.
Abstract
Genome-scale metabolic models (GEMs) are manually curated repositories describing the metabolic capabilities of an organism. GEMs have been successfully used in different research areas, ranging from systems medicine to biotechnology. However, the different naming conventions (namespaces) of databases used to build GEMs limit model reusability and prevent the integration of existing models. This problem is known in the GEM community, but its extent has not been analyzed in depth. In this study, we investigate the name ambiguity and the multiplicity of non-systematic identifiers and we highlight the (in)consistency in their use in 11 biochemical databases of biochemical reactions and the problems that arise when mapping between different namespaces and databases. We found that such inconsistencies can be as high as 83.1%, thus emphasizing the need for strategies to deal with these issues. Currently, manual verification of the mappings appears to be the only solution to remove inconsistencies when combining models. Finally, we discuss several possible approaches to facilitate (future) unambiguous mapping.Entities:
Keywords: GEM; GEM interoperability; chemical nomenclature; databases; identifier multiplicity; name ambiguity; naming conventions; standardization
Year: 2019 PMID: 30736318 PMCID: PMC6409771 DOI: 10.3390/metabo9020028
Source DB: PubMed Journal: Metabolites ISSN: 2218-1989
Figure 1Overview of namespace mapping problems. (A) The same chemical entities (coloured nodes) link to different names (colourless nodes) in different namespaces: names in namespace A may link to different chemical entities in namespace B; (B) Example of inconsistency within the same namespace: the same name links to different chemical compounds; (C) Example of inconsistency between different namespaces: the same name links to different compounds in different databases. Chemical entities are represented with coloured nodes, names are represented with colourless nodes.
Ambiguity in biochemical database: number of compound names associated with more than one identifier (ID.) s.d. stands for standard deviation. Blue boxes are used to highlight highest numbers.
| Database | #Name | Average Number of IDs | % Ambiguous | # Ambiguous | Highest Number |
|---|---|---|---|---|---|
| BiGG | 5102 | 1.0141 ± 0.126 | 1.31 | 67 | 3 |
| ChEBI | 388,505 | 1.3846 ± 1.52 | 14.8 | 57,497 | 413 |
| enviPath | 11,648 | 1.0804 ± 0.325 | 7.38 | 860 | 10 |
| HMDB | 101,101 | 1.0377 ± 3.865 | 1.67 | 1686 | 921 |
| KEGG | 59,682 | 1.1461 ± 0.422 | 13.3 | 7936 | 16 |
| LIPID MAPS | 77,457 | 1.0113 ± 0.33 | 0.62 | 478 | 63 |
| MetaCyc | 55,823 | 1.0058 ± 0.103 | 0.5 | 279 | 13 |
| Reactome | 6972 | 1.7902 ± 2.458 | 29.43 | 2052 | 34 |
| SABIO-RK | 11,475 | 1.0008 ± 0.031 | 0.07 | 8 | 3 |
| SEED | 47,410 | 1.0108 ± 0.106 | 1.06 | 503 | 4 |
| SLM | 1,218,750 | 1.0782 ± 0.321 | 6.72 | 81,894 | 9 |
Figure 2Intra database consistency. Edges indicate a link between a metabolite name and a database ID. Database name has been added to the ID (denoted as database names followed by ‘:’, i.e., kegg:C00228). (A) Examples of metabolite names associated with multiple IDs in ChEBI. (B) Examples of metabolite names associated with multiple IDs in KEGG. (C) Examples of metabolite IDs associated with multiple names in Reactome. (D) Examples of metabolite IDs associated with multiple names in LIPID MAPS.
ID multiplicity in each database: number of IDs in each database, average number of names per ID (average multiplicity), percentage and number of IDs that associate to more than one name, and highest number of names an ID links to; s.d. stands for standard deviation. Blue boxes are used to highlight highest numbers.
| Database | #ID | Average | % of IDs with | # of IDs with | Highest Multiplicity |
|---|---|---|---|---|---|
| BiGG | 5174 | 1.0 ± 0.0 | 0.0 | 0 | 1 |
| ChEBI | 123,835 | 4.344 ± 3.588 | 97.74 | 121,034 | 57 |
| enviPath | 12,306 | 1.0226 ± 0.229 | 1.6 | 197 | 10 |
| HMDB | 43,179 | 2.4297 ± 0.512 | 99.71 | 43,052 | 8 |
| KEGG | 40,256 | 1.6991 ± 1.231 | 38.93 | 15,671 | 31 |
| LIPID MAPS | 40,772 | 3.9213 ± 0.962 | 100.0 | 40,772 | 23 |
| MetaCyc | 17,159 | 3.2722 ± 1.984 | 99.75 | 17,116 | 98 |
| Reactome | 5344 | 2.3355 ± 16.65 | 47.46 | 2536 | 1106 |
| SABIO-RK | 7683 | 1.4947 ± 1.193 | 24.17 | 1857 | 21 |
| SEED | 27,693 | 1.7305 ± 1.311 | 39.83 | 11,031 | 28 |
| SLM | 505,004 | 2.602 ± 0.611 | 99.87 | 504,333 | 9 |
Example of compound names and IDs with high ambiguity and multiplicity.
| Metabolite Name | # Associate IDs | Metabolite ID | # Associated Names |
|---|---|---|---|
| lecithin | 922 | reactome:5278291 | 1106 |
| diacylglycerol | 812 | reactome:1131511 | 266 |
| Lecithin | 417 | reactome:1236709 | 266 |
| Diglyceride | 317 | reactome:1132345 | 180 |
| Diacylglycerol | 317 | reactome:1132084 | 155 |
| Triacylglycerol | 106 | reactome:1132304 | 140 |
| Triglyceride | 103 | reactome:5278409 | 123 |
| PPP | 66 | reactome:5278317 | 107 |
| Cer[NS] | 63 | MetaCyc:PARATHION | 98 |
Number of IDs (#ID) in each database, number of MNXRef IDs (#MNXRef ID) linking to each database, multiplicity of MNXRef IDs when mapping to IDs in the corresponding database, and average and highest number of MNXRef ID per database ID; s.d. stands for standard deviation. Blue boxes are used to highlight highest numbers, while red boxes are for lowest numbers.
| Database | #ID | #MNXRef ID | Average #ID per | % of IDs with | # of IDs with | Highest ID |
|---|---|---|---|---|---|---|
| BiGG | 5174 | 5062 | 1.0221 ± 0.165 | 1.96 | 99 | 4 |
| ChEBI | 123,835 | 96,746 | 1.28 ± 1.005 | 11.93 | 11,541 | 30 |
| enviPath | 12,306 | 11,087 | 1.1099 ± 0.44 | 8.14 | 902 | 9 |
| HMDB | 43,179 | 42,354 | 1.0195 ± 0.176 | 1.63 | 691 | 12 |
| KEGG | 40,256 | 37,722 | 1.0672 ± 0.293 | 6.14 | 2316 | 12 |
| LIPID MAPS | 40,772 | 40,546 | 1.0056 ± 0.083 | 0.51 | 207 | 6 |
| MetaCyc | 17,159 | 16,985 | 1.0102 ± 0.115 | 0.9 | 153 | 5 |
| Reactome | 5344 | 2058 | 2.5967 ± 3.895 | 41.93 | 863 | 34 |
| SABIO-RK | 7683 | 7512 | 1.0228 ± 0.154 | 2.2 | 165 | 3 |
| SEED | 27,693 | 26,894 | 1.0297 ± 0.181 | 2.79 | 749 | 4 |
| SLM | 505,004 | 504,881 | 1.0002 ± 0.016 | 0.02 | 119 | 3 |
Number of IDs in one database (column) that map to IDs in the database in the corresponding row using database names as a bridge for mapping. Percentages indicate fraction of the initial database. Blue boxes indicate highest overall mapping. Red boxes are used to highlight the lowest numbers.
| Database | BiGG | ChEBI | enviPath | HMDB | KEGG | LIPID MAPS | MetaCyc | Reactome | SABIO-RK | SEED | SLM |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BiGG | – | 5097 (4.1%) | 150 (1.2%) | 702 (1.6%) | 1489 (3.7%) | 158 (0.4%) | 210 (1.2%) | 361 (6.8%) | 839 (10.9%) | 1829 (6.6%) | 61 (0.0%) |
| ChEBI | 1303 (25.2%) | – | 816 (6.6%) | 9178 (21.3%) | 16013 (39.8%) | 4662 (11.4%) | 7209 (42.0%) | 2146 (40.2%) | 2552 (33.2%) | 15,837 (57.2%) | 4336 (0.9%) |
| enviPath | 142 (2.7%) | 2284 (1.8%) | – | 304 (0.7%) | 1111 (2.8%) | 55 (0.1%) | 31 (0.2%) | 90 (1.7%) | 300 (3.9%) | 983 (3.5%) | 6 (0.0%) |
| HMDB | 643 (12.4%) | 15,749 (12.7%) | 310 (2.5%) | – | 4745 (11.8%) | 4078 (10.0%) | 1693 (9.9%) | 877 (16.4%) | 1268 (16.5%) | 3868 (14.0%) | 14,007 (2.8%) |
| KEGG | 1286 (24.9%) | 30,098 (24.3%) | 1050 (8.5%) | 3922 (9.1%) | – | 1725 (4.2%) | 731 (4.3%) | 928 (17.4%) | 2604 (33.9%) | 16,646 (60.1%) | 84 (0.0%) |
| LIPID MAPS | 149 (2.9%) | 7832 (6.3%) | 54 (0.4%) | 4200 (9.7%) | 1862 (4.6%) | – | 622 (3.6%) | 311 (5.8%) | 377 (4.9%) | 1893 (6.8%) | 13,478 (2.7%) |
| MetaCyc | 212 (4.1%) | 20,183 (16.3%) | 31 (0.3%) | 1967 (4.6%) | 851 (2.1%) | 648 (1.6%) | – | 1266 (23.7%) | 340 (4.4%) | 7703 (27.8%) | 326 (0.1%) |
| Reactome | 156 (3.0%) | 5833 (4.7%) | 41 (0.3%) | 620 (1.4%) | 588 (1.5%) | 254 (0.6%) | 717 (4.2%) | – | 368 (4.8%) | 542 (2.0%) | 146 (0.0%) |
| SABIO-RK | 864 (16.7%) | 10,413 (8.4%) | 324 (2.6%) | 1456 (3.4%) | 3127 (7.8%) | 390 (1.0%) | 342 (2.0%) | 781 (14.6%) | – | 2692 (9.7%) | 55 (0.0%) |
| SEED | 1824 (35.3%) | 32,212 (26.0%) | 1020 (8.3%) | 4971 (11.5%) | 18,489 (45.9%) | 1915 (4.7%) | 7580 (44.2%) | 985 (18.4%) | 2641 (34.4%) | – | 233 (0.0%) |
| SLM | 55 (1.1%) | 4964 (4.0%) | 4 (0.0%) | 12,354 (28.6%) | 94 (0.2%) | 10,634 (26.1%) | 289 (1.7%) | 225 (4.2%) | 44 (0.6%) | 211 (0.8%) | – |
Percentage of IDs in the database (column) that gets mapped to more than one ID in the database in the corresponding row using database names as a bridge. Blue boxes are used to highlight highest numbers. While red boxes indicate lowest numbers.
| Database | BiGG | ChEBI | enviPath | HMDB | KEGG | LIPID MAPS | MetaCyc | Reactome | SABIO-RK | SEED | SLM |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BiGG | – | 2.9 | 1.3 | 3.0 | 3.6 | 3.2 | 1.4 | 0.6 | 2.9 | 2.7 | 1.6 |
| ChEBI | 76.3 | – | 67.0 | 38.1 | 38.3 | 34.3 | 58.7 | 81.3 | 78.7 | 37.3 | 26.9 |
| enviPath | 6.3 | 6.5 | – | 8.2 | 6.1 | 0.0 | 0.0 | 12.2 | 7.7 | 4.6 | 0.0 |
| HMDB | 10.7 | 11.5 | 6.8 | – | 7.3 | 4.3 | 13.2 | 22.8 | 12.8 | 7.4 | 0.7 |
| KEGG | 17.0 | 15.2 | 11.1 | 28.5 | – | 10.2 | 18.5 | 34.5 | 19.6 | 12.4 | 33.3 |
| LIPID MAPS | 8.7 | 9.8 | 1.9 | 1.8 | 3.2 | – | 4.2 | 13.2 | 4.5 | 3.2 | 0.8 |
| MetaCyc | 0.5 | 3.9 | 0.0 | 2.5 | 3.9 | 2.0 | – | 6.0 | 4.1 | 1.5 | 0.6 |
| Reactome | 42.3 | 41.4 | 51.2 | 49.0 | 51.4 | 24.4 | 38.9 | – | 49.5 | 43.2 | 47.9 |
| SABIO-RK | 0.0 | 4.5 | 0.0 | 0.0 | 3.8 | 1.0 | 3.8 | 2.2 | – | 3.3 | 1.8 |
| SEED | 3.0 | 6.0 | 0.9 | 2.0 | 2.4 | 2.2 | 3.1 | 8.9 | 5.3 | – | 1.7 |
| SLM | 7.3 | 37.2 | 25.0 | 12.3 | 18.1 | 22.3 | 10.4 | 24.4 | 20.5 | 9.5 | – |
Number of IDs in one database (column) that map to IDs in the database in the corresponding row using MetaNetX as a bridge. Percentages indicate fraction of IDs in the initial database. Blue boxes are used to highlight highest numbers, while red boxes indicate lowest numbers.
| Database | BiGG | ChEBI | enviPath | HMDB | KEGG | LIPID MAPS | MetaCyc | Reactome | SABIO-RK | SEED | SLM |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BiGG | – | 2064 (2.1%) | 232 (2.1%) | 1469 (3.5%) | 1781 (4.7%) | 533 (1.3%) | 1715 (10.1%) | 609 (29.6%) | 1180 (15.7%) | 2652 (9.9%) | 221 (0.0%) |
| ChEBI | 2064 (40.8%) | – | 1424 (12.8%) | 8775 (20.7%) | 19,244 (51.0%) | 5464 (13.5%) | 9019 (53.1%) | 1242 (60.3%) | 3252 (43.3%) | 17,649 (65.6%) | 3848 (0.8%) |
| enviPath | 232 (4.6%) | 1424 (1.5%) | – | 549 (1.3%) | 1093 (2.9%) | 166 (0.4%) | 733 (4.3%) | 120 (5.8%) | 377 (5.0%) | 1123 (4.2%) | 23 (0.0%) |
| HMDB | 1469 (29.0%) | 8775 (9.1%) | 549 (5.0%) | – | 5028 (13.3%) | 5387 (13.3%) | 3283 (19.3%) | 788 (38.3%) | 1804 (24.0%) | 5021 (18.7%) | 9870 (2.0%) |
| KEGG | 1781 (35.2%) | 19,244 (19.9%) | 1093 (9.9%) | 5028 (11.9%) | – | 2397 (5.9%) | 7030 (41.4%) | 926 (45.0%) | 2651 (35.3%) | 16,791 (62.4%) | 375 (0.1%) |
| LIPID MAPS | 533 (10.5%) | 5464 (5.6%) | 166 (1.5%) | 5387 (12.7%) | 2397 (6.4%) | – | 2056 (12.1%) | 325 (15.8%) | 719 (9.6%) | 2807 (10.4%) | 10,076 (2.0%) |
| MetaCyc | 1715 (33.9%) | 9019 (9.3%) | 733 (6.6%) | 3283 (7.8%) | 7030 (18.6%) | 2056 (5.1%) | – | 877 (42.6%) | 2538 (33.8%) | 11,502 (42.8%) | 655 (0.1%) |
| Reactome | 609 (12.0%) | 1242 (1.3%) | 120 (1.1%) | 788 (1.9%) | 926 (2.5%) | 325 (0.8%) | 877 (5.2%) | – | 705 (9.4%) | 1006 (3.7%) | 200 (0.0%) |
| SABIO-RK | 1180 (23.3%) | 3252 (3.4%) | 377 (3.4%) | 1804 (4.3%) | 2651 (7.0%) | 719 (1.8%) | 2538 (14.9%) | 705 (34.3%) | – | 2915 (10.8%) | 253 (0.1%) |
| SEED | 2652 (52.4%) | 17,649 (18.2%) | 1123 (10.1%) | 5021 (11.9%) | 16,791 (44.5%) | 2807 (6.9%) | 11,502 (67.7%) | 1006 (48.9%) | 2915 (38.8%) | – | 647 (0.1%) |
| SLM | 221 (4.4%) | 3848 (4.0%) | 23 (0.2%) | 9870 (23.3%) | 375 (1.0%) | 10,076 (24.9%) | 655 (3.9%) | 200 (9.7%) | 253 (3.4%) | 647 (2.4%) | – |
Figure 3Number of mappings using the two approaches: The x axis shows the mappings resulted using MNXRef ID as a bridge; the y axis shows the number of mappings via name. Each red dot indicates a mapping between a pair of databases, points in blue indicate mappings from a database to itself. Mapping results from/to SLM are not shown in the plot due to the high number of matches outside the considered range.
Figure 4Visualization of the inter database inconsistency. An ID from BiGG (in yellow) can link to many other IDs in CheBI (red) when using MetaNetX ID (green) for the mapping.
Percentage of IDs in the database in the column that get mapped to more than one ID in the database in the corresponding row using database MetaNetX as a bridge. Blue boxes are used to highlight highest numbers. While red boxes indicate lowest numbers.
| Database | BiGG | ChEBI | enviPath | HMDB | KEGG | LIPID MAPS | MetaCyc | Reactome | SABIO-RK | SEED | SLM |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BiGG | – | 3.9 | 5.2 | 3.5 | 3.9 | 3.2 | 4.0 | 3.8 | 4.5 | 3.2 | 2.7 |
| ChEBI | 83.1 | – | 56.2 | 39.7 | 36.4 | 37.8 | 64.7 | 76.8 | 72.2 | 39.4 | 27.8 |
| enviPath | 9.9 | 10.6 | – | 12.0 | 8.1 | 8.4 | 7.6 | 14.2 | 11.1 | 8.1 | 8.7 |
| HMDB | 19.1 | 6.8 | 12.6 | – | 9.3 | 5.1 | 12.7 | 26.4 | 17.2 | 9.7 | 1.6 |
| KEGG | 15.0 | 10.0 | 11.0 | 22.1 | – | 8.4 | 9.6 | 19.7 | 17.5 | 11.2 | 14.7 |
| LIPID MAPS | 10.5 | 2.8 | 6.0 | 2.5 | 4.5 | – | 4.6 | 14.5 | 7.0 | 4.2 | 0.5 |
| MetaCyc | 3.6 | 1.4 | 3.4 | 2.1 | 1.7 | 0.9 | – | 4.3 | 2.8 | 1.1 | 0.5 |
| Reactome | 42.7 | 33.3 | 45.0 | 32.4 | 37.1 | 28.9 | 36.7 | – | 41.7 | 37.2 | 35.0 |
| SABIO-RK | 8.1 | 4.7 | 5.6 | 6.1 | 5.3 | 3.8 | 5.6 | 9.2 | – | 5.1 | 4.7 |
| SEED | 8.4 | 3.5 | 4.2 | 4.6 | 3.7 | 3.6 | 5.7 | 12.1 | 9.5 | – | 5.1 |
| SLM | 5.0 | 1.1 | 0.0 | 0.2 | 5.6 | 0.4 | 2.7 | 6.0 | 5.1 | 3.2 | – |
Examples of mapping inconsistencies.
| Abbreviation | Database | IDs in Database | MetaNetX ID | Compound(s) |
|---|---|---|---|---|
| suc | MetaCyc | SUC | MNXM25 | succinate |
| suc | Reactome | 188980 | MNXM167 | sucrose |
| H | MetaCyc | PROTON | MNXM1 | proton |
| H | MetaCyc | HIS | MNXM134 | L-histidine |
| tmp | BiGG | tmp | MNXM87343 | TMP |
| tmp | ChEBI | 10529 | MNXM257 | Thymidine monophosphate |
| tmp | KEGG | C01081 | MNXM662 | Thiamine monophosphate |
| tmp | MetaCyc | CPD-610 | MNXM88031 | cyclo-triphosphoric acid |
| PPP | Reactome | 1475054 | MNXM3109 | triphosphate ion |
| PPP | MetaCyc | 2-PHENYL-2-1-PIPERDINYLPROPANE | MNXM150634 | 2-phenyl-2-1piperdinylpropane |