| Literature DB >> 20426876 |
Henning Redestig1, Miyako Kusano, Atsushi Fukushima, Fumio Matsuda, Kazuki Saito, Masanori Arita.
Abstract
BACKGROUND: Analysis of data from high-throughput experiments depends on the availability of well-structured data that describe the assayed biomolecules. Procedures for obtaining and organizing such meta-data on genes, transcripts and proteins have been streamlined in many data analysis packages, but are still lacking for metabolites. Chemical identifiers are notoriously incoherent, encompassing a wide range of different referencing schemes with varying scope and coverage. Online chemical databases use multiple types of identifiers in parallel but lack a common primary key for reliable database consolidation. Connecting identifiers of analytes found in experimental data with the identifiers of their parent metabolites in public databases can therefore be very laborious.Entities:
Mesh:
Year: 2010 PMID: 20426876 PMCID: PMC2879285 DOI: 10.1186/1471-2105-11-214
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The MetMask concept. Metabolite identifier consolidation using MetMask. (A) A local database is created by importing public databases as well as platform specific reference libraries (Ref Lib) that list all relevant analytes and the parent metabolites to which they correspond. (B) The created database can be used to rapidly extract identifiers and meta-data to enable summarization and contextual analysis such as pathway projections.
Figure 2Metabolite identifier integration. Strategy for integrating metabolite identifiers. (1) The reference library RefLib specifies an identifier and its links. Upon import to the MetMask database, these identifiers are assigned to Group 1. (2) The KEGG entry c00037 links to both the CAS number 56-40-6 and the synonym glycine and is therefore merged to Group 1. (3) The PlantCyc entry gmp overlaps with the identifiers in Group 1 but only in a single synonym and is therefore assigned to the new identifier group, Group 2. Group 2 retains the mapping to the conflicting synonym G but has this link annotated as weak.
Parsers.
| Name | Resource | Format | Imported identifier types |
|---|---|---|---|
| simple | User provided | Comma separated text file | File specific |
| sdf | The NIST library | SDF chemical information file | NIST number, CAS, Synonyms, Sum formula |
| mpimp | MPIMP MS library | NIST MS export file | Name, KEGG, Synonyms, CAS |
| cycdb | Any *Cyc database | compounds.dat file | Frame ID, CAS, Synonyms, SMILES, InChI, KEGG |
| cyc | Any *Cyc database | compounds dump file | Synonyms, CAS, KEGG, SMILES, Sum formula, Pathway |
| kegg | KEGG | compounds file (local or via FTP) | KEGG ID, Synonyms, CAS, Sum formula, ChEBI, KNAp-SAcK, Pathway, PubChem SID |
| chebi | ChEBI | online database (SOAP) | ChEBI ID, IUPAC Name, CAS, KEGG, InChI, SMILES, Sum formula, Synonyms |
| metabocard | HMDB | metabocards.txt | BioCyc, CAS, ChEBI, Sum formula, HMDB, InChI, IUPAC, KEGG, Metlin, Synonym, Pub-Chem SID, PubChem CID |
The currently provided parsers for importing metabolite information. File format definitions can be found in the user manual. The imported identifier types indicate the identifiers that are extracted from the source file.
The sources for the provided database.
| Name | Source | Synchronization mode | Parser |
|---|---|---|---|
| PRIMe chemical standards | In-house database | No | simple |
| RIKEN MS Library | No | riken | |
| MPIMP MS library | Personal communication, [ | No | mpimp |
| PlantCyc compounds.dat | Yes | cycdb | |
| KEGG Compounds/Pathways | Yes | kegg | |
| ChEBI | Yes | chebi |
The sources used to build the provided database. Each source contains one or more different identifier types. Synchronization mode imports only additional data to already existing metabolite groups in the database.
Statistics of the provided database.
| Identifier type | Identifier name | Number of identifiers |
|---|---|---|
| Groups | _id | 1439 |
| PRIMe chemical standards | rlib | 1287 |
| RIKEN MS Library [ | riken | 241 |
| Synonym | synonym | 11180 |
| Sum-formula | formula | 951 |
| CAS | cas | 2416 |
| KEGG Compounds [ | kegg | 1297 |
| KEGG Pathway [ | pathway | 184 |
| PubChem Compound [ | cid | 1857 |
| PubChem Substance [ | sid | 1077 |
| IUPAC Names | iupac | 1928 |
| SMILES | smiles | 2666 |
| InChI | inchi | 1668 |
| KNApSAcK [ | knapsack | 671 |
| KaPPA-View [ | kappav | 261 |
| LipidBank [ | lipidbank | 127 |
| Lipid maps [ | lipidmaps | 178 |
| ChEBI [ | chebi | 1177 |
| Chemspider | chemspider | 1001 |
| MPIMP MS library [ | mpimp | 3439 |
| PlantCyc Frame ID [ | cycdb | 495 |
Statistics of the provided database. The number of groups is the total number of constructed distinct metabolite groups. Each group gathers one or more identifiers of the following listed identifier types.
Comparison of cross-referencing performance on the example data set.
| Databases | CAS Registry number | KEGG ID | InChI | Any identifier |
|---|---|---|---|---|
| Only reference libraries | 124 | 58 | 0 | 124 |
| PlantCyc | 125 | 82 | 74 | 125 |
| GMD | 199 | 146 | 0 | 202 |
| ChEBI | 124 | 58 | 54 | 124 |
| KEGG | 131 | 111 | 0 | 131 |
| PRIMe chemical standards | 168 | 158 | 166 | 168 |
| All (MetMask) | 235 | 222 | 231 | 238 |
Comparison of cross-referencing performance for the 251 identifiers and synonyms found in the example data set. Local reference libraries were combined with the sources listed in column Databases via MetMask and used to convert local identifiers to CAS, KEGG and InChI identifiers. The table lists the number of successfully converted identifiers. Conversion to "Any identifier" indicates the number of local identifiers that could be converted to any other type of identifier (e.g., SMILES, synonym, IUPAC name, etc.). Using all resources together, as performed in the MetMask approach, we obtain a better identifier conversion performance.
Figure 3The connection graph for the KEGG identifier "C00041". An excerpt of the connection graph associated with alanine. Identifiers for D-alanine and L-alanine have been merged since high-throughput metabolomics usually do not resolve optical isomers. Several of the connections are only available via intermediate steps, illustrating how complicated manual identifier conversions can be. Green edges come from the MPIMP MS library, gray edges come from ChEBI, red edges come from PlantCyc, and blue edges come from KEGG and the yellow edge come from our CE-MS library.
Figure 4Obtaining a consensus metabolite feature. Examples of enabled technologies. (A) MetMask makes it easy to cross-reference identifiers used in different metabolomics platforms. Once this has been done, features representing the same metabolite can be summarized using, e.g., principal component analysis to obtain a consensus data set. Here, an example is shown where the features from CE-MS and GC-MS that represent alanine are replaced by PC1. (B) MetMask can link unified identifiers to annotation databases such as PlantCyc, thereby allowing for contextual interpretation such as metabolite set enrichment analysis. The boxplot shows that the log fold changes between red and green tomatoes are higher among the nucleotide synthesis related metabolites than the other metabolites.