| Literature DB >> 23093944 |
Jan Krumsiek1, Karsten Suhre, Anne M Evans, Matthew W Mitchell, Robert P Mohney, Michael V Milburn, Brigitte Wägele, Werner Römisch-Margl, Thomas Illig, Jerzy Adamski, Christian Gieger, Fabian J Theis, Gabi Kastenmüller.
Abstract
Recent genome-wide association studies (GWAS) with metabolomics data linked genetic variation in the human genome to differences in individual metabolite levels. A strong relevance of this metabolic individuality for biomedical and pharmaceutical research has been reported. However, a considerable amount of the molecules currently quantified by modern metabolomics techniques are chemically unidentified. The identification of these "unknown metabolites" is still a demanding and intricate task, limiting their usability as functional markers of metabolic processes. As a consequence, previous GWAS largely ignored unknown metabolites as metabolic traits for the analysis. Here we present a systems-level approach that combines genome-wide association analysis and Gaussian graphical modeling with metabolomics to predict the identity of the unknown metabolites. We apply our method to original data of 517 metabolic traits, of which 225 are unknowns, and genotyping information on 655,658 genetic variants, measured in 1,768 human blood samples. We report previously undescribed genotype-metabotype associations for six distinct gene loci (SLC22A2, COMT, CYP3A5, CYP2C18, GBA3, UGT3A1) and one locus not related to any known gene (rs12413935). Overlaying the inferred genetic associations, metabolic networks, and knowledge-based pathway information, we derive testable hypotheses on the biochemical identities of 106 unknown metabolites. As a proof of principle, we experimentally confirm nine concrete predictions. We demonstrate the benefit of our method for the functional interpretation of previous metabolomics biomarker studies on liver detoxification, hypertension, and insulin resistance. Our approach is generic in nature and can be directly transferred to metabolomics data from different experimental platforms.Entities:
Mesh:
Year: 2012 PMID: 23093944 PMCID: PMC3475673 DOI: 10.1371/journal.pgen.1003005
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Figure 1Data integration workflow for the systematic classification of unknown metabolites.
We combine high-throughput metabolomics and genotyping data in Gaussian graphical models (GGMs) [21] and in genome-wide association studies (GWAS) [5] in order to produce testable predictions of the unknown metabolites' identities. These hypotheses are then subject to experimental verification by mass-spectrometry. Six such cases have been fully worked through and are presented in Table 3.
Six specific scenarios and their experimental validations.
| Scenario name | Unknowns | Evidence used | Prediction | Validated as |
| DIPEPTIDE | X-14208 | GGM, genetics | Phe-Ser or Ser-Phe | Phe-Ser |
| X-14205 | Glu-Tyr or Try-Glu | α-Glu-Tyr | ||
| X-14778 | Phe-Phe | Phe-Phe | ||
| STEROID | X-11244 | GGM, genetics | sulfated androsterone | androstene disulfate |
| HETE | X-12441 | GGM, pathway | hydroxy-arachidonate (HETE) | 12-HETE |
| CARNITINE | X-11421 | GGM, genetics, pathway | carnitine species, with 6 to 10 carbon atoms | cis-4-decenoyl-carnitine |
| X-13431 | nonanoyl carnitine* | |||
| BILIRUBIN | X-11793 | GGM, genetics | oxidized bilirubin variant | oxidized bilirubin variant* |
| ASCORBATE | X-11593 | GGM, genetics, pathway | O-methylascorbate | O-methylascorbate* |
We investigated six scenarios that included a total of nine unknown metabolites. The first three scenarios, DIPEPTIDE, STEROID, and HETE are discussed in the main text of this paper; the remaining three scenarios, CARNITINE, BILIRUBIN and ASCORBATE, are discussed in Text S3. Predictions marked by * are confirmed by exact mass, fragmentation pattern and chromatographic retention time; however, validation using a pure standard compound as a reference is pending since these compounds are presently commercially unavailable in pure form.
Figure 2Manhattan plot of genetic association.
The strength of association for known (bottom) and unknown (top) metabolites is indicated as the negative logarithm of the p-value for the linear model (see Methods). Only metabolite-SNP associations with p-values below 10−6 are plotted (grey circles). Triangles represent metabolite-SNP associations with p-values below 10−40. Horizontal lines indicate the threshold for genome-wide significance ( = 1.6×10−10 corresponding to α = 0.05 after Bonferroni correction); red vertical dashes indicate loci at which this threshold is attained.
Genome-wide significant associations (p<1.6×10−10) involving unknown metabolites.
| Locus | Locus Info | Lead-SNP | Chr | BP | Metabolite | P-value | Published GWAS hits(metabolic traits) | Published GWAS hits(further phenotypes) |
| PYROXD2 | pyridine nucleotide-disulfide oxidoreductase domain 2 | rs4488133 | 10 | 100149126 | X-12092 | 2.2×10−281 | trimethylamine (urine)/dimethylamine (plasma) | - |
| X-12093 | 1.4×10−27 | |||||||
| SLCO1B1 | organic anion transporter family, bile acids | rs4149056 | 12 | 21269288 | X-11529 | 3.3×10−81 | eicosenoate/tetradecanedioate | statin response |
| X-11538 | 1.4×10−37 | |||||||
| X-13429 | 4.9×10−22 | |||||||
| X-12063 | 5.2×10−20 | |||||||
| X-12456 | 8.4×10−17 | |||||||
| X-14626 | 2.1×10−13 | |||||||
| SLC22A2 | solute carrier family 22 (organic cation transporter), member 2 | rs316020 | 6 | 160589071 | X-12798 | 1.7×10−72 |
| - |
| NAT8 | N-acetyltransferase 8 | rs7598396 | 2 | 73672444 | X-12510 | 1.5×10−56 | N-acetylornithine | chronic kidney disease |
| X-11787 | 3.0×10−37 | |||||||
| X-12093 | 8.9×10−22 | |||||||
| COMT | catechol-O-methyltransferase | rs4680 | 22 | 18331271 | X-11593 | 1.1×10−48 |
| - |
| X-01911 | 5.8×10−11 | |||||||
| CYP3A5 | cytochrome P450, family 3, subfamily A, polypeptide 5 | rs10242455 | 7 | 99078115 | X-12063 | 1.5×10−45 |
| - |
| SULT2A1 | sulfotransferase family, cytosolic, 2A, dehydroepiandrosterone-preferring | rs296391 | 19 | 53060346 | X-11440 | 1.7×10−43 | dehydroepiandrosterone sulfate | - |
| X-11244 | 2.1×10−26 | |||||||
| UGT1A | UDP glucuronosyltransferase 1 family, polypeptide A complex locus | rs6742078 | 2 | 234333309 | X-11530 | 2.1×10−38 | bilirubin (E,E)/oleoylcarnitine | circulating cell-free DNA |
| X-11441 | 5.6×10−30 | |||||||
| X-11793 | 2.6×10−26 | |||||||
| X-11442 | 1.2×10−25 | |||||||
| ACADL | acyl-CoA dehydrogenase, long-chain | rs2286963 | 2 | 210768295 | X-13431 | 2.7×10−33 | C9/C10:2 | |
| ACADM | acyl-CoA dehydrogenase, medium-chain | rs12134854 | 1 | 75879263 | X-11421 | 1.9×10−27 | C12/C8 | - |
| CYP2C18 | cytochrome P450, family 2, subfamily C, polypeptide 18 | rs7896133 | 10 | 96454720 | X-11787 | 4.0×10−26 |
| warfarin maintenance dose |
| GBA3 | glucosidase, beta, acid 3 (cytosolic) | rs358231 | 4 | 22429602 | X-11799 | 2.9×10−17 |
| - |
| ACE | angiotensin I converting enzyme (peptidyl-dipeptidase A) 1 | rs4343 | 17 | 58917190 | X-14189 | 1.5×10−16 | aspartylphenylalanine | angiotensin-converting enzyme activity |
| X-14208 | 4.6×10−15 | |||||||
| X-14205 | 4.0×10−14 | |||||||
| X-14304 | 2.7×10−12 | |||||||
| UGT3A1 | UDP glycosyltransferase 3 family, polypeptide A1 | rs13358334 | 5 | 36025563 | X-11445 | 2.4×10−12 |
| - |
| — | [no known gene locus] | rs12413935 | 10 | 85443900 | X-06226 | 4.0×10−11 |
| - |
We observe associations at 15 genetic loci that involve genes from various biological processes. Note that most of these genes code for proteins that are related to metabolic activities in the body, thereby providing information that allows to derive concrete hypotheses on the biochemical identity of each unknown. Previously published associations with known metabolites and other phenotypic traits (derived from the GWAS catalog [26]) provide further evidence on specific parts of a pathway in which the unknown might be involved. Chromosomal locations are reported with respect to the positive strand of human genome build 36.1.
Figure 3Gaussian graphical modeling.
GGMs embed unknown metabolites into their biochemical context. A: Complete network presentation of partial correlations that are significantly different from zero at α = 0.05 after Bonferroni correction. The unknown metabolites are spread over the entire network and are involved in various metabolic pathways. B–D: Selected high-scoring sub-networks. We observe that GGM edges directly correspond to chemical reactions which alter specific chemical groups (e.g. carbonyl groups and methyl groups). Solid lines denote positive partial correlation. Dashed lines indicate negative partial correlations. Line widths represent partial correlation strengths.
Interpretation of top-ranking partial correlation coefficients (PCC>0.5).
| Metabolite 1 | Metabolite 2 | ζ | Interpretation |
| X-11847 | X-11849 | 0.901 | biochemical link between two unknowns |
| 3-indoxyl sulfate | X-12405 | 0.840 |
|
| X-11452 | X-12231 | 0.832 | biochemical link between two unknowns |
| X-12094 | X-12095 | 0.822 | biochemical link between two unknowns |
| guanosine | inosine | 0.798 | nucleosides |
| X-11441 | X-11442 | 0.760 | biochemical link between two unknowns |
| androsterone sulfate | epiandrosterone sulfate | 0.755 | steroid sulfates |
| X-11537 | X-11540 | 0.753 | biochemical link between two unknowns |
| X-02269 | X-11469 | 0.734 | biochemical link between two unknowns |
| X-11204 | X-11327 | 0.706 | biochemical link between two unknowns |
| decanoylcarnitine | octanoylcarnitine | 0.689 | β-oxidation footprints |
| linoleamide (18:2n6) | oleamide | 0.654 | C18:1/C18:2 acylamides |
| 3-methyl-2-oxovalerate | 4-methyl-2-oxopentanoate | 0.646 | branched-chain amino acid degradation |
| catecholsulfate | X-12217 | 0.601 |
|
| X-14189 | X-14304 | 0.593 | biochemical link between two unknowns |
| 1,5-anhydroglucitol (1,5-AG) | X-12696 | 0.580 |
|
| dehydroisoandrosterone sulfate (DHEA-S) | X-18601 | 0.575 |
|
| 1-arachidonoylglycerophosphoethanolamine | X-12644 | 0.570 |
|
| X-14208 | X-14478 | 0.558 | biochemical link between two unknowns |
| caffeine | paraxanthine | 0.554 | caffeine metabolism |
| X-11423 | X-12749 | 0.549 | biochemical link between two unknowns |
| 1-linoleoylglycerophosphocholine | 2-palmitoylglycerophosphocholine | 0.544 | phospholipids (PC) |
| piperine | X-01911 | 0.526 |
|
| 2-hydroxypalmitate | 2-hydroxystearate | 0.523 | hydroxy fatty acids |
| X-14056 | X-14057 | 0.519 | biochemical link between two unknowns |
| 3-methyl-2-oxovalerate | isoleucine | 0.514 | isoleucine degradation |
| X-11244 | X-11443 | 0.510 | biochemical link between two unknowns |
| urea | X-09706 | 0.506 |
|
| isoleucine | leucine | 0.506 | branched-chain amino acids |
| 1-arachidonoylglycerophosphoethanolamine | 1-linoleoylglycerophosphoethanolamine | 0.502 | phospholipids (PE) |
Connections between two known metabolites indicate a direct metabolic relationship, e.g. between purines (guanosine/inosine) or steroid hormones (androsterone sulfate/epiandrosterone sulfate). A link between a known and an unknown compound therefore provides evidence for a shared metabolic pathway. For instance, the link between 3-indoxylsulfate and X-12405 suggests a role of this unknown in tryptophan metabolism. Abbreviations: PC = phosphatidylcholine, PE = phosphatidylethanolamine, ζ = partial correlation coefficient. Italic text represents hypothetical known-unknown connections.
Figure 4Semi-automatic prediction of unknown metabolite identities.
A: Examples of how to determine pathway classifications based on the functional annotations of GGM and GWAS hits. We present two metabolites, X-11421 and X-11244, whose GGM and GWAS associations clearly point into carnitine and steroid metabolism, respectively. B: Overview of unknowns functionally annotated by both GGMs and the GWAS approach. ‘GGM’ refers to an unknown metabolite which is three or less steps away from a known metabolite in the GGM, whereas ‘direct GGM’ represents direct neighbors in the network. C: Pathway predictions for the 16 unknowns with both direct GGM and GWAs annotations. Unknowns marked with a star were subjected to in-depth analysis followed by experimental validation in the following.
Figure 5Detailed investigation of three scenarios (DIPEPTIDE, STEROID, and HETE).
In order to generate concrete hypotheses on the unknowns' identities, we assembled all available information for each scenario. This includes biochemical edges from the GGM, genetic associations from the GWAS, pathway annotations as well as mass information. For details of the predicted identities, see Table 3 and main text. Similar figures for three further scenarios (CARNITINE, BILIRUBIN, and ASCORBATE) are available in Text S3.
Figure 6Experimental confirmation of X-14208 as phenylalanylserine.
Two possible dipeptide variants were predicted and consequently tested. The fragmentation spectrum of the 253.1 m/z ion (positive mode) of the pure Phe-Ser matches that of the unknown compound, whereas the spectrum for pure Ser-Phe differs visibly. Moreover, the retention index (RI) of Phe-Ser is similar to the RI of X-14208, whereas that of Ser-Phe is significantly different.