| Literature DB >> 32326153 |
Neda Hassanpour1, Nicholas Alden2, Rani Menon3, Arul Jayaraman3, Kyongbum Lee2, Soha Hassoun1,2.
Abstract
Mass spectrometry coupled with chromatography separation techniques provides a powerful platform for untargeted metabolomics. Determining the chemical identities of detected compounds however remains a major challenge. Here, we present a novel computational workflow, termed extended metabolic model filtering (EMMF), that aims to engineer a candidate set, a listing of putative chemical identities to be used during annotation, through an extended metabolic model (EMM). An EMM includes not only canonical substrates and products of enzymes already cataloged in a database through a reference metabolic model, but also metabolites that can form due to substrate promiscuity. EMMF aims to strike a balance between discovering previously uncharacterized metabolites and the computational burden of annotation. EMMF was applied to untargeted LC-MS data collected from cultures of Chinese hamster ovary (CHO) cells and murine cecal microbiota. EMM metabolites matched, on average, to 23.92% of measured masses, providing a > 7-fold increase in the candidate set size when compared to a reference metabolic model. Many metabolites suggested by EMMF are not catalogued in PubChem. For the CHO cell, we experimentally confirmed the presence of 4-hydroxyphenyllactate, a metabolite predicted by EMMF that has not been previously documented as part of the CHO cell metabolic model.Entities:
Keywords: enzyme promiscuity; extended metabolic models; metabolite annotation; metabolomics
Year: 2020 PMID: 32326153 PMCID: PMC7241244 DOI: 10.3390/metabo10040160
Source DB: PubMed Journal: Metabolites ISSN: 2218-1989
Figure 1Comparison between annotation workflows. The candidate set for annotation is derived by filtering the measured masses based on: (A) the metabolic model, (B) databases, and (C) extended metabolic model (EMM). The candidate sets in (A) and (C) are biologically relevant, while candidates in (B) prior to filtering may not all be biologically relevant.
Size of experimental data sets and models. (A) Three experimental datasets under different conditions were collected for the CHO cell, and two for the gut microbiota sample. (B) The size of the metabolic model: number of reactions, metabolites, and unique masses. (C) The size of the expanded metabolic model: number of operators derived using PROXIMAL, unique derivatives generated by PROXIMAL, unique derivative masses due to PROXIMAL. For comparison purposes, the numbers of derivatives and derivative masses exclude those in the metabolic model. (D) Fold increase in number of metabolites and masses when comparing the size of these sets for EMM against the metabolic model.
| (A) | (B) | (C) | (D) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Biological Sample | Dataset | MS Mode | Number of Measured Masses | Number of Reactions | Number of Metabolites | Number of Unique Masses | Number of Unique Operators | Number of Unique Derivatives | Number of Unique Derivative Masses | Number of Metabolites | Number of Unique Masses |
| CHO cell | HilNeg | negative | 2502 | 1619 | 1353 | 775 | 2392 | 76745 | 17930 | 56.72 | 23.14 |
| HilPos | positive | 3856 | |||||||||
| SynNeg | negative | 5336 | |||||||||
| gut microbiota | Neg | negative | 1651 | 1381 | 1307 | 779 | 2756 | 94186 | 23356 | 72.06 | 29.98 |
| Pos | positive | 1657 | |||||||||
Candidate set size using different workflows. (A) Candidate set size when using the model: number of measured masses that match to metabolites in the model, the equivalent percentage of the number of measured masses reported for experimental data in Table 1, and corresponding number of chemical identities. (B) Candidate set size when using extended metabolic model (EMM)-based filtering: number of measured masses that match to metabolites in the EMM, equivalent percentage in reference to the number of measured masses reported for experimental data in Table 1, and corresponding number of chemical identities. (C) Further filtering of the EMM derivatives reported in column group (B) to include only mass measurements that match to previously known chemical IDs as reported in PubChem, and reporting the number of matched masses, the relative percentage of these masses to the number of measured masses reported for experimental data in Table 1, and the corresponding number of chemical IDs. (D) Size of the candidate set when filtering using PubChem.
| Biological Sample | (A) | (B) | (C) | (D) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Number of Measured Masses Matched to Those in Metabolic Model | Percentage of Measured Masses Matched to Those in Metabolic Model | Number of Chemical Ids Associated with Measured Masses | Number of Masses Matched to Those in EMM | Percentage of Masses Matched to Those in EMM | Number of Unique Mass-Matched Derivatives in EMM But Not in The Model | Number of Masses Matched to Those with Previously Known Chemical IDs | Percentage of Masses Matched to Those with Previously Known Chemical IDs | Number of Previously Known Chemical IDs for EMM Derivatives that Mass-Match to Measurements | Number of Unique Mass Matches in PubChem | Number of Corresponding Chemical IDs Associated with Measured Masses | ||
| CHO cell | HilNeg | 118 | 4.72% | 178 | 678 | 27.10% | 2,725 | 174 | 6.95% | 386 | 3,951,635 | 7,657,564 |
| HilPos | 75 | 1.95% | 93 | 715 | 18.54% | 2,729 | 132 | 3.42% | 226 | 3,362,305 | 6,406,877 | |
| SynNeg | 198 | 3.71% | 229 | 1,490 | 27.92% | 4,944 | 293 | 5.49% | 527 | 7,058,696 | 14,133,885 | |
| gut microbiota | Neg | 51 | 3.09% | 131 | 445 | 26.95% | 2,470 | 77 | 4.66% | 207 | 2,448,238 | 5,192,205 |
| Pos | 36 | 2.17% | 43 | 316 | 19.07% | 1,236 | 84 | 5.07% | 149 | 2,774,074 | 5,572,587 | |
| Averages | 96 | 3.13% | 135 | 729 | 23.92% | 2,821 | 152 | 5.12% | 299 | 3,918,990 | 7,792,624 | |
Percentage of EMMF candidates with non-zero CFM-ID scores and their average scores.
| Biological Sample | KEGG | PubChem | |||||
|---|---|---|---|---|---|---|---|
| Number of EMMF Derivatives | Percentage of EMMF Derivatives with Nonzero CFM-ID scores | Average CFM-ID Score | Number of EMMF Derivatives | Percentage of EMMF Derivatives with Nonzero CFM-ID scores | Average CFM-ID Score | ||
| CHO cell | HilNeg | 65 | 65% | 0.557 | 280 | 55% | 0.415 |
| HilPos | 48 | 63% | 0.395 | 286 | 49% | 0.316 | |
| SynNeg | 114 | 64% | 0.501 | 446 | 51% | 0.370 | |
| gut microbiota | Neg | 252 | 16% | 0.631 | 197 | 53% | 0.484 |
| Pos | 56 | 55% | 0.292 | 428 | 29% | 0.270 | |
| Average | 53% | 0.475 | 47% | 0.396 | |||
Figure 2Distribution of CFM-ID scores for EMMF derivatives. (A) Chinese hamster ovary (CHO) cell derivatives that had a match in PubChem. (B) CHO cell derivatives that had a match in KEGG. (C) Gut microbiota derivatives that had a match in PubChem. (D) Gut microbiota derivatives that had a match in KEGG.
Using EMMs to compare the annotation opportunities of PubChem against the KEGG database.(A) Experimental data for different datasets (repeated for convenience). (B) Number of matched masses and candidate chemicals found using EMMF that are reported in KEGG. (C) Number of matched masses and candidate chemicals found using EMMF reported in PubChem but not in KEGG. (D) Lower-bounds on discovery of biologically relevant matched masses and candidate chemicals when using PubChem over KEGG.
| Biological Sample | (A) | (B) | (C) | (D) | ||||
|---|---|---|---|---|---|---|---|---|
| Dataset | Number of Measured Masses | Number of Matched Masses | Number of Candidate Chemical IDs | Number of Matched Masses | Number of Candidate Chemical IDs | Number of Matched Masses | Number of Candidate Chemical IDs | |
| CHO cell | HilNeg | 2502 | 56 | 93 | 118 | 200 | 2.11 | 2.15 |
| HilPos | 3856 | 26 | 39 | 106 | 148 | 4.08 | 3.79 | |
| SynNeg | 5336 | 88 | 122 | 205 | 283 | 2.33 | 2.32 | |
| gut microbiota | Neg | 1651 | 25 | 47 | 52 | 113 | 2.08 | 2.40 |
| Pos | 1657 | 23 | 28 | 61 | 93 | 2.65 | 3.32 | |
| Average | 2.65 | 2.80 | ||||||
Candidate metabolites identified by EMMF that were used for experimental validation. (A) Candidate mass and name. (B) Ranking of metabolite and number of candidates that matched mass measurement using KEGG. (C) Ranking of metabolite and number of candidates that matched mass measurement using PubChem. (D) The number of reactions that yielded the PROXIMAL operator that yielded each candidate metabolite and the associated number of enzymes that catalyze these reactions. (E) The status of experimental validation.
| (A) | (B) | (C) | (D) | (E) | ||||
|---|---|---|---|---|---|---|---|---|
| Mass Measurement (Daltons) | Candidate Metabolite Identified by EMMF | Rank | Matches | Rank | Matches | Number of Reactions Used to Derive Operator | Number of ECs Associated with Reactions | Experimentally Validated? |
| 122.04 | Salicylaldehyde | 1 | 1 | 1 | 1 | 1 | 1 | No |
| 182.06 | 4-Hydroxyphenyllactate | 1 | 2 | 1 | 4 | 12 | 15 | Yes |
| 101.05 | Acetoacetamide | 1 | 1 | 2 | 3 | 1 | 1 | No |
| 117.79 | 5-Aminopentanoate | 1 | 2 | 1 | 5 | 4 | 4 | No |
| 132.04 | Glutarate | 1 | 1 | 3 | 6 | 12 | 11 | No |
| 167.06 | 3-Methoxyanthranilate | 1 | 1 | 2 | 3 | 8 | 2 | No |
| 152.05 | 2-Hydroxyphenylacetic acid | NA | 1 | 1 | 4 | 1 | 1 | No |
| 183.05 | 4-Pyridoxate | NA | 0 | 1 | 1 | 1 | 1 | No |
EMMF candidate metabolites analyzed using annotation tools and databases.(A) Candidate metabolite suggested by EMMF on the basis of scores from CFM-ID. (B) CFM-ID score. (C) Name of top match compound(s) and its score based on the GNPS spectral library. (D) Name of top match compounds and its score based on HMDB. (E) Number of PubChem candidates based on a 10ppm window of the measured mass. (F) MetFrag results, including the rank of the compound identified via EMMF based on CFM-ID scores and compound availability, its associated number of peaks in the spectral signature that MetFrag explained compared to the number of peaks that were utilized to provide the MetFrag ranking, the top match provided by MetFrag, and its associated number of peaks in the spectral signature that MetFrag explained compared to the number of peaks that were utilized to provide the MetFrag ranking.
| (A) | (B) | (C) | (D) | (E) | (F) MetFrag | ||||
|---|---|---|---|---|---|---|---|---|---|
| Mass Measurement (Daltons) | Candidate Metabolite | Score | Matched Compound | Matched Compound (Score) | Number of Matches | Rank of Compound Identified by EMMF | # of Peaks Explained/ | Top Ranked Candidate | # of Peaks Explained/ |
| 122.04 | Salicylal | 0.596 | No Match | No Match | 241 | 27 | 4/8 | 2-cyclopenta-1,3-dien-1-yl-2-oxo-acetaldehyde | 4/8 |
| 182.06 | 4-Hydroxyphenyllactate | 0.717 | No Match | Homovanillic acid (0.43) | 1694 | 218 | 10/22 | methyl 2-hydroxy-2-phenyl-peroxyacetate | 11/22 |
| 101.05 | Acetoacetamide | 0.682 | Aminocyclopropane (0.92), | No Match | 445 | 331 | 1/2 | hydroxy N-isopropenylmethanimidate | 1/2 |
| 117.79 | 5-Aminopentanoate | 0.979 | No Match | L-Valine (0.44), | 858 | 12 | 2/5 | 2-[ethyl(methyl)amino]acetic acid | 2/5 |
| 132.04 | Glutarate | 0.600 | No Match | Ethylmalonic acid (0.41) | N/A | ||||
| 167.06 | 3-Methoxyanthranilate | 0.949 | No Match | Mandelic acid (0.55), | 1962 | 972 | 2/7 | (2-aminophenyl) peroxyacetate | 2/7 |
| 152.05 | 2-Hydroxyphenylacetic acid | 0.716 | 4-hydroxyphenylacetic acid (0.81) | No Match | 841 | 129 | 1/4 | methyl-phenyl-silyl-silane | 1/4 |
| 183.05 | 4-Pyridoxate | 0.870 | 4-Pyridoxate (0.76) | No Match | 1252 | 149 | 2/5 | 2-[1-(3-furyl)ethylideneamino]oxyacetic acid | 2/5 |
Figure 3Mirror plot for 4-hydroxyphenyllactate, KEGG compound C03672. (A) Experimental data collected using untargeted metabolomics from the CHO cell culture. (B) Data from high-purity chemical standard. This is considered a match by retention time (RT; difference < 3 min) and by MS/MS (spearman rank correlation p-value < 0.05 and r-value > 0.6).