| Literature DB >> 35453629 |
Sara M de Cripan1,2,3, Adrià Cereto-Massagué2, Pol Herrero2, Andrei Barcaru4, Núria Canela2, Xavier Domingo-Almenara1,2,3.
Abstract
In gas chromatography-mass spectrometry-based untargeted metabolomics, metabolites are identified by comparing mass spectra and chromatographic retention time with reference databases or standard materials. In that sense, machine learning has been used to predict the retention time of metabolites lacking reference data. However, the retention time prediction of trimethylsilyl derivatives of metabolites, typically analyzed in untargeted metabolomics using gas chromatography, has been poorly explored. Here, we provide a rationalized framework for machine learning-based retention time prediction of trimethylsilyl derivatives of metabolites in gas chromatography. We compared different machine learning paradigms, in addition to exploring the influence of the computational molecular structure representation to train the prediction models: fingerprint class and fingerprint calculation software. Our study challenged predicted retention time when using chemical ionization and electron impact ionization sources in simulated and real cases, demonstrating a good correct identity ranking capability by machine learning, despite observing a limited false identity filtering power in cases where a spectrum or a monoisotopic mass match to multiple candidates. Specifically, machine learning prediction yielded median absolute and relative retention index (relative retention time) errors of 37.1 retention index units and 2%, respectively. In addition, fingerprint class and fingerprint calculation software, as well as the molecular structural similarity between the training and test or real case sets, showed to be critical modulators of the prediction performance. Finally, we leveraged the structural similarity between the training and test or real case set to determine the probability that the prediction error is below a specific threshold. Overall, our study demonstrates that predicted retention time can provide insights into the true structure of unknown metabolites by ranking from the most to the least plausible molecular identity, and sets the guidelines to assess the confidence in metabolite identification using predicted retention time data.Entities:
Keywords: GC-MS; machine-learning; metabolomics; retention index; retention time
Year: 2022 PMID: 35453629 PMCID: PMC9024754 DOI: 10.3390/biomedicines10040879
Source DB: PubMed Journal: Biomedicines ISSN: 2227-9059
Figure 1Workflow for RI prediction. The histogram shows the number of TMS groups of metabolites in GMD. Multiple fingerprint classes were generated from metabolites in GMD and metabolites were randomly split in training and test sets generating 20 different sets to train and test the ML models.
Figure 2(a) RI of derivatized metabolites with 1TMS vs. 2TMS, (b) RI of derivatized metabolites with 2TMS vs. 3TMS, and box plots showing the corresponding relative prediction errors as determined by linear regression. Dashed black lines are the identity functions and the dashed red lines are the regression lines. (c) Prediction error for each ML model and (d) FP class. p-value < 0.001 from a paired Wilcoxon rank tests (n = 5800) is shown as ***. Outliers are not shown (all panels).
Figure 3Structural similarity influence. (a) Distribution of Tanimoto similarity of the full training set (red) and randomly selected metabolites (blue). (b) Prediction error across similarity ranges. (c) Cumulative probability functions (CPF) for the structural similarity ranges. (d) Statistical significance for pairwise comparisons in (b): Wilcoxon test, *** for p-value < 0.001; ** for p-value < 0.01.
Figure 4Metabolite identity candidates ranking and filtering using predicted RI in EI and CI. (a) Percentage of correctly identified metabolites with 2 (C = 2), 3 (C = 3), or 4 or more (C ≥ 4) putative candidates ranked according to predicted-reference RI error. (b) Total number of metabolites (black), number of filtered metabolites using a 3% RI threshold (gray) and candidate classification as True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) using the same threshold.
Figure 5Application of RT/RI prediction in plasma samples from patients with ulcerative colitis (UC). (a) Workflow graphical representation. (b) Experimental vs. reference RI and (c) experimental vs. predicted RI of the identified metabolites. (d) Heat-map and sample hierarchical clustering of identified metabolites in plasma samples (M for male, F for female, CTR for control and UC for ulcerative colitis).