Literature DB >> 35936554

AI/ML-driven advances in untargeted metabolomics and exposomics for biomedical applications.

Abstract

Metabolomics describes a high-throughput approach for measuring a repertoire of metabolites and small molecules in biological samples. One utility of untargeted metabolomics, unbiased global analysis of the metabolome, is to detect key metabolites as contributors to, or readouts of, human health and disease. In this perspective, we discuss how artificial intelligence (AI) and machine learning (ML) have promoted major advances in untargeted metabolomics workflows and facilitated pivotal findings in the areas of disease screening and diagnosis. We contextualize applications of AI and ML to the emerging field of high-resolution mass spectrometry (HRMS) exposomics, which unbiasedly detects endogenous metabolites and exogenous chemicals in human tissue to characterize exposure linked with disease outcomes. We discuss the state of the science and suggest potential opportunities for using AI and ML to improve data quality, rigor, detection, and chemical identification in untargeted metabolomics and exposomics studies.

Entities: Chemical

Year: 2022 PMID： 35936554 PMCID： PMC9354369 DOI： 10.1016/j.xcrp.2022.100978

Source DB: PubMed Journal: Cell Rep Phys Sci ISSN： 2666-3864

INTRODUCTION

Chemical reactions in the body produce the myriad metabolites essential for human life, a process known as metabolism. Metabolism itself falls in two main types: catabolism, or the breakdown of molecules to obtain energy, and anabolism, or the synthesis of compounds required by cells. Metabolism also encompasses deactivation, detoxification, and elimination of foreign or unwanted substances. Insight into these processes is crucial for understanding human physiology in health and disease. There are multiple ways to study these processes individually or collectively, but one comprehensive, high-throughput approach is metabolomics, which relies on measurement of small molecules (<2,000 Da) in a biological sample, typically blood, urine, or saliva (Figure 1, bottom). The metabolomics framework can capture endogenous metabolites and signal molecules that participate in regulation of gene expression, protein function, and enzyme activity. Its high-throughput nature is particularly valuable, given that the scale of small molecule-enzyme interactions varies by organism from around 500 to a few thousand reactions and metabolite intermediates.[1]

Figure 1.

Metabolomics and exposomics facilitate discovery

Metabolomics (pink shading, bottom) focuses on measures of endogenous small molecules as outputs of metabolic pathways. Perturbations in the ‘omics’ layers can lead to changes in metabolite profiles linked with phenotypes. Exposomics (blue shading, top) expands on this to include measurements of exogenous small molecules as well as the influence of exogenous and non-genetic factors on the ‘omics’ cascade that can lead to changes in metabolite profiles linked with phenotype. The timing of sample collection relative to phenotype characterization helps determine whether the metabolite biomarker or pathway is linked to etiology, diagnosis, or progression of disease.

Within the metabolomics framework, different approaches enable different kinds of insights into these metabolic processes. One approach, targeted metabolomics, typically measures concentrations of tens to approximately 100 endogenous metabolites determined a priori. This quantitative approach enables comparisons across studies and populations as well as development of thresholds describing average or expected ranges to aid medical diagnosis and intervention. A specific type of targeted metabolomics is metabolic flux analysis, which monitors the fate of stable isotope tracers (e.g., 13C-glucose, 15N-glutamine), allowing research into the flow of metabolites.[1-5] Together, flux and concentration provide a fuller understanding of metabolism. Complementing these quantitative approaches, untargeted metabolomics is an unbiased and semi-quantitative measure of thousands of small molecules simultaneously. This approach circumvents the logistical and economical challenges that restrict how many chemicals can be measured in a quantitative assay. Here, high-resolution mass spectrometry (HRMS) typically pairs with liquid chromatography (LC) or gas chromatography (GC) to easily separate and detect thousands of chemical peaks—unitless measurements of semi-quantitative features applicable for downstream analysis. With particular study designs, untargeted metabolomics can generate new hypotheses on altered pathways and individual metabolites that can be linked to disease initiation, diagnosis, progression, or prognosis (Figure 1).[6,7] Untargeted metabolomics offers a particularly valuable approach when considering that beyond the core set of metabolites studied in targeted and flux analysis lies the vast unknown metabolome described holistically as the “exposome.”The exposome stems from a hypothesis that most diseases and disorders are heterogeneous and that non-genetic influence or “exposure” from environmental chemicals, diet, lifestyle, psychosocial factors, and disease history throughout life may play pivotal roles in health. Indeed, exogenous chemicals from food (genistein, vitamin E), lifestyle (nicotine, caffeine), drugs (cefuroxime, acetaminophen), and pollution (phthalates, perfluoroalkyl substances) enter the body and circulate in the blood to cells and organs. Biological fluids and tissues contain chemical readouts of these exposures, such as cortisol from stress, di(2-ethylhexyl)phthalate (DEHP) from plasticizers, caffeine from coffee, cholesterol from high-fat diets, and antibiotics used before surgery. Over the last decade, untargeted metabolomics strategies have expanded to include detection and measurement of exposure chemicals. “Exposomics” analysis leverages HRMS-based strategies to capture, in the same analytical assay, the endogenous metabolites typically measured in untargeted metabolomics analysis and exogenous chemicals resulting from various exposures (Figure 1, top). Despite extensive LC-MS studies collecting exposome data, most of the human exposome remains unknown, with only a small fraction identified and incorporated into in-house libraries or databases. Even though only about 5,000 chemicals likely have wide enough dispersal in the environment to pose a global threat to the human population, many thousands more are expected to affect individuals.[8] Because such chemicals can be converted to metabolites or environmental transformation products, first-order reaction products could number in the millions.[9] Unlike typical endogenous metabolites and nutritional chemicals detected in high concentrations in most study participants, environmental chemicals often have concentrations orders of magnitude lower; they can be transient and then rapidly disappear, or they can fall below detection limits. Current hypotheses in environmental epidemiology purport that individuals commonly experience chemical exposure in mixtures rather than individually and that mixture effects underlie phenotypic changes in health or during disease.[10,11] These phenomena are important when considering which suitable analytic approach to take. Although untargeted assays with HRMS can detect tens of thousands of peaks from each sample, the complexity of these data and inherent analytical variability in HRMS bring forth computational challenges. Advanced artificial intelligence (AI) and machine learning (ML) algorithms can assist with alignment of the data; feature selection to pinpoint important exposures, metabolites, and biomarkers as mixtures; and annotation of unknown metabolites. Thus, developing and optimizing such applications is necessary to advance exposomics discovery and further research. Here we highlight some of the most recent AI/ML tools applied to the field of untargeted metabolomics data processing. Although AI/ML is now being applied to new technologies, including MS imaging and single-cell MS metabolomics,[12-15] we focus our discussion on the most widely used LC- or GC-HRMS techniques and critical gaps required to overcome to advance exposomics research. We discuss these as key steps required for successful application of untargeted metabolomics within an exposomics framework.

Typical untargeted workflow

For biological matrices like serum, plasma, or urine, LC or GC column chromatography is used to first separate the complex mixture before detection and measurement by HRMS. The LC/GC-HRMS workflow typically follows a series of steps: sample preparation; data acquisition; data pre- and post-processing; data analysis, including feature selection; and identification/annotation of chemicals (Figure 2).[16-18] Metabolites and chemicals are extracted from the biological sample using a high percentage of organic solvent that also removes proteins. In LC analysis, the extracts are usually analyzed using hydrophilic interaction LC (HILIC), which uses a column that retains polar chemicals, and reverse-phase (RP) chromatography, which uses a column that retains neutral and non-polar compounds. Together, these complementary methods maximize the total number of measurable small molecules. For GC analysis, a clean-up step using solid-phase extraction can precede protein precipitation; then the extracts are derivatized to make them more volatile so they can be analyzed on a capillary column.[19] Physicochemical properties of the small molecules dictate their extraction efficiency as well as their retention on the LC or GC column, and ionization/detection on the MS, which provides an opportunity for ML approaches to support chemical space prediction for selecting chromatographic columns and buffer or temperature gradients.[20] The untargeted approach uses minimal sample processing steps to maximize the breadth of chemicals measured because multi-step extractions trade this breadth of measurement for one of maximizing the signal of particular chemical classes. Most of the MS features measured in an untargeted analysis are unknown, precluding a method to optimize conditions a priori based on properties of all intended targets or to determine all of the chemicals that are missing or lost from the analysis because of the selected conditions.

Figure 2.

Untargeted analysis workflow for biomedical applications

Step 1: biological samples, including biofluids, tissues, and/or cells, are collected from the study groups or phenotypes of interest (e.g., exposed or not exposed, cases or controls). Metabolites and chemicals are isolated through sample preparation. Step 2: the metabolites and chemicals are analyzed by liquid or gas chromatography-high resolution mass spectrometry (LC- or GC-HRMS, respectively), which separates the molecules in the complex mixture and detects and measures them. Step 3: data undergo processing to convert the 3D data collected on each sample (m/z, retention time, and abundance) to a 2D table for analysis. In pre-processing, the peaks are found and aligned across all samples, and the area under the peak curve is calculated. In post-processing, the data are normalized and scaled, batch correction is performed when needed, as well as filtering for QC. Step 4: data are analyzed using AI/ML and classic statistical approaches to identify individual metabolites and chemicals or pathways that are predictive of the phenotype of interest. Step 5: metabolites and chemicals of interest are annotated by matching to spectral libraries, spectral prediction, and spectral/chemical similarities.

Data acquisition can occur in one stage of mass spectrometry (MS1) or two stages with tandem mass spectrometry (MS/MS), in which ions from MS1 are selectively fragmented to generate a molecular fingerprint of the molecule to aid identification (Figure 3). Most studies use combinations of collecting MS1 and MS2 data; MS1 is usually used for semi-quantification of the feature, and MS/MS fragmentation data are usually used for annotation of the feature.

Figure 3.

MS and MS/MS data acquisition

(A) A sample after injection into a chromatography column enters the mass spectrometer, where eluting chemicals are ionized, accelerated, and analyzed by mass spectrometry (MS1). Each chemical elutes at a characteristic time and m/z.

(B) Tandem mass spectrometry (MS/MS) can then be performed, where the ions of interest (e.g., m/z 195.0877) can be selectively fragmented to generate fragment ions (e.g., m/z 138.1). These fragment ions are characteristic of the molecule. Therefore, MS/MS spectra can be used to aid chemical identification.

Data are acquired for each analytical mode (HILIC-MS, RP-MS, or GC-MS) in three dimensions: mass-to-charge ratio (m/z), retention time (rt), and abundance. AI/ML play major roles in subsequent steps of metabolomics workflows. MS1 data are pre- and post-processed using a variety of algorithms to transform the large amount of raw spectral data into a much smaller, statistically manageable set of peaks or features (Figure 2). Software processes include selecting the peaks and aligning the peaks across the samples. The output is a peak table for each analytical mode that has peak intensity values (abundance) for every metabolite feature for every sample. When the data are adjusted to remove unwanted technical variation, feature selection takes place, often using statistical approaches or ML to focus on the small molecules associated with health outcomes or exposure. Finally, these features are identified for biological interpretation, with many ML and AI tools developed to facilitate metabolite annotation using MS/MS data. Innovation in the first two steps of the workflow (Figure 2, steps 1 and 2) stems largely from instrumentation and automation, which enable more reproducible sample preparation, better detection in smaller sample volumes, and a broader range of measurable metabolites. In contrast, innovations in the last three steps (Figure 2, steps 3–5) stem from computational advances. Most efforts in AI/ML to date have focused on the latter part of the data processing workflow—feature selection and metabolite identification—because of an urgent need for computational tools to draw biologically interpretably connections between complex MS metabolomics data and health and exposure outcomes. However, recent efforts demonstrate a shift toward developing and applying advanced ML methods to enhance quality control and cleaning—data processing—of untargeted MS data before downstream analysis. This shift reflects that most readily used peak-picking algorithms can successfully measure high-concentration, Gaussian-shaped peaks typical of endogenous metabolites[21] but are less successful with low-abundance signals. This measurement is a fundamental challenge for HRMS exposomics, which seeks to capture a handful of needles in a large and variable haystack. Therefore, advancing the field of HRMS exposomics requires robust peak picking of low-abundance features and development or optimization of novel computational tools for data processing.

Data processing

Metabolomics raw data are inherently complex because of multiple linear and nonlinear interactions among the metabolites as well as challenges with mass spectrometry data structure.[22] These challenges include features (e.g., peaks) that can massively outnumber the samples, high levels of noise, batch and run order effects during measurements, and missing values during peak detection. The data pre-processing step of this workflow is crucial for accurate translation of the 3D data obtained from LC-MS (m/z, rt, abundance) into a 2D aligned peaktable (aligned peaks [of specific m/z and rt] and their respective abundances in every sample) that is required for downstream data analyses.[23,24] This is crucial because the peak area correlates with chemical concentration in a sample, which are the data ultimately analyzed to draw statistical and biological inferences. Even though data pre-processing is easily performed using automated software, it is challenging to precisely, accurately, and robustly synthesize the data across the full range of metabolite features, concentrations, and sample acquisition times into a manageable dataset.[16,25] Algorithms for pre-processing include open-source XCMS,[23] MZmine3,[24] MS-Dial,[26] and MetAlign[27] as well as several types of proprietary software. XCMS and MZmine are the most widely used, but no algorithm has been accepted as the benchmark in the fields of metabolomics or exposomics. Consequently, concordance across methods can be less than 50%.[28] Recent evidence demonstrates that false positives and poorly integrated peaks (low quality) are retained in the data at large numbers for public and private software platforms,[29,30] which can propagate errors into downstream analyses.[31] These findings have stoked increased interest in development of different quality control (QC) measures to improve the quality and reliability of data reporting for high-throughput untargeted analysis.[32-34] After peak picking, application of subsequent filtering strategies based on predetermined thresholds, such as mean/median value across samples, variability across biological samples, and levels of missing values, is routine to remove noisy peaks.[35] The most commonly used is a pooled QC that is generated by thoroughly combining a small volume from all samples or from a representative subset of the samples and realiquoting this into multiple samples. These replicates can be evenly distributed throughout the analytical batch for acquisition. Because the pooled QC sample is a representative sample comprising the metabolite compositions of the study samples, features present in all pooled QC samples with a low quantitative coefficient of variation across the analytical batches are retained. However, with complex samples, contaminants can be greater than 50% in some experiments,[36] which may be difficult to discriminate from true positives of low abundance. Although QCs allow removal of many false positive features (noise or baseline recognized as a peak), correct features will also be discarded. A recent study illustrated this phenomenon by re-mining untargeted metabolomics data using minimal thresholds to reveal additional metabolites and pathways associated with the outcome that were not identified in the primary publication.[37] Therefore, to retain important but low-abundance features for downstream analysis, there is a need for new approaches that comprehensively retain all high-quality peaks.[38] Here, new peak-picking algorithms and ML-based filtering have entered the arena. The comprehensive peak characterization (CPC) algorithm with user-based peak criterion filtering removes 35% of the peak-picked XCMS features and demonstrates, for a subset of retained peaks, a 90% true positive rate and 87% true negative rate.[39] Similarly, Finnee introduces algorithms to correct baseline drift and background noise and uses a clustering and targeted analysis approach to reduce false-positives.[40] These resulted in more biomarkers than XCMS or MS-DIAL, on a limited demonstration of five controls, five with an asthma medical diagnosis and five with a chronic obstructive pulmonary disease medical diagnosis, suggesting that algorithm development and optimization may be needed to enable detection and measurement of new chemicals in datasets for statistical analysis.[41] Recent work introduced applications of ML classification approaches that use peak quality to train models. The first tool, WiPP, introduced in 2019, uses support vector machine (SVM) to classify high-concentration peaks from GC-MS data but performs worse for low-concentration peaks.[42] MetaClean assesses 24 different classifiers—combinations of eight algorithms and three sets of peak quality metrics—at filtering peaks based on quality of peak boundaries, revealing that the adaBoost algorithm and a set of 11 peak quality metrics perform best.[43] However, the distributions of low-and high-abundance peaks between the true positive and true negative rates were not assessed. Peakonly[44] and NeatMS[45] utilize deep learning (DL) neural networks for LC-MS peak classification. Peakonly has a true positive detection rate of 97% but deliberately does not consider narrow peaks and uncertain peaks with noisy shape, achieving confidence to detect only true positive peaks. On the other hand, NeatMS demonstrates a greater ability to retain high-quality peaks even at lower concentrations. In the tool NPFimg, raw GC-MS data are flattened into a 2D image for processing using a neural networks model. NPFimg performs better than or to the same as XCMS, with true positive and true negative rates of more than 97% with a limited demonstration of application to human breath samples from a single participant.[46] Finally, the software EVA uses convolutional neural networks (CNN) to classify good and bad peak shapes; applying this to 22 publicly available LC-MS-based metabolomics datasets yielded a classification accuracy greater than 90%.[47] Although these tools show promise based on their initial demonstrations, their utility must be tested through full evaluations on large datasets. It is critical to determine whether exposure data, and not just endogenous metabolites, are retained in the analysis, especially after data processing,[48-51] to determine where along the workflow critical improvements are needed. Anticipated future advancements and developments in algorithms for untargeted metabolomics data mining[52] will contribute to robust data analysis and, ultimately, discovery of new biomarkers for health and disease.

Feature selection

Discovery of molecular biomarkers and metabolomics signatures requires analyzing the complex untargeted data in biological samples. Analyses using traditional univariate and multivariate linear models perform multiple hypothesis tests (one hypothesis per feature) and apply a correction method to adjust for multiple hypothesis testing (false discovery rate or Bonferroni). Borrowing from the concept of genome-wide association studies (GWASs), environment-wide association studies (EWASs) analytically validate associations between metabolite features and a phenotype. Such studies are comprehensive in that each measured metabolite is assessed for possible association with the target phenotype.[53] These methods enable successful biomarker discovery in metabolomics data across disease contexts.[54-57] However, these methods cannot consider the highly correlated structure of metabolomics data a priori and do not address interactions between molecules,[58] thus increasing the probability of obtaining false positives and false negatives. In contrast, AI and ML approaches use the data to build models and then test those models with the data. For both epidemiological and clinical studies, AI and ML of metabolomics data can unveil important relationships between phenotypes and exposures or phenotypes and disease groups. AI and ML can identify variation between phenotypes through dimension reduction, metabolites and chemicals that can predict disease status or phenotypes, and biological pathways that are different between phenotypes, demonstrating the power of these approaches to answer a range of important clinical, environmental health, and precision health questions.[59] The most widely used AI/ML tools in metabolomics include absolute shrinkage and selection operator (LASSO), principal-component analysis (PCA), hierarchical clustering analysis (HCA), self-organizing maps (SOMs), partial least square-discriminant analysis (PLS-DA), and random forest (RF).[60,61] Recent studies also applied hidden-layer artificial neural networks (ANNs) and DL (CNN and deepNN [DNN]).[62] These multivariate methods are advantageous in that they can consider all features simultaneously and, consequently, deal with correlation among the metabolites.[63,64] As a result, these techniques have helped uncover significant biomarkers and metabolite signatures. There are several examples of ML algorithm use in metabolite feature selection across disease contexts. In a recent metabolomics study, feature selection with RF identified 17 metabolites that, in combination, accurately detected cirrhosis resulting from non-alcoholic fatty liver disease (NAFLD), and whose levels were sufficient to discriminate NAFLD cirrhosis from control probands in a PCA. These findings yielded a potential non-invasive stool signature for disease prediction of NAFLD cirrhosis.[65] In another study, application of RF identified metabolites predictive of coronavirus disease 2019 (COVID-19) severity.[66] Similarly, RF and SVM uncovered a targeted metabolomics signature of Alzheimer’s disease (AD) in the brain.[67] When tested in blood samples, this panel identified distinct metabolites belonging to the sphingolipid and glycerophospholipid classes that are related to the severity of AD pathology in the brain, and their concentrations in blood are associated with preclinical disease progression. In another example, application of LASSO defined metabolites as a metabolic clock of gestational age in maternal plasma during pregnancy.[68] Finally, an HSIC LASSO-based prediction model showed better predictive power than LASSO, SVM, PLS, RF, and neural network for predicting depression symptoms in a study of more than 800 Japanese adults.[69] These examples highlight the number and breadth of applications of ML algorithms—alone or in combination—to discover metabolomic biomarkers to support prediction of disease incidence or severity, demonstrating the versatility of models for use in the field. To date, AI/ML feature selection applications have resulted primarily in identification of significant endogenous metabolites and pathways and not exogenous chemicals. Although use of classic statistical techniques, correlation analysis, and meet-in-the-middle approaches identified links between environmental and dietary exposure and disease outcomes,[70-72] success using AI/ML feature selection tools to identify non-endogenous metabolites remains limited. It is possible that these tools have been less successful in selecting robust peaks; however, to our knowledge, the necessary comprehensive comparison of tools with an exposomics focus has yet to be undertaken. The complexity of environmental toxicity relationships (e.g., U-shaped toxicity, nonlinear associations, and unknown interactions) may require more advanced AI or deep ML algorithms. Emerging applications of CNN and DNN to metabolomics showed successful feature selection of predictors of estrogen receptor (ER) status in breast cancer[73] or of predictors of Alzheimer’s disease.[74] Indeed, a DL framework yielded the highest area-under-the-curve point estimate for classifying individuals with breast cancer by ER+/ER−status based on metabolomics data compared with that of six other ML algorithms. Biological interpretation of the first hidden layer identified by the DL framework revealed enrichment of eight cancer-relevant metabolic pathways that were not identified through conventional ML algorithms. Although DL methods do not always outperform traditional ML methods,[75] these results suggest that further development and applications of tools in ML, and especially DL, for feature selection may help uncover novel exposure risk factors. Several existing projects aim to leverage such opportunities, using HRMS exposomics data integrated with other exposures and measures with planned AI and ML strategies; for example, in the areas of women’s health[76] and chronic gut inflammation.[77]

Metabolite identification

Metabolite identification is a critically important aspect in the biomarker discovery pipeline. Accordingly, many researchers devote efforts in software and tool development to support this process.[78] After feature selection, important peaks or features, minimally defined by a specific m/z and rt, must be annotated to determine biological plausibility for eventual translation into intervention and prevention strategies or clinical practice. This step often relies on use of metabolite databases and spectral libraries containing experimental and in silico spectra, including GNPS,[79] Metlin,[80] the Human Metabolome Database,[81] MassBank,[82] and others.[83] Users match to databases on m/z alone for low-confidence annotations or include additional orthogonal data (e.g., presence of isotopes and their ratios, MS/MS fragmentation data, neutral losses, and characteristic fragments) to increase the confidence of the annotations.[84,85] For an annotated peak, the chemical or metabolite is confirmed by analyzing a commercially available or synthesized standard under the same experimental conditions and matching across all available parameters (m/z, retention time, MS/MS, etc.). However, the list of chemicals available in databases is small compared with the more than 68 million known available chemicals,[86] and spectral matching rates for specialized chemicals remain low. Thus, there is a need for additional tools to help generate annotations of unknown chemicals identified in an untargeted chemical assay. One approach is cognitive metabolomics computing using ML and natural language processing (NLP), which extracts information from the scientific literature and understands its semantic context.[87] Although promising for annotation and biological interpretation of exposomics data,[88] applications remain limited thus far, likely because of entry barriers such as required subscriptions to the databases and the need for expert user knowledge to successfully execute this type of analysis. Recent efforts to overcome these challenges produced a protocol with suggestions for free and open-source tools for NLP,[89] showing promise for further expansion of use in metabolomics and exposomics. Easier-to-use in silico tools provide a widely adopted alternative for annotation. CSI:FingerID uses SVM to predict MS/MS spectra, then suggests candidate compounds to match those spectra.[90] Other tools, like LipidBlast,[26] MetFrag,[91] MIDAS,[92] and CFM-ID,[93] take molecular structures as inputs and predict the spectra. In CFM-ID, a pre-trained neural network model is mixed with rule-based fragmentation.[93] The addition of rules, compared with ML alone, improves prediction of classes of metabolites found in food and endogenous metabolite databases but not exposomic chemicals, possibly because the approach lacks rules specific to industrial chemicals. To match the spectral pairs between a user’s spectra and those of a database, there is a ranking system or score available to help the user understand the robustness of a match. One common metric for matching MS/MS spectral data is the cosine similarity score, which ranks the overlap between MS/MS spectral data but performs poorly at matching chemical analogs with several structural modifications.[94,95] Recent improvements to this approach added ML algorithms focused on molecular structural similarity. Spec2Vec uses an unsupervised ML method to learn from co-occurrence of ion fragments across large datasets.[96] This method is computationally faster than cosine similarity, and its results correlate better to structural similarity than cosine-based scores, suggesting better matching. Similarly, MS2Deep uses neural networks to predict structural similarity scores of MS/MS data without requiring a known molecular formula.[97] Finally, SteroidXtract uses CNN on a training set of manually curated steroid MS/MS spectra to predict other steroid-like chemicals in an untargeted dataset.[98] Further expansion of spectral libraries will facilitate confident metabolite and chemical identification, but this step must be supplemented with new annotation tools. In addition to the 1,500 new chemicals that are being produced annually by the United States,[99] new sample types that require atypical sample processing are likely to generate new spectral adducts and unidentified chemicals. For example, using MS/MS spectral matching to Metlin, GNPS, and an in-house library resulted in just 4% of chemicals being annotated from a study on the tooth exposome.[100] This example, with all 267 metabolites discriminating prenatal and postnatal tooth fractions unannotated, highlights that the large percentage of unannotated chemicals in a study poses a major challenge for biomarker discovery. Development of additional tools for annotation and identification of unknowns can be facilitated by the publicly available spectral databases that are now large enough to provide substantial training, validation, and testing data. Whether using network maps to expand annotations of unknowns through chemical similarity to those in databases[101] or using biological information to drive development of chemical class prediction, ML and DL hold promise for robust MS/MS data interpretation and endogenous and exogenous chemical elucidation.[102]

Challenges for the future

Significant advances in untargeted chemical analysis instrumentation enable large numbers of measurements through several orders of magnitude including at trace-level concentrations. Similar to other fields, such as DNA sequencing, the decrease in cost of these technologies now facilitates cost-effective measurement of thousands of samples for epidemiological and clinical studies. Critically, over the last decade, AI/ML tools have been developed to support extraction and formatting of data, mining the data, and annotating the data generated from these massive datasets,[103] already playing an important role in accelerating discovery. Many applications of ML in metabolomics have focused on the forward-facing step of the untargeted analysis pipeline—the feature selection process. In this step, aided by ML algorithms, thousands of features are siphoned down to the tens of features that are predictive of a health outcome or phenotype. Although current applications are limited largely to an individual “omics” dataset, recent advances include using ML to combine data across different types of “omics” levels in a system’s biology approach.[104-106] This development will lead to identification of additional and combined biomarkers to attain higher specificity or to assist with unraveling the cascade of factors associated with disease initiation and progression. However, these achievements require that metabolites are sufficiently annotated. When features are selected, researchers hit the ultimate bottleneck of untargeted chemical analysis—annotation of the unknown metabolites. Without this critical step, selected metabolites cannot be biologically interpreted or further validated. This issue prompts burgeoning efforts to develop experimental databases of spectra along with AI/ML-based in silico prediction models, retention time predictors, and chemical similarity algorithms to facilitate annotation of metabolites at different levels of confidence from molecular formulas, to chemical classes, to absolute identification of a metabolite or chemical with the potential to drastically improve the breadth of annotations needed for exposomics. Much work remains to be done. Evidence suggests that, when provided with information that an exogenous or non-endogenous chemical compound exists within an untargeted dataset, by screening the data for a chemical in an in-house library or a priori hypothesis, we can uncover associations between those chemicals and health outcomes. However, examples of successfully extracting these chemicals directly from the data using untargeted data analysis workflows (e.g., pre-processing, feature selection) are limited beyond food and microbiome metabolites.[54,107] Recent work has shown that using typical filtering criteria of untargeted data can miss up to 80% of significant peaks,[37] suggesting that pre-processing may play a pivotal role in this challenge. However, only in the last several years has there been advancement in addressing this peak quality aspect via ML classifiers and AI algorithm development.[16] The ability to maximize features while decreasing false positives is a critical challenge to overcome in the field of untargeted metabolomics. This issue is exacerbated in exposomics, in which exposure biomarkers may be difficult to discriminate from noise. Although important strides are being made through development of ML classifiers to improve retention of high-quality peaks, these classifiers remain largely untested on the diverse range of concentrations seen in complex biological samples, and the little data available suggest that current algorithms and classifiers are insufficient to robustly capture low-concentration chemicals. Therefore, there is an important gap to fill, highlighting a critical need for ML algorithms with a focus on retaining quality peaks across the full dynamic range of measured chemicals in a biological sample. The “functional exposomics” concept suggests that the complexity of the exposome can be reduced by focusing on a biology-driven approach.[108] Such an approach is meant to complement measurement-based approaches; for example, using SteroidXtract,[98] ML might predict and identify spectral patterns of exogenous chemicals by chemical class, focusing on those with similar biological activity, such as endocrine-disrupting chemicals or those for which analytical standards are not readily available, such as conjugated phthalates. This principle—that exogenous metabolites with potential to alter biology are likely to consist of several building blocks that mimic biochemical machinery—drives development of tools for determining structures of natural products.[109] In this case, using a biological approach rather than a statistical approach to focus on feature candidates within the HRMS data may reveal previously unknown combinations of chemicals working synergistically to affect health. Many of these high-confidence annotations require collection of MS/MS data, which may not be possible for low-concentration chemicals. However, this problem might be surmountable by utilizing MS1 data collected on every sample. Existing HRMS exposomics tools that focus on reaction-level chemical changes[110,111] and “molecular gatekeepers” that focus on determining active molecular networks can be expanded with information from in-source fragments, retention time prediction, and cognitive computing.[112,113] AI/ML advances in data processing have triggered significant discoveries in metabolomics and are poised to do the same in the field of exposomics. The success of DL algorithms for unstructured data and use of new AI/ML approaches not readily implemented in metabolomics and exposomics, combined with available datasets or samples with known chemicals of low and high concentrations[114,115] for training and validating peak picking algorithms and the QC step, are key starting points for catalyzing this new era of discovery toward environmental and precision health.

108 in total

1. One Step Forward for Reducing False Positive and False Negative Compound Identifications from Mass Spectrometry Metabolomics Data: New Algorithms for Constructing Extracted Ion Chromatograms and Detecting Chromatographic Peaks.

Authors: Owen D Myers; Susan J Sumner; Shuzhao Li; Stephen Barnes; Xiuxia Du
Journal: Anal Chem Date: 2017-08-17 Impact factor: 6.986

Review 2. Software tools, databases and resources in metabolomics: updates from 2018 to 2019.

Authors: Keiron O'Shea; Biswapriya B Misra
Journal: Metabolomics Date: 2020-03-07 Impact factor: 4.290

3. A Universal Gut-Microbiome-Derived Signature Predicts Cirrhosis.

Authors: Tae Gyu Oh; Susy M Kim; Cyrielle Caussy; Ting Fu; Jian Guo; Shirin Bassirian; Seema Singh; Egbert V Madamba; Ricki Bettencourt; Lisa Richards; Ruth T Yu; Annette R Atkins; Tao Huan; David A Brenner; Claude B Sirlin; Michael Downes; Ronald M Evans; Rohit Loomba
Journal: Cell Metab Date: 2020-11-03 Impact factor: 27.287

Review 4. Linking genomics and metabolomics to chart specialized metabolic diversity.

Authors: Justin J J van der Hooft; Hosein Mohimani; Anelize Bauermeister; Pieter C Dorrestein; Katherine R Duncan; Marnix H Medema
Journal: Chem Soc Rev Date: 2020-05-12 Impact factor: 54.564

5. WaveICA 2.0: a novel batch effect removal method for untargeted metabolomics data without using batch information.

Authors: Kui Deng; Falin Zhao; Zhiwei Rong; Lei Cao; Liuchao Zhang; Kang Li; Yan Hou; Zheng-Jiang Zhu
Journal: Metabolomics Date: 2021-09-20 Impact factor: 4.290

6. IDSL.IPA Characterizes the Organic Chemical Space in Untargeted LC/HRMS Data Sets.

Authors: Sadjad Fakouri Baygi; Yashwant Kumar; Dinesh Kumar Barupal
Journal: J Proteome Res Date: 2022-05-17 Impact factor: 5.370

7. MetaClean: a machine learning-based classifier for reduced false positive peak detection in untargeted LC-MS metabolomics data.

Authors: Kelsey Chetnik; Lauren Petrick; Gaurav Pandey
Journal: Metabolomics Date: 2020-10-21 Impact factor: 4.290

Review 8. Guidelines and considerations for the use of system suitability and quality control samples in mass spectrometry assays applied in untargeted clinical metabolomic studies.

Authors: David Broadhurst; Royston Goodacre; Stacey N Reinke; Julia Kuligowski; Ian D Wilson; Matthew R Lewis; Warwick B Dunn
Journal: Metabolomics Date: 2018-05-18 Impact factor: 4.290

Review 9. Integration strategies of multi-omics data for machine learning analysis.

Authors: Milan Picard; Marie-Pier Scott-Boyer; Antoine Bodein; Olivier Périn; Arnaud Droit
Journal: Comput Struct Biotechnol J Date: 2021-06-22 Impact factor: 7.271

10. PubChem 2019 update: improved access to chemical data.

Authors: Sunghwan Kim; Jie Chen; Tiejun Cheng; Asta Gindulyte; Jia He; Siqian He; Qingliang Li; Benjamin A Shoemaker; Paul A Thiessen; Bo Yu; Leonid Zaslavsky; Jian Zhang; Evan E Bolton
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971