| Literature DB >> 32545768 |
Ulf W Liebal1, An N T Phan1, Malvika Sudhakar2,3,4, Karthik Raman2,3,4, Lars M Blank1.
Abstract
The metabolome of an organism depends on environmental factors and intracellular regulation and provides information about the physiological conditions. Metabolomics helps to understand disease progression in clinical settings or estimate metabolite overproduction for metabolic engineering. The most popular analytical metabolomics platform is mass spectrometry (MS). However, MS metabolome data analysis is complicated, since metabolites interact nonlinearly, and the data structures themselves are complex. Machine learning methods have become immensely popular for statistical analysis due to the inherent nonlinear data representation and the ability to process large and heterogeneous data rapidly. In this review, we address recent developments in using machine learning for processing MS spectra and show how machine learning generates new biological insights. In particular, supervised machine learning has great potential in metabolomics research because of the ability to supply quantitative predictions. We review here commonly used tools, such as random forest, support vector machines, artificial neural networks, and genetic algorithms. During processing steps, the supervised machine learning methods help peak picking, normalization, and missing data imputation. For knowledge-driven analysis, machine learning contributes to biomarker detection, classification and regression, biochemical pathway identification, and carbon flux determination. Of important relevance is the combination of different omics data to identify the contributions of the various regulatory levels. Our overview of the recent publications also highlights that data quality determines analysis quality, but also adds to the challenge of choosing the right model for the data. Machine learning methods applied to MS-based metabolomics ease data analysis and can support clinical decisions, guide metabolic engineering, and stimulate fundamental biological discoveries.Entities:
Keywords: MS-based metabolomics; machine learning; metabolic engineering; metabolic flux analysis; multi-omics; synthetic biology
Year: 2020 PMID: 32545768 PMCID: PMC7345470 DOI: 10.3390/metabo10060243
Source DB: PubMed Journal: Metabolites ISSN: 2218-1989
Description of important supervised statistical models.
| Supervised ML Model | Advantage | Disadvantage | |
|---|---|---|---|
| PLS—Projection to Latent Structure/Partial Linear Regression [ | |||
|
| PLS is a supervised method to construct predictive models when the factors are collinear. PLS-DA is an extension of PLS that can maximize the covariance between classes. Orthogonal PLS (OPLS) is an extension to increase latent feature interpretability. | Overfitting risk: | Collinear data |
| RF—Random Forest [ | |||
|
| Composed of several decision trees. Each decision tree separates the samples according to the measured feature properties. Different trees use a random subset of samples and features for classification. | Overfitting risk: | Features/sample: |
| SVM—Support Vector Machine [ | |||
|
| A boundary is determined that separates the classes. For nonlinear separation, the data is augmented by additional dimensions using a kernel function (Φ), often the Radial Basis Function (RBF). | Features/sample: | Overfitting risk: |
| ANN—Artificial Neural Network [ | |||
|
| The features are transformed by hidden nodes with a linear equation ‘z’ and a nonlinear function ‘g.’ Several layers may follow, each with nodes containing transformations by functions ‘z’ and ‘g.’ The output is generated by a ‘softmax’ function. | Features/sample: | Overfitting risk: |
| GA—Genetic Algorithm [ | |||
|
| Solution space is searched by operations similar to natural genetic processes to identify suitable solutions. Fitness function is defined to find the fittest solutions. The fittest solutions are subject to cross-over and mutations to evolve towards the best solution. | Interpretation: | Overfitting risk: |
Figure 1History of machine learning (ML) in metabolomics. The graph shows the frequency of articles mentioning ‘metabolomics’ (green bars) or ‘metabolomics’ and ‘multivariate’ (orange bars) in the Web of Science for five-year intervals from 2000 to 2020. The pie charts represent different statistical analysis approaches, and the absolute number represented by the pie charts is equal to the ‘multivariate’ bar (orange bars). We searched: RF: random forest (blue, ‘random forest’ and ‘decision forest’), SVM: support vector machine (pink, ‘support vector machine’), ANN: artificial neural networks (green, ‘neural network’ and ‘deep learning’), GA: genetic algorithm (yellow, ‘genetic algorithm’ and ‘evolutionary computation’), PLS: partial least squares (brown, ‘partial least squares’ and ‘projection to latent’), and missing (grey). The missing fraction decreases continuously, indicating the adaptation of nomenclature or the conformance of statistical analyses. We searched for ‘multivariate’ to assess the overall number of metabolomics papers with a statistical analysis and obtained similar results for the term ‘statistical’.
Figure 2Mass spectrometry workflow with technical and analytical techniques. The MS investigation begins with the definition of the design of the experiment, whether a comprehensive metabolic overview is required, metabolite class identifications are sufficient or targeted metabolites are quantified. The design determines the analytical methods that are distinguished by their metabolite separation. The data processing includes peak processing, normalization and imputation and the contribution of machine learning is discussed in Section 2. The data interpretation is covered in Section 3 and deals with classification and regression, pathway analysis and multi-omics integration. Abbreviations: GC: gas chromatography; LC: liquid chromatography; CE: capillary electrophoresis; IM: ion mobility; DI: direct infusion; MALDI: matrix-assisted laser desorption ionization; MSI: mass spectrometry imaging; DART: direct analysis in real time.
Available spectral database for metabolite annotation *.
| Database | Description | URL |
|---|---|---|
| HMDB [ | 114,193 metabolite entries including both polar and non-polar metabolites |
|
| LMSD [ | 43,665 lipid structures with MS/MS spectra |
|
| METLIN [ | 961,829 molecules (lipids, steroids, plant and bacteria metabolites, small peptides, carbohydrates, exogenous drugs/metabolites, central carbon metabolites and toxicants). Over 14,000 metabolites have been individually analyzed and another 200,000 have in silico MS/MS data |
|
| isoMETLIN [ | All computed isotopologues derived from METLIN based on |
|
| NIST [ | Reference mass spectra for GC/MS, LC–MS/MS, NMR and gas-phase retention indices for GC |
|
| MassBank [ | Shared public repository of mass spectral data with 41,092 spectra |
|
| MoNA | 200,000+ mass spectral records from experimental, in silico libraries and user contributions |
|
| mzCloud | More than 6 million multi-stage MSn spectra for more than 17,670 compounds |
|
| PRIME [ | Standard spectrum of standard compounds generated by GC/MS, LC–MS, CE/MS and NMR |
|
| 2019 metabolites with GC-MS spectra and retention time indices |
| |
| Community database for natural products |
| |
| Over 9000 MS/MS spectrum of phytochemicals |
|
* Adapted and updated from An PNT et al. [41].
ML tools for data processing since 2019.
| Step | ML Tool | Example | Ref. |
|---|---|---|---|
| Peak picking/integration | SVM | WIPP software: optimization of peak detection, instrument and sample specific | [ |
| ANN | Peak quality selection for downstream analysis | [ | |
| CNN | Trace: two-dimensional peak picking over retention time and m/z | [ | |
| CNN | peakonly software: peak picking and integration | [ | |
| CNN | Peak classification for subsequent PARAFAC analysis | [ | |
| CNN | DeepSWATH software: correlation between parent metabolites and fragment ions in MS/MS spectra | [ | |
| CNN | Representational learning from different tissues, organisms, ionization, instruments for improved classification on small datasets | [ | |
| CNN | ‘DeepSpectra’: targeted metabolomics on environmental samples, raw spectra analysis | [ | |
| CNN | Compound recognition in complex tandem MS data tested with several ML tools | [ | |
| Retention time prediction | ANN | Metlin-integrated prediction of metabolite retention time extrapolation to different chromatographic methods | [ |
| Ensemble | Performance test of multiple ML algorithms for retention time prediction based on physical properties, ANN and SVM perform well, ensemble training is optimal | [ | |
| Metabolite annotation | SVM | Input–output kernel regression (IOKR) to predict fingerprint vectors from m/z spectra, mapping molecular structures to spectra | [ |
| SVM | CSI:Fingerprint:Structure mapping | [ | |
| Text mining | MS2LDA software: detection of peak co-occurrence | [ | |
| Text mining | MESSAR software: automated substructure recommendation for co-occurring peaks | [ | |
| ANN | NEIMS software: ‘neural electron-ionization MS’ spectrum prediction | [ | |
| ANN | DeepMASS software: substructure detection by comparing unknown spectra to known spectra | [ | |
| CNN | DeepEI software: fingerprint prediction from MS spectrum | [ | |
| Normalization | RF | SERRF software: Systematic error removal based on quality control pool samples | [ |
| RF | pseudoQC software: simulated quality control sample generation, preferably with RF | [ | |
| SVM | Improvement of statistical analysis by SVM normalization | [ | |
| Imputation | RF | Best overall performance of RF for unknown missing value type | [ |
| Bayesian Model | BayesMetab: classification of missing value type, Markov chain Monte Carlo approach with data augmentation | [ |
Data to knowledge procedures with ML support published from 2019. In some cases, different datasets (DS) are used for samples. Spec-Type—spectrometry type; Ens.—ensemble ML approach; Tar.—targeted; SCMS—single-cell MS; Bench. data—benchmark datasets; Sim.—simulated.
| Biological Insight | Optimal ML | Other Models | Samples | Dimension Reduction | Spec-Type | Comment | Ref. |
|---|---|---|---|---|---|---|---|
| Class + biomarker | SVM | LDA, QDA | 4 DS: | PCA | IR | Effect of variance and covariance on classification of infrared spectra. | [ |
| SVM | RF, | 80 | RFE | LC–MS | Serum identification of lipids, glycans, fatty acids. | [ | |
| RF | N.A. | <100 | N.A. | SCMS | Single-cell MS on drug response, pathway inference. | [ | |
| RF | SVM, | 703 | LASSO | LC–MS | Serum metabolomics related to chronic kidney disease. | [ | |
| RF | N.A. | 3 DS: | Peak-binning | GCMS | Chromatogram peak ranking for sample discrimination. | [ | |
| RF | N.A. | 217 | Human selection | LC–MS | Metabolite selection based on expert classification with tinderest Shiny-App. | [ | |
| ANN | PLS-DA, | 10 DS: | N.A. | Bench. data | Thorough comparison of ML approaches on different published targeted MS datasets. | [ | |
| GA | RF | 60 | N.A. | LC–MS | Wine origin classification. | [ | |
| Ens. | RF, SVM | 111 | Correlation, information filter | N.A. | Use of symbolic methods, analysis of spectrogram. | [ | |
| Regression | Ens. | RF, ANN | 2 DS: | N.A. | Assay | Optimization of gene expression for metabolite overproduction. | [ |
| Pathway inference | RF | Bayes | 500 | N.A. | Sim. | Metabolite correlation network on simulated data. | [ |
| RF | PLS, Bayes | 339 | Information filter | GCMS | Mapping of metabolic correlation networks to metabolic pathways. | [ | |
| Bayes | N.A. | 2 DS: | N.A. | Sim. | ‘PUMA’: Probabilistic modeling for Untargeted Metabolomics Analysis. Simulation of pathway activity, metabolite association, and spectra. | [ | |
| Multi-omics integration | ANN | SVM | 2 DS: | Encoder-decoder | LC–MS/MS | Multi-omics projection to 20–70 latent variables. Classification of latent variables. | [ |
| ANN | N.A. | 2 DS: | Encoder-decoder | LC–MS | Correlation of gut bacteria level to metabolite level, unsupervised clustering of latent variables. | [ | |
| Text Mining | N.A. | 4 DS: | N.A. | Bench. data | ‘mmvec’: microbial sequence to metabolite occurrence mapping with as little as 166 microbes mapped to 85 metabolites | [ | |
| Bayes | N.A. | 25 | N.A | Sim. | Estimation of metabolic kinetics based on multi-omics data for lysine synthesis. | [ | |
| Bayes | N.A. | 22 | N.A. | Estimation of metabolic kinetics based on multi-omics data | [ |
Dimensionality reduction strategies. FS—feature selection; FE—feature extraction.
| Type | Method | Description | Advantages | Disadvantages |
|---|---|---|---|---|
| Unsupervised method | ||||
| FE | Principal Component Analysis (PCA) | Unsupervised method to transform data into axes that explain maximum variability. Returns orthogonal features. | Prior Information: | Interpretation: |
| FE | Kernel PCA (k-PCA) | Transforms the data into a lower dimension that is linearly separable. | Correlation type: | Interpretation: |
| FE | Encoder–Decoder | ANN-based, the encoder maps input to lower-dimensional latent variables. The decoder uses latent variables to generate output. | Correlation type: | Correlation type: |
| Regularization | ||||
| FS | LASSO or L1 | Supervised method to select sparse features. Regularization parameter (L1 penalty) can be used for regression and classification problems. The coefficients ( | Interpretation: | Correlation type: |
| FS | Ridge or L2 | Supervised method to penalize (L2 penalty) large individual weights. The coefficients ( | Note: | Note: |
| FS | Elastic Net | Regularization method to retain advantages of both L1 and L2 penalty. | Note: | Correlation type: |
| Discriminant Analysis | ||||
| FE | Linear Discriminant Analysis (LDA) | Supervised method to transform data into axes, which maximizes class separation. Assumes that data is normal with common class covariance. | Prior information: | Correlation type: |
| Quadratic Discriminant Analysis (QDA) | Supervised classification similar to LDA. Assumes that data is normal but allows for differing class covariance. | Correlation type: | Not useful for dimensionality reduction | |
| Sequential Feature Selection | ||||
| FS | Recursive Feature Elimination/Sequential Backward Selection | At each step, the feature with minimal contribution to the model is dropped until required number of features remain. | Interpretation: | Note: |
Cross-consistency matrix with categorical research topics of multi-omics integration. M—metabolomics; T—transcriptomics; P—proteomics; F—fluxomics.
| Data | Integration Method | Dimensionality Reduction | Model Organisms | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MT | MP | MTP | MPF | MTPF | Concatenation | Post-Analysis Integration | Ensemble | PCA | Regularization | LDA | SFE |
|
|
| Mammalian | ||
| Model | Partial least squares | [ | [ | [ | [ | [ | [ | [ | [ | ||||||||
| Random forest | [ | [ | [ | [ | [ | [ | [ | [ | [ | [ | |||||||
| SVM | [ | [ | [ | [ | [ | [ | [ | [ | |||||||||
| Artificial neural network | [ | [ | [ | [ | [ | [ | |||||||||||
| Genetic algorithms | [ | [ | |||||||||||||||
| Bayesian models | [ | [ | [ | [ | |||||||||||||
| Data | MT | [ | [ | [ | [ | [ | [ | ||||||||||
| MP | [ | [ | [ | [ | [ | ||||||||||||
| MTP | [ | [ | [ | ||||||||||||||
| MPF | [ | [ | |||||||||||||||
| MTPF | [ | [ | [ | [ | [ | ||||||||||||
| Integration Method | Concatenation | [ | [ | [ | |||||||||||||
| Post-analysis integration | [ | [ | [ | [ | [ | ||||||||||||
| Ensemble | [ | [ | |||||||||||||||
| Dimensionality reduction | PCA | [ | [ | ||||||||||||||
| Regularization | [ | [ | |||||||||||||||
| LDA | [ | ||||||||||||||||
| SFE | [ | ||||||||||||||||
Figure 3Strategies for multi-omics integration. Omics data can be combined in a single matrix with all omics features, called ‘concatenation,’ or each omics measurement is separately analyzed, called ‘post-analysis integration,’ or the data is concatenated, but instead of a single ML model, many models are trained and their results are combined to calculate the optimal response, called ‘ensemble.’.