| Literature DB >> 29324649 |
Abstract
Machine learning uses experimental data to optimize clustering or classification of samples or features, or to develop, augment or verify models that can be used to predict behavior or properties of systems. It is expected that machine learning will help provide actionable knowledge from a variety of big data including metabolomics data, as well as results of metabolism models. A variety of machine learning methods has been applied in bioinformatics and metabolism analyses including self-organizing maps, support vector machines, the kernel machine, Bayesian networks or fuzzy logic. To a lesser extent, machine learning has also been utilized to take advantage of the increasing availability of genomics and metabolomics data for the optimization of metabolic network models and their analysis. In this context, machine learning has aided the development of metabolic networks, the calculation of parameters for stoichiometric and kinetic models, as well as the analysis of major features in the model for the optimal application of bioreactors. Examples of this very interesting, albeit highly complex, application of machine learning for metabolism modeling will be the primary focus of this review presenting several different types of applications for model optimization, parameter determination or system analysis using models, as well as the utilization of several different types of machine learning technologies.Entities:
Keywords: genomics; machine learning; metabolism modeling; metabolomics; system biology
Year: 2018 PMID: 29324649 PMCID: PMC5875994 DOI: 10.3390/metabo8010004
Source DB: PubMed Journal: Metabolites ISSN: 2218-1989
Groups of machine learning algorithms based on algorithm similarities. Groups are based on [4,5]. Sample examples of the metabolomics application of methods are provided in the references included.
| Algorithm Group | Short Description | Methods and Some Metabolomics Uses |
|---|---|---|
| Regression algorithms [ | Iteratively improve the model of the relationship between features and labels using the error measure | Ordinary Least Squares Regression (OLSR); linear regression; stepwise regression; Local Estimate Scatterplot Smoothing (LOESS) |
| Instance-based algorithms [ | Compare new problem instances (e.g., samples) with examples seen in training. | k-Nearest Neighbors (kNN); Self-Organized Map (SOM) and Locally Weighted Learning (LWL); |
| Regularization algorithms [ | An extension to other models that penalize models based on their complexity generally favouring simpler models. | Least Absolute Shrinkage and Selection Operator (LASSO) and elastic net |
| Decision tree algorithms [ | Trained on the data for classification and regression problems providing a flowchart-like structure model where nodes denote tests on an attribute with each branch representing the outcome of a test and each leaf node holding a class label. | Classification and regression tree (CART); C4.5 and C5.0; decision stump; regression tree |
| Bayesian algorithms [ | Application of Bayes’ theorem for the probability of classification and regression. | Naive Bayes, Gaussian naive Bayes, Bayesian Belief Network (BBN); Bayesian Network (BN) |
| Association rule learning algorithms [ | Methods aiming to extract rules that best explain the relationships between variables. | A priori algorithm; Eclat algorithm |
| Artificial neural network algorithms including deep learning [ | Building of a neural network. | Perceptron |
| Dimensionality reduction algorithms [ | Unsupervised and supervised methods seeking and exploiting inherent structures in the data in order to simplify data for easier visualization or selection of major characteristics. | Principal Component Analysis (PCA) |
| Ensemble algorithms [ | Models composed of multiple weaker models that are independently trained leading to predictions that are combined in some way to provide greatly improved overall prediction. | boosting |
Figure 1Overview of different data analysis steps in metabolomics and metabolism modeling where machine learning methodologies have found uses.
Freely available software tools providing machine learning methods applicable to the metabolism analysis.
| Tool Name | Focus | Availability |
|---|---|---|
| FingerID [ | Molecular fingerprinting | |
| SIRIUS [ | Molecular fingerprinting | |
| Metaboanalyst [ | General tool for metabolomics analysis | |
| MeltDB 2.0 [ | General tool for metabolomics analysis | - |
| KNIME * | General machine learning tool | |
| Weka [ | General machine learning tool | |
| Orange * | General machine learning tool | |
| TensorFlow | General machine learning tool |
* Visual programming languages made for easy software development without extensive programming knowledge.
Figure 2Schematic comparison of representative methods for the metabolic pathway and network models including different constraints and approaches to defining metabolic reactions.
Figure 3An outline of the utilization of machine learning methods in metabolism modeling application for optimization of parameters in the model, as well as testing different input conditions. In this example, parameters in the model are selected at random (e.g., using Monte Carlo sampling) or, alternatively, reactions are taken out or put in (in the constraint-based models) and the models are run. The success of each model is determined using the parameter of interest (e.g., cell growth or cell death, production of the molecule of interest, etc.). Model parameters (1–N) are used as feature vectors with the success label used as the class label in the machine learning classifier. The classifier determines patterns in the parameter space with the highest discriminatory power that ensures success according to the model.
Figure 4In the examples presented by [71,72], gene expression data are used to define the upper limit for fluxes across reactions catalyzed by that gene. Further flux optimization is subsequently possible with different methods including machine learning, as shown in Figure 3.
Figure 5Analysis of gene correlation in stoichiometric models and experimental gene expression data can provide information about the system robustness and lead to more detailed information about gene correlation under different conditions.
Figure 6Schematic representation of the DeepMetabolism approach by Guo et al. [81].