| Literature DB >> 31295267 |
Guido Zampieri1, Supreeta Vijayakumar1, Elisabeth Yaneske1, Claudio Angione1,2.
Abstract
Omic data analysis is steadily growing as a driver of basic and applied molecular biology research. Core to the interpretation of complex and heterogeneous biological phenotypes are computational approaches in the fields of statistics and machine learning. In parallel, constraint-based metabolic modeling has established itself as the main tool to investigate large-scale relationships between genotype, phenotype, and environment. The development and application of these methodological frameworks have occurred independently for the most part, whereas the potential of their integration for biological, biomedical, and biotechnological research is less known. Here, we describe how machine learning and constraint-based modeling can be combined, reviewing recent works at the intersection of both domains and discussing the mathematical and practical aspects involved. We overlap systematic classifications from both frameworks, making them accessible to nonexperts. Finally, we delineate potential future scenarios, propose new joint theoretical frameworks, and suggest concrete points of investigation for this joint subfield. A multiview approach merging experimental and knowledge-driven omic data through machine learning methods can incorporate key mechanistic information in an otherwise biologically-agnostic learning process.Entities:
Mesh:
Year: 2019 PMID: 31295267 PMCID: PMC6622478 DOI: 10.1371/journal.pcbi.1007084
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Omic data–integration methods in machine learning.
Multiview omic data–integration methods can be classified into three main domains. (a) Concatenation-based (early-stage) integration involves combining all omic data into one large matrix before applying ML methods to obtain a data-driven model. (b) Transformation-based (intermediate-stage) integration involves applying data transformations to obtain a uniform format, which can then permit the combination into one fused dataset. (c) Model-based (late-stage) integration involves obtaining individual machine learning models separately for each dataset before combining the outcomes rather than combining data prior to the learning phase. ML, machine learning.
Fig 2Constraint-based data integration and fluxome generation.
(a) Constraint-based metabolic modeling begins with the construction of a manually curated GSMM recording all reactions taking place in the network. (b) Coded within the structure of a GSMM is the stoichiometric matrix S, denoting the involvement of metabolites in each reaction. Constraints are applied to the model to identify a given metabolic goal, represented as the objective function c, and linear or quadratic optimization is used to maximize or minimize this objective. The steady-state assumption (Sv = 0) sets the product of the stoichiometric matrix S and flux vector v as invariant. (c) To compute a unique flux distribution, the objective function can be regularized by subtracting a concave function from it. In addition to v being restricted between default lower and upper limits (v and v), external multiomic data θ can be used to further constrain fluxes using the mapping function φ(θ), hence driving the output toward condition-dependent solutions. GSMM, genome-scale metabolic model.
Overview of previous studies that integrated CBM and machine learning, grouped by task type.
| Study | Data integration approach | Machine learning component | CBM component | Task |
|---|---|---|---|---|
| [ | - | Regularized multinomial logistic regression | FBA | Prediction of growth conditions |
| [ | - | Bagging SVM, random forest | FVA, gene deletion | Inhibitory drug side effect prediction |
| [ | - | ANN | FBA, gene deletion | Prediction of xylitol production |
| [ | - | SVM, ANN, NMF | FBA | Prediction of bacterial ecological niches |
| [ | - | Random forest | dFBA | Prediction of ecological interactions |
| [ | - | Discriminant analysis | Elementary flux modes | Identification of distinguishing metabolic patterns between conditions |
| [ | - | PCA, SVM, elastic net, random forest, XGBoost, kNN, ANN, ensemble learning | FBA | Estimation of titer, production rate, and yield of microbial factories |
| [ | - | Hierarchical clustering | FBA | Characterization of epistasis in yeast metabolism |
| [ | - | PCA | Random sampling | Decomposition of metabolic flexibility |
| [ | - | PCA | Elementary flux modes | Identification of metabolic patterns |
| [ | - | Hierarchical clustering | FBA | Exploration of ecological interactions |
| [ | - | PCA | Stoichiometric constraints | Identification of responsive pathways |
| [ | - | PCA | Elementary flux modes | Identification of metabolic patterns in dynamic conditions |
| [ | Concatenation based | SVM | FBA, reaction deletion | Reaction essentiality prediction |
| [ | Constraint based | Kernel kNN | Maximization of consistency between reaction activity and gene expression | Drug target prediction |
| [ | Concatenation based | Random forest, logistic regression | FBA | Genetic interactions prediction in yeast |
| [ | Constraint based, concatenation based, model based | RNN, LASSO regression, ensemble learning | FBA | Cross-omic states prediction in |
| [ | Constraint based | Decision trees | TFBA | Estimation of kinetic parameter range and identification of key enzymes |
| [ | Concatenation based | SVM-RFE | FCA | Prediction of gene essentiality |
| [ | Transformation based | Sparse-group LASSO | Extreme currents | Identification of disease-deregulated pathways |
| [ | Constraint based, concatenation based | Elastic net regression, PCA, GLM | Bilevel FBA | Prediction of lactate production in CHO cells |
| [ | Model based | ANN, autoencoder | FBA, gene deletion | Phenotypic predictions in |
| [ | Constraint based | Elastic net regression | Bilevel FBA | Identification of polyomic predictors of aging |
| [ | Constraint based | Elastic net regression | Geometric FBA | Identification of disrupted pathways in |
| [ | Constraint based | Bayesian factor modeling | Bilevel FBA | Prediction of temporal pathway activation in |
| [ | Constraint based | Hierarchical clustering, | Bilevel FBA | Polyomic characterization of aging |
| [ | Constraint based | PCA | Geometric FBA | Identification of biomarkers for rhamnolipids biosynthesis |
| [ | Constraint based, model based | ANN | Stoichiometric constraints | Interpretation of gene expression data in |
| [ | - | kNN, decision trees, SVM | Stoichiometric constraints | Metabolic flux estimation based on general genetic and environmental conditions |
| [ | Constraint based | PCA | FBA, Monte Carlo sampling | Characterization of engineered |
| [ | Constraint based | PCA, linear regression | FBA | Metabolic flux estimation in dynamic conditions |
| [ | Concatenation based, constraint based | Elastic net regression, random forest, neural networks, ensemble learning | FBA, pFBA, ME model | Prediction of proteomic data |
Fig 3Multiomic data analysis by combination of constraint-based modeling with machine learning.
(a) Fluxomic analysis involves FBA or related techniques performed on a general-purpose GSMM, from which the flux data obtained can be used as input for unsupervised or supervised machine learning. (b) To improve the accuracy of machine learning predictions, multiomic datasets are obtained using high-throughput analytics—e.g., transcriptomics (DNA microarrays, RNA sequencing), proteomics (2D gel electrophoresis, stable isotope labeling, mass spectrometry), or metabolomics (NMR spectroscopy, isotopic labeling, LC-MS, GC-MS). As these datasets are obtained from different sources, they must undergo several preprocessing stages such as filtration and normalization to maintain synchronicity, account for variance, and reduce noise. Condition-specific knowledge-based models are generated by introducing these multiple datasets into GSMMs to obtain more precise flux estimations, from which machine learning techniques can be applied to infer biologically relevant patterns in the data. (c) Alternatively, machine learning can be directly applied to single- or multiomic datasets to produce or improve GSMMs or fluxomic data. FBA, flux balance analysis; GC-MS, gas chromatography–mass spectroscopy; GSMM, genome-scale metabolic model; LC-MS, liquid chromatography–mass spectroscopy; NMR, nuclear magnetic resonance.