| Literature DB >> 35521559 |
Marta Sampaio1,2, Miguel Rocha1,2, Oscar Dias1,2.
Abstract
As plants produce an enormous diversity of metabolites to help them adapt to the environment, the study of plant metabolism is of utmost importance to understand different plant phenotypes. Omics data have been generated at an unprecedented rate for several organisms, including plants, and are widely used to study the central dogma of molecular biology, connecting the genome to phenotypes. Constraint-based modelling (CBM) methods, working over genome-scale metabolic models (GSMMs), have been crucial for organising and analysing omics data by integrating them with biochemical knowledge. In 2009, the first plant GSMM was reconstructed and, since then, several advances have been made, including the creation of context- and multi-tissue models that have supported the study of plant metabolism. Nevertheless, plant metabolic modelling remains very challenging. In parallel, as omics datasets are complex and heterogeneous, machine learning (ML) models have been applied in their interpretation to foster knowledge discovery. Recently, the first studies combining both CBM and ML approaches have emerged and have shown promising results. Here, we present the major advances in plant metabolic modelling and review the main CBM-ML hybrid studies. Finally, we discuss the application of machine learning to address the unique challenges of plant metabolic modelling.Entities:
Keywords: Constraint-based modelling; Machine learning; Omics data; Plant genome-scale metabolic models
Year: 2022 PMID: 35521559 PMCID: PMC9052043 DOI: 10.1016/j.csbj.2022.04.016
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Description of the most relevant databases of plant metabolic data.
| Database | Ref. | Description | Data |
|---|---|---|---|
| KEGG | Generic database resource that comprises genomes, metabolic pathways, chemical compounds, diseases, and drugs. | Metabolic data | |
| Metacyc | Comprehensive database of extensively curated metabolic pathways, containing information on reactions, enzymes, genes, and compounds for several organisms. | Curated metabolic data | |
| BioCyc | Collection of organism-specific pathway genome databases (PGDBs), each containing the complete genome and predicted metabolic network of an organism. | Organism-specific predicted metabolic data | |
| NCBI | Online repository containing several databases for genomics and biomedical information and tools for extracting and analysing the data. | Reviewed and unreviewed sequence data | |
| UNIPROT | Resource for protein sequence and related information, including manually reviewed data (Swiss-Prot) and automatic, non-reviewed protein annotations (TrEMBL). | Curated and predicted protein data | |
| BRENDA | Main database of manually annotated enzyme functional data, which uses the Enzyme Commission (EC) classification system. | Curated enzyme data | |
| TCDB | Curated database containing information on transport systems from several organisms, including sequence, structure, and function, and uses the Transport Classification (TC) system to classify transport proteins. | Curated transport data | |
| PubChem | The largest database of chemical information, including molecular structure, physical properties, and biological activities of compounds. | Unreviewed chemical data | |
| PlantCyc and PMN | PlantCyc contains more than 1000 curated metabolic pathways, for at least 350 plant species. This database is the centre of PMN, a resource of plant metabolic databases, and is used as reference to create plant-specific PGDBs. The current version of PMN (15.0) comprises 126 plant-specific metabolic databases, including curated and predicted databases. | Curated and predicted plant metabolic data | |
| PlantReactome | Manually curated and comparative pathway database for plants, being part of the Gramene, which is a resource for comparative functional genomics | Curated plant metabolic data | |
| MetaCrop | Repository of detailed and manually curated metabolic information for six major crop plants with agronomic importance. It allows to export the data automatically for the creation of metabolic models. | Curated plant metabolic data | |
| SolCyc | Collection of PGDBs for Solanaceae species, including databases for | Curated metabolic data of | |
| TAIR | Database of genetic and molecular data for | Curated genetic and metabolic data of |
Description of the most relevant databases of plant omics data.
| Database | Ref. | Description | Data |
|---|---|---|---|
| SRA | Archive for next-generation raw sequence data. | Sequences | |
| GenBank | Comprehensive collection of all publicly available DNA sequences and respective annotations. | Sequences | |
| RefSeq | A comprehensive, curated, and non-redundant collection of sequences, including genomes, transcripts, and proteins. | Sequences | |
| Nucleotide | A collection of sequences from different sources including GenBank and RefSeq. | Sequences | |
| GEO | Repository of functional genomics data, including raw and processed data with descriptive metadata. | Genomics and transcriptomics | |
| DDBJ | Public database of nucleotide sequences at National Institute of Genetics. | Sequences | |
| ENA | A comprehensive nucleotide sequence resource, including raw sequencing data, assembly information and functional annotations. | Sequences | |
| ArrayExpress | Database of functional genomics data and respective metadata. | Genomics and transcriptomics | |
| Expression Atlas | A resource for gene and protein expression data for multiple organisms and across different biological conditions. | Transcriptomics | |
| PODC | Database of mRNA-sequencing expression data for plants. | Transcriptomics | |
| PlantExpress | Database of gene expression data from microarrays for | Transcriptomics | |
| ProteomicsDB | Database for quantitative Mass Spectrometry (MS)-based proteomics data. Currently, it also includes RNA-Seq expression datasets, drug-target interactions, and protein turnover data. | Proteomics | |
| PRIDE | Repository of MS-based proteomics data, including protein identification and quantification, post-translational modifications, analysed mass spectra and technical metadata. | Proteomics | |
| Peptide Atlas | Database of peptides identified in MS proteomics experiments. It provides tools for processing and analysing raw MS output data. | Proteomics | |
| GPMDB | Database for analysis, validation, and storage of MS proteomics data. | Proteomics | |
| Massive | Community resource for raw MS data, including proteomics datasets. | Proteomics | |
| PPDB | Database for integrating MS-based proteomics data of | Proteomics | |
| MetaboLights | Repository for metabolomics data and associated metadata, covering metabolite structures, reference spectra, concentrations, and functions. | Metabolomics | |
| MetabolomeExpress | Online server for processing, interpreting, and storing MS metabolomics data | Metabolomics | |
| Metabolomics Workbench | Repository for metabolomics data and associated metadata from MS and nuclear magnetic resonance studies. | Metabolomics | |
| GDM | Collection of reference mass spectra and retention times for metabolites. | Metabolomics |
Fig. 1Timeline of the most relevant plant metabolic model reconstructions.
Fig. 2Schematic representation of the dynamic multi-tissue model of A. thaliana, including the leaf and root tissues and the common pool in both light and dark phases [62]. Each tissue module includes five compartments: cytoplasm, mitochondria, vacuole, plastid, and peroxisome. Starch, glucose, sucrose, fructose, malate, fumarate, citrate, and nitrate can accumulate in the light and dark phases of leaf and root (dashed rectangle between phases). Amino acids can be stored in the light and used in the dark phase. Exchange of amino acids, sucrose, sulphate, nitrate, and phosphate (Pi) were allowed between leaf and root through a common pool using proton pumps. Photon uptake was allowed through leaf in the light phase while mineral nutrients, such as nitrate, sulphate, and Pi, were allowed through the root in both phases. Exchanges of carbon dioxide and oxygen were allowed through leaf and root in both phases.
Fig. 3Schematic overview of the two-tissue model of Z. mays, representing the M and BS leaf cells of C4 plants [75]. Each cell includes 4 compartments: cytoplasm, chloroplast, peroxisome and mitochondrion, and small molecules are directly exchanged between the two tissues by transport reactions. The M cells exchange carbon dioxide and oxygen with the intercellular air space while BS cells exchange sucrose, glutathione and glycine with phloem and import water and inorganic nutrients from xylem. This type of model representation is used to understand the photosynthesis in C4 plants, mainly the interactions between M and BS cells.
Description of the supervised and unsupervised ML algorithms used in combination with CBM methods.
| ML method | Type | Description |
|---|---|---|
| Principal Component Analysis (PCA) | Unsupervised | Linearly transforms the variable space into uncorrelated variables, named principal components, which capture most data variability. |
| Clustering | Unsupervised | Analyses the underlying data structure and groups data observations with similar features into clusters. |
| Autoencoder | Unsupervised | Unsupervised artificial neural network that compresses and encodes the input data and then learns how to reconstruct the compressed data by minimising the differences with the original data. |
| Support Vector Machine (SVM) | Supervised | Prediction algorithm that aims to find a hyperplane that separates data observations into two classes, while maximising the distance between data points of both classes. |
| Artificial Neural Network (ANN) and Deep Learning | Supervised | Inspired by the biological neural networks, an ANN comprises a collection of connected units named neurons that receive a set of weighted inputs and perform a weighted sum of these inputs, which is filtered by an activation function to generate the neural output signal. |
| Regression algorithms | Supervised | Estimate the functional relationship between the output and the input features. Linear regression is used when the output variable is continuous, while logistic regression predicts the discrete output. Regressions are often combined with regularisation algorithms, such as the least absolute shrinkage and selection operator (LASSO) and elastic nets. |
| K-nearest neighbours (KNN) | Supervised | Instance-based method that compares new observations with the previously trained examples that have been stored in memory. |
| Decision Trees | Supervised | Build a tree-like model of decisions, wherein nodes denote the attributes, branches represent attribute values, and leaf nodes hold the class labels. The paths from the root to leaf represent classification rules. |
| Random Forests (RF) | Supervised | Ensemble of decision trees in which the subset of features is selected randomly. |
Fig. 4Types of analyses combining CBM and ML. Fluxomics analysis consists of applying ML to the fluxomics data predicted by metabolic models’ simulations. In multi-omics analysis, omics data can be integrated within metabolic models to generate context-specific fluxomics data, which ML can analyse in combination with omics data from high-throughput technologies. In alternative, ML can be trained with omics datasets to produce or improve metabolic models or fluxomics data.
Hybrid studies combining ML and CBM approaches, including the CBM and ML components and the application.
| First Author | CBM | ML | Task |
|---|---|---|---|
| Fluxomics analysis | |||
| Folch-Fortuny 2016 | EFMs | PCA (21–26 samples) | Identify metabolic patterns |
| Bhadra 2018 | EFMs | PCA (12–28 samples) | Identify responsive pathways |
| Folch-Fortuny 2018 | Dynamic EFMs | Discriminant analysis (64 samples) | Identify distinguishing metabolic patterns between conditions |
| Magnusdottir 2017 | FBA | Hierarchical clustering (298378 samples) | Explore ecological interactions |
| DiMucci 2018 | dFBA | RF (9900 samples) | Predict microbial interactions |
| Shaked 2016 | FVA, gene knockouts | Ensemble of SVMs (190–426 samples) | Predict drug side effects |
| Oyetunde 2019 | FBA | PCA, SVM, elastic net, RF, kNN, ANN, ensemble (1200 samples) | Estimate titer, production rate and yield of microbial factories |
| Czajka 2021 | FBA, gene knockouts, gene overexpression | RF, elastic net, kNN, gaussian process regression, support vector regressors (2915 samples) | Predict |
| Schinn 2021 | Flux sampling | Linear regressions (80 samples) | Predict amino acid concentrations in CHO cell cultures |
| Multiomics analysis | |||
| Plaimas 2008 | FBA, gene KO | SVM (1356 samples) | Predict essential reactions |
| Nandi 2017 | Flux Coupled Analysis | SVM-RFE (768 samples) | Predict essential genes |
| Li 2010 | Condition-specific models | Kernel kNN (260 samples) | Predict new drug targets |
| Kim 2016 | Condition-specific models | RNN, LASSO regression, ensemble (649 samples) | Predict cross-omics states in |
| Culley 2020 | Strain-specific models | Support Vector Regressor, RF, ANNs, BEMKL, MMANN, ensemble (1143 samples) | Estimate yeast growth rate |
| Magazzù 2021 | Strain-specific models | Regularised linear models, ANNs, MMANN (1143 samples) | Estimate yeast growth rate |
| Lewis 2021 | Patient-specific models | Ensemble of gradient boosting machines (915 samples) | Identify biomarkers of radiation resistance |
| Guebila 2019 | Drug-specific models | SVMs, clustering (605 samples) | Predict gastrointestinal drug effects |
| Vijayakumar 2020 | Condition-specific models | PCA, k-means clustering, LASSO regularization (24 samples) | Improve phenotypic prediction in cyanobacteria |
| Kavvas 2021 | MAC | MAC (375 samples) | Predict allele-specific antimicrobial resistance in |
| Guo 2017 | FBA, gene KO | Deep ANN (30000 samples) | Predict phenotypes (Deep Metabolism) |
| Generation of CBM models and fluxomics | |||
| Wu 2016 | Stoichiometry | SVM, kNN, decision trees (450 samples) | Estimate metabolic fluxes |
| Brunk 2016 | FBA | PCA (126 samples) | Characterise strain variation |
| Bordbar 2017 | Random sampling | PCA, linear regression (22 samples) | Estimate metabolic fluxes in dynamic conditions |
| Nagaraja 2019 | ANNs (121 samples) | Predict fluxes for the upper part of glycolysis | |