| Literature DB >> 30645629 |
Tolutola Oyetunde1, Di Liu1, Hector Garcia Martin2,3,4,5, Yinjie J Tang1.
Abstract
Metabolic models can estimate intrinsic product yields for microbial factories, but such frameworks struggle to predict cell performance (including product titer or rate) under suboptimal metabolism and complex bioprocess conditions. On the other hand, machine learning, complementary to metabolic modeling necessitates large amounts of data. Building such a database for metabolic engineering designs requires significant manpower and is prone to human errors and bias. We propose an approach to integrate data-driven methods with genome scale metabolic model for assessment of microbial bio-production (yield, titer and rate). Using engineered E. coli as an example, we manually extracted and curated a data set comprising about 1200 experimentally realized cell factories from ~100 papers. We furthermore augmented the key design features (e.g., genetic modifications and bioprocess variables) extracted from literature with additional features derived from running the genome-scale metabolic model iML1515 simulations with constraints that match the experimental data. Then, data augmentation and ensemble learning (e.g., support vector machines, gradient boosted trees, and neural networks in a stacked regressor model) are employed to alleviate the challenges of sparse, non-standardized, and incomplete data sets, while multiple correspondence analysis/principal component analysis are used to rank influential factors on bio-production. The hybrid framework demonstrates a reasonably high cross-validation accuracy for prediction of E.coli factory performance metrics under presumed bioprocess and pathway conditions (Pearson correlation coefficients between 0.8 and 0.93 on new data not seen by the model).Entities:
Mesh:
Year: 2019 PMID: 30645629 PMCID: PMC6333410 DOI: 10.1371/journal.pone.0210558
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Database curation and feature extraction methodology.
Fig 2Feature additions via genome scale model simulations and data augmentation based on case studies described in the literatures.
Metabolic engineering design factors template used for feature extraction.
Sample values are taken from [19]. Features that refer to a list of genes are entered as a vector of ones and zeros as categorical numbers. For example, in the sample values, ‘het_gene’ (whether the gene inserted/overexpressed was heterologous) is entered as 1,0,0 meaning alsS is heterologous while ilvC, ilvD are not. YE stands for yeast extract.
| Feature | Description | Sample value | ||
|---|---|---|---|---|
| 1 | cs1 | first carbon source | 1 | |
| 2 | cs1_mw | first carbon source molecular weight | 180 | |
| 3 | cs_conc1 | first carbon source concentration (mM) | 111 | |
| 4 | CS_C1 | mol C in first carbon source | 6 | |
| 5 | CS_H1 | mol H in first carbon source | 12 | |
| 6 | CS_O1 | mol O in first carbon source | 6 | |
| 7 | reactor_type | type of reactor (continuous, batch or fed-batch) | 1 | |
| 8 | rxt_volume | working volume of reactor (L) | 2 | |
| 9 | media | media used for fermentation (M9,AM1,AM2, M9+ yeast extract,LB,NBS,TB,other rich media) | YE | |
| 10 | temp | temperature of medium used for fermentation (oC) | 37 | |
| 11 | time | total time for fermentation | 36 | |
| 12 | oxygen | oxygen condition in reactor (aerobic, anaerobic, microaerobic,extra aerobic) | 2 | |
| 13 | sbg_ref | reference strain in the study | BFA7.001(DE3) PCT01 | |
| 14 | s_ref_gen | genes modified from the strain MG1655 | lacI, rrnB, lacZ, hsdR514, araBAD, rhaBAD, zwf, mdh, frdA, ndh, pta, poxB, ldhA,T7 RNA polymerase | |
| 15 | s_gen_mod | type of gene modification: insertion/deletion | 0,0,0,0,0,0,0,0,0,0,0,0,0,1 | |
| 16 | gene_mod | genes modified from reference strain of study | alsS, ilvC, ilvD | |
| 17 | gene_del | whether or not the gene was deleted | 0,0,0 | |
| 18 | gene_ovr | whether or not the gene was overexpressed | 1,1,1 | |
| 19 | het_gene | is the gene heterologous? (yes/no) | 1,0,0 | |
| 20 | rep_origin | plasmid copy numbers | 5,5,5 | |
| 21 | codon_opt | codon optimization? | 0,0,0 | |
| 22 | sen_reg | sensor regulator? | 0,0,0 | |
| 23 | enz_design | enzyme redesign evolution? | 0,0,0 | |
| 24 | protein_scaffold | protein scaffolding? | 0,0,0 | |
| 25 | dir_evo | direction evolution? | 0 | |
| 26 | Mod_path_opt | modular pathway optimization? | 0 | |
| 27 | prod_name | name of the product | Isobutanol | |
| 28 | no_C | mol C in product | 4 | |
| 29 | no_H | mol H in product | 10 | |
| 30 | no_O | mol O in product | 1 | |
| 31 | no_N | mol N in product | 0 | |
| 32 | mw | molecular weight of product | 74 | |
| 33 | precursor | precursor from central metabolism | 6 | |
| 34 | enz_steps | number of enzyme steps from precursor | 5 | |
| 35 | atp_cost | number of atp molecules needed from precursor to product | 0 | |
| 36 | na_cost | number of nadh/nadph molecules needed from precursor to product | 2 | |
| 37 | yield_1 | yield in gProduct/g Carbon source fed | 0.0405 | |
| 38 | yield_2 | yield in gProduct/g Carbon source consumed | NA | |
| 39 | yield_3 | yield in gProduct/g Biomass | 0.623 | |
| 40 | titer | concentration of product in g/L | 0.81 | |
| 41 | rate | maximum productivity in g Product/ L /h | 0.0225 | |
| 42 | bio_titre | biomass concentration (g/L) | 1.3 | |
| 43 | bio_grw_rate | biomass growth rate in exponential phase (/h) | 0.45 | |
| 44 | gen_info | are all the genetic modifications in the paper fully captured by the above categories? (yes/no) | 1 | |
| 45 | env_info | are all the reactor conditions in the paper fully captured by the above categories? (yes/no) | 1 |
Fig 3Summary of curated database showing distribution of titers (units in g/L) for 25 different products from the bacterium E. coli.
Fig 4Comparison of production metrics (titer, rate, and yield).
The size of the dots corresponds to the rate values (in g/L/h scaled by the minimum and maximum value– 0.000043 and 10.83 g/L/h respectively). Molecular weight of each product (g/mol) is shown by the color gradient of the dots (color bar).
Fig 5Inferring possible influential factors on metabolic engineering design performance.
A. First two principal components from multiple correspondence analysis (MCA). The labels correspond to titer values in g/L. The shaded areas for each point show the predicted area within which all points have a high probability of belonging to the specified titer range. B. Impact of different influential factors on first two principal components from principal component analysis (PCA). PCA plot shown in S1 Fig in S2 File. Carbon source 1, 2 and 3 are used to capture the cases in which more than one carbon source was used. If only one was used, corresponding entries of carbon source 2 and 3 were set to zero. E.coli MG1655 was taken as the reference strain and all modifications done to get the background strain used in each study were captured as ‘background modifications’. The scores describe the relative contribution of each feature to the principal components.
Fig 6Prediction of production metrics TRY.
R2: coefficient of determination. Solid lines are shown on the diagonal that represent where all the points would fall for perfect prediction. A scaled version of Fig 6 is presented in S4 Fig in S2 File (enabling the fit to visualized without the outlier effects). The data points are scaled based on the maximum value (titer, rate or yield) for the particular product in our curated database.
Fig 7Model performance analyses.
A. Quantification of the effect of COBRA (Constraint-Based Reconstruction and Analysis)—based features on model performance. CV stands for the best cross validation accuracy (R2 values). Higher scores imply a better fit. B. Comparing individual machine learning performance with ensemble model. TS stands for Test Scores (R2 values). CV stands for the best cross validation accuracy (R2 values). Higher scores imply a better fit.
Fig 8Titer learning curve as the function of size of training data set.
The training scores (R2) and cross validation (CV) scores (also R2) are shown. Below 800 training examples, the cross-validation accuracies variation were too large. The hybrid model can fit the training data set (red points) well irrespective of the number of training examples. The cross-validation scores improve slightly with more data points. This implies that more feature engineering (and not necessarily more data) would be necessary to significantly improve model performance.
Fig 9Machine learning pipeline.
Ensemble learning using stacked regressors.