| Literature DB >> 28831069 |
Hai-Feng Yang1, Xiao-Nan Zhang2, Yan Li3, Yong-Hong Zhang4, Qin Xu5, Dong-Qing Wei1.
Abstract
With the rapid growth of micro-organism metabolic networks, acquiring the intracellular concentration of microorganisms' metabolites accurately in large-batch is critical to the development of metabolic engineering and synthetic biology. Complementary to the experimental methods, computational methods were used as effective assessing tools for the studies of intracellular concentrations of metabolites. In this study, the dataset of 130 metabolites from E. coli and S. cerevisiae with available experimental concentrations were utilized to develop a SVM model of the negative logarithm of the concentration (-logC). In this statistic model, in addition to common descriptors of molecular properties, two special types of descriptors including metabolic network topologic descriptors and metabolic pathway descriptors were included. All 1997 descriptors were finally reduced into 14 by variable selections including genetic algorithm (GA). The model was evaluated through internal validations by 10-fold and leave-one-out (LOO) cross-validation, as well as external validations by predicting -logC values of the test set. The developed SVM model is robust and has a strong predictive potential (n = 91, m = 14, R2 = 0.744, RMSE = 0.730, Q2 = 0.57; R2p = 0.59, RMSEp = 0.702, Q2p = 0.58). An effective tool could be provided by this analysis for the large-batch prediction of the intracellular concentrations of the micro-organisms' metabolites.Entities:
Mesh:
Year: 2017 PMID: 28831069 PMCID: PMC5567373 DOI: 10.1038/s41598-017-08793-2
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
The variable selection.
| Round | Number of variables(m) | 10-fold cross-validation | Leave-one-out(LOO) | ||
|---|---|---|---|---|---|
| RMSE | Q2 | RMSE | Q2 | ||
| Initial | 1669 | 1.120 | 0 | 1.103 | 0.02 |
| 1 | 697 | 1.053 | 0.13 | 1.050 | 0.13 |
| 2 | 258 | 0.973 | 0.23 | 0.960 | 0.25 |
| 3 | 70 | 0.912 | 0.33 | 0.898 | 0.35 |
| 4 | 18 | 0.771 | 0.52 | 0.769 | 0.52 |
|
|
|
|
|
|
|
| 6 | 13 | 0.772 | 0.52 | 0.758 | 0.54 |
| 7 | 12 | 0.786 | 0.50 | 0.756 | 0.54 |
| 8 | 11 | 0.780 | 0.51 | 0.751 | 0.55 |
| 9 | 10 | 0.853 | 0.41 | 0.828 | 0.45 |
The selected 14 variables of the optimal variable set.
| Name | Type | Description |
|---|---|---|
| Clustering-Coefficient | Topological parameter | Clustering coefficients of nodes |
| Degree | Topological parameter | Degree of nodes |
| BCUT_SLOGP_2 | Molecular descriptor | LogP BCUT (2/3) |
| BCUT_SMR_3 | Molecular descriptor | Molar refractivity BCUT (3/3) |
| GCUT_PEOE_1 | Molecular descriptor | PEOE charge GCUT (1/3) |
| SlogP_VSA9 | Molecular descriptor | Bin 9 SlogP (0.40, 10] |
| PEOE_VSA + 0 | Molecular descriptor | Total positive 0 vdw surface area |
| PEOE_VSA + 5 | Molecular descriptor | Total positive 5 vdw surface area |
| Vsa_hyd | Molecular descriptor | VDW hydrophobe surface area |
| Opr_nring | Molecular descriptor | Oprea ring count |
| 6mem_rings_molecules | Molecular descriptor | Number of 6 membered rings |
| RPCG | Molecular descriptor | Ratio of most positive charge on sum total positive charge (Relative positive charge) |
| ClogP | Molecular descriptor | Partition coefficient octanol/water |
| MPF descriptor | Metabolic pathway | Five Metabolic Pathways’ Features descriptor |
Correlation coefficients between selected variables and-logC.
| Variables | correlation coefficients |
|---|---|
| BCUT_SLOGP_2 MPF | 0.446 |
| −0.437 | |
| Degree | −0.325 |
| 6mem rings Molecules | 0.296 |
| opr_nring | 0.296 |
| ClogP | 0.267 |
| GCUT_PEOE_1 | 0.235 |
| Clustering Coefficient | −0.124 |
| vsa_hyd | 0.099 |
| RPCG | −0.091 |
| PEOE_VSA + 0 | −0.075 |
| PEOE_VSA + 5 | −0.062 |
| SlogP_VSA9 | −0.035 |
| BCUT_SMR_3 | −0.024 |
Figure 1Correlation between metabolite concentration and Clustering-Coefficient in (a) E. coli; and (b) S. cerevisiae.
Figure 2Correlations between concentration and CLogP of 130 metabolites (R2 = 0.838).
The deviation of the concentrations and polarities of the 14 metabolites in the five pathways included in the MPF descriptor.
| Name | –log Ce | CLogP | |
|---|---|---|---|
|
|
| ||
| Glutamate | 1.02 | 1.09 | −2.69 |
| ATP | 2.02 | 2.47 | −4.55 |
| Aspartate | 2.37 | 1.80 | −2.41 |
| Glutamine | 2.42 | 1.09 | −3.38 |
| Citrate | 2.71 | 2.83 | −2.00 |
| Malate | 2.77 | 2.77 | −1.52 |
| Acetyl-CoA | 3.22 | NA | −3.54 |
| Succinate | 3.24 | 3.47 | −0.53 |
| Succinyl-CoA | 3.63 | NA | −3.94 |
| Fumarate | 3.94 | 2.78 | −0.17 |
| S-adenosyl-L-methionine | 3.74 | NA | −5.08 |
| Alanine | 2.59 | 1.61 | −3.12 |
| GTP | 2.31 | 3.23 | −5.53 |
| D-Glucose 6-phosphate | NA | 2.43 | −3.28 |
| Average of 14 above metabolites |
|
| − |
| Average of 130 metabolites |
| − | |
-log Ce is the negative logarithm of the corresponding concentration in E. coli and S. cerevisiae. NA means no data available.
Predictive performance among internal and external validation.
| Training set | Test set (n = 39) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Number of samples (n) | R2 | 10-fold | LOO | R2 p | RMSEp | Q2 p | |||
| RMSE | Q2 | RMSE | Q2 | ||||||
| Zhu’s model[ | 80 | 0.683 | 0.729 | 0.59 | |||||
| Bar-Even[ | 60 | 0.43 | 0.43 | ||||||
| This model | 91 | 0.744 | 0.741 | 0.55 | 0.730 | 0.57 | 0.586 | 0.702 | 0.58 |
Figure 3Plot of the -logC values predicted by the SVM model (-logCp) vs. those observed (-logCe).
Figure 4Williams plot of standardized residual versus leverage.
Comparison of performances between the non-overlap and the random strategy.
| Separation Strategy | Training set | Test set | ||||||
|---|---|---|---|---|---|---|---|---|
| R2 | 10-fold | LOO | R2 p | RMSEp | Q2 p | |||
| RMSE | Q2 | RMSE | Q2 | |||||
| non-overlap | 0.77 | 0.72 | 0.57 | 0.71 | 0.59 | 0.55 | 0.74 | 0.54 |
| random | 0.74 | 0.74 | 0.55 | 0.73 | 0.57 | 0.59 | 0.70 | 0.58 |
Figure 5Flow chart of the model development.