Literature DB >> 32030085

iMTBGO: An Algorithm for Integrating Metabolic Networks with Transcriptomes Based on Gene Ontology Analysis.

Abstract

BACKGROUND: Constraint-based metabolic network models have been widely used in pheno-typic prediction and metabolic engineering design. In recent years, researchers have attempted to im-prove prediction accuracy by integrating regulatory information and multiple types of "omics" data into this constraint-based model. The transcriptome is the most commonly used data type in integration, and a large number of FBA (flux balance analysis)-based integrated algorithms have been developed. METHODS AND
RESULTS: We mapped the Kcat values to the tree structure of GO terms and found that the Kcat values under the same GO term have a higher similarity. Based on this observation, we developed a new method, called iMTBGO, to predict metabolic flux distributions by constraining reaction bounda-ries based on gene expression ratios normalized by marker genes under the same GO term. We applied this method to previously published data and compared the prediction results with other metabolic flux analysis methods which also utilize gene expression data. The prediction errors of iMTBGO for both growth rates and fluxes in the central metabolic pathways were smaller than those of earlier published methods.
CONCLUSION: Considering the fact that reaction rates are not only determined by genes/expression levels, but also by the specific activities of enzymes, the iMTBGO method allows us to make more precise pre-dictions of metabolic fluxes by using expression values normalized based on GO.

Entities: Chemical

Keywords: Transcriptome; constraint-based model; flux balance analysis; gene ontology; metabolic network; turnover number

Year: 2019 PMID： 32030085 PMCID： PMC6983954 DOI： 10.2174/1389202920666190626155130

Source DB: PubMed Journal: Curr Genomics ISSN： 1389-2029 Impact factor: 2.236

INTRODUCTION

Flux balance analysis is a constraint-based method that uses the metabolic network and an objective function to obtain the optimal flux distribution [1]. It has been successfully used to predict the metabolic phenotypes of gene knockout strains [2-4], as well as growth rates and metabolite production levels [5-7]. In recent years, many researchers have tried to integrate multiple types of “omics” data to improve the predictive power of metabolic models or to define some new restrictive boundaries to narrow the feasible solution space. In these approaches, the integration of data on regulation or “omics” data is the most common means. The transcriptome is the most commonly used data type in integration, and a large number of FBA-based integrated algorithms have been developed. Transcriptome has been integrated into constraint-based models in three ways. Firstly, this has been done by setting an apriori threshold to judge the gene expression state [8, 9]. For example, GIMME [8] directly removes the reactions for which the expression values are lower than the pre-set threshold. Secondly, one can calculate the relative expression thresholds to determine the gene expression status. This approach usually classifies the reactions into high or low expression and treats these as part of additional constraints or an objective function [10-13]. For the classification of high- and low expressions, there are generally two cases. One involves ranking the expression value. For example, EXAMO [11] ranks the expression levels of all genes from high to low, with the top 15% being defined as highly expressed genes and the last 15% being defined as the low expressing genes. Then, these high expression reactions and low expression reactions are treated as a new objective function. Another approach is to determine the state of gene expression by differential expression analysis of expression profiles, such as MADE [12] and Adam [13]. The last approach involves directly introducing transcriptome data into the objective function [14-17] or the constraints [18]. Several of these methods result in a reaction with a low expression level but a high enzyme turnover number (Kcat) being filtered (e.g. GIMME and iMAT) or strongly constrained (e.g. E-Flux and GX-FBA), and currently, there is no method that considers the enzymes turnover numbers. Gene Ontology [19, 20] is the most comprehensive and widely used knowledge base on gene function established by the Gene Ontology Consortium, which covers three different aspects of gene function: molecular function (MF), cellular components (CC), and biological processes (BP). In addition, gene ontology gives the EC number corresponding to the GO term. By including the Kcat information corresponding to the EC number given by the Brenda database [21], we found that GO terms with similar functions have a smaller difference in Kcat values, and Kcat is also relatively close in the same GO term. Based on this observation, we developed the gene ontology-based flux analysis method (iMTBGO) that integrates transcriptome data with metabolic network analysis to make more accurate predictions of metabolic fluxes. This method first identified some marker genes with a stable expression from the published large-scale E. coli expression profiles. Subsequently, the expression value of the marker gene in each term was used to modify the upper bounds of each reaction. We used different experimental data (e.g. overexpression or knockout) to evaluate the predictive ability of iMTBGO and found that the flux predictions of growth rate and the central metabolic pathways were stable and maintained a low level of error (Fig. ).

MATERIALS AND METHODS

Distribution of Kcat Values at Each Level of GO Terms

The Kcat values corresponding to all EC numbers were downloaded and extracted from the Brenda database, and the relationship between the GO and EC number was obtained from the Gene Ontology database. Kcat values were assigned to each GO term based on the GO-EC and EC- Kcat relationships. GO terms were organized hierarchically at different levels in the Gene Ontology database. The three root terms (‘molecule function’, ‘cellular component’ and ‘biological process’) are at level 0, their direct child GO terms are at level 1, and so on. Child terms have more specialized function descriptions than their parent terms [19]. A parent GO term contains Kcat values for all its child GO terms. We mainly focused on the GO terms under ‘molecule function’ (MF), as most EC numbers are assigned to these terms. We used Eq. 1 to calculate the standard deviation of Kcat values for each GO term (containing two or more Kcat values). (1) Where i represents the i-th GO term in a level, represents the j-th Kcat value in i-th GO term, represents the mean of the Kcat values in i-th GO term, and N is the number of Kcat values. To test if GO terms with more specific functions (higher level numbers) have similar Kcat values, we calculated the average standard deviations for all GO terms in the respective same levels and compared the values at different levels. To test if genes under the same GO term also have similar Kcat values, we also shuffled the GO-Kcat pairs to simulate the discrete distribution of Kcat values when a GO term contained Kcat values from other GO terms (1000 times), and ran a t-test statistic with real a Kcat distribution.

The iMTBGO Algorithm

iMTBGO uses the expression ratio obtained from a specific expression profile to correct the upper and lower boundary of each reaction (Fig. dotted line rectangle). When we know the substrate absorption rate (such as glucose absorption rate) or the concentration of an intermediate metabolite under specific conditions, we can assess the true flux corresponding to each reaction via the formula . Finally, the optimization problem is converted into the following form, shown in Eqs. 2-8. (2) (3) (4) (5) (6) (7) (8) where is the glucose absorption rate calculated by iMTBGO, is the true glucose absorption rate (e.g. experimentally-measured phenotype data from Holm et al.), is the flux of the j-th reaction predicted by iMTBGO that needs to be corrected, is the corrected flux of the j-th reaction, is the maximum boundary of calculated from the expression data, and represents the expression value of the marker genes in each GO term.

Determination of Marker Genes Using Gene Expression Profiles in EcoMAC

We directly used the expression profiles in EcoMAC, derived from the literature [22] and standardized, to determine the marker genes. We used the DAVID [23, 24] annotation information to obtain the GO term information for each gene in EcoMAC. Then, we mapped all of the gene ids in EcoMAC to Ecogene [25] and removed the unique ids from EcoMAC. If the different ids in the EcoMAC matrix corresponded to the same id in Ecogene, they were replaced with the gene id in the Ecogene, and the expression value was the mean of these two ids. Furthermore, we selected the mean expression value of the wild-type strains during the exponential phase in the M9 medium as the control group, and other expression data in EcoMAC as the case group to run the differential expression analysis. We defined genes with fold changes of more than two times and the number of samples exceeded half of the total as unstably expressed genes and others as stably expressed genes. We chose the coefficient of variation (Eq. 9) to obtain the discrete distribution of the genes’ expression and finally selected candidates among the stably expressed genes and CV less than specific values (e.g. 0.175 in this research) as potential marker genes. The detailed flowchart is shown in Fig. , of a solid rectangle. (9) where i represents the i-th gene in the profiles, represents the expression value of the i-th gene in the j-th profile, represents the mean of the expression values of the i-th gene, and P is the number of expression profiles. Based on the GO-gene relationship obtained from the DAVID database, we mapped potential marker genes onto individual GO terms. Since some GO terms did not have a marker gene, we relied on the tree structure of GO terms in gene ontology to develop the following standardized rules for different types of terms (Fig. dashed rectangle). First, when a term had a number of potential marker genes, we selected the gene with the smallest CV value among these potential marker genes. Second, for terms without a maker, a single set was formed, and the gene with the smallest CV value in the set was used as the marker.

Determination of Expression Ratios Using an Expression Profile Under Specific Conditions

It is known from the Michaelis-Menten equation that V. When the substrate concentration is saturated, . In the same GO term, Kcat is relatively close, and we can reasonably assume that , where marker represents the marker gene in GO term, i represents the i-th gene in the same GO term, represents the maximum flux of i-th gene, and represents the expression value of i-th gene. If we split the entire metabolic network into multiple GO term combinations and assume that =1 in each combination for computational simplicity, then , that is . Therefore, we used the standardized rules of each term to obtain the expression ratio of genes under specific experimental conditions. If an enzyme consists of multiple subunits encoded by different genes, the minimum value of these genes was used as the expression ratio of the reaction (Eq. 10). If a reaction can be independently catalyzed by multiple enzymes, the corresponding expression ratio for this enzyme is the sum of expression ratios of the subunits, as shown in Eq. 11. (10) (11) where marker (Termi) indicates the expression value of the marker gene in the term containing the i-th gene. If an enzyme consists of multiple subunits, gi represents the expression value of the gene encoding the i-th subunit, and otherwise represents the expression value of the gene encoding the i-th enzyme.

Evaluation of the Predictive Accuracy of iMTBGO

Microarray and experimentally-measured phenotype data for E. coli from Holm et al. [26] (wild-type, overexpressed ATPase, and overexpressed NADH oxidase) and Ishii et al. (two wild-type strains: WT0.5, WT0.7, and five mutant strains: , , , and ) [27] were used to solve the metabolic fluxes under specific conditions. For the datasets of Holm et al., the raw data (GSE20374) was first read into R using the affy package. Then, the RMA algorithm was applied for a set of replicates for background correction and normalization. Since the data of Holm et al. involved three replicates, we used the mean of the three sets of experiments as the expression value. Because the datasets provided by Ishii et al. did not give the corresponding GEO id, so we directly downloaded the chip signal file from the literature, and took the logarithm of the “normalized intensity of sample” column as the expression value. Since the datasets of Ishii and Holm [26, 27] provided experimentally-measured phenotype data of the central metabolic pathways, we used the growth rate error [28] (Eq. 12) and flux error measurement [29] (Eq. 13) to evaluate the prediction. (12) (13) where and represent the predicted growth rate and the experimentally measured growth rate, respectively. n is the number of experiments, is the flux of the i-th reaction in the central metabolic pathways and is the measured flux of i-th reaction. We also carried out sensitivity analysis to determine the optimal CV value by changing the range of CV values according to the literature [29]. We then performed robustness analysis by using the optimal CV value to determine the sensitivity of fluxes to gene expression levels. In the robustness analysis, we shuffled each column of the expression matrix (x) to determine a random matrix (r), and then we obtained the noisy expression matrix (y) according to Eq. 14, as published previously [29]. The value of λ was consistent with the literature [28], which set the values of 0, 0.25, 0.5, 0.75, and 1 for λ. (14) We obtained the complete codes of E-Flux, RELATCH [15], iMAT [10], GIMME, and GX-FBA [16] from the literature [29]. In all simulation experiments, the genome-scale metabolic model iJO1366 [30] was used in conjunction with the COBRA toolbox 2.0 in MATLAB software to solve the optimal solution by glpk/qpng, in which the objective function is the maximization of biomass. For the Ishii datasets, the oxygen uptake rate was also used as a constraint, which was set as the measured value when running the simulation.

RESULTS

The Distribution of Kcat in the GO Terms

We obtained 4178 pairs of GO-EC numbers from the Gene Ontology database, including 4150 GO terms and 3950 EC numbers (Supplementary file). In addition, we obtained 59615 pairs of EC-Kcat information with 2840 EC numbers from the Brenda database (Supplementary file). We mapped the corresponding Kcat to each GO term through a common EC number, and finally obtained 44118 pairs of GO-Kcat data including 2043 GO terms (Supplementary file). We also used child nodes to assign Kcat values to the terms without Kcat values level by level, and re-updated 463 GO terms. Only 413 pairs of GO-Kcat records belonged to CC (4 terms) and BP (120 terms), so these two types of terms were not considered in the subsequent analysis. We then used the standard deviation to evaluate the distribution of Kcat at each level of MF and found that except for levels 10 and 6, the larger the level, the smaller the mean value of the standard deviation (Table ). It can be inferred that GO terms with similar functions have a smaller difference in Kcat values. Furthermore, we also found that the standard deviation of the GO term of real Kcat was significantly smaller than that of the simulation (nearly 90% of the simulations when p<0.001). When the p-value was set to 0.05, the standard deviation of the real Kcat was smaller than that of all simulations, which indicated that Kcat values in the same GO term were relatively similar.

Determining Marker Genes and Standardization Rules

We found that GO annotation information for 126 genes of 4190 genes in EcoMAC could not be obtained using the DAVID database, 35 of the 126 genes were only found in EcoMAC, and the remaining 91 genes used outdated gene ids. After excluding these, we obtained a final set of 4106 genes. We carried out sensitivity analysis to find the optimal value of CV to determine the marker genes and standardization rules. The results showed that the average value of the growth error changes with the change of CV, as well as that the variation is monotonic with respect to the CV value, and the average error changes little with the change of CV (Fig. S1). When considering the fluxes in the central metabolic pathways, we found that the optimal values of CV were 0.175 and 0.125 in the datasets of the Holm et al. and Ishii et al. respectively. Furthermore, 0.175 was also a suboptimal value for the dataset of Ishii et al. (Fig. ). When considering the ratio of the number of genes in a GO term to the total number of genes (Fig. S2 a) and the proportion of terms containing potential marker genes to the total numbers of terms in the different CVs (Fig. S2 b), we found that the proportion occupied by CV at 0.175 was basically saturated. Based on these findings, we finally selected the genes with CV less than 0.175, which resulted in 1083 potential marker genes. We obtained 21411 pairs of GO-gene information from the DAVID database, including 3675 GO terms and 3815 genes (Supplementary file). Using the GO-gene relationship, we mapped these potential marker genes onto individual GO terms of MF, encompassing 1705 GO terms and 1235 genes are annotated into these GO terms (Supplementary file). We used the gene with the smallest CV value as the final marker gene with multiple potential markers, and finally determined 1209 unique marker genes. For terms without a marker, we used b3033 (yqiB) as the marker gene, since it had the lowest CV value among these terms. We found that the accuracy of iMTBGO for growth rate prediction was the highest under different experimental conditions among several methods for the datasets of Holm et al. (Fig. ), but did not perform the same for the datasets of Ishii et al. (Fig. ), which may have a certain relationship to the use and omission of the optimal CV. For fluxes in the central metabolic pathways, iMTBGO still maintained lower flux errors than other methods (Fig. , ). It was found to be superior to several other methods for integrating the transcriptome data, but not superior to pFBA, especially for the datasets of Ishii et al. (Fig. ). In addition, iMAT and RELACTH always remained unchanged for growth rate prediction (Fig. , ) and GX-FBA was not able to calculate the fluxes for partial expression profiles (Fig. , ). Compared with these methods, except for the way of integrating transcriptome data, the biggest difference is that the iMTBGO considers the enzymes turnover numbers. By analyzing GO terms, the genes stably expressed in the same GO term are used to standardize the genes in the term, eliminating the constraint error caused by low enzyme expression in conjunction with a high turnover number, and the flux is only related to the enzyme expression level. This is why the iMTBGO method is superior to several other methods that integrate the transcriptome data. As the central metabolic pathways are more heavily regulated at post-transcriptional levels [31-33], transcript levels are not suitable to represent the final enzyme activity levels, which leads to problems in the upper constraints of fluxes in the central metabolic pathways, as observed in our method and finally results in incorrect prediction compared with pFBA. Subsequently, robustness analysis was carried out to evaluate the effects of noise in the expression data, and we found that iMTBGO to be robust due to a relatively smooth increase in the error with an increase in the noise level for the datasets of Holm et al. and Ishii et al. (Fig. S3). However, it was also evident that the error of several noise points was reduced, which indirectly indicated that iMTBGO was dependent on the expression profiles, and it was necessary to provide more accurate transcriptomic data or directly provide proteomic data if possible. In conclusion, this GO-based metabolic flux analysis method is relatively reliable and can be used to estimate flux changes in metabolic networks with comparatively good accuracy.

CONCLUSION

We developed the new method iMTBGO that integrates metabolic network and transcriptome data. This method first introduces Gene Ontology information into metabolic network analysis by obtaining the expression ratios of the enzymes by normalizing the marker genes in the same GO term. iMTBGO maintained higher predictive accuracy in some experiments and always maintained a lower level of flux errors in the central metabolic pathways than other tested methods. The iMTBGO method provides a feasible process for integrating metabolic networks with omics datasets, which offers new broader ideas for using expression profiles. Although there are still some problems when the method integrates metabolic networks with transcriptome data, it cannot be denied that this integrated analysis can improve the prediction process under certain experimental conditions.

Table 1

The distribution of Kcat values according to GO levels.

GO Level	Mean of the Standard Deviation of K_cat	Numbers of GO Terms with K_cat
0	959774.44	1
1	242600.75	4
2	147651.83	24
3	43151.06	129
4	16857.15	369
5	12258.59	1181
6	19380.66	364
7	11525.28	120
8	387.64	24
9	6.25	5
10	20.78	2

33 in total

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

2. Metabolic gene-deletion strains of Escherichia coli evolve to computationally predicted growth phenotypes.

Authors: Stephen S Fong; Bernhard Ø Palsson
Journal: Nat Genet Date: 2004-09-26 Impact factor: 38.330

3. Multiple high-throughput analyses monitor the response of E. coli to perturbations.

Authors: Nobuyoshi Ishii; Kenji Nakahigashi; Tomoya Baba; Martin Robert; Tomoyoshi Soga; Akio Kanai; Takashi Hirasawa; Miki Naba; Kenta Hirai; Aminul Hoque; Pei Yee Ho; Yuji Kakazu; Kaori Sugawara; Saori Igarashi; Satoshi Harada; Takeshi Masuda; Naoyuki Sugiyama; Takashi Togashi; Miki Hasegawa; Yuki Takai; Katsuyuki Yugi; Kazuharu Arakawa; Nayuta Iwata; Yoshihiro Toya; Yoichi Nakayama; Takaaki Nishioka; Kazuyuki Shimizu; Hirotada Mori; Masaru Tomita
Journal: Science Date: 2007-03-22 Impact factor: 47.728

4. Role of transcriptional regulation in controlling fluxes in central carbon metabolism of Saccharomyces cerevisiae. A chemostat culture study.

Authors: Pascale Daran-Lapujade; Mickel L A Jansen; Jean-Marc Daran; Walter van Gulik; Johannes H de Winde; Jack T Pronk
Journal: J Biol Chem Date: 2003-11-20 Impact factor: 5.157

5. Improving metabolic flux predictions using absolute gene expression data.

Authors: Dave Lee; Kieran Smallbone; Warwick B Dunn; Ettore Murabito; Catherine L Winder; Douglas B Kell; Pedro Mendes; Neil Swainston
Journal: BMC Syst Biol Date: 2012-06-19

6. BRENDA in 2017: new perspectives and new tools in BRENDA.

Authors: Sandra Placzek; Ida Schomburg; Antje Chang; Lisa Jeske; Marcus Ulbrich; Jana Tillack; Dietmar Schomburg
Journal: Nucleic Acids Res Date: 2016-10-19 Impact factor: 16.971

7. The fluxes through glycolytic enzymes in Saccharomyces cerevisiae are predominantly regulated at posttranscriptional levels.

Authors: Pascale Daran-Lapujade; Sergio Rossell; Walter M van Gulik; Marijke A H Luttik; Marco J L de Groot; Monique Slijper; Albert J R Heck; Jean-Marc Daran; Johannes H de Winde; Hans V Westerhoff; Jack T Pronk; Barbara M Bakker
Journal: Proc Natl Acad Sci U S A Date: 2007-09-26 Impact factor: 11.205

8. Inferring metabolic states in uncharacterized environments using gene-expression measurements.

Authors: Sergio Rossell; Martijn A Huynen; Richard A Notebaart
Journal: PLoS Comput Biol Date: 2013-03-21 Impact factor: 4.475

9. Integration of time-resolved transcriptomics data with flux-based methods reveals stress-induced metabolic adaptation in Escherichia coli.

Authors: Nadine Töpfer; Szymon Jozefczuk; Zoran Nikoloski
Journal: BMC Syst Biol Date: 2012-11-30

10. GSMN-TB: a web-based genome-scale network model of Mycobacterium tuberculosis metabolism.

Authors: Dany J V Beste; Tracy Hooper; Graham Stewart; Bhushan Bonde; Claudio Avignone-Rossa; Michael E Bushell; Paul Wheeler; Steffen Klamt; Andrzej M Kierzek; Johnjoe McFadden
Journal: Genome Biol Date: 2007 Impact factor: 13.583