Literature DB >> 21685093

Detection and interpretation of metabolite-transcript coresponses using combined profiling data.

Abstract

MOTIVATION: Studying the interplay between gene expression and metabolite levels can yield important information on the physiology of stress responses and adaptation strategies. Performing transcriptomics and metabolomics in parallel during time-series experiments represents a systematic way to gain such information. Several combined profiling datasets have been added to the public domain and they form a valuable resource for hypothesis generating studies. Unfortunately, detecting coresponses between transcript levels and metabolite abundances is non-trivial: they cannot be assumed to overlap directly with underlying biochemical pathways and they may be subject to time delays and obscured by considerable noise.
RESULTS: Our aim was to predict pathway comemberships between metabolites and genes based on their coresponses to applied stress. We found that in the presence of strong noise and time-shifted responses, a hidden Markov model-based similarity outperforms the simpler Pearson correlation but performs comparably or worse in their absence. Therefore, we propose a supervised method that applies pathway information to summarize similarity statistics to a consensus statistic that is more informative than any of the single measures. Using four combined profiling datasets, we show that comembership between metabolites and genes can be predicted for numerous KEGG pathways; this opens opportunities for the detection of transcriptionally regulated pathways and novel metabolically related genes. AVAILABILITY: A command-line software tool is available at http://www.cin.ufpe.br/~igcf/Metabolites. CONTACT: henning@psc.riken.jp; igcf@cin.ufpe.br

Entities: Chemical Disease Gene Species

Mesh：

Year: 2011 PMID： 21685093 PMCID： PMC3117345 DOI： 10.1093/bioinformatics/btr231

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Changes in metabolite abundances are the first molecular events that follow shifts in an organism's environment. As the steady state of metabolism is broken, gene expression changes to reestablish homeostatic control and to adapt the system to the altered living conditions. This interlocking regulation between gene expression and metabolite levels manifests in intricate patterns of coresponses where metabolite and transcript levels alternately vary consistently with the behavior of both regulators and targets (Kresnowati ). Therefore, the elucidation of metabolite and transcript coresponses can yield important clues to the consequences of altered environmental conditions on the biochemical level as well as to the organism's coping strategies. From the perspective of biotechnology, knowing which genes affect metabolite levels is crucial for metabolic engineering since metabolism cannot, whereas gene expression can, be altered directly. Consequently, many studies using metabolomics and transcriptomics in parallel have been performed to monitor development (e.g. Carrari ; Urbanczyk-Wochniak ) and stress responses (e.g. Dutta ; Hirai ; Kaplan ) in time-series experiments. The datasets from this type of experiments are now a valuable resource with the potential to fill similar needs as the widely applied gene coexpression databases (e.g.Obayashi ) for hypothesis-generating research (Saito ). Transcript–transcript (TT) correlations that derive from genes coding for enzymes that catalyze nearby reactions are stronger than correlations that derive from genes that catalyze metabolically distant reactions (Kharchenko ; Walther ). When the system is not at steady state, abundances of neighboring metabolites also tend to be correlated; this observation was used to approximate the glycolysis pathway from metabolite abundance measurements (Arkin ). The inverse relationship between metabolite–metabolite (MM) correlations versus pathway distance was recently shown to manifest also on the omics level during heat and cold stress responses in Escherichia coli (Walther ). Although interpreting MM correlations is far from straightforward (Steuer ), metabolite–transcript (MT) correlations are arguably even more multifaceted. First, since enzymes influence the reaction speed, metabolically driven correlations are more intuitive between transcripts and metabolite fluxes than for the more easily measured abundances. Second, correlations are not expected for metabolites that are rapidly consumed by subsequent reactions. Third, other modes such as post-translational modification appear to be more common for regulating enzyme activity than transcription (Carrari ; Gibon ). However, there are many other reasons for correlations between metabolites and transcripts that do not derive from the underlying biochemical pathway. In a 2006 review, (Ladurner, 2006) described several mechanisms in which metabolites can tune gene expression; they include riboswitches, direct interactions with transcription factors, regulation of cofactors, chromatin remodeling, chromatin modification and hormone signaling. In plants, the last is arguably the best known since hormones play central roles in nearly all stress responses and developmental programs. With these considerations in mind, it becomes clear that MT correlations may take many forms exhibiting various degrees of noise, time lags and conditionality on the studied response (see Fig. 1 for examples). Therefore, to detect such coresponses using combined profiling data, methods that can handle these issues and deliver interpretable results are needed.

Fig. 1.

Three examples of MT coresponse patterns seen in the glucosinolate (Gls) synthesis pathway during sulfur deficiency [data from (Hirai )]. (a) ATST5a catalyzes a late step in the synthesis of Indolylmethyl-Gls and its transcript is positively correlated with the reaction product. (b) CYP7F1 and MAM3 catalyzes aliphatic Gls synthesis from methionine that appear anticorrelated with isoleucine, also a precursor for aliphatic Gls via a pathway that shares enzymes with the methionine → aliphatic Gls pathway. (c) SUR1 catalyzes a later step in aliphatic Gls synthesis and appear correlated with methylthiobutyl Gls at a positive time lag.

1.1 Previous approaches

The most commonly applied method has been to rank all transcripts according to their Pearson correlation with different metabolites (Urbanczyk-Wochniak ). Hirai ) coclustered transcripts and metabolites and identified relevant patterns by manually inspecting the resulting clusters. Bradley ) used a Bayes net to calculate an association statistic by conditioning the Pearson correlation on four broad classes of metabolites and the studied stress response. Although it has proven highly useful, the Pearson correlation has the considerable disadvantage that it is sensitive to noise and that it cannot detect time-lagged responses. To cope with delayed responses, lagged covariances/Pearson correlations have been applied to study the heat and cold stress responses in Saccharomyces cerevisiae (Walther ) and E.coli (Takahashi ) and for predicting transcription factor targets in Arabidopsis thaliana (Redestig ). However, the introduction of the lags truncates the already typically short time-series and may exacerbate the problem with noise sensitivity. Another serious problem with the direct study of pairwise MT correlations (regardless of how they are calculated) is that since data are scarce, the large number of false-positives renders interpretation very difficult. Even with a very good measure, finding relevant patterns among millions of correlations is a staggering task, especially when done in a completely unsupervised fashion. One approach that addresses this issue is to use multivariate regression via for example O2PLS (Bylesjö ) or PLS (Pir ). These methods relate the entire transcriptomics and metabolomics data blocks with each other in a single model; this is useful for interpreting large-scale trends but cannot easily be used for identifying specific MT coresponses.

1.2 Our approach

In this study, we aimed to design a general method for detecting MT coresponses that addresses these issues. Specifically, we demanded that the method should (i) be able to work with time-series data with few time-points, that it be robust to (ii) noise and (iii) time lags and (iv) that it deliver directly interpretable results to facilitate hypothesis generation. To this end, we introduced two novel methodological developments: (i) a hidden Markov model-based similarity measure and (ii) an approach that summarizes several similarity statistics on the pathway level.

1.2.1 Hidden Markov Model based similarity

We propose here a novel Hidden Markov Model (HMM)-based similarity for accessing the temporal coresponse of MT pairs. We used a particular linear HMM topology defined by Schliep , 2005). Such models have been successfully applied in distinct gene expression time course applications such as querying (Schliep ), model-based clustering (Costa ; Schliep ) and classifying treatment responses (Costa ; Hafemeister ). The linear HMM can be interpreted as a segmentation method, where each state defines an expression range that a time-series follows, e.g. low or high expression, during a particular time interval. For example, the model in Figure 2 defines a prototypical up and down expression behavior. Given its stochastic nature, the linear HMMs have been shown to be robust to modeling noisy and lagged time courses (Costa ; Schliep ).

Fig. 2.

Examples of three-state HMM time courses with an up and down gene expression/metabolite abundance pattern (top). The state mean and the transitions determine the expected expression/abundance value for a particular time interval. For example, we depict in the bottom a time series with high similarity to the example HMM. The colored boxes correspond to the most probable expression/abundance intervals defined for each of the three states.

1.2.2 Pathway-based coresponse summarization

Considering the complex interdependency between metabolites and transcripts, we reasoned that regulatory patterns may be easier to detect and interpret on the level of isolated pathways such as glycolysis or the tricarboxylic acid cycle than when considering individual MT pairs. Therefore, we propose the use of orthogonal projections to latent structures discriminant analysis (OPLS-DA) (Bylesjö ) to combine the similarity statistics for metabolites in a particular pathway. Moreover, this approach can be used to combine different types of similarity statistics. The results can be used to rank the transcripts after determining the strength of their association with a given pathway based on their coresponses with metabolites in that pathway. We depict in Figure 3 the method schematic. For a metabolite's abundance profile at a particular stress condition (a) we apply traditional approaches by measuring the Pearson and lagged Pearson correlation with all transcripts (b). Additionally, we train one HMM to capture the metabolite's prototypical temporal abundance (c). Next we evaluate the likelihood of all gene expression time-courses with each HMM (d). A high likelihood indicates a high similarity/coresponse between the metabolite and the transcript. Note that the HMM also assigns high scores to time courses with similar but lagged expression patterns. Steps (a)–(d) are repeated for all metabolites measured in a particular pathway/stress condition. Next, we use OPLS-DA to summarize the statistics for all metabolites in the corresponding pathway using the annotated transcripts as true positives (Step e). To evaluate the found coresponses, we use both Fisher's exact test on pathway enrichments and lift charts on consensus statistics calculated in cross-validation fashion.

Fig. 3.

Schematic of the proposed methodology. For the metabolites in a given pathway (Step a), we calculate their (direct and lagged) Pearson correlation with the expression of all genes (Step b). Additionally, we train a linear HMM for each metabolite (Step c) and evaluate gene expression in the obtained models (Step d). Steps (a–d) are repeated for all metabolites in a given pathway for a particular dataset. To combine multiple coresponse statistics, we perform OPLS-DA using a set of true positives from a pathway database (Step e). Finally, we evaluate the results using an enrichment test and lift-charts on consensus statistics predicted during cross-validation (Step f).

2 METHODS

2.1 Notation

g∈{1,…,N}: be a running index for the genes, m∈{1,…,M}: be a running index for the metabolites, t∈{1,…,T}: denote the time points, X∈ℝ: expression matrix of all genes and time-points in a particular treatment/stress, Y∈ℝ: abundance matrix of all metabolites and time-points in a particular treatment/stress, x∈ℝ: expression time course of gene g, y∈ℝ: abundance time course of metabolite m, x∈ℝ: expression of gene g at time t, y∈ℝ: abundance of metabolite m at time t.

2.2 Detection of metabolite–transcript coresponses

The application problem concerns scoring the coresponse between a pair of transcripts and metabolites, (x,y) in a given stress response. For this, we use a series of statistics that measure the similarity between temporal expression/abundances. Moreover, as distinct similarity statistics measure distinct similarity properties of a pair of time courses, we investigate the use of either individual statistics or their combination, as described below.

2.2.1 Pearson correlation

For a pair of gene expression and metabolite abundance time courses (x,y), the Pearson correlation can be estimated as where μ=∑x/T, μ=∑y/T, S=∑(x−μ)2/T and S=∑(y−μ)2/T. PC(x,y) will have values in the range [−1, 1], where 0 indicates no correlation, 1 a perfect positive correlation and −1 a perfect negative correlation.

2.2.2 Lagged Pearson correlation

The lagged Pearson correlation can be estimated by introducing shifts in the original time courses and estimating the Pearson correlation with the above-defined formula. More formally, the shifted pair (x(,y() with a lag l is obtained for positive lags as y(={y,…,y} and x(={x,…,x} and for negative lags as y(={y,…,y} and x(={x,…,x}. The lag Pearson correlation (LPC) is estimated as Correspondingly, one can estimate the optimal time lag as

2.2.3 Linear HMMs

An HMM is a probabilistic function of a Markov Chain. It is defined by a set of states S, the transition probabilities a for moving from state i to j and the emission densities f attached to the i-th state (see Rabiner, 1989 for details). We use HMM with a linear topology, i.e. each state has a self-transition and a transition to the next state (Fig. 2). The model also has a start and end state to force the time-courses to visit all states at least once. The emission densities are based on univariate Gaussians with free parameters μ,σ. To support missing values and noise, we use a mixture model as the emission function (see Costa ; Schliep for details). A HMM is parametrized as λ=(A,(μ1…,μ),(σ1,…,σ)), where A is a S×S matrix with entry a defining the transition from state i to state j. The linear HMM can be interpreted as a segmentation method, where each state defines an expression range that a time-series follow, e.g. low or high expression, during a particular time interval. Here, the intensity of the expression is parametrized by the mean value of the emission density μ and the length of the interval is parametrized by the transition probability s. For example, the model in Figure 2 defines a prototypical up and down expression behavior.

2.2.4 HMM-based similarity

To obtain an HMM-based similarity for a pair (y,x), we perform the following procedure. For a given metabolite m, we estimate an HMM λ with the Baum-Welch algorithm using y as the training data. Next, we use the forward algorithm to obtain the likelihood over observation x under the model λ (Rabiner, 1989). The likelihood reflects the similarity between the HMM trained with y and that evaluated with x, that is This statistic will have high values for time-series with high similarities and low values for time courses with no similarity. Note that the Pearson correlation also captures negative correlation, i.e. time courses that have a similar but inverted pattern of expression. Given that the time-series are normalized, i.e. mean zero and SD 1, an HMM-based inverted similarity can be obtained by inverting the signal of the gene time course before evaluating Equation (4). We will call the normal and inverted similarities HMM+ and HMM−, respectively. Overall, the only parameter to be defined by the user is the number of states. For our previous applications, we chose the number of states to be at maximum 40% of the number of time points. Given the short duration of the time courses analyzed in this work, we only consider HMMs with two or three states, called HMM2 and HMM3, respectively. We also performed analysis for four states (Supplementary Figs. S1–S3 and S8); however, the results were either worse or similar to HMM3.

2.3 Combining similarity statistics

We summarize the coresponse statistics for all metabolites in a given pathway, p, to obtain a single statistic, b, that measures the coresponse between each transcript and that pathway in general. For this purpose, we chose OPLS-DA as it sets no constraints regarding colinearity or the distribution of the input data (Bylesjö ). Furthermore, a wide range of diagnostic tools and plots exist for this type of models and this facilitates interpretation. Let Z be the scaled and centered g×m∈p matrix with the similarity statistics for the metabolites in p. Further, let be a column vector with ones for the transcripts that are annotated to p and zeroes for those that are not (as defined by a pathway database such as KEGG). Then we use OPLS-DA to decompose Z into a -correlated part and a -uncorrelated part by: where BP approximates the -uncorrelated variance in Z and E is the residual. Our consensus statistic is given by the -correlated score vector . The model is predictive and can be estimated for a new Z as well. We estimate the complexity of the model (number columns in P) by 10-fold class-balanced cross-validation and use separate round of cross-validation to obtain independent observations of to use for evaluation purposes. The same approach can be used to summarize any number of statistics by simply augmenting Z with other coresponse statistics. Thereby, we provide a general way to combine complementary methods for scoring MT coresponses. In particular, we use this approach for combining the HMM+ statistics with the inverted similarities from HMM− or for combining Pearson with HMM.

2.4 Datasets

2.4.1 Arabidopsis stress response data

Four public combined profiling datasets were used to evaluate our approach. The ‘sulfur root’ and ‘sulfur leaf’ were introduced by Hirai , 2005) for studying the transcriptome and metabolome in leaves and roots during sulfur deficiency stress. The transcriptomics data were obtained from custom Agilent arrays and the metabolomics data were obtained by untargeted FTMS and targeted high performance liquid chromatography (HPLC) and capillary electrophoresis (CE) analysis [42 (root) and 28 (leaf) annotated metabolites and 7342 transcript probes after filtering, 6 time-points in both datasets]. The third dataset is from a study of cold stress response reported by Kaplan ) who used Affymetrix ATH1 array and GC-TOF/MS analysis, respectively (87 annotated metabolites, 6680 filtered transcript probesets, 7 time-points). The fourth dataset comes from a study of the response to elevated CO2 concentration in leaves by Dutta ); it was performed using TIGR arrays and GC-TOF/MS (76 annotated metabolites and 7138 filtered transcript probesets, 9 time-points). All metabolites and transcripts were standardized to zero-mean and unit-variance. Metabolite identifiers were unified using the MetMask tool (Redestig ).

2.4.2 Simulated data

We resort to simulated data to investigate the performance of the similarity statistics under controlled conditions. We use the impulse model (Chechik and Koller, 2009) as a smooth function of the gene expression/metabolite abundance response to stress/treatment conditions. For a particular simulated stress/treatment condition, we define an impulse model for each metabolite. We sample a time course from this function at regular intervals to represent metabolite abundance. Gene expression, associated with this metabolite, was sampled from the same function (50 samples). In this case, we also allowed for time lags, that is, observations can be drawn at previous or later time points than the original metabolite time course. Noise sampled from a normal distribution with mean zero and σnoise is added independently to each time point from all time courses. We repeat the same procedure for five distinct impulse models representing five distinct metabolites. We allow time lags from the range [−2,−1,0,1,2]; they are chosen with probability [0.1,0.2,0.4,0.2,0,1]. At the end, we obtain for each simulated datum the temporal abundance of 5 metabolites and the expression of 250 genes. As the gold standard for evaluation, gene–metabolite pairs are associated if they were sampled from the same impulse model. To investigate how the distinct method works in distinct conditions, we changed the random noise parameter σnoise from 0.1 to 0.5 and the number of time points from 7 to 14. For each parameter selection, we generate 10 datasets. All software and processed datasets can be found at http://www.cin.ufpe.br/~igcf/Metabolites.

3 RESULTS

3.1 Evaluation of individual statistics on simulated data

To evaluate the individual similarity statistics under controlled conditions, we applied HMM, Pearson and lagged Pearson approaches on simulated data. We are particularly interested in evaluating the methods under distinct amounts of injected noise, numbers of time points and the presence of time lags. If we consider only time courses without lag, the Pearson correlation is overall the best method (Supplementary Fig. S1). Nevertheless, for the most degenerate scenario—high noise and few time points—the distinct HMM topologies display similar performance as the Pearson correlation. In the presence of time courses with time lag (Fig. 4), lagged Pearson performs best in the presence of few time points and low noise, but rather poorly under other conditions. The Pearson correlation performs best when the number of time points is equal to 14, or in the lower values of gene calls (Supplementary Figs. S1–S3). This is mostly because of the deficiency of Pearson in the detection of the coresponse of pairs that contains lags in the time courses (see Supplementary Fig. S2 for evaluation of only lagged time courses). The two HMM topologies present consistent results over all scenarios. In particular, they outperform Pearson and lagged Pearson correlations in the scenario of high injected noise and few time points (Fig. 4, right-most panel). Note that few time points and the presence of high amounts of noise are characteristic of the majority of actual experimental data. These results reinforce the importance of the HMM-based distance for detecting coresponses in noisy, undersampled time courses with time-lagged MT coresponses.

Fig. 4.

Boxplots with area under the curve (AUC) estimates for the prediction of true MT coresponses on simulated data using HMMs, Pearson and lagged Pearson. Experiments were based on time courses with 7 or 14 time points and low- (σnoise=0.1) or high-injected noise (σnoise=0.5).

3.2 Unsupervised prediction of specific metabolite–transcript coresponses

A good measure for coresponses should give higher values for truly associated than for unrelated MT pairs. A direct way to determine whether a coresponse measure lives up to this expectation is to compare the values for close neighbors in the underlying metabolic pathways with values for distant pairs. To this end, we generated an undirected aggregated bipartite graph for all A.thaliana-specific KEGG pathways by connecting metabolites to enzymes that catalyze their formation or breakdown. Similar to the methodology of Walther ), we used this graph to classify TT, MM and MT pairs at a maximum distance of three steps as cognate pairs and metabolites at a minimum distance of eight steps as non-cognates. We then evaluate the performance of the different coresponse measures for discriminating cognate from non-cognate pairs by calculating the area under the ROC curve (AUC) (Fig. 5). Notably, with all AUC values close to 0.5 as expected from a random classifier for all MT pairs, none appeared informative for this classification, indicating that expression/abundance profiles contain little information about immediate associations in the metabolic reaction graph.

Fig. 5.

AUC for the classification of cognate pairs of TT, MM and MT (maximum three steps in the metabolic pathway) from non-cognate pairs (minimum of eight steps). The box shows the expected distribution of the AUC under H0 as estimated by randomization and the dot indicates the observed AUC. While Pearson correlation shows higher coresponses between MM pairs in CO2 and sulfur; and with TT pairs in CO2, none of the considered that MT coresponse measures are informative for performing this classification.

3.3 Supervised prediction of pathway level metabolite–transcript coresponses

The above results confirm the complex interdependency and difficulty in detecting MT coresponses. We investigated whether regulatory patterns may be easier to detect and interpret on the level of whole pathways than when considering individual MT pairs. Using the summarization approach introduced in Section 2.3, we therefore calculate consensus statistics for each transcript and KEGG pathway using each of the considered coresponse measures or their combination. As a result, we obtain ranks of the transcripts according to how strongly associated they are with the different pathways given their metabolite coresponse patterns. It should be noted that summarized statistics can only be calculated for pathway-level coresponses since the computation requires a set of true positives. All statistics were calculated in cross-validation fashion to minimize the generalization error. To evaluate these rankings, we first used the Fisher's exact test to obtain a list of the pathways for which a significant enrichment of truly associated transcripts was seen among the top 1-, 3-, 5- and 10% transcripts for each method and dataset. Of the 119 pathways (and pathway maps), 50 were significantly enriched for at least one method and dataset at the 10% cut-off level. To visualize the enrichment at all possible cutoffs for these pathways, we use lift charts. The lift value is defined as the observed ratio of true positives divided by the total ratio of true positives (expected invariant over the cut-off rate for a random classifier). We use lift values since they emphasize performance in lower positive call rates; this is important in this application. Furthermore, because they indicate the relative benefit, they can be compared across different pathways even though the pathways contain different numbers of genes and we therefore summarize them by averaging (Fig. 6). In the sulfur root dataset, HMM2 and HMM3 performed slightly better than Pearson which in turn outperformed lagged Pearson. All combined approaches performed better than their individual counterparts and the best results were obtained with HMM3+Pearson. For the sulfur leaf dataset, the Pearson correlation gives the worst- and HMM2+Pearson the best result. On the cold stress dataset, the Pearson correlation performed comparable to HMM2+Pearson and better than the other HMM-based measures. On the CO2 dataset, all the combined methods performed comparably with a slight preference for HMM2+Pearson and HMM3+Pearson. Of the individual methods, HMM2 was overall best but it notably underperformed on the cold stress dataset. In general, the Pearson correlation and HMM-based methods appeared complementary and provided the overall best performance when used together.

Fig. 6.

Average lift charts for classifying transcripts to 50 different KEGG pathways. The lift value indicates how many times better the actual coresponse statistic is compared to using a random classifier at the corresponding cut-off level. Pathway level analysis provides a direct way to search for interpretable coresponse patterns between metabolites and transcripts. To further visualize these results, we performed side-by-side comparison of the Fisher exact test P-values at the 10% cut-off level using Pearson and the HMM method of choice for each pathway (Fig. 7 and Supplementary Figs. S4–S7). Similar to the results seen for the lift charts, it is clear that HMM and Pearson are complementary, i.e. they show enrichment for distinct pathways. Moreover, this complementarity is successfully summarized using our OPLS-DA-based approach, which shows an overall higher enrichment than Pearson alone.

Fig. 7.

P-value scatter plots comparing pathway enrichments at the 10% cut-off level for the CO2 dataset for (a) HMM3 and Pearson, (b) combined HMM3+Pearson and Pearson. The gray line corresponds to FDR<0.05. Points in the upper-left part of the graph indicate a higher enrichment for the method on the y-axis; points in the bottom-right corner indicate a higher enrichment for the method on the x-axis. The farther the points are from the diagonal, the higher is the enrichment gain of the best method. In the left plot, we observe the presence of pathways with higher enrichment for either HMM or Pearson. This indicates that each method recovers coresponses on distinct pathways. In the right plots, we see that the combination of Pearson and HMM has an overall higher enrichment of pathways than Pearson alone. PS: photosynthesis; Starch/suc: starch and sucrose synthesis pathway.

3.4 Examples of metabolite–transcript coregulation

Although phosphate was the only detected metabolite member of the metabolic map for photosynthesis (map 00195), the corresponding transcripts were already highly enriched in the top 3% of the gene lists from HMM3 with the cold, CO2 and sulfur root datasets (phosphate was not present in the sulfur leaf dataset and plants were grown on transparent media; this may explain the expression of photosynthesis genes in the root). At the same cutoff, the Pearson correlation-based gene lists showed high enrichments of photosynthesis genes for the cold stress dataset but markedly lower enrichment than HMM3 in the CO2 and sulfur root datasets (Fig. 7 and Supplementary Figs. S4–S7). Considering the abundance/expression profiles for phosphate and the photosynthesis-related genes, it becomes clear how HMM3 and Pearson are distinguished in their priorization of different patterns (Fig. 8a, b and c). Expectedly, genes that are ranked highly by the Pearson correlation have patterns nearly identical (or inversely identical) to phosphate itself. This leads to a high enrichment of photosynthetic genes under cold stress but much less enrichment under CO2 and sulfur stress conditions where the response of the photosynthesis genes exhibits a more complex dependency on the phosphate levels. HMM3 on the other hand is mostly concerned with the overall shape of the expression patterns and therefore ranks the photosynthesis genes highly in all three cases despite their inverted pattern and the presence of noise. Moreover, delay lag is noticeable in the gene expression response in the sulfur root and CO2 datasets.

Fig. 8.

Coresponse patterns between metabolites and transcripts as detected by HMM3 and the Pearson correlation. (a–c) Coresponse between phosphate and transcripts in the photosynthesis pathway map (00195) under (a) cold stress, (b) elevated CO2 and (c) sulfur deficiency (root). (d) Coresponse between OAS and transcripts in the zeatin synthesis pathway (00908) under sulfur deficiency (root). The color intensity is proportional to the rank. Numbers to the right of the gene names correspond to the ranking of the gene in the entire dataset (max rank 7342 for sulfur-, 6680 for the cold stress- and 7138 for the CO2 dataset). The only detected metabolite in the zeatin synthesis pathway (Fig. 8d) is O-acetyl-l-serine (OAS). A significant proportion of the zeatin synthesis-related genes are enriched at the very top of HMM3-based gene list but not on Pearson. In particular, the expression profile of ATCKX5 and DOGT1, which were not detected by Pearson, precede the fluctuations in OAS abundance. Another informative example is the starch and sucrose synthesis pathway (pathway 00500) under elevated CO2 conditions. Six different metabolites in this pathway were detected and our summarization approach integrates their coresponse patterns with the transcripts to form a single consensus statistic. This statistic significantly enriches the relevant transcripts at the top of the list when using either Pearson or HMM3 (Fig. 7). However, HMM3 and Pearson reveal different coresponse patterns; using them together results in a greatly improved recovery of the transcripts annotated to this pathway (Fig. 9). Since the summarization is performed with OPLS-DA, the importance of different metabolites can be investigated by looking at the model's loadings [ in Equation (5)]. In this example, HMM3 mainly detects coresponse with maltose and glucose-6-phosphate, whereas the Pearson correlation is higher with trehalose and sucrose. When using both HMM3 and Pearson together, these signals become integrated resulting in a high recovery of a large proportion of the starch and sucrose synthesis-related transcripts.

Fig. 9.

HMM3 and Pearson consensus statistics for scoring coresponses patterns between genes and metabolites in the starch and sucrose synthesis pathway under elevated CO2 conditions. Blue crosses indicate genes that are pathway members and blue circles other genes. Red crosses correspond to the metabolites and their corresponding loadings (importance for separating pathway members from other genes) are given on top and on right axes.

3.5 De novo detection of metabolite–transcript coregulation

The summarization step requires a set of true positives for weighting the different statistics according to how well they capture the observed coresponse patterns in a given pathway. Although this requirement precludes complete de novo discovery of MT coresponses on the pathway level, unknown genes can still be scored when a set of guiding examples is available. To examine whether relevant genes outside the KEGG pathways also receive high rankings in our examples, we extracted genes that code for proteins that are predicted by AtPIN (Brandão ) to interact with the enzymes in each pathway. Using the Fisher's exact test, we then checked whether the transcripts for the interacting proteins were overrepresented among the top 10% in the ranked gene lists. For the cold and CO2 datasets, several such scenarios could be seen (Table 1). As a notable example, AT2G46820 (P subunit of photosystem I), AT4G21280 (a locus coding for PsbQ), and AT4G39710 photosystem I interacting NADH dehydrogenase,(Peng ) are predicted to interact with the proteins in the photosynthesis map and are among the top 2% of the gene list for the cold stress datasets. These genes were not among the KEGG annotations originally used (the org.At.Tair.db package from Bioconductor), indicating that our approach can be used to mine for novel metabolic properties and also for genes outside the set used for calculating the consensus statistic.

Table 1.

Example of pathways, where interacting proteins as described by AtPIN, also were enriched among the top 10% in the ranked gene lists

Dataset/method	Pathway	Description	KEGG	AtPIN
Cold	00020	TCA cycle	0.001	0.025
hmm2 + Pearson	00195	Photosynthesis	2(−22)	0.005
	00230	Purine metabolism	6(−04)	0.0013
	00240	Pyrimidine metabolism	0.13	0.00086
	01061	BS of phenylprop.	0.07	0.012
	01064	Alkaloids BS (ornit, lys, nic)	0.23	0.039
	01065	Alkaloids BS (his., purine)	0.0018	0.005
	00290	Val., Leu. and Ile BS	0.77	0.0068
	00640	Propanoate metabolism	0.3	0.031
CO₂	00230	Purine metabolism	2.3(−05)	0.033
hmm3 + Pearson	00240	Pyrimidine metabolism	0.0057	0.0083

Numbers correspond to the FDR values from Fisher's exact test.

Example of pathways, where interacting proteins as described by AtPIN, also were enriched among the top 10% in the ranked gene lists Numbers correspond to the FDR values from Fisher's exact test.

4 DISCUSSION

We presented an approach for detecting coresponses between metabolites and transcripts using combined profiling time-series data. In previous studies, the direct or lagged Pearson correlation was used for identifying such patterns but as demonstrated by our simulations, this approach does not work well for short time-series with strong noise and time-lagged responses (Fig. 4). Our HMM-based similarity measure performed well also in this most degenerate scenario but, as expected, not better than the Pearson correlation in the more favorable scenarios. In general, both MM and TT correlations are expected to decrease with increasing distance in the underlying metabolic pathways. By looking at differences in the correlations between cognate and non-cognate transcript and metabolite pairs, we could reproduce earlier findings (Kharchenko ; Walther ) with the CO2 dataset for TT and MM correlations, and for MM correlations with the sulfur datasets (Fig. 5). However, different from Walther ), we could not see a dependency between metabolic distance and MT correlations for any of the studied datasets regardless of the used measure. Circumstances such as differences in experiment designs and analytical platforms may be causative of this discrepancy. Possibly, since A.thaliana is a much more complex organism than yeast, it could also show less obvious dependency on the pathways that have been deciphered primarily in single-cell organisms. Nonetheless, with the inevitably high number of false positives among pairwise correlations, we contend that coresponses on the level of whole pathways are both more likely to be detectable and easier to interpret. In actual stress response data, we saw similar tendencies as in our simulation; Pearson was better at detecting direct, clear responses (Fig. 8a) and HMM was better at more complex patterns (Fig. 8b–d). In combined profiling data, we both expected and saw many types of coresponses between metabolites and transcripts. However, the nature of these responses—lagged or direct, noisy or clear—is not enough for judging their relevance because they have many possible origins (Ladurner, 2006): each with its own expected pattern. Therefore, we propose to combine different types of measures and then to summarize them by looking at a set of true associations as guiding examples. This approach worked well; the combined HMM+Pearson methods were best for predicting pathway comemberships between the metabolites and transcripts on all four datasets (Fig. 6). The richness of the interplay between metabolism and gene expression requires attentiveness when interpreting observed patterns. While direct coregulation is conceptually possible, pleiotropic and indirect causes are presumably more likely. However, observed coresponses may still be informative for understanding the perception of stress and the events that follow on a molecular level. We presented three examples of this. (i) In plant stress responses, photosynthesis is strongly downregulated to avoid the production of detrimental reactive oxygen species (Mittler, 2002) while metabolism undergoes large-scale changes. With decreasing energy production, the ATP/ADP ratio drops and free phosphate levels increase. These are the same patterns we observe under cold stress and sulfur deficiency (Fig. 8a and c). However, under conditions of elevated CO2, which may initially be beneficial to the plant (Kanani ), the photosynthesis rate increases and consequently, phosphate levels start to decrease (Fig. 8b). This pattern was also detected with our approach although it is overloaded by circadian rhythmic expression (Harmer ). (ii) OAS is a key metabolite in sulfur assimilation and also a member of the cytokinin (zeatin) synthesis pathway (Hirai ). Cytokinins themselves are well-known repressors of sulfur uptake (Ohkama ). The observation of a strong increase in ATCKX5 expression before the increase in the OAS-levels may therefore be a manifestation of upregulated sulfur transport and of changes that follow sulfur assimilation. The coresponses in the glucosinolate synthesis pathway discussed by Hirai ) were also clearly detected in both root and shoot datasets (Supplementary Figs. S6 and S7; pathway 00966). (iii) During elevated CO2 conditions, as the expression of photosynthesis genes increases, carbon assimilation also changes. As it is a central metabolism pathway, metabolites involved in starch and sucrose synthesis show complex behavior but our method still highlighted coresponses between them and related transcripts. Interestingly, both the Pearson correlation and HMM3 consensus statistics were informative here but completely orthogonal (Fig. 9). Furthermore, the OPLS-DA model prioritizes different metabolites, highlighting that different types of coresponse patterns are present and that Pearson and HMM complement each other. Bradley ) predicted MT associations by combining several stress response datasets using a Bayesian approach. As that methodology was based on the use of Pearson correlations and does not handle time lags, it would profit from the incorporation of the coresponse statistics proposed here. The presented photosynthesis example indicates that combining datasets may be useful also in our case. However, this appeared to be more of an exception than a rule since the significant pathways overall differed strongly between different datasets (Supplementary Figs. S4–S7). Furthermore, as a difficulty in combining datasets from completely different experiments, there typically is a very small overlap of the detected metabolites. For example, only seven metabolites were measured in all the datasets we studied. Moreover, combining data will only capture MT coresponses in pathways affected in the majority of the studied stress conditions. This may worsen predictions for stress-specific metabolite–transcript associations. Extrapolation of the predictions with additional sources of biological data, as shown with AtPIN, indicates the power of the methodology for finding de novo associations. One interesting future aspect is the inclusion of further interaction data for validation purposes and integrative analysis. Moreover, we are currently developing a web tool over an expanded collection of datasets that includes genome-wide predictions of MT associations.

34 in total

1. Analyzing gene expression time-courses.

Authors: Alexander Schliep; Ivan G Costa; Christine Steinhoff; Alexander Schönhuth
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2005 Jul-Sep Impact factor: 3.710

2. Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data.

Authors: Max Bylesjö; Daniel Eriksson; Miyako Kusano; Thomas Moritz; Johan Trygg
Journal: Plant J Date: 2007-10-10 Impact factor: 6.417

3. Transcript and metabolite profiling during cold acclimation of Arabidopsis reveals an intricate relationship of cold-regulated gene expression with modifications in metabolite content.

Authors: Fatma Kaplan; Joachim Kopka; Dong Yul Sung; Wei Zhao; Mick Popp; Ron Porat; Charles L Guy
Journal: Plant J Date: 2007-04-25 Impact factor: 6.417

Review 4. Decoding genes with coexpression networks and metabolomics - 'majority report by precogs'.

Authors: Kazuki Saito; Masami Y Hirai; Keiko Yonekura-Sakakibara
Journal: Trends Plant Sci Date: 2007-12-21 Impact factor: 18.313

5. Integration of metabolite with transcript and enzyme activity profiling during diurnal cycles in Arabidopsis rosettes.

Authors: Yves Gibon; Bjoern Usadel; Oliver E Blaesing; Beate Kamlage; Melanie Hoehne; Richard Trethewey; Mark Stitt
Journal: Genome Biol Date: 2006-08-17 Impact factor: 13.583

6. Integrative investigation of metabolic and transcriptomic data.

Authors: Pinar Pir; Betül Kirdar; Andrew Hayes; Z Ylsen Onsan; Kutlu O Ulgen; Stephen G Oliver
Journal: BMC Bioinformatics Date: 2006-04-12 Impact factor: 3.169

7. When transcriptome meets metabolome: fast cellular responses of yeast to sudden relief of glucose limitation.

Authors: M T A P Kresnowati; W A van Winden; M J H Almering; A ten Pierick; C Ras; T A Knijnenburg; P Daran-Lapujade; J T Pronk; J J Heijnen; J M Daran
Journal: Mol Syst Biol Date: 2006-09-12 Impact factor: 11.429

8. ATTED-II: a database of co-expressed genes and cis elements for identifying co-regulated gene groups in Arabidopsis.

Authors: Takeshi Obayashi; Kengo Kinoshita; Kenta Nakai; Masayuki Shibaoka; Shinpei Hayashi; Motoshi Saeki; Daisuke Shibata; Kazuki Saito; Hiroyuki Ohta
Journal: Nucleic Acids Res Date: 2006-11-27 Impact factor: 16.971

9. Integrated analysis of metabolite and transcript levels reveals the metabolic shifts that underlie tomato fruit development and highlight regulatory aspects of metabolic network behavior.

Authors: Fernando Carrari; Charles Baxter; Björn Usadel; Ewa Urbanczyk-Wochniak; Maria-Ines Zanor; Adriano Nunes-Nesi; Victoria Nikiforova; Danilo Centero; Antje Ratzka; Markus Pauly; Lee J Sweetlove; Alisdair R Fernie
Journal: Plant Physiol Date: 2006-10-27 Impact factor: 8.340

10. Transcription factor target prediction using multiple short expression time series from Arabidopsis thaliana.

Authors: Henning Redestig; Daniel Weicht; Joachim Selbig; Matthew A Hannah
Journal: BMC Bioinformatics Date: 2007-11-18 Impact factor: 3.169

19 in total

1. Analyzing LC/MS metabolic profiling data in the context of existing metabolic networks.

Authors: Tianwei Yu; Yun Bai
Journal: Curr Metabolomics Date: 2013-01-01

2. Eigenvector metabolite analysis reveals dietary effects on the association among metabolite correlation patterns, gene expression, and phenotypes.

Authors: Clare H Scott Chialvo; Ronglin Che; David Reif; Alison Motsinger-Reif; Laura K Reed
Journal: Metabolomics Date: 2016-09-20 Impact factor: 4.290

3. Glycosylation and post-translational modification gene expression analysis by DNA microarrays for cultured mammalian cells.

Authors: Arthur Nathan Brodsky; Mary Caldwell; Sarah W Harcum
Journal: Methods Date: 2011-10-19 Impact factor: 3.608

4. Investigation of the transcriptomic and metabolic changes associated with superficial scald physiology impaired by lovastatin and 1-methylcyclopropene in pear fruit (cv. "Blanquilla").

Authors: Jordi Giné-Bordonaba; Nicola Busatto; Christian Larrigaudière; Violeta Lindo-García; Gemma Echeverria; Urska Vrhovsek; Brian Farneti; Franco Biasioli; Concetta De Quattro; Marzia Rossato; Massimo Delledonne; Fabrizio Costa
Journal: Hortic Res Date: 2020-04-01 Impact factor: 6.793

5. Genomic analysis of the hydrocarbon-producing, cellulolytic, endophytic fungus Ascocoryne sarcoides.

Authors: Tara A Gianoulis; Meghan A Griffin; Daniel J Spakowicz; Brian F Dunican; Cambria J Alpha; Andrea Sboner; A Michael Sismour; Chinnappa Kodira; Michael Egholm; George M Church; Mark B Gerstein; Scott A Strobel
Journal: PLoS Genet Date: 2012-03-01 Impact factor: 5.917

6. Identification of regulatory network hubs that control lipid metabolism in Chlamydomonas reinhardtii.

Authors: Mahmoud Gargouri; Jeong-Jin Park; F Omar Holguin; Min-Jeong Kim; Hongxia Wang; Rahul R Deshpande; Yair Shachar-Hill; Leslie M Hicks; David R Gang
Journal: J Exp Bot Date: 2015-05-28 Impact factor: 6.992

7. Coordinating metabolite changes with our perception of plant abiotic stress responses: emerging views revealed by integrative-omic analyses.

Authors: Jordan D Radomiljac; James Whelan; Margaretha van der Merwe
Journal: Metabolites Date: 2013-09-06

8. Combined Use of Genome-Wide Association Data and Correlation Networks Unravels Key Regulators of Primary Metabolism in Arabidopsis thaliana.

Authors: Si Wu; Saleh Alseekh; Álvaro Cuadros-Inostroza; Corina M Fusari; Marek Mutwil; Rik Kooke; Joost B Keurentjes; Alisdair R Fernie; Lothar Willmitzer; Yariv Brotman
Journal: PLoS Genet Date: 2016-10-19 Impact factor: 5.917

9. The role of tumor metabolism as a driver of prostate cancer progression and lethal disease: results from a nested case-control study.

Authors: Matthew Vander Heiden; Lorelei A Mucci; Rachel S Kelly; Jennifer A Sinnott; Jennifer R Rider; Ericka M Ebot; Travis Gerke; Michaela Bowden; Andreas Pettersson; Massimo Loda; Howard D Sesso; Philip W Kantoff; Neil E Martin; Edward L Giovannucci; Svitlana Tyekucheva
Journal: Cancer Metab Date: 2016-12-07

10. Cellular and Molecular Responses of Dunaliella tertiolecta by Expression of a Plant Medium Chain Length Fatty Acid Specific Acyl-ACP Thioesterase.

Authors: Huixin Lin; Hui Shen; Yuan K Lee
Journal: Front Microbiol Date: 2018-04-04 Impact factor: 5.640