Literature DB >> 24586671

Meta-analysis of pathway enrichment: combining independent and dependent omics data sets.

Alexander Kaever1, Manuel Landesfeind1, Kirstin Feussner2, Burkhard Morgenstern1, Ivo Feussner2, Peter Meinicke1.   

Abstract

A major challenge in current systems biology is the combination and integrative analysis of large data sets obtained from different high-throughput omics platforms, such as mass spectrometry based Metabolomics and Proteomics or DNA microarray or RNA-seq-based Transcriptomics. Especially in the case of non-targeted Metabolomics experiments, where it is often impossible to unambiguously map ion features from mass spectrometry analysis to metabolites, the integration of more reliable omics technologies is highly desirable. A popular method for the knowledge-based interpretation of single data sets is the (Gene) Set Enrichment Analysis. In order to combine the results from different analyses, we introduce a methodical framework for the meta-analysis of p-values obtained from Pathway Enrichment Analysis (Set Enrichment Analysis based on pathways) of multiple dependent or independent data sets from different omics platforms. For dependent data sets, e.g. obtained from the same biological samples, the framework utilizes a covariance estimation procedure based on the nonsignificant pathways in single data set enrichment analysis. The framework is evaluated and applied in the joint analysis of Metabolomics mass spectrometry and Transcriptomics DNA microarray data in the context of plant wounding. In extensive studies of simulated data set dependence, the introduced correlation could be fully reconstructed by means of the covariance estimation based on pathway enrichment. By restricting the range of p-values of pathways considered in the estimation, the overestimation of correlation, which is introduced by the significant pathways, could be reduced. When applying the proposed methods to the real data sets, the meta-analysis was shown not only to be a powerful tool to investigate the correlation between different data sets and summarize the results of multiple analyses but also to distinguish experiment-specific key pathways.

Entities:  

Mesh:

Year:  2014        PMID: 24586671      PMCID: PMC3938450          DOI: 10.1371/journal.pone.0089297

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

High-throughput omics platforms, such as mass spectrometry (MS) based Metabolomics and Proteomics or DNA microarray or RNA-seq-based Transcriptomics, allow the comprehensive analysis of an organism's reaction under different experimental conditions [1]–[5]. A current major challenge in systems biology is the combination and integrative analysis of the large data sets obtained from these platforms [6]–[8]. A single data set usually contains the intensity/expression profiles (intensities for all measured samples) of thousands of features, such as different ion species in MS or spots in DNA microarray analysis. After individual preprocessing of each data set, which includes the statistical analysis, ranking, or filtering of features according to the relevance of their profiles [9]–[11], the features have to be assigned to known biological entities [12], such as metabolites, genes, or proteins. Especially in MS-based Metabolomics, a major bottleneck is the identification of metabolites in non-targeted experiments [13]. In many applications, the putative monoisotopic masses of measured ion species cannot unambiguously mapped to metabolite entries in public databases. The integration of data from other omics platforms which provide a more reliable mapping, such as DNA microarrays, can significantly support the metabolite identification in this case. After annotation, the results are usually interpreted in the context of current knowledge, e.g. known biochemical pathways or processes [14]–[16]. A popular method for this knowledge-based interpretation of single data sets is the Gene Set Enrichment Analysis [17] or Overrepresentation Analysis [18], [19]. Many similar approaches have been developend and the methodology was transferred to other omics platforms [20]–[23]. In general, the enrichment analysis is based on sets of entities, e.g. pathways with associated metabolites, and results in a list of relevant sets which are enriched in high-ranking features (in comparison to all features in the data set). In most methods, the enrichment level of a single set is expressed as p-value. Modelling metabolic pathways as well-defined sets of biological entities, e.g. metabolites, enzymes, and corresponding genes, has shown to be a powerful approach to interpreting complex omics data sets. Furthermore, the concept of pathways associated with different types of biological entities facilitates the joint analysis of different data sets [24]. The combination of results from different studies sharing the same experimental design in terms of null and alternative hypothesis (meta-analysis) is a central task in various statistical applications [25]–[27]. In case of the combination of independent p-values, Fisher's method [28] or Stouffer's method [29], also known as normal, Z-method, or Z-transform test, are often applied. For dependent p-values and known covariances, in [30] an extended version of Fisher's method was proposed (Brown's method). In order to increase statistical power, meta-analysis has been applied to Pathway Enrichment Analysis (Set Enrichment Analysis utilizing pathways as sets) in the context of cancer studies [31]. The proposed methods were focused on the combination of independent p-values based on DNA microarray data. In contrast, we introduce a general methodical framework for the meta-analysis of multiple dependent or independent data sets resulting from different omics platforms applied to Pathway Enrichment Analysis. In order to cope with dependent data sets, such as obtained from the same biological samples analyzed by MS in negative and positive ionization mode, the framework utilizes a covariance estimation procedure based on the nonsignificant pathways in single data set enrichment analysis. The framework is applied and evaluated on two Metabolomics MS data sets [32] and two Transcriptomics DNA microarray studies [11] in the context of wounding of Arabidopsis thaliana. The main focus of this exemplary meta-analysis lies on the enhancement of MS based Metabolomics results by means of the microarray studies.

Materials and Methods

Data sets and preprocessing

For application and evaluation of the meta-analysis, two Metabolomics MS data sets (M1 and M2) [11] and two Transcriptomics DNA microarray data sets (T1 and T2) [32] were used (see Table 1 and Dataset S1 for details). All studies investigate the wounding of Arabidopsis thaliana wild type and the jasmonate-deficient dde 2-2 mutant plants [33], the experimental designs comprise conditions for control plants as well as plants harvested at different times after wounding (see Table 1). The two Metabolomics data sets derive from an Ultra Performance Liquid Chromatography (UPLC) analysis coupled to a Time-Of-Flight (TOF) MS detection. With this method, the non-polar extraction phase of one set of samples was analyzed in positive and negative ionization mode. Since some metabolites may have been measured in both ionization modes following different (partially unknown) ionization rules [34], the level of dependence between both data sets is not clear. In case of the MS data sets, a single feature corresponds to a particular ion species, which is characterized by an exact mass-to-charge ratio and a retention time. A single metabolite may be represented by multiple features, e.g. corresponding to different adduct formations and isotopologues. The features in the microarray data sets correspond to different spots on the array containing DNA probes that match a particular sequence. Also in this case, a single transcript may be represented by multiple features corresponding to particular sequences of the respective gene. The feature profiles of all data sets were ranked separately utilizing a signal-to-noise ratio (similar to the method described in [9], see TechnicalDescription S1).
Table 1

Overview on data sets.

LabelNumber of featuresTimesPlatformIonization modeReference
M1247960.5 h, 2 h, 5 hMass spectrometrynegative [11]
M2233250.5 h, 2 h, 5 hMass spectrometrypositive [11]
T1253921 hDNA microarray- [32], E-ATMX-9
T2253923 hDNA microarray-E-MEXP-1475

The table gives an overview on the four data sets used for evaluation and application. The third column (Times) summarizes the different points in time when the wounded plants were harvested in the respective experiment. The T1 and T2 data sets can be obtained from the ArrayExpress [44] website.

The table gives an overview on the four data sets used for evaluation and application. The third column (Times) summarizes the different points in time when the wounded plants were harvested in the respective experiment. The T1 and T2 data sets can be obtained from the ArrayExpress [44] website.

Pathway enrichment analysis

The ranked features were mapped to the pathway entries in AraCyc [35] and the Arabidopsis-specific pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [14] (see TechnicalDescription S1). In case of the Metabolomics MS data sets, all potential monoisotopic masses were calculated per feature based on the ionization rules and number of isotopes used in [11] and mapped to the metabolite masses in the databases. In case of the Transcriptomics DNA microarray data, the features were mapped to the A. thaliana genes utilizing their CATMA IDs [36]. Based on the mappings, a set of feature ranks was extracted for each pathway and data set. In order to test for an over-representation of high-ranked features, a p-value was calculated for each set of ranks (pathway) utilizing a one-sided Kolmogorov-Smirnov (KS) or Wilcoxon rank-sum test (also known as Mann-Whitney U test) [21]. In case of the KS test, the empirical distribution of ranks in a given set is compared to the distribution of ranks in the respective data set. In case of the rank-sum test, the sum of feature ranks within a given set is evaluated. Especially for Gene Set Enrichment Analysis of DNA microarrays, many methods have been published [20]. Most of these methods are based on KS-like or average gene-specific statistics. For a general meta-analysis and in order to combine the Metabolomics and Transcriptomics data sets in a robust way, we decided to utilize the rank-based KS and rank-sum test. However, more specialized methods for the pathway-specific p-value calculation may be employed as well. The resulting p-values for the dependent Metabolomics data sets were used for the covariance estimation (see corresponding section). The covariances between both Transcriptomics data sets and between the Metabolomics and Transcriptomics data sets, which were obtained from independent biological samples, were set to zero.

Meta-analysis of p-values

In statistical meta-analysis, the most common methods for combining independent p-values from related tests are Fisher's [28] and Stouffer's method [29]. In Fisher's method, the meta-p-value is calculated based on a chi-squared distribution (see TechnicalDescription S1). In Stouffer's method, the test statistic is the sum of p-values transformed into normally distributed random variables (standard normal deviates). For dependent p-values, a powerful approach is Brown's method [30], which is an extension of Fisher's method based on a scaled chi-squared distribution and modified degrees of freedom utilizing a known covariance matrix for standard normal deviates. The given p-values can be transformed into standard normal deviates by means of the inverse cumulative distribution function of the standard normal distribution. The covariance matrix of the standard normal deviates can also be utilized in order to extend Stouffer's method to dependent p-values.

Estimation of covariances

In most applications with dependent data sets, the covariance matrix is not known and has to be estimated. In our proposed procedure, the pairwise covariance between two data sets is estimated based on the standard normal deviates of the pathway-specific p-values, which were obtained for each single data set in Pathway Enrichment Analysis. This estimation is expected to be biased by the alternative hypothesis since the similar or same experimental setup of the data sets imposes a certain dependence and significant pathways associated with very low p-values will strongly influence the results. In order to minimize this bias in the estimation of the pairwise covariance between two data sets, a parameter is introduced and only pathways with p-values in the range are considered. This procedure leaves out significant pathways for which the null hypothesis is likely to be rejected for at least one of the data sets. Instead of directly estimating the sample covariance of the transformed p-values in this range (which would again be biased because of the range restriction), Pearson's correlation coefficient is used.

Results

The Pathway Enrichment Analysis, the transformation of pathway-specific p-values into standard normal deviates, the estimation of covariances for dependent data sets, and the meta-analysis based on the previous results were applied and evaluated on the four Metabolomics/Transcriptomics data sets (see previous section). First, in order to check the distribution of transformed p-values, the histograms of the standard normal deviates were inspected. Because of significant pathways which are highly relevant in this context, the p-values are expected to be not fully uniformly distributed, which may result in a distribution of transformed p-values that deviates from the standard normal distribution. In this case, the p-values/normal deviates should be corrected for significance analysis. Second, the performance of the introduced method in reconstructing simulated data set correlations was evaluated for different values. This performance was not clear, since the proposed correlation estimation includes several complex steps, such as the mapping of a proportion of feature ranks to pathways of different size, the calculation and restriction of p-values, and the transformation into normal deviates. Additionally, the parameter might have a strong influence on the results. Therefore, another objective of the simulation studies was the identification of an appropriate parameter value for the real data sets. Third, the correlation estimation and meta-analysis were applied to all four real data sets. All data sets, containing the annotation information from the pathway mapping, and the results from Pathway Enrichment Analysis are available as comma-separated-values files (see Dataset S1 and Table S1). The source code of functions for the meta-analysis of p-values can be found in File S1.

Distribution of standard normal deviates

Figure 1 shows the histograms of the transformed p-values (standard normal deviates) from Pathway Enrichment Analysis for the two Metabolomics and two Transcriptomics data sets within the p-value range . The histograms for the KS and the rank-sum test are similar and confirm the normal-like distribution of deviates. In both cases however, the sample standard deviation is higher than the unit standard deviation used for the transformation. Additionally, the sample mean for the combined Transcriptomics data sets is smaller than zero. This difference may be caused by pathways which are directly or indirectly influenced by the experimental setup. Although the highly significant pathways with p-values below the threshold were left out, many other pathways are expected to be indirectly affected by the wounding process. Another explanation would be the dependence of feature ranks used for p-value calculation, e.g. introduced by the dependence of different microarray spots representing the same gene or by gene-gene correlations [17]. In order to eliminate the observed bias, the p-values were restandardized [37] for significance analysis by means of the sample mean and sample standard deviation of observed normal deviates per data set and retransforming of the standardized deviates into corrected p-values. This is a conservative correction because the observed bias also includes the pathways which are directly influenced by the wounding process.
Figure 1

Histograms of standard normal deviates for the Metabolomics and Transcriptomics data sets.

For the p-value calculation, the Kolmogorov-Smirnov (KS) and rank-sum tests were utilized. The p-values were restricted to the range . The red graph represents the expected density assuming the standard normal distribution. The green graph shows the expected density assuming a normal distribution with the sample mean and standard deviation as parameters. The histograms for both tests are similar and confirm the normal-like distribution of deviates. In both cases however, the sample standard deviation is higher than the unit standard deviation used for the transformation. Additionally, the sample mean for the combined Transcriptomics data sets is smaller than zero.

Histograms of standard normal deviates for the Metabolomics and Transcriptomics data sets.

For the p-value calculation, the Kolmogorov-Smirnov (KS) and rank-sum tests were utilized. The p-values were restricted to the range . The red graph represents the expected density assuming the standard normal distribution. The green graph shows the expected density assuming a normal distribution with the sample mean and standard deviation as parameters. The histograms for both tests are similar and confirm the normal-like distribution of deviates. In both cases however, the sample standard deviation is higher than the unit standard deviation used for the transformation. Additionally, the sample mean for the combined Transcriptomics data sets is smaller than zero.

Estimation of data set correlation

In simulated studies (see TechnicalDescription S1 for details), the correlation estimation was evaluated by calculating the pairwise Pearson correlation coefficients between all four data sets and a copy of the respective data set with different percentages of feature ranks randomly permuted. For each original and permuted data set, the p-values were calculated for all pathways using the KS or rank-sum test. The correlation coefficient between each original and permuted data set was computed based on the respective standard normal deviates (not restandardized) and the restriction of p-values utilizing different parameter values . As measurement of the introduced artificial correlation, the correlation coefficient between the feature ranks of each data set and the permuted ranks (feature rank correlation) was calculated and averaged, respectively. The whole procedure was repeated for negative correlation by randomly permuting a percentage of the inverted original feature ranks per data set. Table S2 shows the average results over all data sets in detail. Figure 2 and 3 summarize the differences between the reconstructed correlation coefficients from pathway enrichment and the introduced positive or negative feature rank correlation. In comparison to the average feature rank correlation coefficients (x-axis), the absolute correlation is overestimated for low values and underestimated for high values. A value of 0.01 results in the best reconstruction of data set correlation, the absolute difference between the correlation coefficients from pathway enrichment and the feature rank correlation is close to zero for both tests. In case of the observed overestimation for low values, the relevant pathways, which are associated with many top-ranking features, are assigned a low p-value, even when randomly permuting some of the features, and have a high influence on the correlation estimation. In case of the underestimation for high values, the introduced correlation over all features and pathways cannot be fully recovered when restricting the range of p-values and number of utilized pathways too much.
Figure 2

Differences between the reconstructed correlation coefficients from pathway enrichment and the introduced positive feature correlation.

The differences were calculated for different values and the Kolmogorov-Smirnov (KS) and rank-sum test. The best reconstruction, corresponding to differences near zero, can be observed for .

Figure 3

Differences between the reconstructed correlation coefficients from pathway enrichment and the introduced negative feature correlation.

The differences were calculated for different values and the Kolmogorov-Smirnov (KS) and rank-sum test. The best reconstruction, corresponding to differences near zero, can be observed for . The KS test is not able to fully reconstruct strong negative feature correlations.

Differences between the reconstructed correlation coefficients from pathway enrichment and the introduced positive feature correlation.

The differences were calculated for different values and the Kolmogorov-Smirnov (KS) and rank-sum test. The best reconstruction, corresponding to differences near zero, can be observed for .

Differences between the reconstructed correlation coefficients from pathway enrichment and the introduced negative feature correlation.

The differences were calculated for different values and the Kolmogorov-Smirnov (KS) and rank-sum test. The best reconstruction, corresponding to differences near zero, can be observed for . The KS test is not able to fully reconstruct strong negative feature correlations. For the KS test and small negative feature rank correlations, the estimated coefficients from enrichment are considerably larger, e.g. showing a difference between 0.2 and 0.4 in case of a feature rank correlation of −1 (see Figure 3). This can be explained by the non-symmetric properties of the one-sided KS test. A set enriched in both high-ranking and low-ranking features would receive a low p-value when performing the one-sided KS test on the original as well as the inverted ranks. The rank-sum test, on the contrary, would result in an average p-value in both cases because the sum of ranks in the set is near the expected value. For a value of 0.01 and negative correlation, the KS test is still able to reconstruct feature rank correlation coefficients between 0 and −0.3 with a difference near zero. For the correlation estimation between the two dependent Metabolomics data sets, a value of 0.01, which showed the best reconstruction in the simulations, was utilized. The estimation resulted in relatively small coefficients, 0.12 (KS test) and 0.08 (rank-sum test).

Meta-analysis of pathway enrichment

Tables 2 and 3 show the results from meta-analysis of pathway enrichment utilizing Brown's and Stouffer's extended method integrating the correlation estimation for the Metabolomics data sets. The pathways are sorted according to the False Discovery Rate (FDR) [38] calculated based on the meta-p-values. Pathways with more than 500 associated entries were left out in this analysis for better interpretability. For both methods, the top-ranked pathways are the “alpha-Linolenic acid metabolism” (KEGG, 214 feature hits), the “jasmonic acid biosynthesis” (AraCyc, 176 feature hits), and the “lycolipid desaturation”(AraCyc, 325 feature hits). These pathways specifically describe parts of the biosynthesis of the well-known wound hormone jasmonate [39]. The first two pathways cover all biosynthetic steps from the fatty acid alpha-linolenic acid to jasmonic acid. The first committed step is catalyzed by the allene oxide synthase (AOS), whose gene is mutated in the dde 2-2 mutant plants [33]. The glycolipid desaturation pathway describes the formation of the alpha-linolenic acid via sequential steps of glycolipid-linked desaturation. The FDRs for these key pathways are much lower compared to the following pathways. Tables 4, 5, 6, 7, 8, 9, 10, and 11 show the results from enrichment analysis of the four single data sets and selected mappings of top-ranked features which were assigned to entries in the three key pathways, respectively. The enrichment analysis of the M1 data set (negative ionization mode, see Table 4) provides a major contribution to the results from meta-analysis. The first two pathways are also top-ranked but associated with much higher FDRs. The high-ranked features associated with jasmonic acid and its precursor metabolites, such as OPDA and OPC-8:0, are mainly responsible for this ranking (see Table 5). However, the mapping of putative monoisotopic feature masses to metabolites is error-prone and ambiguous. For example, OPDA, EOTrE, and a couple of other metabolites provided by KEGG and AraCyc share the same sum formula and single ion features cannot be unambiguously assigned without further information. In contrast to the alpha-linolenic acid metabolism pathway (KEGG), the very similar jasmonic acid biosynthesis pathway (AraCyc) is associated with a much higher FDR. This can be explained by a number of additional entries found only in the AraCyc version of the pathway and representing general substrates, such as acetyl-CoA, intermediate products which could not be measured with a high signal-to-noise ratio, such as OPC6-3-hydroxyacyl-CoA, or other side products. The glycolipid desaturation pathway, which can be found at position seven, is associated with a very high FDR. Most of the glycolipid species show higher intensities and signal-to-noise ratios in positive compared to negative ionization mode, which results in a very low FDR in pathway enrichment analysis of the M2 data set (see Tables 6 and 7). In contrast, jasmonate and many direct precursor metabolites cannot be measured in positive ionization mode with sufficient intensity, which explains the less prominent ranking of the alpha-linolenic acid metabolism (rank 12) and jasmonic acid biosynthesis (rank 13). Nonetheless, metabolites such as OPDA can be measured in both ionization modes with high signal-to-noise ratio and these findings confirm the corresponding pathways in meta-analysis. Integrating the Transcriptomics data sets T1 and T2 results in a much more comprehensive data interpretation (see Tables 9 and 11). Figure 4 exemplarily shows the pathway map of the alpha-linolenic acid metabolism with marked entries matched by high-ranking features from all data sets. In this combination, the ambiguous mapping of the MS data is supported by unambiguously matching transcripts. Almost all of the transcripts corresponding to enzymes in the alpha-linolenic acid metabolism can be found in the T1 and T2 data sets with relatively high signal-to-noise ratios. This results in much lower FDRs for the jasmonate-specific pathways in meta-analysis compared to the results from single Metabolomics data set analysis. Also in the analysis of the single Transcriptomics data sets (see Tables 8 and 10), these two pathways are associated with relatively high FDRs. In case of the T1 data set, both pathways can be found at less prominent positions (rank 14 and 23, see Table 8). For both Transcriptomics data sets, the glycolipid desaturation is ranked in the middle of all pathways (rank 420 and 161). Only a small number of transcripts associated with fatty acid desaturase show a high signal-to-noise ratio (see Tables 9 and 11).
Table 2

Results from meta-analysis of pathway enrichment (Brown's method).

RankDBPathwayHitsKSRank-sum
1KEGGalpha-Linolenic acid metabolism2140.00013212.383e-05
2AraCycjasmonic acid biosynthesis1760.0036760.0003101
3AraCycglycolipid desaturation3250.00070460.0102
4KEGGLinoleic acid metabolism1470.52520.3968
5AraCycsuperpathway of phenylalanine, tyrosine and tryptophan biosynthesis860.53640.5253
6AraCyctraumatin and (Z)-3-hexen-1-yl acetate biosynthesis1310.45380.5557
7KEGG2-Oxocarboxylic acid metabolism5780.52520.767
8AraCycglucosinolate biosynthesis from dihomomethionine1530.53640.7857
9KEGGStarch and sucrose metabolism3350.52520.7857
10KEGGProteasome1140.52520.7857

The table contains the high-ranking pathways from meta-analysis of pathway enrichment (Brown's method) based on the Kolmogorov-Smirnov (KS) and rank-sum test utilizing all data sets. The p-values per data set were restandardized. The pathways are sorted according to the meta-p-values derived from the rank-sum test. The second column (DB) contains the name of the source database, the fourth column (Hits) the number of feature assignments. The last two columns comprise the false discovery rates calculated from the meta-p-values.

Table 3

Results from meta-analysis of pathway enrichment (Stouffer's extended method).

RankDBPathwayHitsKSRank-sum
1KEGGalpha-Linolenic acid metabolism2146.708e-051.127e-05
2AraCycjasmonic acid biosynthesis1760.0017747.328e-05
3AraCycglycolipid desaturation3250.020430.2122
4AraCyctraumatin and (Z)-3-hexen-1-yl acetate biosynthesis1310.45450.4326
5AraCycsuperpathway of phenylalanine, tyrosine and tryptophan biosynthesis860.45450.4326
6KEGGLinoleic acid metabolism1470.78660.499
7AraCycglucosinolate biosynthesis from dihomomethionine1530.45450.6282
8AraCycglucosinolate biosynthesis from tryptophan1320.45450.6583
9AraCycglucosinolate biosynthesis from phenylalanine1150.65220.6583
10AraCycglucosinolate biosynthesis from tetrahomomethionine1140.45450.6583

The table contains the high-ranking pathways from meta-analysis of pathway enrichment (Stouffer's extended method) based on the Kolmogorov-Smirnov (KS) and rank-sum test utilizing all data sets. The last two columns comprise the false discovery rates calculated from the meta-p-values.

Table 4

Results from pathway enrichment analysis of data set M1.

RankDBPathwayHitsKSRank-sum
1KEGGalpha-Linolenic acid metabolism650.020840.03717
2AraCycjasmonic acid biosynthesis680.15310.1503
3KEGGLinoleic acid metabolism430.45980.8524
4AraCycindole-3-acetyl-amino acid biosynthesis290.45980.8524
5AraCyctraumatin and (Z)-3-hexen-1-yl acetate biosynthesis380.45980.8524
6AraCycgalactosylcyclitol biosynthesis140.45980.8524
7AraCycglycolipid desaturation1440.45980.8524
8KEGGPorphyrin and chlorophyll metabolism2220.88410.8524
9AraCycpoly-hydroxy fatty acids biosynthesis590.92480.8524
10KEGGLysine degradation460.45980.8524

The table contains the high-ranking pathways from enrichment analysis of data set M1 based on the Kolmogorov-Smirnov (KS) and rank-sum test. The pathways are sorted according to the restandardized p-values derived from the rank-sum test. The last two columns comprise the false discovery rates calculated from the restandardized p-values.

Table 5

Selected feature mappings from data set M1.

Rankrtm/zMappings
10.73255.1218Jasmonic acid
30.73209.1168Jasmonic acid
70.73256.1264Jasmonic acid
82.08337.1999OPDA, EOTrE
112.08338.2044OPDA, EOTrE
3215.66986.614518:3/18:1-DGD, 18:2/18:2-DGD
3245.78822.542818:2/16:0-MGD, 18:1/16:1-MGD
4105.53820.529518:3/16:0-MGD, 18:2/16:1-MGD, 18:1/16:2-MGD
4472.33339.2155OPC-8:0
5405.64960.598518:3/16:0-DGD
5426.02823.554118:1/16:0-MGD, 18:0/16:1-MGD
5545.67858.506418:3/18:3-MGD
5635.74795.523218:3/18:1-MGD, 18:2/18:2-MGD
6506.18939.598618:2/18:3-DGD
8465.89962.61318:2/16:0-DGD
8796.23859.515518:2/18:3-MGD
8991.86309.2055HpOTrE
14456.17964.625818:1/16:0-DGD
17277.53278.2245Linolenic acid
21420.52239.08959-Oxononanoic acid

The table shows selected mappings of features from data set M1 (24796 features) to entries in the first three pathways in tables 2 and 3. The first column contains the feature rank. The second and third column show the corresponding retention times and mass-to-charge ratios. Multiple mappings correspond to different ionization rules or isotopologues.

Table 6

Results from pathway enrichment analysis of data set M2.

RankDBPathwayHitsKSRank-sum
1AraCycglycolipid desaturation1670.00091730.002862
2AraCycantheraxanthin and violaxanthin biosynthesis630.10330.2477
3KEGGCarotenoid biosynthesis3890.36080.8365
4AraCyczeaxanthin biosynthesis290.52510.8365
5AraCyclutein biosynthesis340.52510.8365
6AraCyccapsanthin and capsorubin biosynthesis380.52510.8365
7AraCycbrassinosteroids inactivation200.52510.8365
8KEGGPorphyrin and chlorophyll metabolism2360.86930.8365
12KEGGalpha-Linolenic acid metabolism890.86930.8365
13AraCycjasmonic acid biosynthesis540.86930.8365

The table contains the high-ranking pathways from enrichment analysis of data set M2 based on the Kolmogorov-Smirnov (KS) and rank-sum test. The last two columns comprise the false discovery rates calculated from the restandardized p-values.

Table 7

Selected feature mappings from data set M2.

Rankrtm/zMappings
22.08310.2377OPDA, EOTrE
82.08293.2117OPDA, EOTrE
112.08311.2422OPDA, EOTrE
482.08315.1932OPDA, EOTrE
1806.17942.617518:1/16:0-DGD
2116.17941.612418:2/18:3-DGD
2316.22772.591218:2/16:0-MGD, 18:1/16:1-MGD
2485.51776.536518:3/16:0-MGD, 18:2/16:1-MGD, 18:1/16:2-MGD
2956.00960.657618:3/18:1-DGD, 18:2/18:2-DGD
2974.69772.503418:3/16:2-MGD
3105.07937.584318:3/18:3-DGD
3305.87935.645218:2/16:0-DGD
4136.15915.599616:0/18:1-DGD
4596.45774.605418:1/16:0-MGD, 18:0/16:1-MGD
5075.72768.5618:3/16:1-MGD, 18:2/16:2-MGD, 18:1/16:3-MGD
6155.18748.505218:3/16:3-MGD
6991.41441.3184Volicitin

The table contains selected feature mappings from data set M2 (23325 features) to the first three pathways in tables 2 and 3. Multiple mappings correspond to different ionization rules or isotopologues.

Table 8

Results from pathway enrichment analysis of data set T1.

RankDBPathwayHitsKSRank-sum
1KEGGGlycolysis/Gluconeogenesis1080.35270.2952
2KEGGProteasome570.28850.2952
3KEGGProtein processing in endoplasmic reticulum1760.28850.2952
4KEGGRibosome2200.024890.2952
5KEGGOxidative phosphorylation1180.28850.2952
6KEGGPhenylalanine, tyrosine and tryptophan biosynthesis540.4920.2952
7AraCycsuperpathway of phenylalanine, tyrosine and tryptophan biosynthesis430.49660.3302
14AraCycjasmonic acid biosynthesis270.82010.35
23KEGGalpha-Linolenic acid metabolism300.62020.4782
420AraCycglycolipid desaturation70.9760.9758

The table contains the high-ranking pathways from enrichment analysis of data set T1 based on the Kolmogorov-Smirnov (KS) and rank-sum test. The last two columns comprise the false discovery rates calculated from the restandardized p-values.

Table 9

Selected feature mappings from data set T1.

RankIDMappings
6AT2G0605012-oxophytodienoate reductase 3
12AT3G11170fatty acid desaturase 7
16AT5G42650allene oxide synthase
18AT1G17420lipoxygenase 3
82AT2G0605012-oxophytodienoate reductase 3
120AT4G15440hydroperoxide lyase 1
226AT5G48880peroxisomal 3-keto-acyl-CoA thiolase 5
241AT2G44810phospholipase A1
316AT1G20510OPC-8:0 CoA ligase 1
436AT1G7668012-oxophytodienoate reductase 1
638AT1G72520lipoxygenase 4
737AT4G16760peroxisomal acyl-coenzyme A oxidase 1
744AT1G17420lipoxygenase 3
1037AT3G45140lipoxygenase 2
1487AT1G13280allene oxide cyclase 4
2116AT2G06925phospholipase A2-ALPHA
2788AT2G31360delta 9 acyl-lipid desaturase 2
3146AT3G152903-hydroxyacyl-CoA dehydrogenase
4263AT5G04040triacylglycerol lipase SDP1
4276AT1G76150enoyl-CoA hydratase 2

The table contains selected feature mappings from data set T1 (25392 features) to the first three pathways in tables 2 and 3. Multiple mappings correspond to different spots on the microarray.

Table 10

Results from pathway enrichment analysis of data set T2.

RankDBPathwayHitsKSRank-sum
1KEGGalpha-Linolenic acid metabolism300.58850.0794
2KEGGStarch and sucrose metabolism1420.67480.3277
3AraCycjasmonic acid biosynthesis270.71920.3277
4KEGGLinoleic acid metabolism110.71920.7457
5AraCycglucosinolate biosynthesis from phenylalanine160.71920.7457
6AraCycglucosinolate biosynthesis from dihomomethionine190.71920.7457
7KEGGValine, leucine and isoleucine biosynthesis190.71920.7457
8AraCycglucosinolate biosynthesis from tryptophan210.71920.7457
9AraCycstarch degradation I370.71920.7457
161AraCycglycolipid desaturation70.88510.9666

The table contains the high-ranking pathways from enrichment analysis of data set T2 based on the Kolmogorov-Smirnov (KS) and rank-sum test. The last two columns comprise the false discovery rates calculated from the restandardized p-values.

Table 11

Selected feature mappings from data set T2.

RankIDMappings
25AT5G42650allene oxide synthase
104AT2G0605012-oxophytodienoate reductase 3
355AT1G7668012-oxophytodienoate reductase 1
376AT5G48880peroxisomal 3-keto-acyl-CoA thiolase 5
426AT1G17420lipoxygenase 3
484AT3G15870oxidoreductase
631AT1G19640jasmonic acid carboxyl methyltransferase
1019AT3G11170fatty acid desaturase 7
1263AT5G04040triacylglycerol lipase SDP1
1354AT4G16760peroxisomal acyl-coenzyme A oxidase 1
1371AT1G17420lipoxygenase 3
1544AT3G45140lipoxygenase 2
1812AT3G15850fatty acid desaturase 5
1940AT2G06925phospholipase A2-ALPHA
2139AT4G30950fatty acid desaturase 6
2413AT2G0605012-oxophytodienoate reductase 3
2653AT3G152903-hydroxyacyl-CoA dehydrogenase
3022AT1G67560lipoxygenase 3
3297AT3G068603-hydroxyacyl-CoA dehydrogenase
3383AT2G33150peroxisomal 3-keto-acyl-CoA thiolase 2

The table contains selected feature mappings from data set T2 (25392 features) to the first three pathways in tables 2 and 3. Multiple mappings correspond to different spots on the microarray.

Figure 4

Pathway map of the alpha-linolenic acid metabolism (KEGG) with marked entries.

Entries mapped to features from all data sets are marked in gray, selected entries from Tables 5, 7, 9, and 11 are marked in red.

Pathway map of the alpha-linolenic acid metabolism (KEGG) with marked entries.

Entries mapped to features from all data sets are marked in gray, selected entries from Tables 5, 7, 9, and 11 are marked in red. The table contains the high-ranking pathways from meta-analysis of pathway enrichment (Brown's method) based on the Kolmogorov-Smirnov (KS) and rank-sum test utilizing all data sets. The p-values per data set were restandardized. The pathways are sorted according to the meta-p-values derived from the rank-sum test. The second column (DB) contains the name of the source database, the fourth column (Hits) the number of feature assignments. The last two columns comprise the false discovery rates calculated from the meta-p-values. The table contains the high-ranking pathways from meta-analysis of pathway enrichment (Stouffer's extended method) based on the Kolmogorov-Smirnov (KS) and rank-sum test utilizing all data sets. The last two columns comprise the false discovery rates calculated from the meta-p-values. The table contains the high-ranking pathways from enrichment analysis of data set M1 based on the Kolmogorov-Smirnov (KS) and rank-sum test. The pathways are sorted according to the restandardized p-values derived from the rank-sum test. The last two columns comprise the false discovery rates calculated from the restandardized p-values. The table shows selected mappings of features from data set M1 (24796 features) to entries in the first three pathways in tables 2 and 3. The first column contains the feature rank. The second and third column show the corresponding retention times and mass-to-charge ratios. Multiple mappings correspond to different ionization rules or isotopologues. The table contains the high-ranking pathways from enrichment analysis of data set M2 based on the Kolmogorov-Smirnov (KS) and rank-sum test. The last two columns comprise the false discovery rates calculated from the restandardized p-values. The table contains selected feature mappings from data set M2 (23325 features) to the first three pathways in tables 2 and 3. Multiple mappings correspond to different ionization rules or isotopologues. The table contains the high-ranking pathways from enrichment analysis of data set T1 based on the Kolmogorov-Smirnov (KS) and rank-sum test. The last two columns comprise the false discovery rates calculated from the restandardized p-values. The table contains selected feature mappings from data set T1 (25392 features) to the first three pathways in tables 2 and 3. Multiple mappings correspond to different spots on the microarray. The table contains the high-ranking pathways from enrichment analysis of data set T2 based on the Kolmogorov-Smirnov (KS) and rank-sum test. The last two columns comprise the false discovery rates calculated from the restandardized p-values. The table contains selected feature mappings from data set T2 (25392 features) to the first three pathways in tables 2 and 3. Multiple mappings correspond to different spots on the microarray. In case of both methods for meta-analysis, the pathways “Linoleic acid metabolism” and “traumatin and (Z)-3-hexen-1-yl acetate biosynthesis” can be found in the list of top-ten. These pathways are directly connected with the alpha-linolenic acid metabolism and affected by the AOS mutation as well [40]. However, it should be noted that the second pathway is only of limited relevance in this context because the used genotype Columbia is a natural mutant in its second enzymatic step, the fatty acid hydroperoxide lyase reaction [41]. The 2-Oxocarboxylic acid metabolism (Brown's method) and several pathways in the ranking based on Stouffer's extended method describe glucosinolate biosynthesis, the major chemical defense reaction of Arabidopsis plants upon wounding that is regulated by jasmonates [42]. Though, these pathways are associated with comparably high FDRs. Comparing the results based on the KS and the rank-sum test, no clear trend towards lower FDRs can be observed. In case of Brown's method, the glycolipid desaturation pathway is associated with a much lower FDR for both tests. In case of Stouffer's extended method, both jasmonate-specific pathways are scored with lower FDRs.

Discussion

The meta-analysis of pathway enrichment was evaluated and applied on two Metabolomics and two Transcriptomics data sets in the context of plant wounding. The meta-analysis based on Brown's and Stouffer's extended method is able to incorporate information from different independent and dependent omics data sets and distinguish key pathways in the experimental context. The FDRs calculated based on the meta-p-values are much lower compared to the single data set analysis. Especially for the pathway analysis of non-targeted Metabolomics studies, where the identification of metabolites is a bottleneck, the integration of data from other omics platforms, such as DNA microarrays, increases the value and reliability of results. In this application, Brown's and Stouffer's extended method showed overall similar results. However, Brown's method seems to be more powerful in case of pathways which are associated with extreme p-values for only a proportion of the data sets. The glycolipid desaturation pathway for example is associated with very small p-values (KS and rank-sum test) for the M2, relatively small p-values for the M1, and much larger p-values for the T1 and T2 data sets (see Table S1). In case of Brown's method, this pathway is associated with smaller FDRs (0.0007 and 0.01) in comparison to Stouffer's method (0.02 and 0.21). In contrast, Stouffer's method seems to be more powerful in case a pathway is associated with comparably small p-values for all data sets (see alpha-linolenic acid metabolism and jasmonic acid biosynthesis pathways). The choice of method depends on the objective of the meta-analysis, e.g. focus on pathways which show a consensus for all data sets or also including pathways with significant p-values for only a single or small number of data sets [26], [43]. In the context of heterogeneous omics data sets, which contain entities that cannot be measured in all experiments, e.g. metabolites that can be ionized either in positive or negative ionization mode, and pathways that may be associated with only a small number of entries for a particular omics platform, Brown's (or Fisher's method in case of independent p-values) seems to be the better choice. In both meta-analyses, a couple of pathways related to the wounding process were detected with relatively large FDRs. In order to combine the Metabolomics and Transcriptomics data sets in a robust way, we utilized general rank-based tests and a conservative restandardization of p-values per data set. The introduced framework may also be combined with more powerful tests specialized on microarray data analysis [37]. The enrichment analysis of the single T1 and T2 data sets resulted in considerably different rankings. This is likely to be related to the different time points when the wounded plants have been harvested (one and three hours). In the performed simulation studies, the introduced feature rank correlation could be fully reconstructed utilizing the correlation estimation from pathway enrichment. By restricting the range of p-values via the parameter , leaving out significant pathways, the estimation bias could be reduced. The comparison of the two dependent Metabolomics data sets, which were obtained from the same biological samples analyzed in positive and negative ionization mode, resulted in relatively small positive correlation coefficients. This indicates that only a small proportion of metabolites could be detected in both ionization modes with comparable quality of intensity profiles and that data from both modes should be considered in a comprehensive analysis. In general, the statistical power of the meta-analysis increases with decreasing dependence of data sets. Therefore, nearly independent data sets are desirable. Comparing the one-sided KS and rank-sum test, both tests resulted in a similar distribution of normal deviates. In the simulation studies, the one-sided KS test was not able to fully reconstruct strong negative feature correlations. In most applications however, this type of data set correlation is not expected. Matlab source code for functions used in meta-analysis. (GZ) Click here for additional data file. Data sets with database entry and pathway annotations. The archive file contains the data sets in comma separated values format. The first column contains the feature IDs, respectively. The rt and Former m/z columns (M1 and M2 data set) contain the retention times and mass-to-charge ratios from MS analysis. The raw intensities for each sample can be found in the following columns. The s/n column shows the feature-specific signal-to-noise ratios and the last columns contain the KEGG and AraCyc entries and pathways mapped to the corresponding features and separated by slash characters. (ZIP) Click here for additional data file. Pathways with p-values and FDRs from Pathway Enrichment Analysis. The comma separated values file contains the p-values, restandardized p-values, meta-p-values, and corresponding FDRs for single data set and meta-analysis. (CSV) Click here for additional data file. Supplementary tables for simulation studies. (PDF) Click here for additional data file. Technical description of methods. (PDF) Click here for additional data file.
  34 in total

1.  Global functional profiling of gene expression.

Authors:  Sorin Draghici; Purvesh Khatri; Rui P Martins; G Charles Ostermeier; Stephen A Krawetz
Journal:  Genomics       Date:  2003-02       Impact factor: 5.736

2.  Combining probability from independent tests: the weighted Z-method is superior to Fisher's approach.

Authors:  M C Whitlock
Journal:  J Evol Biol       Date:  2005-09       Impact factor: 2.411

Review 3.  The model organism as a system: integrating 'omics' data sets.

Authors:  Andrew R Joyce; Bernhard Ø Palsson
Journal:  Nat Rev Mol Cell Biol       Date:  2006-03       Impact factor: 94.444

4.  Mapping and quantifying mammalian transcriptomes by RNA-Seq.

Authors:  Ali Mortazavi; Brian A Williams; Kenneth McCue; Lorian Schaeffer; Barbara Wold
Journal:  Nat Methods       Date:  2008-05-30       Impact factor: 28.547

Review 5.  Visualization of omics data for systems biology.

Authors:  Nils Gehlenborg; Seán I O'Donoghue; Nitin S Baliga; Alexander Goesmann; Matthew A Hibbs; Hiroaki Kitano; Oliver Kohlbacher; Heiko Neuweger; Reinhard Schneider; Dan Tenenbaum; Anne-Claude Gavin
Journal:  Nat Methods       Date:  2010-03       Impact factor: 28.547

6.  The Arabidopsis male-sterile mutant dde2-2 is defective in the ALLENE OXIDE SYNTHASE gene encoding one of the key enzymes of the jasmonic acid biosynthesis pathway.

Authors:  Bernadette von Malek; Eric van der Graaff; Kay Schneitz; Beat Keller
Journal:  Planta       Date:  2002-11-12       Impact factor: 4.116

7.  Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors:  Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal:  Proc Natl Acad Sci U S A       Date:  2005-09-30       Impact factor: 11.205

8.  Identifying biological themes within lists of genes with EASE.

Authors:  Douglas A Hosack; Glynn Dennis; Brad T Sherman; H Clifford Lane; Richard A Lempicki
Journal:  Genome Biol       Date:  2003-09-11       Impact factor: 13.583

9.  The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases.

Authors:  Ron Caspi; Tomer Altman; Kate Dreher; Carol A Fulcher; Pallavi Subhraveti; Ingrid M Keseler; Anamika Kothari; Markus Krummenacker; Mario Latendresse; Lukas A Mueller; Quang Ong; Suzanne Paley; Anuradha Pujar; Alexander G Shearer; Michael Travers; Deepika Weerasinghe; Peifen Zhang; Peter D Karp
Journal:  Nucleic Acids Res       Date:  2011-11-18       Impact factor: 16.971

10.  CATMA, a comprehensive genome-scale resource for silencing and transcript profiling of Arabidopsis genes.

Authors:  Gert Sclep; Joke Allemeersch; Robin Liechti; Björn De Meyer; Jim Beynon; Rishikesh Bhalerao; Yves Moreau; Wilfried Nietfeld; Jean-Pierre Renou; Philippe Reymond; Martin Tr Kuiper; Pierre Hilson
Journal:  BMC Bioinformatics       Date:  2007-10-18       Impact factor: 3.169

View more
  20 in total

1.  A novel bi-level meta-analysis approach: applied to biological pathway analysis.

Authors:  Tin Nguyen; Rebecca Tagett; Michele Donato; Cristina Mitrea; Sorin Draghici
Journal:  Bioinformatics       Date:  2015-10-14       Impact factor: 6.937

2.  1H nuclear magnetic resonance-based plasma metabolomics provides another perspective of response mechanisms of newborn calves upon the first colostrum feeding.

Authors:  Y S Guo; J Z Tao
Journal:  J Anim Sci       Date:  2018-05-04       Impact factor: 3.159

3.  DANUBE: Data-driven meta-ANalysis using UnBiased Empirical distributions-applied to biological pathway analysis.

Authors:  Tin Nguyen; Cristina Mitrea; Rebecca Tagett; Sorin Draghici
Journal:  Proc IEEE Inst Electr Electron Eng       Date:  2016-03-31       Impact factor: 10.961

4.  The glycosyltransferase UGT76E1 significantly contributes to 12-O-glucopyranosyl-jasmonic acid formation in wounded Arabidopsis thaliana leaves.

Authors:  Sven Haroth; Kirstin Feussner; Amélie A Kelly; Krzysztof Zienkiewicz; Alaa Shaikhqasem; Cornelia Herrfurth; Ivo Feussner
Journal:  J Biol Chem       Date:  2019-05-09       Impact factor: 5.157

5.  Integrating Genome-Wide Association Study and Brain Expression Data Highlights Cell Adhesion Molecules and Purine Metabolism in Alzheimer's Disease.

Authors:  Zimin Xiang; Meiling Xu; Mingzhi Liao; Yongshuai Jiang; Qinghua Jiang; Rennan Feng; Liangcai Zhang; Guoda Ma; Guangyu Wang; Zugen Chen; Bin Zhao; Tiansheng Sun; Keshen Li; Guiyou Liu
Journal:  Mol Neurobiol       Date:  2014-09-10       Impact factor: 5.590

6.  MarVis-Pathway: integrative and exploratory pathway analysis of non-targeted metabolomics data.

Authors:  Alexander Kaever; Manuel Landesfeind; Kirstin Feussner; Alina Mosblech; Ingo Heilmann; Burkhard Morgenstern; Ivo Feussner; Peter Meinicke
Journal:  Metabolomics       Date:  2014-10-10       Impact factor: 4.290

7.  Collaborative mining and interpretation of large-scale data for biomedical research insights.

Authors:  Georgia Tsiliki; Nikos Karacapilidis; Spyros Christodoulou; Manolis Tzagarakis
Journal:  PLoS One       Date:  2014-09-30       Impact factor: 3.240

8.  Systems biology analysis of the proteomic alterations induced by MPP(+), a Parkinson's disease-related mitochondrial toxin.

Authors:  Chiara Monti; Heather Bondi; Andrea Urbani; Mauro Fasano; Tiziana Alberio
Journal:  Front Cell Neurosci       Date:  2015-02-02       Impact factor: 5.505

9.  ONION: Functional Approach for Integration of Lipidomics and Transcriptomics Data.

Authors:  Monika Piwowar; Wiktor Jurkowski
Journal:  PLoS One       Date:  2015-06-08       Impact factor: 3.240

10.  Interactive XCMS Online: simplifying advanced metabolomic data processing and subsequent statistical analyses.

Authors:  Harsha Gowda; Julijana Ivanisevic; Caroline H Johnson; Michael E Kurczy; H Paul Benton; Duane Rinehart; Thomas Nguyen; Jayashree Ray; Jennifer Kuehl; Bernardo Arevalo; Peter D Westenskow; Junhua Wang; Adam P Arkin; Adam M Deutschbauer; Gary J Patti; Gary Siuzdak
Journal:  Anal Chem       Date:  2014-06-25       Impact factor: 6.986

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.