Literature DB >> 19474337

PathExpress update: the enzyme neighbourhood method of associating gene-expression data with metabolic pathways.

Nicolas Goffard¹, Tancred Frickey, Georg Weiller.

Abstract

The post-genomic era presents us with the challenge of linking the vast amount of raw data obtained with transcriptomic and proteomic techniques to relevant biological pathways. We present an update of PathExpress, a web-based tool to interpret gene-expression data and explore the metabolic network without being restricted to predefined pathways. We define the Enzyme Neighbourhood (EN) as a sub-network of linked enzymes with a limited path length to identify the most relevant sub-networks affected in gene-expression experiments. PathExpress is freely available at: http://bioinfoserver.rsbs.anu.edu.au/utils/PathExpress/.

Entities: Chemical Disease Species

Mesh：

Substances：
Enzymes

Year: 2009 PMID： 19474337 PMCID： PMC2703986 DOI： 10.1093/nar/gkp432

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

With the development of transcriptomic and proteomic techniques, post-genomic data represents a new challenge for researchers attempting to interpret the vast amount of raw data in a biological context (1). The analysis of microarray data is usually performed in two steps: the identification of genes that are differentially expressed under two or more conditions, using different statistical methods (2), and a comparison of selected genes with a background to find overlaps between the observed changes in expression and biologically relevant partitionings of the measured genes. Many ontological tools are now available that support the functional interpretation of gene-expression data via the identification of significantly enriched Gene Ontology (GO) categories (3) within groupings of genes of interest (4). Additionally, with the availability of pathway databases such as KEGG (5,6) and MetaCyc (7), numerous tools have been proposed that analyse microarray data and visually present associated metabolic or regulatory pathway information (8–16). However, the predefined metabolic pathways used in these methods represent an essentially arbitrary segmentation of the metabolism. In contrast, other methods integrate, a priori, the knowledge of gene networks in the analysis of gene-expression data. Ideker and co-workers presented a procedure for screening a molecular interaction network combined with a statistical measure to identify sub-networks that show significant changes in expression (17). This approach has been included in Cytoscape to identify functional modules, i.e. highly connected network regions with similar responses across multiple experimental conditions (18). Hanisch and co-workers proposed a co-clustering method based on a distance function that combines information from expression data and biological networks (19). A Potts spin algorithm was developed to cluster gene-expression data by using the nearest neighbour reactions of biochemical networks (20). Rapaport and co-workers extracted gene-expression patterns of neighbouring genes in the network, involving the attenuation of high-frequency signals with respect to the graph (21). Another approach identifies the smallest functional units based on the network topology using the Petri net theory (22). It has been shown by Schwartz and co-workers that elementary modes represent true functional units of metabolism and can be used to reveal transcriptional activity (23). However, the combinatorial explosion of computing elementary modes in large networks limits the practical use of these methods. We previously presented a web-based tool called PathExpress (10) that allowed us to interpret gene-expression results from microarrays in the context of biological pathways. PathExpress has been developed to identify the most relevant pathways or sub-pathways associated with a subset of genes of interest (e.g. a set of differentially expressed genes). It is based on a directed graph modelling enzymatic reactions derived from the publicly available KEGG LIGAND database (24,25). In the present article, we describe a new development in PathExpress—the enzyme neighbourhood (EN) method. We define the EN as a sub-network of linked enzymes with a limited path length. The EN method enables us to explore the metabolic network and identify the most relevant sub-networks affected in gene-expression experiments without being restricted to predefined pathways. While the interaction with the web server is essentially unchanged, PathExpress now incorporates the EN method and supports 28 Affymetrix 3′ Gene-expression Analysis Arrays, representing 32 distinct organisms, and is easy to extend further. In a case study, the EN method was tested with gene-expression data of the model legume Medicago truncatula by comparing the transcriptomes of meristematic and non-meristematic root cells (26).

METHODS

Data representation

PathExpress is based on a directed graph modelling enzymatic reactions as used in the Petri net representation of biological networks (27). Two types of nodes are used to represent compounds and reactions. Specific reactions can encompass one or more enzymes. Directed edges, connecting these nodes, correspond to the consumption or the production of compounds by the reaction. We first built the global metabolic network consisting of 2276 enzymes and 3810 compounds involved in 3663 reactions as specified in the KEGG LIGAND database (24,25). In order to avoid annotation errors due to the misinterpretation of partial Enzyme Commission (EC) numbers (28), we only utilized enzymes defined by a full EC term. This database has the advantage of providing a manually curated representation of enzymatic reactions involved in metabolic pathways where most secondary metabolites (very common and highly connected compounds such as water, oxygen, major coenzymes and prosthetic groups) have been removed, thus avoiding invalid metabolic connections and unspecified pathways. Many of the current methods for the functional interpretation of gene-expression data are constrained by their need to link expressed genes with predefined metabolic pathways and are therefore often hampered when the species to be analysed is not represented in the pathway database. To overcome this limitation, probe sets of the genome arrays supported in PathExpress are linked to the metabolic network using NetAffx annotations (29) or similarities with protein sequences of known EC numbers retrieved from the UniProt database (30). A complete metabolic graph representing all assignments is produced for each organism. This strategy can be applied to any set of sequences and makes it easy to extend PathExpress for use with novel species. In addition, EC numbers can be directly uploaded and compared to the reference network, which allows the analysis of custom data.

ENZYME NEIGHBOURHOOD

In the global network, two reactions are regarded as neighbours if a metabolite exists that is the product of one reaction and the substrate of the other. We define the EN of depth d for an enzyme e, as the set of enzymes that can be reached in the graph from e by traversing a maximum of d compounds, regardless of the direction of the edges (Figure 1). The EN of depth 1 for a given enzyme thus corresponds to the set of enzymes directly connected via a compound (e.g. immediate neighbours). The EN of depth 2 includes the enzymes involved in the EN of depth 1 plus the enzymes linked to these. As different paths can connect two enzymes, the shortest distance between two enzymes is used to define the EN. These ENs correspond to different sub-networks of the global metabolic network. By comparing a specific list of genes to the ENs it is possible to identify those ENs that are significantly over-represented in the gene list.

Figure 1.

EN of depth 4, identified from a list of differentially expressed genes in Medicago truncatula. Compounds (labelled with their KEGG identifier and represented as ellipses) and reactions (labelled with the EC number of the enzymes that mediate it and represented as boxes) are the nodes of the directed graph. The enzyme coloured in black was used to seed this EN (entry point). Greyed reactions show that at least one enzyme thought to be capable of catalyzing the corresponding reaction was present in the submitted list of genes. The label of edges indicates the level of EN depth, i.e. the minimal number of compounds traversed in the global network from the seed enzyme to this point, regardless of the direction of the edges. To identify the most relevant sub-network associated with a list of submitted enzymes, the EN of each seed (submitted EC number), for a given depth, is determined in the global network and the EC numbers contained in the resulting EN are compared to the submitted list. For each test, a P-value, representing the probability that the intersection of the list of enzymes belonging to the given EN occurs per chance in the population of enzymes involved in the entire network, is calculated using the hypergeometric distribution (31). Because multiple tests are performed, it is necessary to correct these P-values with adjustment methods such as the conservative Bonferroni correction (32) or the False Discovery Rate approach (33). The size of the EN depends on its depth d, which has to be specified as a parameter in the current implementation. To optimize this parameter with the size of the submitted list of genes, we have computed the average number of enzymes involved in each possible EN for a range of depths (Table 1). Based on these results, it is possible to adjust the depth parameter to compare groups of enzymes with sub-networks of similar size. For example, to compare a group of 10 enzymes, we recommend a depth parameter of 1 (i.e. direct neighbours), corresponding to an average size of 11.7 enzymes.

Table 1.

Average size of the EN according to the depth parameter

Depth	Average no. of neighbours
1	11.7
2	14.5
3	21.9
4	34.0
5	51.0
6	74.2
7	105.5
8	145.1
9	193.8
10	253.5
20	995.0
30	1397.7
40	1622.1
50	1767.4
100	2106.8

Average size of the EN according to the depth parameter

THE PATHEXPRESS WEB SERVER

As input data, PathExpress receives a list of identifiers (Affymetrix probe set identifiers and/or GenBank accession numbers). Other parameters can be specified: the type of comparison (pathway, sub-pathway or EN), the P-value significance threshold and the adjustment method used to correct for multiple testing. The PathExpress output contains the list of sub-networks (metabolic pathways, sub-pathways or ENs) that are associated with the enzymes in the submitted list of identifiers. The ones with significant association are highlighted. Each of these networks can be displayed, both via an automatically generated graphical representation and as an enumeration of enzymatic reactions.

APPLICATION EXAMPLE

As an example, we used PathExpress to analyse microarray data obtained from the model legume Medicago truncatula, comparing the gene expression of meristematic and non-meristematic root tissues (26). The data have been deposited in NCBI's Gene Expression Omnibus (34) and are accessible through GEO series accession number GSE8115. Following normalization, differentially expressed probe sets were identified by evaluating the log2 ratio between the two conditions. All probe sets that differed by more than a 2-fold difference were considered to be differentially expressed. Of the 390 transcripts over-expressed in the non-meristematic tissue, 94 could be assigned to 50 distinct enzymatic functions, as defined by their EC number in the Affymetrix Medicago Genome Array. To contrast the whole pathway approach with the EN method, we used the ‘Entire Pathway’ option of PathExpress to identify over-representation of metabolic pathways in the non-meristematic root. Most significantly (P-value: 1.09e–03), the carbon fixation pathway is defined by 22 enzymes of which six are differentially expressed in the tissue. We also identified the most relevant sub-networks corresponding to the same group of over-expressed transcripts, using the EN option with a depth of 4. The resulting sub-networks were ranked by increasing P-values. The most significant EN (P-value: 4.06e–04) is given in Figure 1 and was seeded by the glucuronate isomerase (EC 5.3.1.12, black). Of the 13 enzymes present in the depicted sub-network, seven are involved in the pentose and glucuronate interconversion pathway as described in the KEGG database. The remaining six enzymes connected to this sub-network are part of different pathways involved in carbohydrate metabolism (galactose, inositol phosphate, ascorbate and aldarate) and would not have been considered by an approach restricted to the predefined metabolic pathways.

DISCUSSION

Our web-based tool for the interpretation of genomics data, first described in 2007 (10), has been extended to implement the concept of ENs. The EN of a given enzyme is defined as a connected sub-network within the global metabolic network, built from the KEGG database. The identification of statistically significantly over-represented ENs is based on the same statistical approach used for the identification of gene enrichment in GO terms or metabolic pathways. However, the clustering method differs, as it includes knowledge about the network of gene products without being restricted to predefined pathways. Recently, another tool called KEGG spider, presenting a similar approach of interpretation of genomics data in the context of the global gene metabolic network, has been published (35). Although both methods identify statistically significant sub-networks in a submitted list of genes, there are some fundamental differences. KEGG spider infers the network that minimizes the distance between each connected gene pair according to pair-wise distances between genes. It estimates the significance of the inferred network by a Monte Carlo procedure. On the other hand, PathExpress performs an enrichment analysis by comparing the EN of a given depth with the submitted genes, using the hypergeometric distribution and an adjustment method. While KEGG spider limits sub-networks by allowing a maximum of three consecutive missing enzymes, PathExpress can consider all sub-networks up to a depth of 10, corresponding to approximately 250 enzymes. KEGG spider uses the KEGG orthology database to map the genes to the metabolic network and is available only for nine reference organisms, whereas PathExpress uses pre-computed assignments of sequences to EC numbers, and can easily be extended from the currently supported 32 organisms to any organism or set of sequences (e.g. custom DNA microarray, proteome array), enabling the analysis of a wider range of gene-expression experiments. For example, it has recently been used to compare the proteomic data derived from seeds of plants within and beyond the legume family (36). Since its initial development, PathExpress has been extended to explore the Enzyme Neighbourhood for the identification of relevant sub-networks affected in gene-expression experiments. Many genome arrays have been added, making PathExpress a useful resource for the integration of transcriptomic and proteomic and enzymatic or metabolic reaction datasets.

FUNDING

Australian Research Council Centre of Excellence Grant. Funding for open access charge: Australian Research Council Centre of Excellence Grant. Conflict of interest statement. None declared.

34 in total

Review 1. Biological microarray interpretation: the rules of engagement.

Authors: Rainer Breitling
Journal: Biochim Biophys Acta Date: 2006-07-13

2. Integration of biological networks and gene expression data using Cytoscape.

Authors: Melissa S Cline; Michael Smoot; Ethan Cerami; Allan Kuchinsky; Nerius Landys; Chris Workman; Rowan Christmas; Iliana Avila-Campilo; Michael Creech; Benjamin Gross; Kristina Hanspers; Ruth Isserlin; Ryan Kelley; Sarah Killcoyne; Samad Lotia; Steven Maere; John Morris; Keiichiro Ono; Vuk Pavlovic; Alexander R Pico; Aditya Vailaya; Peng-Liang Wang; Annette Adler; Bruce R Conklin; Leroy Hood; Martin Kuiper; Chris Sander; Ilya Schmulevich; Benno Schwikowski; Guy J Warner; Trey Ideker; Gary D Bader
Journal: Nat Protoc Date: 2007 Impact factor: 13.491

3. Observing metabolic functions at the genome scale.

Authors: Jean-Marc Schwartz; Claire Gaugain; Jose C Nacher; Antoine de Daruvar; Minoru Kanehisa
Journal: Genome Biol Date: 2007 Impact factor: 13.583

4. MetaCyc: a multiorganism database of metabolic pathways and enzymes.

Authors: Ron Caspi; Hartmut Foerster; Carol A Fulcher; Rebecca Hopkinson; John Ingraham; Pallavi Kaipa; Markus Krummenacker; Suzanne Paley; John Pick; Seung Y Rhee; Christophe Tissier; Peifen Zhang; Peter D Karp
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

5. Classification of microarray data using gene networks.

Authors: Franck Rapaport; Andrei Zinovyev; Marie Dutreix; Emmanuel Barillot; Jean-Philippe Vert
Journal: BMC Bioinformatics Date: 2007-02-01 Impact factor: 3.169

6. Application of Petri net based analysis techniques to signal transduction pathways.

Authors: Andrea Sackmann; Monika Heiner; Ina Koch
Journal: BMC Bioinformatics Date: 2006-11-02 Impact factor: 3.169

7. The universal protein resource (UniProt).

Authors:
Journal: Nucleic Acids Res Date: 2007-11-27 Impact factor: 16.971

8. Transcriptional profiling of Medicago truncatula meristematic root cells.

Authors: Peta Holmes; Nicolas Goffard; Georg F Weiller; Barry G Rolfe; Nijat Imin
Journal: BMC Plant Biol Date: 2008-02-27 Impact factor: 4.215

9. GenMAPP 2: new features and resources for pathway analysis.

Authors: Nathan Salomonis; Kristina Hanspers; Alexander C Zambon; Karen Vranizan; Steven C Lawlor; Kam D Dahlquist; Scott W Doniger; Josh Stuart; Bruce R Conklin; Alexander R Pico
Journal: BMC Bioinformatics Date: 2007-06-24 Impact factor: 3.169

10. PathExpress: a web-based tool to identify relevant pathways in gene expression data.

Authors: Nicolas Goffard; Georg Weiller
Journal: Nucleic Acids Res Date: 2007-06-22 Impact factor: 16.971

11 in total

Review 1. Identification of aberrant pathways and network activities from high-throughput data.

Authors: Jinlian Wang; Yuji Zhang; Catalin Marian; Habtom W Ressom
Journal: Brief Bioinform Date: 2012-01-27 Impact factor: 11.622

2. Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst.

Authors: Jianguo Xia; David S Wishart
Journal: Nat Protoc Date: 2011-05-05 Impact factor: 13.491

Review 3. Principles and methods of integrative genomic analyses in cancer.

Authors: Vessela N Kristensen; Ole Christian Lingjærde; Hege G Russnes; Hans Kristian M Vollan; Arnoldo Frigessi; Anne-Lise Børresen-Dale
Journal: Nat Rev Cancer Date: 2014-05 Impact factor: 60.716

4. Gene network modules associated with abiotic stress response in tolerant rice genotypes identified by transcriptome meta-analysis.

Authors: Shuchi Smita; Amit Katiyar; Sangram Keshari Lenka; Monika Dalal; Amish Kumar; Sanjeet Kumar Mahtha; Gitanjali Yadav; Viswanathan Chinnusamy; Dev Mani Pandey; Kailash Chander Bansal
Journal: Funct Integr Genomics Date: 2019-07-08 Impact factor: 3.410

5. Alterations in the transcriptome of soybean in response to enhanced somatic embryogenesis promoted by orthologs of Agamous-like15 and Agamous-like18.

Authors: Qiaolin Zheng; Sharyn E Perry
Journal: Plant Physiol Date: 2014-01-30 Impact factor: 8.340

6. Global changes in the transcript and metabolic profiles during symbiotic nitrogen fixation in phosphorus-stressed common bean plants.

Authors: Georgina Hernández; Oswaldo Valdés-López; Mario Ramírez; Nicolas Goffard; Georg Weiller; Rosaura Aparicio-Fabre; Sara Isabel Fuentes; Alexander Erban; Joachim Kopka; Michael K Udvardi; Carroll P Vance
Journal: Plant Physiol Date: 2009-09-15 Impact factor: 8.340

7. Transcriptional profiling of Medicago truncatula under salt stress identified a novel CBF transcription factor MtCBF4 that plays an important role in abiotic stress responses.

Authors: Daofeng Li; Yunqin Zhang; Xiaona Hu; Xiaoye Shen; Lei Ma; Zhen Su; Tao Wang; Jiangli Dong
Journal: BMC Plant Biol Date: 2011-07-01 Impact factor: 4.215

8. SEAS: a system for SEED-based pathway enrichment analysis.

Authors: Xizeng Mao; Yu Zhang; Ying Xu
Journal: PLoS One Date: 2011-07-22 Impact factor: 3.240

9. Critical assessment of human metabolic pathway databases: a stepping stone for future integration.

Authors: Miranda D Stobbe; Sander M Houten; Gerbert A Jansen; Antoine H C van Kampen; Perry D Moerland
Journal: BMC Syst Biol Date: 2011-10-14

10. AMBIENT: Active Modules for Bipartite Networks--using high-throughput transcriptomic data to dissect metabolic response.

Authors: William A Bryant; Michael J E Sternberg; John W Pinney
Journal: BMC Syst Biol Date: 2013-03-25