Literature DB >> 17586825

PathExpress: a web-based tool to identify relevant pathways in gene expression data.

Nicolas Goffard1, Georg Weiller.   

Abstract

PathExpress is a web-based tool developed to interpret gene expression data obtained from microarray experiments by identifying the most relevant metabolic pathways associated with a subset of genes (e.g. differentially expressed genes). A graphical pathway representation permits the visualization of the expressed genes in a functional context. Based on the publicly accessible KEGG Ligand database, PathExpress can be adapted to any organism and is currently available for seven Affymetrix genome arrays. About 20% of the probe sets of each array have been assigned to Enzyme Commission numbers by homology relationship and linked to corresponding metabolic pathways. PathExpress is available at http://bioinfoserver.rsbs.anu.edu.au/utils/PathExpress/.

Entities:  

Mesh:

Substances:

Year:  2007        PMID: 17586825      PMCID: PMC1933187          DOI: 10.1093/nar/gkm261

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Microarrays enable us to investigate the expression of thousands of genes simultaneously, providing a comprehensive overview of the gene activities in a given tissue. The results of such experiments are usually presented as lists of (differentially expressed) genes. A number of statistical tests have been employed for assessing differential gene expression (1) and several ontological tools are available (2–6) to support the biological interpretation of these data. Most are based on the identification of significant associations of gene ontology terms (7) with groups of genes, but this does not directly reflect metabolic networks. With the availability of biological pathway databases such as the kyoto encyclopedia of genes and genomes (KEGG) (8) or MetaCyc (9), several resources have been developed to visualize and analyse microarray data in the context of known biological networks (10–17). However, some limitations still remain, as current pathway databases are limited to organisms with well-annotated genomes and most of the analysis tools only attempt to match expression data to entire pathways without considering sub-pathways. PathExpress overcomes some of these limitations, and provides a user-friendly web-based tool to interpret gene expression results from microarray experiments in the context of biological pathways. Based on the publicly available KEGG Ligand database of chemical compounds and reactions in biological pathways (18,19), PathExpress can be extended to any organism, as it uses similarity between the probe set sequences of supported genome arrays and the sequences of genes with known Enzyme Commission (EC) numbers in order to link probe sets to the metabolic networks. To take into account how reactions are linked in a pathway, sub-pathways are defined as a chain of reactions linked to each other by a common compound (substrate or product) (Figure 1). Two statistical approaches can be considered to perform a pathway analysis. The first compares a gene list to a pathway using a chi-squared test, a Fisher exact test or the hypergeometric distribution to calculate the probability of a specific number of genes from one pathway. The second is based on the analysis of all genes present on the genome array and measures the significance of pathway-level statistics computed from the gene-level statistics, using gene set enrichment analysis (20), random forests methods (21), Hotelling's T-square statistics (22) or random-set methods (23). With the aim of providing some flexibility to the user in defining his genes of interest, PathExpress compares a submitted list of genes to the genes involved in annotated pathways. The significantly overrepresented sets of reactions (pathways or sub-pathways) in the query list of genes are identified using a hypergeometric distribution test as developed in the BlastSets system (2). As the comparisons are based on enzyme compositions rather than single probe assignments, problems that arise from a multiplicity of genes coding for the same enzyme are largely overcome and the functional activities become apparent. In addition, an automatically generated graphical representation of the metabolic pathways allows the visualization of differential gene expression in a functional context.
Figure 1.

Representation of metabolic pathways and sub-pathways. The directed graph contains two types of nodes, compounds (labelled with their KEGG identifier and represented as ellipses) and reactions (labelled with the EC number of the enzyme involved and represented as boxes). Greyed reactions show that the corresponding enzyme has been identified in a given genome array. Directed arcs between two different nodes represent the consumption or the production of compounds by a reaction. The presented pathway contains eight enzymes with probe sets assignments. Two sub-pathways can be considered, containing three and four enzymes (surrounded by a black line).

Representation of metabolic pathways and sub-pathways. The directed graph contains two types of nodes, compounds (labelled with their KEGG identifier and represented as ellipses) and reactions (labelled with the EC number of the enzyme involved and represented as boxes). Greyed reactions show that the corresponding enzyme has been identified in a given genome array. Directed arcs between two different nodes represent the consumption or the production of compounds by a reaction. The presented pathway contains eight enzymes with probe sets assignments. Two sub-pathways can be considered, containing three and four enzymes (surrounded by a black line). PathExpress is freely available at http://bioinfoserver.rsbs.anu.edu.au/utils/PathExpress/.

METHODS

Data representation

PathExpress is based on a directed graph to model enzymatic reactions in the context of biological pathways (Figure 1). Two types of nodes are used to represent compounds and reactions that can be mediated by one or more enzymes. Directed edges, connecting these different nodes, correspond to the consumption or the production of compounds by the reactions. The data used to build this network is derived from the Compound, Reaction and Enzyme sections of the publicly available KEGG Ligand database (18,19). To link gene expression data to pathways, PathExpress uses pre-computed assignments of the probe sets of supported genome arrays to EC numbers, identifying enzyme entries. These assignments are based on sequence similarities with proteins retrieved from the Swiss-Prot database (24). Blastx (25) is used to find the best match (E-value ≤ 10−8) for the sequences representing each probe set sequence (i.e. sequences derived from the most 5′ to the most 3′ probe in the public UniGene cluster) of the genome array analysed. If these entries have been annotated as an enzyme, the probe set is assigned to the corresponding EC number, extracted from its definition line. Note that probe sets that cannot be assigned to EC numbers are excluded from further analyses, and although this limits the number of usable probe sets, it also eliminates much of the ambiguity that arises from multiple (iso) genes encoding the same enzymatic function. This strategy can be applied to any set of sequences. As of March 2007, data from eight Affymetrix Genome Arrays are available in PathExpress (Table 1). They were selected because of their importance as model organisms for various taxonomic groups or their economic interest, even if they don't have a well–annotated genome. Additional species could be included upon request.
Table 1.

Available Affymetrix genome arrays and assignment statistics

Affymetrix Genome ArrayOrganismSequencesaAssigned sequencesbECc
ATH1 Genome ArrayArabidopsis thaliana22 7655 177823
Drosophila Genome 2.0 ArrayDrosophila melanogaster18 9523 107724
E. coli Genome 2.0 ArrayEscherichia coli10 2082 245803
Human Genome U133 Plus 2.0 ArrayHomo sapiens39 0703 332658
Medicago Genome ArrayMedicago truncatula50 9008 981953
Rice Genome ArrayOryza sativa57 19410 068923
Soybean Genome ArrayGlycine max37 6186 502803
Yeast Genome 2.0 ArraySaccharomyces cerevisiae5 8141 471601
Yeast Genome 2.0 ArraySchizosaccharomyces pombe5 0281 333566

aNumber of probe set sequences.

bNumber of probe set sequences assigned to an EC number.

cNumber of distinct EC numbers corresponding to the probe set sequences.

Available Affymetrix genome arrays and assignment statistics aNumber of probe set sequences. bNumber of probe set sequences assigned to an EC number. cNumber of distinct EC numbers corresponding to the probe set sequences.

Microarray data analysis

To interpret gene expression results from microarray experiments, PathExpress detects if the genes associated with a pathway or sub-pathway are statistically over-represented in a set of sequences, when compared to the rest of the genome array. When a list of identifiers has been submitted, PathExpress first assigns them to EC numbers according to pre-computed relationships. The proportion of submitted EC numbers is then tested for every (sub) pathway. For each test, a P-value, representing the probability that the intersection of the given list with the list of enzymes belonging to the given set of reactions occurs by chance, is calculated using the hypergeometric distribution (26). Because multiple hypothesis tests are performed, it is necessary to correct these P-values. Two adjustment methods are available in PathExpress; the conservative Bonferroni correction method (27) in which the P-values are multiplied by the number of comparisons and the less stringent False Discovery Rate (FDR) approach (28) defined as the determination of the expected proportion of false positive results among all rejected hypotheses.

SYSTEM IMPLEMENTATION

The PathExpress web server runs on a Linux server (2 Intel 4 3.20 GHz, 1 GB RAM). It combines a PostgreSQL database management system to store the data with a dynamic web interface based on PHP and Perl. Data pre-processing is implemented in Perl, statistical analyses are performed using Perl and the R statistical package and graphical representations are generated using GraphViz software (http://www.graphviz.org). An analysis of 1000 identifiers (i.e. comparison to sub-pathways with FDR adjustments) takes less than 2 s and the automatic generation of a graphical representation takes less than 1 s.

WEB INTERFACE

Input

The input data for PathExpress consists of a list of genes of interest (Affymetrix probe set identifiers and/or GenBank accession numbers) present in the selected genome array. Other parameters can be specified: the type of comparison (pathway or sub-pathway), the P-value significant threshold and the adjustment method for multiple testing.

Output

The PathExpress output contains the list of pathways or sub-pathways that are significantly associated with the enzymes in a list of submitted sequence identifiers (Figure 2a). Metabolic pathways are ranked by increasing P-values whereas sub-pathways are grouped according to the pathway to which they belong. In each case, those that are significant (according to the P-value threshold defined by the user) are highlighted. Each pathway can be displayed as an automatically generated graphical representation (Figure 2b) and as an enumeration of reactions. On these pictures, reactions are highlighted if the according enzyme was identified in the genome array (in grey) and in the submitted list of identifiers (in yellow). The name of the compounds and the definition of the reactions are displayed as a tool-tip when the mouse is over any of the nodes in the graph. In addition, compounds are linked on the corresponding KEGG entry. If the user clicks on a reaction node, a new page containing the description of the enzymes associated with the list of probe sets assigned in the selected genome array is opened (Figure 2c). Blast results used for the EC assignments are available for each probe set in its ‘detail’ page.
Figure 2.

Screenshots of the PathExpress web interface. (a) The list of metabolic pathways whose enzyme composition intersects with the enzymes corresponding to a list of submitted identifiers, ordered by increasing P-value. Each row reports information concerning the pathway's name and the number of enzymes. The comparison of groups is reported with the number of submitted enzymes involved in the pathway and the P-value for finding the group by chance, associated with the corresponding adjusted P-value. The significant pathways are highlighted in red (‘carbon fixation’, ‘glycolysis/gluconeogenesis’ and ‘pentose phosphate pathway’ in this example). (b) Graphical representation of the glycolysis/gluconeogenesis pathway for enzymes identified in the Affymetrix ATH1 Genome Array (Arabidopsis thaliana). Reactions mediated by enzymes found in the genome array are highlighted in grey whereas those where the enzymes are also present in the query are highlighted in yellow. (c) Example of the detail page of an enzymatic reaction. Each enzyme is reported with its EC number linked to the KEGG database (Entry), its recommended and alternative names, the pathways in which this enzyme is involved (glycolysis/gluconeogenesis in this example) and the list of probe sets assigned. For each probe set, the identifier and description of the best match in Swiss-Prot is displayed. The ‘detail’ button is linked to the complete blast report. The corresponding row is highlighted in yellow if the probe set belongs to the submitted list of identifiers.

Screenshots of the PathExpress web interface. (a) The list of metabolic pathways whose enzyme composition intersects with the enzymes corresponding to a list of submitted identifiers, ordered by increasing P-value. Each row reports information concerning the pathway's name and the number of enzymes. The comparison of groups is reported with the number of submitted enzymes involved in the pathway and the P-value for finding the group by chance, associated with the corresponding adjusted P-value. The significant pathways are highlighted in red (‘carbon fixation’, ‘glycolysis/gluconeogenesis’ and ‘pentose phosphate pathway’ in this example). (b) Graphical representation of the glycolysis/gluconeogenesis pathway for enzymes identified in the Affymetrix ATH1 Genome Array (Arabidopsis thaliana). Reactions mediated by enzymes found in the genome array are highlighted in grey whereas those where the enzymes are also present in the query are highlighted in yellow. (c) Example of the detail page of an enzymatic reaction. Each enzyme is reported with its EC number linked to the KEGG database (Entry), its recommended and alternative names, the pathways in which this enzyme is involved (glycolysis/gluconeogenesis in this example) and the list of probe sets assigned. For each probe set, the identifier and description of the best match in Swiss-Prot is displayed. The ‘detail’ button is linked to the complete blast report. The corresponding row is highlighted in yellow if the probe set belongs to the submitted list of identifiers. All results can be downloaded as tab-delimited text files for further statistical analyses. Pictures representing the pathways can be saved in png or dot format and visualized locally using the GraphViz software. To enhance the visualization of the expression of individual probe sets, all resources (EC assignments and pictures with xml descriptions) are available to be imported into MapMan (17). Although initially developed for the Arabidopsis thaliana array, MapMan has been extended to other organisms, and classification including the Affymetrix arrays of the plants is presented in Table 1 (29).

COMPARISON WITH EXISTING TOOLS

To illustrate the novelty and utilities of PathExpress, we compare it with existing web-based pathway analysis tools (Table 2). Except for the KOBAS server, these tools are limited to organisms with well-annotated genomes. Indeed, the KOBAS server annotates a set of genes with KEGG Orthologous (KO) terms to link them to KEGG pathways (13). To overcome the problems that arise from a multiplicity of genes coding for the same enzyme and thus to provide a robust pathway identification, PathExpress focuses on chemical compounds and reactions in the KEGG Ligand database. It uses pre-computed assignments of probe set sequences to EC numbers and thus can be extended to any organism or set of sequences. Another limitation of these tools is that they define a metabolic pathway as a set of genes without considering the relationships between the reactions within a pathway. To take into account these relationships, we defined a sub-pathway as a chain of enzymatic reactions linked to each other by a common compound (substrate or product) within a pathway (Figure 1). This strategy allows us to identify the most relevant sets of reactions (pathways or sub-pathways) associated with a subset of genes.
Table 2.

Comparison of PathExpress with existing tools

SoftwarePathwaysInputOutputCommentsReferences
PathExpressKEGGList of identifiers (Affymetrix identifiers and/or Gene accession numbers)Statistically significant pathways and sub-pathways with graphical visualisationAffymetrix data but extendable to any organism or set of sequences
ArrayXPath IIGenMAPP, KEGG, BioCarta, PharmGKBClustered gene expression profileStatistically significant pathways with graphical visualisationHomo sapiens, Mus musculus, Rattus norvegicus(10)
KOBASKEGGList of KO identifiersStatistically significant pathwaysAnnotation of a set of genes or proteins with KO terms(13)
PathMAPAKEGG, TAIR, NCBI, GOExpression data file (GenePix, Affymetrix)Statistically significant pathways, enzymes, genes with graphical visualisationArabidopsis thaliana(14)
PathwayExplorerKEGG, BioCarta, GenMAPPExpression data file (Tab-delimitated)Statistically significant pathways with graphical visualisationWell-annotated genomes(15)
Pathway MinerGenMAPP, KEGG, BioCartaList of Gene Accession numbers with gene expression valuesStatistically significant pathways with graphical visualisationHomo sapiens, Mus musculus(16)
Comparison of PathExpress with existing tools

CONCLUSION

We have developed PathExpress, a web-based tool for finding the pathways relevantly affected in gene expression experiments. The focus on enzymes results in robust pathway identification. PathExpress can also correctly identify partial pathways, and provides a graphical representation for their visualization. Based on the KEGG Ligand database and on a pre-computed assignment of probe sets to EC numbers, PathExpress can be extended to any organism or set of analysis sequences (e.g. custom DNA microarray, proteome array) and hence provides a useful resource for the integration of transcriptomic and proteomic data sets. In the near future, additional species will be included in PathExpress. The process of assigning probe sets to EC numbers will be improved by taking into account the domain composition of the protein and by using enzyme-specific profiles. We also consider applying other pathway-based tests to analyse all genes from expression data. Finally, we intend to further develop the system by extending its application to the analysis of metabolomic results, since compound information is already included in the metabolic network representation.
  26 in total

1.  Pathway Miner: extracting gene association networks from molecular pathways for predicting the biological significance of gene expression microarray data.

Authors:  Ritu Pandey; Raghavendra K Guru; David W Mount
Journal:  Bioinformatics       Date:  2004-05-14       Impact factor: 6.937

2.  NetAffx Gene Ontology Mining Tool: a visual approach for microarray data analysis.

Authors:  Jill Cheng; Shaw Sun; Adam Tracy; Earl Hubbell; Joseph Morris; Venu Valmeekam; Andrew Kimbrough; Melissa S Cline; Guoying Liu; Ron Shigeta; David Kulp; Michael A Siani-Rose
Journal:  Bioinformatics       Date:  2004-02-12       Impact factor: 6.937

3.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

4.  LIGAND: chemical database for enzyme reactions.

Authors:  S Goto; T Nishioka; M Kanehisa
Journal:  Bioinformatics       Date:  1998       Impact factor: 6.937

5.  Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors:  Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal:  Proc Natl Acad Sci U S A       Date:  2005-09-30       Impact factor: 11.205

6.  PathwayExplorer: web service for visualizing high-throughput expression data on biological pathways.

Authors:  Bernhard Mlecnik; Marcel Scheideler; Hubert Hackl; Jürgen Hartler; Fatima Sanchez-Cabo; Zlatko Trajanoski
Journal:  Nucleic Acids Res       Date:  2005-07-01       Impact factor: 16.971

7.  MetaCyc: a multiorganism database of metabolic pathways and enzymes.

Authors:  Ron Caspi; Hartmut Foerster; Carol A Fulcher; Rebecca Hopkinson; John Ingraham; Pallavi Kaipa; Markus Krummenacker; Suzanne Paley; John Pick; Seung Y Rhee; Christophe Tissier; Peifen Zhang; Peter D Karp
Journal:  Nucleic Acids Res       Date:  2006-01-01       Impact factor: 16.971

8.  The Universal Protein Resource (UniProt).

Authors:  Amos Bairoch; Rolf Apweiler; Cathy H Wu; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Darren A Natale; Claire O'Donovan; Nicole Redaschi; Lai-Su L Yeh
Journal:  Nucleic Acids Res       Date:  2005-01-01       Impact factor: 16.971

9.  ArrayXPath II: mapping and visualizing microarray gene-expression data with biomedical ontologies and integrated biological pathway resources using Scalable Vector Graphics.

Authors:  Hee-Joon Chung; Chan Hee Park; Mi Ryung Han; Seokho Lee; Jung Hun Ohn; Jihoon Kim; Jihun Kim; Ju Han Kim
Journal:  Nucleic Acids Res       Date:  2005-07-01       Impact factor: 16.971

10.  PathMAPA: a tool for displaying gene expression and performing statistical tests on metabolic pathways at multiple levels for Arabidopsis.

Authors:  Deyun Pan; Ning Sun; Kei-Hoi Cheung; Zhong Guan; Ligeng Ma; Matthew Holford; Xingwang Deng; Hongyu Zhao
Journal:  BMC Bioinformatics       Date:  2003-11-07       Impact factor: 3.169

View more
  32 in total

Review 1.  Network integration and graph analysis in mammalian molecular systems biology.

Authors:  A Ma'ayan
Journal:  IET Syst Biol       Date:  2008-09       Impact factor: 1.615

2.  Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst.

Authors:  Jianguo Xia; David S Wishart
Journal:  Nat Protoc       Date:  2011-05-05       Impact factor: 13.491

3.  Alterations in the transcriptome of soybean in response to enhanced somatic embryogenesis promoted by orthologs of Agamous-like15 and Agamous-like18.

Authors:  Qiaolin Zheng; Sharyn E Perry
Journal:  Plant Physiol       Date:  2014-01-30       Impact factor: 8.340

4.  Transcriptomic and metabolic changes associated with photorespiratory ammonium accumulation in the model legume Lotus japonicus.

Authors:  Carmen M Pérez-Delgado; Margarita García-Calderón; Diego H Sánchez; Michael K Udvardi; Joachim Kopka; Antonio J Márquez; Marco Betti
Journal:  Plant Physiol       Date:  2013-06-06       Impact factor: 8.340

5.  Transcriptional profiling of an Fd-GOGAT1/GLU1 mutant in Arabidopsis thaliana reveals a multiple stress response and extensive reprogramming of the transcriptome.

Authors:  Ralph Kissen; Per Winge; Diem Hong Thi Tran; Tommy S Jørstad; Trond R Størseth; Tone Christensen; Atle M Bones
Journal:  BMC Genomics       Date:  2010-03-22       Impact factor: 3.969

6.  Global changes in the transcript and metabolic profiles during symbiotic nitrogen fixation in phosphorus-stressed common bean plants.

Authors:  Georgina Hernández; Oswaldo Valdés-López; Mario Ramírez; Nicolas Goffard; Georg Weiller; Rosaura Aparicio-Fabre; Sara Isabel Fuentes; Alexander Erban; Joachim Kopka; Michael K Udvardi; Carroll P Vance
Journal:  Plant Physiol       Date:  2009-09-15       Impact factor: 8.340

7.  GeneSet2miRNA: finding the signature of cooperative miRNA activities in the gene lists.

Authors:  Alexey V Antonov; Sabine Dietmann; Philip Wong; Dominik Lutter; Hans W Mewes
Journal:  Nucleic Acids Res       Date:  2009-05-06       Impact factor: 16.971

8.  PathExpress update: the enzyme neighbourhood method of associating gene-expression data with metabolic pathways.

Authors:  Nicolas Goffard; Tancred Frickey; Georg Weiller
Journal:  Nucleic Acids Res       Date:  2009-05-27       Impact factor: 16.971

9.  SubpathwayMiner: a software package for flexible identification of pathways.

Authors:  Chunquan Li; Xia Li; Yingbo Miao; Qianghu Wang; Wei Jiang; Chun Xu; Jing Li; Junwei Han; Fan Zhang; Binsheng Gong; Liangde Xu
Journal:  Nucleic Acids Res       Date:  2009-08-25       Impact factor: 16.971

10.  KEGG spider: interpretation of genomics data in the context of the global gene metabolic network.

Authors:  Alexey V Antonov; Sabine Dietmann; Hans W Mewes
Journal:  Genome Biol       Date:  2008-12-18       Impact factor: 13.583

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.