Literature DB >> 20519200

R spider: a network-based analysis of gene lists by combining signaling and metabolic pathways from Reactome and KEGG databases.

Alexey V Antonov¹, Esther E Schmidt, Sabine Dietmann, Maria Krestyaninova, Henning Hermjakob.

Abstract

R spider is a web-based tool for the analysis of a gene list using the systematic knowledge of core pathways and reactions in human biology accumulated in the Reactome and KEGG databases. R spider implements a network-based statistical framework, which provides a global understanding of gene relations in the supplied gene list, and fully exploits the Reactome and KEGG knowledge bases. R spider provides a user-friendly dialog-driven web interface for several model organisms and supports most available gene identifiers. R spider is freely available at http://mips.helmholtz-muenchen.de/proj/rspider.

Entities: Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2010 PMID： 20519200 PMCID： PMC2896180 DOI： 10.1093/nar/gkq482

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

High-throughput technologies enable biological researchers to study hundreds or thousands of genes simultaneously. Genes or proteins are detected that are differentially expressed or co-expressed across varying cellular conditions. However, generating hypotheses about the underlying biological mechanisms based on experimentally derived gene/protein lists remains a non-trivial task for biologists. In 2002, a computerized analysis approach using the Gene Ontology (GO) was proposed to deal with this issue (1,2). Currently, there are over 25 tools performing this type of analysis with some variations (3–13). More recently, computational methods seek to interpret or at least visualize the pathway context of the experimentally derived genes (14–17). In this respect, one should mentioned a landmark procedure proposed recently in (17,18) which goes beyond gene pairs and fully captures the topology of signaling pathways by propagating the perturbations measured at gene levels through the entire pathway. However, the development of rigorous statistical methods for global network inference has been a challenging task. Recently, we have introduced a network-based computational framework for the interpretation of gene/protein lists derived from high-throughput studies (19,20). Our approach overcomes a major bottleneck of the commonly employed methods for enrichment analysis (21) by providing network models that unite genes from different pathways into a single connected network. A Monte Carlo procedure was employed to estimate the significance of the inferred models, thus providing a rigorous quantitative statistical control (22). A web-based tool, KEGG spider (19), was introduced that exploits the network-based methodology for the exploration of metabolic reactions accumulated in the KEGG database (23). It was demonstrated that KEGG spider provides deeper insight into the genomic basis of metabolism variations in comparison to other tools (19). Although being a powerful tool, KEGG spider is limited only to metabolism-related genes which cover <10% of the human genome (about 1100 genes). It is clear that many other important cellular processes, such as regulatory and signaling pathways remain uncovered by the inferred network models. On the other hand, the Reactome knowledgebase (24,25) is a dynamically expanding project, which provides high quality expert-authored, peer-reviewed knowledge of human reactions and pathways, covering 3916 human proteins (as of release 30). To provide experimentalists with an efficient web-based tool for the analysis of high-throughput data using Reactome knowledge, we have developed R spider, which implements the network-based methodology and exploits the data accumulated in the Reactome knowledgebase to the full extent. R spider unites both Reactome and KEGG knowledge databases covering proteins from signaling and metabolism pathways. We would like to point out that there are other signaling and metabolic databases available in the public domain like the manually curated BioCarta, NCI or inferred data (26) or (27). R spider has the option to switch between Reactome&KEGG, Nature Curated pathways (http://pid.nci.nih.gov/) and BioCarta (www.biocarta.com).

MATERIALS AND METHODS

A global Reactome protein network

Reactome (http://www.reactome.org/) is an expert-authored, peer-reviewed knowledgebase of human reactions and pathways. We used a file in tab-delimited format which specifies protein–protein interaction pairs derived from Reactome data (http://www.reactome.org/download/current/homo_sapiens.interactions.txt.gz). The meaning of ‘interaction’ is quite broad: two protein sequences occur in the same complex or they occur in the same or neighbouring reaction(s). For the human genome, the global Reactome protein network covers about 3700 proteins (including proteins from non-human species that interact with human proteins) involved in approximately 83 000 unique pairwise interactions (based on release 30).

A global metabolic gene network

The KEGG database is a collection of chemical structure transformation patterns for substrate–product pairs (reactant pairs). A detailed description of the procedure used to construct a global metabolic gene network can be found in ref. (19). The resulting global metabolic gene network links by edges any two genes that are associated with reactions sharing common compounds (from the main reaction pair). For the human genome, the global metabolic gene network covers about 1100 genes involved in approximately 15 000 unique pairwise interactions.

Integral reference network

To unite both networks, the Reactome protein network was transformed into a gene network. As in many cases, several proteins map to the same gene, the resulting gene network has a smaller number of nodes and edges. Once both KEGG and Reactome networks have the same type of node identifiers, they can be united. For the human genome, the resulting integral network covers about 3700 genes involved in approximately 50 000 unique pairwise gene interactions.

Network inference procedure and statistical treatment

Detailed information on the network inference and the Monte Carlo simulation procedure for computing P-values can be found in our previously published papers (19,20,28). Initially, the genes from the input list are mapped to the global reference network. At this point, all nodes from the input list are disconnected. In the first step, all pairs of nodes with distance 1 are connected by edges and connected subnetworks are extracted. The subnetwork with the maximal number of nodes is referred to as an inferred network model D1. In the second step, the disconnected nodes from the input list with distance 2 are connected by edges. The subnetwork with the maximal number of input nodes is inferred and referred to as network model D2. In the next step, the disconnected nodes from the input list with distance 3 are connected by edges and a network model D3 (a subnetwork with the maximal number of input nodes) is inferred. Models D2 and D3 incorporate nodes that are not from the input list but are added to connect input nodes in the network model. We refer to these added nodes as intermediate or missing genes. Let us assume that we have N genes from the input list to be mapped to the reference network. Next, we refer to the value N as the size of the input list. We infer the network models D1, D2, D3. Let us denote S1, S2, S3 to be the number of input nodes in the inferred network models. We also refer to S1, S2, S3 as the sizes of the respective models D1, D2, D3. Given the number of mapped genes from the input list (N), we consider the sizes of the models (values S1, S2, S3) as statistics. We have to estimate the probability to get models of the same or larger sizes from a randomly generated gene list which has N genes mapped to the reference network. To generate the background distributions BD1, BD2, BD3 we repeat the following simulation procedure k times, where k specifies the upper significance level. A random gene list Lj of size N (equal to the size of the input list) is generated by sampling genes from global gene network. Index j = 1 … k specifies each of the k random simulations. The network inference procedure described above is applied to the random list Lj and the network models D1, D2, D3 are inferred. Let us denote the size (the number of input genes) of the inferred models D1, D2, D3 for the random list Lj as R1j, R2j, R3j. Thus, after repeating the simulation procedure k times, we get the background distribution R1j (j = 1… k) for models D1, the background distribution R2j (j = 1… k) for models D2, and the background distribution R3j (j = 1… k) for models D3. To estimate significance of the inferred network model D1 for the input gene list, the value S1 is compared with the distribution R1j. Let n be the number of values from the distribution R1j that are equal or greater than S1. The estimate of P (P-value) of the inferred network model D1 is computed as P = (n + 1)/k. In the same way the P-values for the model D2 and D3 are estimated. Statistical treatment plays an important role for the quality control of inferred models. It is clear that given a gene list and a reference network, one can always infer some model, which will cover all genes from the list by relaxing the number of possible intermediate genes. There is a very simple test for any similar tool: the tool must be able to recognize a random gene list and return on average insignificant P-values for the random case. In 20 submissions of different randomly generated gene lists on average only 1 case is expected to be significant at the level of 0.05 (1/20). The estimate of the P-value provided by the Monte Carlo procedure corresponds exactly to the definition of P-value: the probability to get a model of the same quality for a random gene list.

Enrichment of the reactome and KEGG canonical pathways

To compute enrichment of canonical Reactome and KEGG pathways, we also employed the Monte Carlo procedure. In this case, we randomly draw k genes (the number of genes in the input list) 100 times from the set of all genes (or from the background set of genes supplied by the user) and each time we estimate P-value based on the hypergeometric distribution for the best (whatever) pathway. Thus, we got a distribution of size 100 of the best P-values for a random drawing of k genes which we compare with the P-value for the best (whatever) pathway related to our original list. The estimate of the adjusted P-value by Monte Carlo procedure is given by the share of random simulations where the best P-value was equal or superior (less) than the P-value for the best (whatever) pathways related to our original gene list.

RESULTS

R spider (http://mips.helmholtz-muenchen.de/proj/rspider) is a freely available web-based tool that implements a pathway-free statistical framework for the interpretation of gene lists from high-throughput studies. R spider is available for several model organisms (Mus musculus, Rattus norvegicus, Caenorhabditis elegans, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster). In addition, R spider has the option to switch to the other available in the public domain signaling pathway databases, Nature Curated pathways (http://pid.nci.nih.gov/) and BioCarta (www.biocarta.com). R spider has a simple, user-friendly interface. As input it accepts several types of gene or protein identifiers, such as identifiers from ‘Entrez Gene’ (29), ‘UniProt/Swiss-Prot’ (30), ‘Hugo Gene Symbols’, ‘UniGene’, ‘Ensembl’ (31), ‘RefSeq’ (32) and ‘Affymetrix’ (33). As output, the user obtains network models (D1, D2, D3), where (1, 2, 3) indicates the maximal distance between any two input genes to be considered as ‘connected’ in the output model. The network model (D1, D2 or D3) represent a connected subnetwork with the maximal number of input genes. R spider provides a report on the statistical significance of the inferred network models (D1, D2, D3), as well as a catalog of the enriched Reactome or KEGG pathways. For each model (D1, D2, D3), a link is provided to obtain a graphical visualization. The visualization is performed by the Medusa package (34). We would like to point out that online visualization capabilities are limited. For this reason, we recommend to download the inferred network models as text files (links are provided on the visualization page) and to use freely available packages (Cytoscape, Meduza) for network visualization. Using these programs the users can produce high-quality figures (34,35).

Graphical output

In the graphical output, input genes are represented by rectangles and specified by the input gene Ids. Intermediate genes are represented by triangles and specified by Entrez Gene Symbols. Compounds are represented by circles and specified by compound names (if the length of the name exceeds 10 digits then the compound KEGG id is used). Different colors are used to specify canonical Reactome or KEGG pathways. In general, up to 11 of the most representative pathways (in terms of the number of genes in the model, both input and intermediate genes are counted) are coloured. In most cases, a gene can be associated to multiple pathways. For this reason, R spider implements a strict hierarchical procedure for gene coloring. First, pathways are ordered in respect to the number of genes that are present in the model from any given pathway. The most representative pathway will be colored in red. Colored genes (red) are excluded and pathways are reordered considering only the remaining genes. The next most represented pathway will be colored in blue. Colored genes (red and blue) are excluded and pathways are reordered considering only the remaining genes. The procedure will continue until 13 pathways will be colored or there will be no pathway which covers at least two genes. Therefore, colors have a strict hierarchy: red, blue, green and so on. The number before the color indicates the hierarchy order (Figure 1). It is evident that some red genes may also belong to the blue (green and so on) pathway, but not vice versa.

Figure 1.

Network model D3 returned by R spider on submission of 360 candidate genes residing in regions with copy number alteration typical of the Sézary syndrome (37). Boxes represent input genes, triangles represent intermediate genes (genes that are added to connect two input genes, for model D3 up to two intermediate genes are allowed between any two input genes), circles represent compounds which are common substrates or products for both connected genes. Diamonds are used to specify the colour of canonical Reactome or KEGG pathways.

Table: interaction context

For each gene in the reported model, R spider provides the full interaction context. This information is summarized in the table ‘Interacting Pairs’. In the case of Reactome, there are four types of interactions: ‘direct_complex’, ‘indirect_complex’, ‘reaction’ or ‘neighbouring_reaction’. In the case of the KEGG database, interactions represent either a compound (connected genes are assigned to different reactions utilizing the same compound) or, rarely, by a reaction ID (both connected genes catalyze the same metabolic reaction). The edge can be supported by several different interactions, all of which will be reported, and corresponding links to the source data are provided.

Example

We present at our website (http://mips.helmholtz-muenchen.de/proj/rspider/example.html) several hundred examples of analyses by R spider of gene lists, which were automatically extracted by text mining from proteomics studies in various biological contexts (36). Here, we present one example in detail to demonstrate the potential benefit of our tool. Currently, many clinical studies are designed to reveal possible pathogenic mechanisms and novel therapeutic targets for complex diseases with specific phenotypes. The Sézary syndrome, for example, is associated with the aggressive cutaneous T-cell lymphoma/leukemia. In a study by Vermeer et al. (37), a high-resolution array-based comparative genomic hybridization was performed on malignant T cells from 20 patients to reveal highly recurrent genetic alterations typical for the Sézary syndrome. Minimal common regions with copy number alteration occurring in at least 35% of patients were reported, which comprised in total about 360 candidate genes (see Table 1 in ref. 37). Only 22 of these genes are mapped to KEGG metabolic pathways. Thus, for comparison, an analysis by KEGG spider reports that the inferred network model is not significant (P = ∼0.1). On the contrary, consideration of the integral reference network that unites both Reactome and KEGG data provides more interesting insights into the possible molecular mechanisms behind genes with copy number alteration in the Sézary syndrome. In this case, 92 out of the 360 genes are mapped to the integral network. Network model D3, which allows up to two missing genes between any two input genes, connects 74 out of the 92 mapped candidate genes into a single non-interrupted network. The model is statistically significant (P < 0.01). R spider randomly sampled 92 genes from the set of 3700 human genes that constitute the integral reference network for 1000 times; and in 993 cases, the size of the resulting network model D3 was less than 74 genes. Thus, the significance of the model is about 0.01. R spider provides graphical models. The network model D3 for the considered example, which covers 74 genes (P < 0.01), is presented in Figure 1. Proteins from the input list are indicated by rectangles, intermediate proteins by triangles, and chemical compounds are indicated by circles. The colours are used to specify Reactome and KEGG canonical pathways. In comparison to other available pathway analyses tools, R spider provides a global vision of gene functional relations. For example, submission to Onto-express (17) results in reporting of several (∼10) enriched pathways with possibility to visualize them separately. This is certainly valuable information. However, the best model (enriched pathway ‘Pathways in cancer’) covers 19 genes. The relation between pathways as well as the role and relation between genes that are not covered by enriched pathways is not disclosed. Thus, in comparison to Onto-express R spider demonstrates that genes residing in regions which frequently have a copy number alteration in Sézary syndrome are dependent although they belongs to a wide spectrum of signaling and metabolic pathways. In this case the user gets a newly created pathway which covers 74 genes and actually runs through several canonical Reactome and KEGG pathways.

CONCLUSIONS

Various modern genomics technologies result in gene lists. A better understanding of the biological mechanisms, which unite the identified genes, can give clues to a better understanding of the phenomena under study. R spider provides a possibility to actively exploit the knowledge of biological processes of various natures accumulated in the Reactome knowledgebase and metabolism related processes in the KEGG database to decipher the mechanisms behind experimentally derived gene lists. A pathway-free statistical framework combined with the most advanced publicly available databases for pathways and reactions makes R spider a very attractive tool for interpretation of genomics data.

FUNDING

Funding for open access charge: European Bioinformatics Institute, Wellcome Trust Genome Campus; The development of Reactome is supported by a grant from the US National Institutes of Health (P41 HG003751) and EU grant LSHG-CT-2005-518254 “ENFIN”. Conflict of interest statement. None declared.

36 in total

1. Profiling gene expression using onto-express.

Authors: Purvesh Khatri; Sorin Draghici; G Charles Ostermeier; Stephen A Krawetz
Journal: Genomics Date: 2002-02 Impact factor: 5.736

2. Global functional profiling of gene expression.

Authors: Sorin Draghici; Purvesh Khatri; Rui P Martins; G Charles Ostermeier; Stephen A Krawetz
Journal: Genomics Date: 2003-02 Impact factor: 5.736

3. Characterizing gene sets with FuncAssociate.

Authors: Gabriel F Berriz; Oliver D King; Barbara Bryant; Chris Sander; Frederick P Roth
Journal: Bioinformatics Date: 2003-12-12 Impact factor: 6.937

4. NetAffx: Affymetrix probesets and annotations.

Authors: Guoying Liu; Ann E Loraine; Ron Shigeta; Melissa Cline; Jill Cheng; Venu Valmeekam; Shaw Sun; David Kulp; Michael A Siani-Rose
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

5. Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker
Journal: Genome Res Date: 2003-11 Impact factor: 9.043

6. Onto-Tools: an ensemble of web-accessible, ontology-based tools for the functional design and interpretation of high-throughput gene expression experiments.

Authors: Purvesh Khatri; Pratik Bhavsar; Gagandeep Bawa; Sorin Draghici
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

7. GFINDer: Genome Function INtegrated Discoverer through dynamic annotation, statistical analysis, and mining.

Authors: Marco Masseroli; Dario Martucci; Francesco Pinciroli
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

8. PLIPS, an automatically collected database of protein lists reported by proteomics studies.

Authors: Alexey V Antonov; Sabine Dietmann; Philip Wong; Rodchenkov Igor; Hans W Mewes
Journal: J Proteome Res Date: 2009-03 Impact factor: 4.466

9. PPI spider: a tool for the interpretation of proteomics data in the context of protein-protein interaction networks.

Authors: Alexey V Antonov; Sabine Dietmann; Igor Rodchenkov; Hans W Mewes
Journal: Proteomics Date: 2009-05 Impact factor: 3.984

10. KEGG: Kyoto Encyclopedia of Genes and Genomes.

Authors: H Ogata; S Goto; K Sato; W Fujibuchi; H Bono; M Kanehisa
Journal: Nucleic Acids Res Date: 1999-01-01 Impact factor: 16.971

33 in total

1. Analysis of natural variation reveals neurogenetic networks for Drosophila olfactory behavior.

Authors: Shilpa Swarup; Wen Huang; Trudy F C Mackay; Robert R H Anholt
Journal: Proc Natl Acad Sci U S A Date: 2012-12-31 Impact factor: 11.205

2. Haploinsufficiency of X-linked intellectual disability gene CASK induces post-transcriptional changes in synaptic and cellular metabolic pathways.

Authors: P A Patel; C Liang; A Arora; S Vijayan; S Ahuja; P K Wagley; R Settlage; L E W LaConte; H P Goodkin; I Lazar; S Srivastava; K Mukherjee
Journal: Exp Neurol Date: 2020-04-17 Impact factor: 5.330

3. The genetic basis for variation in resistance to infection in the Drosophila melanogaster genetic reference panel.

Authors: Jonathan B Wang; Hsiao-Ling Lu; Raymond J St Leger
Journal: PLoS Pathog Date: 2017-03-03 Impact factor: 6.823

4. Epistasis dominates the genetic architecture of Drosophila quantitative traits.

Authors: Wen Huang; Stephen Richards; Mary Anna Carbone; Dianhui Zhu; Robert R H Anholt; Julien F Ayroles; Laura Duncan; Katherine W Jordan; Faye Lawrence; Michael M Magwire; Crystal B Warner; Kerstin Blankenburg; Yi Han; Mehwish Javaid; Joy Jayaseelan; Shalini N Jhangiani; Donna Muzny; Fiona Ongeri; Lora Perales; Yuan-Qing Wu; Yiqing Zhang; Xiaoyan Zou; Eric A Stone; Richard A Gibbs; Trudy F C Mackay
Journal: Proc Natl Acad Sci U S A Date: 2012-09-04 Impact factor: 11.205

5. Genome-wide association for sensitivity to chronic oxidative stress in Drosophila melanogaster.

Authors: Katherine W Jordan; Kyle L Craver; Michael M Magwire; Carmen E Cubilla; Trudy F C Mackay; Robert R H Anholt
Journal: PLoS One Date: 2012-06-08 Impact factor: 3.240

6. Effects of psychological stress on innate immunity and metabolism in humans: a systematic analysis.

Authors: Sushri Priyadarshini; Palok Aich
Journal: PLoS One Date: 2012-09-19 Impact factor: 3.240

7. Genome-wide association analysis of oxidative stress resistance in Drosophila melanogaster.

Authors: Allison L Weber; George F Khan; Michael M Magwire; Crystal L Tabor; Trudy F C Mackay; Robert R H Anholt
Journal: PLoS One Date: 2012-04-04 Impact factor: 3.240

8. MIPS: curated databases and comprehensive secondary data resources in 2010.

Authors: H Werner Mewes; Andreas Ruepp; Fabian Theis; Thomas Rattei; Mathias Walter; Dmitrij Frishman; Karsten Suhre; Manuel Spannagl; Klaus F X Mayer; Volker Stümpflen; Alexey Antonov
Journal: Nucleic Acids Res Date: 2010-11-24 Impact factor: 16.971

9. Enhancing nucleotide metabolism protects against mitochondrial dysfunction and neurodegeneration in a PINK1 model of Parkinson's disease.

Authors: Roberta Tufi; Sonia Gandhi; Inês P de Castro; Susann Lehmann; Plamena R Angelova; David Dinsdale; Emma Deas; Hélène Plun-Favreau; Pierluigi Nicotera; Andrey Y Abramov; Anne E Willis; Giovanna R Mallucci; Samantha H Y Loh; L Miguel Martins
Journal: Nat Cell Biol Date: 2014-01-19 Impact factor: 28.824

10. MYBL2 haploinsufficiency increases susceptibility to age-related haematopoietic neoplasia.

Authors: M Clarke; S Dumon; C Ward; R Jäger; S Freeman; B Dawood; L Sheriff; M Lorvellec; R Kralovics; J Frampton; P García
Journal: Leukemia Date: 2012-08-22 Impact factor: 11.528