Radek Szklarczyk1, Wout Megchelenbrink2, Pavel Cizek3, Marie Ledent4, Gonny Velemans3, Damian Szklarczyk5, Martijn A Huynen6. 1. Maastricht University Medical Centre, P. Debyelaan 25, 6226XM Maastricht, The Netherlands Centre for Molecular and Biomolecular Informatics (CMBI), Radboud University Medical Centre, Geert Grooteplein Zuid 26-28, 6525 GA Nijmegen, Nijmegen, The Netherlands radek.szklarczyk@maastrichtuniversity.nl. 2. Centre for Molecular and Biomolecular Informatics (CMBI), Radboud University Medical Centre, Geert Grooteplein Zuid 26-28, 6525 GA Nijmegen, Nijmegen, The Netherlands Institute for Computing and Information Sciences (ICIS), Radboud University, Toernooiveld 212, 6525 EC Nijmegen, The Netherlands Centre for Systems Biology and Bioenergetics (CSBB), Radboud University Medical Centre, Geert Grooteplein Zuid 10, 6500 HB Nijmegen, Nijmegen, The Netherlands. 3. Centre for Molecular and Biomolecular Informatics (CMBI), Radboud University Medical Centre, Geert Grooteplein Zuid 26-28, 6525 GA Nijmegen, Nijmegen, The Netherlands. 4. Maastricht University Medical Centre, P. Debyelaan 25, 6226XM Maastricht, The Netherlands. 5. Institute of Molecular Life Sciences and Swiss Institute of Bioinformatics, University of Zurich, 8057 Zurich, Switzerland. 6. Centre for Molecular and Biomolecular Informatics (CMBI), Radboud University Medical Centre, Geert Grooteplein Zuid 26-28, 6525 GA Nijmegen, Nijmegen, The Netherlands Centre for Systems Biology and Bioenergetics (CSBB), Radboud University Medical Centre, Geert Grooteplein Zuid 10, 6500 HB Nijmegen, Nijmegen, The Netherlands Martijn.Huijnen@radboudumc.nl.
Abstract
We have developed the Weighted Gene Expression Tool and database (WeGET, http://weget.cmbi.umcn.nl) for the prediction of new genes of a molecular system by correlated gene expression. WeGET utilizes a compendium of 465 human and 560 murine gene expression datasets that have been collected from multiple tissues under a wide range of experimental conditions. It exploits this abundance of expression data by assigning a high weight to datasets in which the known genes of a molecular system are harmoniously up- and down-regulated. WeGET ranks new candidate genes by calculating their weighted co-expression with that system. A weighted rank is calculated for human genes and their mouse orthologs. Then, an integrated gene rank and p-value is computed using a rank-order statistic. We applied our method to predict novel genes that have a high degree of co-expression with Gene Ontology terms and pathways from KEGG and Reactome. For each query set we provide a list of predicted novel genes, computed weights for transcription datasets used and cell and tissue types that contributed to the final predictions. The performance for each query set is assessed by 10-fold cross-validation. Finally, users can use the WeGET to predict novel genes that co-express with a custom query set.
We have developed the Weighted Gene Expression Tool and database (WeGET, http://weget.cmbi.umcn.nl) for the prediction of new genes of a molecular system by correlated gene expression. WeGET utilizes a compendium of 465 human and 560 murine gene expression datasets that have been collected from multiple tissues under a wide range of experimental conditions. It exploits this abundance of expression data by assigning a high weight to datasets in which the known genes of a molecular system are harmoniously up- and down-regulated. WeGET ranks new candidate genes by calculating their weighted co-expression with that system. A weighted rank is calculated for human genes and their mouse orthologs. Then, an integrated gene rank and p-value is computed using a rank-order statistic. We applied our method to predict novel genes that have a high degree of co-expression with Gene Ontology terms and pathways from KEGG and Reactome. For each query set we provide a list of predicted novel genes, computed weights for transcription datasets used and cell and tissue types that contributed to the final predictions. The performance for each query set is assessed by 10-fold cross-validation. Finally, users can use the WeGET to predict novel genes that co-express with a custom query set.
Ever since the publication of the first gene expression arrays, the correlated expression of genes involved in a related molecular process has been used to predict functional relations between gene pairs (1). Large amounts of microarray and RNA-seq transcript expression, measured under a plethora of conditions enable mining for concordantly expressed genes. Indeed, this concept has been successfully employed in databases such as COEXPRESSdb, GeneFriends, GeneMANIA and STARNET 2 (2–5). Nevertheless, relative to other types of genomics data, co-expression has lower sensitivity and selectivity (6). To improve the quality of the predictions, various strategies have been applied, like exploiting the conservation of co-expression between species (7), combining many gene expression datasets (8,9) or biclustering datasets to identify groups of genes that co-express within a subset of the experiments (see (10) for a review). Expression screening, an extension of biclustering methods (11), weighs gene expression datasets based on the co-expression of genes within a molecular system and uses those weights to predict new genes involved in that system. It has been successfully applied to predict new mitochondrial proteins essential for the organelle (11) and to discover new players in heme biosynthesis (12). The principle behind this method is appealing: it systematically exploits the available gene expression data and, via its weighting scheme, implicitly solves the question facing many researchers: which gene expression data to use to predict new genes for a pathway? Nevertheless, it is computationally costly, as the weighting has to be recalculated for each pathway separately and additional cross validation requires multiple runs per pathway. We have therefore developed and implemented a fast expression screening algorithm that includes a dataset weighting and allows for the rapid computation of genes that co-regulate with a query gene set. Our algorithm was employed to compile a weighted co-expression database for all Gene Ontology (GO) terms and human pathways annotated in the KEGG and Reactome databases. Furthermore, we provide information regarding the original experimental setup of the highly weighted datasets. In particular, WeGet reports the cell and tissue types in which the query genes are consistently up- and down-regulated with each other. Finally, the robustness of the predicted results is assessed by 10-fold cross-validation and reported as the receiver operating characteristic (ROC) curve.We compared WeGET with five popular web tools and databases that predict novel genes based on their co-expression with specified query gene sets, using two query gene sets published by Baughman et al. (ref 11) and show that indeed, weighting the datasets results in improved precision, in particular at low recall rates (the top 100 genes). The complete WeGET database, together with a custom query submission system, is available through the WeGET website.
THE WeGET ANALYSIS PIPELINE
WeGET uses a compendium of 465 human and 560 murine gene expression datasets ranging from 6 to 192 samples per dataset. In total, ∼30 000 samples from multiple mammalian platforms were collected from the Gene Expression Omnibus (GEO) (13).The WeGET computational pipeline starts with selecting the normalized expression values for all probes associated with the query genes. For genes with multiple probes, the probe with the highest average Pearson correlation coefficient with all other query probes is selected. Subsequently, the pipeline calculates the average Pearson correlation between each gene and the set of query genes in every dataset (Figure 1). Then, all probes are ranked based on their average correlation with the query probes and mapped back to their associated gene. Each gene i obtains a score s depending on the fraction of the query set that has been ranked above that gene. These calculations are then repeated four times for the same query set and gene expression dataset, where the expression values have been randomly permuted between the genes in every measurement. This step estimates the number of genes that are expected to highly correlate with the query set in a random model. To calculate the dataset weight, an N100 value is calculated that is the fraction of query genes found among the top 100 genes with highest average Pearson correlation. The ratio between the N100 from the original dataset and the average N100 value for the randomized datasets constitutes the weight of the experiments. A species score is the weighted average of all its datasets. The final ranked gene list is obtained by integrating the ranked human and mouse list (mouse genes that are unambiguous human one-to-one orthologs). This is performed using the RobustRankAggreg R-package (14) that computes the final gene rankings using a rank-order statistic (15,16).
Figure 1.
WeGET computational pipeline used to create the database. (A) Determining the dataset weight wdataset. The transcriptome measurements are converted into a correlation matrix. The average correlation with the query set (sgene) is used for gene ranking and the dataset weight calculation wdataset. (B) Data integration across datasets, platforms and species. Gene scores sgene from all datasets are combined taking into account the precomputed weights. Subsequently different transcriptome platforms and species data are integrated to arrive at the final ranking. The process is repeated after excluding each query gene to construct a receiver operating characteristic (ROC) curve that visualizes predictive power of the method for a specific query set of genes.
WeGET computational pipeline used to create the database. (A) Determining the dataset weight wdataset. The transcriptome measurements are converted into a correlation matrix. The average correlation with the query set (sgene) is used for gene ranking and the dataset weight calculation wdataset. (B) Data integration across datasets, platforms and species. Gene scores sgene from all datasets are combined taking into account the precomputed weights. Subsequently different transcriptome platforms and species data are integrated to arrive at the final ranking. The process is repeated after excluding each query gene to construct a receiver operating characteristic (ROC) curve that visualizes predictive power of the method for a specific query set of genes.Thus for each set of expression data the pipeline measures whether genes in a given pathway are co-expressed with each other better than expected and uses that to assign weights to that expression dataset. These weights are subsequently used in determining the (weighted) co-expression of all genes with that pathway. The source of the variation in the weights between the datasets can be technical, e.g. variation in the probes that have been used, or biological, e.g. variation in the tissues in which gene expression has been measured. The important assumption behind the method is that new genes for a pathway are significantly co-expressed with the majority of the genes of a pathway that they belong to, rather than only with some of its members. This, in turn, depends on the pathway definition. To aid in finding the genes from a pathway that co-express with each other, the results include a visualization of co-expression between query genes displayed as a network. This allows the user to select a subset of co-expressed genes from that pathway to repeat the procedure.
WeGET VALIDATION AND COMPARISON TO OTHER CO-EXPRESSION DATABASES
To assess and compare the predictive power of different co-expression methods (see Supplementary Table SI1), we used two query gene sets (11): 19 query genes in the cholesterol biosynthesis pathway and 76 genes involved in oxidative phosphorylation (OXPHOS). We manually performed leave-one-out or 10-fold cross validation by multiple submissions (see Supplementary Methods for details). We took into account the top 100 ranked genes as a likely use case scenario. Figure 2 shows the WeGET results for the cholesterol biosynthesis pathway and OXPHOS system compared to other online tools employing the co-expression analysis.
Figure 2.
ROC performance curves for online co-expression tools (see Supplementary Table SI1). Performance measured by multiple cross-validation runs is indicated by the area under the curve (AUC) for the top 100 genes corresponding to a typical use case scenario. (A) Results for 19 genes in the cholesterol pathway using leave-one-out cross-validation. (B) Results for 10-fold cross-validation in the oxidative phosphorylation (OXPHOS) query set.
ROC performance curves for online co-expression tools (see Supplementary Table SI1). Performance measured by multiple cross-validation runs is indicated by the area under the curve (AUC) for the top 100 genes corresponding to a typical use case scenario. (A) Results for 19 genes in the cholesterol pathway using leave-one-out cross-validation. (B) Results for 10-fold cross-validation in the oxidative phosphorylation (OXPHOS) query set.Baughman et al. (11) carried out one-time computations for cholesterol and OXPHOS datasets. The Weget webserver achieves identical (cholesterol) or marginally better performance (OXPHOS, 86.4% sensitivity at 99.8% specificity, compared to 85% and 99.4%, respectively, see Supplementary Figure SI1).
THE WeGET DATABASE AND WEB ACCESS
Figure 3 depicts the architecture of the WeGET database. Human pathways and their associated genes from GO and KEGG are stored in a central database. The WeGET parallel algorithm that calculates each dataset on a separate thread precomputes the co-expressed genes and dataset weights for all pathways using the transcriptome compendium. The results are presented to the user using the WeGET webtool (implemented in Python Flask) and can additionally be downloaded.
Figure 3.
The WeGET system architecture. Results for predefined pathways (GO, KEGG and Reactome) are precomputed and exposed through the WeGET webtool. Custom defined gene sets can be analyzed by submitting the gene ids or gene symbols to the webserver.
The WeGET system architecture. Results for predefined pathways (GO, KEGG and Reactome) are precomputed and exposed through the WeGET webtool. Custom defined gene sets can be analyzed by submitting the gene ids or gene symbols to the webserver.On the WeGET website, pathways are shown in a data grid (Figure 4), which can be sorted and searched. Detailed information (Figure 5), such as the best scoring genes, dataset weights, cell and tissue types in which the genes highly co-express (see also Supplementary Figures SI2 and SI3) and cross-validation results are shown when a row entry is selected. A pathway can be accessed directly as weget.cmbi.umcn.nl/pathwaydb/identifier where pathwaydb denotes the pathway database (one of: GO, KEGG or Reactome) and identifier the category identification (e.g. http://weget.cmbi.umcn.nl/GO/GO:0000398). User queries (different than the predefined sets) can be entered using the ‘Custom pathway’ tab, specifying genes as Entrez ID or HUGO gene symbol. The query is then scheduled for analysis. After the analysis, the user receives an email with results, including the cross-validation and a network that displays the co-regulation of the query genes within the datasets (see below), in a spreadsheet.
Figure 4.
The GO data grid. Users can browse or search the list of Gene Ontology (GO) terms. When a row is clicked, detailed information is provided (see Figure 5). KEGG and Reactome pathways can be accessed in a similar fashion.
Figure 5.
Detailed information for a precomputed term/pathway. The query genes, co-expressed genes, dataset weights and cross-validation results are shown.
The GO data grid. Users can browse or search the list of Gene Ontology (GO) terms. When a row is clicked, detailed information is provided (see Figure 5). KEGG and Reactome pathways can be accessed in a similar fashion.Detailed information for a precomputed term/pathway. The query genes, co-expressed genes, dataset weights and cross-validation results are shown.The website provides an opportunity to learn more about the experimental conditions in which the concordant expression of the query molecular system has been observed. The tab ‘Dataset Weights’ accessible for each precomputed query set lists GEO dataset records with concordant expression patterns of the query system, indicating congruent co-expression of the gene components of the molecular system. The dataset identifier is directly hyperlinked with the GEO entry description (both online and in Excel output file) such that users can read details of the experiment that lead to harmonious expression of the query set.
EVALUATION OF WeGET RESULTS FOR A QUERY SET
The robustness of the results is tested by k-fold cross-validation and graphically displayed with a ROC curve. The curve illustrates the performance of the WeGET method, by plotting the true positive rate (successfully cross-validated query genes) versus all human genes (Figure 6). The curve is plotted for every molecular system stored in the database (GO, KEGG and Reactome pathways) separately. The area under the curve (AUC) is a measure of the prediction quality and robustness for that pathway. The average AUC computed for all pathways is around 0.7. Well-studied and clearly defined cellular components such as mitochondria and biological processes such as cilium movement and assembly have a higher AUC (average 0.83 and 0.84, respectively) reflecting their concordant expression patterns. For pathways with less than 50 genes we use leave-one-out cross-validation, for larger pathways 10-fold cross-validation is carried out.
Figure 6.
Visualization of performance of the WeGET results for neuropathic pain genes. (A) ROC curves for the neuropathic pain query set (Table 1). The X axis represents fraction of human genes, the Y axis the fraction of the neuropathic pain molecular system. Shown are ROC curves for final results (blue), the cross-validation (CV) of integrated datasets (green), the average co-expression across all datasets (integration with equal contribution of each dataset) with CV (red) and results of co-expression within a single high-weight dataset (GDS1634, a nodose and dorsal root ganglia comparison, cyan) (22). (B) Network visualization of the co-expression allows identification of genes less co-expressed with the core of the query set.
Visualization of performance of the WeGET results for neuropathic pain genes. (A) ROC curves for the neuropathic pain query set (Table 1). The X axis represents fraction of human genes, the Y axis the fraction of the neuropathic pain molecular system. Shown are ROC curves for final results (blue), the cross-validation (CV) of integrated datasets (green), the average co-expression across all datasets (integration with equal contribution of each dataset) with CV (red) and results of co-expression within a single high-weight dataset (GDS1634, a nodose and dorsal root ganglia comparison, cyan) (22). (B) Network visualization of the co-expression allows identification of genes less co-expressed with the core of the query set.
Table 1.
Genes implicated in neuropathic pain collected from the literature.
No.
Gene Symbol
Sub-system
No.
Gene Symbol
Sub-system
1
ACTG1
CORE
11
DPYSL2
TAG
2
ANK3
CORE
12
KCNK3
TAG
3
SCN10A
CORE
13
NRCAM
TAG
4
SCN11A
CORE
14
ANXA2
TAG
5
SCN1B
CORE
15
PRKACA
TAG
6
SCN2B
CORE
16
PRKCB
TAG
7
SCN3A
CORE
17
SYN2
TAG
8
SCN3B
CORE
18
TNR
TAG
9
SCN8A
CORE
19
MSN
PI
10
SPTBN4
CORE
20
NEDD4L
PI
Core molecular sub-system associated with voltage-gated sodium channels (CORE), trafficking-associated genes (TAG) and the peripheral involvement (PI) classes are indicated (23–25).
Finally, the cohesion of the query gene set is displayed as a network using a node-force algorithm (Figure 6B). Query genes that consistently co-express perform a large attractive force and therefore cluster together. In contrast, genes that show little evidence of co-regulation exhibit a smaller force and do not cluster with the other query genes. Using this visualization, the user can resubmit the query gene set to omit genes that do not show evidence of co-regulation.
USING WeGET TO PREDICT GENES INVOLVED IN NEUROPATHIC PAIN
Previous studies indicate that mutations in genes coding for voltage-gated sodium channels and related processes may impair the nociceptive pathway and influence response to pain stimuli (17). From the literature we collected genes implicated in neuropathic pain (Table 1) and used the WeGET database to predict novel candidate genes for this pathway. Table 2 shows genes co-expressing with the neuropathic pain molecular system as calculated by WeGET. Next to sodium channels and its regulators (PRMT8, UNC80 rank 41 and 74, respectively) also genes of the voltage-gated potassium system are strongly represented among top co-expressing genes (MAP1A, PPP2R2C, KCNH3, KCNQ2 rank 6, 7, 66 and 87, respectively) consistent with their involvement in nociceptive processing (18), their expression in dorsal root ganglion neurons analogous to voltage-gated sodium channels (19) and with recently discovered genetic variants that modulate neuropathic pain (20). The PIEZO2 gene, a nociceptive component mechanically activated in nerve endings (21) ranks 74th among all genes. Additional poorly characterized genes such as SERP2, TMEM130 and CCDC155 (ranks 5, 9 and 20, respectively) are also present among genes co-expressing with the system and constitute novel candidate genes for nociceptive pathway. Figure 6A shows the higher performance of WeGET integration of all datasets (cross-validated AUC = 0.82) compared to integration of all datasets with equal weights (average co-expression across all experiments, AUC = 0.71) and a high-weight individual dataset GDS1634 of dorsal root ganglia neurons (AUC = 0.68). Weights assigned to GEO datasets reveal a high contribution of transcriptome measurements related to neurons: a murine nodose and dorsal root ganglia study (GDS1634, weight 3.0) (22), gene expression in human neurofibrillary tangles (GDS2795, weight 2.5) and DNA methylation effect on neural stem cells (GDS538, weight 3.0). The peripheral roles of DPYSL2 (trafficking subset, Table 1), MSN and NEDD4L proteins (peripheral subclass) are visualized in the query gene network (Figure 6B). Currently we screen patients with a familial form of neuropathic pain for genetic variants that may impact the function of the candidate genes.
Table 2.
Results from custom molecular system as received by the user. Top 40 genes prioritized for their involvement in neuropathic pain are shown. Genes that were part of the query set are shown on shaded background.
Core molecular sub-system associated with voltage-gated sodium channels (CORE), trafficking-associated genes (TAG) and the peripheral involvement (PI) classes are indicated (23–25).
Authors: Christian von Mering; Roland Krause; Berend Snel; Michael Cornell; Stephen G Oliver; Stanley Fields; Peer Bork Journal: Nature Date: 2002-05-08 Impact factor: 49.962
Authors: Sarah L Pollema-Mays; Maria Virginia Centeno; Crystle J Ashford; A Vania Apkarian; Marco Martina Journal: Mol Cell Neurosci Date: 2013-08-29 Impact factor: 4.314
Authors: Xianghong Jasmine Zhou; Ming-Chih J Kao; Haiyan Huang; Angela Wong; Juan Nunez-Iglesias; Michael Primig; Oscar M Aparicio; Caleb E Finch; Todd E Morgan; Wing Hung Wong Journal: Nat Biotechnol Date: 2005-01-16 Impact factor: 54.908
Authors: Pieter J Peeters; Jeroen Aerssens; Ronald de Hoogt; Andrzej Stanisz; Hinrich W Göhlmann; Kirk Hillsley; Ann Meulemans; David Grundy; Ronald H Stead; Bernard Coulie Journal: Physiol Genomics Date: 2005-11-22 Impact factor: 3.107
Authors: Tanya Barrett; Stephen E Wilhite; Pierre Ledoux; Carlos Evangelista; Irene F Kim; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Michelle Holko; Andrey Yefanov; Hyeseung Lee; Naigong Zhang; Cynthia L Robertson; Nadezhda Serova; Sean Davis; Alexandra Soboleva Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971
Authors: Khalid Zuberi; Max Franz; Harold Rodriguez; Jason Montojo; Christian Tannus Lopes; Gary D Bader; Quaid Morris Journal: Nucleic Acids Res Date: 2013-07 Impact factor: 16.971
Authors: Milena Ślęczkowska; Rowida Almomani; Margherita Marchi; Bianca T A de Greef; Maurice Sopacua; Janneke G J Hoeijmakers; Patrick Lindsey; Erika Salvi; Gidon J Bönhof; Dan Ziegler; Rayaz A Malik; Stephen G Waxman; Giuseppe Lauria; Catharina G Faber; Hubert J M Smeets; Monique M Gerrits Journal: Int J Mol Sci Date: 2022-06-28 Impact factor: 6.208
Authors: Teunis J P van Dam; Julie Kennedy; Robin van der Lee; Erik de Vrieze; Kirsten A Wunderlich; Suzanne Rix; Gerard W Dougherty; Nils J Lambacher; Chunmei Li; Victor L Jensen; Michel R Leroux; Rim Hjeij; Nicola Horn; Yves Texier; Yasmin Wissinger; Jeroen van Reeuwijk; Gabrielle Wheway; Barbara Knapp; Jan F Scheel; Brunella Franco; Dorus A Mans; Erwin van Wijk; François Képès; Gisela G Slaats; Grischa Toedt; Hannie Kremer; Heymut Omran; Katarzyna Szymanska; Konstantinos Koutroumpas; Marius Ueffing; Thanh-Minh T Nguyen; Stef J F Letteboer; Machteld M Oud; Sylvia E C van Beersum; Miriam Schmidts; Philip L Beales; Qianhao Lu; Rachel H Giles; Radek Szklarczyk; Robert B Russell; Toby J Gibson; Colin A Johnson; Oliver E Blacque; Uwe Wolfrum; Karsten Boldt; Ronald Roepman; Victor Hernandez-Hernandez; Martijn A Huynen Journal: PLoS One Date: 2019-05-16 Impact factor: 3.240
Authors: Selma L van Esveld; Şirin Cansız-Arda; Fenna Hensen; Robin van der Lee; Martijn A Huynen; Johannes N Spelbrink Journal: Front Cell Dev Biol Date: 2019-11-15
Authors: Hao Li; Daria Rukina; Fabrice P A David; Terytty Yang Li; Chang-Myung Oh; Arwen W Gao; Elena Katsyuba; Maroun Bou Sleiman; Andrea Komljenovic; Qingyao Huang; Robert W Williams; Marc Robinson-Rechavi; Kristina Schoonjans; Stephan Morgenthaler; Johan Auwerx Journal: Genome Res Date: 2019-11-21 Impact factor: 9.043
Authors: Job A J Verdonschot; Emma L Robinson; Kiely N James; Mohamed W Mohamed; Godelieve R F Claes; Kari Casas; Els K Vanhoutte; Mark R Hazebroek; Gabriel Kringlen; Michele M Pasierb; Arthur van den Wijngaard; Jan F C Glatz; Stephane R B Heymans; Ingrid P C Krapels; Shareef Nahas; Han G Brunner; Radek Szklarczyk Journal: Mol Genet Genomic Med Date: 2019-12-27 Impact factor: 2.183