Literature DB >> 26582928

WeGET: predicting new genes for molecular systems by weighted co-expression.

Radek Szklarczyk¹, Wout Megchelenbrink², Pavel Cizek³, Marie Ledent⁴, Gonny Velemans³, Damian Szklarczyk⁵, Martijn A Huynen⁶.

Abstract

We have developed the Weighted Gene Expression Tool and database (WeGET, http://weget.cmbi.umcn.nl) for the prediction of new genes of a molecular system by correlated gene expression. WeGET utilizes a compendium of 465 human and 560 murine gene expression datasets that have been collected from multiple tissues under a wide range of experimental conditions. It exploits this abundance of expression data by assigning a high weight to datasets in which the known genes of a molecular system are harmoniously up- and down-regulated. WeGET ranks new candidate genes by calculating their weighted co-expression with that system. A weighted rank is calculated for human genes and their mouse orthologs. Then, an integrated gene rank and p-value is computed using a rank-order statistic. We applied our method to predict novel genes that have a high degree of co-expression with Gene Ontology terms and pathways from KEGG and Reactome. For each query set we provide a list of predicted novel genes, computed weights for transcription datasets used and cell and tissue types that contributed to the final predictions. The performance for each query set is assessed by 10-fold cross-validation. Finally, users can use the WeGET to predict novel genes that co-express with a custom query set.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2015 PMID： 26582928 PMCID： PMC4702868 DOI： 10.1093/nar/gkv1228

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Ever since the publication of the first gene expression arrays, the correlated expression of genes involved in a related molecular process has been used to predict functional relations between gene pairs (1). Large amounts of microarray and RNA-seq transcript expression, measured under a plethora of conditions enable mining for concordantly expressed genes. Indeed, this concept has been successfully employed in databases such as COEXPRESSdb, GeneFriends, GeneMANIA and STARNET 2 (2–5). Nevertheless, relative to other types of genomics data, co-expression has lower sensitivity and selectivity (6). To improve the quality of the predictions, various strategies have been applied, like exploiting the conservation of co-expression between species (7), combining many gene expression datasets (8,9) or biclustering datasets to identify groups of genes that co-express within a subset of the experiments (see (10) for a review). Expression screening, an extension of biclustering methods (11), weighs gene expression datasets based on the co-expression of genes within a molecular system and uses those weights to predict new genes involved in that system. It has been successfully applied to predict new mitochondrial proteins essential for the organelle (11) and to discover new players in heme biosynthesis (12). The principle behind this method is appealing: it systematically exploits the available gene expression data and, via its weighting scheme, implicitly solves the question facing many researchers: which gene expression data to use to predict new genes for a pathway? Nevertheless, it is computationally costly, as the weighting has to be recalculated for each pathway separately and additional cross validation requires multiple runs per pathway. We have therefore developed and implemented a fast expression screening algorithm that includes a dataset weighting and allows for the rapid computation of genes that co-regulate with a query gene set. Our algorithm was employed to compile a weighted co-expression database for all Gene Ontology (GO) terms and human pathways annotated in the KEGG and Reactome databases. Furthermore, we provide information regarding the original experimental setup of the highly weighted datasets. In particular, WeGet reports the cell and tissue types in which the query genes are consistently up- and down-regulated with each other. Finally, the robustness of the predicted results is assessed by 10-fold cross-validation and reported as the receiver operating characteristic (ROC) curve. We compared WeGET with five popular web tools and databases that predict novel genes based on their co-expression with specified query gene sets, using two query gene sets published by Baughman et al. (ref 11) and show that indeed, weighting the datasets results in improved precision, in particular at low recall rates (the top 100 genes). The complete WeGET database, together with a custom query submission system, is available through the WeGET website.

THE WeGET ANALYSIS PIPELINE

WeGET uses a compendium of 465 human and 560 murine gene expression datasets ranging from 6 to 192 samples per dataset. In total, ∼30 000 samples from multiple mammalian platforms were collected from the Gene Expression Omnibus (GEO) (13). The WeGET computational pipeline starts with selecting the normalized expression values for all probes associated with the query genes. For genes with multiple probes, the probe with the highest average Pearson correlation coefficient with all other query probes is selected. Subsequently, the pipeline calculates the average Pearson correlation between each gene and the set of query genes in every dataset (Figure 1). Then, all probes are ranked based on their average correlation with the query probes and mapped back to their associated gene. Each gene i obtains a score s depending on the fraction of the query set that has been ranked above that gene. These calculations are then repeated four times for the same query set and gene expression dataset, where the expression values have been randomly permuted between the genes in every measurement. This step estimates the number of genes that are expected to highly correlate with the query set in a random model. To calculate the dataset weight, an N100 value is calculated that is the fraction of query genes found among the top 100 genes with highest average Pearson correlation. The ratio between the N100 from the original dataset and the average N100 value for the randomized datasets constitutes the weight of the experiments. A species score is the weighted average of all its datasets. The final ranked gene list is obtained by integrating the ranked human and mouse list (mouse genes that are unambiguous human one-to-one orthologs). This is performed using the RobustRankAggreg R-package (14) that computes the final gene rankings using a rank-order statistic (15,16).

Figure 1.

WeGET computational pipeline used to create the database. (A) Determining the dataset weight wdataset. The transcriptome measurements are converted into a correlation matrix. The average correlation with the query set (sgene) is used for gene ranking and the dataset weight calculation wdataset. (B) Data integration across datasets, platforms and species. Gene scores sgene from all datasets are combined taking into account the precomputed weights. Subsequently different transcriptome platforms and species data are integrated to arrive at the final ranking. The process is repeated after excluding each query gene to construct a receiver operating characteristic (ROC) curve that visualizes predictive power of the method for a specific query set of genes. Thus for each set of expression data the pipeline measures whether genes in a given pathway are co-expressed with each other better than expected and uses that to assign weights to that expression dataset. These weights are subsequently used in determining the (weighted) co-expression of all genes with that pathway. The source of the variation in the weights between the datasets can be technical, e.g. variation in the probes that have been used, or biological, e.g. variation in the tissues in which gene expression has been measured. The important assumption behind the method is that new genes for a pathway are significantly co-expressed with the majority of the genes of a pathway that they belong to, rather than only with some of its members. This, in turn, depends on the pathway definition. To aid in finding the genes from a pathway that co-express with each other, the results include a visualization of co-expression between query genes displayed as a network. This allows the user to select a subset of co-expressed genes from that pathway to repeat the procedure.

WeGET VALIDATION AND COMPARISON TO OTHER CO-EXPRESSION DATABASES

To assess and compare the predictive power of different co-expression methods (see Supplementary Table SI1), we used two query gene sets (11): 19 query genes in the cholesterol biosynthesis pathway and 76 genes involved in oxidative phosphorylation (OXPHOS). We manually performed leave-one-out or 10-fold cross validation by multiple submissions (see Supplementary Methods for details). We took into account the top 100 ranked genes as a likely use case scenario. Figure 2 shows the WeGET results for the cholesterol biosynthesis pathway and OXPHOS system compared to other online tools employing the co-expression analysis.

Figure 2.

ROC performance curves for online co-expression tools (see Supplementary Table SI1). Performance measured by multiple cross-validation runs is indicated by the area under the curve (AUC) for the top 100 genes corresponding to a typical use case scenario. (A) Results for 19 genes in the cholesterol pathway using leave-one-out cross-validation. (B) Results for 10-fold cross-validation in the oxidative phosphorylation (OXPHOS) query set. Baughman et al. (11) carried out one-time computations for cholesterol and OXPHOS datasets. The Weget webserver achieves identical (cholesterol) or marginally better performance (OXPHOS, 86.4% sensitivity at 99.8% specificity, compared to 85% and 99.4%, respectively, see Supplementary Figure SI1).

THE WeGET DATABASE AND WEB ACCESS

Figure 3 depicts the architecture of the WeGET database. Human pathways and their associated genes from GO and KEGG are stored in a central database. The WeGET parallel algorithm that calculates each dataset on a separate thread precomputes the co-expressed genes and dataset weights for all pathways using the transcriptome compendium. The results are presented to the user using the WeGET webtool (implemented in Python Flask) and can additionally be downloaded.

Figure 3.

The WeGET system architecture. Results for predefined pathways (GO, KEGG and Reactome) are precomputed and exposed through the WeGET webtool. Custom defined gene sets can be analyzed by submitting the gene ids or gene symbols to the webserver. On the WeGET website, pathways are shown in a data grid (Figure 4), which can be sorted and searched. Detailed information (Figure 5), such as the best scoring genes, dataset weights, cell and tissue types in which the genes highly co-express (see also Supplementary Figures SI2 and SI3) and cross-validation results are shown when a row entry is selected. A pathway can be accessed directly as weget.cmbi.umcn.nl/pathwaydb/identifier where pathwaydb denotes the pathway database (one of: GO, KEGG or Reactome) and identifier the category identification (e.g. http://weget.cmbi.umcn.nl/GO/GO:0000398). User queries (different than the predefined sets) can be entered using the ‘Custom pathway’ tab, specifying genes as Entrez ID or HUGO gene symbol. The query is then scheduled for analysis. After the analysis, the user receives an email with results, including the cross-validation and a network that displays the co-regulation of the query genes within the datasets (see below), in a spreadsheet.

Figure 4.

Figure 5.

Detailed information for a precomputed term/pathway. The query genes, co-expressed genes, dataset weights and cross-validation results are shown.

The GO data grid. Users can browse or search the list of Gene Ontology (GO) terms. When a row is clicked, detailed information is provided (see Figure 5). KEGG and Reactome pathways can be accessed in a similar fashion. Detailed information for a precomputed term/pathway. The query genes, co-expressed genes, dataset weights and cross-validation results are shown. The website provides an opportunity to learn more about the experimental conditions in which the concordant expression of the query molecular system has been observed. The tab ‘Dataset Weights’ accessible for each precomputed query set lists GEO dataset records with concordant expression patterns of the query system, indicating congruent co-expression of the gene components of the molecular system. The dataset identifier is directly hyperlinked with the GEO entry description (both online and in Excel output file) such that users can read details of the experiment that lead to harmonious expression of the query set.

EVALUATION OF WeGET RESULTS FOR A QUERY SET

The robustness of the results is tested by k-fold cross-validation and graphically displayed with a ROC curve. The curve illustrates the performance of the WeGET method, by plotting the true positive rate (successfully cross-validated query genes) versus all human genes (Figure 6). The curve is plotted for every molecular system stored in the database (GO, KEGG and Reactome pathways) separately. The area under the curve (AUC) is a measure of the prediction quality and robustness for that pathway. The average AUC computed for all pathways is around 0.7. Well-studied and clearly defined cellular components such as mitochondria and biological processes such as cilium movement and assembly have a higher AUC (average 0.83 and 0.84, respectively) reflecting their concordant expression patterns. For pathways with less than 50 genes we use leave-one-out cross-validation, for larger pathways 10-fold cross-validation is carried out.

Figure 6.

Visualization of performance of the WeGET results for neuropathic pain genes. (A) ROC curves for the neuropathic pain query set (Table 1). The X axis represents fraction of human genes, the Y axis the fraction of the neuropathic pain molecular system. Shown are ROC curves for final results (blue), the cross-validation (CV) of integrated datasets (green), the average co-expression across all datasets (integration with equal contribution of each dataset) with CV (red) and results of co-expression within a single high-weight dataset (GDS1634, a nodose and dorsal root ganglia comparison, cyan) (22). (B) Network visualization of the co-expression allows identification of genes less co-expressed with the core of the query set.

Table 1.

Genes implicated in neuropathic pain collected from the literature.

No.	Gene Symbol	Sub-system	No.	Gene Symbol	Sub-system
1	ACTG1	CORE	11	DPYSL2	TAG
2	ANK3	CORE	12	KCNK3	TAG
3	SCN10A	CORE	13	NRCAM	TAG
4	SCN11A	CORE	14	ANXA2	TAG
5	SCN1B	CORE	15	PRKACA	TAG
6	SCN2B	CORE	16	PRKCB	TAG
7	SCN3A	CORE	17	SYN2	TAG
8	SCN3B	CORE	18	TNR	TAG
9	SCN8A	CORE	19	MSN	PI
10	SPTBN4	CORE	20	NEDD4L	PI

Core molecular sub-system associated with voltage-gated sodium channels (CORE), trafficking-associated genes (TAG) and the peripheral involvement (PI) classes are indicated (23–25).

Finally, the cohesion of the query gene set is displayed as a network using a node-force algorithm (Figure 6B). Query genes that consistently co-express perform a large attractive force and therefore cluster together. In contrast, genes that show little evidence of co-regulation exhibit a smaller force and do not cluster with the other query genes. Using this visualization, the user can resubmit the query gene set to omit genes that do not show evidence of co-regulation.

USING WeGET TO PREDICT GENES INVOLVED IN NEUROPATHIC PAIN

Previous studies indicate that mutations in genes coding for voltage-gated sodium channels and related processes may impair the nociceptive pathway and influence response to pain stimuli (17). From the literature we collected genes implicated in neuropathic pain (Table 1) and used the WeGET database to predict novel candidate genes for this pathway. Table 2 shows genes co-expressing with the neuropathic pain molecular system as calculated by WeGET. Next to sodium channels and its regulators (PRMT8, UNC80 rank 41 and 74, respectively) also genes of the voltage-gated potassium system are strongly represented among top co-expressing genes (MAP1A, PPP2R2C, KCNH3, KCNQ2 rank 6, 7, 66 and 87, respectively) consistent with their involvement in nociceptive processing (18), their expression in dorsal root ganglion neurons analogous to voltage-gated sodium channels (19) and with recently discovered genetic variants that modulate neuropathic pain (20). The PIEZO2 gene, a nociceptive component mechanically activated in nerve endings (21) ranks 74th among all genes. Additional poorly characterized genes such as SERP2, TMEM130 and CCDC155 (ranks 5, 9 and 20, respectively) are also present among genes co-expressing with the system and constitute novel candidate genes for nociceptive pathway. Figure 6A shows the higher performance of WeGET integration of all datasets (cross-validated AUC = 0.82) compared to integration of all datasets with equal weights (average co-expression across all experiments, AUC = 0.71) and a high-weight individual dataset GDS1634 of dorsal root ganglia neurons (AUC = 0.68). Weights assigned to GEO datasets reveal a high contribution of transcriptome measurements related to neurons: a murine nodose and dorsal root ganglia study (GDS1634, weight 3.0) (22), gene expression in human neurofibrillary tangles (GDS2795, weight 2.5) and DNA methylation effect on neural stem cells (GDS538, weight 3.0). The peripheral roles of DPYSL2 (trafficking subset, Table 1), MSN and NEDD4L proteins (peripheral subclass) are visualized in the query gene network (Figure 6B). Currently we screen patients with a familial form of neuropathic pain for genetic variants that may impact the function of the candidate genes.

Table 2.

Results from custom molecular system as received by the user. Top 40 genes prioritized for their involvement in neuropathic pain are shown. Genes that were part of the query set are shown on shaded background.

Core molecular sub-system associated with voltage-gated sodium channels (CORE), trafficking-associated genes (TAG) and the peripheral involvement (PI) classes are indicated (23–25).

25 in total

1. Comparative assessment of large-scale data sets of protein-protein interactions.

Authors: Christian von Mering; Roland Krause; Berend Snel; Michael Cornell; Stephen G Oliver; Stanley Fields; Peer Bork
Journal: Nature Date: 2002-05-08 Impact factor: 49.962

2. Expression of background potassium channels in rat DRG is cell-specific and down-regulated in a neuropathic pain model.

Authors: Sarah L Pollema-Mays; Maria Virginia Centeno; Crystle J Ashford; A Vania Apkarian; Marco Martina
Journal: Mol Cell Neurosci Date: 2013-08-29 Impact factor: 4.314

3. Functional annotation and network reconstruction through cross-platform integration of microarray data.

Authors: Xianghong Jasmine Zhou; Ming-Chih J Kao; Haiyan Huang; Angela Wong; Juan Nunez-Iglesias; Michael Primig; Oscar M Aparicio; Caleb E Finch; Todd E Morgan; Wing Hung Wong
Journal: Nat Biotechnol Date: 2005-01-16 Impact factor: 54.908

4. Gene prioritization through genomic data fusion.

Authors: Stein Aerts; Diether Lambrechts; Sunit Maity; Peter Van Loo; Bert Coessens; Frederik De Smet; Leon-Charles Tranchevent; Bart De Moor; Peter Marynen; Bassem Hassan; Peter Carmeliet; Yves Moreau
Journal: Nat Biotechnol Date: 2006-05 Impact factor: 54.908

5. Molecular profiling of murine sensory neurons in the nodose and dorsal root ganglia labeled from the peritoneal cavity.

Authors: Pieter J Peeters; Jeroen Aerssens; Ronald de Hoogt; Andrzej Stanisz; Hinrich W Göhlmann; Kirk Hillsley; Ann Meulemans; David Grundy; Ronald H Stead; Bernard Coulie
Journal: Physiol Genomics Date: 2005-11-22 Impact factor: 3.107

Review 6. Trafficking mechanisms underlying neuronal voltage-gated ion channel localization at the axon initial segment.

Authors: Helene Vacher; James S Trimmer
Journal: Epilepsia Date: 2012-12 Impact factor: 5.864

7. NCBI GEO: archive for functional genomics data sets--update.

Authors: Tanya Barrett; Stephen E Wilhite; Pierre Ledoux; Carlos Evangelista; Irene F Kim; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Michelle Holko; Andrey Yefanov; Hyeseung Lee; Naigong Zhang; Cynthia L Robertson; Nadezhda Serova; Sean Davis; Alexandra Soboleva
Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971

8. GeneMANIA prediction server 2013 update.

Authors: Khalid Zuberi; Max Franz; Harold Rodriguez; Jason Montojo; Christian Tannus Lopes; Gary D Bader; Quaid Morris
Journal: Nucleic Acids Res Date: 2013-07 Impact factor: 16.971

9. COXPRESdb: a database of comparative gene coexpression networks of eleven species for mammals.

Authors: Takeshi Obayashi; Yasunobu Okamura; Satoshi Ito; Shu Tadaka; Ikuko N Motoike; Kengo Kinoshita
Journal: Nucleic Acids Res Date: 2012-11-29 Impact factor: 16.971

Review 10. Opening paths to novel analgesics: the role of potassium channels in chronic pain.

Authors: Christoforos Tsantoulas; Stephen B McMahon
Journal: Trends Neurosci Date: 2014-01-21 Impact factor: 13.837

13 in total

1. COSCEB: Comprehensive search for column-coherent evolution biclusters and its application to hub gene identification.

Authors: Ankush Maind; Shital Raut
Journal: J Biosci Date: 2019-06 Impact factor: 1.826

2. Peripheral Ion Channel Gene Screening in Painful- and Painless-Diabetic Neuropathy.

Authors: Milena Ślęczkowska; Rowida Almomani; Margherita Marchi; Bianca T A de Greef; Maurice Sopacua; Janneke G J Hoeijmakers; Patrick Lindsey; Erika Salvi; Gidon J Bönhof; Dan Ziegler; Rayaz A Malik; Stephen G Waxman; Giuseppe Lauria; Catharina G Faber; Hubert J M Smeets; Monique M Gerrits
Journal: Int J Mol Sci Date: 2022-06-28 Impact factor: 6.208

3. Exploiting single-cell expression to characterize co-expression replicability.

Authors: Megan Crow; Anirban Paul; Sara Ballouz; Z Josh Huang; Jesse Gillis
Journal: Genome Biol Date: 2016-05-06 Impact factor: 13.583

4. Genome-wide association studies identified multiple genetic loci for body size at four growth stages in Chinese Holstein cattle.

Authors: Xu Zhang; Qin Chu; Gang Guo; Ganghui Dong; Xizhi Li; Qin Zhang; Shengli Zhang; Zhiwu Zhang; Yachun Wang
Journal: PLoS One Date: 2017-04-20 Impact factor: 3.240

5. CLIC, a tool for expanding biological pathways based on co-expression across thousands of datasets.

Authors: Yang Li; Alexis A Jourdain; Sarah E Calvo; Jun S Liu; Vamsi K Mootha
Journal: PLoS Comput Biol Date: 2017-07-18 Impact factor: 4.475

6. Latent network-based representations for large-scale gene expression data analysis.

Authors: Wajdi Dhifli; Julia Puig; Aurélien Dispot; Mohamed Elati
Journal: BMC Bioinformatics Date: 2019-02-04 Impact factor: 3.169

7. CiliaCarta: An integrated and validated compendium of ciliary genes.

Authors: Teunis J P van Dam; Julie Kennedy; Robin van der Lee; Erik de Vrieze; Kirsten A Wunderlich; Suzanne Rix; Gerard W Dougherty; Nils J Lambacher; Chunmei Li; Victor L Jensen; Michel R Leroux; Rim Hjeij; Nicola Horn; Yves Texier; Yasmin Wissinger; Jeroen van Reeuwijk; Gabrielle Wheway; Barbara Knapp; Jan F Scheel; Brunella Franco; Dorus A Mans; Erwin van Wijk; François Képès; Gisela G Slaats; Grischa Toedt; Hannie Kremer; Heymut Omran; Katarzyna Szymanska; Konstantinos Koutroumpas; Marius Ueffing; Thanh-Minh T Nguyen; Stef J F Letteboer; Machteld M Oud; Sylvia E C van Beersum; Miriam Schmidts; Philip L Beales; Qianhao Lu; Rachel H Giles; Radek Szklarczyk; Robert B Russell; Toby J Gibson; Colin A Johnson; Oliver E Blacque; Uwe Wolfrum; Karsten Boldt; Ronald Roepman; Victor Hernandez-Hernandez; Martijn A Huynen
Journal: PLoS One Date: 2019-05-16 Impact factor: 3.240

8. A Combined Mass Spectrometry and Data Integration Approach to Predict the Mitochondrial Poly(A) RNA Interacting Proteome.

Authors: Selma L van Esveld; Şirin Cansız-Arda; Fenna Hensen; Robin van der Lee; Martijn A Huynen; Johannes N Spelbrink
Journal: Front Cell Dev Biol Date: 2019-11-15

9. Identifying gene function and module connections by the integration of multispecies expression compendia.

Authors: Hao Li; Daria Rukina; Fabrice P A David; Terytty Yang Li; Chang-Myung Oh; Arwen W Gao; Elena Katsyuba; Maroun Bou Sleiman; Andrea Komljenovic; Qingyao Huang; Robert W Williams; Marc Robinson-Rechavi; Kristina Schoonjans; Stephan Morgenthaler; Johan Auwerx
Journal: Genome Res Date: 2019-11-21 Impact factor: 9.043

10. Mutations in PDLIM5 are rare in dilated cardiomyopathy but are emerging as potential disease modifiers.

Authors: Job A J Verdonschot; Emma L Robinson; Kiely N James; Mohamed W Mohamed; Godelieve R F Claes; Kari Casas; Els K Vanhoutte; Mark R Hazebroek; Gabriel Kringlen; Michele M Pasierb; Arthur van den Wijngaard; Jan F C Glatz; Stephane R B Heymans; Ingrid P C Krapels; Shareef Nahas; Han G Brunner; Radek Szklarczyk
Journal: Mol Genet Genomic Med Date: 2019-12-27 Impact factor: 2.183