Literature DB >> 22570409

GPSy: a cross-species gene prioritization system for conserved biological processes--application in male gamete development.

Ramona Britto¹, Olivier Sallou, Olivier Collin, Grégoire Michaux, Michael Primig, Frédéric Chalmel.

Abstract

We present gene prioritization system (GPSy), a cross-species gene prioritization system that facilitates the arduous but critical task of prioritizing genes for follow-up functional analyses. GPSy's modular design with regard to species, data sets and scoring strategies enables users to formulate queries in a highly flexible manner. Currently, the system encompasses 20 topics related to conserved biological processes including male gamete development discussed in this article. The web server-based tool is freely available at http://gpsy.genouest.org.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2012 PMID： 22570409 PMCID： PMC3394256 DOI： 10.1093/nar/gks380

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

High-throughput technologies have generated a vast amount of biological information. However, it remains a difficult task for biologists and clinical researchers to identify genes potentially important for a given biological process or disorders related to it based on these data. When various sources of information are weighted and prioritized by investigators based on their subjective perception of how important they are, a bias may be introduced. To tackle this critical problem, the bioinformatics field has developed a number of solutions for gene prioritization (1); these methods are typically based on the idea that genes whose expression patterns, subcellular localization, structural domains, molecular functions or physical interactions are similar to those known to be important for a given biological process or a pathology, are likely to play critical roles as well. Alternatively, genes can be prioritized on the basis of domain-specific knowledge for specific diseases and biological processes (2,3). The tools available are either standalone applications (4–6) or solutions implemented on web servers (1). These systems exploit several data sources and many of them require known (‘training’) genes as a control (positive) reference set for prioritization (1,7–12). A number of these solutions bring together information from diverse sources both within and across species and are often too vast to be integrated manually. The existing solutions, while very useful, are limited in the choice of species, query options and coverage of data types. Moreover, none of them fully exploit multiple sources of information across species. The majority of existing approaches (Supplementary Table S1) are centered on human, some include several species (13–16), and others utilize data from one organism to drive prioritization in another species (4,11–13,17–22). Chen et al. (11) demonstrated that the inclusion of a single data type (phenotype) from an alternate organism (mouse) significantly improved prioritization of human disease candidates. Protein–protein interaction data from multiple organisms has also been shown to aid gene prioritization (12,21,22). This cross-species capability, however, is restricted to a single data type in each case. Our lab has been developing and maintaining solutions for genome biological data management, data analysis and data dissemination (23–25) during the last decade. Here, we present the first release of the gene prioritization system (GPSy), which currently covers 20 topics related to conserved biological processes including cellular development and differentiation (3 topics), organ/tissue development (15 topics) and disorders/diseases (2 topics; Supplementary Table S2 for a complete list). Users can query the system with genes from a list of 45 eukaryotic species including all major model organisms; it is possible to upload lists of genes identified via expression profiling, proteomics, genome wide association (GWA) studies or even complete genomes. The submitted lists of genes are analysed using biological data falling into four broad categories (Sequence, Expression, Annotation and Association) each in combination with a specific ranking method (Figure 1A and Supplementary Table S3). Importantly, the ranking parameters are flexible which enables users to attribute different weights and to select species of interest for each data type (Figure 1B). We provide an optimized weight scheme for each topic based on an evaluation of different weight combinations ranging from 1 to 10 for each data type. Taken together, these features allow for complex queries pertaining to very specific questions for each topic. We have successfully tested GPSy using worm homologs of mammalian candidate genes followed by validation using phenotypic data from high-throughput RNA interference (RNAi) studies in Caenorhabditis elegans (26) and our own manual RNAi experiments.

Figure 1.

Framework for the prioritization of candidate genes. (A) and (B) describe the steps involved in pre-processing and querying respectively. Lane 1 (Data categories and modules) lists a non-exhaustive list of modules falling into the four categories (Sequence, Expression, Annotation and Association) that were collected and curated from different species to drive gene prioritization. Lane 2 outlines the scoring strategies, one for each module. The species-wise ranking process that follows the scoring of individual genes is depicted in Lane 3. H, M, F, W and Y indicate the ranked lists for human, mouse, fly, worm and yeast, respectively. (B) The server accepts as input a gene list from any one of the 45 species (human, in the displayed example). Genes in the input list are mapped onto pre-computed ranked lists for selected species (Lane 4) and an intra-module rank is generated (Lane 5). Lane 6 (WS; Weight Scheme) highlights the weight applied to each module. Lanes 7 and 8 describe the final step in gene prioritization, calculation of an inter-module weighted average rank for each gene. The output is the prioritized input list.

RESULTS

User interface: data input/data output

GPSy has a simple and intuitive interface including a Query tab which enables users to first select one of 20 topics that are currently available from a dropdown menu and then to define the query species. A text field is available to enter the list of candidates; alternatively, the user can request prioritization of 1000 random selected genes or the entire genome for the chosen species. Additionally, for human, a set of positive reference genes can be uploaded for each topic. Currently, GPSy only accepts Entrez Gene identifiers (IDs) because reliable and consistent gene ID conversion is a complex problem; users are referred to two up-to-date resources for gene ID unification over a wide range or organisms (27,28). It is possible to select individual species and data modules and to modify their weights (from 0 to 10) using the Advanced options tab (Figure 1B). By default, all data sets are selected for all available species (n = 45) and the preset parameters from the optimal weight scheme are applied. The output page displays the top 50 genes by default but users can change this setting as they deem appropriate. The result is displayed in the form of a table containing one gene per line with columns for Gene IDs (hyperlinked to the NCBI), Priority ranking, individual module ranks and other relevant information. The weight used in each module to compute the overall score is indicated in brackets. The output list is ordered (prioritized) according to the overall score; it can be reordered based on the ranks of individual modules. Information regarding the intra-module ranks is accessible through the magnifying glass icon. The table in the html output displays the top 1000 genes; the entire gene list and corresponding ranking information can be exported as an archive file (.tar) via the ‘Export results’ link at the bottom of the page. The welcome page includes a link to a brief tutorial for GPSy.

Species and homology

We assembled a map of conserved genes across the 45 eukaryotic species for which complete genome sequence information was available (Supplementary Table S3). Related homolog clusters from NCBI’s HomoloGene (29) and the OMA (Orthologous MAtrix) (30) projects were merged using verified homolog pairs (BLAST reciprocal best hits) as suggested by Roth et al. (31) Supplementary Figure S2A).

Modules and ranking

Thirteen different types of genomic data common to the included topics were assembled from various sources (Supplementary Table S1). These were organized into four data categories: Sequence, Expression, Annotation and Association each associated with a unique scoring strategy. The integration of genome data sets with distinct scoring strategies forms the basis of GPSy’s modular architecture allowing for maximum query flexibility (Figure 1A). The choice of data sources and scoring strategies is explained in detail in Supplementary Methods. In contrast to methods used in generic gene prioritization tools, the process-specific approach implemented in GPSy enables the pre-computation of module- and species-wise ranks; a feature that greatly accelerates the process of prioritization. When the system is queried, candidate genes in the input list are mapped onto the pre-computed ranked lists for the corresponding species. An intra-module weighted average rank is computed for each gene in the input list by combining the relative ranks for the input species according to every other selected species.

Positive and negative reference gene sets

Positive reference sets (PRSs) of genes known to be relevant for each topic were assembled for the 45 species and used for scoring genes in the Annotation and Association categories (Supplementary Table S5). For this purpose, information was gathered from the Gene Ontology and phenotype projects in various organisms. The ontological structure of these data allowed us to identify the ensemble of relevant annotation terms for each topic. This included ‘biological process’ terms from the Gene Ontology project (e.g. gamete generation) and species-specific phenotype terms (e.g. azoospermia; listed in Supplementary Table S4). Negative reference sets (NRSs) of 1000 randomly chosen genes not annotated with the selected terms were generated as controls. Note that the human PRS and NRS were employed in the Weightage optimization procedure.

Weightage optimization and overall prioritization

To assess the contributions of each module to overall prioritization, we decided to test the effectiveness of different weight combinations. We employed an approach similar to Sun et al. (2), to test different weight vectors (ranging from 1 to 10) in the 13 different modules for each topic (Supplementary Table S2). To evaluate the performance of each weight combination, a discrimination analysis method was employed. Sensitivity and specificity values were computed and a receiver operating characteristic (ROC) curve was plotted (1-Specificity versus Sensitivity). The area under this curve (AUC) corresponds to the probability that a random positive instance will score higher than a random negative instance (32). An AUC of 1 indicates that all PRS genes ranked above NRS genes; 0.5 indicates that the genes ranked randomly. As an exhaustive test of all weight combinations (2) is impractical (1013 weight schemes), we employed a heuristic approach to achieve a satisfactory discrimination of true positives (PRS) from true negative (NRS) candidates (Supplementary Methods). The overall rank of a given gene is an inter-module weighted average of the individual module ranks. The final output is a reordered list based on the overall ranking of each gene. A more detailed description of the pre-processing steps and overall prioritization can be found in Supplementary Methods.

Caenorhabditis elegans as a model for spermatogenesis

The worm is a key model organism for the high-throughput analysis of genes involved in meiotic development; these functional studies typically involve small interfering RNA (siRNA) which down-regulates mRNA expression (33). High-throughput RNAi studies are informative; however, they are often limited to detecting specific defects and are biased by a number of experimental artefacts such as wrongly annotated RNAi clones and false-positive or false-negative phenotype scores. Finally, the penetrance of a phenotype depends upon the technique used: RNAi feeding where worms are bred on a layer of bacteria containing a plasmid expressing the siRNA is less efficient than direct RNAi injection or the use of a bona fide gene deletion strain. To corroborate GPSy’s ranking output, we therefore decided to test the ability of a selected group of genes to induce a sterility or germ line defect phenotype in a strain background particularly sensitive to RNAi by the feeding method (Supplementary File S5). We first selected 56 C. elegans orthologues of mammalian genes previously identified in our lab as strongly induced in the worm and mouse germ line (34). Among the 56 genes investigated, 23 were associated with a reproductive phenotype (RP corresponding to sterility or a germ line defect) when the union of results from our RNAi experiments (11 genes associated with RP; Supplementary File S4) and those of large-scale and individual studies available via Wormbase (18 genes associated with RP) were taken into consideration. These additional phenotypes reported but not identified in our experiments are likely due to different strain backgrounds and experimental approaches. The remaining 33 genes (non-RP set) showed no clearly detectable RP under the conditions we and others employed. Next, we prioritized the worm gene list (56 genes) using GPSy’s Spermatogenesis topic using default weight settings and all species and modules with the exception of C. elegans phenotype data. The output list was integrated with phenotypic information from our and other experiments (23 RP and 33 non-RP genes; Figure 2A).

Figure 2.

Gene ranking and RNAi phenotypes. (A) The most relevant phenotypes are plotted for each gene in the prioritized candidate list (from the 1st to the 56th, x-axis). On the y-axis, phenotype classes are indicated: RP = reproduction-associated phenotype; LP = lethal phenotype; OP = other phenotype; None = no observable phenotype. Official gene symbols are displayed for all genes. (B) Displays receiver operating characteristic (ROC) curves for: (i) the candidate gene set (n = 56 genes) versus the C. elegans negative reference set (NRS; n = 1000; blue curve); (ii) the RP genes set (n = 23) versus NRS (red); (iii) the RP versus non-RP sets (union of LP, OP and None phenotype; n = 33; green). The corresponding area under the ROC curve (AUC) values are indicated. Note the significant improvement in AUC value between (ii) and (i). The AUC value for (iii) is significantly non-random. (C) Displays ROC curves for the discrimination of the C. elegans RP (n = 23) versus non-RP sets (n = 33) using GPSy (default settings, solid blue line), GPSy (C. elegans data only, dashed blue line), Endeavour (red) and Génie (green). Combining the GPSy ranks with the validated phenotypic data suggests a promising pattern, we observe a tendency for genes associated with reproductive phenotypes (RP phenotype class) to receive a high rank in comparison to genes whose involvement in the gametogenic process could not be established (bottom of the list, non-RP classes; Figure 2A). Eight of the top 10 genes display a reproductive or lethal phenotype. These genes are discussed in Supplementary File S5. The lower half of the list has relatively few genes with documented germ line/sterility phenotypes. The overall trend for high-ranking genes to result in a sterility/germ line defect phenotype is also demonstrated by the reliable discrimination of genes associated a reproductive phenotype (RP, n = 23) from a worm negative reference set (NRS, n = 1000) based on GPSy ranking (Figure 2B). Since the candidate list (n = 56) itself is expected to be enriched for PRS genes, its AUC is non-random (75.2%). This is, however, significantly lower than the AUC obtained with RP genes alone (86.2%). The ranking also demonstrated sufficient discriminability within the candidate list (RP versus non-RP; AUC = 71.9%). A chi-square test performed on the same set (RP genes against all others) revealed a statistically significant trend (P = 0.002). To illustrate the contribution of cross-species information, we subjected the gene list to GPSy prioritization without considering data from homologs in other species. The resulting difference in AUC value (0.582 versus 0.722) clearly illustrates the value of the cross-species approach (Figure 2C).

Comparison to other methods

We wanted to test GPSy’s ability to efficiently prioritize the worm candidate gene list in comparison to existing approaches. A comprehensive survey of freely available, web-based gene prioritization software revealed that for C. elegans, as with most non-human species, the choices are limited (Supplementary Table S1). Seven of the 30 tools compared offer multi-species capability. Of these, only two tools allow the querying of C. elegans data sets and provide gene ranking based on diverse data types thus enabling comparison with GPSy’s results. The performance of these two tools, Génie and Endeavour (13,16), was compared to that of GPSy using the discrimination analysis method described. We subjected the C. elegans shortlist (n = 56) to GPSy and to Endeavour using default parameters. We used the worm PRS for spermatogenesis as the training set for Endeavour. For Génie, we used ‘spermatogenesis’ as topic of interest, a P-value cutoff of 1.0 for abstracts and a false discovery rate of 1.0 for gene selection, while taking into consideration all possible orthologues. The resulting receiver operating characteristic (ROC) curves and corresponding AUC values show significant differences among the tools in favor of GPSy (72.2%) as compared to Génie (68.9%) and Endeavour (65.2%; Figure 2C). We also observed a considerable increase in computation time for the method dependent on a training set (∼10 min using Endeavour as against 10 s for GPSy). The justification of several high- and low-ranking genes obtained through a fair validation strategy (exclusion of worm phenotype data during prioritization), point to the effectiveness of the cross-species approach. The correlation of GPSy rank and phenotype relevance (Figure 2A) and the reliable discrimination of genes with and without the phenotype of interest (Figure 2B and C), suggest that the use of this system on large candidate gene lists will enable the focusing of time and experimental resources on those predictions most likely to be true.

DISCUSSION

The wide variety of data types included in GPSy, in conjunction with its modular nature, enables users to address very specific biological questions. In the Spermatogenesis topic, maximizing the weight of the Tissue specificity module may be advantageous for identifying potential gonad (germ line)-specific marker genes across species. On the other hand, decreasing the weight of Gene Ontology and Phenotype annotations for the query species, improves the ranking of uncharacterized genes, thus facilitating the discovery of novel genes important for the selected topic. In comparison to other prioritization methods, GPSy covers many more data sources and provides users with a choice of different species (Supplementary Table S1). The multi-species capability is important for basic scientists whose research is primarily conducted in model organisms. This feature is especially valuable for recently sequenced organisms and others where little or no data beyond the genomic sequence are available (27 out of 45 species; Supplementary Table S3). The value of a cross-species approach is evident also in the case of established model organisms; for example, very little phenotype/disease data are available for primates in comparison to mouse, fly, worm and yeast. Existing approaches using machine learning (35), and kernel- (16) or network-based (32,36) strategies generally rely on training gene sets provided during the query. Systems such as GPSy that use pre-defined criteria and pre-computed scores have the advantage of being much faster. GPSy returns priority lists for the mouse and human genomes in 45 s in comparison to 30 min on average in the case of Endeavour (with a small training set and all data sets selected). With the majority of tools, limitations exist for the size of the reference or candidate gene sets, or both; thus a direct comparison of all performance aspects is not feasible. The choice of positive reference genes (PRS) for training purposes is a critical factor because both the size and the homogeneity of the reference set affect the reliability of gene prioritization. There is usually an inverse relationship between them; for very small training sets, homogeneity increases but at the cost of statistical validity. It has been noted that the training set homogeneity is an important factor for effective ranking (10). Estimating homogeneity is a non-trivial task and the time required for the process increases with the size of the reference set. GPSy uses a comprehensive reference set (PRS) relevant for each topic that was carefully selected and then reviewed by experts in the field. Nevertheless, such contrasting features between GPSy and the other gene prioritization approaches suggest that the tools may be used in a complementary fashion (37). The effective prioritization of C. elegans genes through data available in other species shows that the system is scientifically sound and stresses the importance of a cross-species approach. It is obvious, however, that investigator discretion is important in the inclusion/exclusion of selected species particularly for widely divergent clades (e.g. Human–Plant).

CONCLUSION

We report the development and application of GPSy, a novel multi-dimensional tool which integrates distinct data types across a wide range of organisms. This tool is intended for the rapid identification of genes potentially important for conserved biological processes such as male gamete development. GPSy is modular and extendable which enables us and others to include novel topics and data sets as the need arises. In the future, GPSy will include less utilized datasets such as regulation by non-coding RNAs (38) and others, as they become available. A future release of our tool will include an update of GPSy’s ‘Cancer’ topic through the inclusion of gene expression data in normal versus cancer samples. We intend to complete GPSy’s repertoire with other topics of interest related to conserved biological processes in the near future.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Tables 1–7, Supplementary Figures 1–5, Supplementary Methods, Supplementary Files 1–5 and Supplementary References [39-71].

FUNDING

Funding for open access charge: Inserm, Région Bretagne (PhD fellowship); University of Rennes 1 awarded (to R.B.); Inserm Avenir [R07216NS to M.P.]. Conflict of interest statement. None declared.

69 in total

1. Using literature-based discovery to identify disease candidate genes.

Authors: Dimitar Hristovski; Borut Peterlin; Joyce A Mitchell; Susanne M Humphrey
Journal: Int J Med Inform Date: 2005-03 Impact factor: 4.046

2. Specific interference by ingested dsRNA.

Authors: L Timmons; A Fire
Journal: Nature Date: 1998-10-29 Impact factor: 49.962

Review 3. A guide to web tools to prioritize candidate genes.

Authors: Léon-Charles Tranchevent; Francisco Bonachela Capdevila; Daniela Nitsch; Bart De Moor; Patrick De Causmaecker; Yves Moreau
Journal: Brief Bioinform Date: 2010-03-21 Impact factor: 11.622

4. The role of the SPO11 gene in meiotic recombination in yeast.

Authors: S Klapholz; C S Waddell; R E Esposito
Journal: Genetics Date: 1985-06 Impact factor: 4.562

5. Bioconductor: open software development for computational biology and bioinformatics.

Authors: Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang
Journal: Genome Biol Date: 2004-09-15 Impact factor: 13.583

6. OMA 2011: orthology inference among 1000 complete genomes.

Authors: Adrian M Altenhoff; Adrian Schneider; Gaston H Gonnet; Christophe Dessimoz
Journal: Nucleic Acids Res Date: 2010-11-27 Impact factor: 16.971

7. Profiling spermatogenic failure in adult testes bearing Sox9-deficient Sertoli cells identifies genes involved in feminization, inflammation and stress.

Authors: Aurélie Lardenois; Frédéric Chalmel; Francisco Barrionuevo; Philippe Demougin; Gerd Scherer; Michael Primig
Journal: Reprod Biol Endocrinol Date: 2010-12-23 Impact factor: 5.211

8. Speeding disease gene discovery by sequence based candidate prioritization.

Authors: Euan A Adie; Richard R Adams; Kathryn L Evans; David J Porteous; Ben S Pickard
Journal: BMC Bioinformatics Date: 2005-03-14 Impact factor: 3.169

9. Promoter features related to tissue specificity as measured by Shannon entropy.

Authors: Jonathan Schug; Winfried-Paul Schuller; Claudia Kappen; J Michael Salbaum; Maja Bucan; Christian J Stoeckert
Journal: Genome Biol Date: 2005-03-29 Impact factor: 13.583

10. GeneDistiller--distilling candidate genes from linkage intervals.

Authors: Dominik Seelow; Jana Marie Schwarz; Markus Schuelke
Journal: PLoS One Date: 2008-12-05 Impact factor: 3.240

9 in total

Review 1. Computational tools for prioritizing candidate genes: boosting disease gene discovery.

Authors: Yves Moreau; Léon-Charles Tranchevent
Journal: Nat Rev Genet Date: 2012-07-03 Impact factor: 53.242

Review 2. Systems genetics in "-omics" era: current and future development.

Authors: Hong Li
Journal: Theory Biosci Date: 2012-11-09 Impact factor: 1.919

3. Research resource: the dynamic transcriptional profile of sertoli cells during the progression of spermatogenesis.

Authors: Céline Zimmermann; Isabelle Stévant; Christelle Borel; Béatrice Conne; Jean-Luc Pitetti; Pierre Calvel; Henrik Kaessmann; Bernard Jégou; Frédéric Chalmel; Serge Nef
Journal: Mol Endocrinol Date: 2015-02-24

4. OVA: integrating molecular and physical phenotype data from multiple biomedical domain ontologies with variant filtering for enhanced variant prioritization.

Authors: Agne Antanaviciute; Christopher M Watson; Sally M Harrison; Carolina Lascelles; Laura Crinnion; Alexander F Markham; David T Bonthron; Ian M Carr
Journal: Bioinformatics Date: 2015-08-12 Impact factor: 6.937