Literature DB >> 26490503

Web-based Gene Pathogenicity Analysis (WGPA): a web platform to interpret gene pathogenicity from personal genome data.

Juan J Diaz-Montana¹, Owen J L Rackham², Norberto Diaz-Diaz¹, Enrico Petretto².

Abstract

UNLABELLED: As the volume of patient-specific genome sequences increases the focus of biomedical research is switching from the detection of disease-mutations to their interpretation. To this end a number of techniques have been developed that use mutation data collected within a population to predict whether individual genes are likely to be disease-causing or not. As both sequence data and associated analysis tools proliferate, it becomes increasingly difficult for the community to make sense of these data and their implications. Moreover, no single analysis tool is likely to capture all relevant genomic features that contribute to the gene's pathogenicity. Here, we introduce Web-based Gene Pathogenicity Analysis (WGPA), a web-based tool to analyze genes impacted by mutations and rank them through the integration of existing prioritization tools, which assess different aspects of gene pathogenicity using population-level sequence data. Additionally, to explore the polygenic contribution of mutations to disease, WGPA implements gene set enrichment analysis to prioritize disease-causing genes and gene interaction networks, therefore providing a comprehensive annotation of personal genomes data in disease.
AVAILABILITY AND IMPLEMENTATION: wgpa.systems-genetics.net.

Entities: Chemical

Mesh：

Substances：
Virulence Factors

Year: 2015 PMID： 26490503 PMCID： PMC4743624 DOI： 10.1093/bioinformatics/btv598

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Motivation

With the growing volume of patient-specific sequences that is being generated there is an increasing need to annotate these data and distinguish possible disease causing mutations from benign mutations. To this end, a number of approaches have been developed to prioritize genes based on their predicted pathogenicity using whole-exome and whole-genome data. A recently introduced class of approaches use the pattern of functional sequence variation (i.e. rare and common mutations) observed in the human population (Petrovski ), the likelihood of observed mutations according to evolution (Rackham ) or statistical modelling of genes under selective constraint (Samocha ) to prioritize (rank) disease-causing genes from sets of genes impacted by mutations. Differently from sequence variant-level analysis (e.g. PolyPhen2 (Adzhubei )), these methods specifically allow a gene-level analysis of pathogenicity, providing elegant, yet distinct schemes to evaluate the significance for individual genes in disease (Enns ; Shashi ). Here we provide an easy to use web-based tool (Web-based Gene Pathogenicity Analysis or WGPA) that integrates these methods for gene-level pathogenicity analysis (Petrovski ; Rackham ; Samocha ) as well as any future scoring system, therefore facilitating the assessment of the evidence supporting a role for a gene or variant in disease pathogenesis. Beyond single-gene analyses, WGPA provides a means to assess and test pathogenicity (using gene set enrichment analysis (Subramanian )) for groups of genes of interest, look for mutations in the so called hot-zone using the gene level scores in conjunction with PolyPhen-2 (Adzhubei ) or FATHMM (Shihab ) and also to incorporate information from known gene interaction networks all within the same web based framework. Our platform will allow the scientific community to critically evaluate and interpret the large sets of mutation data from sequencing studies, aiding in the identification of genes and networks that play a critical role in disease aetiology.

2 Methods and implementation

2.1 Measures of genic intolerance

To date, only a few methods to predict pathogenicity at the gene level using sequence or population information alone are available: Residual variance intolerance score (RVIS) (Petrovski ), Evolutionary intolerance score (EvoTol) (Rackham ) and gene constraint scores (GCS) (Samocha ). The combination of these techniques with other analysis tools can provide a means to assess pathogenicity for sets of genes that have been found to be mutated in a disease, such as those identified by whole-exome and whole-genome sequencing. Here we provide a web-based tool that integrates in a single framework of analysis the following genic intolerance measures: RVIS identifies an intolerant gene as a gene containing a higher number of rare mutations than would be expected compared to other genes with a similar number of mutations. EvoTol identifies an intolerant gene as a gene containing an excess of mutations that, on the protein space, are not favoured by evolution as compared with other genes with the same number of mutations. GCS identifies excessively constrained genes using a statistical model which allow to rank genes based on their relative deficiency of functional variation.

2.2 Gene set enrichment analysis of gene pathogenicity

The methods described above provide gene-level scores for the identification of variants and genes that have a critical role in disease; these scores can be used to create ranked gene lists where individual highly intolerant (or constrained) genes can be prioritized. In order to integrate these scores over sets of genes, we provide a gene set enrichment analysis (GSEA) implementation (Subramanian ) that can be used with RVIS, EvoTol or GCS. Briefly, given a ranked list of genes (calculated genome-wide for each method described above) the GSEA tool tests if the genic intolerance scores of a subset of genes (provided by the user) occupy higher (or lower) positions in the ranked gene list than what it would be expected by chance. Gene set enrichment scores and significance level of the enrichment (P-value, False Discovery Rate (FDR), FWER P-value) are provided, using the GSEA output format developed by Broad Institute of MIT and Harvard (Subramanian ).

2.3 Interactome data

Genes that are mutated in disease do not operate in isolation, but as part of highly complex cellular and regulatory systems. A number of sources of gene interaction data are available, and here we use the STRING database (von Mering ), which provides several types of gene-gene interaction data. In order to remove less reliable interactions, we have filtered the STRING network to include only those interactions that have a STRING confidence score greater than 500 and are experimentally supported (Rackham ). The interaction data is used to display the pathogenicity scores for a set of genes on a network which, for instance, can be used to indentify genes that are both intolerant to mutation and network hubs.

2.4 Tools for annotating individual SNPs

In the development of RVIS the authors also defined the ‘hot-zone’ of mutation. This is a set of mutations that are both predicted to be damaging and also lie within genes that are predicted to be intolerant to mutation. In order to generalize this concept we have integrated both PolyPhen-2 and FATHMM, allowing for the hot-zone to be created as a combination these with of any of the three measures of intolerance.

2.5 Web interface

In order to facilitate the annotation of personal genomes data with respect to disease pathogenesis, we have developed a unified web-based tool for pathogenicity analysis of individual genes, gene sets and gene interacting networks. To this aim, we developed an intuitive graphical user interface that will make the available prioritization methods (RVIS, EvoTol, GCS) and integrated analysis tools (GSEA, cell-type specificity, gene interacting networks) easy to access and use by the general scientific community. The type of input data, integrative analyses components and outputs are schematically summarized in Figure 1, and include the following inputs, analyses and outputs:

Fig. 1.

Schematic representation of the inputs, integrative data analyses component and associated outputs available through WGPA

Inputs – Gene-Level: manual data entry; gene list (*.txt); GRP, gene set (*.grp); GMX, gene matrix (*.gmx); GMT, gene matrix transposed (*.gmt); WGCNA, weighted gene co-expression network analysis output (*.wgcna); Variant-Level: manual data entry; list of protein substitutions (*.txt); list dbSNP identifiers (*.txt); Network-Level: manual data entry; list of gene identifiers for STRING (*.txt); list of gene pairs (*.txt) Analyses – RVIS, EvoTol (can be stratified by gene expression), GCS (user-selected); RVIS, EvoTol, GCS combined with variant-level consequence predictions (PolyPhen2 (Adzhubei )) or FATHMM (Shihab )); gene set enrichment analysis (for Gene-Level inputs) Outputs – genes ranked by their genic intolerance or constraint scores (graphical and table formats); GSEA results for gene sets (graphical and table formats); gene pathogenicity annotation using both the predicted ‘functionally damaging’ mutations and genic intolerance (or constraint) scores (to identify the so-called hot-zone, i.e. predicted both highly-intolerant and ‘functionally damaging’) (graphical and table formats); gene interaction network annotated according to RVIS, EvoTol or GCS allowing zooming out of a particular gene and visualizing its connections to other genes (graphical format). Schematic representation of the inputs, integrative data analyses component and associated outputs available through WGPA

3 Example

An example of where WGPA will be useful is to prioritize the set of genes with de novo mutations from trio sequencing projects. For instance in the Epi4K project Allen , trio sequencing was performed on epilepsy patients resulting in the identification of 329 de novo mutations impacting 176 different genes. By cross matching the RVIS, GCS and EvoTol scores and focusing on the genes from the top 25 percentile, we identify a set of 17 genes of interest (ATP2B4, CHD4, DNM1, FLNA, FLNC, GABRA1, GABRB3, GNAO1, GRIN1, KCNQ2, MLL, MLL4, MYH6, SCN1A, SCN2A, SCN8A, WHSC1L1, Supplementary Table S1). Using WGPA it was also possible to perform a GSEA of each of the measures of intolerance using the Epi4K mutated genes as the gene set of interest, and show that in each case the Epi4K mutated gene set is significantly enriched for predicted pathogenic genes (Supplementary Figure S1).

Funding

Supported by The Duke-NUS Graduate Medical School Signatures Research Program (Program in Cardiovascular and Metabolic Disorders). Conflict of Interest: none declared.

10 in total

1. STRING: a database of predicted functional associations between proteins.

Authors: Christian von Mering; Martijn Huynen; Daniel Jaeggi; Steffen Schmidt; Peer Bork; Berend Snel
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

2. The RBMX gene as a candidate for the Shashi X-linked intellectual disability syndrome.

Authors: V Shashi; P Xie; K Schoch; D B Goldstein; T D Howard; M N Berry; C E Schwartz; K Cronin; S Sliwa; A Allen; A C Need
Journal: Clin Genet Date: 2014-12-05 Impact factor: 4.438

3. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

4. Predicting functional effect of human missense mutations using PolyPhen-2.

Authors: Ivan Adzhubei; Daniel M Jordan; Shamil R Sunyaev
Journal: Curr Protoc Hum Genet Date: 2013-01

5. EvoTol: a protein-sequence based evolutionary intolerance framework for disease-gene prioritization.

Authors: Owen J L Rackham; Hashem A Shihab; Michael R Johnson; Enrico Petretto
Journal: Nucleic Acids Res Date: 2014-12-29 Impact factor: 16.971

6. Genic intolerance to functional variation and the interpretation of personal genomes.

Authors: Slavé Petrovski; Quanli Wang; Erin L Heinzen; Andrew S Allen; David B Goldstein
Journal: PLoS Genet Date: 2013-08-22 Impact factor: 5.917

7. Mutations in NGLY1 cause an inherited disorder of the endoplasmic reticulum-associated degradation pathway.

Authors: Gregory M Enns; Vandana Shashi; Matthew Bainbridge; Michael J Gambello; Farah R Zahir; Thomas Bast; Rebecca Crimian; Kelly Schoch; Julia Platt; Rachel Cox; Jonathan A Bernstein; Mena Scavina; Rhonda S Walter; Audrey Bibb; Melanie Jones; Madhuri Hegde; Brett H Graham; Anna C Need; Angelica Oviedo; Christian P Schaaf; Sean Boyle; Atul J Butte; Rui Chen; Rong Chen; Michael J Clark; Rajini Haraksingh; Tina M Cowan; Ping He; Sylvie Langlois; Huda Y Zoghbi; Michael Snyder; Richard A Gibbs; Hudson H Freeze; David B Goldstein
Journal: Genet Med Date: 2014-03-20 Impact factor: 8.822

8. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models.

Authors: Hashem A Shihab; Julian Gough; David N Cooper; Peter D Stenson; Gary L A Barker; Keith J Edwards; Ian N M Day; Tom R Gaunt
Journal: Hum Mutat Date: 2012-11-02 Impact factor: 4.878

9. De novo mutations in epileptic encephalopathies.

Authors: Andrew S Allen; Samuel F Berkovic; Patrick Cossette; Norman Delanty; Dennis Dlugos; Evan E Eichler; Michael P Epstein; Tracy Glauser; David B Goldstein; Yujun Han; Erin L Heinzen; Yuki Hitomi; Katherine B Howell; Michael R Johnson; Ruben Kuzniecky; Daniel H Lowenstein; Yi-Fan Lu; Maura R Z Madou; Anthony G Marson; Heather C Mefford; Sahar Esmaeeli Nieh; Terence J O'Brien; Ruth Ottman; Slavé Petrovski; Annapurna Poduri; Elizabeth K Ruzzo; Ingrid E Scheffer; Elliott H Sherr; Christopher J Yuskaitis; Bassel Abou-Khalil; Brian K Alldredge; Jocelyn F Bautista; Samuel F Berkovic; Alex Boro; Gregory D Cascino; Damian Consalvo; Patricia Crumrine; Orrin Devinsky; Dennis Dlugos; Michael P Epstein; Miguel Fiol; Nathan B Fountain; Jacqueline French; Daniel Friedman; Eric B Geller; Tracy Glauser; Simon Glynn; Sheryl R Haut; Jean Hayward; Sandra L Helmers; Sucheta Joshi; Andres Kanner; Heidi E Kirsch; Robert C Knowlton; Eric H Kossoff; Rachel Kuperman; Ruben Kuzniecky; Daniel H Lowenstein; Shannon M McGuire; Paul V Motika; Edward J Novotny; Ruth Ottman; Juliann M Paolicchi; Jack M Parent; Kristen Park; Annapurna Poduri; Ingrid E Scheffer; Renée A Shellhaas; Elliott H Sherr; Jerry J Shih; Rani Singh; Joseph Sirven; Michael C Smith; Joseph Sullivan; Liu Lin Thio; Anu Venkat; Eileen P G Vining; Gretchen K Von Allmen; Judith L Weisenberg; Peter Widdess-Walsh; Melodie R Winawer
Journal: Nature Date: 2013-08-11 Impact factor: 49.962

10. A framework for the interpretation of de novo mutation in human disease.

Authors: Kaitlin E Samocha; Elise B Robinson; Stephan J Sanders; Christine Stevens; Aniko Sabo; Lauren M McGrath; Jack A Kosmicki; Karola Rehnström; Swapan Mallick; Andrew Kirby; Dennis P Wall; Daniel G MacArthur; Stacey B Gabriel; Mark DePristo; Shaun M Purcell; Aarno Palotie; Eric Boerwinkle; Joseph D Buxbaum; Edwin H Cook; Richard A Gibbs; Gerard D Schellenberg; James S Sutcliffe; Bernie Devlin; Kathryn Roeder; Benjamin M Neale; Mark J Daly
Journal: Nat Genet Date: 2014-08-03 Impact factor: 38.330

10 in total