Literature DB >> 19502494

Gene set-based analysis of polymorphisms: finding pathways or biological processes associated to traits in genome-wide association studies.

Ignacio Medina¹, David Montaner, Nuria Bonifaci, Miguel Angel Pujana, José Carbonell, Joaquin Tarraga, Fatima Al-Shahrour, Joaquin Dopazo.

Abstract

Genome-wide association studies have become a popular strategy to find associations of genes to traits of interest. Despite the high-resolution available today to carry out genotyping studies, the success of its application in real studies has been limited by the testing strategy used. As an alternative to brute force solutions involving the use of very large cohorts, we propose the use of the Gene Set Analysis (GSA), a different analysis strategy based on testing the association of modules of functionally related genes. We show here how the Gene Set-based Analysis of Polymorphisms (GeSBAP), which is a simple implementation of the GSA strategy for the analysis of genome-wide association studies, provides a significant increase in the power testing for this type of studies. GeSBAP is freely available at http://bioinfo.cipf.es/gesbap/.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2009 PMID： 19502494 PMCID： PMC2703970 DOI： 10.1093/nar/gkp481

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Genome-wide association studies (GWAS) use high-resolution maps of markers (Single Nucleotide Polymorphisms - SNPs - and, in some platforms, Copy Number Variations - CNVs - too) along the genome to look for allele frequency differences between cases (e.g. individuals with a certain disease or trait) and controls. A significant frequency difference is taken as an indication of the presence of functional variants in the DNA sequence related to the disease or trait in question within the genomic region corresponding to the markers (1). Despite the high resolution available to interrogate the genomes (e.g. more than 1.8 million markers, half of them SNPs, in the Affymetrix 6.0 chip), the current testing strategies that considers markers independently results in weak, usually non-significant, associations (2). Thus under this conventional testing paradigm, only consortia analysing very large cohorts were successful in finding clear and reproducible results (3). However, a new paradigm inspired in systems biology has recently been adopted in other areas, such as transcriptomics (4). Since the concept aimed at is the functional association, it might occur that individual genes are not the best proxies for such concept. Actually, it is widely accepted that conventional biological functions, in the way we understand them, can rarely be attributed only to an individual molecule. Instead, most of the biological functionality of the cell arises from complex interactions between their molecular components that define operational interacting entities or modules of functionally related genes (5). In the case of genetic traits, it is generally believed that multigenicity reflects disruptions in proteins that participate in a protein complex or a in a pathway (6). Accordingly, Gene Set Analysis (GSA) methods aim to test the activity of such modules instead of testing the activity of individual genes (7,8). The extrapolation of this concept to other fields, and in particular to the study of complex traits by GWAS, where combinations of mutations in different genes can globally affect a pathway making it difficult to find detectable associations in individual genes but not in the pathway, has recently been suggested (9–11). Actually, the application of this emergent concept (also known as pathway-based analysis) to the analysis of GWAS is already producing new and original results for different pathologies (12–14). Although stand-alone software is available for this type of analysis (9,11,15), no web servers with better user-interface are available for experimentalists to use. The Gene Set-based Analysis of Polymorphisms (GeSBAP) web server can input lists of SNP or CNV identifiers arranged according to a parameter accounting for the association (P-value) and returns a collection of biological processes significantly associated to the trait studied. Lists of genes belonging to such processes can easily be obtained for further validation.

RATIONALE FOR EXPANDING GENE-SET ANALYSIS CONCEPTS TO THE STUDY OF POLYMORPHISMS

The rationale behind the GeSBAP approach relies on the use of a testing strategy that target damaged functionalities which can produce the final phenotype. Most of the traits analysed in GWAS experiments are multigenic. Even in the cases that such traits depend on a unique main gene, there are other modifier genes that modulate the phenotype. The main problem with GWAS is that testing markers independently results in weak, usually non-significant, associations of them to the trait (2). Since multigenicity is generally caused by different combinations of mutations whose only common feature is their belonging to a pathway (or, generally speaking, to a functional unit), the GeSBAP approach proposed here, where the entity tested is the pathway, seems the natural way of discovering associations in GWAS.

DESCRIPTION OF THE PROGRAM

GeSBAP is a web-server written in Java with a core that essentially conducts the test written C++. The program is running in a high-end cluster with 10 dedicated Intel XEON Quad-Core CPUs at 2.0 GHz (summing up a total of 40 cores) with a large amount of RAM (total 60 GB). The default mode of use is the traditional ‘anonymous user’. In this case, the results are maintained only for 1 day and then deleted. The program also allows registered users. An account can be created, which can be later used after the corresponding login. The results of the sessions are preserved in the account and can be recovered in future sessions after login.

Input

The program inputs a tab-delimited list of polymorphisms (SNP and/or CNV) along with the corresponding P-value of the association. It is also possible to enter more processed data, such a list of genes with the corresponding P-value of association. Any of the most common gene identifiers can be used given that GeSBAP uses internally the INFRARED engine for gene ID conversion used in the Babelomics package (16). Finally, data can also be input in the format of the popular PLINK program (17), which performs the association test and obtains the corresponding list of P-values. At present only human, mouse and rat SNPs can be used in the program. The user can choose to test one or several functional categories among Gene Ontology (GO) (18), KEGG (19) and Biocarta (http://cgap.nci.nih.gov/Pathways/BioCarta_Pathways) pathways. Several filters can be applied to use subsets of any of these categories, which essentially involve filtering for maximum and minimum number of genes in the terms tested, filtering by keywords and, in the case of GO, filtering by levels in the GO hierarchy.

Testing strategy

Given a list of genes ranked by any criteria, GSA is used to find enrichments of gene sets significantly associated to high (or low) values of the rank. Here, the ranking criterion for the genes is derived from the associations of the SNPs to the trait studied. In particular, the program uses the –log(P-value) of the association test corresponding to the case–control comparison. The program selects the polymorphisms (SNPs or CNVs) that map into genes or in their neighbourhoods (±5 kb). Among all the polymorphisms corresponding to each gene, the one with the highest association to the trait studied is taken as a proxy of the gene. Then, all the genes represented in the GWAS (usually the complete genome) are mapped to the corresponding functional categories previously selected by the user and ranked accordingly to their proxy polymorphisms. Finally, a GSA test (10) is used to check for functional categories showing significant association to the trait studied. The significant functional terms along with the corresponding P-values [adjusted for multiple testing (20)] are listed in the output.

Output and an example

The output of the programs provides a general overview on the functional categories found as significantly associated to the trait studied. Figure 1 shows the representation of such summary. A table including the significant functional categories along with their corresponding P-values and the genes included in each one is provided. In addition, when GO terms are being analysed, a graphical representation of the significant terms within the GO hierarchy is also provided. The GO viewer implemented in the Babelomics package (16) is used for this purpose.

Figure 1.

Plot representing a summary of the GO terms significantly associated to the breast cancer in the case–control analysed (see text). The length of the red bar represents the relative proportion of genes of the GO term in the partition at which the enrichment was significant (16,21). As an example of the application of the program, we show the analysis of a breast cancer case–control from the CGEMs initiative (22), in which a total of 528 173 SNPs were genotyped in 1145 post-menopausal women with invasive breast cancer and 1142 controls. The original study identified four SNPs in intron 2 of FGFR2 (which encodes a receptor tyrosine kinase) that were highly associated with breast cancer (22). The GeSBAP analysis of the resulting list of ∼18 000 genes ranked by –log(P-value) revealed a considerable number of biological processes associated with risk of sporadic postmenopausal breast cancer (23) where the conventional tests only detected a unique gene (22). Figure 1 represents the GO biological processes detected as significantly associated to the cancer, among them ‘transmembrane receptor protein tyrosine kinase signaling pathway’ (GO:0007169, False Discovery Rate (FDR)-adjusted P-value = 1.73 × 10−03) and ‘regulation of signal transduction’ (GO:0009966, FDR-adjusted P-value = 4.45 × 10−03) in which FGFR2 is included. Figure 2 represents a summary of these GO terms mapped within the GO hierarchy. Supplementary Figure 1 displays the complete relationships among all the GO terms found. Notably, some of these processes have been log-standing linked to human neoplasia although not at the germline genetic level, in any case as comprehensively as proposed here. The role of RAS signal transduction in tumorigenesis is well-established (24), but this study will expand its involvement into cancer susceptibility by suggesting that several genes annotated with the corresponding GO term play a key role in risk of breast cancer. This term is a children of Cell Communication and a parent of Rho signaling, which then provides a global view of the cellular alteration in breast cancer susceptibility and a more detailed link to which components of the RAS signaling may be altered, respectively. Notably, previous work have linked the Rho guanine nucleotide exchange factor AKAP13 (significant here as part of the ‘transmembrane receptor protein tyrosine kinase signaling pathway’, GO:0007169, FDR-adjusted P-value = 1.73 × 10−03) with familial breast cancer (25) and two of the most recent world-wide replicated findings include FGFR2 (significant here as previously mentioned as part of the GO terms: ‘transmembrane receptor protein tyrosine kinase signaling pathway’, GO:0007169, FDR-adjusted P-value = 1.73 × 10−03; and ‘regulation of signal transduction’, GO:0009966, FDR-adjusted P-value = 4.45 × 10−03) and MAP3K1 (significant here as part of the GO terms: ‘transmembrane receptor protein tyrosine kinase signaling pathway’, GO:0007169, FDR-adjusted P-value = 1.73 × 10−03; and ‘Rho protein signal transduction’, GO:0007266, FDR-adjusted P-value = 2.26 × 10−02).

Figure 2.

Plot of the relationships in the GO hierarchy of a summary of the main GO terms found as significantly associated to the breast cancer in the case–control experiment analysed (see text). Octagons represent GO terms found as significant. Rectangles represent other GO terms in the hierarchy depicting the functional relationships among the significant terms. Supplementary Figure 1 displays the complete relationships among all the terms found.

DISCUSSION

We have shown how the concept of GSA can easily be extrapolated to the field of polymorphism analysis. An example shows how the application of this test to a breast cancer case–control reveals a considerable number of biological processes associated with risk of sporadic postmenopausal breast cancer (23), where the conventional tests only detected a unique gene association (22). At present, the main limitations for targeting functionality is our limited catalogue of functions, represented mainly by functional annotations contained in repositories such as GO (18), KEGG (19), Biocarta pathways, etc. This fact not only limits the number of testable effects to the content of these repositories, but also conceptually restricts to genes associations that can be found using the available information. It is expectable that projects like ENCODE (26) or the 1000 genomes (http://www.1000genomes.org/) will help to define extra-genic regions with functional significance. Nevertheless, this is not a limitation of the proposed methodology, but of the information available. In any case, the philosophy of testing groups of markers corresponding to functional units instead of markers alone has proven to be superior to the conventional testing schema and will increase its scope as new functional definitions become available in the future. Despite there are other general purpose programs and web servers for carrying out different flavours of GSA, revised in (7,8,27) (see also http://bioinfo.cipf.es/docus/tools-citations/functional_profiling/), there is no similar application available to GeSBAP, orientated to the analysis of GWAS, that calculates association P-values and maps SNPs or CNVs to genes previously to carry out the GSA. The program for the GeSBAP has been running for >6 months. The first paper using this approach has recently been published (23). To our knowledge, there are no other web applications offering this type of analysis. GeSBAP is freely available at http://bioinfo.cipf.es/gesbap/www/index.jsp.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

26 in total

Review 1. Beyond Mendel: an evolving view of human genetic disease transmission.

Authors: Jose L Badano; Nicholas Katsanis
Journal: Nat Rev Genet Date: 2002-10 Impact factor: 53.242

2. Discovering molecular functions significantly related to phenotypes by combining gene expression data and biological information.

Authors: Fátima Al-Shahrour; Ramón Díaz-Uriarte; Joaquín Dopazo
Journal: Bioinformatics Date: 2005-04-19 Impact factor: 6.937

3. Analyzing gene expression data in terms of gene sets: methodological issues.

Authors: Jelle J Goeman; Peter Bühlmann
Journal: Bioinformatics Date: 2007-02-15 Impact factor: 6.937

4. Pathway-based approaches for analysis of genomewide association studies.

Authors: Kai Wang; Mingyao Li; Maja Bucan
Journal: Am J Hum Genet Date: 2007-12 Impact factor: 11.025

5. Formulating and testing hypotheses in functional genomics.

Authors: Joaquin Dopazo
Journal: Artif Intell Med Date: 2008-09-11 Impact factor: 5.326

Review 6. Hyperactive Ras in developmental disorders and cancer.

Authors: Suzanne Schubbert; Kevin Shannon; Gideon Bollag
Journal: Nat Rev Cancer Date: 2007-04 Impact factor: 60.716

7. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

Authors: Ewan Birney; John A Stamatoyannopoulos; Anindya Dutta; Roderic Guigó; Thomas R Gingeras; Elliott H Margulies; Zhiping Weng; Michael Snyder; Emmanouil T Dermitzakis; Robert E Thurman; Michael S Kuehn; Christopher M Taylor; Shane Neph; Christoph M Koch; Saurabh Asthana; Ankit Malhotra; Ivan Adzhubei; Jason A Greenbaum; Robert M Andrews; Paul Flicek; Patrick J Boyle; Hua Cao; Nigel P Carter; Gayle K Clelland; Sean Davis; Nathan Day; Pawandeep Dhami; Shane C Dillon; Michael O Dorschner; Heike Fiegler; Paul G Giresi; Jeff Goldy; Michael Hawrylycz; Andrew Haydock; Richard Humbert; Keith D James; Brett E Johnson; Ericka M Johnson; Tristan T Frum; Elizabeth R Rosenzweig; Neerja Karnani; Kirsten Lee; Gregory C Lefebvre; Patrick A Navas; Fidencio Neri; Stephen C J Parker; Peter J Sabo; Richard Sandstrom; Anthony Shafer; David Vetrie; Molly Weaver; Sarah Wilcox; Man Yu; Francis S Collins; Job Dekker; Jason D Lieb; Thomas D Tullius; Gregory E Crawford; Shamil Sunyaev; William S Noble; Ian Dunham; France Denoeud; Alexandre Reymond; Philipp Kapranov; Joel Rozowsky; Deyou Zheng; Robert Castelo; Adam Frankish; Jennifer Harrow; Srinka Ghosh; Albin Sandelin; Ivo L Hofacker; Robert Baertsch; Damian Keefe; Sujit Dike; Jill Cheng; Heather A Hirsch; Edward A Sekinger; Julien Lagarde; Josep F Abril; Atif Shahab; Christoph Flamm; Claudia Fried; Jörg Hackermüller; Jana Hertel; Manja Lindemeyer; Kristin Missal; Andrea Tanzer; Stefan Washietl; Jan Korbel; Olof Emanuelsson; Jakob S Pedersen; Nancy Holroyd; Ruth Taylor; David Swarbreck; Nicholas Matthews; Mark C Dickson; Daryl J Thomas; Matthew T Weirauch; James Gilbert; Jorg Drenkow; Ian Bell; XiaoDong Zhao; K G Srinivasan; Wing-Kin Sung; Hong Sain Ooi; Kuo Ping Chiu; Sylvain Foissac; Tyler Alioto; Michael Brent; Lior Pachter; Michael L Tress; Alfonso Valencia; Siew Woh Choo; Chiou Yu Choo; Catherine Ucla; Caroline Manzano; Carine Wyss; Evelyn Cheung; Taane G Clark; James B Brown; Madhavan Ganesh; Sandeep Patel; Hari Tammana; Jacqueline Chrast; Charlotte N Henrichsen; Chikatoshi Kai; Jun Kawai; Ugrappa Nagalakshmi; Jiaqian Wu; Zheng Lian; Jin Lian; Peter Newburger; Xueqing Zhang; Peter Bickel; John S Mattick; Piero Carninci; Yoshihide Hayashizaki; Sherman Weissman; Tim Hubbard; Richard M Myers; Jane Rogers; Peter F Stadler; Todd M Lowe; Chia-Lin Wei; Yijun Ruan; Kevin Struhl; Mark Gerstein; Stylianos E Antonarakis; Yutao Fu; Eric D Green; Ulaş Karaöz; Adam Siepel; James Taylor; Laura A Liefer; Kris A Wetterstrand; Peter J Good; Elise A Feingold; Mark S Guyer; Gregory M Cooper; George Asimenos; Colin N Dewey; Minmei Hou; Sergey Nikolaev; Juan I Montoya-Burgos; Ari Löytynoja; Simon Whelan; Fabio Pardi; Tim Massingham; Haiyan Huang; Nancy R Zhang; Ian Holmes; James C Mullikin; Abel Ureta-Vidal; Benedict Paten; Michael Seringhaus; Deanna Church; Kate Rosenbloom; W James Kent; Eric A Stone; Serafim Batzoglou; Nick Goldman; Ross C Hardison; David Haussler; Webb Miller; Arend Sidow; Nathan D Trinklein; Zhengdong D Zhang; Leah Barrera; Rhona Stuart; David C King; Adam Ameur; Stefan Enroth; Mark C Bieda; Jonghwan Kim; Akshay A Bhinge; Nan Jiang; Jun Liu; Fei Yao; Vinsensius B Vega; Charlie W H Lee; Patrick Ng; Atif Shahab; Annie Yang; Zarmik Moqtaderi; Zhou Zhu; Xiaoqin Xu; Sharon Squazzo; Matthew J Oberley; David Inman; Michael A Singer; Todd A Richmond; Kyle J Munn; Alvaro Rada-Iglesias; Ola Wallerman; Jan Komorowski; Joanna C Fowler; Phillippe Couttet; Alexander W Bruce; Oliver M Dovey; Peter D Ellis; Cordelia F Langford; David A Nix; Ghia Euskirchen; Stephen Hartman; Alexander E Urban; Peter Kraus; Sara Van Calcar; Nate Heintzman; Tae Hoon Kim; Kun Wang; Chunxu Qu; Gary Hon; Rosa Luna; Christopher K Glass; M Geoff Rosenfeld; Shelley Force Aldred; Sara J Cooper; Anason Halees; Jane M Lin; Hennady P Shulha; Xiaoling Zhang; Mousheng Xu; Jaafar N S Haidar; Yong Yu; Yijun Ruan; Vishwanath R Iyer; Roland D Green; Claes Wadelius; Peggy J Farnham; Bing Ren; Rachel A Harte; Angie S Hinrichs; Heather Trumbower; Hiram Clawson; Jennifer Hillman-Jackson; Ann S Zweig; Kayla Smith; Archana Thakkapallayil; Galt Barber; Robert M Kuhn; Donna Karolchik; Lluis Armengol; Christine P Bird; Paul I W de Bakker; Andrew D Kern; Nuria Lopez-Bigas; Joel D Martin; Barbara E Stranger; Abigail Woodroffe; Eugene Davydov; Antigone Dimas; Eduardo Eyras; Ingileif B Hallgrímsdóttir; Julian Huppert; Michael C Zody; Gonçalo R Abecasis; Xavier Estivill; Gerard G Bouffard; Xiaobin Guan; Nancy F Hansen; Jacquelyn R Idol; Valerie V B Maduro; Baishali Maskeri; Jennifer C McDowell; Morgan Park; Pamela J Thomas; Alice C Young; Robert W Blakesley; Donna M Muzny; Erica Sodergren; David A Wheeler; Kim C Worley; Huaiyang Jiang; George M Weinstock; Richard A Gibbs; Tina Graves; Robert Fulton; Elaine R Mardis; Richard K Wilson; Michele Clamp; James Cuff; Sante Gnerre; David B Jaffe; Jean L Chang; Kerstin Lindblad-Toh; Eric S Lander; Maxim Koriabine; Mikhail Nefedov; Kazutoyo Osoegawa; Yuko Yoshinaga; Baoli Zhu; Pieter J de Jong
Journal: Nature Date: 2007-06-14 Impact factor: 49.962

8. From genes to functional classes in the study of biological systems.

Authors: Fátima Al-Shahrour; Leonardo Arbiza; Hernán Dopazo; Jaime Huerta-Cepas; Pablo Mínguez; David Montaner; Joaquín Dopazo
Journal: BMC Bioinformatics Date: 2007-04-03 Impact factor: 3.169

9. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists.

Authors: Da Wei Huang; Brad T Sherman; Richard A Lempicki
Journal: Nucleic Acids Res Date: 2008-11-25 Impact factor: 16.971

10. Babelomics: advanced functional profiling of transcriptomics, proteomics and genomics experiments.

Authors: Fátima Al-Shahrour; José Carbonell; Pablo Minguez; Stefan Goetz; Ana Conesa; Joaquín Tárraga; Ignacio Medina; Eva Alloza; David Montaner; Joaquín Dopazo
Journal: Nucleic Acids Res Date: 2008-05-31 Impact factor: 16.971

44 in total

1. Optimal structural inference of signaling pathways from unordered and overlapping gene sets.

Authors: Lipi R Acharya; Thair Judeh; Guangdi Wang; Dongxiao Zhu
Journal: Bioinformatics Date: 2011-12-22 Impact factor: 6.937

Review 2. Analysing biological pathways in genome-wide association studies.

Authors: Kai Wang; Mingyao Li; Hakon Hakonarson
Journal: Nat Rev Genet Date: 2010-12 Impact factor: 53.242

Review 3. Functional and genomic context in pathway analysis of GWAS data.

Authors: Michael A Mooney; Joel T Nigg; Shannon K McWeeney; Beth Wilmot
Journal: Trends Genet Date: 2014-08-22 Impact factor: 11.639

4. Apoptosis pathway signature for prediction of treatment response and clinical outcome in childhood high risk B-Precursor acute lymphoblastic leukemia.

Authors: Ya-Hsuan Chang; Yung-Li Yang; Chung-Ming Chen; Hsuan-Yu Chen
Journal: Am J Cancer Res Date: 2015-04-15 Impact factor: 6.166

Review 5. Gene set analysis of SNP data: benefits, challenges, and future directions.

Authors: Brooke L Fridley; Joanna M Biernacka
Journal: Eur J Hum Genet Date: 2011-04-13 Impact factor: 4.246

6. GSA-SNP: a general approach for gene set analysis of polymorphisms.

Authors: Dougu Nam; Jin Kim; Seon-Young Kim; Sangsoo Kim
Journal: Nucleic Acids Res Date: 2010-05-25 Impact factor: 16.971

7. Multidimensional gene set analysis of genomic data.

Authors: David Montaner; Joaquín Dopazo
Journal: PLoS One Date: 2010-04-27 Impact factor: 3.240

8. i-GSEA4GWAS: a web server for identification of pathways/gene sets associated with traits by applying an improved gene set enrichment analysis to genome-wide association study.

Authors: Kunlin Zhang; Sijia Cui; Suhua Chang; Liuyan Zhang; Jing Wang
Journal: Nucleic Acids Res Date: 2010-04-30 Impact factor: 16.971

9. Exploring the link between germline and somatic genetic alterations in breast carcinogenesis.

Authors: Núria Bonifaci; Bohdan Górski; Bartlomiej Masojć; Dominika Wokołorczyk; Anna Jakubowska; Tadeusz Dębniak; Antoni Berenguer; Jordi Serra Musach; Joan Brunet; Joaquín Dopazo; Steven A Narod; Jan Lubiński; Conxi Lázaro; Cezary Cybulski; Miguel Angel Pujana
Journal: PLoS One Date: 2010-11-22 Impact factor: 3.240

Review 10. Bioinformatics challenges for genome-wide association studies.

Authors: Jason H Moore; Folkert W Asselbergs; Scott M Williams
Journal: Bioinformatics Date: 2010-01-06 Impact factor: 6.937