Literature DB >> 22635606

Gowinda: unbiased analysis of gene set enrichment for genome-wide association studies.

Abstract

SUMMARY: An analysis of gene set [e.g. Gene Ontology (GO)] enrichment assumes that all genes are sampled independently from each other with the same probability. These assumptions are violated in genome-wide association (GWA) studies since (i) longer genes typically have more single-nucleotide polymorphisms resulting in a higher probability of being sampled and (ii) overlapping genes are sampled in clusters. Herein, we introduce Gowinda, a software specifically designed to test for enrichment of gene sets in GWA studies. We show that GO tests on GWA data could result in a substantial number of false-positive GO terms. Permutation tests implemented in Gowinda eliminate these biases, but maintain sufficient power to detect enrichment of GO terms. Since sufficient resolution for large datasets requires millions of permutations, we use multi-threading to keep computation times reasonable.
AVAILABILITY AND IMPLEMENTATION: Gowinda is implemented in Java (v1.6) and freely available on http://code.google.com/p/gowinda/ CONTACT: christian.schloetterer@vetmeduni.ac.at SUPPLEMENTARY INFORMATION: Manual: http://code.google.com/p/gowinda/wiki/Manual. Test data and tutorial: http://code.google.com/p/gowinda/wiki/Tutorial. VALIDATION: http://code.google.com/p/gowinda/wiki/VALIDATION.

Entities: Chemical

Mesh：

Year: 2012 PMID： 22635606 PMCID： PMC3400962 DOI： 10.1093/bioinformatics/bts315

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The advent of high-throughput analysis such as single-nucleotide polymorphism (SNP) arrays and next-generation sequencing enabled large-scale genome-wide association (GWA) studies (Nordborg and Weigel, 2008) or GWA-like studies, such as selective genotyping (Darvasi and Soller, 1994) and experimental evolution (Turner ), for almost any phenotype of interest. These studies typically yield hundreds of candidate SNPs associated with the studied trait. A wide-spread approach to shed light on the biological implications of these SNPs is a test for gene set enrichment [e.g. Gene Ontology (GO)] (Ashburner ; Wang ). Such an analysis of gene set enrichment is based on the assumptions that all genes are sampled independently from each other with the same probability. These assumptions are violated with data from GWA studies as (i) longer genes usually have more SNPs resulting in a higher probability of being sampled and (ii) overlapping genes are sampled in clusters (Holmans ). For these reasons, we developed Gowinda, a user-friendly and multi-threaded software specifically designed for detecting unbiased enrichment in gene sets from large datasets such as generated by GWA studies. By relying on standard file formats, any species with a sequenced and annotated genome may be analyzed, hence it favorably compares to a similar tool that is restricted to the analysis of GWA datasets in humans (Holmans ). [For a review of different available methods, see Wang )]. We validated Gowinda and show that the biases inherent to GWA dataset could result in a substantial number of false-positive GO terms and that Gowinda eliminates these biases while still yielding highly reliable results.

2 IMPLEMENTATION

Gowinda calculates the significance of overrepresentation for each gene set with permutation tests. Gowinda randomly samples SNPs from the total set of SNPs and records the associated genes. After repeating this permutation multiple times, an empirical null distribution of gene abundance for every gene set is obtained. The significance of overrepresentation of the candidate SNPs is estimated from the empirical null distribution. To account for multiple testing, an empirical false discovery rate (FDR) is calculated, by dividing the number of expected gene sets for a given P-value (averaged from the simulations) by the number of observed gene sets. Gowinda requires four input files, all of which widely used standard formats: a file with the annotation of the species of interest (.gtf or .gff), a file with the total set of SNPs used for the GWA study [.vcf, .mpileup or similar (Danecek, ; Li, )], a file containing the candidate SNPs (must be a subset of all SNPs) and a file with the mapping of genes to gene sets. Such files with the mapping between genes and GO terms can, for example, be obtained either from FuncAssociate2 (Berriz ) or from High-Troughput (HT) GoMiner (Zeeberg ). In addition to this, Gowinda can be used to identify enrichment of SNPs in any user-defined gene set (Manual: http://code.google.com/p/gowinda/wiki/Manual; Test data and walkthrough: http://code.google.com/p/gowinda/wiki/Tutorial). Gowinda does not reproduce the exact pattern of linkage disequilibrium (LD) between SNPs but offers two complementary test strategies making two extreme assumptions about LD: Another important feature of Gowinda is a flexible definition of a gene. This enables the user to include SNPs mapping to different features in the analysis, such as exons, introns, untranslated region or 2000 bp downstream. Possible definitions are as follows: exons, CDS, exons + introns, untranslated region, upstream + exons + introns, exons + introns + downstream and upstream + exons + introns + downstream. SNPs are in linkage equilibrium: Gowinda randomly samples the same number of SNPs as candidate SNPs. Subsequently, the corresponding genes are identified and overrepresentation is estimated as described above. Note that the number of randomly sampled SNPs is kept constant in the simulations (–mode SNP) and that a gene may be considered several times according to the number of candidate SNPs. SNPs are in complete LD: Gowinda randomly samples SNPs until the corresponding number of genes is identical to the number candidate genes. Finally, the significance of overrepresentation is estimated as described above. Note that the number of randomly sampled genes is kept constant in the simulations (–mode gene), but the number of sampled SNPs may vary between simulations. Furthermore, every gene is only considered once even when containing several candidate SNPs. This approach assumes complete LD between SNPs within a gene but does, however, not account for LD between genes.

3 VALIDATION AND BENCHMARKS

To test the reliability of Gowinda, we asked whether Gowinda reproduces the significance of overrepresentation for GO categories as the widely used tool HT GoMiner (Zeeberg ). We created an unbiased dataset by filtering for non-overlapping Drosophila melanogaster genes and introduced exactly five SNPs into each of the genes. Subsequently, we randomly sampled 1000 SNPs and computed the significance for the overrepresentation of every GO category, either on the basis of SNPs using Gowinda or based on the corresponding genes using HT GoMiner. We found that Gowinda yields almost identical results as HT GoMiner (Fig. 1A; Spearman's rank correlation; ρ>0.99; P-value <2.2×−10−16). We also assessed the bias introduced in GWAS data and to what extent Gowinda corrects for this bias. We created a biased dataset by introducing a SNP every 100 bp into all genes and again randomly sampled 1000 SNPs. We calculated the significance of overrepresentation using Gowinda and HT GoMiner. Consistent with the expected bias due to different gene lengths and overlapping genes, HT GoMiner reports a significant enrichment for 341 GO categories (FDR < 0.05), whereas Gowinda correctly reports zero (FDR < 0.05). We also found that the correlation of the P-values of overrepresentation for Gowinda and HT Gominer dramatically decreased with the biased dataset (Fig. 1B; Spearman's rank correlation; ρ>0.56; P-value <2.2×10−16).

Fig. 1.

Correlation of the significance of overrepresentation for GO terms as calculated by Gowinda and GoMiner using an unbiased (A) and a biased (B) dataset

Correlation of the significance of overrepresentation for GO terms as calculated by Gowinda and GoMiner using an unbiased (A) and a biased (B) dataset Finally, we tested whether Gowinda correctly identifies overrepresentation for five randomly preselected GO categories. We randomly picked five small GO categories (5–10 genes) and introduced a candidate SNP into every gene associated with these GO categories. Subsequently, we randomly sampled SNPs from the biased dataset until a total of 1000 candidate SNPs were obtained. After analysis for GO term enrichment, we found that Gowinda correctly identified all preselected GO categories (FDR < 0.05). Interestingly, significant enrichment was also identified for another 14 GO categories, which is due to the nesting of GO categories. Details about validation can be found at: http://code.google.com/p/gowinda/wiki/Validation. Gowinda is reasonably fast, in D. melanogaster 1 000 000 simulations for 2 000 candidate SNPs out of a total of 1.8 million SNPs take about 31 min with a Mac Pro (10.5.8) using eight threads and requires about 1.2 GB of RAM. Memory consumption is mostly dependent on the total number of SNPs and computation time scales with the number of simulations. Funding: Austrian Science Fund (FWF) grant (P19467) to C.S. Conflict of Interest: none declared.

10 in total

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

Review 2. Analysing biological pathways in genome-wide association studies.

Authors: Kai Wang; Mingyao Li; Hakon Hakonarson
Journal: Nat Rev Genet Date: 2010-12 Impact factor: 53.242

3. Next-generation genetics in plants.

Authors: Magnus Nordborg; Detlef Weigel
Journal: Nature Date: 2008-12-11 Impact factor: 49.962

4. Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder.

Authors: Peter Holmans; Elaine K Green; Jaspreet Singh Pahwa; Manuel A R Ferreira; Shaun M Purcell; Pamela Sklar; Michael J Owen; Michael C O'Donovan; Nick Craddock
Journal: Am J Hum Genet Date: 2009-06-18 Impact factor: 11.025

5. Next generation software for functional trend analysis.

Authors: Gabriel F Berriz; John E Beaver; Can Cenik; Murat Tasan; Frederick P Roth
Journal: Bioinformatics Date: 2009-08-28 Impact factor: 6.937

6. Selective DNA pooling for determination of linkage between a molecular marker and a quantitative trait locus.

Authors: A Darvasi; M Soller
Journal: Genetics Date: 1994-12 Impact factor: 4.562

7. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

8. Population-based resequencing of experimentally evolved populations reveals the genetic basis of body size variation in Drosophila melanogaster.

Authors: Thomas L Turner; Andrew D Stewart; Andrew T Fields; William R Rice; Aaron M Tarone
Journal: PLoS Genet Date: 2011-03-17 Impact factor: 5.917

9. The variant call format and VCFtools.

Authors: Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin
Journal: Bioinformatics Date: 2011-06-07 Impact factor: 6.937

10. High-Throughput GoMiner, an 'industrial-strength' integrative gene ontology tool for interpretation of multiple-microarray experiments, with application to studies of Common Variable Immune Deficiency (CVID).

Authors: Barry R Zeeberg; Haiying Qin; Sudarshan Narasimhan; Margot Sunshine; Hong Cao; David W Kane; Mark Reimers; Robert M Stephens; David Bryant; Stanley K Burt; Eldad Elnekave; Danielle M Hari; Thomas A Wynn; Charlotte Cunningham-Rundles; Donn M Stewart; David Nelson; John N Weinstein
Journal: BMC Bioinformatics Date: 2005-07-05 Impact factor: 3.169

10 in total

52 in total

Gowinda: unbiased analysis of gene set enrichment for genome-wide association studies.

1 INTRODUCTION

2 IMPLEMENTATION

3 VALIDATION AND BENCHMARKS

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Review 2. Analysing biological pathways in genome-wide association studies.

3. Next-generation genetics in plants.

4. Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder.

5. Next generation software for functional trend analysis.

6. Selective DNA pooling for determination of linkage between a molecular marker and a quantitative trait locus.

7. The Sequence Alignment/Map format and SAMtools.

8. Population-based resequencing of experimentally evolved populations reveals the genetic basis of body size variation in Drosophila melanogaster.

9. The variant call format and VCFtools.

10. High-Throughput GoMiner, an 'industrial-strength' integrative gene ontology tool for interpretation of multiple-microarray experiments, with application to studies of Common Variable Immune Deficiency (CVID).

1. Testing for Ancient Selection Using Cross-population Allele Frequency Differentiation.

Review 2. Functional and genomic context in pathway analysis of GWAS data.

3. Genome-Wide Analysis of Starvation-Selected Drosophila melanogaster-A Genetic Model of Obesity.

4. Sex and Mitonuclear Adaptation in Experimental Caenorhabditis elegans Populations.

5. SNP2GO: functional analysis of genome-wide association studies.

6. Genomic Trajectories to Desiccation Resistance: Convergence and Divergence Among Replicate Selected Drosophila Lines.

7. Evolutionary origins of genomic adaptations in an invasive copepod.

8. Efficient pathway enrichment and network analysis of GWAS summary data using GSA-SNP2.

9. Natural variation in the regulation of neurodevelopmental genes modifies flight performance in Drosophila.

10. Taste perception and lifestyle: insights from phenotype and genome data among Africans and Asians.