Literature DB >> 21347143

Biomolecular Systems of Disease Buried Across Multiple GWAS Unveiled by Information Theory and Ontology.

Younghee Lee¹, Jianrong Li, Eric Gamazon, James L Chen, Anna Tikhomirov, Nancy J Cox, Yves A Lussier.
1. Sect. of Genetic Medicine.

Abstract

A key challenge for genome-wide association studies (GWAS) is to understand how single nucleotide polymorphisms (SNPs) mechanistically underpin complex diseases. While this challenge has been addressed partially by Gene Ontology (GO) enrichment of large list of host genes of SNPs prioritized in GWAS, these enrichment have not been formally evaluated. Here, we develop a novel computational approach anchored in information theoretic similarity, by systematically mining lists of host genes of SNPs prioritized in three adult-onset diabetes mellitus GWAS. The "gold-standard" is based on GO associated with 20 published diabetes SNPs' host genes and on our own evaluation. We computationally identify 69 similarity-predicted GO independently validated in all three GWAS (FDR<5%), enriched with those of the gold-standard (odds ratio=5.89, P=4.81e-05), and these terms can be organized by similarity criteria into 11 groupings termed "biomolecular systems". Six biomolecular systems were corroborated by the gold-standard and the remaining five were previously uncharacterized. http://lussierlab.org/publications/ITS-GWAS.

Entities: Chemical Disease Gene Species

Year: 2010 PMID： 21347143 PMCID： PMC3041547

Source DB: PubMed Journal: Summit Transl Bioinform ISSN： 2153-6430

Introduction

Single nucleotide polymorphism (SNP) arrays have been extensively used to predict how genetic variants are related to a single phenotype in genome-wide association studies (GWAS). Many methods have been developed to confirm these predictions. SNPs predicted by GWAS have been assessed with follow-up studies and with biological models. In contrast, the opposite has not been true. There is a paucity of validation studies exploring the predicted biomolecular functions associated with known SNPs. To properly evaluate these predicted biomolecular functions, a statistically significant number of annotated SNPs using a proper methodology are required. Three important considerations argue for the improvement of the accuracy and validation set of related annotations, or what we are calling “biomolecular systems”, predicted by the SNP array. First, biomolecular systems associated with complex diseases are poorly understood and remain largely without computational replication in a different dataset or biological validation. Current enrichment approaches have only been conducted with intragenic SNPs. Second, the existing GWAS can be leveraged at minimal additional expense to unveil additional knowledge. Finally, increasing availability of multiple, independent, and disease-specific SNP array datasets provides an excellent opportunity to analyze across related experiments. Others have previously conducted straight-forward Gene Ontology (GO) [1] enrichment studies over a limited number of host genes annotated to intragenic SNPs. Such single study predictions with an arbitrary number of prioritized SNPs are available through web-based tools [2]. However, these studies do not provide formal evaluations as to the optimal cutoff for the number of prioritized SNPs and their computational validation remains entirely “internal” to a single study, using empirical or theoretical statistics for correction of multiple comparisons. Additionally, others have found a large number of biomolecular systems in the intersection of gene expression and SNPs or interacting quantitative trait loci (QTLs) [3, 4]. Together, these results suggest that long lists of prioritized host genes (PHG) annotated to intragenic SNPs in expression arrays contain additional biological function information about diseases beyond that found in a top few SNPs. The limitation to these previous computational validation approaches is that they remain entirely “internal” to a single study and use empirical or theoretical statistics for multiple comparison correction rather than consider any recapitulation in new patient datasets or biological validation. In this paper, we propose that multiple GWAS focusing on the same complex disease phenotype may provide a means for “external” validation of the biomolecular systems predicted by enrichment studies in a single GWAS. However, to provide this evaluation, biomolecular functions and processes need to be replicated between GWAS. This is challenging as biomolecular functions and processes that are systematically queried via Gene Ontology contains over 25,000 distinct terms, organized in a directed acyclic graph in various depths. Because of the high dimensionality of GO, the enriched GO terms between two studies may be similar (e.g. parent-child) but not identical. One solution is to use Information Theoretic Similarity (ITS), which was originally applied to GO by Lord et al. to show consistency between sequence and annotation [5]. Lerman et al. have shown high concordance between functions inferred from similarity in GO and measured structure of proteins and functional similarity of proteins. We and others have also shown how biological systems similarity measured by ITS in GO can improve the accuracy of machine learning-based predictions of biological function for inadequately characterized genes [6]. These studies further support the accuracy of ITS and of “annotation transference” between sequence, structure, and function [7] We hypothesize that biomolecular functions and processes discovered in a meta-analysis of several GWAS can formally be “replicated in silico” in a separate independent GWAS and validated by a gold standard with no significant bias. This gold standard can be derived from known published disease specific genes and GO annotations. Secondly, we hypothesize that we can increase the number of accurate predictions of disease-related biomolecular functions and processes which are significantly validated in the second GWAS by studying its ITS in GO in addition to its exact GO overlap when compared to other independent GWAS. Third, we also hypothesize that we can organize the resulting specific biomolecular functions and processes into a smaller set of biomolecular systems using the similarity criteria between GO terms to discriminate between uncharacterized systems of a disease and those corroborated ones. In this proof of concept paper, we chose to focus our analysis on Adult Onset Diabetes Mellitus (AODM) as it is a particularly intricate multi-system disease with relatively low odds ratio (OR) in individual SNPs in pathologic phenotypes. We also propose an original evaluation based on GO terms associated with known host genes of AODM’s SNPs.

Methods

Data Processing and Experimental Design provides a schema and succinct description of the proposed methods and evaluation. GWAS Datasets and SNP’s Host Gene(s) In this study, we used three independent GWAS: The Wellcome Trust Case Control Consortium (WTCCC)[8], Finland - United State Investigation of NIDDM Genetics (FUSION)[9], and the Diabetes Genetics Initiative (DGI)[10]. To standardize the gene names, refflat.txt, kgalias.txt, and knowngene.txt were downloaded from the UCSC genome browser (Nov. 2007) [11] as well as the core data from the Human Genome Organizations (HUGO)’s Gene Nomenclature Committee (June 30, 2008). Gene Ontology annotation, structural files gene_ontology.obo (July 08, 2008) and gene_annotation.goa_human.gz (Oct. 28, 2008) were downloaded from its website [12]. SNP’s Host Gene(s) Detailed methods are described in the Suppl. Methods. Functional annotation of GWAS to GO ( We conducted an enrichment study with GO functional annotations to prioritize biomolecular systems related to AODM using host genes annotated to intragenic SNPs that were prioritized statistically in a GWAS. We systematically evaluated several numbers of prioritized host genes (PHG) as follows, 50, 100, 200, 300, 400, 500, or 1000. The unadjusted P-value of the GO enrichment was calculated using the cumulative hypergeometric distribution provided by an open source Perl API, GO-TermFinder[13]. Bonferroni correction (P-value*n, where n=number of GO terms in the test) was applied to control for multiple comparisons. Hierarchical Refinement of Enriched GO Terms Recent reports have stated that enrichment studies conducted over genes in GO can generate falsely significant P-values due to the inheritance of genes in parent classes which may be highly enriched [14]. To remove such false positive signals inherited in the GO hierarchy during enrichment, we refined the enriched GO terms in each study according to each PHG with a novel set-theoretic method described in the Supplemental Methods. Determining Similarity between GO Enriched GWAS ( In order to reduce the dimensionality of the predicted GO terms and to increase the precision, we retained only those terms that were either identical in two studies or similar between two studies. We previously implemented Lin’s standardized ITS metric [15] that ranges from 0 to 1 to identify similarity between GO terms and have shown that an ITS score ≥ 0.7 was significant and optimal for the prediction of GO function in sparsely annotated genes[6]. GO terms enriched in each GWAS were systematically compared to one another using: (I) simple overlap (e.g. same GO terms or ITS=1, ), and with (II) GO terms with ITS ≥ 0.7 (, left side of Panel IIB), which are each from a distinct GWAS and with a similarity ≥ 0.7 between them. Similar GO terms between two studies were also compared to the 3rd GWAS for calculation of FDR and validation (Evaluation, replication in the 3 study, , right side of Panel IIB). With three GWAS, there were three possible combinations in which each study serves once as the validation study. Each of these groups of predictions of related GO terms (biological processes and molecular functions) was controlled for with a FDR calculation using a bootstrap of randomly selected genes. 100 bootstraps were conducted for each of the GWAS at each prioritized host gene threshold and subjected to the same analyses (GO enrichment, hierarchical refinements, and ITS between GO of studies), to control for in silico replication (, right side of Panel IIB). Finally, GO terms with a similarity ≥ 0.7 among all three studies were also illustrated in a Cytoscape network[16] ( and Suppl. Fig.1). Generation and Evaluation of Gold Standard ( To evaluate our predictions of GO terms associated with AODM, we developed an “gold standard” with no significant bias based on GO annotations of 20 published diabetes genes [17]. Of the 20, 19 genes were annotated in GO generating 245 distinct terms in what we call our “gold standard GO” (GS-GO). In addition, we conducted a study confirming that GO terms associated with these genes are more related to one another (and thus to diabetes) than an empirical distribution derived from 100 bootstraps of a random pick of 19 host genes annotated in all three GWAS (P<0.0001; details in Suppl. Results).

Results and Discussion

Reproducibility and Validation of Biomolecular Systems (Exact Overlap) We estimated the likelihood of a straightforward overlap of a specific enriched GO term in two or three studies. provides the number of overlapping GO terms enriched between FUSION and DGI GWAS (archetypical results, other combinations of GWAS not shown) according to the following parameters: (I) P-value of GO enrichment adjusted for multiplicity and reproducibility in two studies, (II) length of PHG. The large number of predicted GO does not satisfy statistical significance as measured by adjusted P-value or odds ratio with the gold standard. These drawbacks suggest that there is an opportunity for improving the accuracy of predictions of replicated GO terms between two studies using similarity (for detail, see Suppl. Results and Suppl. Tables 1&2). Biomolecular Systems Predicted by ITS (Biological Similarity) In a sense, GO annotations serve as a proxy to identify components of biomolecular functions and processes that are then assembled in systems with similarity metric as well. Applying such an approach to uncharacterized SNPs is related to our previous work on predicting GO functions and processes in sparsely annotated genes [6] where we show that similarity could perform optimally in GO for values of ITS ≥0.7. provides the number of similarity-predicted GO terms enriched between three independent GWAS (top) and between two GWAS, FUSION and DGI according to the following parameters: (I) P-value of GO enrichment adjusted for multiplicity and reproducibility in multiple studies, (II) length of PHG. As compared to the exact match study (), we demonstrated that ITS has dramatically improved the accuracy of the predictions according to three metrics: (I) adjusted theoretical P 0.05 (white zones, non-grey), (II) empirical FDR ≤5% (unstruck data), and (III) statistically significant odds ratio using the GS-GO (, red bolded). The optimal range of prediction with high odds ratio is around 200–300 PHG counts and at an unadjusted P-value of the GO enrichment (P=0.025). AS shown in , whether by exact match or by similarity, results between two studies are not statistically significant in most ranges as compared to those between three studies. In contrast, there is a distinct improvement with ITS; the number of accurate predictions tripled and in some cases quintupled as compared to those found in the exact overlap method. These results suggest that ITS provide opportunities to uncover novel biomolecular systems properties overlooked by straightforward overlap methods. We also organized these GO terms as 11 biomolecular system classes by similarity to identify which of the GO terms found by similarity were novel as compared to those found by exact match (). Cluster C, GTPase regulator activity, does not contain any black circles (exact match) and thus corresponds to a group of GO terms predicted exclusively by similarity. Further, Suppl. Figure 1 provides a more detailed map of GO terms where 6 “biomolecular systems” are corroborated by the gold standard (red circles) and five may be novel ones. Using ITS, the 69 GO terms that were similar among the three GWAS were also enriched in diabetes signal as they comprised 12 gold standards (GS-GO; P=4.81e-05, cumulative hypergeometric test). There was a significant fourfold increase in the number of predictions and a threefold increase in the recapitulation of our gold standard annotations. At least four biomolecular systems were related to the nervous system, which may indicate a pleiotropy of biomolecular systems involved in complex traits. Although commonly thought as more of an endocrine organ, the endocrine pancreas responsible for secreting insulin and glucagon is a neuroendocrine organ with parasympathethic innervation. Further analysis is described in the Suppl. Results. Future Studies and Limitations We observed several limitations to this study. First, GO does not encompass pathophysiological and higher clinical concepts. Consequently, important associations with complex disease systems cannot be derived. Further, the gold standard we derived from all GO terms associated with diabetes SNPs is far from optimal and likely comprises GO terms not related to diabetes. In future studies, we will explore the use of expression arrays to derive an improved gold standard and also the use of eQTL associations in order to include intergenic SNPs via their “trans”-correlated genes in addition to their intragenic ones. We will use this approach to identify common systems across complex diseases. For example, one can imagine a study of metabolic syndrome by pooling hypertension diabetes, and cardiovascular studies along with obesity to generate the biomolecular systems underlying this complex disease.

Conclusion

We successfully implemented a framework of functional similarity using Gene Ontology to more accurately recapitulate known biomolecular systems associations with complex diseases in GWAS according to a gold standard and to predict uncharacterized ones. We further show that this similarity technique is able to find more associations than a straightforward overlap of GO terms. Currently, a new biological process finds one host gene at a time through GWAS, which requires additional biological validation. To our knowledge, this is the first study that demonstrates reliable novel biological processes repeated across GWAS. Since very little is known of the biomolecular mechanisms relating SNPs to complex diseases and how these processes differ from those of single gene inheritance or from those observed in animal models, this approach could contribute in mapping the SNP phenome in high throughput and at low cost from existing studies, providing an insight into the genetic architecture underpinning the inheritable variants of complex diseases. (See Supplementary Information for Additional References and Acknowledgements)

Table 1.

Reproducibility of Biomolecular Systems predicted in GWAS through . This table presents the count of GO terms enriched in each GWAS according to the unadjusted P-value (top row) and that have been replicated in two (FUSION and DGI, lower) or three studies (upper). Further, different rows present results for increasing numbers of prioritized host genes (PHG) derived from GWAS intragenic SNPs. Joint P-values for two and three GWAS were corrected with combinatorial Bonferroni correction ((P-value^m)*n, m=number of GWAS, n=count of GO studied∼2800). Odds ratios were calculated from a gold standard of biomolecular results (see Methods). Legend: empty sets (0) are presented in pale grey (see Methods, bootstrap); struck, FDR>5% unshaded, meets Bonferroni corrections; bold and red, odds ratio>confidence interval 95% according to our gold standard.

Table 2.

Reproducibility of Biomolecular Systems Predicted through This table present the count of GO terms enriched in each GWAS using information theoretic similarity (ITS) between GO terms (ITS>0.7, ). The lower table shows the number of GO terms predicted in two studies (FUSION and DGI) by ITS. The upper table is then associated with information theory to those of the WTCCC. Since a similarity of ITS=1 is the same thing as a join sets of GO (intersection of sets) produced in : see ; *odds ratio significant in every combination of GWAS at that PHG and P-value.

16 in total

1. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation.

Authors: P W Lord; R D Stevens; A Brass; C A Goble
Journal: Bioinformatics Date: 2003-07-01 Impact factor: 6.937

2. Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker
Journal: Genome Res Date: 2003-11 Impact factor: 9.043

3. GO::TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes.

Authors: Elizabeth I Boyle; Shuai Weng; Jeremy Gollub; Heng Jin; David Botstein; J Michael Cherry; Gavin Sherlock
Journal: Bioinformatics Date: 2004-08-05 Impact factor: 6.937

4. Modularity and interactions in the genetics of gene expression.

Authors: Oren Litvin; Helen C Causton; Bo-Juen Chen; Dana Pe'er
Journal: Proc Natl Acad Sci U S A Date: 2009-02-17 Impact factor: 11.205

5. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels.

Authors: Richa Saxena; Benjamin F Voight; Valeriya Lyssenko; Noël P Burtt; Paul I W de Bakker; Hong Chen; Jeffrey J Roix; Sekar Kathiresan; Joel N Hirschhorn; Mark J Daly; Thomas E Hughes; Leif Groop; David Altshuler; Peter Almgren; Jose C Florez; Joanne Meyer; Kristin Ardlie; Kristina Bengtsson Boström; Bo Isomaa; Guillaume Lettre; Ulf Lindblad; Helen N Lyon; Olle Melander; Christopher Newton-Cheh; Peter Nilsson; Marju Orho-Melander; Lennart Råstam; Elizabeth K Speliotes; Marja-Riitta Taskinen; Tiinamaija Tuomi; Candace Guiducci; Anna Berglund; Joyce Carlson; Lauren Gianniny; Rachel Hackett; Liselotte Hall; Johan Holmkvist; Esa Laurila; Marketa Sjögren; Maria Sterner; Aarti Surti; Margareta Svensson; Malin Svensson; Ryan Tewhey; Brendan Blumenstiel; Melissa Parkin; Matthew Defelice; Rachel Barry; Wendy Brodeur; Jody Camarata; Nancy Chia; Mary Fava; John Gibbons; Bob Handsaker; Claire Healy; Kieu Nguyen; Casey Gates; Carrie Sougnez; Diane Gage; Marcia Nizzari; Stacey B Gabriel; Gung-Wei Chirn; Qicheng Ma; Hemang Parikh; Delwood Richardson; Darrell Ricke; Shaun Purcell
Journal: Science Date: 2007-04-26 Impact factor: 47.728

6. Information theory applied to the sparse gene ontology annotation network to predict novel gene function.

Authors: Ying Tao; Lee Sam; Jianrong Li; Carol Friedman; Yves A Lussier
Journal: Bioinformatics Date: 2007-07-01 Impact factor: 6.937

7. Integrated transcriptional profiling and linkage analysis for identification of genes underlying disease.

Authors: Norbert Hubner; Caroline A Wallace; Heike Zimdahl; Enrico Petretto; Herbert Schulz; Fiona Maciver; Michael Mueller; Oliver Hummel; Jan Monti; Vaclav Zidek; Alena Musilova; Vladimir Kren; Helen Causton; Laurence Game; Gabriele Born; Sabine Schmidt; Anita Müller; Stuart A Cook; Theodore W Kurtz; John Whittaker; Michal Pravenec; Timothy J Aitman
Journal: Nat Genet Date: 2005-02-13 Impact factor: 38.330

8. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants.

Authors: Laura J Scott; Karen L Mohlke; Lori L Bonnycastle; Cristen J Willer; Yun Li; William L Duren; Michael R Erdos; Heather M Stringham; Peter S Chines; Anne U Jackson; Ludmila Prokunina-Olsson; Chia-Jen Ding; Amy J Swift; Narisu Narisu; Tianle Hu; Randall Pruim; Rui Xiao; Xiao-Yi Li; Karen N Conneely; Nancy L Riebow; Andrew G Sprau; Maurine Tong; Peggy P White; Kurt N Hetrick; Michael W Barnhart; Craig W Bark; Janet L Goldstein; Lee Watkins; Fang Xiang; Jouko Saramies; Thomas A Buchanan; Richard M Watanabe; Timo T Valle; Leena Kinnunen; Gonçalo R Abecasis; Elizabeth W Pugh; Kimberly F Doheny; Richard N Bergman; Jaakko Tuomilehto; Francis S Collins; Michael Boehnke
Journal: Science Date: 2007-04-26 Impact factor: 47.728

9. The HUGO Gene Nomenclature Database, 2006 updates.

Authors: Tina A Eyre; Fabrice Ducluzeau; Tam P Sneddon; Sue Povey; Elspeth A Bruford; Michael J Lush
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

10. The UCSC genome browser database: update 2007.

Authors: R M Kuhn; D Karolchik; A S Zweig; H Trumbower; D J Thomas; A Thakkapallayil; C W Sugnet; M Stanke; K E Smith; A Siepel; K R Rosenbloom; B Rhead; B J Raney; A Pohl; J S Pedersen; F Hsu; A S Hinrichs; R A Harte; M Diekhans; H Clawson; G Bejerano; G P Barber; R Baertsch; D Haussler; W J Kent
Journal: Nucleic Acids Res Date: 2006-11-16 Impact factor: 16.971

5 in total

1. Towards a PBMC "virogram assay" for precision medicine: Concordance between ex vivo and in vivo viral infection transcriptomes.

Authors: Vincent Gardeux; Anthony Bosco; Jianrong Li; Marilyn J Halonen; Daniel Jackson; Fernando D Martinez; Yves A Lussier
Journal: J Biomed Inform Date: 2015-03-19 Impact factor: 6.317

2. Translating Mendelian and complex inheritance of Alzheimer's disease genes for predicting unique personal genome variants.

Authors: Kelly Regan; Kanix Wang; Emily Doughty; Haiquan Li; Jianrong Li; Younghee Lee; Maricel G Kann; Yves A Lussier
Journal: J Am Med Inform Assoc Date: 2012 Mar-Apr Impact factor: 4.497

3. Joint GWAS Analysis: Comparing similar GWAS at different genomic resolutions identifies novel pathway associations with six complex diseases.

Authors: Michael J McGeachie; George L Clemmer; Jessica Lasky-Su; Amber Dahlin; Benjamin A Raby; Scott T Weiss
Journal: Genom Data Date: 2014-12-01

4. Concordance of deregulated mechanisms unveiled in underpowered experiments: PTBP1 knockdown case study.

Authors: Vincent Gardeux; Ahmet D Arslan; Ikbel Achour; Tsui-Ting Ho; William T Beck; Yves A Lussier
Journal: BMC Med Genomics Date: 2014-05-08 Impact factor: 3.063

5. Convergent downstream candidate mechanisms of independent intergenic polymorphisms between co-classified diseases implicate epistasis among noncoding elements.

Authors: Jiali Han; Jianrong Li; Ikbel Achour; Lorenzo Pesce; Ian Foster; Haiquan Li; Yves A Lussier
Journal: Pac Symp Biocomput Date: 2018

5 in total