Literature DB >> 27042214

Evolutionary triangulation: informing genetic association studies with evolutionary evidence.

Minjun Huang¹, Britney E Graham¹, Ge Zhang², Reed Harder¹, Nuri Kodaman¹, Jason H Moore³, Louis Muglia⁴, Scott M Williams⁵.

Abstract

Genetic studies of human diseases have identified many variants associated with pathogenesis and severity. However, most studies have used only statistical association to assess putative relationships to disease, and ignored other factors for evaluation. For example, evolution is a factor that has shaped disease risk, changing allele frequencies as human populations migrated into and inhabited new environments. Since many common variants differ among populations in frequency, as does disease prevalence, we hypothesized that patterns of disease and population structure, taken together, will inform association studies. Thus, the population distributions of allelic risk variants should reflect the distributions of their associated diseases. Evolutionary Triangulation (ET) exploits this evolutionary differentiation by comparing population structure among three populations with variable patterns of disease prevalence. By selecting populations based on patterns where two have similar rates of disease that differ substantially from a third, we performed a proof of principle analysis for this method. We examined three disease phenotypes, lactase persistence, melanoma, and Type 2 diabetes mellitus. We show that for lactase persistence, a phenotype with a simple genetic architecture, ET identifies the key gene, lactase. For melanoma, ET identifies several genes associated with this disease and/or phenotypes related to it, such as skin color genes. ET was less obviously successful for Type 2 diabetes mellitus, perhaps because of the small effect sizes in known risk loci and recent environmental changes that have altered disease risk. Alternatively, ET may have revealed new genes involved in conferring disease risk for diabetes that did not meet nominal GWAS significance thresholds. We also compared ET to another method used to filter for phenotype associated genes, population branch statistic (PBS), and show that ET performs better in identifying genes known to associate with diseases appropriately distributed among populations. Our results indicate that ET can filter association results to improve our ability to discover disease loci.

Entities: CellLine Chemical Disease Gene Mutation Species

Keywords: Genetic association; Health disparity; Population differentiation; Selection

Year: 2016 PMID： 27042214 PMCID： PMC4818851 DOI： 10.1186/s13040-016-0091-7

Source DB: PubMed Journal: BioData Min ISSN： 1756-0381 Impact factor: 2.522

Introduction

As humans moved out of Africa, population patterns of genetic variation changed dramatically, due to both random (e.g., genetic drift that can cause serial founder events) and non-random (environment-specific selection) processes. The histories of the different populations have therefore resulted in substantial differences among populations in allele frequencies at loci throughout the genome [1]. Many, if not most of these differences, will have limited effect on phenotypic variation, but some allelic substitutions may have implications for differences in phenotypic variation among populations that may help inform us in the search for genes of biomedical significance. Specifically, it could be hypothesized that the alleles that affect disease risk in multiple populations should be distributed in ways that are consistent with differences in population prevalence, such that risk alleles should be more frequent in those populations where a given disease is more prevalent and less frequent where the disease is relatively rare. Screening variants for differences in allele frequencies among populations may serve as an effective filter to identify candidates for disease, even in the absence of prior physiological evidence for disease-related function. However, because of the large number of allele frequency differences among any two populations [2], pairwise comparisons between most human populations will generate too large a number of differentiated single nucleotide polymorphisms (SNPs) to be of much practical use; any single comparison is likely to generate too many possible loci, most of which are unlikely to be related to a specific phenotype or class of phenotypes (Fig. 1a). We hypothesize that adding a third population with similar disease prevalence to one of the other two populations being compared increases our ability to define genomic regions of particular interest with respect to disease or phenotypic variation by removing the vast majority of loci or SNPs that, although highly differentiated, are unlikely to associate with phenotypes of interest (Fig. 1b). The intersection of variants that have similar allele frequencies in populations with similar disease prevalences, and different allele frequencies between populations with different prevalences, should yield an enrichment of genes that associate with a given disease. We define such variants as “appropriately distributed” with respect to a given phenotype. Similarly, diseases that we will call “appropriately distributed” are those that have prevalences distributed consistently with the allele frequency patterns of variation.

Fig. 1

Limiting the number of SNPs by switching from pairwise to three-way comparison. The number of SNPs identified in pairwise versus three-way population comparisons are illustrated by showing each two-analysis and the three-way overlap. The number in each circle indicates the number of SNPs that are identified under a particular F ST threshold for a pairwise comparison; the numbers are shown in the lower left legend. For example, a indicates there are 65119 SNPs under 95th percentile F ST threshold in the comparison between CEU and YRI. b shows that there is a significant decrease to 22 SNPs (center triangle) when using a three way comparison under the thresholds of the same stringency In this manuscript, we present a heuristic approach that incorporates genetic and epidemiological information – namely, known disease distributions with pairwise allele frequency differences among three populations – to filter association results from other genetic epidemiological analyses, such as genome wide association studies (GWAS). We call this approach Evolutionary Triangulation (ET), as it represents comparisons of three genetically distinct populations, assayed simultaneously. To assess levels of differentiation, we use Wright’s FST [3], which is a metric that provides an estimate of the level of allele frequency differences among populations. To assess the performance and to determine the limits of our approach, we applied ET to a number of phenotypes that vary in presumed genetic architecture, ranging from an essentially Mendelian trait (lactase persistence), to diseases of increasing complexity, i.e., melanoma, representing an oligogenic disease, and fasting glucose/Type 2 diabetes mellitus as a highly complex disease with many genes of small effect.

Methods

Index phenotype and population selection

We selected three index phenotypes in this study: lactase persistence, melanoma, and Type 2 diabetes mellitus/fasting glucose, based on prior epidemiological data, most of which were derived from the World Health Organization’s website (http://www.who.int/en/) or current literature (Additional file 1: Table S1). Our selection of phenotypes was further guided by the availability of genetic data from appropriate HapMap populations that coincide with the epidemiological data (Additional file 1: Table S1). For instance, melanoma is rare in South Asians and Africans, but common in Europeans (e.g. ~150 fold greater prevalence than in South Asians), matching the HapMap populations GIH, YRI and CEU. Additionally, these phenotypes represent a range of underlying genetic architectures, allowing us to assess whether ET is applicable to a broad variety of traits, from effectively Mendelian diseases to complex disorders with small effect sizes (e.g., odds ratios as low as 1.04)(Table 1).

Table 1

Genetic associations of the 85th/15th threshold ET genes with index diseases/traits (CEU-GIH-YRI)

Index disease	ET genes	5th^a	10th^a	Odds ratio^b
Lactase Persistence	LCT	Y	Y	Mendelian
Melanoma/Skin Neoplasms	OCA2			3.16 [58]
	SLC45A2	Y	Y	2.78 [15]
	TYRP1			1.15 [59]
	XRCC1			0.60 [60]
Diabetes Mellitus, Type 2/Glucose Intolerance/Insulin Resistance^c	ADAMTS9			1.12 [61]
	DGKB			1.04 [62]
	FTO			1.17 [63]
	IDE			1.28 [64]
	IGF2BP2		Y	1.22 [65]
	IL6		Y	1.29 [66]
	SH2B1		Y	1.16 [67]
	SREBF1		Y	1.17 [68]

a5th/10th: Y indicates ET genes identified under the 95th/5th or 90th/10th threshold, respectively

bLargest reported Odds ratio

cCase diagnosis is as defined in each reference

Genetic associations of the 85th/15th threshold ET genes with index diseases/traits (CEU-GIH-YRI) a5th/10th: Y indicates ET genes identified under the 95th/5th or 90th/10th threshold, respectively bLargest reported Odds ratio cCase diagnosis is as defined in each reference

Triangulation for ET SNPs and mapping to ET genes

We chose representative HapMap populations based on the prevalences of the index phenotypes (Additional file 1: Table S1). We obtained all SNP allele frequency data for unrelated individuals from the selected populations in the International HapMap Project Phase III [4], which included 113 Utah residents with Northern and Western European ethnicity (CEU), 84 Han Chinese from Beijing, China (CHB), 88 Tuscans from Italy (TSI), 88 Gujarati Indians from Houston (GIH), 113 Yoruba from Ibadan, Nigeria (YRI), 86 Japanese in Tokyo, Japan (JPT), and 50 Mexican Americans in Los Angeles, California (MEX). CEU was a proxy for Northern European populations, CHB and JPT for East Asian populations, TSI for Southern European populations, GIH for South Asian populations, MEX for Central American Hispanic populations, and YRI for West African populations. To assess the population genetic differences we calculated unbiased estimates of FST based on HapMap allele frequency data according to the Weir and Cockerham formula [5]. For example, among the CEU, GIH and YRI, as shown in Fig. 1, the FST of each SNP was calculated pairwise. ET SNPs (center triangle, Fig. 1b) were selected according to the overlaps of high FST(CEU_GIH), high FST(CEU_YRI) and low FST(GIH_YRI). Since we did not know a priori the appropriate levels of FST similarity/dissimilarity that would enable us to detect genes associated with specific phenotypes, we used different FST thresholds to yield different numbers of ET SNPs. We first chose a highly stringent threshold, with the 95th percentile of FST indicating high differentiation and the 5th indicating a high degree of similarity (Fig. 2). This allowed us to limit the number of ET SNPs suitable for further hand-curated genetic association mining. We then applied more lenient thresholds, namely the 90th percentile and the 85th percentile reflecting high differentiation, and the corresponding 10th and 15th percentile reflecting sufficient similarity. ET genes were defined as genes that were within 100 Kb upstream or downstream of each ET SNP, according to NCBI Build 37. To further explore putative associating loci we also examined the recombination landscape around ET detected SNPs, using Locus Zoom ((http://locuszoom.sph.umich.edu/locuszoom/).

Fig. 2

F ST distributions among ET comparisons of CEU, GIH and YRI populations. Shows the F ST distributions of CEU-GIH, CEU-YRI and GIH-YRI. Similar in the three distributions, most F ST’s are less than 0.1, which means that most SNPs are not differentiated greatly among population. The red dotted lines indicates the percentile thresholds we use to generate ET SNPs. For example, for CEU-GIH and CEU-YRI we took SNPs having an F ST greater than or equal to 95th percentile, which are those to right of the red dotted lines. And for GIH-YRI, we took SNPs having an F ST less than or equal to than 5th percentile, which are those to left of the red dotted line. By overlapping these three sets of SNPs, we could generate the ET SNPs

Genetic association mining and identification of additional phenotypes

Genetic associations of ET genes with diseases were analyzed based on data retrieved from the HuGE Navigator [6] (http://www.cdc.gov/genomics/hugenet/hugenavigator.htm), a human genome epidemiology knowledgebase that incorporates information from PubMed abstracts as its core data source. We defined a “true” association as one for which a gene had: (1) five or more HuGE publications significantly associating it with a phenotype using the Genopedia function and p < 0.05 in any combination of studied populations, or (2) GWAS evidence with a P value of less than 5 × 10−8 reported in HuGE Navigator. Although negative GWAS results have been shown to correlate with geography and FST [7], we chose to be conservative, ignoring negative results as this is a proof of principle. Additional phenotypes were noted if they too associated with ET genes in the HuGE Navigator database. Other associating diseases were then assessed for their global distributions to assess whether they are also similarly distributed to the index phenotypes (Additional file 1: Table S1).

Random resampling analysis

To determine whether ET significantly enriched for genes that associated with the index or other appropriately distributed phenotypes, we performed permutation testing. Specifically, for each ET threshold we sampled from the genome the same number of genes that were identified to be associated with appropriately distributed disease using ET. We then verified using the same criteria as for ET how many of these genes were associated with appropriately distributed disease based on the continental ancestry for a given comparison. We used the empirical distribution to determine the p- value of the ET analyses. The empirical distributions were determined by sampling the genome 10,000 times without replacement. We established the number of times that the same number or more such genes were found by this random process.

Results

ET comparisons among populations were chosen to reflect the distribution of several phenotypes, ranging from genetically simple to increasingly complex, that differ in prevalence among HapMap populations. Estimated genetic complexity was based on the number of GWAS identified genes and their effects sizes, extracted from the NHGRI GWAS catalog (Additional file 2: Table S2). The index phenotypes we analyzed were, in order of presumed increasing genetic complexity, lactase persistence, melanoma, and Type 2 diabetes mellitus/fasting glucose. These phenotypes are likely related to evolutionary adaptations to different environments, making the integration of evolutionary data more promising.

Lactase persistence

Lactase persistence is the ability to digest lactose in adulthood and is the inverse of lactose intolerance [8]. More than 90 % of Northern European individuals have lactase persistence. In contrast, about half of the population of Southern Europe, and most Asians and Africans, have lactase persistence, as milk has been rarely consumed in these regions (Additional file 1: Table S1). Evolutionary evidence indicates that there has been strong selective pressure in populations with animal domestication and adult milk consumption to maintain the ability to metabolize lactose after weaning [9]. The genetics of lactase persistence is relatively simple. The primary gene that affects the phenotype is lactase (LCT) in the LCT-MCM6 locus on chromosome 2 [8], and the identified variations are primarily cis-acting elements [10]. Interestingly, the variants associated with lactase persistence in European and African pastoralist populations are different, even though they are within the same upstream region of LCT [9]. In both cases, the evidence indicates that all variants are under strong selection corresponding to levels of animal milk consumption and pastoralist culture. We applied ET to lactase persistence using three HapMap populations, CEU, TSI and CHB. Using the 95th percentile threshold to define highly differentiated SNPs between CEU/TSI and CEU/CHB, and the 5th percentile threshold to define similarly distributed SNPs between TSI and CHB, we identified eight ET SNPs and eight ET genes (Additional file 3: Table S3a). Seven of these SNPs were within ± 100 Kb of the LCT-MCM6 region and, although more than 100 kb from LCT, the eighth was in the same region (Fig. 3). Of note, the majority of the ET SNPs were bounded by distal recombination hotspots, illustrating how we might refine locus boundaries (Additional file 4: Figure S1).

Fig. 3

ET SNPs in vicinity of the LCT gene. Seven out of eight ET SNPs generated under the 95th/5th percentile threshold are within 100 Kb of LCT loci. The eighth is to the right of the DARS gene, upstream of the LCT-MCM6 locus. (Coordinate only shows relative distance not indicating exact build 37 coordinates) To assess the sensitivity of ET to varying levels of FST thresholds, we used multiple thresholds ranging from the 80th/20th to the 90th/10th percentile. The significant over-representation of the LCT-MCM6 signal still held under less stringent thresholds. Under the 90th/10th threshold, we obtained 29 ET SNPs and 14 genes (Additional file 3: Table S3b). Twenty of these SNPs were within ± 100 Kb of LCT-MCM6 region. Under the 85th/15th threshold, we generated 33 ET SNPs and 14 genes (Additional file 3: Table S3c). Again, twenty of these SNPs were within ± 100 Kb of LCT-MCM6 region. Even using the 80th percentile threshold, with FST(CEU_CHB) and FST(CEU_TSI), and the 20th percentile threshold, with FST(TSI_CHB), we generated 50 ET SNPs, of which 23 were within ± 100 Kb of LCT-MCM6 region. (Additional file 3: Table S3d). FST values for other thresholds and populations are presented in Additional file 5: Table S4. Based on the prevalence of lactase persistence in other populations [11] (Additional file 1: Table S1), we also applied ET to CEU, YRI and GIH. Highly differentiated SNPs between CEU and GIH and between CEU and YRI were compared to highly similar SNPs between GIH and YRI. With the 95th/5th threshold, we identified 22 SNPs, of which two were within ± 100 Kb of the LCT-MCM6 region (Additional file 6: Table S5). Additionally, using the 95th/5th threshold, we were able to identify the LCT-MCM6 gene region in comparisons of CEU to CHB and LWK (HapMap East African), both of which do not have a large amount of lactase persistence (data not shown).

Two way comparison for lactase persistence

As we hypothesized that adding a third population will improve the efficiency of filtering for disease related variants, we investigated the number of genes revealed by only examining two-way comparisons to assess the improved precision derived from adding a third population. To do this, we compared the percentage of SNPs that are within ±100 Kb of the LCT-MCM locus in all two-way FST comparisons vs. the three-way FST comparison (ET). Comparing only CEU to TSI under FST thresholds, ranging from 95th/5th to 85th/15th, we found that for two-way FST comparisons at best 0.13 % of the generated signals were located within the LCT-MCM region (95th/5th comparison). However, in the three-way FST comparison, more than 60 % of the ET SNPs are located within ±100 Kb of the LCT-MCM region (Fig. 4). Adding a third population greatly enriched the proportion of variants related to the corresponding phenotype(s), demonstrating that the ET can better resolve associating SNPs/genes.

Fig. 4

Percentage of ET SNPs within ± 100 Kb of the LCT-MCM genic region. In all two-way comparisons, only a small proportion (<0.2 %) of the SNPs are within ± 100 Kb of the LCT-MCM region. For example, the 95th percentile F ST threshold of two way comparison between CEU and TSI (red line) generates 68,106 SNPs, only 91 (0.133 %) of which locate within ± 100 Kb of the LCT-MCM region. In contrast, signals from the three way comparison of F ST (ET) more than 60 % of the ET SNPs are within the same region (blue line). The black line represents the pairwise comparisons between CEU and CHB, while the green line represents between CHB and TSI

Melanoma and other diseases correlated with latitude

Melanoma was our second selected index phenotype. About 40 genes have been associated with melanoma, according to HuGE Navigator, making it substantially more complex than lactase persistence. The prevalence of melanoma varies with latitude [12], a relationship likely due to dark skin being favored in regions of high UV radiation, where it protects from skin damage, while still allowing adequate vitamin D to be synthesized. At the same time, dark skin limits UV degradation of folate in underlying skin layers, and low folate levels are associated with increased risk of neural tube defects in utero [13]. In contrast, light skin is more common at high latitudes where its presence promotes vitamin D biosynthesis in environments with low ultraviolet radiation [14]. Melanoma is relatively common in populations of European descent, but its prevalence is much lower in populations of South Asian and African descent (Additional file 1: Table S1). Thus, applying ET analysis to melanoma using the CEU, YRI and GIH populations (with CEU as the outlier population) fits the population prevalence requirements for ET filtering. Under the stringent 95th/5th percentile threshold we obtained 22 ET SNPs that mapped to 33 ET genes (Additional file 6: Table S5a). One of these 33 genes, SLC45A2, has previously been shown to associate with melanoma [15]. The ET SNP that identified this locus, rs28117, is bounded both proximally and distally by recombination hotspots, thereby providing additional resolution (Additional file 7: Figure S2). Under the 90th/10th percentile threshold, we generated 168 ET SNPs which mapped to 230 ET genes. Under the least stringent 85th/15th percentile threshold, we generated 733 ET SNPs that mapped to 971 ET genes, four of which are known to associate with melanoma (Table 1, Additional file 6: Table S5b, S5c). This analysis also identified numerous other genes associated with diseases that are distributed appropriately among these three populations. One example is multiple sclerosis (MS), a complex disease with more than 80 genes that have been associated with susceptibility, but each of these variants confers only a very small change in risk. Collectively, these genes explain approximately 20 % of the genetics of MS [16, 17]. Similar to melanoma, the prevalence of MS is associated with latitude; the most prominent correlating factor is ultraviolet radiation/vitamin D [18]. Recently, by showing positive selection on MS associated loci, researchers proposed that MS might have arisen due to pleiotropic effects of host resistance to pathogens over the course of human history, where there has been significant selective pressure acting to increase this resistance to pathogens [19]. Under the 90th/10th percentile threshold, we were able to identify one gene, IL6, previously associated with MS. IL6 is of particular interest as it has been associated with 15 diseases that are similarly distributed, several of which are related to immune function, indicating strong pleiotropic effects (Additional file 8: Table S6). The ET analysis for melanoma also identified genes associating with many other similarly distributed phenotypes (Additional file 8: Table S6). For example, ET detected LCT under the most stringent 95th/5th FST percentile threshold, consistent with the distribution of lactase persistence. Besides lactase persistence, ET genes under the least stringent 85th/15th threshold have also been associated with other Mendelian disorders having appropriate distributions among the three populations, such as oculocutaneous albinism, glucosephosphate dehydrogenase deficiency and Smith-Lemli-Opitz Syndrome. ET was able to capture the key genes that cause these diseases, namely OCA2, G6PD and DHCR7. Other disorders have also been associated with these ET genes, but the association is not as direct. For example, G6PD has been associated with favism caused by G6PD deficiency that can affect survival of the malaria parasite, Plasmodium falciparum [20]. Malaria is much more common in South Asia and Africa than it is in northern Europe, highlighting another appropriately identified genetic association. ET was also able to capture genes that have been associated with complex diseases having appropriate distributions among the three populations, for example, Vitamin D deficiency, which is strongly correlated with skin color. DHCR7 and CYP24A1 have the largest reported effect sizes for Vitamin D deficiency and were found using ET [21]. Interestingly, another ET gene, BNC2, has also been associated with hair and skin color in a GWAS study [22]. As melanoma is strongly associated with skin color, we expect that genes affecting skin color would also appear. The evolutionary factors that drive these phenotypes are likely common, emphasizing the ability of ET to detect genes that are pleiotropic. ET genes have also been associated with other complex diseases, such as pulmonary tuberculosis, multiple cancers, and Alzheimer’s disease, all of which have relatively large effect sizes (Additional file 8: Table S6). ET comparisons also allowed us to identify key immune regulators, such as IFNG and IL6, which associate with many phenotypes.

Type 2 diabetes mellitus/fasting glucose

Fasting glucose, although measured as a continuous trait, is reflective of Type 2 diabetes mellitus prevalence. Because diabetes case criteria may differ among studies based on variable diagnostic standards we chose to use fasting glucose as our index phenotype, as it is a more objective but highy correlated trait representing the complex genetic architecture of Type 2 diabetes. In the Mexican population, the average fasting glucose level is higher than in Asian and Northern European populations. For this analysis, we used the populations MEX, JPT, and CEU for ET analyses based on current epidemiological data. We identified no genes that have been associated with Type 2 diabetes mellitus, according to our criteria. Because of this inability to identify appropriate genes we expanded our search to include genes that were within 250 Kb of each ET SNP. Although we discovered no genes associated with Type 2 diabetes mellitus using the most stringent threshold, we did find one gene with the 90th/10th threshold, BHMT, that has been putatively associated with diabetes (β = 0.26; p value 5.9 × 10−3) [23]. Under the 95th/5th percentile threshold, we found only seven ET SNPs which mapped to four genes. None of these four genes had evidence of genetic association with any phenotype, to date. Using the 90th/10th percentile threshold, we identified 164 ET SNPs that mapped to 115 genes. Again, none of these genes were associated with any disease having the expected population distribution. Under the 85th/15th percentile threshold, we identified 407 ET SNPs that mapped to 233 genes. Of these genes, two had prior evidence of association with diseases with the appropriate population distributions. Both genes, XG and NLGN4X, are related to autism spectrum disorders that are also appropriately distributed among these populations (Additional file 1: Table S1). The p value of the most significant SNP in XG was 3.79 × 10−8 in a family based GWAS [24]. In contrast, in NLGN4X the smallest p value was 0.024 in a candidate gene study [25]. We also found a gene, SHROOM2, that is associated with an Alzheimer’s disease endophenotype, but the GWAS study describing this association only generated a p value of 0.0003 [26].

Enrichment of ET genes relative to randomly sampled genes

To assess whether ET significantly enriched for genes associating with our index and other appropriately distributed phenotypes, we perfomed random resampling of the genome to estimate an empiral p value. We then interrogated HuGE Navigator to identify associations and tested whether our randomly sampled genes were associated with either our index phenotypes or other traits similarly distributed among the respective populations. From the CEU, GIH and YRI ET analyses for melanoma, for the most stringent threshold, four out of the 33 identified genes were known to be associated with phenotypes appropriately distributed among the three populations (Table 2). For the next two thresholds, the results were 15 out of 230 genes and 49 out of 971 genes. Our random resampling determined in all three of these cases that ET significantly enriched for genes that associated with appropriately distributed diseases/traits, with the p-value being more significant for the less stringent thresholds (Table 2).

Table 2

Significant Association of ET Genes (CEU, GIH, YRI) with diseases/traits

Percentile threshold	Number of ET SNPs	Number of ET genes	Number of diseases appropriately distributed among ET populations	Number of ET genes associated with appropriately distributed diseases	Permutation P value
95th/5th	22	33	6	4	0.0237
90th/10th	168	230	27	15	0.0111
85th/15th	733	971	42	49	0.0023

Significant Association of ET Genes (CEU, GIH, YRI) with diseases/traits

Comparison to population branch statistic

An existing method, Population Branch Statistic (PBS) also integrates evolutionary comparisons among three populations to search for candidate genes and has been successful in identifying genes related to high altitude adaptation [27]. PBS uses two populations with known differences in phenotype prevalence and a third random outlier population to identify genes likely to associate with a predefined trait. Variants that show high allele frequency differentiation in only one of the two pre-selected populations are assumed to be under selection, reflected by high PBS scores. Unlike our method, there is no phenotype prevalence restriction on the outlier population. We compared ET with PBS to determine if adding the phenotypic prevalence in the third population increases our ability to find disease associating genes. To do this, we first chose CEU and YRI as the two pre-selected populations, then ran PBS analysis using all other available HapMap III populations as the outlier. We selected the top 22 SNPs from PBS (based on the most stringent CEU-YRI-GIH comparison from ET) to assess the performance against ET. SNPs with the highest PBS scores were picked for each population and then were mapped to genes within ± 100 Kb (Additional file 9: Table S7). To determine association, we applied the same criteria as described above for our ET analyses. We found that, at most, three genes, SLC12A1, SLC45A2 and AMACR, are associated with phenotypes that are appropriately distributed between European and West African populations, two of which had been observed using ET. In other analyses using MEX, TSI, GIH, ASW, LWK and MKK as outlier populations, only one appropriate gene was identified, SLC12A1. Using CHD and JPT as outlier populations, we identified no genes associated with differentially distributed disease rates. In contrast, ET was able to identify four genes relevant to diseases (Table 2). It is likely that ET performed better in most cases because we explicitly added a third population with known disease prevalences and this additional information appeared to increase our resolving power.

Discussion

ET’s performance is related to effect size

In this manuscript, we addressed whether population differences shaped by evolutionary histories can be used to identify or enrich for genes associated with phenotypes that are differentially distributed. We hypothesized that comparing three populations, using the ET algorithm, can enrich for variants or genes that can be compared to standard association results, helping define those of greater interest with respect to phenotype. This, in essence, is an additional filtering method that may allow us to relax standard association significance thresholds, such that we can identify putative disease associating variants that can be targeted for follow-up, even if they do not reach standard genome-wide association significance. In our analyses, we focused on three different phenotypes that are both differentially distributed and have putatively different genetic architectures. ET’s performance appeared to differ among these phenotypes. For the trait with the simplest genetic architecture, lactase persistence, ET was robust; ET defined the causative genetic region using several population comparisons that matched the distribution of lactase persistence [9, 28]. For likely oligogenic phenotypes such as melanoma, ET performed well. Assuming that the number of GWAS hits in each of these two diseases is correlated with the complexity of the genetic models (one for lactase persistence and 22 for melanoma), we can see that as the effect sizes decrease and the number of associating genes increases, which, presumably, is correlated with increasing complexity of the genetic architecture, ET was less effective in identifying previously known genes [29, 30]. It may also be possible that ET is best suited to detect loci that may not only be associating with diseases that are differentially distributed, but with loci that differentiated due to selection, as is the case with genes for lactase persistence and melanoma. Interestingly, standard GWAS found multiple signals for lactase persistence, but after adjusting for ancestry all except LCT, which was identified by ET, became non-significant, demonstrating that ET does not conflate ancestry with association [29]. Regardless of the FST threshold, ET clearly showed an enrichment for melanoma genes compared to the random resampling analysis, emphasizing its utility. It is also interesting that ET did significantly enrich for genes associated with several other diseases that shared the same prevalence pattern and are presumably related to melanoma. Multiple sclerosis, although sharing the same geographic distribution as melanoma, does not appear to be as good a candidate disease for ET analysis. However, MS appears to be more genetically complex based on the number and effect sizes of genes already identified as associating with it (Additional file 2: Table S2). Although we did identify some genes related to MS, we were not able to see an enrichment for them. MS also appears to be more affected by environmental variation, and risk can change with migration during early adolescence, thereby serving as a modifier of genetic risk [31]. Another important finding from ET in the melanoma analyses is the identification of G6PD. This is of particular interest as G6PD deficiency is known to protect from malarial infection [20, 32–34], although it was not found in the GWAS for malaria [35]. We do, however, note that the failure to identify G6PD in the original malaria GWAS is likely an artifact of poor coverage in this genic region as well as low frequency of the protective alleles in the studied populations. Nonetheless, this demonstrates the potential of ET, to not only refine, but in some cases, to identify important signals missed by standard association that are included in publically available databases but absent from genotyping arrays. For phenotypes that are polygenic, with putative small effect sizes such as fasting glucose or Type 2 diabetes mellitus, ET’s utility appeared to be more limited. Specifically, the role of environment in confounding the ability of ET to identify risk genes may be seen in diabetes, as the prevalence of Type 2 diabetes is increasing, presumably because of the adoption of lifestyles that promote it. Therefore, even if there are strong diabetes genes in a few environments, their action may not be detectable in many modern human societies where the effect of the environment has become preeminent and time has been inadequate to differentiate genes affecting this phenotype. We may, therefore, conclude that as the environment’s recent role in disease increases, the ability of ET to identify key genes will decrease. This should correlate with the length to which a population has been exposed to a new environment. This is not a feature unique to ET, as this will be the case even with standard association approaches. Nonetheless, even though ET and standard association methods differ substantially in their underlying assumptions and metrics, they may still be used to support each other [36]. In cases where there is significant locus heterogeneity among cohorts, ET may be extremely useful in helping to identify interesting associations in subsets of data that are often ignored in large GWAS meta-analyses. Therefore, despite that fact that ET did not enrich for known diabetes genes, it may have enriched for previously undetected diabetes genes, and can be used to provide additional evidence for follow-up. As with other approaches, ET is not as useful for detecting variants that are rare in all populations, as association offers low power in detecting differences. However, we emphasize that all current population-based designs suffer from this limitation and are unlikely to be practical for the statistical detection of rare variants that associate with disease. Family based (linkage) approaches to identifying rare disease associated genes are still the preferred method in these cases.

ET as an agnostic filter

GWAS have been used to examine almost every common complex disease, and although many SNPs have been dismissed because they do not meet standard multiple testing correction thresholds, it is probable that many of them truly associate with the disease. The detection of these SNPs represents a major challenge in our developing understanding of disease etiology because the current methods produce inflated Type II errors. Thus, although GWAS have been able to discover significant hits in certain diseases, its success is limited in most situations and probably represents the limitations of only using p values for biological data mining [37], making association result filtering based solely on p value problematic. Therefore, it is important to develop methods that can minimize this type of error while still controlling for the Type I error rate. One way to do this is to integrate knowledge from independent analyses [36]. Integration of several data types has been successful in the analysis of diseases, such as MS [38]. However, pathway-oriented filtering, such as for MS in Baranzini et al., requires prior knowledge of biological functions, which are still poorly understood in many cases. We present an alternative filtering metric based, not on prior biological knowledge, only on evolutionary history that shaped biology. ET also has an advantage over some standard association analyses because, if a variant that increases risk is fixed in a population that has a higher prevalence (i.e., is an etiological agent), it would not be detected in association analyses; however, ET may identify it. A key to the success of ET is having at least three populations, since comparing only two would produce an enormous number of false positives, while adding a third population can reduce this number. ET only requires knowledge of allele frequencies and disease distributions, but the latter may be a limiting factor for many diseases, but this will change with better surveillance. However, limited surveillance is not the only factor that may affect the utility of ET. For example, variable diagnoses across countries may present an issue as will changing definitions of disease phenotypes over time. The latter will make comparisons over time difficult. The issue of ethnic variation in disease prevalence may, however, be partially ameliorated in coutries that have highly diverse populations, such as the US where data on many continental populations can be collected simultaneously and used to define appropriate ET populations, but this is still not as robust as would be ideal as many populations in such situations are highly admixed. As ET detects differences in local variation the method should still be powerful for the detection of genes that affect disease risk in all populations, i.e. universal genetic risk factors. However, in epistatic models of disease it may not be possible to detect all key genes when local ancestry varies elsewhere in the genome [39].

Combining FST with other metrics

In this study, we used only FST as a metric of population differentiation. FST is computationally straightforward and thus makes our model easily applied in any population comparisons. Since genetic differentiation could be due to both selection and random drift, and FST does not explicitly differentiate between these two processes, extensions of ET that include measures of selection may be useful for future analyses. An existing statistic that can be used to provide evidence for selection is iHS [40] (http://haplotter.uchicago.edu/). We investigated iHS patterns in the genomic regions identified in our study using the most stringent FST thresholds among CEU, YRI and GIH. In most regions the iHS scores were larger in CEU than in YRI and ASN (the East Asian HapMap population), indicating stronger selection signals in the CEU population (Additional file 10: Table S8). Clearly, integrating the role of selection into the ET analyses in the future will be an important extension of the method. Selection may be an important factor in shaping the distribution of disease risk alleles; however, it is not always the mechanism through which important differentiation, which affects phenotypic variation, takes place. One strong example of this is the LCT locus, which has been shown by multiple metrics to be under selection in multiple populations and was identified by ET in our study [9, 41]. Another example of a selected locus affecting disease risk that was identified by ET is APOL1. This gene has been shown to be under selection in African populations, due to protection against human African trypanosomiasis [42]. Different metrics may be able to pick up different signals indicating selection (e.g., Cross Population Extended Haplotype Homozygosity (XP-EHH) [43], Tajima’s D [44], Fay and Wu’s H [45]). In contrast to the phenotypes with comparatively simple genetic architecture in which selection appears to be important in shaping population structure, the role of selection in Type 2 diabetes mellitus is less clear. Recent studies of Type 2 diabetes mellitus genes have indicated that they are not under selective pressure among multiple human populations [46, 47]. However, another paper studying the distribution of Type 2 diabetes mellitus risk alleles indicates that random drift alone has not shaped diabetes’ genetic risk differences from Africa to East Asia [48], making it difficult to draw strong conclusions at this point. Understanding the role of non-random processes in shaping disease risk will be an important area of research, especially as it affects disease disparity among continental populations in changing environments.

Possible pleiotropy in ET genes

In our analyses of ET genes and their associated diseases, it became obvious that several of the genes identified are risk factors for multiple diseases. Among the genes from the least stringent ET threshold among CEU, YRI and GIH, we discovered that several genes were associated with multiple diseases and traits (Tables 1, Additional file 8: Table S6). For example, IL6 has been associated with 15 phenotypes that are appropriately distributed and IFNG with seven. Both genes are related to immune response, among other processes. It is possible that selection has been important in changing allele frequency distributions, while at the same time having multiple pleiotropic effects that may or may not have been directly driven by selection. Another example is the genes related to melanoma, as they are also associated with skin color, eye color and albinism (SLC45A2 [49, 50]), preterm birth (DHCR7 [51]), and Vitamin D levels (NADSYN1 [21]). Interestingly, an ET gene from the most stringent threshold, BNC2, has also been associated with hair and skin color [22, 52, 53]. In this specific case, it seems that skin color is a confounder of other genetic effects. As humans migrated to higher latitudes with insufficient UV radiation, lighter skin color facilitated more vitamin D synthesis in the skin. However, this increased the risk of melanoma. Although having less than five publications indicating associations, BNC2 has also been associated with ovarian cancer in two GWAS, both reaching genome wide significance [54, 55]. We do note that our calculations of the number of associating genes, with appropriately distributed phenotypes, were conservative; each gene was counted only once, even if it associated with multiple pleiotropic phenotypes.

Comparisons to other related methods

Models developed in different studies employ similar evolutionary mindsets to what we have discussed. PBS does not require a specific phenotype prevalence pattern on the third population in the three population comparison, which may make it less powerful than ET in identifying disease associating genes. On the other hand, Hancock et al. [56] reported that correlations between allele frequencies and climate variables can be applied to detect SNPs associating with pigmentation and autoimmune diseases. In this case, they were looking for local selection due to environmental factors chosen a priori, whereas ET is not limited to specific factors that give rise to the disease prevalence distribution pattern. Another method was developed to identify loci covarying with the African Pygmy phenotype based on how much additional information is provided by phenotype to infer the geographic origin compared to information provided without genotypes [57]. However, this method requires the phenotype, which is height, to be known for every individual. In our method, we only need to compare disease risk at the population level.

Conclusions

Evolutionary thinking can provide important insights for biomedical research; when combined with current approaches commonly employed in human disease studies, it can increase our ability to find key genes or pathways that affect etiology. By taking advantage of both epidemiological differences and population structure, we demonstrated that many genes associating with diseases can be found. This paper presents a proof of principle for this approach, as well as some of its limitations. Clearly, for phenotypes with simple genetic architecture, ET is an extremely powerful approach, but this becomes less practical for traits of increasingly complex architecture. Nonetheless, for several traits, we were able to identify genes with known effects. It is also possible that many more of the ET genes are truly associating, but have not been reported as such, as they do not meet current thresholds for significance. Therefore, we propose that ET can be a useful filter with which to interrogate existing and new association studies for consistent patterns that might lead to the identification of additional genetic risk factors.

66 in total

1. Principal components analysis corrects for stratification in genome-wide association studies.

Authors: Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal: Nat Genet Date: 2006-07-23 Impact factor: 38.330

2. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism.

Authors: F Tajima
Journal: Genetics Date: 1989-11 Impact factor: 4.562

3. The lactase persistence/non-persistence polymorphism is controlled by a cis-acting element.

Authors: Y Wang; C B Harvey; W S Pratt; V R Sams; M Sarner; M Rossi; S Auricchio; D M Swallow
Journal: Hum Mol Genet Date: 1995-04 Impact factor: 6.150

4. Common risk alleles for inflammatory diseases are targets of recent positive selection.

Authors: Towfique Raj; Manik Kuchroo; Joseph M Replogle; Soumya Raychaudhuri; Barbara E Stranger; Philip L De Jager
Journal: Am J Hum Genet Date: 2013-03-21 Impact factor: 11.025

5. European genetic variants associated with type 2 diabetes in North African Arabs.

Authors: S Cauchi; I Ezzidi; Y El Achhab; N Mtiraoui; L Chaieb; D Salah; C Nejjari; Y Labrune; L Yengo; D Beury; M Vaxillaire; T Mahjoub; M Chikri; P Froguel
Journal: Diabetes Metab Date: 2012-03-29 Impact factor: 6.041

6. Adaptations to climate-mediated selective pressures in humans.

Authors: Angela M Hancock; David B Witonsky; Gorka Alkorta-Aranburu; Cynthia M Beall; Amha Gebremedhin; Rem Sukernik; Gerd Utermann; Jonathan K Pritchard; Graham Coop; Anna Di Rienzo
Journal: PLoS Genet Date: 2011-04-21 Impact factor: 5.917

7. Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis.

Authors: Stephen Sawcer; Garrett Hellenthal; Matti Pirinen; Chris C A Spencer; Nikolaos A Patsopoulos; Loukas Moutsianas; Alexander Dilthey; Zhan Su; Colin Freeman; Sarah E Hunt; Sarah Edkins; Emma Gray; David R Booth; Simon C Potter; An Goris; Gavin Band; Annette Bang Oturai; Amy Strange; Janna Saarela; Céline Bellenguez; Bertrand Fontaine; Matthew Gillman; Bernhard Hemmer; Rhian Gwilliam; Frauke Zipp; Alagurevathi Jayakumar; Roland Martin; Stephen Leslie; Stanley Hawkins; Eleni Giannoulatou; Sandra D'alfonso; Hannah Blackburn; Filippo Martinelli Boneschi; Jennifer Liddle; Hanne F Harbo; Marc L Perez; Anne Spurkland; Matthew J Waller; Marcin P Mycko; Michelle Ricketts; Manuel Comabella; Naomi Hammond; Ingrid Kockum; Owen T McCann; Maria Ban; Pamela Whittaker; Anu Kemppinen; Paul Weston; Clive Hawkins; Sara Widaa; John Zajicek; Serge Dronov; Neil Robertson; Suzannah J Bumpstead; Lisa F Barcellos; Rathi Ravindrarajah; Roby Abraham; Lars Alfredsson; Kristin Ardlie; Cristin Aubin; Amie Baker; Katharine Baker; Sergio E Baranzini; Laura Bergamaschi; Roberto Bergamaschi; Allan Bernstein; Achim Berthele; Mike Boggild; Jonathan P Bradfield; David Brassat; Simon A Broadley; Dorothea Buck; Helmut Butzkueven; Ruggero Capra; William M Carroll; Paola Cavalla; Elisabeth G Celius; Sabine Cepok; Rosetta Chiavacci; Françoise Clerget-Darpoux; Katleen Clysters; Giancarlo Comi; Mark Cossburn; Isabelle Cournu-Rebeix; Mathew B Cox; Wendy Cozen; Bruce A C Cree; Anne H Cross; Daniele Cusi; Mark J Daly; Emma Davis; Paul I W de Bakker; Marc Debouverie; Marie Beatrice D'hooghe; Katherine Dixon; Rita Dobosi; Bénédicte Dubois; David Ellinghaus; Irina Elovaara; Federica Esposito; Claire Fontenille; Simon Foote; Andre Franke; Daniela Galimberti; Angelo Ghezzi; Joseph Glessner; Refujia Gomez; Olivier Gout; Colin Graham; Struan F A Grant; Franca Rosa Guerini; Hakon Hakonarson; Per Hall; Anders Hamsten; Hans-Peter Hartung; Rob N Heard; Simon Heath; Jeremy Hobart; Muna Hoshi; Carmen Infante-Duarte; Gillian Ingram; Wendy Ingram; Talat Islam; Maja Jagodic; Michael Kabesch; Allan G Kermode; Trevor J Kilpatrick; Cecilia Kim; Norman Klopp; Keijo Koivisto; Malin Larsson; Mark Lathrop; Jeannette S Lechner-Scott; Maurizio A Leone; Virpi Leppä; Ulrika Liljedahl; Izaura Lima Bomfim; Robin R Lincoln; Jenny Link; Jianjun Liu; Aslaug R Lorentzen; Sara Lupoli; Fabio Macciardi; Thomas Mack; Mark Marriott; Vittorio Martinelli; Deborah Mason; Jacob L McCauley; Frank Mentch; Inger-Lise Mero; Tania Mihalova; Xavier Montalban; John Mottershead; Kjell-Morten Myhr; Paola Naldi; William Ollier; Alison Page; Aarno Palotie; Jean Pelletier; Laura Piccio; Trevor Pickersgill; Fredrik Piehl; Susan Pobywajlo; Hong L Quach; Patricia P Ramsay; Mauri Reunanen; Richard Reynolds; John D Rioux; Mariaemma Rodegher; Sabine Roesner; Justin P Rubio; Ina-Maria Rückert; Marco Salvetti; Erika Salvi; Adam Santaniello; Catherine A Schaefer; Stefan Schreiber; Christian Schulze; Rodney J Scott; Finn Sellebjerg; Krzysztof W Selmaj; David Sexton; Ling Shen; Brigid Simms-Acuna; Sheila Skidmore; Patrick M A Sleiman; Cathrine Smestad; Per Soelberg Sørensen; Helle Bach Søndergaard; Jim Stankovich; Richard C Strange; Anna-Maija Sulonen; Emilie Sundqvist; Ann-Christine Syvänen; Francesca Taddeo; Bruce Taylor; Jenefer M Blackwell; Pentti Tienari; Elvira Bramon; Ayman Tourbah; Matthew A Brown; Ewa Tronczynska; Juan P Casas; Niall Tubridy; Aiden Corvin; Jane Vickery; Janusz Jankowski; Pablo Villoslada; Hugh S Markus; Kai Wang; Christopher G Mathew; James Wason; Colin N A Palmer; H-Erich Wichmann; Robert Plomin; Ernest Willoughby; Anna Rautanen; Juliane Winkelmann; Michael Wittig; Richard C Trembath; Jacqueline Yaouanq; Ananth C Viswanathan; Haitao Zhang; Nicholas W Wood; Rebecca Zuvich; Panos Deloukas; Cordelia Langford; Audrey Duncanson; Jorge R Oksenberg; Margaret A Pericak-Vance; Jonathan L Haines; Tomas Olsson; Jan Hillert; Adrian J Ivinson; Philip L De Jager; Leena Peltonen; Graeme J Stewart; David A Hafler; Stephen L Hauser; Gil McVean; Peter Donnelly; Alastair Compston
Journal: Nature Date: 2011-08-10 Impact factor: 49.962

8. Diverse convergent evidence in the genetic analysis of complex disease: coordinating omic, informatic, and experimental evidence to better identify and validate risk factors.

Authors: Sarah A Pendergrass; Marquitta J White; Nuri Kodaman; Timothy H Ciesielski; Rafal S Sobota; Minjun Huang; Jacquelaine Bartlett; Jing Li; Qinxin Pan; Jiang Gui; Scott B Selleck; Christopher I Amos; Marylyn D Ritchie; Jason H Moore; Scott M Williams
Journal: BioData Min Date: 2014-06-30 Impact factor: 2.522

9. Genetic associations of type 2 diabetes with islet amyloid polypeptide processing and degrading pathways in asian populations.

Authors: Vincent Kwok Lim Lam; Ronald Ching Wan Ma; Heung Man Lee; Cheng Hu; Kyong Soo Park; Hiroto Furuta; Ying Wang; Claudia Ha Ting Tam; Xueling Sim; Daniel Peng-Keat Ng; Jianjun Liu; Tien-Yin Wong; E Shyong Tai; Andrew P Morris; Nelson Leung Sang Tang; Jean Woo; Ping Chung Leung; Alice Pik Shan Kong; Risa Ozaki; Wei Ping Jia; Hong Kyu Lee; Kishio Nanjo; Gang Xu; Maggie Chor Yin Ng; Wing-Yee So; Juliana Chung Ngor Chan
Journal: PLoS One Date: 2013-06-11 Impact factor: 3.240

10. Genome-wide and fine-resolution association analysis of malaria in West Africa.

Authors: Muminatou Jallow; Yik Ying Teo; Kerrin S Small; Kirk A Rockett; Panos Deloukas; Taane G Clark; Katja Kivinen; Kalifa A Bojang; David J Conway; Margaret Pinder; Giorgio Sirugo; Fatou Sisay-Joof; Stanley Usen; Sarah Auburn; Suzannah J Bumpstead; Susana Campino; Alison Coffey; Andrew Dunham; Andrew E Fry; Angela Green; Rhian Gwilliam; Sarah E Hunt; Michael Inouye; Anna E Jeffreys; Alieu Mendy; Aarno Palotie; Simon Potter; Jiannis Ragoussis; Jane Rogers; Kate Rowlands; Elilan Somaskantharajah; Pamela Whittaker; Claire Widden; Peter Donnelly; Bryan Howie; Jonathan Marchini; Andrew Morris; Miguel SanJoaquin; Eric Akum Achidi; Tsiri Agbenyega; Angela Allen; Olukemi Amodu; Patrick Corran; Abdoulaye Djimde; Amagana Dolo; Ogobara K Doumbo; Chris Drakeley; Sarah Dunstan; Jennifer Evans; Jeremy Farrar; Deepika Fernando; Tran Tinh Hien; Rolf D Horstmann; Muntaser Ibrahim; Nadira Karunaweera; Gilbert Kokwaro; Kwadwo A Koram; Martha Lemnge; Julie Makani; Kevin Marsh; Pascal Michon; David Modiano; Malcolm E Molyneux; Ivo Mueller; Michael Parker; Norbert Peshu; Christopher V Plowe; Odile Puijalon; John Reeder; Hugh Reyburn; Eleanor M Riley; Anavaj Sakuntabhai; Pratap Singhasivanon; Sodiomon Sirima; Adama Tall; Terrie E Taylor; Mahamadou Thera; Marita Troye-Blomberg; Thomas N Williams; Michael Wilson; Dominic P Kwiatkowski
Journal: Nat Genet Date: 2009-05-24 Impact factor: 38.330

2 in total

1. Evolutionarily derived networks to inform disease pathways.

Authors: Britney E Graham; Christian Darabos; Minjun Huang; Louis J Muglia; Jason H Moore; Scott M Williams
Journal: Genet Epidemiol Date: 2017-09-25 Impact factor: 2.135

2. Estimating prevalence of human traits among populations from polygenic risk scores.

Authors: Britney E Graham; Brian Plotkin; Louis Muglia; Jason H Moore; Scott M Williams
Journal: Hum Genomics Date: 2021-12-13 Impact factor: 4.639

2 in total