Literature DB >> 28018425

The Decay of Disease Association with Declining Linkage Disequilibrium: A Fine Mapping Theorem.

Mehdi Maadooliat¹, Naveen K Bansal², Jiblal Upadhya², Manzur R Farazi², Xiang Li³, Max M He⁴, Scott J Hebbring⁵, Zhan Ye³, Steven J Schrodi⁵.

Abstract

Several important and fundamental aspects of disease genetics models have yet to be described. One such property is the relationship of disease association statistics at a marker site closely linked to a disease causing site. A complete description of this two-locus system is of particular importance to experimental efforts to fine map association signals for complex diseases. Here, we present a simple relationship between disease association statistics and the decline of linkage disequilibrium from a causal site. Specifically, the ratio of Chi-square disease association statistics at a marker site and causal site is equivalent to the standard measure of pairwise linkage disequilibrium, r2. A complete derivation of this relationship from a general disease model is shown. Quite interestingly, this relationship holds across all modes of inheritance. Extensive Monte Carlo simulations using a disease genetics model applied to chromosomes subjected to a standard model of recombination are employed to better understand the variation around this fine mapping theorem due to sampling effects. We also use this relationship to provide a framework for estimating properties of a non-interrogated causal site using data at closely linked markers. Lastly, we apply this way of examining association data from high-density genotyping in a large, publicly-available data set investigating extreme BMI. We anticipate that understanding the patterns of disease association decay with declining linkage disequilibrium from a causal site will enable more powerful fine mapping methods and provide new avenues for identifying causal sites/genes from fine-mapping studies.

Entities: Chemical Disease Gene Mutation Species

Keywords: disease association; disease genetics; fine-mapping; linkage disequilibrium; mode of inheritance; statistical genetics/genomics; theoretical genetics; two-site model

Year: 2016 PMID： 28018425 PMCID： PMC5149547 DOI： 10.3389/fgene.2016.00217

Source DB: PubMed Journal: Front Genet ISSN： 1664-8021 Impact factor: 4.599

Introduction

Genetic markers closely linked to disease-causing sites will exhibit association with disease through linkage disequilibrium (Lai et al., 1994; Weiss and Clark, 2002; Morton, 2005; Slatkin, 2008). This is the central idea behind population-based association mapping of disease genes using high density SNP arrays (McVean et al., 2005; Balding, 2006). However, the decay of disease association with declining linkage disequilibrium from a disease-predisposing, functional site has not yet been completely described even though this is a fundamental property of disease genetics. Doing so will provide much needed information concerning the properties of disease genetics and greatly aid experimental designs and statistical methods for identifying functional variants in regions that exhibit disease association. Although many have argued that genome-wide association studies have been largely unsuccessful in that they have not revealed a large proportion of the heritability from most complex diseases (Latham, 2011), it is certainly clear that numerous loci with impressive statistical evidence for correlation with a wide variety of complex diseases have been identified and replicated (Welter et al., 2014). In a number of instances, these results have provided much needed insight into the biochemical pathways and cellular mechanisms responsible for increasing disease risk (Klein et al., 2005; Cargill et al., 2007; Xavier et al., 2008; Visscher et al., 2012). However, the functional variants underlying the majority of these disease-associated regions have yet to be identified and described (McClellan and King, 2010). The dearth of information concerning functional variants obviously presents a sizable impediment to further dissection of complex disease etiologies and subsequent utility in impacting clinical practice. If genetic and statistical methods can aid in generating either supporting or opposing evidence for the role of functional motifs within a region of disease association, then the progression of human genetics studies can be made much more efficient and potent. When designing fine mapping genotyping experiments, it is important to select genetic variants and subregions such that the design is well-powered to discover functional variants under two important types of disease models: The first class of model that should be covered by such efforts encompasses models of a causal variant driving a portion, or perhaps all of the disease association within a region. Under this model, varying levels of association signal at different sites are explained by different levels of linkage disequilibrium with causal variants. Hence, given allele frequencies and linkage disequilibrium patterns, one can, in principle, back-calculate the properties of putative functional variants that could be driving an initially observed disease association within the region of interest. Known variants, including those that were not initially interrogated, fulfilling these calculated allele frequency and linkage disequilibrium properties with the initial markers should then be included in a fine-mapping panel. The second model that should be covered by a fine-mapping panel of markers is one of allelic heterogeneity at a functional motif (e.g., a gene) that was originally found to exhibit a disease association signal. Empirical data tends to strongly favor this type of model over an individual variant serving as the sole driving allele within a region (Raychaudhuri et al., 2011; Rivas et al., 2011; Nelson et al., 2012; Kim-Howard et al., 2013; Seddon et al., 2013). Indeed, it is quite typical for studies aiming to fine map regions harboring a GWAS-significant SNP to reveal multiple disease-correlated variants within the same gene. This is not terribly surprising as the site frequency spectrum is expected to contain vast numbers of rare variants in outbred populations, which is accentuated in rapidly expanding demographics (Wright, 1931; Coventry et al., 2010; Keinan and Clark, 2012). Even if there is a small likelihood of any one of these rare variants to exhibit pathogenic effects, the sheer number of variants segregating at a gene trends to produce multiple functional alleles in a sizable population. To cover this class of disease models, one would want to reliably identify the functional motifs tagged by an initial association signal and proceed by exhaustively interrogating variants within those functional motifs. Ultimately, in vitro or in vivo functional studies will serve to confirm that specific, very rare variants have pathogenic effects. In practice, this two-model approach guiding fine mapping was successfully employed to identify alleles segregating at the TRAF1-C5 region conferring susceptibility to rheumatoid arthritis (Schrodi et al., 2007a; Chang et al., 2008) and to fine map the IL23R region in psoriasis (Garcia et al., 2008). Here, building upon previous work (Kruglyak, 1999; Pritchard and Przeworski, 2001; Zaykin et al., 2006; Schrodi et al., 2007b, 2009), we prove a simple, analytic relationship between case/control association statistics at two closely-linked sites and the linkage disequilibrium between the two sites under a generalized disease genetics model. The result holds treating the parameters as being fixed. Interestingly, the result is invariant with mode of inheritance parameters. Further, we posit that concurrently considering the patterns of disease-association and the genetic architecture within a region of interest may strengthen the ability to assess the likelihood that a particular variant is indeed causal with regard to inflating the risk of disease. By doing so, one may be better able to prioritize variants for functional follow-up studies. For finite sample sizes, dispersion around this relationship is expected if the parameters are replaced with random variables and we therefore explore this variation in the result through the use of a Monte Carlo simulation. Lastly, we investigate these patterns in experimental data around the FTO locus in a large GWAS of extreme BMI.

Results

Approximation

Several groups have described the relationship of statistical power at a marker site in linkage disequilibrium with a causal site. In 1999, using the coalescent process to investigate the density of markers necessary for adequate coverage across the genome to detect disease-associated regions, Kruglyak presented the outline of an argument that the sample size necessary to detect association at a marker locus in linkage disequilibrium with a causal site is approximately S/d2, where S is the number of samples required to detect disease association at the causal site with a given level of power and d2 = [q(1 − q)p−1(1 − p)−1]r2, such that r2 is the standard measure of linkage disequilibrium between the causal site and the marker site and q and p are the allele frequencies at the marker and causal sites, respectively (Kruglyak, 1999). Later, Pritchard and Pzreworski performed a derivation showing a similar result, also with regard to power (Pritchard and Przeworski, 2001). Under the Pritchard- Przeworski derivation, the power to detect disease association at a causal site and marker site were found to be approximately the same if the sample size at a marker site is increased by a factor of (r2)−1 over that used in interrogating the causal site. While certainly an intriguing relationship between sample sizes, as it is, the finding may not always have utility in fine mapping applications as most association studies use the same number samples at all sites interrogated. That said, this relationship can be used to motivate related and illuminating properties regarding how fast the disease association signal can be expected to decay as a function of declining linkage disequilibrium from a causal site. Equating the power at the disease-predisposing site to that at the marker site, it follows that, where Z and Z are the normally-distributed Z-scores for testing disease-association at the causal site and marker site, respectively; and α is the significance level. Taking the inverse functions and squaring yields the provocative approximation, where and are the Chi-Square statistics for disease association at the disease and marker sites, respectively. An interesting parallel was described by Luo, Thompson, and Wooliams in the context of marker-assisted selection of quantitative traits where the authors showed that the proportion of the additive variance of a trait due to a marker in linkage disequilibrium with a causal quantitative trait locus, , is equal to r2 (Luo et al., 1997). Plotting the Equation (2) approximation with the χ2 disease-association statistic on the ordinate and 1 − r2 on the abscissa is a simple method of displaying the expected linear decay in the χ2 -values as the linkage disequilibrium with a causal site declines at different marker sites. Figure 1 shows this relationship. This decay pattern was first used empirically in 2007–2008 to fine map the TRAF1 region in rheumatoid arthritis (Schrodi et al., 2007a) and the IL23R region in psoriasis (Garcia et al., 2008) and has, in an analogous form, subsequently been used in other applications (Farh et al., 2015). Although this approximation is very useful in understanding the decay of disease association with declining linkage disequilibrium from a causal site, several simplifying assumptions were made in the original Pritchard-Przeworski derivation. While the impact of these assumptions have been explored to some extent in previous work (Hu et al., 2004), it is not known how violations of the original assumptions might produce departures from Equation (2) nor what the effect of sampling haplotypes does to the relationship. Hence, an exact relationship between disease association statistics and r2 -values with a causal site would aid in clarifying this relationship and motivate statistical approaches to harnessing this pattern for the purpose of fine-mapping functional alleles. Further, Monte Carlo simulations can be used to explore the how treating haplotype counts as random variables generates stochastic variation around this central relationship.

Figure 1

The expected decay of disease association with declining linkage disequilibrium for four modes of inheritance. The standard recursive haplotype frequencies under recombination were used to generate a series of haplotype combinations. The disease-predisposing allele at the causal site was set at a general population frequency of 0.01. The penetrance f22 was set to 0.001 and the remaining two penetrances varied according to the modes of inheritance examined and the relative risks (R) cited in the Figures. Sample sizes were set at nD = 2000 and nC = 2000. (A) displays the results for an additive model, such that f12 is the arithmetic mean of f22 and f11. (B) shows the results under a multiplicative model. (C) shows the results under a general recessive model. (D) shows the results under a general dominant model.

Full derivation

In this section, we will show the algebraic relationship between the Chi-Square-test statistics at a causal site and marker site, without any assumptions regarding the probabilistic properties (or whether they are fixed parameters) of the allele frequencies or haplotype frequencies of which the statistics are composed. Note that in the Monte Carlo Simulations Section we will treat the haplotype counts as random variables; and hence the Chi-Squared statistics and r2 will each carry stochastic properties and we investigate these properties in that section. Defining the Chi-Square-test statistics for a disease-causing site () and a marker closely linked to the disease site () following the Pritchard-Przeworski derivation, where a two-site model is considered (site A segregating alleles A1 and A2, and site B segregating alleles B1 and B2), p, p, and p are the frequencies of the A1 allele in the combined population, disease-affected population, and the control population, respectively, and where q, q, and q are the frequencies of the B1 allele in the combined population, disease-affected population, and the control population, respectively. n and n are the sample sizes for diploid cases and controls, respectively, and n = n + n. For this work, haplotype and allele probabilities conditional on disease status (i.e., within cases or within controls) are derived. For the haplotype and allele probabilities in the general population, we weighted the disease status conditional probabilities by the probability of disease or healthy control attributable to the causal site, in accordance with the law of total probability. Note, that the form of these Chi-Square statistics in Equations (3) and (4) is twice the value of traditionally-defined Chi-Square statistic. However, this scalar inflation factor cancels out in the subsequent derivation. Noting that where K is the P(Case) attributable to the causal site, we can substitute and into Equation (5), resulting in This treatment of the allele frequencies using the law of total probability holds for all populations in which each individual is either a case or control (e.g., cohort studies or case/control study designs). The next aim in the derivation is to substitute quantities for the allele frequencies in the affected population at both sites in terms of penetrances, disease prevalence, and general population allele frequencies. The allele frequencies at both the causal and marker sites have been previously described for two-locus systems under general disease models (Schrodi et al., 2007b): where f11, f12, and f22 are the prevalences of the A1A1, A1A2, and A2A2 genotypes, respectively, such that f = P(Case|AA); which, under this monogenic model and assuming Hardy-Weinberg Equilibrium in the general population and using the law of total probability we can express the disease prevalence as, ; and haplotype frequencies P11 = P(A1B1), and P21 = P(A2B1). Applied to complex diseases, it may be useful to think of this disease model as the subset of individuals with a common disease that is primarily driven by a particular locus. With the substitution into Equation (6), In Equation (9), the R.H.S. numerator can be simplified to Noting that P21 = q − P11 and substituting , the numerator becomes whereas, the denominator in Equation (9) can be simplified to Hence, Equation (9) can be written as where D = P11P22 − P12P21 = P11 − pq. Again substituting , Therefore, we have shown the exact relationship under our model, Not only is this relationship an exact result under the model employed, but it is universal in that there is no dependence on the penetrances. Thus, we may expect that from a true disease-susceptibility site, that there should be a linear decay in the Chi-square statistics for disease association with declining r2 -values with the causal site. Figure 1 shows the expected disease association decay with declining linkage disequilibrium from the causal site for additive, multiplicative, recessive, and dominant sets of models. The patterns arising from various relative risks are presented. Similarly, Figure 2 presents the patterns expected as a function of sample sizes. Aside from Equation (13) illuminating a central aspect of disease genetics, we suspect that it carries utility in fine mapping applications—we hypothesize that identifying this type of pattern in fine mapping data will better enable the pinpointing of truly causal sites through harnessing correlated data.

Figure 2

Effect of sample size on the expected decay of disease association with declining linkage disequilibrium. This figure shows how the fine mapping theorem behaves under different sample sizes. The case/control sample sizes in the two-site model are varied from 500 to 10,000.

Corollary

Consider the situation where there is a disease-susceptibility site and other sites in differing levels of linkage disequilibrium with the disease-susceptibility site. From large-scale genotyping or sequencing studies, we often know the matrix of pairwise r2 -values, and allele frequencies at each site in the general population, broadly defined. An interesting question arises: If one has genotyped a marker site in a case/control sample set and calculated testing for disease association, can we infer the expected effect size at a non-interrogated causal site? Using Equation (13), and substituting allele frequencies at the causal site, where , the effective total number of independent diploid samples. Defining a traditional allelic odds ratio, R, calculated at the causal site as the allele frequency in the cases can be solved: . Therefore, To simplify the derivation, we will assume that the disease studied is not very common such that the allele frequency in controls is well-approximated by the allele frequency in the general population, p ≅ p. This is also true if samples drawn from the general population are serving as the controls. Hence, Solving for R, To illustrate the use and implications of Equation (17), suppose that we have genotyped a site in 500 diploid cases and 500 diploid controls and calculated the test statistic χ2 = 20, corresponding to p = 1.57E-03 (recall that half the Pritchard-Przeworski statistic is Chi-Square distributed with one degree of freedom). Further assume that this region has previously been subjected to next-generation sequencing in individuals derived from the same source population as the cases and controls which has yielded the discovery of numerous additional variants closely linked to the genotyped site, allele frequencies at those variants, and an array of pairwise linkage disequilibrium values across the region of interest. Under that scenario, one would typically have access to good estimates of the general population allele frequencies and r2 -values at sites neighboring the genotyped site that produced the original finding. Suppose that one of these adjacent sites has a general population allele frequency p = 0.03 and a linkage disequilibrium value with the genotyped site of r2 = 0.2. Under the two-site model, we would therefore estimate the odds ratio at the putative, non-genotyped, causal site to be 5.17. Put another way, the putative causal site, having the general population allele frequency and linkage disequilibrium values above, would have to have an odds ratio of 5.17 in order to generate twice a standard Chi-Square statistic value at the genotyped site of 20 given 500 cases and 500 controls. Indirect inference of the properties of non-interrogated causal sites can be helpful in subsequent experimental efforts to identify disease-predisposing sites in a fine-mapped region. Figure 3 displays the relationship between the inferred odds ratio at the causal site from disease association data at the marker site as a function of linkage disequilibrium between the two sites. Graphs for various p-values at marker site are shown. Additional work under a stochastic model would enable the calculation of the posterior probabilities of properties of non-interrogated causal sites given genetic data at linked markers.

Figure 3

Inferred odds ratio. The relationship between the inferred odds ratio at a causal site and the level of linkage disequilibrium with an interrogated marker is presented in (A,B). Equation (17) is used for the calculations. The seven curves show the patterns of expected odds ratios for disease association at the causal site under different observed p-values calculated at the marker site. Sample size was set at ne = 5000. (A) shows results assuming that the disease-predisposing allele at the causal site has frequency of 0.10 in the general population, whereas (B) sets that frequency at 0.01. The results detailed in Equations (1–17) do not treat any of the parameters, such as haplotype frequencies, as random variables. Clearly, haplotype counts in cases and controls should be treated with sampling processes from a larger population. To address this issue, we have constructed a Monte Carlo simulation program to generate haplotypes under a probabilistic model. Under this program we are able to explore the variation around Equation (13) generated by sampling haplotypes and to observe effects that may be produced by different sets of parameters.

Monte carlo simulations

In an effort to understand the variation in the patterns of disease association decay as a function of linkage disequilibrium with a causative site, we constructed a Monte Carlo simulation using a generalized disease model (penetrances for each of the three genotypes at the causal site are parameterized) and treating the haplotype counts in cases and controls as random variables. Recombination was introduced between a causal site and a closely linked marker as a realistic method of generating different sets of 2-site haplotypes for the general population (Hartl and Clark, 1989). For a rate of recombination, c, and generation time t, we used the following set of recursions (Haldane model of recombination): Hence, for the general population, we can express r2 as a function of generation time using the recursions in Equations (18–21): Assuming Hardy-Weinberg equilibrium in the general population at both sites, the proportion of individuals affected by the disease attributable to this locus, is calculated through the previously-described formula for disease prevalence. To calculate the expected haplotype frequencies in cases, we used Bayes theorem. Hence, the expected frequency of the A1B1 haplotype in cases is In an analogous manner, the remaining haplotype frequencies in cases, where the subscript indicates the haplotype, are The haplotype frequencies in controls are simply Sampling of the case and control haplotypes from the expected frequencies is accomplished through two independent multinomial variates such that the joint densities are given by Hence, the sample frequency of the causal allele in cases and controls, respectively, are and We employed an additive model for the penetrances at the causal site and a design using 10,000 cases and 40,000 controls. As the time parameter is increased, the number of recombination events between the casual site and the marker site increases and there is a corresponding reduction in the linkage disequilibrium between the two sites. Figure 4 shows the distribution of the association statistic at the marker site (Equation 4) plotted against the product of the association statistic at the causal site (Equation 3) and the -value between the two sites. Four different time points were evaluated in the simulation, each with 10,000 replicates generated. The patterns show the general linear trend of how the association statistics scale with linkage disequilibrium and the variation around this pattern. For fixed properties at a causal site, Figure 5 displays the mean value and 95% confidence interval of the association statistic at the marker site as the -value declines.

Figure 4

Figure 5

Mean and 95% Confidence Interval for the Chi-Square Statistic at the Marker Site as a Function of Linkage Disequilibrium. Results from the Monte Carlo simulation showing the mean and 95% confidence interval for the Chi-Square statistic at the marker site, plotted as a function of r2 with the causal site. The two-site model with recombination and additive mode of inheritance described previously were used.

Monte Carlo results under the 2-site model with recombination. The Chi-Square statistic as measured at the marker site is plotted against the product of r2 and the Chi-Square Statistic at the disease site for 10,000 replications of the simulation. 10,000 disease cases and 40,000 controls were assumed in the calculations. The initial frequencies of the four haplotypes were 0.70 for the parental, non-causal haplotype, 0.28 for the parental haplotype carrying the causal variant, and 0.01 for each of the recombinant haplotypes. As time (t) increases, these frequencies varied according to the recursions specified in Equations (18–21). An additive model was assumed as the mode of inheritance model with penetrances of 0.01, 0.03, and 0.05 for the three genotypes at the causal site. Mean and 95% Confidence Interval for the Chi-Square Statistic at the Marker Site as a Function of Linkage Disequilibrium. Results from the Monte Carlo simulation showing the mean and 95% confidence interval for the Chi-Square statistic at the marker site, plotted as a function of r2 with the causal site. The two-site model with recombination and additive mode of inheritance described previously were used.

Application to experimental data

All indicated earlier, there are several uses of the theorem presented here. The pattern of linear decay of association (as measured by the test statistic) with declining linkage disequilibrium can be used to support various markers as causal sites. Conversely, significant departure from the expected pattern can indicate multiple causal sites segregating at the disease locus. And additionally, understanding this fine mapping theorem can be used to infer properties of non-interrogated causal sites. To illustrate the application of the relationship described in Equation (13) to experimental data, we used GWAS data around the well-established obesity locus, FTO, generated by a recent large study of extreme BMI (Berndt et al., 2013). The FTO gene encodes for an alpha-ketoglutarate-dependent dioxygenase (Gerken et al., 2007), playing a role in growth and development (Boissel et al., 2009; Daoud et al., 2016), and has been reliably associated with the related conditions of type 2 diabetes, BMI, adiposity and other obesity-related traits (Scott et al., 2007; Zeggini et al., 2007; Lindgren et al., 2009; Thorleifsson et al., 2009; Fox et al., 2012; DIAGRAM Consortium et al., 2014; Wood et al., 2016). Within the FTO gene region, the study found that rs11075990 exhibited the strongest association with extreme BMI with a reported p-value of 9.3E-33. From this study, we identified 752 SNPs residing within a ~1 Mb region surrounding FTO, having linkage disequilibrium data from the 1000 Genomes project (The 1000 Genomes Project Consortium et al., 2015). Figure 6 displays the positional association of these data, showing a substantial peak localized on chr16q over the FTO gene. Plotting these association results as a function of pairwise linkage disequilibrium (as measured by r2) with rs11075990, there is a general decay of the Chi-Square association statistics with declining r2 -values (Figure 7). Pearson's correlation is 0.979 and the p-value for this relationship (testing Spearman's rho under the null model of no correlation) is 2.87E-29. For this example, there are some immediate findings by visual inspection. The general pattern following the theorem is present. In addition, there appear to be some SNPs with extreme BMI associations that substantially exceed the level of association expected to be driven through linkage disequilibrium with rs11075990. That is, the theoretical model of one causal site (rs11075990) driving the extreme BMI association patterns in the FTO gene region may not explain the association statistics at some SNPs, such as rs2058908, where the theory only predicts a Chi-Square-value of 12.36 (r2 with rs11075990 is 0.087) and yet the observed Chi-Square statistic is 73.98. Hence, the genetic information at rs2058908 may be driven by a causal signal independent of rs11075990 (rs2058908 is denoted with a green circle in Figure 7). A test of conditional association could be used to verify these types of hypotheses. Since the residuals obtains from the fitted line (the line that passes through the origin and the Chi-Square value associated to the causal site) and the observed Chi-Square-values are not normally-distributed, we used a resampling approach to obtain a 95% confidence band (dashed lines in Figure 7). In this approach, we treat the fitted Chi-Square-values to be the expected response for the bootstrap samples, and by resampling the original residuals, we obtain bootstrap replicates for the fixed covariate (r2) (Fox and Weisberg, 2012). Here, we resampled the original residuals 100,000 times in the R programming language (R Core Team, 2014) and used the 0.025 and 0.975 quantiles of the resampled fits to achieve the 95% confidence band in Figure 7.

Figure 6

Figure 7

Decay of association from rs11075990. Association from the Berndt et al. study of extreme BMI was plotted against the pairwise linkage disequilibrium (r2) values of each SNP and rs11075990—the most significant finding in the region. rs2058908 is denoted with the green circle. Ninety-five percent of confidence intervals are determined through the resampling scheme presented in the text.

Positional association plot of SNPs around the . Association data from the Berndt et al. study of extreme BMI were used for this positional plot. SNPs (n = 752) within 500 kb of the FTO gene, having linkage disequilibrium information from the 1000 genomes study were used in the plot showing strong localization of the association signal around the FTO gene. Decay of association from rs11075990. Association from the Berndt et al. study of extreme BMI was plotted against the pairwise linkage disequilibrium (r2) values of each SNP and rs11075990—the most significant finding in the region. rs2058908 is denoted with the green circle. Ninety-five percent of confidence intervals are determined through the resampling scheme presented in the text.

Discussion

One of the most fundamental patterns in disease genetics is the nature of the decay of disease association with declining linkage disequilibrium from a causal site. Motivated by the Kruglyak and Pritchard-Przeworski derivations for the approximate increase in sample size to attain the equivalent statistical power at a marker site in linkage disequilibrium with a causal site, we first showed how this result could be used to produce an approximation showing a linear relationship in the Chi-Square association statistics testing disease association at a marker and a causal site and that the ratio of the two was approximately r2 (Equation 2). Next, using a general two-site model with penetrances, we showed that this is indeed an exact result and invariant to the mode of inheritance model (Equation 13). In this derivation, we treated the variables as fixed parameters. To treat the situation where the haplotype frequencies have sampling properties (i.e., are treated as random variables), we wrote a Monte Carlo simulation of this system for finite sample sizes and used a standard model of recombination between the causal and marker sites. The results characterized the stochastic variability around the initial result. Lastly, we applied this work to experimental data from a large GWAS on extreme BMI and showed reasonably good correspondence with this fine mapping theory. Aside from being a theorem in disease genetics for dichotomous traits, we hope that this fine mapping theorem can serve as an aid in identifying casual variants segregating in a region associated with disease. Recently, substantial effort has driven the field of fine-mapping forward. To address the statistical aspects of prioritizing potentially causal variants within a fine-mapped region, several methods have been developed including a useful Bayesian method created by Maller et al. (The Wellcome Trust Case Control Consortium et al., 2012), which uses Bayes Factor for each variant in the region and calculates the proportion of the total sum of Bayes Factors in the region that is attributable to that variant, producing a relative ranking of the strength of evidence for each variant within the disease-associated region being causal. These calculations allow for the determination of a credible set of highest ranked variants that explains the large majority of the statistical evidence of disease association within the region of interest. The Maller et al. method has been applied to fine mapping data for complex diseases, such as type 1 diabetes (Onengut-Gumuscu et al., 2015). Other important developments in fine mapping approaches include: Bim-Bam (Servin and Stephens, 2007), another Bayesian approach which determines subsets of variants that likely contain causal sites, CAVIAR (Hormozdiari et al., 2014) and CAVIARBF (Chen et al., 2015), coalescent-based methods (Graham, 1998; Morris et al., 2002; Zöllner and Pritchard, 2005), and PAINTOR (Kichaev et al., 2014), which incorporates functional annotation data in a probabilistic manner. Several different extensions of the work presented here could substantially aid fine mapping efforts for complex diseases: (1) Statistical approaches that harness the pattern of association decay with declining linkage disequilibrium will leverage the genetic data at a fine-mapped region to better support or reject the hypothesis that a particular site is indeed causal. Screening each site for a goodness-of-fit with the expected decay pattern from a causal site would better enable the detection of causal sites; (2) Future work focusing on imputing additional properties of a non-interrogated causal variant within a disease-associated region using the linkage disequilibrium patterns and disease association statistics would provide valuable insights into design and interpretation of fine mapping studies. For example, if one imputed a low-frequency, high effect size variant, then experimental designs and genetic techniques, such as sequencing, that have high power to detect such variants can be utilized; and (3) It is becoming increasingly clear that the large majority of regions associated with complex disease susceptibility have multiple predisposing alleles segregating in the populations examined. Methods that extend the simple two-site model explored here to include multiple causal sites will be invaluable for the identification of these functional variants.

Author contributions

MM made substantial contributions to the analysis and interpretation of data, constructed the Monte Carlo simulations, devised the method for obtaining confidence intervals and generated figures and drafting/revising the manuscript. NB made substantial contributions to the interpretation of data, proofing the derivations, and editing the manuscript. JU, MF, XL, and ZY reviewed the manuscript, proofed the derivations, and aided in analyses. MH and SH reviewed and edited the manuscript. SS originated the concept of the manuscript, designed the study, derived the equations, aided in the design of the simulations, interpreted the data, generated figures and drafted/revised the manuscript.

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

49 in total

1. Linkage disequilibrium and the mapping of complex human traits.

Authors: Kenneth M Weiss; Andrew G Clark
Journal: Trends Genet Date: 2002-01 Impact factor: 11.639

2. Complement factor H polymorphism in age-related macular degeneration.

Authors: Robert J Klein; Caroline Zeiss; Emily Y Chew; Jen-Yue Tsai; Richard S Sackler; Chad Haynes; Alice K Henning; John Paul SanGiovanni; Shrikant M Mane; Susan T Mayne; Michael B Bracken; Frederick L Ferris; Jurg Ott; Colin Barnstable; Josephine Hoh
Journal: Science Date: 2005-03-10 Impact factor: 47.728

3. Fine Mapping Causal Variants with an Approximate Bayesian Method Using Marginal Test Statistics.

Authors: Wenan Chen; Beth R Larrabee; Inna G Ovsyannikova; Richard B Kennedy; Iana H Haralambieva; Gregory A Poland; Daniel J Schaid
Journal: Genetics Date: 2015-05-06 Impact factor: 4.562

4. Identifying causal variants at loci with multiple signals of association.

Authors: Farhad Hormozdiari; Emrah Kostem; Eun Yong Kang; Bogdan Pasaniuc; Eleazar Eskin
Journal: Genetics Date: 2014-08-07 Impact factor: 4.562

Review 5. Linkage disequilibrium in humans: models and data.

Authors: J K Pritchard; M Przeworski
Journal: Am J Hum Genet Date: 2001-06-14 Impact factor: 11.025

Review 6. Linkage disequilibrium--understanding the evolutionary past and mapping the medical future.

Authors: Montgomery Slatkin
Journal: Nat Rev Genet Date: 2008-06 Impact factor: 53.242

7. Perspectives on human genetic variation from the HapMap Project.

Authors: Gil McVean; Chris C A Spencer; Raphaelle Chaix
Journal: PLoS Genet Date: 2005-10 Impact factor: 5.917

8. A global reference for human genetic variation.

Authors: Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal: Nature Date: 2015-10-01 Impact factor: 49.962

9. Genome-wide association scan meta-analysis identifies three Loci influencing adiposity and fat distribution.

Authors: Cecilia M Lindgren; Iris M Heid; Joshua C Randall; Claudia Lamina; Valgerdur Steinthorsdottir; Lu Qi; Elizabeth K Speliotes; Gudmar Thorleifsson; Cristen J Willer; Blanca M Herrera; Anne U Jackson; Noha Lim; Paul Scheet; Nicole Soranzo; Najaf Amin; Yurii S Aulchenko; John C Chambers; Alexander Drong; Jian'an Luan; Helen N Lyon; Fernando Rivadeneira; Serena Sanna; Nicholas J Timpson; M Carola Zillikens; Jing Hua Zhao; Peter Almgren; Stefania Bandinelli; Amanda J Bennett; Richard N Bergman; Lori L Bonnycastle; Suzannah J Bumpstead; Stephen J Chanock; Lynn Cherkas; Peter Chines; Lachlan Coin; Cyrus Cooper; Gabriel Crawford; Angela Doering; Anna Dominiczak; Alex S F Doney; Shah Ebrahim; Paul Elliott; Michael R Erdos; Karol Estrada; Luigi Ferrucci; Guido Fischer; Nita G Forouhi; Christian Gieger; Harald Grallert; Christopher J Groves; Scott Grundy; Candace Guiducci; David Hadley; Anders Hamsten; Aki S Havulinna; Albert Hofman; Rolf Holle; John W Holloway; Thomas Illig; Bo Isomaa; Leonie C Jacobs; Karen Jameson; Pekka Jousilahti; Fredrik Karpe; Johanna Kuusisto; Jaana Laitinen; G Mark Lathrop; Debbie A Lawlor; Massimo Mangino; Wendy L McArdle; Thomas Meitinger; Mario A Morken; Andrew P Morris; Patricia Munroe; Narisu Narisu; Anna Nordström; Peter Nordström; Ben A Oostra; Colin N A Palmer; Felicity Payne; John F Peden; Inga Prokopenko; Frida Renström; Aimo Ruokonen; Veikko Salomaa; Manjinder S Sandhu; Laura J Scott; Angelo Scuteri; Kaisa Silander; Kijoung Song; Xin Yuan; Heather M Stringham; Amy J Swift; Tiinamaija Tuomi; Manuela Uda; Peter Vollenweider; Gerard Waeber; Chris Wallace; G Bragi Walters; Michael N Weedon; Jacqueline C M Witteman; Cuilin Zhang; Weihua Zhang; Mark J Caulfield; Francis S Collins; George Davey Smith; Ian N M Day; Paul W Franks; Andrew T Hattersley; Frank B Hu; Marjo-Riitta Jarvelin; Augustine Kong; Jaspal S Kooner; Markku Laakso; Edward Lakatta; Vincent Mooser; Andrew D Morris; Leena Peltonen; Nilesh J Samani; Timothy D Spector; David P Strachan; Toshiko Tanaka; Jaakko Tuomilehto; André G Uitterlinden; Cornelia M van Duijn; Nicholas J Wareham; Dawn M Waterworth; Michael Boehnke; Panos Deloukas; Leif Groop; David J Hunter; Unnur Thorsteinsdottir; David Schlessinger; H-Erich Wichmann; Timothy M Frayling; Gonçalo R Abecasis; Joel N Hirschhorn; Ruth J F Loos; Kari Stefansson; Karen L Mohlke; Inês Barroso; Mark I McCarthy
Journal: PLoS Genet Date: 2009-06-26 Impact factor: 5.917

10. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes.

Authors: Eleftheria Zeggini; Michael N Weedon; Cecilia M Lindgren; Timothy M Frayling; Katherine S Elliott; Hana Lango; Nicholas J Timpson; John R B Perry; Nigel W Rayner; Rachel M Freathy; Jeffrey C Barrett; Beverley Shields; Andrew P Morris; Sian Ellard; Christopher J Groves; Lorna W Harries; Jonathan L Marchini; Katharine R Owen; Beatrice Knight; Lon R Cardon; Mark Walker; Graham A Hitman; Andrew D Morris; Alex S F Doney; Mark I McCarthy; Andrew T Hattersley
Journal: Science Date: 2007-04-26 Impact factor: 47.728