Literature DB >> 25727067

Novel bayes factors that capture expert uncertainty in prior density specification in genetic association studies.

Amy V Spencer1, Angela Cox, Wei-Yu Lin, Douglas F Easton, Kyriaki Michailidou, Kevin Walters.   

Abstract

Bayes factors (BFs) are becoming increasingly important tools in genetic association studies, partly because they provide a natural framework for including prior information. The Wakefield BF (WBF) approximation is easy to calculate and assumes a normal prior on the log odds ratio (logOR) with a mean of zero. However, the prior variance (W) must be specified. Because of the potentially high sensitivity of the WBF to the choice of W, we propose several new BF approximations with logOR ∼N(0,W), but allow W to take a probability distribution rather than a fixed value. We provide several prior distributions for W which lead to BFs that can be calculated easily in freely available software packages. These priors allow a wide range of densities for W and provide considerable flexibility. We examine some properties of the priors and BFs and show how to determine the most appropriate prior based on elicited quantiles of the prior odds ratio (OR). We show by simulation that our novel BFs have superior true-positive rates at low false-positive rates compared to those from both P-value and WBF analyses across a range of sample sizes and ORs. We give an example of utilizing our BFs to fine-map the CASP8 region using genotype data on approximately 46,000 breast cancer case and 43,000 healthy control samples from the Collaborative Oncological Gene-environment Study (COGS) Consortium, and compare the single-nucleotide polymorphism ranks to those obtained using WBFs and P-values from univariate logistic regression.
© 2015 The Authors. *Genetic Epidemiology published by Wiley Periodicals, Inc.

Entities:  

Keywords:  elicitation; empirical; expert knowledge; filtering; fine mapping; flexibility; hyperparameter; sensitivity; single-nucleotide polymorphism

Mesh:

Year:  2015        PMID: 25727067      PMCID: PMC4406822          DOI: 10.1002/gepi.21891

Source DB:  PubMed          Journal:  Genet Epidemiol        ISSN: 0741-0395            Impact factor:   2.135


Introduction

Recently, several methods have been published for analyzing genotype data at the fine‐mapping level [Maller et al., 2012; Udler et al., 2009; Vignal et al., 2011]. These data consist of hundreds or thousands of single‐nucleotide polymorphisms (SNPs) in a small region of the genome, where associations with a disease have previously been found, commonly in a genome‐wide association study (GWAS). Bayesian methods [Stephens and Balding, 2009] have the advantage of naturally allowing for the inclusion of prior functional genetic information which could inform the probability of an SNP being causal. This could help to overcome the problems faced in fine‐mapping, where it is expected that the causal SNPs not yet identified are likely to have small effect sizes (odds ratios [ORs] of the order of 1.1 or less) and the results are likely to be confounded by high levels of short‐range linkage disequilibrium (LD). Furthermore, some causal SNPs may have low minor allele frequencies (MAFs). These factors make it difficult to identify causal SNPs, even using the tens of thousands of subjects that are currently being analyzed by international consortia. We previously carried out a thorough investigation of several different frequentist fine‐mapping filtering methods [Spencer et al., 2014]. Filtering is a general framework in which the SNPs are ranked according to a given statistic, a threshold value of that statistic is chosen and SNPs with a value below this threshold are removed from the set of candidate causal SNPs. This results in a more manageable number of SNPs which can be investigated for causality in further biological tests, such as the analysis of gene expression in cell lines. The Bayes factor (BF) is a Bayesian statistic which can be used for filtering [Kass and Raftery, 1995], and it has already been used as a tool for use in genetic association analyses [Stephens and Balding, 2009; Wakefield, 2008, 2009]. The calculation of BFs is now implemented in genetic analysis software such as snptest2 [Marchini et al., 2007] and their use is becoming increasingly popular as a filter in fine‐mapping studies [Maller et al., 2012]. The BF is a ratio, which compares the probabilities of the data under two models or hypotheses. In this setting, they compare two models, one in which an SNP is not causally associated with a disease and the other in which it is. They can be used alone or used to update prior odds of the SNP being causal. BFs often have to be approximated due to intractable integrals and, although current methods of approximation are good, they restrict the form of the prior on the effect size. We consider the use of Wakefield's approximate BF [Wakefield, 2008, 2009], which requires that each SNP has a prior on its log odds ratio (logOR) of the form for some fixed W. When eliciting a value of W from experts it may be the case that there is some uncertainty in W, perhaps because several quantiles have been elicited giving inconsistent values of W. Because of this we examine several families of prior distributions on W, which result in novel tractable BFs. Such priors on W can be thought of as providing a weighted average of the BF over each value in the support of W. We examine the properties of these priors on W and give the novel BF approximations in forms that are easily calculable in commonly used software. Although this removes the problem of specifying a fixed W value, most of the novel BFs require hyperparameters to be specified. We demonstrate how appropriate values may be obtained via expert elicitation. The BFs we describe could be used in any genetic association study, but we give an example with simulated fine‐mapped data to show how effective the use of these BFs can be in filtering. We compare the results to those using the Wakefield BF (WBF) and examine the effect of the choice of hyperparameters. We give an example of eliciting the prior hyperparameters and using the BFs as a fine‐mapping tool using breast cancer case‐control data from an international consortium. We are able to show that our methods may be used to describe a variety of uncertainties and appropriately incorporate these into a BF analysis. Not only this, but they can potentially produce better results than if the uncertainty was not taken into account.

Materials and Methods

Bayes Factors and the Wakefield Approximation

BFs compare the probability of the observed data under two models or hypotheses. For our purposes, the BF can be defined as BFs are also used to update prior odds () to posterior odds () via , where in our case Δ and δ are the posterior and prior probabilities of “true” association, respectively. By “true association,” we mean causally linked to disease risk rather than being associated through LD or sampling variation. Here, a BF greater than one indicates that the data are more likely under the alternative than the null hypothesis. BFs require the specification of a likelihood and prior on all model parameters. Both BFs and posterior probabilities can be used to fine‐map genomic regions in case‐control studies by using the likelihood from a logistic regression model. For SNP i in single‐SNP logistic regression models, the probability () of subject j, with copies of the minor allele, being a case is With this definition, can be interpreted as the SNP‐specific per‐allele natural logarithm of the OR comparing the minor to the major allele. For SNP i, BF is calculated comparing the hypotheses and [Stephens and Balding, 2009]. The BF, as given in Equation (1), is the ratio of marginal likelihoods which can lead to intractable integrals for many prior densities. For nontractable BFs, it is common to use a Laplace approximation [Kass and Raftery, 1995]. The Laplace approximation is implemented in software packages, including snptest2 [Marchini et al., 2007]. Wakefield [2008, 2009] derived a tractable approximation to the BF (which we abbreviate as WBF). We found excellent agreement between the WBF and Laplace approximations from snptest2 for sample sizes ⩾10, 000 for a variety of ORs and MAFs (data not shown). Both methods are based on asymptotic approximations and, given the large sample sizes in the types of dataset we consider, should provide accurate approximations to the true BF. Using the definition of the BF in Equation (1), the Wakefield approximate BF is In Equation (3), β1, the logOR of causal SNPs in the genomic region under consideration, is assumed to follow a normal distribution given by . is the maximum likelihood estimator (MLE) of β1. Rather than consider the logistic likelihood, Wakefield used the asymptotic distribution of the MLE: which leads to the WBF given in Equation (3). Note that the WBF we specify in Equation (3), and use in the rest of this paper, is the reciprocal of the WBF given by Wakefield [2009].

Motivation for the Study

To use the WBF, one needs to specify W (e.g., through elicitation) and be prepared to accept that the prior distribution of the logOR is Gaussian. For a percentile , such that , W is calculated using , where Φ is the distribution function of the standard normal distribution (Wakefield, 2009). When performing elicitation about W with an expert, they may express some uncertainty about the value of W. For example, the expert may believe that the 80th percentile of the prior distribution for the OR is between 1.05 and 1.3 which implies that . As a result, we investigated how sensitive the WBFs are to the choice of W. We found the results to be highly dependent on the choice of W. We therefore wanted to allow for uncertainty about W in the BF calculations. We retained the normal density for the prior for β1 and considered three different parametric families of priors for W that yield BFs that should be flexible enough to capture expert uncertainty in W. We also considered an additional prior that may be useful in some scenarios. We wanted priors that led to BFs that could be easily calculated and this informed our choice of priors. We go on to show that these priors have desirable properties in the context of fine‐mapping and carry out sensitivity analysis to demonstrate the effect of the prior parameters. We give suggestions for how prior hyperparameters might be elicited and in the supplementary material we provide R code to calculate the new BFs.

Novel BFs Allowing for Uncertainty in W

We have derived four forms for priors on W which are given (up to proportionality) in Table 1. Three of these forms, the power, hybrid, and reciprocal priors, use the genotype data through V. The dependence of the prior for W on V is purely for mathematical convenience to yield tractable integrals. Therefore, they are not true priors but we show that in practice the values of V likely to be encountered in large association studies have very little impact on the prior density of W. We also provide one prior on W, the exponential prior, which does not depend on the genotype data.
Table 1

Prior densities for W

Name of prior f(W) Restrictions on hyperparameters
Power (V+W)k k<12
Exponential expcW/2 c>0
Hybrid (V+W)kexpd2(V+W) d>β1^,k<1
Reciprocal 1(V+W)exp(V+W)2

Proportional density functions for each of the four prior forms (applies for ).

Prior densities for W Proportional density functions for each of the four prior forms (applies for ). Each of the power, exponential, and hybrid priors are really families of priors because they depend on hyperparameters, whereas the reciprocal prior takes a single density. These hyperparameters are represented by k, c, and d, and all priors have a support . We suggest choosing the values of a, b, c, d, and k via expert elicitation. Figure 1 shows the densities of some possible priors and hence the range of prior beliefs they can accommodate. We derive the approximate BFs relating to these priors in the supplementary material. All four new BFs can be easily calculated in R [R Core Team, 2012] (code is provided in the supplementary material). It should be noted that what we term the hydrid prior is identical to a shifted inverse gamma distribution on a restricted support.
Figure 1

Densities of three families of tractable priors and one specific prior for . β1 is logOR with and a value of is used in all plots.

Densities of three families of tractable priors and one specific prior for . β1 is logOR with and a value of is used in all plots.

Eliciting Hyperparameters of the Priors for W

Our new BFs avoid the problem of specifying a fixed value for W, but instead require other hyperparameter specification for most of the BFs suggested. The hyperparameters would usually be determined through elicitation based on the distribution function of W. The distribution function for W when using the power prior Bayes factor (PPBF) is given by So if a single percentile (p 1) of the distribution of W is elicited (w 1) then we find k by equating Equation (4) to p 1 with W replaced by w 1. For the PPBF and exponential prior Bayes factor (EPBF) this can easily be solved algebraically. The hybrid prior Bayes factor (HPBF) requires some other numerical search method. The distribution functions for all four priors are given in Table 2. A more reliable strategy is to determine suitable hyperparameters by optimizing the fit of multiple percentiles elicited from an expert. For example, if we elicit h percentiles of W , then we find such that Rather than directly eliciting W values, it is likely that an expert will find it easier to envisage particular central probability intervals (PIs) for the ORs. For the ith percentile to be elicited (), let represent the upper limit of the th central PI for the OR. Then can be found using , where . It is also likely to help if they are encouraged to choose the particular PIs themselves.
Table 2

Prior distribution functions for W

Type of prior F(W) Limitations
Power (V+W)k+1(V+a)k+1(V+b)k+1(V+a)k+1 k<12,k1
lnV+WV+a/lnV+bV+a k=1
Exponential expcW2expca2expcb2expca2 c>0
Hybrid Γk1,d2(V+W)Γk1,d2(V+a)Γk1,d2(V+b)Γk1,d2(V+a) d>β1^,k<1
Reciprocal lnW+Va+V+n=1(1)nnn!W+V2na+V2nlnb+Va+V+n=1(1)nnn!b+V2na+V2n

Distribution functions for each of the four prior forms.

Prior distribution functions for W Distribution functions for each of the four prior forms. To calculate , we also need to specify a, b, and V. Suppose the expert provides the minimum and maximum values of the upper limits of say the 80th percentile (denoted and ) that they consider plausible. We can use to determine a using where in this case and similarly for b replacing with . The value of V will be different for every SNP, so we suggest taking the median of the range of V. The values of V can be found by fitting univariate logistic regression models to the data where V is the square of the standard error of the parameter estimate for the SNP. This can be done in many standard statistical software packages. The values of the hyperparameters in Figure 1 give a good indication of the space over which to search. We have written R code to carry out this search over the hyperparameters. The output includes the minimum sum of squares from Equation (5), so that the form of the prior which results in the smallest value can be determined. This code is available upon request. There are other methods available to determine the hyperparameters, for example, empirical Bayes methods. In empirical Bayes, the hyperparameters (Λ) are found as the solution to . In our case, this corresponds to maximizing the BF over Λ, which cannot be done analytically.

Testing the Properties and Efficacy of the New BFs on Simulated Data

Three of out four forms of BF use the genotype data to inform the prior through V, the asymptotic variance of the estimate of the logOR. V will be different for each SNP since it depends on, among other quantities, the MAF. We used simulated data to generate realistic values of V and examined their effect on the prior density of W for values of V corresponding to SNPs that have an MAF not less than 0.005, as these are the SNPs that we might have sufficient power to detect with current sample sizes. These datasets were simulated using hapgen2 [Spencer et al., 2009] and the European haplotypes of the August 2010 release of the 1,000 genomes data with large sample sizes reflecting those now being generated by disease‐specific consortia. There has been some suggestion that the effect size of causal SNPs may increase with decreasing MAF [Wang et al., 2005]. We investigate whether the three empirical forms of prior implicitly have this property. To assess this we examine how changes with V, over a support relevant to studies with sample sizes of 2,000 or more. Since SNPs with lower MAFs have larger V [Slager and Schaid, 2001], an appropriate prior would possess the property that is a nondecreasing function of V. Then as the MAF decreases, V increases and rarer SNPs have a priori larger effects on average. We also tested the use of the novels BFs as a method for filtering (narrowing down the set of candidate causal variables) in a fine‐mapping study by carrying out such an analysis on simulated datasets with known causal SNPs. We give results for scenarios in which the causal SNP has an MAF of 0.08, for ORs of 1.10, 1.14, and 1.18 and for total sample sizes of 2,000, 4,000, and 20,000 (with an equal number of cases and controls). We simulated 1,000 datasets for this scenario, and illustrate the results using receiver operating characteristic (ROC) curves. Fawcett [2006] outlines several ways to determine ROC curves when they are used to represent a summary of multiple analyses (in this case those on each of the datasets). Our ROC curves present the true and mean false‐positive rates (FPRs) over the multiple analyses, but give no indication as to the variation in FPRs between datasets. Fawcett calls the method that we use threshold averaging.

Comparing Fine‐Mapping Methods for the CASP8 Region Using iCOGS Data

The Collaborative Oncological Gene‐environment Study (COGS) Consortium have recently carried out a number of studies using a specially developed Illumina array, known as the iCOGS array [Michailidou et al., 2013]. This was designed to fine‐map regions that had been previously identified by GWAS, by concentrating a large number of SNPs in regions of interest where there is already thought to be a causal association with breast, ovarian, or prostate cancer. One such region comprises base positions 201500074–202569992 of chromosome two, including the CASP8 gene. In this region, 585 SNPs were originally genotyped on breast cancer case and control samples from the Breast Cancer Association Consortium and 501 passed quality control checks. A further 1,232 were successfully imputed using impute2 (Marchini and Howie, 2010), resulting in genotypes for 1,733 SNPs in 46,450 cases and 42,600 controls (total sample size: 89,050). We used both the full dataset and a subset of 5,238 individuals (2,721 cases and 2,517 controls) to assess the impact of our priors on both smaller and larger studies. Prior to receiving the data, we carried out elicitation with a breast cancer genetics expert who had previously been involved in studies into the CASP8 region. We then determined the prior distribution that best matched their beliefs and used it to calculate BFs and carry out filtering on the genotype data from iCOGS.

Results

The Dependence of the Prior Densities of W upon Hyperparameters and the Genotype Data

Figure 1 shows some possible prior densities for W, with for those densities which depend on V. We see that most of the priors put the majority of the weight of W close to the lower limit of its support (in this case close to zero). Other than being independent of the genotype data, one of the main advantages of the exponential prior over the other forms for is its ability to provide an almost uniform prior distribution for W over the range of values likely to be considered appropriate. The HPBF allows for more flexibility in the shape of the prior distribution. In particular, it is the only prior that allows more mass at higher values of W than at lower values of W and the only prior to have a stationary point. While the other priors are monotonically decreasing with W, the mode of hybrid prior is at . To assess the impact of V on our priors in an intermediate sized fine‐mapping study, we simulated a dataset of size 20,000 (using exactly the same scenario as for the simulated dataset of 4,000) and determined the distribution of values of V for those SNPs with MAF ⩾0.005 and then determined the minimum, median, and maximum of these values to be 0.00040, 0.00176, and 0.02211, respectively. Figure 2 panels (a), (c), and (e) show the prior densities for W for the power, hybrid, and reciprocal priors for these three values of V, using fixed values of the hyperparameters. While there is some variation in the prior for W as V takes its maximum and minimum values, for most of the SNPs with MAF ⩾0.005 the prior will be relatively similar. There is little reason to suspect these observations will not generalize to other genomic regions.
Figure 2

Prior densities of W (plots (a), (c), and (e)) and as a function of V (plots (b), (d), and (f)) for empirical forms of the prior (). Prior densities are given for minimum, median, and maximum values of V for SNPs with MAF>0.005 in a sample size of 20,000. is given over a range of V likely to been seen in sample sizes of 2,000 or greater with different values of the hyperparameters, where relevant.

Prior densities of W (plots (a), (c), and (e)) and as a function of V (plots (b), (d), and (f)) for empirical forms of the prior (). Prior densities are given for minimum, median, and maximum values of V for SNPs with MAF>0.005 in a sample size of 20,000. is given over a range of V likely to been seen in sample sizes of 2,000 or greater with different values of the hyperparameters, where relevant. We next consider SNPs with extreme values of V as these may lead to extreme priors and potentially large BFs. Because V is bounded below by 0, we only need to consider extreme large values of V. If the number of cases and controls is equal and denoted by n, Slager and Schaid [2001] showed that approximately. Studies with total sample sizes of 2,000 will yield most values of V in the region for SNPs with MAF greater than 0.005. As sample size decreases occasional large values of V may be seen resulting from rare alleles (with MAFs less than 0.005) having very low counts in the case and/or control groups. These SNPs will have broad likelihoods and so are unlikely to have high BFs regardless of the prior. So although very rare alleles may have priors that are different from most of the SNPs, these rare SNPs are unlikely to be retained after filtering.

Are These Priors Consistent with Rare Alleles Having Larger Effects?

There is some suggestion in the literature that rare SNPs might be expected to have larger effect sizes [Wang et al., 2005]. We wanted to assess whether this was the case for our suggested priors. To do this we examined how the expected value of W varied with V. If the hypothesis is true then rarer alleles should have larger effect sizes, hence larger values of W. Rarer alleles also have larger values of V. So if our priors possess this property, we should observe that increases with V. For the power, hybrid, and reciprocal priors, the expected value of W can either be found directly using integration by parts or by finding by integration and then using the relationship . Those expectations that depend on V are plotted in Figure 2 panels (b), (d), and (f) for values of V likely to occur. is independent of V for the exponential prior. For the three expectations which depend on V, we were unable to verify algebraically that within the specified limits (where relevant), but Figure 2 plots against V for a range of values of the hyperparameters. Panels (b) and (f) show this appears to be the case for the power and reciprocal prior. However, we can see from panel (d) of Figure 2 that could decrease with V for the hybrid prior. Researchers who believe that rare SNPs will a priori have larger effects should constrain d to be close to zero. Alternatively a generalization of the Savage‐Dickey density ratio (Verdinelli and Wasserman, 1995) could be used with a prior of the form , so removing the dependence on V. The generalization of the Savage‐Dickey density ratio approximates the BF without the need for integration, hence allowing for a wider range of priors. It bases calculation of the BF on a large number of samples from the posterior distributions of the model parameters.

Fine‐Mapping Using BFs on Simulated Data

We carried out filtering using WBFs and our new BFs on 1,000 simulated datasets with a single causal SNP with an MAF of 0.08. We chose ORs of 1.10, 1.14, and 1.18 and sample sizes of 2,000, 4,000, and 20,000. We also included the results of filtering using P‐values from univariate logistic regression. Table 3 shows the true‐positive rates (TPRs) (× 1, 000) from the simulated data analysis by sample size, OR, and FPR. We have limited the results to FPRs as this more than covers the desired range of FPRs in any fine‐mapping procedure. The support for the new priors is and so we let and in the WBF in Table 3. In Table 3, we chose hyperparameters for our new priors that place more of the mass at smaller values of W. The priors for PPBF and HPBF with the specified hyperparameters have similar shaped densities and so, not surprisingly, produce broadly similar rankings.
Table 3

TPRs (× 1, 000) from the simulated data analysis by sample size, OR, and FPR

Sample size (SS)
SS = 2,000SS = 4,000SS = 20,000
FPR (%)FPR (%)FPR (%)
Method (parameter values)5%10%15%20%5%10%15%20%1%5%10%15%
OR = 1.10
P‐value72118163259151249320384202497677786
WBF (W=0.003)48 131 216 291 136 285 385 442 231 579 762 836
WBF (W=0.1) 89 126 166 173 152 221242259173433530582
PPBF (k=1.5) 88 140 202 245 170 264 318346 224 524 676758
HPBF (d=0.01, 89 140 204 248 170 263 308334 225 517 655707
k=1.1)
OR = 1.14
P‐value146247331395232357457531508831924962
WBF (W=0.003) 106 255 376 457 222 408 528 574 502 850 960 979
WBF (W=0.1) 159 248 290307 239 335373387503789875899
PPBF (k=1.5) 159 272 331 372 257 386 457 486 536 846 926 956
HPBF (d=0.01, 161 276 332 363 256 387 450470 539 843 919943
k=1.1)
OR = 1.18
P‐value178286382454323475551619753950984994
WBF (W=0.003) 134 302 430 496 290 500 614 663 730 950 989 997
WBF (W=0.1) 200 298 340353 328 452478495751947980988
PPBF (k=1.5) 196 322 397 439 351 496 560 602 767 954 987 994
HPBF (d=0.01, 207 325 396 434 355 493 551 582 770 955 987 994
k=1.1)

True‐positive rates (TPRs) multiplied by a thousand at the most relevant false‐positive rates (FPRs) for different filtering methods (PPBF, HPBF, Wakefield Bayes factors, and P‐values) applied to 1,000 simulated datasets with 2,871 SNPs. For PPBF and HPBF, the support is . The data were simulated using the LD structure of the CASP8 region for a scenario with a single causal SNP with an MAF of 0.08 for various sample sizes, odds ratios, and FPRs . Figures in bold are those that exceed the TPR obtained using P‐values.

TPRs (× 1, 000) from the simulated data analysis by sample size, OR, and FPR True‐positive rates (TPRs) multiplied by a thousand at the most relevant false‐positive rates (FPRs) for different filtering methods (PPBF, HPBF, Wakefield Bayes factors, and P‐values) applied to 1,000 simulated datasets with 2,871 SNPs. For PPBF and HPBF, the support is . The data were simulated using the LD structure of the CASP8 region for a scenario with a single causal SNP with an MAF of 0.08 for various sample sizes, odds ratios, and FPRs . Figures in bold are those that exceed the TPR obtained using P‐values. Table 3 shows that using WBF with yields higher TPRs than using WBF with in many of the scenarios considered. This is perhaps not surprising because the logORs for the causal SNP used in the simulations are all small (0.095–0.166) and so lower values of W, which put more weight at small ORs, are expected to perform better. The exception to this is at the lowest FPR where generally yields higher TPRs. It is at the very low (and arguably most relevant) FPRs that the PPBF and HPBF methods generally outperform all the other methods considered. If we focus on the smallest FPR considered at each level of sample size and OR, we see that either the HPBF or the PPBF have the highest TPR of any of the methods considered in eight out of the nine scenarios. It is only when the sample size is 20,000 and the OR is 1.1 that one of the other methods (WBF with ) is superior at the lowest FPR. What is somewhat surprising is that even when the sample size is as high as 20,000, there are substantial differences in the performances of the P‐value method and the methods using BFs, For example, with a sample size of 20,000, an OR of 1.10 and an FPR of 5%, the TPRs for the P‐value and WBF with are 0.497 and 0.579, respectively. The TPR for the WBF method is nearly 17% bigger than the TPR for the P‐value method. Making the same comparison when the FPR is 1% the WBF method is over 14% bigger than the TPR for the P‐value (0.231 compared to 0.202). Figure 3 shows several ROC curves derived from filtering with a causal SNP with an OR of 1.14, an MAF of 0.08 and a sample size of 4,000. Figure 3a shows the whole ROC space while Figure 3b shows the ROC curve for FPRs below 20%. Figure 3 includes the results of filtering using the WBF approximation with and and also the results of filtering using a power prior and a hybrid prior with support . The two new priors were chosen so that the power prior put a lot of mass at and hybrid prior out a lot of mass around . Consequently, using the power prior produces an ROC curve similar to that using WBFs with and using the hybrid prior produces an ROC curve similar to WBF with . There is clearly a lot of variation in the effectiveness of the WBF filter as W changes. Putting too much prior weight on large effect sizes clearly leads to poor performance of the WBF when the actual causal effect size is small. The new BFs can be thought of as a weighted average of the BFs over the support of W and so do not suffer to the same degree as using WBF with a value of W that puts a lot of mass at values of β that have a low likelihood.
Figure 3

ROC curves showing the results of WBF, PPBF, HPBF, and P‐value filtering. (A) Shows the whole ROC space while (B) shows the ROC curve for false‐positive rates below 20%. We use and for the WBF analysis. The prior for the PPBF analysis has and puts most of the mass close to . The prior for the HPBF analysis has and and puts most of the weight close to . All filtering was carried out on 1,000 datasets simulated using the LD structure of the CASP8 region for a scenario with a single causal SNP that has an OR of 1.14, an MAF of 0.08, and a total sample size of 4,000.

ROC curves showing the results of WBF, PPBF, HPBF, and P‐value filtering. (A) Shows the whole ROC space while (B) shows the ROC curve for false‐positive rates below 20%. We use and for the WBF analysis. The prior for the PPBF analysis has and puts most of the mass close to . The prior for the HPBF analysis has and and puts most of the weight close to . All filtering was carried out on 1,000 datasets simulated using the LD structure of the CASP8 region for a scenario with a single causal SNP that has an OR of 1.14, an MAF of 0.08, and a total sample size of 4,000.

Fine‐Mapping Using BFs on the iCOGS Data

Before analyzing the iCOGS data [Michailidou et al., 2013], we carried out elicitation with a breast cancer genetics expert. They believed that there was a causal SNP with a small effect size in the CASP8 region on chromosome two. We initially asked our expert to give an interval of possible ORs at a single percentile. They thought that the 80th percentile of the OR was between 1.05 and 1.3, yielding and . We then asked them to choose three further PIs that they were comfortable thinking in terms of. They provided the 0.95, 0.75, and 0.50 PIs giving . For these probabilities, we asked for their best estimate of the upper limit of the OR and the expert provided . From these we get . The median value of V for the SNPs in the iCOGS data was 0.00017. We used all these values to carry out a search over at intervals of 0.01 for the PPBF. Evaluating Equation (5) for each of these values of k, we found that for the PPBF, the minimum of the sum of squared differences occurs at . If we use the same method to find other prior forms which fit the elicited values we get an exponential prior with and a hybrid prior with and . Of these three priors, the power prior has the closest fit to the elicited values and is a better fit than the reciprocal prior, so we used this to calculate the BF for the iCOGS data. Table 4 contains information about the top 20 ranked SNPs based on the PPBF values calculated from the full iCOGS data. It should be noted that none of these will necessarily be the causal SNP, although previous simulations show us that, with such a large sample size, there is a high probability that the causal SNP will be highly ranked (Spencer et al., 2014). We observe that nine of the top 20 ranked SNPs were among the 501 SNPs genotyped (rather than imputed) in the iCOGS data. Included in the table is the ranking for these SNPs by P‐value from univariate logistic regression and by WBF with the two values of W that are the extremities of the support used for W with the PPBF. We can see that while all these methods rank SNPs similarly, with the same SNPs always ranked in the top three, the rankings do differ somewhat.
Table 4

Results of analysis of iCOGS data

Ranking
WBF with W=
SNP numberOR (95% CI)MAFPPBF P‐value0.0030.1PPBF
980b 1.048 (1.027, 1.071)0.2941,3871111
1,0271.046 (1.024, 1.068)0.2856642222
992b 1.045 (1.022, 1.067)0.2873343333
9091.043 (1.021, 1.065)0.2872349464
878a 1.081 (1.039, 1.125)0.061228121345
1,272a 1.075 (1.036, 1.116)0.071217141156
950b 1.043 (1.021, 1.065)0.28621710577
8381.041 (1.020, 1.062)0.33821356108
960b 1.043 (1.021, 1.065)0.2852137=7=8=9
961b 1.043 (1.021, 1.065)0.2852138=7=8=9
985b 1.043 (1.021, 1.066)0.286206491111
8371.042 (1.021, 1.064)0.2992006101312
9071.042 (1.020, 1.064)0.28716711121613
8961.042 (1.020, 1.064)0.28716613=14=17=14
9121.042 (1.020, 1.064)0.28716615=14=17=14
956a, b 1.052 (1.025, 1.080)0.17015916161516
681a 1.074 (1.035, 1.116)0.06914917191417
1,004a, b 1.051 (1.024, 1.078)0.17312418182018
8851.041 (1.019 1.063)0.28711919172319
955a, b 1.050 (1.023, 1.078)0.17311221212120

For these SNPs, the major allele is associated with a higher disease risk.

These SNPs were not genotyped but imputed. Top‐ranked SNPs in CASP8 region based on power prior Bayes factor (PPBF) approximation with hyperparameter and . Rankings using P‐value and Wakefield Bayes factor (WBF) are also included, as is the logistic regression estimate and 95% confidence interval (CI) of the odds ratio (OR) for each SNP. The genotype data for CASP8 region come from the iCOGS study and has a total sample size of 89,050 and 1,733 SNPs.

Results of analysis of iCOGS data For these SNPs, the major allele is associated with a higher disease risk. These SNPs were not genotyped but imputed. Top‐ranked SNPs in CASP8 region based on power prior Bayes factor (PPBF) approximation with hyperparameter and . Rankings using P‐value and Wakefield Bayes factor (WBF) are also included, as is the logistic regression estimate and 95% confidence interval (CI) of the odds ratio (OR) for each SNP. The genotype data for CASP8 region come from the iCOGS study and has a total sample size of 89,050 and 1,733 SNPs. With such a large sample size, the majority of the information comes from the likelihood, rather than the prior, which is why the rankings using the different methods are so similar when considering the full iCOGS data. Many association studies will have smaller sample sizes than those used here and in these studies the prior will have more influence on the BFs obtained and so we would expect much more variation in the ranks across the different BFs. We investigated this using a stratified random subset of the iCOGS subjects, with 2,721 cases and 2,517 controls (5,238 total). The same analyses were carried out on this subset of the data and the results are given in Table 5. As expected, the prior has much larger influence in this smaller fine‐mapping study. Of the top 20 SNPs selected by PPBF, only four of these are in the top 20 for WBF with (where the prior has most mass around small effect sizes). Of the top 20 SNPs selected by WBF with , 10 SNPs are not in the top 50 for PPBF and three are not in the top 75 for PPBF. There is more agreement between WBF with and PPBF but WBF still selects two SNPs in its top 20 that are not in the top 20 for PPBF.
Table 5

Results of subset analysis of iCOGS data

Ranking
WBF with W=
SNP numberOR (95% CI)MAFPPBF value P‐value0.0030.1PPBF
822a 1.514 (1.215, 1.886)0.03745.112811
807a 1.520 (1.216, 1.900)0.03642.223522
820a 1.515 (1.213, 1.893)0.03640.033733
824a 1.514 (1.212, 1.891)0.03639.743944
868a 1.508 (1.209, 1.881)0.03838.553855
378a 1.431 (1.174, 1.745)0.04633.371666
858a 1.495 (1.198, 1.866)0.03630.664777
379a 1.409 (1.162, 1.709)0.04729.281588
8541.470 (1.181, 1.829)0.03624.095699
3461.262 (1.099, 1.449)0.09323.72222710
845a 1.469 (1.180, 1.829)0.03723.611571011
879a 1.480 (1.184, 1.851)0.03723.210641112
823a 1.480 (1.183, 1.851)0.03622.912651213
339a 1.266 (1.099, 1.459)0.09121.72833714
705a 1.439 (1.161, 1.761)0.04320.515531515
752a 1.449 (1.168, 1.798)0.03920.214611416
900a 1.475 (1.177, 1.849)0.03619.713891317
698a 1.454 (1.167, 1.812)0.03618.316791618
699a 1.454 (1.167, 1.812)0.03618.317801719
700a 1.432 (1.159, 1.771)0.03818.220631920

These SNPs were imputed. Ranks of the top‐ranked SNPs in CASP8 region based on the power prior Bayes factor (PPBF) approximation with hyperparameter and . Rankings using P‐value and Wakefield Bayes factor (WBF) are also included, as is the logistic regression estimate and 95% confidence interval (CI) of the odds ratio (OR) for each SNP. Values of the Bayes factors for the PPBF are also provided. The genotype data for CASP8 region come from a subset of the iCOGS study and has a total sample size of 5,238 and 1,733 SNPs.

Results of subset analysis of iCOGS data These SNPs were imputed. Ranks of the top‐ranked SNPs in CASP8 region based on the power prior Bayes factor (PPBF) approximation with hyperparameter and . Rankings using P‐value and Wakefield Bayes factor (WBF) are also included, as is the logistic regression estimate and 95% confidence interval (CI) of the odds ratio (OR) for each SNP. Values of the Bayes factors for the PPBF are also provided. The genotype data for CASP8 region come from a subset of the iCOGS study and has a total sample size of 5,238 and 1,733 SNPs.

Discussion

Novel BFs and Their Properties

We have developed several new forms of approximate BF. These include three parametric families and one fixed form, each relating to a different prior distribution on W, where we assume that the prior distribution on the logOR of an SNP is . This allows for the calculation of BFs where a normal distribution is believed to be an appropriate form for the prior, but where there is uncertainty in W. Most of the priors we suggest for W put most of the weight of W at the lower end of the support but the exponential prior allows for an almost uniform prior which is useful when an expert believes a range of values of W are equally likely a priori. The hybrid prior can be specified so that the mode is anywhere in the given range. This might be useful when an expert has a strong prior belief in a particular value of W but wants to allow for some uncertainty in it. Three of these prior forms also depend upon the genotype data. The expectation of W appears universally to be an increasing function of V only for the power and reciprocal priors. This means that these priors are consistent with the hypothesis that rarer alleles have have larger effects. Depending on the values of the hyperparameters, may be a decreasing function of W when the hybrid prior is employed. Such a prior would be inappropriate to use if it was believed that rare causal SNPs do indeed have larger effect sizes. In this case, we suggest a Savage‐Dickey density ratio approach [Verdinelli and Wasserman, 1995]. All the novel BFs described here are univariate. The solution we have employed in the univariate case relies on recognizing the BF integrands as standard probability densities. We did consider extending these BFs to the multivariate case and while specifying the prior in the multivariate case is straightforward, the resulting integrals appear to be intractable and would probably need to be solved analytically so that easily calculated closed form expressions for the BFs may not be easily available.

Using Novel BFs in Practice

All novel BF approximations can be calculated in R (we give code in supplementary data), although the EPBF is computationally intensive and cannot produce results for SNPs which have very small MLEs of the logOR. The computation for all of the other forms is simple and efficient and we therefore recommend using the PPBF or HPBF instead of the EPBF where possible. In most cases, hyperparameters can be found which result in power or hybrid priors very similar to the desired exponential prior. In these situations, we recommend that the EPBF only be used if the investigator does not want to include any information from the data in the prior. We described how elicitation may be employed with an expert to determine the most appropriate values for the hyperparameters, although such experts may find this task difficult. The importance of feedback in the elicitation process is worth emphasizing. Once a distribution for W has been determined based on the quantiles elicited from the expert, it is important to relay back to them what this means about other quantiles not elicited to check that these are acceptable. A web‐based tool, MATCH, which may help with this purpose is now available [Morris et al., 2014].

BF Analysis in Future Genetics

Filtering using BFs has already been used in fine‐mapping [Maller et al., 2012] and our more flexible approach is likely to appeal to investigators who struggle to specify a suitable variance for a fixed normal prior. The methods we use assume that all SNPs have equal prior odds of being causal, but BFs can also be used to update the odds and then filtering can be carried out using the posterior odds of causality. The effectiveness of BF filtering may be further improved by appropriate incorporation of functional information through SNP‐specific prior odds. Such functional data can be found on the encode database [Encode Project Consortium, 2011], the RegulomeDB database [Boyle et al., 2012] and the F‐SNP database [Lee and Shatkay, 2009]. All functional SNP‐level data sources are currently limited as information is not complete for all the SNPs across the genome. The FS score found on the F‐SNP database has the advantage that it integrates a large amount of data from multiple publicly available data sources. It formally combines scores from a number of bioinformatics tools using weighting based on the “reliability” of these tools to give a score between 0 and 1. BFs are likely to become increasingly popular as investigators seek to make use of the vast prior functional information available. The new BFs presented in this paper allow researchers more flexibility in the specification of the prior OR, allowing the distribution used to better match expert prior beliefs. Disclaimer: Supplementary materials have been peer‐reviewed but not copyedited. Supplementary material. Click here for additional data file.
  16 in total

1.  Case-control studies of genetic markers: power and sample size approximations for Armitage's test for trend.

Authors:  S L Slager; D J Schaid
Journal:  Hum Hered       Date:  2001       Impact factor: 0.444

Review 2.  Genotype imputation for genome-wide association studies.

Authors:  Jonathan Marchini; Bryan Howie
Journal:  Nat Rev Genet       Date:  2010-07       Impact factor: 53.242

3.  Reporting and interpretation in genome-wide association studies.

Authors:  Jon Wakefield
Journal:  Int J Epidemiol       Date:  2008-02-11       Impact factor: 7.196

4.  FGFR2 variants and breast cancer risk: fine-scale mapping using African American studies and analysis of chromatin conformation.

Authors:  Miriam S Udler; Kerstin B Meyer; Karen A Pooley; Eric Karlins; Jeffery P Struewing; Jinghui Zhang; David R Doody; Stewart MacArthur; Jonathan Tyrer; Paul D Pharoah; Robert Luben; Leslie Bernstein; Laurence N Kolonel; Brian E Henderson; Loic Le Marchand; Giske Ursin; Michael F Press; Paul Brennan; Suleeporn Sangrajrang; Valerie Gaborieau; Fabrice Odefrey; Chen-Yang Shen; Pei-Ei Wu; Hui-Chun Wang; Daehee Kang; Keun-Young Yoo; Dong-Young Noh; Sei-Hyun Ahn; Bruce A J Ponder; Christopher A Haiman; Kathleen E Malone; Alison M Dunning; Elaine A Ostrander; Douglas F Easton
Journal:  Hum Mol Genet       Date:  2009-02-17       Impact factor: 6.150

5.  Bayes factors for genome-wide association studies: comparison with P-values.

Authors:  Jon Wakefield
Journal:  Genet Epidemiol       Date:  2009-01       Impact factor: 2.135

Review 6.  Bayesian statistical methods for genetic association studies.

Authors:  Matthew Stephens; David J Balding
Journal:  Nat Rev Genet       Date:  2009-10       Impact factor: 53.242

7.  An integrative scoring system for ranking SNPs by their potential deleterious effects.

Authors:  Phil Hyoun Lee; Hagit Shatkay
Journal:  Bioinformatics       Date:  2009-02-19       Impact factor: 6.937

8.  A user's guide to the encyclopedia of DNA elements (ENCODE).

Authors: 
Journal:  PLoS Biol       Date:  2011-04-19       Impact factor: 8.029

9.  Comparing the efficacy of SNP filtering methods for identifying a single causal SNP in a known association region.

Authors:  Amy Victoria Spencer; Angela Cox; Kevin Walters
Journal:  Ann Hum Genet       Date:  2013-11-11       Impact factor: 1.670

10.  Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip.

Authors:  Chris C A Spencer; Zhan Su; Peter Donnelly; Jonathan Marchini
Journal:  PLoS Genet       Date:  2009-05-15       Impact factor: 5.917

View more
  4 in total

1.  Semi-parametric empirical Bayes factor for genome-wide association studies.

Authors:  Junji Morisawa; Takahiro Otani; Jo Nishino; Ryo Emoto; Kunihiko Takahashi; Shigeyuki Matsui
Journal:  Eur J Hum Genet       Date:  2021-01-25       Impact factor: 5.351

Review 2.  Fine-mapping genetic associations.

Authors:  Anna Hutchinson; Jennifer Asimit; Chris Wallace
Journal:  Hum Mol Genet       Date:  2020-09-30       Impact factor: 6.150

3.  Quantifying posterior effect size distribution of susceptibility loci by common summary statistics.

Authors:  Olga A Vsevolozhskaya; Dmitri V Zaykin
Journal:  Genet Epidemiol       Date:  2020-02-25       Impact factor: 2.344

4.  Incorporating Functional Genomic Information in Genetic Association Studies Using an Empirical Bayes Approach.

Authors:  Amy V Spencer; Angela Cox; Wei-Yu Lin; Douglas F Easton; Kyriaki Michailidou; Kevin Walters
Journal:  Genet Epidemiol       Date:  2016-02-01       Impact factor: 2.135

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.