Literature DB >> 26773131

FINEMAP: efficient variable selection using summary data from genome-wide association studies.

Christian Benner¹, Chris C A Spencer², Aki S Havulinna³, Veikko Salomaa³, Samuli Ripatti⁴, Matti Pirinen⁵.

Abstract

MOTIVATION: The goal of fine-mapping in genomic regions associated with complex diseases and traits is to identify causal variants that point to molecular mechanisms behind the associations. Recent fine-mapping methods using summary data from genome-wide association studies rely on exhaustive search through all possible causal configurations, which is computationally expensive.
RESULTS: We introduce FINEMAP, a software package to efficiently explore a set of the most important causal configurations of the region via a shotgun stochastic search algorithm. We show that FINEMAP produces accurate results in a fraction of processing time of existing approaches and is therefore a promising tool for analyzing growing amounts of data produced in genome-wide association studies and emerging sequencing projects.
AVAILABILITY AND IMPLEMENTATION: FINEMAP v1.0 is freely available for Mac OS X and Linux at http://www.christianbenner.com CONTACT: : christian.benner@helsinki.fi or matti.pirinen@helsinki.fi.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2016 PMID： 26773131 PMCID： PMC4866522 DOI： 10.1093/bioinformatics/btw018

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Genome-Wide Association Studies (GWAS) have identified thousands of genomic regions associated with complex diseases and traits. Any associated region may contain thousands of genetic variants with complex correlation structure. Therefore, one of the next challenges is fine-mapping that aims to pinpoint individual variants and genes that have a direct effect on the trait. This step is crucial for fully exploiting the potential of GWAS: to unveil molecular biology of complex traits and, eventually, provide targets for therapeutic interventions. For a recent review on fine-mapping, see Spain and Barrett (2015). A standard approach for refining association signals is a step-wise conditional analysis, an iterative procedure that conditions on the Single-Nucleotide Polymorphisms (SNPs) with the lowest P-value of association until no additional SNP reaches the pre-assigned P-value threshold. While conditional analysis is informative about the number of complementary sources of association signals within the region, it fails to provide probabilistic measures of causality for individual variants. To overcome this problem, many recent fine-mapping methods have adopted a Bayesian framework. Approaches for Bayesian analysis of multi-SNP GWAS data include exhaustive search as implemented in software BIMBAM (Servin and Stephens, 2007), MCMC algorithms (Guan and Stephens, 2011), variational approximations (Carbonetto and Stephens, 2012) and stochastic search as implemented in software GUESS (Bottolo and Richardson, 2010, Bottolo ) and GUESSFM (Wallace ). Bayesian fine-mapping has also been conducted under a simplified assumption of a single causal variant in the region (WTCCC ). Common to these approaches is that they require original genotype-phenotype data as input, which is becoming impractical or even impossible as the size of current GWAS meta-analyses rises to several hundreds of thousands of samples (Wood ). For this reason, fine-mapping methods have recently been extended to use only GWAS summary data together with a SNP correlation estimate from a reference panel. To our knowledge, the existing fine-mapping implementations using GWAS summary data are PAINTOR (Kichaev , Kichaev and Pasaniuc, 2015), CAVIAR (Hormozdiari ) and CAVIARBF (Chen ). PAINTOR is an EM-algorithm to jointly fine-map several associated regions by utilizing functional annotation information of individual variants. As a special case of only a single region without annotation information, PAINTOR tackles the standard fine-mapping problem. CAVIAR differs from PAINTOR by modeling the uncertainty in the observed association statistics. This might be a reason why CAVIARBF, a more efficient implementation of CAVIAR, has been reported to be more accurate than PAINTOR in prioritizing variants when no annotation information is available (Chen ). Although PAINTOR, CAVIAR and CAVIARBF are very useful methods for performing fine-mapping on GWAS summary data, we think that their implementation via an exhaustive search through all possible causal configurations is likely to hinder their use in several settings. For example, it becomes computationally slow or even impossible to run these methods by allowing more than three causal variants on dense genotype data with thousands of variants per region. Thus, these methods are unlikely to make full use of unprecedented statistical power to discern complex association patterns provided by ever increasing GWAS sample sizes and genome sequencing technologies. We introduce FINEMAP, a novel software package to improve the performance of GWAS summary data based fine-mapping. The statistical model of FINEMAP is similar to CAVIAR and CAVIARBF while the important difference is the computational algorithm. FINEMAP uses a Shotgun Stochastic Search (SSS) algorithm (Hans ) that explores the vast space of causal configurations by concentrating efforts on the configurations with non-negligible probability. We compare FINEMAP with the exhaustive search algorithm implemented in CAVIARBF. The comparisons to two other GWAS summary data based fine-mapping methods CAVIAR and PAINTOR are not shown in this paper since CAVIARBF is more efficient but equally accurate as CAVIAR and more accurate than PAINTOR without annotation information (Chen ). In this paper we show that FINEMAP is thousands of times faster than CAVIARBF while still providing similar accuracy in the examples where CAVIARBF can be applied. FINEMAP is more accurate than CAVIARBF when the number of causal variants in CAVIARBF needs to be restricted for computational reasons. Our examples are based on genotype and lipid level data of the Finnish population (Borodulin ) as well as summary statistics from GWAS on Parkinson’s disease (UKPDC and WTCCC2, 2011).

2 Model

We are interested in fine-mapping a genomic region using GWAS summary data instead of original genotype-phenotype data as input. The building blocks of our Bayesian approach are the likelihood function (Section 2.1), priors (Section 2.2), efficient likelihood evaluation (Section 2.3) and efficient search algorithm (section 3). At each step we describe how our choices differ from the existing methods PAINTOR and CAVIARBF.

2.1 Likelihood function

For a quantitative trait, we assume the following linear model where is a mean-centered vector of values of a quantitative trait for n individuals, a column-standardized SNP genotype matrix of dimension n × m and . The Maximum Likelihood Estimate (MLE) of the causal SNP effects depends on and only through the SNP correlation matrix and single-SNP z-scores Thus, it is possible to approximate the likelihood function for by using a SNP correlation estimate from a reference panel and single-SNP z-scores from a standard GWAS software, as previously done in GCTA (Yang ), PAINTOR and CAVIARBF. Note that with z-scores for quantitative traits we can assume that without any loss of generality. For binary traits, a similar approximation applies with z-scores originating from logistic regression and , where is the proportion of cases among the n individuals (Pirinen ). When m is large but has only very few non-zero elements, the MLE alone is not ideal since it does not account for the sparsity assumption (Fig. 1). Thus, we take a Bayesian approach with a prior distribution that induces sparsity among causal effects.

Fig. 1.

The binary indicator vector determines which SNPs have non-zero causal effects (). The corresponding causal (linear) model for a quantitative trait assumes only few SNPs with a causal effect. The Maximum Likelihood Estimate (MLE) of the causal SNP effects can be computed by using only the SNP correlation matrix and single-SNP z-scores. However, the MLE is not ideal because it does not account for the sparsity assumption

2.2 Priors for and

Let a binary indicator vector determine which SNPs have non-zero causal effects ( if the th SNP is causal and 0, otherwise; see top panel in Fig. 1). For the causal effects, we use the prior where is the user given prior variance for the causal effects in units of , with for quantitative traits and for binary traits, and a diagonal matrix with on the diagonal. In our examples for quantitative traits, we have set . This means that with 95% probability a causal SNP explains less than 1% of the trait variation. When available z-scores originate from logistic regression, a value of means that with 95% probability the effect of a causal SNP on the odds-ratio scale is less than 1.15 for common variants (MAF = 0.5) and less than 2.0 for low-frequency variants (MAF = 0.01), where MAF is the minor allele frequency. Robustness to the values of has been studied previously by Chen ). To define the prior for each causal configuration, we use a general discrete distribution for the number of causal SNPs where is the maximum number of SNPs in the causal configuration. Note that we assume that the region to be fine-mapped includes at least one causal SNP, i.e. . For a fixed value of k, we assume the same probability for each configuration with k causal SNPs. Thus, a priori, PAINTOR does not use an explicit prior on k but restricts in practice. The default prior used by CAVIARBF builds on that of CAVIAR and assumes that each SNP is causal with probability and that . This is a special case of our prior when we set and renormalize for K = 5 except that CAVIARBF assigns non-zero prior also for the null configuration k = 0.

2.3 Marginal likelihood for

We now show how the marginal likelihood for the causal configuration can be computed efficiently.

2.3.1 Integrating out causal effects

The likelihood function of the causal SNP effects is (proportional to) a Normal density . This enables an analytic solution for the marginal likelihood of eliminating the causal effects where we defined . Importantly, an evaluation of the marginal likelihood requires only single-SNP z-scores and SNP correlations from a reference panel and does not depend on . This elimination of is similar to the one used by CAVIAR and CAVIARBF and differs from PAINTOR that fixes those values based on the observed z-scores. Next, we describe two implementations to evaluate with high computational efficiency.

2.3.2 Reducing the complexity from to

Option 1

Let and be respectively the set of causal and non-causal SNPs. Consider the quadratic form inside the exponential function in , where can be precomputed. We solve the linear system for by observing that the m – k elements in corresponding to non-causal SNPs () are and the remaining elements result from solving a system of k equations where is the k × k correlation matrix of the causal SNPs and the submatrix of corresponding to the correlations between the causal and non-causal SNPs. In addition, we observe that is simply after expanding with respect to the rows corresponding to non-causal SNPs. Computationally, these computations require one Cholesky decomposition with complexity and provide thus a considerable saving compared to the naive way of decomposing the whole m × m matrix with complexity . This derivation differs from the one used by CAVIARBF that is similar to our option 2 below. It also differs from PAINTOR that fixes based on the observed z-scores and performs once a Cholesky decomposition of the whole m × m SNP correlation matrix that is used repeatedly in each likelihood evaluation. Note that option 1 cannot be used in case of collinearity among the SNPs, because the correlation matrix is not invertible if two SNPs are perfectly correlated and is unstable with nearly perfectly correlated SNPs.

Option 2

We partition the observed z-scores into components and and permute rows and columns of the SNP correlation matrix and covariance matrix such that and . This partitioning entails a block structure in the covariance matrix of Using properties of the multivariate Normal distribution, the conditional expectation and covariance matrix of given are readily available and do not depend on . We rewrite the marginal likelihood in terms of the marginal distribution of and conditional distribution of given to obtain the following expression This means that we can compute the Bayes factor for assessing the evidence against the null model by using only the causal SNPs and that the marginal likelihood is proportional to this expression. CAVIARBF utilizes this result, although without a mathematical derivation explicitly shown in Chen ). Note that the correlation submatrix is not invertible if there is almost perfect collinearity among the SNPs in C. To handle this case, we have implemented an option in FINEMAP to set the posterior probability of a causal configuration to zero if it contains at least one SNP pair with absolute correlation greater or equal to some specified threshold.

2.4 Posterior for

According to the Bayesian paradigm, we want to base our inference on the posterior of causal configurations . The unnormalized posterior can be evaluated by combining the prior with the marginal likelihood (option 1) as where k is the number of causal SNPs in configuration . In addition, we can compute unnormalized posterior by using the Bayes factor (option 2) We observed that option 2 was faster than option 1 and therefore option 2 is used by default in FINEMAP. Ideally, were normalized over all causal configurations. Unfortunately, this is computationally intractable already for modest values of K > 5. However, as we show in the results section, typically a large majority of the causal configurations have negligible posterior probability and hence a good approximation for the posterior can be achieved by concentrating on only those with non-negligible probability. We explore the space of causal configurations with a Shotgun Stochastic Search (SSS) algorithm (Hans ) that rapidly evaluates many configurations and is designed to discover especially those with highest posterior probability.

3 Shotgun stochastic search

We use SSS to efficiently evaluate many causal configurations and discover especially those with highest posterior probability. SSS conducts a pre-defined number of iterations within the space of causal configurations. In each iteration (Fig. 2), the neighborhood of the current causal configuration is defined by configurations that result from deleting, changing or adding a causal SNP from the current configuration. The next iteration starts by sampling a new causal configuration from the neighborhood based on normalized within the neighborhood. All evaluated causal configurations and their unnormalized posterior probabilities are saved in a list for downstream analyses. The aim of the algorithm is that contains all relevant causal configurations, that is, those with non-negligible posterior probabilities.

Fig. 2.

Shotgun stochastic search rapidly identifies configurations of causal SNPs with high posterior probability. In each iteration, the neighborhood of the current causal configuration is defined by configurations that result from deleting, changing or adding a causal SNP () from the current configuration. The next iteration starts by sampling a new causal configuration from the neighborhood based on the scores normalized within the neighborhood. The unnormalized posterior probabilities remain fixed throughout the algorithm and can thus be memorized () to avoid recomputation when already-evaluated configurations appear in another neighborhood

The posterior probability that SNPs in configuration are causal is computed by normalizing over We compute the marginal posterior probability that the th SNP is causal, also called single-SNP inclusion probability, by averaging over all evaluated configurations Shotgun stochastic search rapidly identifies configurations of causal SNPs with high posterior probability. In each iteration, the neighborhood of the current causal configuration is defined by configurations that result from deleting, changing or adding a causal SNP () from the current configuration. The next iteration starts by sampling a new causal configuration from the neighborhood based on the scores normalized within the neighborhood. The unnormalized posterior probabilities remain fixed throughout the algorithm and can thus be memorized () to avoid recomputation when already-evaluated configurations appear in another neighborhood In addition, we compute a single-SNP Bayes factor for assessing the evidence that the th SNP is causal as where the prior probability of the th SNP being causal is PAINTOR, CAVIAR and CAVIARBF do not perform a stochastic search but enumerate all causal configurations with . When m is large but there are only few true causal SNPs, the exhaustive search is computationally expensive and inefficient since most configurations make a negligible contribution to the single-SNP inclusion probabilities.

3.1 Computational implementation

For , the number of causal configurations to be evaluated in each iteration is: k for deleting, for changing, m – k for adding a causal SNP Computing requires a Cholesky decomposition with complexity that is fast when . Importantly, each unnormalized posterior probability remains fixed throughout the algorithm. This means that we can use a hash table (std::unordered_map in C ++) to avoid recomputing when already-evaluated configurations appear in another neighborhood. Inserting to and retrieving from the hash table requires constant time on average. Hash table lookups reduce the dominant computational cost of the algorithm: exploring the vast space of causal configurations. This renders SSS computational efficient because it traverses the space of causal configurations by moving back and forth to configurations with high posterior probability and overlapping neighborhoods.

4 Test data generation

We obtained real genotype data on 18 834 individuals from the National FINRISK study (Borodulin ). The genotype data comprise a 500 kilobase region centered on rs11591147 in PCSK9 gene on chromosome 1 with 1920 polymorphic SNPs with pairwise absolute correlations less than 0.99. To assess the computational efficiency and fine-mapping accuracy, we considered the following scenarios: Scenario A Increasing number of SNPs considering causal configurations with up to K = 3 or K = 5 SNPs. Scenario B Fixed number of m = 150 SNPs considering causal configurations with increasing maximum number of SNPs . We generated datasets where causal SNPs had highly correlated proxies since this is a setting where an in-exhaustive search could theoretically have problems. Five hundred datasets were generated under each combination of m and K in scenarios A and B using the following linear model: where C is the set of causal SNPs, the vector of genotypes at the cth causal SNP, β and f respectively the effect size and minor allele frequency of the cth causal SNP and . The number of causal SNPs was five in scenario A and B. In each dataset, the causal SNPs were randomly chosen among those variants that had highly correlated proxies (absolute correlation greater than 0.5) among the other variants. The effect sizes of the causal SNPs were specified so that the statistical power at a significance level of was approximately 0.5. Single-SNP testing using a linear model was performed to compute z-scores. Each set of z-scores was then analyzed with CAVIARBF (default parameters) and FINEMAP (100 iterations saving the top 50 000 evaluated causal configurations). For both methods, the prior standard deviation of the causal effects was set to 0.05 and the prior distribution of each configuration with k causal SNPs was specified as This required excluding the null configuration (k = 0) from the output of CAVIARBF.

5 Results

The main difference between FINEMAP and CAVIARBF is the search strategy to explore the space of causal configurations. We compare the computational efficiency and fine-mapping accuracy of FINEMAP with CAVIARBF to assess the impact of replacing exhaustive with stochastic search. We also illustrate FINEMAP on data from 4q22/SNCA region that contains a complex association pattern with Parkinson’s disease (UKPDC and WTCCC2, 2011) as well as on data from 15q21/LIPC region associated with HDL-C (Surakka ).

5.1 Computational efficiency

The top panel of Figure 3 shows that FINEMAP is thousands of times faster than CAVIARBF when considering causal configurations with up to three SNPs in Scenario A. The difference in processing time becomes even larger when the maximum number of possible causal SNPs increases (Scenario B) in the bottom panel of Figure 3. CAVIARBF slows down quickly due to the exhaustive search but FINEMAP’s processing time does not increase considerably with increasing K. Importantly, there is no need to restrict the number of causal SNPs in FINEMAP to small values () as is necessary for CAVIARBF.

Fig. 3.

Processing time of one locus with FINEMAP and CAVIARBF on log10 scale. Top panel: Scenario A with increasing number of SNPs allowing K = 3 or K = 5 causal SNPs. Bottom panel: Scenario B with 150 SNPs considering causal configurations with different maximum numbers of SNPs. All processing times are averaged over 500 datasets using one core of a Intel Haswell E5-2690v3 processor running at 2.6 GHz

5.2 Fine-mapping accuracy

We computed the maximum absolute differences between the single-SNP inclusion probabilities in each dataset under scenario B to assess the fine-mapping accuracy of FINEMAP and CAVIARBF (Table 1). The small differences (max < 0.11, median ) show that for practical purposes FINEMAP achieves similar accuracy as CAVIARBF despite concentrating only on a small but relevant subset of all possible causal configurations (see Discussion). Figure 4 shows details of those SNPs in Scenario B for which the difference between the methods is larger than 0.01. We see that by ignoring the large majority of very improbable configurations, FINEMAP slightly overestimates the largest probabilities, that typically belong to the truly causal SNPs, and underestimates smaller probabilities, that most often belong to the non-causal SNPs.

Table 1

Percentiles of absolute maximum differences between FINEMAP’s and CAVIARBF’s single-SNP inclusion probabilities in Scenario B

m=150 \|K	1	2	3	4	5^a
Max	5e−7	8e−3	2e−2	1e−1	–
99th percentile	4e−7	2e−3	8e−3	4e−2	–
95th percentile	3e−7	5e−4	3e−3	1e−2	–
Median	4e−8	4e−7	2e−5	6e−4	–

aCAVIARBF could not compute single-SNP inclusion probabilities due to a memory allocation failure (std::bad_alloc).

Fig. 4.

Single-SNP inclusion probabilities of all SNPs in Scenario B with absolute difference larger than 0.01 between FINEMAP and CAVIARBF

Single-SNP inclusion probabilities of all SNPs in Scenario B with absolute difference larger than 0.01 between FINEMAP and CAVIARBF Percentiles of absolute maximum differences between FINEMAP’s and CAVIARBF’s single-SNP inclusion probabilities in Scenario B aCAVIARBF could not compute single-SNP inclusion probabilities due to a memory allocation failure (std::bad_alloc). In addition to considering only causal configurations with up to three SNPs under scenario A, we also ran FINEMAP with K = 5 to demonstrate the increase in fine-mapping performance in this case where the true number of causal SNPs was five. We determined the proportion of causal SNPs that are included when selecting different numbers of top SNPs on the basis of ranked single-SNP inclusion probabilities (Fig. 5). FINEMAP and CAVIARBF had the same performance when considering causal configurations with up to three SNPs in genomic regions with 1500 SNPs. (Similar performance was also observed for genomic regions with different numbers of SNPs.) As expected, FINEMAP showed better fine-mapping performance when considering causal configurations with up to five SNPs.

Fig. 5.

Fine-mapping accuracy of FINEMAP and CAVIARBF on data with five causal SNPs, allowing either K = 3 or K = 5 causal SNPs. The proportion of causal SNPs included is plotted against the number of top SNPs selected on the basis of ranked single-SNP inclusion probabilities. Proportions are averaged over 500 datasets with 1500 SNPs. Case K = 5 is computationally intractable for CAVIARBF

5.3 4q22/SNCA association with Parkinson’s disease

Using single-SNP testing, the UKPDC and WTCCC2 (2011) found evidence for an association with Parkinson’s disease in the 4q22 region with the lowest P-value at rs356220. A conditional analysis on rs356220 revealed a second SNP rs7687945 with P-value that in the single-SNP testing had only a modest P-value of 0.13. These two SNPs are in low Linkage Disequilibrium (LD) ( in the original data) but the LD was sufficient enough to mask the effect of rs7687945 in single-SNP testing. This complex pattern of association was replicated in an independent French dataset (UKPDC and WTCCC2, 2011). To test whether FINEMAP is able to pick up this complex association pattern, we extracted a 2 megabase region centered on rs356220 with 363 directly genotyped SNPs from the original genotype data. Single-SNP testing using a logistic model implemented in SNPTEST was performed to compute z-scores. The dataset was then analyzed with FINEMAP using 100 iterations and prior parameter value of . Top panel of Figure 6 shows that the evidence that rs356220 and rs7687945 are causal is the largest among all SNPs. In addition, the causal configuration that simultaneously contains both rs356220 and rs7687945 has the highest posterior probability (0.132). The second most probable (0.113) causal configuration contains rs356220 and rs2301134. High correlation between rs7687945 and rs2301134 () explains why these two SNPs are difficult to tell apart. We conclude that FINEMAP was able to identify the complex association pattern at the second SNP that only became identifiable after the first SNP was included in the model. As opposed to the standard conditional analysis, FINEMAP provides posterior probabilities for all SNPs in the region and is thus able to simultaneously identify many causal variants without a step-wise procedure.

Fig. 6.

Fine-mapping of 4q22/SNCA region associated with Parkinson’s disease. Associated SNPs rs356220 and rs7687945 are highlighted by and their configuration by . Dashed lines correspond respectively to a single-SNP Bayes factor of 100 and P-value of . Squared correlations are shown with respect to rs356220

5.4 15q21/LIPC association with high-density lipoprotein cholesterol

Using single-SNP testing and conditional analysis, evidence for multiple independent association signals with high-density lipoprotein cholesterol was found in the 15q21 region (Holmen ; Surakka ). A conditional analysis using genotype data on 19 115 individuals from the National FINRISK study (Borodulin ) revealed three independent associations at rs2043085, rs1800588 and rs113298164. We extracted a 6 megabase region centered on rs2043085 with 8612 polymorphic SNPs and pairwise absolute correlations less than 0.99 from the original genotype data. Single-SNP testing using a linear model implemented in SNPTEST was performed to compute z-scores. The dataset was then analyzed with FINEMAP using 100 iterations allowing for at most five causal variants and prior parameter value of . Top panel of Figure 7 shows that the functional lipid SNPs rs113298164 (missense variant, Durstenfeld ) and rs1800588 (affecting hepatic lipase activity, Deeb and Peng, 2000) are among the variants with largest evidence of being causal. Rs2043085 that had the lowest P-value in single-SNP testing showed less evidence of being causal than rs7350789 (). Indeed, there is substantial evidence that the configuration from standard conditional analysis (rs2043085, rs1800588 and rs113298164) is not the causal one; the top 3 configurations from FINEMAP have between 50 and 190 times higher likelihood values. This demonstrates the importance of jointly modeling the SNPs in the region. Given that FINEMAP completes in less than 30 s (Intel Haswell E5-2690v3 processor running at 2.6GHz) while the exhaustive search implemented in CAVIARBF is estimated to run over 300 years on these data, this example demonstrates the utility of FINEMAP as a tool to carry out future fine-mapping analyses.

Fig. 7.

Fine-mapping of 15q21/LIPC region associated with high-density lipoprotein cholesterol. Independent association signals in conditional analysis are highlighted by . Dashed lines correspond respectively to a single-SNP Bayes factor of 100 and P-value of . Squared correlations are shown with respect to rs2043085

6 Discussion

GWAS have linked thousands of genomic regions to complex diseases and traits in humans and in model organisms. Fine-mapping causal variants in these regions is a high-dimensional variable selection problem complicated by strong correlations between the variables. We introduced a software package FINEMAP that implements an important solution to the problem: a stochastic search algorithm to circumvent computationally expensive exhaustive search. In all datasets we have tested, FINEMAP achieves similar accuracy as the exhaustive search but uses only a fraction of processing time. For example, fine-mapping a genomic region with 8612 SNPs allowing for at most five causal variants completes in less than 30 s using FINEMAP while the exhaustive search implemented in CAVIARBF is estimated to run over 300 years. Computationally efficient algorithms are a key to handle the ever-increasing amount of genetic variation captured by emerging sequencing studies as well as to scale up the analyses to whole chromosomes or even to whole genomes. FINEMAP uses a Shotgun Stochastic Search (SSS) algorithm (Hans ). SSS has been inspired by Markov Chain Monte Carlo (MCMC) algorithms that are widely used for Bayesian inference. For a review on MCMC, see Andrieu ). Standard MCMC methods, such as the Metropolis-Hastings algorithm (Hastings, 1970; Metropolis ) and Gibbs sampler (Geman and Geman, 1984), perform a sequence of steps in the parameter space via a stochastic transition mechanism that ensures a valid approximation to the target distribution. MCMC can often quickly reach an interesting region of the parameter space, but, at each step, it only considers one of the possible neighboring states. This means that MCMC is often slow to explore a high-dimensional state space. To improve on this, SSS generates a whole set of neighboring configurations at each iteration and saves them all for further use in probability calculations. This way a large number of parameter configurations with relatively high probability is quickly explored. FINEMAP is accurate when the set of causal configurations explored captures a large majority of the total posterior probability. Our results show that this is the case in all datasets we have tested: the maximal error in any single-SNP inclusion probability is smaller than 0.11 across all 2000 datasets of Scenario B. Using exhaustive search, we observed in genomic regions with 750 SNPs of which five were truly causal that on average only the top 123 (median = 14) causal configurations out of all possible already cover 95% of the total posterior probability. (Similar results were also observed for genomic regions with different numbers of SNPs.) This explains why an efficient stochastic search can achieve accurate results in a tiny fraction of the processing time of an exhaustive search. Our datasets were generated by requiring that the causal SNPs had highly correlated proxies (absolute correlation greater than 0.5) among the other variants. The high accuracy of FINEMAP throughout these tests makes us believe that FINEMAP is accurate in typical GWAS data with complex correlation structure among the SNPs. Although we have not encountered any dataset where FINEMAP would not have performed well, theoretically, it remains possible that an in-exhaustive search could miss some relevant causal configurations. A simple way to assess possible problems is to run many searches in parallel and compare and combine their outcomes. Another way is parallel tempering (Geyer, 1991) where several searches are run in parallel in different ‘temperatures’. Intuitively, increasing temperature flattens the likelihood function and hence a search in a higher temperature moves around more freely than one in a colder temperature. Such an approach, together with complex global transition mechanisms to escape from local modes, was introduced in an evolutionary stochastic search algorithm by Bottolo and Richardson (2010) that was later tailored for genetic analyses of multiple SNPs and multivariate phenotypes in the software package GUESS (Bottolo ). These two papers could give ideas how FINEMAP could be further modified if trapping into local modes of the search space were encountered in real data analyses of GWAS regions. Summary data based fine-mapping methods require a high-quality correlation estimate. Ideally, the correlation matrix is computed from the same genotype data from which the z-scores originate. In that case, for quantitative traits, the equations in Section 2.1 connecting original genotype-phenotype data and GWAS summary data are exact and hence no information is lost by working with summary data. For case-control data, a normal approximation to the logistic likelihood causes some difference between the two approaches but the difference is expected to be small with current GWAS sample sizes (Pirinen ). For some populations, sequencing of many thousands of individuals have either already been carried out or will complete soon. Such reference data allow reliable fine-mapping down to low-frequency variants also when the original genotype data are not available. A more challenging problem is large meta-analyses that combine individuals from varying ancestries. Assuming that the causal variants are included in the data and have the same effect sizes across the ancestral backgrounds, FINEMAP can be run with the sample size weighted SNP correlation matrix. If these assumptions are not met, then a hierarchical model allowing separate SNP correlation structures in each ancestry would perform better (Kichaev and Pasaniuc, 2015). Summary data based fine-mapping methods assume that the causal variants are included in the data. Recent advances in z-score imputation (Lee ; Pasaniuc ) help to satisfy this requirement also when a causal variant might not be genotyped. However, some SNPs are difficult to impute because they are not tagged well by the SNPs in the data. We do not expect to capture well the association signal from such SNPs either through imputation or indirectly through other SNPs in the data. The output from FINEMAP is a list of possible causal configurations together with their posterior probabilities and Bayes factors similar to CAVIARBF. These probabilities contain all the information from the model needed for downstream analyses. Examples of useful derived quantities are the single-SNP inclusion probabilities, single-SNP Bayes factors, credible sets of causal variants (WTCCC ) and a regional Bayes factor to assess the evidence against the null model where none of the SNPs are causal (Chen ). We believe that FINEMAP, or related future applications of shotgun stochastic search to GWAS summary data, enables unprecedented opportunities to reveal valuable information that could otherwise remain undetected due to computational limitations of the existing fine-mapping methods.

19 in total

1. Fine Mapping Causal Variants with an Approximate Bayesian Method Using Marginal Test Statistics.

Authors: Wenan Chen; Beth R Larrabee; Inna G Ovsyannikova; Richard B Kennedy; Iana H Haralambieva; Gregory A Poland; Daniel J Schaid
Journal: Genetics Date: 2015-05-06 Impact factor: 4.562

2. Identifying causal variants at loci with multiple signals of association.

Authors: Farhad Hormozdiari; Emrah Kostem; Eun Yong Kang; Bogdan Pasaniuc; Eleazar Eskin
Journal: Genetics Date: 2014-08-07 Impact factor: 4.562

3. Leveraging Functional-Annotation Data in Trans-ethnic Fine-Mapping Studies.

Authors: Gleb Kichaev; Bogdan Pasaniuc
Journal: Am J Hum Genet Date: 2015-07-16 Impact factor: 11.025

4. Molecular characterization of human hepatic lipase deficiency. In vitro expression of two naturally occurring mutations.

Authors: A Durstenfeld; O Ben-Zeev; K Reue; G Stahnke; M H Doolittle
Journal: Arterioscler Thromb Date: 1994-03

5. Imputation-based analysis of association studies: candidate regions and quantitative traits.

Authors: Bertrand Servin; Matthew Stephens
Journal: PLoS Genet Date: 2007-05-30 Impact factor: 5.917

6. Dissection of a Complex Disease Susceptibility Region Using a Bayesian Stochastic Search Approach to Fine Mapping.

Authors: Chris Wallace; Antony J Cutler; Nikolas Pontikos; Marcin L Pekalski; Oliver S Burren; Jason D Cooper; Arcadio Rubio García; Ricardo C Ferreira; Hui Guo; Neil M Walker; Deborah J Smyth; Stephen S Rich; Suna Onengut-Gumuscu; Stephen J Sawcer; Maria Ban; Sylvia Richardson; John A Todd; Linda S Wicker
Journal: PLoS Genet Date: 2015-06-24 Impact factor: 5.917

7. DIST: direct imputation of summary statistics for unmeasured SNPs.

Authors: Donghyung Lee; T Bernard Bigdeli; Brien P Riley; Ayman H Fanous; Silviu-Alin Bacanu
Journal: Bioinformatics Date: 2013-08-28 Impact factor: 6.937

8. Integrating functional data to prioritize causal variants in statistical fine-mapping studies.

Authors: Gleb Kichaev; Wen-Yun Yang; Sara Lindstrom; Farhad Hormozdiari; Eleazar Eskin; Alkes L Price; Peter Kraft; Bogdan Pasaniuc
Journal: PLoS Genet Date: 2014-10-30 Impact factor: 5.917

9. Bayesian refinement of association signals for 14 loci in 3 common diseases.

Authors: Julian B Maller; Gilean McVean; Jake Byrnes; Damjan Vukcevic; Kimmo Palin; Zhan Su; Joanna M M Howson; Adam Auton; Simon Myers; Andrew Morris; Matti Pirinen; Matthew A Brown; Paul R Burton; Mark J Caulfield; Alastair Compston; Martin Farrall; Alistair S Hall; Andrew T Hattersley; Adrian V S Hill; Christopher G Mathew; Marcus Pembrey; Jack Satsangi; Michael R Stratton; Jane Worthington; Nick Craddock; Matthew Hurles; Willem Ouwehand; Miles Parkes; Nazneen Rahman; Audrey Duncanson; John A Todd; Dominic P Kwiatkowski; Nilesh J Samani; Stephen C L Gough; Mark I McCarthy; Panagiotis Deloukas; Peter Donnelly
Journal: Nat Genet Date: 2012-10-28 Impact factor: 38.330

10. Defining the role of common variation in the genomic and biological architecture of adult human height.

Authors: Andrew R Wood; Tonu Esko; Jian Yang; Sailaja Vedantam; Tune H Pers; Stefan Gustafsson; Audrey Y Chu; Karol Estrada; Jian'an Luan; Zoltán Kutalik; Najaf Amin; Martin L Buchkovich; Damien C Croteau-Chonka; Felix R Day; Yanan Duan; Tove Fall; Rudolf Fehrmann; Teresa Ferreira; Anne U Jackson; Juha Karjalainen; Ken Sin Lo; Adam E Locke; Reedik Mägi; Evelin Mihailov; Eleonora Porcu; Joshua C Randall; André Scherag; Anna A E Vinkhuyzen; Harm-Jan Westra; Thomas W Winkler; Tsegaselassie Workalemahu; Jing Hua Zhao; Devin Absher; Eva Albrecht; Denise Anderson; Jeffrey Baron; Marian Beekman; Ayse Demirkan; Georg B Ehret; Bjarke Feenstra; Mary F Feitosa; Krista Fischer; Ross M Fraser; Anuj Goel; Jian Gong; Anne E Justice; Stavroula Kanoni; Marcus E Kleber; Kati Kristiansson; Unhee Lim; Vaneet Lotay; Julian C Lui; Massimo Mangino; Irene Mateo Leach; Carolina Medina-Gomez; Michael A Nalls; Dale R Nyholt; Cameron D Palmer; Dorota Pasko; Sonali Pechlivanis; Inga Prokopenko; Janina S Ried; Stephan Ripke; Dmitry Shungin; Alena Stancáková; Rona J Strawbridge; Yun Ju Sung; Toshiko Tanaka; Alexander Teumer; Stella Trompet; Sander W van der Laan; Jessica van Setten; Jana V Van Vliet-Ostaptchouk; Zhaoming Wang; Loïc Yengo; Weihua Zhang; Uzma Afzal; Johan Arnlöv; Gillian M Arscott; Stefania Bandinelli; Amy Barrett; Claire Bellis; Amanda J Bennett; Christian Berne; Matthias Blüher; Jennifer L Bolton; Yvonne Böttcher; Heather A Boyd; Marcel Bruinenberg; Brendan M Buckley; Steven Buyske; Ida H Caspersen; Peter S Chines; Robert Clarke; Simone Claudi-Boehm; Matthew Cooper; E Warwick Daw; Pim A De Jong; Joris Deelen; Graciela Delgado; Josh C Denny; Rosalie Dhonukshe-Rutten; Maria Dimitriou; Alex S F Doney; Marcus Dörr; Niina Eklund; Elodie Eury; Lasse Folkersen; Melissa E Garcia; Frank Geller; Vilmantas Giedraitis; Alan S Go; Harald Grallert; Tanja B Grammer; Jürgen Gräßler; Henrik Grönberg; Lisette C P G M de Groot; Christopher J Groves; Jeffrey Haessler; Per Hall; Toomas Haller; Goran Hallmans; Anke Hannemann; Catharina A Hartman; Maija Hassinen; Caroline Hayward; Nancy L Heard-Costa; Quinta Helmer; Gibran Hemani; Anjali K Henders; Hans L Hillege; Mark A Hlatky; Wolfgang Hoffmann; Per Hoffmann; Oddgeir Holmen; Jeanine J Houwing-Duistermaat; Thomas Illig; Aaron Isaacs; Alan L James; Janina Jeff; Berit Johansen; Åsa Johansson; Jennifer Jolley; Thorhildur Juliusdottir; Juhani Junttila; Abel N Kho; Leena Kinnunen; Norman Klopp; Thomas Kocher; Wolfgang Kratzer; Peter Lichtner; Lars Lind; Jaana Lindström; Stéphane Lobbens; Mattias Lorentzon; Yingchang Lu; Valeriya Lyssenko; Patrik K E Magnusson; Anubha Mahajan; Marc Maillard; Wendy L McArdle; Colin A McKenzie; Stela McLachlan; Paul J McLaren; Cristina Menni; Sigrun Merger; Lili Milani; Alireza Moayyeri; Keri L Monda; Mario A Morken; Gabriele Müller; Martina Müller-Nurasyid; Arthur W Musk; Narisu Narisu; Matthias Nauck; Ilja M Nolte; Markus M Nöthen; Laticia Oozageer; Stefan Pilz; Nigel W Rayner; Frida Renstrom; Neil R Robertson; Lynda M Rose; Ronan Roussel; Serena Sanna; Hubert Scharnagl; Salome Scholtens; Fredrick R Schumacher; Heribert Schunkert; Robert A Scott; Joban Sehmi; Thomas Seufferlein; Jianxin Shi; Karri Silventoinen; Johannes H Smit; Albert Vernon Smith; Joanna Smolonska; Alice V Stanton; Kathleen Stirrups; David J Stott; Heather M Stringham; Johan Sundström; Morris A Swertz; Ann-Christine Syvänen; Bamidele O Tayo; Gudmar Thorleifsson; Jonathan P Tyrer; Suzanne van Dijk; Natasja M van Schoor; Nathalie van der Velde; Diana van Heemst; Floor V A van Oort; Sita H Vermeulen; Niek Verweij; Judith M Vonk; Lindsay L Waite; Melanie Waldenberger; Roman Wennauer; Lynne R Wilkens; Christina Willenborg; Tom Wilsgaard; Mary K Wojczynski; Andrew Wong; Alan F Wright; Qunyuan Zhang; Dominique Arveiler; Stephan J L Bakker; John Beilby; Richard N Bergman; Sven Bergmann; Reiner Biffar; John Blangero; Dorret I Boomsma; Stefan R Bornstein; Pascal Bovet; Paolo Brambilla; Morris J Brown; Harry Campbell; Mark J Caulfield; Aravinda Chakravarti; Rory Collins; Francis S Collins; Dana C Crawford; L Adrienne Cupples; John Danesh; Ulf de Faire; Hester M den Ruijter; Raimund Erbel; Jeanette Erdmann; Johan G Eriksson; Martin Farrall; Ele Ferrannini; Jean Ferrières; Ian Ford; Nita G Forouhi; Terrence Forrester; Ron T Gansevoort; Pablo V Gejman; Christian Gieger; Alain Golay; Omri Gottesman; Vilmundur Gudnason; Ulf Gyllensten; David W Haas; Alistair S Hall; Tamara B Harris; Andrew T Hattersley; Andrew C Heath; Christian Hengstenberg; Andrew A Hicks; Lucia A Hindorff; Aroon D Hingorani; Albert Hofman; G Kees Hovingh; Steve E Humphries; Steven C Hunt; Elina Hypponen; Kevin B Jacobs; Marjo-Riitta Jarvelin; Pekka Jousilahti; Antti M Jula; Jaakko Kaprio; John J P Kastelein; Manfred Kayser; Frank Kee; Sirkka M Keinanen-Kiukaanniemi; Lambertus A Kiemeney; Jaspal S Kooner; Charles Kooperberg; Seppo Koskinen; Peter Kovacs; Aldi T Kraja; Meena Kumari; Johanna Kuusisto; Timo A Lakka; Claudia Langenberg; Loic Le Marchand; Terho Lehtimäki; Sara Lupoli; Pamela A F Madden; Satu Männistö; Paolo Manunta; André Marette; Tara C Matise; Barbara McKnight; Thomas Meitinger; Frans L Moll; Grant W Montgomery; Andrew D Morris; Andrew P Morris; Jeffrey C Murray; Mari Nelis; Claes Ohlsson; Albertine J Oldehinkel; Ken K Ong; Willem H Ouwehand; Gerard Pasterkamp; Annette Peters; Peter P Pramstaller; Jackie F Price; Lu Qi; Olli T Raitakari; Tuomo Rankinen; D C Rao; Treva K Rice; Marylyn Ritchie; Igor Rudan; Veikko Salomaa; Nilesh J Samani; Jouko Saramies; Mark A Sarzynski; Peter E H Schwarz; Sylvain Sebert; Peter Sever; Alan R Shuldiner; Juha Sinisalo; Valgerdur Steinthorsdottir; Ronald P Stolk; Jean-Claude Tardif; Anke Tönjes; Angelo Tremblay; Elena Tremoli; Jarmo Virtamo; Marie-Claude Vohl; Philippe Amouyel; Folkert W Asselbergs; Themistocles L Assimes; Murielle Bochud; Bernhard O Boehm; Eric Boerwinkle; Erwin P Bottinger; Claude Bouchard; Stéphane Cauchi; John C Chambers; Stephen J Chanock; Richard S Cooper; Paul I W de Bakker; George Dedoussis; Luigi Ferrucci; Paul W Franks; Philippe Froguel; Leif C Groop; Christopher A Haiman; Anders Hamsten; M Geoffrey Hayes; Jennie Hui; David J Hunter; Kristian Hveem; J Wouter Jukema; Robert C Kaplan; Mika Kivimaki; Diana Kuh; Markku Laakso; Yongmei Liu; Nicholas G Martin; Winfried März; Mads Melbye; Susanne Moebus; Patricia B Munroe; Inger Njølstad; Ben A Oostra; Colin N A Palmer; Nancy L Pedersen; Markus Perola; Louis Pérusse; Ulrike Peters; Joseph E Powell; Chris Power; Thomas Quertermous; Rainer Rauramaa; Eva Reinmaa; Paul M Ridker; Fernando Rivadeneira; Jerome I Rotter; Timo E Saaristo; Danish Saleheen; David Schlessinger; P Eline Slagboom; Harold Snieder; Tim D Spector; Konstantin Strauch; Michael Stumvoll; Jaakko Tuomilehto; Matti Uusitupa; Pim van der Harst; Henry Völzke; Mark Walker; Nicholas J Wareham; Hugh Watkins; H-Erich Wichmann; James F Wilson; Pieter Zanen; Panos Deloukas; Iris M Heid; Cecilia M Lindgren; Karen L Mohlke; Elizabeth K Speliotes; Unnur Thorsteinsdottir; Inês Barroso; Caroline S Fox; Kari E North; David P Strachan; Jacques S Beckmann; Sonja I Berndt; Michael Boehnke; Ingrid B Borecki; Mark I McCarthy; Andres Metspalu; Kari Stefansson; André G Uitterlinden; Cornelia M van Duijn; Lude Franke; Cristen J Willer; Alkes L Price; Guillaume Lettre; Ruth J F Loos; Michael N Weedon; Erik Ingelsson; Jeffrey R O'Connell; Goncalo R Abecasis; Daniel I Chasman; Michael E Goddard; Peter M Visscher; Joel N Hirschhorn; Timothy M Frayling
Journal: Nat Genet Date: 2014-10-05 Impact factor: 38.330

208 in total

1. Abundant associations with gene expression complicate GWAS follow-up.

Authors: Boxiang Liu; Michael J Gloudemans; Abhiram S Rao; Erik Ingelsson; Stephen B Montgomery
Journal: Nat Genet Date: 2019-05 Impact factor: 38.330

2. Genome-wide Association Study Identifies 27 Loci Influencing Concentrations of Circulating Cytokines and Growth Factors.

Authors: Ari V Ahola-Olli; Peter Würtz; Aki S Havulinna; Kristiina Aalto; Niina Pitkänen; Terho Lehtimäki; Mika Kähönen; Leo-Pekka Lyytikäinen; Emma Raitoharju; Ilkka Seppälä; Antti-Pekka Sarin; Samuli Ripatti; Aarne Palotie; Markus Perola; Jorma S Viikari; Sirpa Jalkanen; Mikael Maksimow; Veikko Salomaa; Marko Salmi; Johannes Kettunen; Olli T Raitakari
Journal: Am J Hum Genet Date: 2016-12-15 Impact factor: 11.025

3. Colocalization of GWAS and eQTL Signals Detects Target Genes.

Authors: Farhad Hormozdiari; Martijn van de Bunt; Ayellet V Segrè; Xiao Li; Jong Wha J Joo; Michael Bilow; Jae Hoon Sul; Sriram Sankararaman; Bogdan Pasaniuc; Eleazar Eskin
Journal: Am J Hum Genet Date: 2016-11-17 Impact factor: 11.025

Review 4. Deciphering the Emerging Complexities of Molecular Mechanisms at GWAS Loci.

Authors: Maren E Cannon; Karen L Mohlke
Journal: Am J Hum Genet Date: 2018-11-01 Impact factor: 11.025

5. Significance Testing for Allelic Heterogeneity.

Authors: Yangqing Deng; Wei Pan
Journal: Genetics Date: 2018-06-29 Impact factor: 4.562

6. PolyQTL: Bayesian multiple eQTL detection with control for population structure and sample relatedness.

Authors: Biao Zeng; Greg Gibson
Journal: Bioinformatics Date: 2019-03-15 Impact factor: 6.937

7. A Statistical Approach to Fine Mapping for the Identification of Potential Causal Variants Related to Bone Mineral Density.

Authors: Jonathan Greenbaum; Hong-Wen Deng
Journal: J Bone Miner Res Date: 2017-05-22 Impact factor: 6.741

8. Causal associations of waist circumference and waist-to-hip ratio with type II diabetes mellitus: new evidence from Mendelian randomization.

Authors: Kexin Li; Tianyu Feng; Lijuan Wang; Yang Chen; Pingping Zheng; Pan Pan; Min Wang; Isaac T S Binnay; Yingshuang Wang; Ruiyu Chai; Siyu Liu; Bo Li; Yan Yao
Journal: Mol Genet Genomics Date: 2021-02-25 Impact factor: 3.291

9. Genomic Dissection of Bipolar Disorder and Schizophrenia, Including 28 Subphenotypes.

Authors:
Journal: Cell Date: 2018-06-14 Impact factor: 41.582

10. Genomic analyses in African populations identify novel risk loci for cleft palate.

Authors: Azeez Butali; Peter A Mossey; Wasiu L Adeyemo; Mekonen A Eshete; Lord J J Gowans; Tamara D Busch; Deepti Jain; Wenjie Yu; Liu Huan; Cecelia A Laurie; Cathy C Laurie; Sarah Nelson; Mary Li; Pedro A Sanchez-Lara; William P Magee; Kathleen S Magee; Allyn Auslander; Frederick Brindopke; Denise M Kay; Michele Caggana; Paul A Romitti; James L Mills; Rosemary Audu; Chika Onwuamah; Ganiyu O Oseni; Arwa Owais; Olutayo James; Peter B Olaitan; Babatunde S Aregbesola; Ramat O Braimah; Fadekemi O Oginni; Ayodeji O Oladele; Saidu A Bello; Jennifer Rhodes; Rita Shiang; Peter Donkor; Solomon Obiri-Yeboah; Fareed Kow Nanse Arthur; Peter Twumasi; Pius Agbenorku; Gyikua Plange-Rhule; Alexander Acheampong Oti; Olugbenga M Ogunlewe; Afisu A Oladega; Adegbayi A Adekunle; Akinwunmi O Erinoso; Olatunbosun O Adamson; Abosede A Elufowoju; Oluwanifemi I Ayelomi; Taiye Hailu; Abiye Hailu; Yohannes Demissie; Miliard Derebew; Steve Eliason; Miguel Romero-Bustillous; Cynthia Lo; James Park; Shaan Desai; Muiawa Mohammed; Firke Abate; Lukman O Abdur-Rahman; Deepti Anand; Irfaan Saadi; Abimibola V Oladugba; Salil A Lachke; Brad A Amendt; Charles N Rotimi; Mary L Marazita; Robert A Cornell; Jeffrey C Murray; Adebowale A Adeyemo
Journal: Hum Mol Genet Date: 2019-03-15 Impact factor: 6.150