| Literature DB >> 26909100 |
Dominic Holland1, Yunpeng Wang2, Wesley K Thompson3, Andrew Schork4, Chi-Hua Chen5, Min-Tzu Lo5, Aree Witoelar6, Thomas Werge7, Michael O'Donovan8, Ole A Andreassen6, Anders M Dale9.
Abstract
Genome-wide Association Studies (GWAS) result in millions of summary statistics ("z-scores") for single nucleotide polymorphism (SNP) associations with phenotypes. These rich datasets afford deep insights into the nature and extent of genetic contributions to complex phenotypes such as psychiatric disorders, which are understood to have substantial genetic components that arise from very large numbers of SNPs. The complexity of the datasets, however, poses a significant challenge to maximizing their utility. This is reflected in a need for better understanding the landscape of z-scores, as such knowledge would enhance causal SNP and gene discovery, help elucidate mechanistic pathways, and inform future study design. Here we present a parsimonious methodology for modeling effect sizes and replication probabilities, relying only on summary statistics from GWAS substudies, and a scheme allowing for direct empirical validation. We show that modeling z-scores as a mixture of Gaussians is conceptually appropriate, in particular taking into account ubiquitous non-null effects that are likely in the datasets due to weak linkage disequilibrium with causal SNPs. The four-parameter model allows for estimating the degree of polygenicity of the phenotype and predicting the proportion of chip heritability explainable by genome-wide significant SNPs in future studies with larger sample sizes. We apply the model to recent GWAS of schizophrenia (N = 82,315) and putamen volume (N = 12,596), with approximately 9.3 million SNP z-scores in both cases. We show that, over a broad range of z-scores and sample sizes, the model accurately predicts expectation estimates of true effect sizes and replication probabilities in multistage GWAS designs. We assess the degree to which effect sizes are over-estimated when based on linear-regression association coefficients. We estimate the polygenicity of schizophrenia to be 0.037 and the putamen to be 0.001, while the respective sample sizes required to approach fully explaining the chip heritability are 10(6) and 10(5). The model can be extended to incorporate prior knowledge such as pleiotropy and SNP annotation. The current findings suggest that the model is applicable to a broad array of complex phenotypes and will enhance understanding of their genetic architectures.Entities:
Keywords: GWAS; Gaussian mixture model; SNP discovery; effect size; heritability; putamen; schizophrenia
Year: 2016 PMID: 26909100 PMCID: PMC4754432 DOI: 10.3389/fgene.2016.00015
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1For schizophrenia, posterior estimate of (A) effect size and (B) variance; (C) estimate of replication probability for .
Figure 2Posterior effect size and variance, calculated for effective sample size . Note that sparse effects have a component that arises from ubiquitous effects. z = δ + ϵ, where δ = δ + δ and E(ϵ) = 0; δ are ubiquitous effects, while δ are additional contributions to total sparse effects.
Figure 3Randomly culling SNPs with LD . The black plot is for light random pruning at r2 ≥ 0.8, shown in Figure 1A.
Figure 4(A) Empirical and model QQ plots for putamen volume and schizophrenia. (B) Proportion of total additive genetic variance or chip heritability explained by sparse effects for all “tagged” SNPs with p-value less than the GWAS p-value threshold (), as a function of effective sample size, for putamen volume and schizophrenia (the asterisks correspond to the current effective sample sizes for ENIGMA and PGS2). Of the total variance that is explained by sparse effects for all SNPs, the proportion explained by SNPs currently reaching the usual GWAS significance level is approximately 15% for both phenotypes.
Figure 5For schizophrenia, (A) posterior estimates of effect-size-squared, as given by Equation 25, vs. . When assuming that the phenotypic variance explained by a SNP is given by , the degree to which this is an over-estimate is indicated by the ratio of the height of the black dashed line (the assumption δ2 = z2) to the height of the corresponding point on the curve for a given sample size. The asterisks correspond to the threshold significant z-score. (B) For a multistage GWAS, where discovery is from a subset (20%, 50%, 90%) of the total PGC2 sample, the curves give the probability of a SNP with p-value p in the discovery sample passing genome-wide significance () in the combined (total) data set, Equation 29. The vertical gray line is at p = p.