Literature DB >> 26414676

An atlas of genetic correlations across human diseases and traits.

Brendan Bulik-Sullivan^1,2,3, Hilary K Finucane⁴, Verneri Anttila^1,2,3, Alexander Gusev^5,6, Felix R Day⁷, Po-Ru Loh^1,5, Laramie Duncan^1,2,3, John R B Perry⁷, Nick Patterson¹, Elise B Robinson^1,2,3, Mark J Daly^1,2,3, Alkes L Price^1,5,6, Benjamin M Neale^1,2,3.

Abstract

Identifying genetic correlations between complex traits and diseases can provide useful etiological insights and help prioritize likely causal relationships. The major challenges preventing estimation of genetic correlation from genome-wide association study (GWAS) data with current methods are the lack of availability of individual-level genotype data and widespread sample overlap among meta-analyses. We circumvent these difficulties by introducing a technique-cross-trait LD Score regression-for estimating genetic correlation that requires only GWAS summary statistics and is not biased by sample overlap. We use this method to estimate 276 genetic correlations among 24 traits. The results include genetic correlations between anorexia nervosa and schizophrenia, anorexia and obesity, and educational attainment and several diseases. These results highlight the power of genome-wide analyses, as there currently are no significantly associated SNPs for anorexia nervosa and only three for educational attainment.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2015 PMID： 26414676 PMCID： PMC4797329 DOI： 10.1038/ng.3406

Source DB: PubMed Journal: Nat Genet ISSN： 1061-4036 Impact factor: 38.330

Introduction

Understanding the complex relationships among human traits and diseases is a fundamental goal of epidemiology. Randomized controlled trials and longitudinal studies are time-consuming and expensive, so many potential risk factors are studied using cross-sectional correlations studies at a single time point. Obtaining causal inferences from such studies can be challenging due to issues such as confounding and reverse causation, which can lead to spurious associations and mask the effects of real risk factors [1, 2]. Genetics can help elucidate cause and effect, since inherited genetic risks cannot be subject to reverse causation and are correlated with a smaller list of confounders. The first methods for testing for genetic overlap were family studies [3, 4, 5, 6, 7]. In order to estimate genetic overlaps among many pairs of phenotypes, family designs require measuring multiple traits on the same individuals. Consequently, it is challenging to scale family designs to a large number of traits, especially traits that difficult or costly to measure (e.g., low-prevalence diseases). More recently, genome-wide association studies (GWAS) have allowed us to obtain effect-size estimates for specific genetic variants, so it is possible to test for shared genetics by looking for correlations in effect-sizes across traits, which does not require measuring multiple traits per individual. There exists a large class of methods for interrogating genetic overlap via GWAS that focus only on genome-wide significant SNPs. One of the most influential methods in this class is Mendelian randomization, which uses significantly associated SNPs as instrumental variables to attempt quantify causal relationships between risk factors and disease[1, 2]. Methods that focus on significant SNPs are effective for traits where there are many significant associations that account for a substantial fraction of heritability [8, 9]. For many complex traits, heritability is distributed over thousands of variants with small effects, and the proportion of heritability accounted for by significantly associated variants at current sample sizes is small [10]. In such situations, one can often obtain more accurate results by using genome-wide data, rather than just significantly associated variants [11]. A complementary approach is to estimate genetic correlation, which includes the effects of all SNPs, including those that do not reach genome-wide significance (Methods). The two main existing techniques for estimating genetic correlation from GWAS data are restricted maximum likelihood (REML) [11, 12, 13, 14, 15, 16] and polygenic scores [17, 18]. These methods have only been applied to a few traits, because they require individual genotype data, which are difficult to obtain due to informed consent limitations. In order to overcome these limitations, we have developed a technique for estimating genetic correlation using only GWAS summary statistics that is not biased by sample overlap. Our method, cross-trait LD Score regression, is a simple extension of single-trait LD Score regression [19] and is computationally very fast. We apply this method to data from 24 GWAS and report genetic correlations for 276 pairs of phenotypes, demonstrating shared genetic bases for many complex diseases and traits.

Results

Overview of Methods

The method presented here for estimating genetic correlation from summary statistics relies on the fact that the GWAS effect-size estimate for a given SNP incorporates the effects of all SNPs in linkage disequilibrium (LD) with that SNP [19, 20]. For a polygenic trait, SNPs with high LD will have higher χ2 statistics on average than SNPs with low LD [19]. A similar relationship holds if we replace χ2 statistics for a single study with the product of z-scores from two studies of traits with non-zero genetic correlation. More precisely, under a polygenic model [11, 13], the expected value of z12 is where N is the sample size for study i, ρ is genetic covariance (defined in Methods), ℓ is LD Score [19], N is the number of individuals included in both studies, and ρ is the phenotypic correlation among the N overlapping samples. We derive this equation in the Supplementary Note. If study 1 and study 2 are the same study, then Equation 1 reduces to the single-trait result from [19], because genetic covariance between a trait and itself is heritability, and χ2=z2. As a consequence of equation 1, we can estimate genetic covariance using the slope from the regression of z12 on LD Score, which is computationally very fast (Methods). Sample overlap creates spurious correlation between z1 and z2, which inflates z12. The expected magnitude of this inflation is uniform across all markers, and in particular does not depend on LD Score. As a result, sample overlap only affects the intercept from this regression (the term ) and not the slope, so the estimates of genetic correlation will not be biased by sample overlap. Similarly, shared population stratification will alter the intercept but have minimal impact on the slope, because the correlation between LD Score and the rate of genetic drift is minimal [19]. If we are willing to assume no shared population stratification, and we know the amount of sample overlap and phenotypic correlation in advance (i.e., the true value of ), we can constrain the intercept to this value. We refer to this approach as constrained intercept LD Score regression. Constrained intercept LD Score regression has lower standard error – often by as much as 30% – than LD Score regression with unconstrained intercept, but will yield biased and misleading estimates if the intercept is misspecified, e.g., if we specify the wrong value of N or do not completely control for population stratification. Normalizing genetic covariance by the SNP-heritabilities yields genetic correlation: , where denotes the SNP-heritability [11] from study i. Genetic correlation ranges between −1 and 1. Results similar to Equation 1 hold if one or both studies is a case/control study, in which case genetic covariance is on the observed scale. Details are provided in the Supplementary Note. There is no distinction between observed and liability scale genetic correlation for case/control traits, so we can define and estimate genetic correlation between a case/control trait and a quantitative trait and genetic correlation between pairs of case/control traits without the need to specify a scale (Supplementary Note).

Simulations

We performed a series of simulations to evaluate the robustness of the model to potential confounders such as sample overlap and model misspecification, and to verify the accuracy of the standard error estimates (Methods). Table 1 shows cross-trait LD Score regression estimates and standard errors from 1,000 simulations of quantitative traits. For each simulation replicate, we generated two phenotypes for each of 2,062 individuals in our sample by drawing effect sizes approximately 600,000 SNPs on chromosome 2 from a bivariate normal distribution. We then computed summary statistics for both phenotypes and estimated heritability and genetic correlation with cross-trait LD Score regression. The summary statistics were generated from completely overlapping samples. Results are shown in Table 1. These simulations confirm that cross-trait LD Score regression yields accurate estimates of the true genetic correlation and that the standard errors match the standard deviation across simulations. Thus, cross-trait LD Score regression is not biased by sample overlap, in contrast to estimation of genetic correlation via polygenic risk scores, which is biased in the presence of sample overlap [18]. We also evaluated simulations with one quantitative trait and one case/control study and show that cross-trait LD Score regression can be applied to binary traits and is not biased by oversampling of cases (Supplementary Table 1).

Table 1

Simulations with complete sample overlap. Truth shows the true parameter values. Estimate shows the average cross-trait LD Score regression estimate across 1000 simulations. SD shows the standard deviation of the estimates across 1000 simulations, and SE shows the mean cross-trait LD Score regression SE across 1000 simulations. Further details of the simulation setup are given in the Methods.

Parameter	Truth	Estimate	SD	SE
h²	0.58	0.58	0.072	0.075
ρ_g	0.29	0.29	0.057	0.058
r_g	0.50	0.49	0.079	0.073

Estimates of heritability and genetic covariance can be biased if the underlying model of genetic architecture is misspecified, e.g., if variance explained is correlated with LD Score or MAF [19, 21]. Because genetic correlation is estimated as a ratio, it is more robust; biases that affect the numerator and the denominator in the same direction tend to cancel. We obtain approximately correct estimates of genetic correlation even in simulations with models of genetic architecture where our estimates of heritability and genetic covariance are biased (Supplementary Table 2).

Replication of Pyschiatric Cross-Disorder Results

As technical validation, we replicated the estimates of genetic correlations among psychiatric disorders obtained with individual genotypes and REML in [14], by applying cross-trait LD Score regression to summary statistics from the same data [22]. These summary statistics were generated from non-overlapping samples, so we applied cross-trait LD Score regression using both unconstrained and constrained intercepts (Methods). Results from these analyses are shown in Figure 1. The results from cross-trait LD Score regression were similar to the results from REML. Cross-trait LD Score regression with constrained intercept gave standard errors that were only slightly larger than those from REML, while the standard errors from cross-trait LD Score regression with intercept were substantially larger, especially for traits with small sample sizes (e.g., ADHD, ASD).

Figure 1

Application to Summary Statistics From 25 Phenotypes

We used cross-trait LD Score regression to estimate genetic correlations among 24 phenotypes (URLs, Methods). Genetic correlation estimates for all 276 pairwise combinations of the 24 traits are shown in Figure 2. For clarity of presentation, the 24 phenotypes were restricted to contain only one phenotype from each cluster of closely related phenotypes (Methods). Genetic correlations among the educational, anthropometric, smoking, and insulin-related phenotypes that were excluded from Figure 2 are shown Supplementary Figures 1, 2, 3 and 4, respectively. A full table of 1176 genetic correlations among 49 traits is provided in Supplementary Table 4. References and sample sizes are shown in Supplementary Table 3.

Figure 2

The first section of Table 2 lists genetic correlation results that are consistent with epidemiological associations, but, as far as we are aware, have not previously been reported using genetic data. The estimates of the genetic correlation between age at menarche and adult height [29], triglycerides [30] and type 2 diabetes [30, 31] are consistent with the epidemiological associations. The estimate of a negative genetic correlation between anorexia nervosa and obesity suggests that the same genetic factors influence normal variation in BMI as well as dysregulated BMI in psychiatric illness. This result is consistent with the observation that BMI GWAS findings implicate neuronal, rather than metabolic, cell-types and epigenetic marks [32, 33]. The negative genetic correlation between adult height and coronary artery disease agrees with a replicated epidemiological association [34, 35, 36]. We observe several significant associations with the educational attainment phenotypes from Rietveld et al. [37]: we estimate a statistically significant negative genetic correlation between college and Alzheimer’s disease, which agrees with epidemiological results [38, 39]. The positive genetic correlation between college and bipolar disorder is consistent with previous epidemiological reports [40, 41]. The estimate of a negative genetic correlation between smoking and college is consistent with the observed differences in smoking rates as a function of educational attainment [42].

Table 2

Genetic correlation estimates, standard errors and p-values for selected pairs of traits. Results are grouped into genetic correlations that are new genetic results, but are consistent with established epidemiological associations (“Epidemiological”), genetic correlations that are new both to genetics and epidemiology (“New/Nonzero”) and interesting null results (“New/Low”). The p-values are uncorrected p-values. Results that pass multiple testing correctionw for the 300 tests in Figure 2 at 1% FDR have a single asterisk; results that pass Bonferroni correction have two asterisks. We present some genetic correlations that agree with epidemiological associations but that do not pass multiple testing correction in these data.

	Phenotype 1	Phenotype 2	rg(se)	p-value
Epidemiological	Age at menarche	Adult height	0.13 (0.03)	2×10⁻⁶**
	Age at menarche	Type 2 diabetes	−0.13 (0.04)	2×10⁻³*
	Age at menarche	Triglycerides	−0.12 (0.04)	1×10⁻³*
	Coronary artery disease	Age at menarche	−0.12 (0.05)	3×10⁻²
	Coronary artery disease	Years of education	−0.25 (0.06)	1×10⁻⁴**
	Coronary artery disease	Adult height	−0.17 (0.04)	1×10⁻⁵**
	Alzheimer’s	Years of education	−0.29 (0.1)	5×10⁻³*
	Bipolar disorder	Years of education	0.30 (0.06)	9×10⁻⁷**
	BMI	Years of education	−0.28 (0.03)	6×10⁻¹⁶**
	Triglycerides	Years of education	−0.26 (0.06)	2×10⁻⁸**
	Anorexia nervosa	BMI	−0.18 (0.04)	3×10⁻⁷**
	Ever/never smoker	Years of education	−0.36 (0.06)	2×10⁻⁸**
	Ever/never smoker	BMI	0.20 (0.04)	8×10⁻⁷**
New/Nonzero	Autism spectrum disorder	Years of education	0.30 (0.08)	2×10⁻⁴*
	Ulcerative colitis	Childhood obesity	−0.34 (0.08)	3.1 × 10⁻⁵**
	Anorexia nervosa	Schizophrenia	0.19 (0.04)	2×10⁻⁵**
New/Low	Schizophrenia	Alzheimer’s	0.04 (0.06)	>0.1
	Schizophrenia	Ever/never smoker	0.04 (0.06)	>0.1
	Schizophrenia	Triglycerides	−0.04 (0.04)	>0.1
	Schizophrenia	LDL cholesterol	−0.04 (0.04)	>0.1
	Schizophrenia	HDL cholesterol	0.03 (0.04)	>0.1
	Schizophrenia	Rheumatoid arthritis	−0.04 (0.05)	>0.1
	Crohn’s disease	Rheumatoid arthritis	−0.03 (0.08)	>0.1
	Ulcerative colitis	Rheumatoid arthritis	0.09 (0.08)	>0.1

The second section of Table 2 lists three results that are, to the best of our knowledge, new both to genetics and epidemiology. One, we find a positive genetic correlation between anorexia nervosa and schizophrenia. Comorbidity between eating and psychotic disorders has not been thoroughly investigated in the psychiatric literature [43, 44], and this result raises the possibility of similarity between these classes of disease. Two, we estimate a negative genetic correlation between ulcerative colitis (UC) and childhood obesity. The relationship between premorbid BMI and ulcerative colitis is not well-understood; exploring this relationship may be a fruitful direction for further investigation. Three, we estimate a positive genetic correlation between autism spectrum disorder (ASD) and educational attainment (which has very high genetic correlation with IQ [37, 45, 46]). The ASD summary statistics were generated using a case-pseudocontrol study design, so this result cannot be explained by oversampling of ASD cases from the more highly educated parents, which is observed epidemiologically [47]. The distribution of IQ among individuals with ASD has lower mean than the general population, but with heavy tails [48] (i.e., an excess of individuals with low and high IQ). There is also emerging evidence that the genetic architecture of ASD varies across the IQ distribution [49]. The third section of Table 2 lists interesting examples where the genetic correlation is close to zero with small standard error. The low genetic correlation between schizophrenia and rheumatoid arthritis is interesting because schizophrenia has been observed to be protective for rheumatoid arthritis [50], though the epidemiological effect is weak, so it is possible that there is a real genetic correlation, but it is too small for us to detect. The low genetic correlation between schizophrenia and smoking is notable because of the increased tobacco use (both prevalence and number of cigarettes per day) among individuals with schizophrenia [51]. The low genetic correlation between schizophrenia and plasma lipid levels contrasts with a previous report of pleiotropy between schizophrenia and triglycerides [52]. Pleiotropy (unsigned) is different from genetic correlation (signed; see Methods); however, the pleiotropy reported by Andreassen, et al. [52] could be explained by the sensitivity of the method used to the properties of a small number of regions with strong LD, rather than trait biology (Supplementary Figure 5). We estimate near-zero genetic correlation between Alzheimer’s disease and schizophrenia. The genetic correlations between Alzheimers disease and the other psychiatric traits (anorexia nervosa, bipolar, major depression, ASD) are also close to zero, but with larger standard errors, due to smaller sample sizes. This suggests that the genetic basis of Alzheimer’s disease is distinct from psychiatric conditions. Last, we estimate near zero genetic correlation between rheumatoid arthritis (RA) and both Crohn’s disease (CD) and UC. Although these diseases share many associated loci [53, 54], there appears to be no directional trend: some RA risk alleles are also risk alleles for UC and CD, but many RA risk alleles are protective for UC and CD [53], yielding near-zero genetic correlation. This example highlights the distinction between pleiotropy and genetic correlation (Methods). Finally, the estimates of genetic correlations among metabolic traits are consistent with the estimates obtained using REML in Vattikuti et al. [15] (Supplementary Table 6), and are directionally consistent with the recent Mendelian randomization results from Wuertz et al. [55]. The estimate of 0.54 (0.07) for the genetic correlation between CD and UC is consistent with the estimate of 0.62 (0.04) from Chen et al. [16].

Discussion

We have described a new method for estimating genetic correlation from GWAS summary statistics, which we applied to a dataset of GWAS summary statistics consisting of 24 traits and more than 1.5 million unique phenotype measurements. We reported several new findings that would have been difficult to obtain with existing methods, including a positive genetic correlation between anorexia nervosa and schizophrenia. Our method replicated many previously-reported GWAS-based genetic correlations, and confirmed observations of overlap among genome-wide significant SNPs, MR results and epidemiological associations. This method is an advance for several reasons: it does not require individual genotypes, genome-wide significant SNPs or LD-pruning (which loses information if causal SNPs are in LD). Our method is not biased by sample overlap and is computationally fast. Furthermore, our approach does not require measuring multiple traits on the same individuals, so it scales easily to studies of thousands of pairs of traits. These advantages allow us to estimate genetic correlation for many more pairs of phenotypes than was possible with existing methods. The challenges in interpreting genetic correlation are similar to the challenges in MR. We highlight two difficulties. First, genetic correlation is immune to environmental confounding, but is subject to genetic confounding, analogous to confounding by pleiotropy in MR. For example, the genetic correlation between HDL and CAD in Figure 2 could result from a causal effect HDL→CAD, but could also be mediated by triglycerides (TG) [9, 56], represented graphically [57] as HDL←G→TG→CAD, where G is the set of genetic variants with effects on both HDL and TG. Extending genetic correlation to multiple genetically correlated phenotypes is an important direction for future work [58]. Second, although genetic correlation estimates are not biased by oversampling of cases, they are affected by other forms of biased sampling, such as misclassification [14] and case/control/covariate sampling (e.g., a BMI-matched study of T2D). We note several limitations of cross-trait LD Score regression as an estimator of genetic correlation. First, cross-trait LD Score regression requires larger sample sizes than methods that use individual genotypes in order to achieve equivalent standard error. Second, cross-trait LD Score regression is not currently applicable to samples from recently-admixed populations. Third, we have not investigated the potential impact of assortative mating on estimates of genetic correlation, which remains as a future direction. Fourth, methods built from polygenic models, such as cross-trait LD Score regression and REML, are most effective when applied to traits with polygenic genetic architectures. For traits where significant SNPs account for a sizable proportion of heritability, analyzing only these SNPs can be more powerful. Developing methods that make optimal use of both large-effect SNPs and diffuse polygenic signal is a direction for future research. Despite these limitations, we believe that the cross-trait LD Score regression estimator of genetic correlation will be a useful addition to the epidemiological toolbox, because it allows for rapid screening for correlations among a diverse set of traits, without the need for measuring multiple traits on the same individuals or genome-wide significant SNPs.

Methods

Definition of Genetic Covariance and Correlation

All definitions refer to narrow-sense heritabilities and genetic covariances. Let S denote a set of M SNPs, let X denote a vector of additively (0-1-2) coded genotypes for the SNPs in S, and let y1 and y2 denote phenotypes. Define β:=argmaxα∈Cor[y1,Xα], where the maximization is performed in the population (i.e., in the infinite data limit). Let γ denote the corresponding vector for y2. This is a projection, so β is unique modulo SNPs in perfect LD. Define , the heritability explained by SNPs in S, as and ρ(y1,y2), the genetic covariance among SNPs in S, as . The genetic correlation among SNPs in S is , which lies in [−1,1]. Following [11], we use subscript g (as in , ρ,r) when the set of SNPs is genotyped and imputed SNPs in GWAS. SNP genetic correlation (r) is different from family study genetic correlation. In a family study, the relationship matrix captures information about all genetic variation, not just common SNPs. As a result, family studies estimate the total genetic correlation (S equals all variants). Unlike the relationship between SNP-heritability [11] and total heritability, for which , no similar relationship holds between SNP genetic correlation and total genetic correlation. If β and γ are more strongly correlated among common variants than rare variants, then the total genetic correlation will be less than the SNP genetic correlation. Genetic correlation is (asymptotically) proportional to Mendelian randomization estimates. If we use a genetic instrument to estimate the effect b12 of y1 on y2, the 2SLS estimate is b̂2:= g2/g1 [59]. The expectations of the numerator and denominator are E[g2]=ρ(y1, y2) and . Thus, . If we use the same set S of SNPs to estimate b12 and b21 (e.g., if S is the set of all common SNPs, as in the genetic correlation analyses in this paper), then this procedure is symmetric in y1 and y2. Genetic correlation is different from pleiotropy. Two traits have a pleiotropic relationship if many variants affect both. Genetic correlation is a stronger condition than pleiotropy: to exhibit genetic correlation, the directions of effect must also be consistently aligned.

Cross-Trait LD Score Regression

Recall from the Overview of Methods that the cross-trait LD Score regression equation is where z denotes the z-score for study i and SNPj, N is the sample size for study i, ρ is genetic covariance, ℓ is LD Score [19], N is the number of individuals included in both studies, and ρ is the phenotypic correlation among the N overlapping samples. We derive this equation in the Supplementary Note. We estimate genetic covariance by regressing z12 against , (where N is the sample size for SNP j in study i) then multiplying the resulting slope by M, the number of SNPs in the reference panel with MAF between 5% and 50% (technically, this is an estimate of the genetic covariance among SNPs with 5–50% MAF; Supplementary Note). If we know the correct value of the intercept term ahead of time, we can reduce the standard error by constraining the intercept to this value using the -constrain-intercept flag in ldsc (for pairs of binary traits, we give a corresponding expression in terms of the number of overlapping cases and controls in the Supplementary Note). Note that this works even when there is known nonzero sample overlap We recommend using the in-sample estimate of ρ (denoted ρ̂), rather than the population value of ρ. Under unbiased sampling ρ̂ is consistent for ρ with O(1/N) variance, so in this case, the distinction between ρ and ρ̂ is not of great importance. Under biased sampling (as discussed in the previous section), the expected LD Score regression intercept depends on the expected sample correlation E[y1y2|s=1] (which is estimated consistently by ρ̂), not population ρ. Thus, we advise to use ρ̂ rather than ρ when constraining the intercept.

Regression Weights

For heritability estimation, we use the regression weights from [19]. If effect sizes for both phenotypes are drawn from a bivariate normal distribution, then the optimal regression weights for genetic covariance estimation are (Supplementary Note). This quantity depends on several parameters ( , ρ, ρ,N) which are not known a priori, so it is necessary to estimate them from the data. We compute the weights in two steps: The first regression is weighted using heritabilities from the single-trait LD Score regressions, ρN=0, and ρ estimated as . The second regression is weighted using the estimates of ρN and ρ from step 1. The genetic covariance estimate that we report is the estimate from the second regression. Linear regression with weights estimated from the data is called feasible generalized least squares (FGLS). FGLS has the same limiting distribution as WLS with optimal weights, so WLS p-values are valid for FGLS [59]. We multiply the heteroskedasticity weights by 1/ℓ (where ℓ is LD Score with sum over regression SNPs) in order to downweight SNPs that are overcounted. This is a heuristic: the optimal approach is to rotate the data so that it is de-correlated, but this rotation matrix is difficult to compute.

Two-Step Estimator

As noted in [19], SNPs with very large effect sizes can result in large LD Score regression standard errors for single-trait LD Score regression with unconstrained intercept; cross-trait LD Score regression with unconstrained intercept behaves similarly. This is due to the well-known fact that linear regression deals poorly with outliers in the response variable (LD Score regression with constrained intercept is not nearly as adversely affected by large-effect SNPs). The solution proposed in [19] was to remove SNPs with χ2>80 from the LD Score regression. This is a satisfactory solution when the goal is to estimate the LD Score regression intercept. If the goal is to distinguish polygenicity from population stratification, and we are willing to assume that the population stratification is subtle, such that SNPs with χ2>80 are much more likely to be real causal SNPs rather than artifacts, then we can make the task much easier by removing those SNPs. However, this is unsatisfactory if the goal is to estimate h2: ignoring large-effect SNPs with χ2>80 would bias estimates of h2 and ρ towards zero. Therefore, for estimating h2 or ρ, we take a two step approach. The first step is to estimate the LD Score regression intercept with all SNPs with χ2>30 removed (i.e., all genome-wide significant SNPs; the threshold can be adjusted with the -two-step flag in ldsc). The second step is to estimate h2 or ρ using all SNPs and constrained intercept LD Score regression with the intercept constrained to the value from the first step (note that we account for uncertainty in the intercept when computing a standard error; see the next section).

Assessment of Statistical Significance via Block Jackknife

Summary statistics for SNPs in LD are correlated, so the OLS standard error will be biased downwards. We estimate a heteroskedasticity-and-correlation-robust standard error with a block jackknife over blocks of adjacent SNPs. This is the same procedure used in [19], and gives accurate standard errors in simulations (Table 1). We obtain a standard error for the genetic correlation by using a ratio block jackknife over SNPs. The default setting in ldsc is 200 blocks per genome, which can be adjusted with the -num-blocks flag. For the two-step estimator, if we were to estimate the intercept in the first step, then obtain a jackknife standard error for the second step treating the intercept as fixed, the standard error would be biased downwards, because it would not take into account the uncertainty in the intercept. Instead, we jackknife both steps of the procedure, which appropriately accounts for uncertainty in the intercept and yields a valid standard error.

Reverse Causation

Consider a scenario where a risk factor E1 causes a disease D, but incidence of disease D changes postmorbid levels of E1 (this could occur e.g., incidence of disease persuades affected individuals to change their behavior in ways that lower E1). If D is sufficiently common in our GWAS sample, then the genetic correlation may be affected by reverse causation. LD Score regression (or any genetic correlation estimator) will yield a consistent estimate of the cross-sectional genetic correlation between E1 and D at the given timepoint; however, the cross-sectional genetic correlation between E1 and D will be attenuated relative to the genetic correlation between disease and pre-morbid levels of E1. The genetic correlation between disease and pre-morbid levels of the risk factor will typically be the more interesting quantity to estimate, because it is more closely related to the causal effect of E1 on D. We can estimate this quantity by excluding all post-morbid measurements of the risk factor from the risk factor GWAS. This allows us to circumvent reverse causation, at the cost of a small decrease in sample size. If D is uncommon, then modification of behavior after onset of D will account for only a small fraction of the population variance in E1, so the effect of reverse causation on the genetic correlation will be small. Thus, reverse causation is primarily a concern for high-prevalence diseases.

Non-Random Ascertainment

We show in the Supplementary Note that LD Score regression is robust to oversampling of cases in case/control studies, modulo transformation observed and liability scale heritability and genetic covariance. Oversampling of cases is the most common form of biased sampling, but there are many other forms of biased sampling. For example, consider case/control/covariate ascertainment, where the sampling of cases and controls takes into account a covariate. As as concrete example, we know that high BMI is a major risk factor for T2D. If we wish to discover genetic variants that influence risk for T2D via mechanisms other than BMI, we may wish to perform a case/control study for T2D where we compare BMI-matched cases and controls. If we were to use such a T2D study and a random population study of BMI to compute the genetic correlation between BMI and T2D, the result would be substantially attenuated relative to the population genetic correlation between T2D and BMI. (Note that this example holds irrespective of whether there is sample overlap and applies to all genetic correlation estimators, not just LD Score). More generally, let s=1 denote the event that individual i is selected into our study, and let C denote a vector of covariates describing individual i (which may include the phenotype of individual i). Then we can represent an arbitrary biased sampling scheme by specifying the selection probabilities f(C):=P[s=1|C] (note that case/control ascertainment is the special case where C=y). Suppose that phenotypes are generated following the model from Section 1.1 of the Supplementary Note, but that our sample is selected following the biased sampling scheme f. Let a denote the additive genetic component for phenotype j in inidividual i. If there is no direct ascertainment on genotype (i.e., if C does not include genotypes), then the proof of Proposition 1 in the Supplementary Note goes through, except that ρ is replaced with E[y1y2|s=1] and ρ is replaced with E[a1a2|s=1]. This has two practical implications: first, in studies with biased sampling schemes and sample overlap, if one wishes to constrain the intercept, one should use the sample correlation between phenotypes ρ̂ rather than the population correlation ρ. Under biased sampling, plim→∞ρ̂=E[y1y2|s=1], which is typically not equal to ρ. Second, even if there is no sample verlap, biased sampling can affect the genetic correlation estimate. If the biased sampling mechanism (i.e., the function f(C):=P[s=1|C]) is known, then it may be possible to explicitly model the biased sampling and derive a function for converting genetic correlation estimates from the biased sample to population genetic correlations (similar to the derivations in sections 1.3 and 1.4 of the Supplementary Note). If the biased sampling mechanism can only be described qualitatively, then it should at least be possible to guess the magnitude and direction of the bias by reasoning about E[y1y2|s=1] and E[a1a2|s=1].

Computational Complexity

Let N denote sample size and M the number of SNPs. The computational complexity of the steps involved in LD Score regression are as follows: Computing summary statistics takes O(MN) time. Computing LD Scores takes O(MN) time, though the N for computing LD Scores need not be large. We use the N=378 Europeans from 1000 Genomes. LD Score regression takes O(M) time and space. For a user who has already computed summary statistics and downloads LD Scores from our website (URLs), the computational cost of LD Score regression is O(M) time and space. For comparison, REML takes time O(MN2) for computing the GRM and O(N3) time for maximizing the likelihood. Practically, estimating LD Scores takes roughly an hour parallelized over chromosomes, and LD Score regression takes about 15 seconds per pair of phenotypes on a 2014 MacBook Air with 1.7 GhZ Intel Core i7 processor. We simulated quantitative traits under an infinitesimal model in 2062 controls from a Swedish study. To simulate the standard scenario where many causal SNPs are not genotyped, we simulated phenotypes by drawing causal SNPs from 622,146 best-guess imputed 1000 Genomes SNPs on chromosome 2, then retained only the 90,980 HM3 SNPs with MAF above 5% for LD Score regression. We note that the simulations in [19] show that single-trait LD Score regression is only minimally biased by uncorrected population stratification and moderate ancestry mismatch between the reference panel used for estimating LD Scores and the population sampled in GWAS. In particular, LD Scores estimated from the 1000 Genomes reference panel are suitable for use with European-ancestry meta-analyses. Put another way, LD Score is only minimally correlated with F, and the differences in LD Score among European populations are not so large as to bias LD Score regression. Since we use the same LD Scores for cross-trait LD Score regression as for single-trait LD Score regression, these results extend to cross-trait LD Score regression.

Summary Statistic Datasets

We selected traits for inclusion in the main text via the following procedure: Begin with all publicly available non-sex-stratified European-only summary statistics. Remove studies that do not provide signed summary statistics. Remove studies not imputed to at least HapMap 2. Remove studies that adjust for heritable covariates [60]. Remove all traits with heritability z-score below 4. Genetic correlation estimates for traits with heritability z-score below 4 are generally too noisy to report. Prune clusters of correlated phenotypes (e.g., obesity classes 1–3) by picking the trait from each cluster with the highest heritability heritability z-score. We then applied the following filters (implemented in the script munge_sumstats.py included with ldsc): For studies that provide a measure of imputation quality, filter to INFO above 0.9. For studies that provide sample MAF, filter to sample MAF above 1%. In order to restrict to well-imputed SNPs in studies that do not provide a measure of imputation quality, filter to HapMap3 [61] SNPs with 1000 Genomes EUR MAF above 5%, which tend to be well-imputed in most studies. This step should be skipped if INFO scores are available for all studies. If sample size varies from SNP to SNP, remove SNPs with effective sample size less than 0.67 times the 90th percentile of sample size. For specialty chip (e.g., metabochip) meta-analyses, remove SNPs with N above the maximum GWAS N. Remove indels and structural variants. Remove strand-ambiguous SNPs. Remove SNPs whose alleles do not match the alleles in 1000 Genomes. Genomic control (GC) correction at any stage biases the heritability and genetic covariance estimates downwards (see the Supplementary Note of [19]. The biases in the numerator and denominator of genetic correlation cancel exactly, so genetic correlation is not biased by GC correction. A majority of the studies analyzed in this paper used GC correction, so we do not report genetic covariance and heritability. Data on Alzheimer’s disease were obtained from the following source: International Genomics of Alzheimer’s Project (IGAP) is a large two-stage study based upon genome-wide association studies (GWAS) on individuals of European ancestry. In stage 1, IGAP used genotyped and imputed data on 7,055,881 single nucleotide polymorphisms (SNPs) to meta-analyze four previously-published GWAS datasets consisting of 17,008 Alzheimer’s disease cases and 37,154 controls (The European Alzheimer’s Disease Initiative, EADI; the Alzheimer Disease Genetics Consortium, ADGC; The Cohorts for Heart and Aging Research in Genomic Epidemiology consortium, CHARGE; The Genetic and Environmental Risk in AD consortium, GERAD). In stage 2, 11,632 SNPs were genotyped and tested for association in an independent set of 8,572 Alzheimer’s disease cases and 11,312 controls. Finally, a meta-analysis was performed combining results from stages 1 and 2. We only used stage 1 data for LD Score regression.

54 in total

1. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies.

Authors: Brendan K Bulik-Sullivan; Po-Ru Loh; Hilary K Finucane; Stephan Ripke; Jian Yang; Nick Patterson; Mark J Daly; Alkes L Price; Benjamin M Neale
Journal: Nat Genet Date: 2015-02-02 Impact factor: 38.330

2. Adjusting for heritable covariates can bias effect estimates in genome-wide association studies.

Authors: Hugues Aschard; Bjarni J Vilhjálmsson; Amit D Joshi; Alkes L Price; Peter Kraft
Journal: Am J Hum Genet Date: 2015-01-29 Impact factor: 11.025

3. Associations of adult height and its components with mortality: a report from cohort studies of 135,000 Chinese women and men.

Authors: Na Wang; Xianglan Zhang; Yong-Bing Xiang; Gong Yang; Hong-Lan Li; Jing Gao; Hui Cai; Yu-Tang Gao; Wei Zheng; Xiao-Ou Shu
Journal: Int J Epidemiol Date: 2011-12 Impact factor: 7.196

4. Autism spectrum disorder severity reflects the average contribution of de novo and familial influences.

Authors: Elise B Robinson; Kaitlin E Samocha; Jack A Kosmicki; Lauren McGrath; Benjamin M Neale; Roy H Perlis; Mark J Daly
Journal: Proc Natl Acad Sci U S A Date: 2014-10-06 Impact factor: 11.205

5. Heritability and genetic correlations explained by common SNPs for metabolic syndrome traits.

Authors: Shashaank Vattikuti; Juen Guo; Carson C Chow
Journal: PLoS Genet Date: 2012-03-29 Impact factor: 5.917

6. Parent-of-origin-specific allelic associations among 106 genomic loci for age at menarche.

Authors: John Rb Perry; Felix Day; Cathy E Elks; Patrick Sulem; Kari Stefansson; Joanne M Murabito; Ken K Ong; Deborah J Thompson; Teresa Ferreira; Chunyan He; Daniel I Chasman; Tõnu Esko; Gudmar Thorleifsson; Eva Albrecht; Wei Q Ang; Tanguy Corre; Diana L Cousminer; Bjarke Feenstra; Nora Franceschini; Andrea Ganna; Andrew D Johnson; Sanela Kjellqvist; Kathryn L Lunetta; George McMahon; Ilja M Nolte; Lavinia Paternoster; Eleonora Porcu; Albert V Smith; Lisette Stolk; Alexander Teumer; Natalia Tšernikova; Emmi Tikkanen; Sheila Ulivi; Erin K Wagner; Najaf Amin; Laura J Bierut; Enda M Byrne; Jouke-Jan Hottenga; Daniel L Koller; Massimo Mangino; Tune H Pers; Laura M Yerges-Armstrong; Jing Hua Zhao; Irene L Andrulis; Hoda Anton-Culver; Femke Atsma; Stefania Bandinelli; Matthias W Beckmann; Javier Benitez; Carl Blomqvist; Stig E Bojesen; Manjeet K Bolla; Bernardo Bonanni; Hiltrud Brauch; Hermann Brenner; Julie E Buring; Jenny Chang-Claude; Stephen Chanock; Jinhui Chen; Georgia Chenevix-Trench; J Margriet Collée; Fergus J Couch; David Couper; Andrea D Coveillo; Angela Cox; Kamila Czene; Adamo Pio D'adamo; George Davey Smith; Immaculata De Vivo; Ellen W Demerath; Joe Dennis; Peter Devilee; Aida K Dieffenbach; Alison M Dunning; Gudny Eiriksdottir; Johan G Eriksson; Peter A Fasching; Luigi Ferrucci; Dieter Flesch-Janys; Henrik Flyger; Tatiana Foroud; Lude Franke; Melissa E Garcia; Montserrat García-Closas; Frank Geller; Eco Ej de Geus; Graham G Giles; Daniel F Gudbjartsson; Vilmundur Gudnason; Pascal Guénel; Suiqun Guo; Per Hall; Ute Hamann; Robin Haring; Catharina A Hartman; Andrew C Heath; Albert Hofman; Maartje J Hooning; John L Hopper; Frank B Hu; David J Hunter; David Karasik; Douglas P Kiel; Julia A Knight; Veli-Matti Kosma; Zoltan Kutalik; Sandra Lai; Diether Lambrechts; Annika Lindblom; Reedik Mägi; Patrik K Magnusson; Arto Mannermaa; Nicholas G Martin; Gisli Masson; Patrick F McArdle; Wendy L McArdle; Mads Melbye; Kyriaki Michailidou; Evelin Mihailov; Lili Milani; Roger L Milne; Heli Nevanlinna; Patrick Neven; Ellen A Nohr; Albertine J Oldehinkel; Ben A Oostra; Aarno Palotie; Munro Peacock; Nancy L Pedersen; Paolo Peterlongo; Julian Peto; Paul Dp Pharoah; Dirkje S Postma; Anneli Pouta; Katri Pylkäs; Paolo Radice; Susan Ring; Fernando Rivadeneira; Antonietta Robino; Lynda M Rose; Anja Rudolph; Veikko Salomaa; Serena Sanna; David Schlessinger; Marjanka K Schmidt; Mellissa C Southey; Ulla Sovio; Meir J Stampfer; Doris Stöckl; Anna M Storniolo; Nicholas J Timpson; Jonathan Tyrer; Jenny A Visser; Peter Vollenweider; Henry Völzke; Gerard Waeber; Melanie Waldenberger; Henri Wallaschofski; Qin Wang; Gonneke Willemsen; Robert Winqvist; Bruce Hr Wolffenbuttel; Margaret J Wright; Dorret I Boomsma; Michael J Econs; Kay-Tee Khaw; Ruth Jf Loos; Mark I McCarthy; Grant W Montgomery; John P Rice; Elizabeth A Streeten; Unnur Thorsteinsdottir; Cornelia M van Duijn; Behrooz Z Alizadeh; Sven Bergmann; Eric Boerwinkle; Heather A Boyd; Laura Crisponi; Paolo Gasparini; Christian Gieger; Tamara B Harris; Erik Ingelsson; Marjo-Riitta Järvelin; Peter Kraft; Debbie Lawlor; Andres Metspalu; Craig E Pennell; Paul M Ridker; Harold Snieder; Thorkild Ia Sørensen; Tim D Spector; David P Strachan; André G Uitterlinden; Nicholas J Wareham; Elisabeth Widen; Marek Zygmunt; Anna Murray; Douglas F Easton
Journal: Nature Date: 2014-07-23 Impact factor: 49.962

7. Genetic and epigenetic fine mapping of causal autoimmune disease variants.

Authors: Kyle Kai-How Farh; Alexander Marson; Jiang Zhu; Markus Kleinewietfeld; William J Housley; Samantha Beik; Noam Shoresh; Holly Whitton; Russell J H Ryan; Alexander A Shishkin; Meital Hatan; Marlene J Carrasco-Alfonso; Dita Mayer; C John Luckey; Nikolaos A Patsopoulos; Philip L De Jager; Vijay K Kuchroo; Charles B Epstein; Mark J Daly; David A Hafler; Bradley E Bernstein
Journal: Nature Date: 2014-10-29 Impact factor: 49.962

8. A framework for the interpretation of de novo mutation in human disease.

Authors: Kaitlin E Samocha; Elise B Robinson; Stephan J Sanders; Christine Stevens; Aniko Sabo; Lauren M McGrath; Jack A Kosmicki; Karola Rehnström; Swapan Mallick; Andrew Kirby; Dennis P Wall; Daniel G MacArthur; Stacey B Gabriel; Mark DePristo; Shaun M Purcell; Aarno Palotie; Eric Boerwinkle; Joseph D Buxbaum; Edwin H Cook; Richard A Gibbs; Gerard D Schellenberg; James S Sutcliffe; Bernie Devlin; Kathryn Roeder; Benjamin M Neale; Mark J Daly
Journal: Nat Genet Date: 2014-08-03 Impact factor: 38.330

9. Metabolic signatures of adiposity in young adults: Mendelian randomization analysis and effects of weight change.

Authors: Peter Würtz; Qin Wang; Antti J Kangas; Rebecca C Richmond; Joni Skarp; Mika Tiainen; Tuulia Tynkkynen; Pasi Soininen; Aki S Havulinna; Marika Kaakinen; Jorma S Viikari; Markku J Savolainen; Mika Kähönen; Terho Lehtimäki; Satu Männistö; Stefan Blankenberg; Tanja Zeller; Jaana Laitinen; Anneli Pouta; Pekka Mäntyselkä; Mauno Vanhala; Paul Elliott; Kirsi H Pietiläinen; Samuli Ripatti; Veikko Salomaa; Olli T Raitakari; Marjo-Riitta Järvelin; George Davey Smith; Mika Ala-Korpela
Journal: PLoS Med Date: 2014-12-09 Impact factor: 11.069

10. Using multivariable Mendelian randomization to disentangle the causal effects of lipid fractions.

Authors: Stephen Burgess; Daniel F Freitag; Hassan Khan; Donal N Gorman; Simon G Thompson
Journal: PLoS One Date: 2014-10-10 Impact factor: 3.240

1160 in total

1. Cracking the brain's genetic code.

Authors: Paul M Thompson
Journal: Proc Natl Acad Sci U S A Date: 2015-11-18 Impact factor: 11.205

2. Response to Day et al.

Authors: Hugues Aschard; Bjarni J Vilhjálmsson; Amit D Joshi; Alkes L Price; Peter Kraft
Journal: Am J Hum Genet Date: 2016-02-04 Impact factor: 11.025

3. Study of genetic correlation between children's sleep and obesity.

Authors: Hao Mei; Fan Jiang; Lianna Li; Michael Griswold; Shijian Liu; Thomas Mosley
Journal: J Hum Genet Date: 2020-06-18 Impact factor: 3.172

4. Confidence intervals for heritability via Haseman-Elston regression.

Authors: Tamar Sofer
Journal: Stat Appl Genet Mol Biol Date: 2017-09-26

5. Identification of Genetic Loci Shared Between Attention-Deficit/Hyperactivity Disorder, Intelligence, and Educational Attainment.

Authors: Kevin S O'Connell; Alexey Shadrin; Olav B Smeland; Shahram Bahrami; Oleksandr Frei; Francesco Bettella; Florian Krull; Chun C Fan; Ragna B Askeland; Gun Peggy S Knudsen; Anne Halmøy; Nils Eiel Steen; Torill Ueland; G Bragi Walters; Katrín Davíðsdóttir; Gyða S Haraldsdóttir; Ólafur Ó Guðmundsson; Hreinn Stefánsson; Ted Reichborn-Kjennerud; Jan Haavik; Anders M Dale; Kári Stefánsson; Srdjan Djurovic; Ole A Andreassen
Journal: Biol Psychiatry Date: 2019-11-29 Impact factor: 13.382

6. Trans Effects on Gene Expression Can Drive Omnigenic Inheritance.

Authors: Xuanyao Liu; Yang I Li; Jonathan K Pritchard
Journal: Cell Date: 2019-05-02 Impact factor: 41.582

Review 7. Reconceptualizing anorexia nervosa.

Authors: Cynthia M Bulik; Rachael Flatt; Afrouz Abbaspour; Ian Carroll
Journal: Psychiatry Clin Neurosci Date: 2019-07-01 Impact factor: 5.188

8. Prediction of Schizophrenia Diagnosis by Integration of Genetically Correlated Conditions and Traits.

Authors: Jingchun Chen; Jian-Shing Wu; Travis Mize; Dandan Shui; Xiangning Chen
Journal: J Neuroimmune Pharmacol Date: 2018-10-01 Impact factor: 4.147

9. Pleiotropic Meta-Analysis of Cognition, Education, and Schizophrenia Differentiates Roles of Early Neurodevelopmental and Adult Synaptic Pathways.

Authors: Max Lam; W David Hill; Joey W Trampush; Jin Yu; Emma Knowles; Gail Davies; Eli Stahl; Laura Huckins; David C Liewald; Srdjan Djurovic; Ingrid Melle; Kjetil Sundet; Andrea Christoforou; Ivar Reinvang; Pamela DeRosse; Astri J Lundervold; Vidar M Steen; Thomas Espeseth; Katri Räikkönen; Elisabeth Widen; Aarno Palotie; Johan G Eriksson; Ina Giegling; Bettina Konte; Annette M Hartmann; Panos Roussos; Stella Giakoumaki; Katherine E Burdick; Antony Payton; William Ollier; Ornit Chiba-Falek; Deborah K Attix; Anna C Need; Elizabeth T Cirulli; Aristotle N Voineskos; Nikos C Stefanis; Dimitrios Avramopoulos; Alex Hatzimanolis; Dan E Arking; Nikolaos Smyrnis; Robert M Bilder; Nelson A Freimer; Tyrone D Cannon; Edythe London; Russell A Poldrack; Fred W Sabb; Eliza Congdon; Emily Drabant Conley; Matthew A Scult; Dwight Dickinson; Richard E Straub; Gary Donohoe; Derek Morris; Aiden Corvin; Michael Gill; Ahmad R Hariri; Daniel R Weinberger; Neil Pendleton; Panos Bitsios; Dan Rujescu; Jari Lahti; Stephanie Le Hellard; Matthew C Keller; Ole A Andreassen; Ian J Deary; David C Glahn; Anil K Malhotra; Todd Lencz
Journal: Am J Hum Genet Date: 2019-08-01 Impact factor: 11.025

10. Genetic factor common to schizophrenia and HIV infection is associated with risky sexual behavior: antagonistic vs. synergistic pleiotropic SNPs enriched for distinctly different biological functions.

Authors: Qian Wang; Renato Polimanti; Henry R Kranzler; Lindsay A Farrer; Hongyu Zhao; Joel Gelernter
Journal: Hum Genet Date: 2016-10-17 Impact factor: 4.132