Literature DB >> 32782262

Efficient variance components analysis across millions of genomes.

Ali Pazokitoroudi¹, Yue Wu¹, Kathryn S Burch², Kangcheng Hou², Aaron Zhou¹, Bogdan Pasaniuc^3,4,5, Sriram Sankararaman^6,7,8.

Abstract

While variance components analysis has emerged as a powerful tool in complex trait genetics, existing methods for fitting variance components do not scale well to large-scale datasets of genetic variation. Here, we present a method for variance components analysis that is accurate and efficient: capable of estimating one hundred variance components on a million individuals genotyped at a million SNPs in a few hours. We illustrate the utility of our method in estimating and partitioning variation in a trait explained by genotyped SNPs (SNP-heritability). Analyzing 22 traits with genotypes from 300,000 individuals across about 8 million common and low frequency SNPs, we observe that per-allele squared effect size increases with decreasing minor allele frequency (MAF) and linkage disequilibrium (LD) consistent with the action of negative selection. Partitioning heritability across 28 functional annotations, we observe enrichment of heritability in FANTOM5 enhancers in asthma, eczema, thyroid and autoimmune disorders.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2020 PMID： 32782262 PMCID： PMC7419517 DOI： 10.1038/s41467-020-17576-9

Source DB: PubMed Journal: Nat Commun ISSN： 2041-1723 Impact factor: 14.919

Introduction

Variance components analysis[1] has emerged as a versatile tool in human complex trait genetics, enabling studies of the genetic contribution to variation in a trait[2] as well as its distribution across genomic loci[3,4], allele frequencies[3], and functional annotations[3,5,6]. There is increasing interest in applying methods for variance components analysis to large-scale genetic datasets with the goal of uncovering novel insights into the genetic architecture of complex traits[4,7]. A prominent example of the utility of these methods is in the estimation of SNP heritability ()[2], the variance in a trait explained by a given set of genotyped SNPs. Variance components methods for estimating SNP heritability typically assume a genetic variance component that represents the fraction of phenotypic variation explained by the SNPs included in the study and a residual variance component. Recent studies have shown that these “single-component” methods yield biased estimates of SNP heritability due to the linkage disequilibrium (LD) and minor allele frequency (MAF)-dependent architecture of complex traits[8,9]. On the other hand, flexible models with multiple variance components[3,4] that allows for SNP effects to vary with MAF and LD, have been shown to yield more accurate SNP heritability estimates[8,9]. Recent work has shown that SNP heritability can be estimated with minimal assumptions about the genetic architecture[10]; however, this method cannot partition heritability across categories of SNPs of interest such as functional or population genomic annotations. Partitioning heritability requires fitting multiple variance components, thus creating the need for accurate and scalable methods that can fit tens or even hundreds of variance components to large-scale genomic data to obtain accurate and novel insights into genetic architecture. While the ability to fit flexible variance component models to large-scale datasets is essential to obtain accurate and novel insights into genetic architecture, fitting such models requires scalable algorithms. Approaches for estimating variance components typically search for parameter values that maximize the likelihood or the restricted maximum likelihood (REML)[11]. Despite a number of algorithmic improvements[2,4,12-16], computing REML estimates of the variance components on data sets such as the UK Biobank[17] (≈500,000 individuals genotyped at nearly one million SNPs) remains challenging. The reason is that methods for computing these estimators typically perform repeated computations on the input genotypes. We propose a method that can jointly estimate multiple variance components efficiently. Our proposed method, RHE-mc, is a randomized multi-component version of the classical Haseman–Elston regression for heritability estimation[18,19]. RHE-mc builds on our previously proposed method, RHE-reg[20], which uses a randomized algorithm to estimate a single variance component. RHE-mc can simultaneously estimate multiple variance components, as well as variance components associated with continuous and overlapping annotations. Further, unlike REML estimation algorithms, RHE-mc requires only a single pass over the input genotypes that results in a highly memory efficient implementation. The resulting computational efficiency permits RHE-mc to jointly fit 300 variance components in less than an hour on a dataset of about 300,000 individuals and 500,000 SNPs, about two orders of magnitude faster than state-of-the-art methods. On a dataset of one million individuals and one million SNPs, RHE-mc can fit 100 variance components in about 12 h. To demonstrate its utility, we first show that RHE-mc can accurately estimate genome-wide and partitioned SNP heritability under realistic genetic architectures (the functional dependence of SNP effect sizes on MAF and LD). We applied RHE-mc to 22 traits measured across 291,273 individuals genotyped at 459,792 common SNPs (MAF > 1%) in the UK Biobank to obtain estimates of genome-wide SNP heritability. We then used RHE-mc to partition heritability for the 22 traits across seven million imputed SNPs (MAF > 0.1%) into 144 bins defined based on MAF and LD. We observe that the per-allele squared effect size tends to increase with lower MAF and LD across the traits considered. Finally, we partitioned heritability for SNPs with MAF > 0.1% across 28 functional annotations. We recover previously reported enrichment of heritability in annotations corresponding to conserved regions[7] and also document enrichment of heritability in FANTOM5 enhancers in eczema, asthma, autoimmune disorders, and thyroid disorders.

Results

Methods overview

RHE-mc aims to fit a variance component model that relates phenotypes measured across N individuals to their genotypes over M SNPs :where is an arbitrary distribution with mean and covariance Σ. Each of the M SNPs is assigned to one of K non-overlapping categories so that is the N × M matrix consisting of standardized genotypes of SNPs belonging to category k (note that the expected heritability is constant within categories when we use standardized genotypes). denotes the effect sizes of SNPs assigned to category k which are drawn from a zero-mean distribution with covariance parameter (the variance component of category k) while is the residual variance. In this model, the genome-wide SNP heritability is defined as: while the SNP heritability of category k is defined as: . By choosing categories to represent genomic annotations of interest, e.g., chromosomes, allele frequencies, or functional annotations, these models can be used to estimate the phenotypic variation that can be attributed to the relevant annotation. The key inference problem in this model is the estimation of the variance components: . These parameters are typically estimated by maximizing the likelihood or the restricted likelihood. Instead, RHE-mc uses a scalable method-of-moments estimator, i.e., finding values of the variance components such that the population moments match the sample moments[18,19,21-23]. RHE-mc uses a randomized algorithm that avoids explicitly computing N × N genetic relatedness matrices that are required by method-of-moments estimators. Instead, it operates on a smaller matrix formed by multiplying the input genotype matrix with a small number of random vectors (see “Methods” section). The application of a randomized algorithm for SNP heritability estimation using a single variance component was proposed in our previous work, RHE-reg[20]. RHE-mc extends our previous work in several directions. RHE-mc can efficiently fit multiple variance components (both non-overlapping and overlapping) and can also handle continuous annotations. The resulting algorithm has scalable runtime as it only requires operating on the genotype matrix one time. Further, RHE-mc uses a streaming implementation that does not require all the genotypes to be stored in memory leading to scalable memory requirements (Supplementary Notes). Finally, RHE-mc uses an efficient implementation of a block Jackknife to estimate standard errors with little computational overhead (Supplementary Notes).

Accuracy of genome-wide SNP heritability estimates in simulations

We assessed the accuracy of RHE-mc in estimating genome-wide SNP heritability as previous attempts at estimating SNP heritability have been shown to be sensitive to assumptions about how SNP effect size varies with MAF and LD[8]. Starting with genotypes of M = 593,300 array SNPs over N = 337,205 unrelated white British individuals in the UK Biobank, we simulated phenotypes according to 64 MAF and LD-dependent architectures by varying the SNP heritability, the proportion of variants that have non-zero effects (causal variants or CVs), the distribution of CVs across minor allele frequencies (CVs distributed across all minor allele frequency bins or CVs restricted to either common or low-frequency bins), and the form of coupling between the SNP effect size and MAF as well as LD. For RHE-mc, we partitioned the SNPs into 24 variance components based on six MAF bins as well as four LD bins (see “Methods” section). The key parameter in applying RHE-mc is the number of random vectors B which we set to 10. RHE-mc estimates were relatively insensitive when we increased the number of random vectors B to 100 (Supplementary Figs. 1 and 2, Supplementary Table 1). Across these 64 architectures, RHE-mc is relatively unbiased (a two-sided t-test of the hypothesis of no bias is not rejected across any of the architectures at a p-value < 0.05) with the largest relative bias observed to be 0.5% of the true SNP heritability (Supplementary Fig. 3). We used a block Jackknife (number of blocks = 100) to estimate the standard errors of RHE-mc and confirmed that the estimated standard errors are close to the true SE (Supplementary Table 2). We compared the accuracy of RHE-mc to state-of-the-art methods for heritability estimation that can be applied to large datasets (across architectures where the true SNP heritability was fixed at 0.5). These methods, LDSC[24], SumHer[25], S-LDSC[26], and GRE[10], all leverage summary statistics while RHE-mc requires individual genotype data. We found that estimates from the summary-statistic methods tend to be sensitive to the underlying genetic architecture: across 16 architecture relative biases range from −31% to 27% for LDSC, −27% to 5% for S-LDSC, and −5% to 9% for SumHer (Fig. 1). We also compared to a recently proposed method (GRE[10]) that only estimates genome-wide SNP heritability (without partitioning by MAF/LD) and observed that relative biases ranged from 1% to 1.4% for GRE and from −1.5% to 0.5% for RHE-mc. We also considered architectures in which only rare variants are causal and found RHE-mc is accurate relative to other methods (Supplementary Fig. 4). These results further emphasize that RHE-mc can accurately estimate SNP-heritability through fitting multiple variance components.

Fig. 1

Comparison of estimates of genome-wide SNP heritability from RHE-mc with LDSC, GRE, S-LDSC, and SumHer in large-scale simulations (N = 337,205 unrelated individuals, M = 593,300 array SNPs).

Comparison of estimates of genome-wide SNP heritability from RHE-mc with LDSC, GRE, S-LDSC, and SumHer in large-scale simulations (N = 337,205 unrelated individuals, M = 593,300 array SNPs).

a We compared methods for heritability estimation under 16 different genetic architectures. We set true heritability to 0.5 and varied the MAF range of causal variants (MAF of CV), the coupling of MAF with effect size (a = 0 indicates no coupling with MAF and a = 0.75 indicates coupling with MAF), and the effect of local LD on effect size (b = 0 indicates no dependence on LDAK weights and b = 1 indicates dependence on LDAK weights) (see “Methods” section). Each boxplot represents estimates from 100 simulations. b Relative bias of each method (as a percentage of the true h2) across 16 distinct MAF-dependent and LD-dependent architectures. Each boxplot contains 16 points; each point is the relative bias estimated from 100 simulations under a single genetic architecture. Points and error bars represent the mean and ±2 SE. In a and b, boxplot whiskers extend to the minimum and maximum estimates located within 1.5× interquartile range (IQR) from the first and third quartiles, respectively. Here, we run RHE-mc using 24 bins formed by the combination of six bins based on MAF as well as four bins based on quartiles of the LDAK score of a SNP (see “Methods“ section). We run S-LDSC with only 10 MAF bins (see Supplementary Table 5). To do a fair comparison, for every method, we computed LD scores and LDAK weights by using in-sample LD, and in all simulations we aim to estimate the SNP-heritability explained by the same set of M SNPs. We compared RHE-mc to the state-of-the-art REML-based variance component estimation method, GCTA-mc (multi-component GREML[8,27,28]) and to exact multi-component Haseman–Elston Regression (HE-mc) as implemented in GCTA[27]. We ran each of these methods by partitioning SNPs into 24 variance components (6 MAF bins by 4 LD bins, see “Methods” section). To make these experiments computationally feasible, we simulated phenotypes starting from a smaller set of genotypes (M = 593,300 array SNPs and N = 10,000 white British individuals). Across 16 architectures where the true SNP heritability was fixed at 0.25, the relative biases for RHE-mc range from −3.2% to 3.6%, and from −3.2% to 5% for GCTA-mc (Fig. 2). On average, RHE-mc has standard errors that are 1.1 times larger than GCTA-mc (which range from 0.97 to 1.24) and 1.08 times larger than HE-mc (which range from 1.00 to 1.21).

Fig. 2

Comparison of SNP heritability estimates from RHE-mc with GCTA-mc (GCTA with multiple variance components) and HE-mc (HE with multiple variance components) (N = 10,000 unrelated individuals, M = 593,300 array SNPs).

In a–d we compared heritability estimates from these methods under 16 different genetic architectures. We varied the MAF range of causal variants (MAF of CV), the coupling of MAF with effect size (a), and the effect of local LD on effect size (b = 0 indicates no dependence on LDAK weights and b = 1 indicates dependence on LDAK weights (see “Methods” section). We ran 100 replicates where the true heritability of the phenotype is 0.25. We run RHE-mc, HE-mc, and GCTA-mc using 24 bins formed by the combination of six bins based on MAF as well as four bins based on quartiles of the LDAK score of a SNP (see “Methods” section). Across all different genetic architectures, the relative biases range from −3.2% to 3.6% for RHE-mc, and from −3.2% to 5% for GCTA-mc, and from −2.6% to 1.45% for HE-mc. On average, RHE-mc has SEs that are 1.1 and 1.08 times larger than GCTA-mc and HE-mc, respectively. Black points and error bars represent the mean and ±2 SE. Each boxplot represents estimates from 100 simulations. Boxplot whiskers extend to the minimum and maximum estimates located within 1.5× interquartile range (IQR) from the first and third quartiles, respectively. The SEs are computed from 100 simulations (note that GCTA-mc did not run successfully on all 100 simulations).

Comparison of SNP heritability estimates from RHE-mc with GCTA-mc (GCTA with multiple variance components) and HE-mc (HE with multiple variance components) (N = 10,000 unrelated individuals, M = 593,300 array SNPs).

Accuracy of heritability partitioning in simulations

We also evaluated the accuracy of RHE-mc in partitioning SNP heritability in both small-scale (M = 593,300 SNPs, N = 10,000 individuals) (Supplementary Fig. 5) and large-scale settings (M = 593,300 SNPs, N = 337,205 individuals) (see Supplementary Fig. 6). For these experiments, we restrict our attention to architectures for which the CVs are chosen to lie within a narrow range of MAF. Since the variance components correspond to bins of MAF and LD, a subset of the variance components would have no causal SNPs and hence have a heritability of zero. We assess the accuracy of estimates of heritability aggregated over these components (termed the non-causal bin) as well as the heritability aggregated over the remaining genetic components (termed the causal bin). For example, variance components that correspond to MAF ∈ [0.01, 0.05] would be included in the causal bin for an architecture that restricts the MAF of CVs to lie in the range [0.01, 0.05]. For the small-scale simulations, we compared RHE-mc to GCTA-mc. We ran both methods by partitioning the SNPs into 24 variance components based on six MAF bins as well as four LD bins defined by quartiles of the measure of LDAK weight at a SNP (see “Methods” section). Across the genetic architectures tested, estimates of heritability within each of the causal and non-causal bins are highly concordant between RHE-mc and GCTA-mc (Supplementary Fig. 5, Supplementary Table 3): for the causal bin, the relative bias ranges from −4% to 0.4% for RHE-mc and −3.6% to 2% for GCTA-mc while, for the non-causal bin, the bias ranges from 0 to 0.7% for RHE-mc and 0 to 1.4% for GCTA-mc (Supplementary Table 3). For the large-scale settings, RHE-mc remains accurate: the relative bias ranges from −2.6% to 3.2% (causal bin) and −0.5% to 0.2% (non-causal bin) over the genetic architectures considered (Supplementary Fig. 6, Supplementary Table 4). Heritability partitioning has been used to estimate heritability attributed to functional genomic annotations[7]. However, some of these annotations (such as FANTOM5 enhancers) are quite small covering <1% of the genome. We explored the ability of RHE-mc to accurately estimate heritability as a function of the size of the annotation. To this end, we performed simulations using N = 291,273 unrelated white British individuals and M = 459,792 common SNPs. We defined eight annotations (four MAF bins and two LD bins) in which we fixed the enrichment of a selected bin and varied the proportion of SNPs in the selected category. RHE-mc obtained accurate estimates of enrichment even when the selected bin only contained 0.4% of the genome-wide SNPs (comparable to the size of FANTOM5 enhancers). RHE-mc estimates are well-calibrated: when the bin has zero enrichment, RHE-mc rejected the null hypothesis of no enrichment in 5% of the simulations, while attaining high power to reject the null hypothesis even when the bin contained <1% of the SNPs (Supplementary Notes).

Computational efficiency

We benchmarked the runtime and memory usage of RHE-mc as a function of number of individuals, SNPs and variance components (Fig. 3, Table 1). We ran RHE-mc with B = 10 random vectors and 22 variance components where each chromosome forms a distinct component. On a dataset of ≈300,000 individuals and ≈500,000 SNPs, RHE-mc can fit 22 variance components in less than an hour and ≈300 variance components (corresponding to bins of size 10 Mb) with little increase in its runtime. On a dataset of one million individuals and one million SNPs, RHE-mc can fit 100 variance components in a few hours. Further, due to its use of a streaming implementation that only requires the genotypes to be operated on once, the memory requirement of RHE-mc is modest: all experiments required <60 GB. We compared the run time and memory usage of RHE-mc with REML-based methods (GCTA[27] and BOLT-REML[4]) on the UK Biobank genotypes consisting of around 500,000 SNPs over varying sample sizes and observed that RHE-mc achieves several orders-of-magnitude reduction in runtime. Summary-statistic methods such as S-LDSC requires pre-computed inputs which depend on the runtimes of other softwares making a direct comparison of speed difficult. Thus, we have restricted our comparison to individual-level methods where the benchmarking can be done in a comparable manner.

Fig. 3

Comparison of running time of RHE-mc, GCTA-mc, and BOLT-REML.

We compared runtime of RHE-mc, GCTA-mc, and BOLT-REML with increasing sample size N (for a fixed number of SNPs M = 459,792 and components K = 22).

Table 1

Comparison of running time of RHE-mc, GCTA-mc, and BOLT-REML.

Parameters			Running time (h)
N	M	K	RHE-mc	GCTA-mc	BOLT-REML
10,000	459,792	22	<1	1.3	1
100,000	459,792	22	<1	–	40
291,273	459,792	22	<1	–	162
291,273	459,792	300	<1	–	–
291,273	4,824,392	8	3.2	–	–
1,000,000	1,000,000	8	3	–	–
1,000,000	1,000,000	100	12.4	–	–

Here M, N, and K are the number of SNPs, individuals, and variance components, respectively. RHE-mc can run efficiently even on datasets with one million individuals and SNPs as well as efficiently computing hundreds of variance components. All comparisons were performed on an Intel(R) Xeon(R) CPU 2.10 GHz server with 128 GB RAM.

Comparison of running time of RHE-mc, GCTA-mc, and BOLT-REML.

We compared runtime of RHE-mc, GCTA-mc, and BOLT-REML with increasing sample size N (for a fixed number of SNPs M = 459,792 and components K = 22). Comparison of running time of RHE-mc, GCTA-mc, and BOLT-REML. Here M, N, and K are the number of SNPs, individuals, and variance components, respectively. RHE-mc can run efficiently even on datasets with one million individuals and SNPs as well as efficiently computing hundreds of variance components. All comparisons were performed on an Intel(R) Xeon(R) CPU 2.10 GHz server with 128 GB RAM.

Estimating total SNP heritability in the UK Biobank

We applied RHE-mc to estimate genome-wide SNP heritability for 22 complex traits (6 quantitative and 16 binary traits) measured in the UK Biobank. We analyzed N = 291,273 unrelated white British individuals and M = 459,792 SNPs genotyped on the UK Biobank Axiom array (see “Methods” section). We ran RHE-mc with B = 10 and with SNPs divided into eight bins based on two MAF bins (0.01 ≤ MAF < 0.05, MAF ≥ 0.05) and quartiles of the LD-scores. We compared the estimates from RHE-mc to those from LDSC, S-LDSC, SumHer, and GRE. Restricting our analysis to 18 traits for which the point estimate of genome-wide SNP heritability from RHE-mc is >0.05, the estimates from S-LDSC, GRE, SumHer, and LDSC were on average 2.5%, 10%, 25%, and 67% higher than RHE-mc (Fig. 4). Relative to the simulation results, the estimates from S-LDSC are generally consistent with those from RHE-mc. This is likely due to the fact that, in simulations, our application of S-LDSC used only MAF bins. On the other hand, in real data, we used S-LDSC with the recommended baseline-LD annotations (including functional annotations).

Fig. 4

Estimates of genome-wide SNP heritability from RHE-mc, LDSC, S-LDSC, GRE, and SumHer for 22 complex traits and diseases in the UK Biobank.

Estimates of genome-wide SNP heritability from RHE-mc, LDSC, S-LDSC, GRE, and SumHer for 22 complex traits and diseases in the UK Biobank.

We restricted our analysis to N = 291,273 unrelated white British individuals. We applied all methods to M = 459,792 array SNPs (MAF > 1%). We ran S-LDSC with baseline-LD model. For every method, LD scores or LDAK weights are computed using in-sample LD among the SNPs, and we aim to estimate the SNP-heritability explained by the same set of SNPs. RHE-mc was applied to array SNPs with eight MAF/LD bins. Black error bars mark ±2 standard errors centered on the estimated heritability. We used a block Jackknife (number of blocks = 100) to estimate the standard errors. In Supplementary Fig. 7, we also report RHE-mc estimates of genome-wide SNP heritability on M = 4,824,392 imputed SNPs (MAF > 1%) with 8 MAF/LD bins and M = 7,774,235 imputed SNPs (MAF > 0.1%) with 144 MAF/LD bins (see “Methods” section). We then applied RHE-mc to estimate genome-wide heritability attributable to imputed variants. The genome-wide estimates of SNP heritability from RHE-mc on imputed SNPs (MAF > 1%) are concordant with the estimates from array SNPs (2.8% higher on average). We then analyzed M = 7,774,235 imputed genotypes with MAF > 0.1% using 144 bins formed by 4 LD bins and 36 MAF bins (see “Methods” section). Genome-wide SNP heritability estimates from RHE-mc on imputed SNPs (MAF > 0.1%) are 11.4% higher than RHE-mc on imputed SNPs (MAF > 1%) (Fig. 4, Supplementary Fig. 7). Following previous work[10], we have removed the MHC region to enable a systematic comparison since the estimation of LD in the MHC region can be challenging; it would be of interest to compare methods when the MHC is included.

Partitioning SNP heritability across allele frequency and LD bins

We used RHE-mc to partition SNP heritability of 22 complex traits across MAF and LD bins. We analyzed M = 7,774,235 imputed SNPs with MAF > 0.1%. We used 144 bins formed by 4 LD bins and 36 MAF bins (see “Methods” section). We compute the per-allele squared effect size of SNPs in bin k as , where is the heritability estimated in bin k, f is the mean MAF in bin k, and M is the number of SNPs in bin k. We observe that allelic effect size increases with lower MAF and LD. For height, in the lowest quartile of LD scores, SNPs with MAF ≈ 0.1% have allelic effect sizes ≈27x ± 8 larger than SNPs with MAF ≈ 50%. Similarly, among SNPs with MAF ≈50%, SNPs in the lowest quartile of LD scores have allelic effect sizes ≈5x ± 1 larger than SNPs in the highest quartile (Fig. 5 for height; other traits in Supplementary Fig. 9). While these trends have been observed in previous studies[9,29,30], the ability of RHE-mc to jointly fit multiple variance components allows us to estimate effect sizes at SNPs with MAF as low as 0.1%. We caution that negative heritability estimates in bins of lowest MAF and high LD score could arise due to one or more of the following factors: low number of SNPs in this bin (we did not constrain our variance components estimates to be non-negative), the inadequacy of the assumed heritability model, and errors in the imputed genotypes used for the analysis.

Fig. 5

Per-allele squared effect size of height as a function of MAF.

We applied RHE-mc to N = 291,273 unrelated white British individuals and M = 7,774,235 imputed SNPs. SNPs were partitioned into 144 bins based on LD score (4 bins based on quartiles of the LD score with i denoting the ith quartile) and MAF (36 MAF bins) (see “Methods” section). Per-allele effect size squared for bin k is defined as , where is the heritability attributed to bin k, M is the number of SNPs in bin k, and f is the average MAF in bin k. Each point represents the estimated allelic effect size. Bars mark ±2 standard errors centered on the estimated allelic effect size. See Supplementary Fig. 9 for results on all 22 traits.

Per-allele squared effect size of height as a function of MAF.

Partitioning heritability by functional annotations

The ability of RHE-mc to estimate variance components associated with a large number of overlapping annotations enables us to explore the contribution of a variety of functional genomic annotations to trait heritability using individual-level data in the UK Biobank. We applied RHE-mc to jointly partition heritability of 22 complex traits across 28 functional annotations as defined in ref. [7] restricting our analysis to N = 291,273 unrelated white British individuals and M = 5,670,959 imputed SNPs (we restrict to SNPs with MAF > 0.1% which are also present in 1000 Genomes Project). We grouped the traits into five categories (autoimmune, diabetes, respiratory, anthropometric, cardiovascular); for a representative trait from each category, we report enrichment of each of the 28 functional annotations in Fig. 6 (see “Methods” section; for all traits see Supplementary Fig. 8). Our results are largely concordant with previous studies[7,9]: we observe enrichment of heritability across traits in conserved regions (Z-score > 3 in 15 traits). We also observe enrichment of heritability at FANTOM5 enhancers (labeled Enhancer_Andersson in Fig. 6) in asthma, eczema, autoimmune disorders (broad), hypothyroidism, and thyroid disorders (Z-score > 3) even though these annotations cover only 0.4% of the analyzed SNPs.

Fig. 6

Enrichment of heritability across 28 functional annotations.

We applied RHE-mc to N = 291,273 unrelated white British individuals and M = 5,670,959 imputed SNPs (MAF > 0.1% and present in 1000 Genomes Project). SNPs were partitioned based on 28 functional annotations that were defined in a previous study[7]. We grouped 22 traits in the UK Biobank into five categories (autoimmune, diabetes, respiratory, anthropometric, cardiovascular). Here we plot enrichment of five traits (one representative trait per category). Each bar represents the estimated enrichment. Black error bars mark ±2 standard errors centered on the estimated enrichment. Annotations are ordered by the proportions of SNPs in that annotation (given in parentheses). See Supplementary Fig. 8 for results on all 22 traits.

Enrichment of heritability across 28 functional annotations.

Discussion

We have presented RHE-mc, an algorithm that can efficiently estimate multiple variance components on large-scale genotype data. In light of increasing evidence for SNP effect sizes that vary as a function of covariates, such as MAF and LD and the bias associated with methods that fit only a single variance component[8], the ability to define flexible models endowed with multiple variance components is important to obtain unbiased estimates of fundamental quantities such as SNP heritability. We confirm that RHE-mc yields accurate genome-wide SNP heritability estimates under diverse genetic architectures. In applications to 22 complex traits in the UK Biobank, RHE-mc yields heritability estimates on array SNPs that are lower on average relative to S-LDSC and SumHer. We have explored the utility of RHE-mc in heritability partitioning analyses. These analyses show that per-allele squared effect sizes tend to increase with a decrease in MAF and LD consistent with previous studies[9]. We also partitioned heritability across functional annotations to reveal enrichment of heritability at FANTOM5 enhancers in specific traits such as asthma and eczema. We discuss several limitations of RHE-mc as well as directions for future work. First, the method-of-moments estimator underlying RHE-mc tends to yield slightly larger standard errors, on average, relative to REML estimators. The relative performance of the two methods likely depends on a number of aspects of the study design such as sample size, number of SNPs, the LD structure, relatedness patterns, and the underlying genetic architecture. Nevertheless, our method is designed to be applicable to massive datasets for which the heritability estimates are relatively precise. Developing scalable variance components estimators that are as efficient as REML-based methods is an important direction for future work. Second, this work has primarily explored the partitioning of heritability across discrete annotations. While we have shown how the methodology can be extended to continuous-valued annotations (see “Methods” section and Supplementary Notes), it would be of interest to explore variation in trait heritability as a function of the value of an annotation. On the other hand, the ability of RHE-mc to fit many annotations allows the annotation to be divided into a sufficiently large number of bins. Third, we have applied RHE-mc to binary traits available in the UK Biobank treating these traits as continuous. Methods that explicitly model binary traits as well as the underlying ascertainment involved in case-control studies are likely to lead to more accurate heritability estimates[23,31]. For example, the PCGC method[23] is an extension of HE regression and it would be of interest to develop a scalable randomized PCGC estimator. Fourth, RHE-mc requires access to individual-level genotype and phenotype data. Methods that only require summary statistic data (GRE[10], LDSC[24], and SumHer[25]) have the advantage of being applicable to datasets where acquiring access to individual-level data can be challenging[10]. Finally, our method could potentially lead to improvements in association testing, trait prediction, and understanding of polygenic selection.

Methods

Multi-component linear mixed model

RHE-mc attempts to fit the following variance component model:Here is a N-vector of centered phenotypes and each of the M SNPs is assigned to one of K non-overlapping categories. Each category k contains M SNPs, k ∈ {1, …, K}, ∑M = M. is a N × M matrix, where x denotes the standardized genotype for individual n at SNP m in category k. We have ∑x = 0 and for m ∈ {1, 2, …, M}. denote the M-vector of SNP effect sizes for the kth category where is an arbitrary distribution with mean and covariance . In the above model, is the residual variance, and is the variance component of the kth category. The total SNP heritability is defined asThe SNP heritability of category k is defined asEnrichment in bin k is defined as

Method-of-moments for estimating multiple variance components

To estimate the variance components, RHE-mc uses a Method-of-Moments (MoM) estimator that searches for parameter values so that the population moments are close to the sample moments[32]. Since , we derived the MoM estimates by equating the population covariance to the empirical covariance. The population covariance is given byHere is the genetic relatedness matrix (GRM) computed from all SNPs of kth category. Using T as our estimate of the empirical covariance, we need to solve the following least-squares problem to find the variance components.The MoM estimator satisfies the following normal equations:Here , is a K × K matrix with entries T = tr(), k, l ∈ {1, …, K}, is a K-vector with entries b = tr() = N (because s is standardized), and is a K-vector with entries c = T. Each GRM can be computed in time and memory. Given K GRMs, the quantities T, c, k, l ∈ {1, …, K}, can be computed in . Given the quantities T, c, the normal Eq. (7) can be solved in . Therefore, the total time complexity for estimating the variance components is .

RHE-mc: Randomized estimator of multiple variance components

The key bottleneck in solving the normal Eq. (7) is the computation of T, k, l ∈ {1, …, K} which takes . Instead of computing the exact value of T, we use an unbiased estimator of the trace[33] based on the following identity: for a given N × N matrix , T is an unbiased estimator of tr() (E[T] = tr[]), where be a random vector with mean zero and covariance . Hence, we can estimate the values T, k, l ∈ {1, …, K} as follows:Here 1, …, are B independent random vectors with zero mean and covariance . We draw these random vectors independently from a standard normal distribution. Computing T using the unbiased estimator involves four multiplications of sub-matrices of the genotype matrix with a vector, repeated B times. Therefore, the total running time for estimating the matrix is . Moreover, we can leverage the structure of the genotype matrix which only contains entries in {0, 1, 2}. For a fixed genotype matrix , we can improve the per iteration time complexity of matrix–vector multiplication from to by using the Mailman algorithm[34]. Solving the normal equations takes time so that the overall time complexity of our algorithm is . RHE-mc uses a block Jackknife to estimate standard errors. In Supplementary Notes, we show how the block Jackknife estimates can be computed with little additional computational overhead. Further, we also show how covariates can be efficiently included in the model (Supplementary Notes).

Multi-component LMM with overlapping annotations

RHE-mc can also be applied in the setting where annotations overlap. Following ref. [7], the heritability of SNPs belong to annotation k is defined as where S is the set of SNPs in kth annotation and M = ∣S∣. Enrichment in bin k is defined as .

Multi-component LMM with continuous annotations

We have described the derivation of RHE-mc using binary annotations. Following ref. [29], we can extend RHE-mc to support continuous-value annotations as follows:This model is similar to the model in Eq. (1) except that here we assume that the variance of effect sizes depend on continuous-valued annotation. Let be a M-vector where a is the value of kth annotation at SNP m (the elements of must be non-negative). Let S be the set of SNPs belong to annotation k. In this model, the SNP heritability of annotation k is defined as:To estimate the variance components of this new model, we only need to replace with in Eq. (5) for every annotation k. We assessed the accuracy of RHE-mc in estimating variance components with continuous annotation in Supplementary Notes.

Simulations

We performed simulations to compare the performance of RHE-mc with several state-of-the-art methods for heritability estimation that cover the spectrum of methods that have been proposed. We considered two simulation settings. In the large-scale simulation setting, we simulated phenotypes for the full set of UK Biobank genotypes consisting of M = 593,300 array SNPs and N = 337,205 individuals. We obtained the individuals by keeping unrelated white British individuals which are >3rd degree relatives (defined as pairs of individuals with kinship coefficient <1/2(9/2))[17], and removing individuals with putative sex chromosome aneuploidy. The small-scale setting was designed so that we could compare the accuracies of RHE-mc to REML methods. In this setting, we simulated phenotypes from a subsampled set of genotypes from the UK Biobank data genotypes used in large-scale simulation[35]. Specifically, we randomly chose a subset of N = 10,000 individuals from the large-scale data so that we have M = 593,300 array SNPs and N = 10,000 individuals. We simulated phenotypes from genotypes using the following model which is used in refs. [8,10]:where S is a normalizing constant chosen so that . Here h2 ∈ [0, 1], a ∈ {0, 0.75}, b ∈ {0, 1}. β, f, and w are the effect size, the minor allele frequency, and LDAK score of mth SNP, respectively. Let c ∈ {0, 1} be an indicator variable for the causal status of SNP m. The LD score of a SNP is defined to be the sum of the squared correlation of the SNP with all other SNPs that lie within a specific distance, and the LDAK score of a SNP is computed based on local levels of LD such that the LDAK score tends to be higher for SNPs in regions of low LD[36]. The above models relating genotype to phenotype are commonly used in methods for estimating SNP heritability: the GCTA Model (when a = b = 0 in Eq. (12)), which is used by the software GCTA[27] and LD Score regression (LDSC)[24], and the LDAK Model (where a = 0.75, b = 1 in Eq. (12)) used by software LDAK[36]. Moreover, under each model, we varied the proportion and minor allele frequency (MAF) of CVs. Proportion of CVs were set to be either 100% or 1%, and MAF of CVs drawn uniformly from [0, 0.5] or [0.01, 0.05] or [0.05, 0.5] to consider genetic architectures that are either infinitesimal or sparse, as well genetic architectures that include a mixture of common and rare SNPs as well as ones that consist of only rare or common SNPs. The true heritability were chosen from {0.1, 0.25, 0.5, 0.8}. We generated 100 sets of simulated phenotypes for each setting of parameters and report accuracies averaged over these 100 sets.

Comparisons

For the large-scale simulations, we compared RHE-mc to methods that rely on summary statistics for estimating heritability. Among the summary statistic methods, LD score regression (LDSC)[24] uses the slope from the GWAS χ2 statistics regressed on the LD scores to estimate heritability. Stratified LD score regression (S-LDSC)[7] is an extension of LDSC for partitioning heritability from summary statistics. SumHer is the summary statistic analog of LDAK[25]. We ran S-LDSC with 10 binary MAF bin annotations defined such that each bin contains exactly 10% of the typed SNPs; this is intended to mirror the 10 MAF bin annotations in the S-LDSC “baseline-LD model”[29] (see Supplementary Table 5). To run SumHer, we used the LDAK software to compute the default “LDAK weights” using in-sample LD [25,36,37]. We then computed “LD tagging” using 1-Mb windows centered on each SNP as recommended[25]. To do a fair comparison we computed LD scores for LDSC, S-LDSC, GRE, and SumHer by using in-sample LD among the M SNPs, and in all simulations we aim to estimate the SNP-heritability explained by the same set of M SNP. We described the parameter settings of summary statistic methods in Supplementary Notes. For the small-scale simulations, we compared RHE-mc to GCTA-mc and HE-mc[27]. GCTA-mc and HE-mc are the extensions of GCTA and HE to a multi-component LMM, respectively, where the variance components are typically defined by binning SNPs according to their MAF as well as local LD[8]. We ran GCTA-mc, HE-mc and RHE-mc using 24 bins formed by the combination of six bins based on MAF (MAF ≤ 0.01, 0.01 < MAF ≤ 0.02, 0.02 < MAF ≤ 0.03, 0.03 < MAF ≤ 0.4, 0.04 < MAF ≤ 0.05, MAF > 0.05) as well as four bins based on quartiles of the LDAK score of a SNP. We ran both GCTA-mc and RHE-mc allowing for estimates of a variance component to be negative. For comparisons of runtime, we compared RHE-mc to GCTA[27] and BOLT-REML[4] which is a computationally efficient approximate method to compute the REML estimator. We ran all methods with 22 components (one for each chromosome). We also ran RHE-mc with ≈300 components (corresponding to 10 Mb bins) on the UK Biobank genotype (Supplementary Fig. 10). To create our largest dataset, we replicate individuals from the UK Biobank and a subset of the imputed SNPs to obtain a dataset with one million individuals and SNPs. We use the latest versions of BOLT-REML (Version 2.3.2) and GCTA (Version 1.92.1) in our comparison. All comparisons are performed on an Intel(R) Xeon(R) CPU 2.10 GHz server with 128 GB RAM.

Heritability estimates in the UK Biobank

We estimated SNP-heritability for 22 complex traits (6 quantitative, 16 binary) in the UK Biobank[17]. In this study, we restricted our analysis to SNPs that were present in the UK Biobank Axiom array used to genotype the UK Biobank. SNPs with >1% missingness and minor allele frequency <1% were removed. Moreover, SNPs that fail the Hardy–Weinberg test at significance threshold 10−7 were removed. We restricted our study to self-reported British white ancestry individuals who are >3rd degree relatives defined as pairs of individuals with kinship coefficient <1/2(9/2)[17]. Furthermore, we removed individuals who are outliers for genotype heterozygosity and/or missingness. Finally, we obtained a set of N = 291,273 individuals and M = 459,792 SNPs to use in the real data analyses. We included age, sex, and the top 20 genetic principal components (PCs) as covariates in our analysis for all traits. We used PCs precomputed by the UK Biobank from a superset of 488,295 individuals. Additional covariates were used for waist-to-hip ratio (adjusted for BMI) and diastolic/systolic blood pressure (adjusted for cholesterol-lowering medication, blood pressure medication, insulin, hormone replacement therapy, and oral contraceptives).

Heritability partitioning

In our initial analysis, we removed SNPs with >1% missingness and minor allele frequency <1%. Moreover, we removed SNPs that fail the Hardy–Weinberg test at significance threshold 10−7 as well as SNPs that lie within the MHC region (Chr6: 25–35 Mb) to obtain 4,824,392 SNPs. We restricted our study to self-reported British white ancestry individuals who are >3rd degree relatives defined as pairs of individuals with kinship coefficient <1/2(9/2)[17]. Furthermore, we removed individuals who are outliers for genotype heterozygosity and/or missingness. Finally, we obtained 291,273 individuals . We partitioned SNPs into eight bins based on two MAF bins (MAF ≤ 0.05, MAF > 0.05) and quartiles of the LD-scores. For each bin k, we computed the heritability enrichment as the ratio of the percentage of heritability explained by SNPs in bin k to the the percentage of SNPs in bin k. We considered an additional analysis in which we included SNPs with MAF > 0.1% resulting in N = 291,273 unrelated white British individuals and M = 7,774,235 imputed SNPs (MAF > 0.1%). We defined 144 bins based on 4 LD bins and 36 MAF bins. The 4 LD bins are defined based on quartile of LD-scores, and 36 MAF bins are defined based on 9-quantile of the following four intervals: 0.001 ≤ MAF ≤ 0.01, 0.01 < MAF ≤ 0.05, 0.05 ≤ MAF ≤ 0.10, 0.10 < MAF ≤ 0.50.

27 in total

1. Measuring missing heritability: inferring the contribution of common variants.

Authors: David Golan; Eric S Lander; Saharon Rosset
Journal: Proc Natl Acad Sci U S A Date: 2014-11-24 Impact factor: 11.205

2. Estimating SNP-Based Heritability and Genetic Correlation in Case-Control Studies Directly and with Summary Statistics.

Authors: Omer Weissbrod; Jonathan Flint; Saharon Rosset
Journal: Am J Hum Genet Date: 2018-07-05 Impact factor: 11.025

3. Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs.

Authors: S Hong Lee; Teresa R DeCandia; Stephan Ripke; Jian Yang; Patrick F Sullivan; Michael E Goddard; Matthew C Keller; Peter M Visscher; Naomi R Wray
Journal: Nat Genet Date: 2012-02-19 Impact factor: 38.330

4. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age.

Authors: Cathie Sudlow; John Gallacher; Naomi Allen; Valerie Beral; Paul Burton; John Danesh; Paul Downey; Paul Elliott; Jane Green; Martin Landray; Bette Liu; Paul Matthews; Giok Ong; Jill Pell; Alan Silman; Alan Young; Tim Sprosen; Tim Peakman; Rory Collins
Journal: PLoS Med Date: 2015-03-31 Impact factor: 11.069

5. Employing a Monte Carlo algorithm in Newton-type methods for restricted maximum likelihood estimation of genetic parameters.

Authors: Kaarina Matilainen; Esa A Mäntysaari; Martin H Lidauer; Ismo Strandén; Robin Thompson
Journal: PLoS One Date: 2013-12-10 Impact factor: 3.240

6. SumHer better estimates the SNP heritability of complex traits from summary statistics.

Authors: Doug Speed; David J Balding
Journal: Nat Genet Date: 2018-12-03 Impact factor: 38.330

7. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index.

Authors: Jian Yang; Andrew Bakshi; Zhihong Zhu; Gibran Hemani; Anna A E Vinkhuyzen; Sang Hong Lee; Matthew R Robinson; John R B Perry; Ilja M Nolte; Jana V van Vliet-Ostaptchouk; Harold Snieder; Tonu Esko; Lili Milani; Reedik Mägi; Andres Metspalu; Anders Hamsten; Patrik K E Magnusson; Nancy L Pedersen; Erik Ingelsson; Nicole Soranzo; Matthew C Keller; Naomi R Wray; Michael E Goddard; Peter M Visscher
Journal: Nat Genet Date: 2015-08-31 Impact factor: 38.330

8. Partitioning heritability by functional annotation using genome-wide association summary statistics.

Authors: Hilary K Finucane; Brendan Bulik-Sullivan; Alexander Gusev; Gosia Trynka; Yakir Reshef; Po-Ru Loh; Verneri Anttila; Han Xu; Chongzhi Zang; Kyle Farh; Stephan Ripke; Felix R Day; Shaun Purcell; Eli Stahl; Sara Lindstrom; John R B Perry; Yukinori Okada; Soumya Raychaudhuri; Mark J Daly; Nick Patterson; Benjamin M Neale; Alkes L Price
Journal: Nat Genet Date: 2015-09-28 Impact factor: 38.330

9. Functional architecture of low-frequency variants highlights strength of negative selection across coding and non-coding annotations.

Authors: Steven Gazal; Po-Ru Loh; Hilary K Finucane; Andrea Ganna; Armin Schoech; Shamil Sunyaev; Alkes L Price
Journal: Nat Genet Date: 2018-10-08 Impact factor: 38.330

10. The UK Biobank resource with deep phenotyping and genomic data.

Authors: Clare Bycroft; Colin Freeman; Desislava Petkova; Gavin Band; Lloyd T Elliott; Kevin Sharp; Allan Motyer; Damjan Vukcevic; Olivier Delaneau; Jared O'Connell; Adrian Cortes; Samantha Welsh; Alan Young; Mark Effingham; Gil McVean; Stephen Leslie; Naomi Allen; Peter Donnelly; Jonathan Marchini
Journal: Nature Date: 2018-10-10 Impact factor: 49.962

9 in total

1. Trade-offs of Linear Mixed Models in Genome-Wide Association Studies.

Authors: Haohan Wang; Bryon Aragam; Eric P Xing
Journal: J Comput Biol Date: 2022-02-25 Impact factor: 1.479

2. Partitioning gene-level contributions to complex-trait heritability by allele frequency identifies disease-relevant genes.

Authors: Kathryn S Burch; Kangcheng Hou; Yi Ding; Yifei Wang; Steven Gazal; Huwenbo Shi; Bogdan Pasaniuc
Journal: Am J Hum Genet Date: 2022-03-09 Impact factor: 11.043

3. A model and test for coordinated polygenic epistasis in complex traits.

Authors: Brooke Sheppard; Nadav Rappoport; Po-Ru Loh; Stephan J Sanders; Noah Zaitlen; Andy Dahl
Journal: Proc Natl Acad Sci U S A Date: 2021-04-13 Impact factor: 11.205

4. Fast estimation of genetic correlation for biobank-scale data.

Authors: Yue Wu; Kathryn S Burch; Andrea Ganna; Päivi Pajukanta; Bogdan Pasaniuc; Sriram Sankararaman
Journal: Am J Hum Genet Date: 2021-12-02 Impact factor: 11.025

5. Fully exploiting SNP arrays: a systematic review on the tools to extract underlying genomic structure.

Authors: Laura Balagué-Dobón; Alejandro Cáceres; Juan R González
Journal: Brief Bioinform Date: 2022-03-10 Impact factor: 11.622

6. Probabilistic inference of the genetic architecture underlying functional enrichment of complex traits.

Authors: Marion Patxot; Daniel Trejo Banos; Athanasios Kousathanas; Etienne J Orliac; Sven E Ojavee; Gerhard Moser; Alexander Holloway; Julia Sidorenko; Zoltan Kutalik; Reedik Mägi; Peter M Visscher; Lars Rönnegård; Matthew R Robinson
Journal: Nat Commun Date: 2021-11-30 Impact factor: 14.919

7. Comparing heritability estimators under alternative structures of linkage disequilibrium.

Authors: Alan Min; Elizabeth Thompson; Saonli Basu
Journal: G3 (Bethesda) Date: 2022-07-29 Impact factor: 3.542

8. Functional dynamic genetic effects on gene regulation are specific to particular cell types and environmental conditions.

Authors: Anthony S Findley; Alan Monziani; Allison L Richards; Katherine Rhodes; Michelle C Ward; Cynthia A Kalita; Adnan Alazizi; Ali Pazokitoroudi; Sriram Sankararaman; Xiaoquan Wen; David E Lanfear; Roger Pique-Regi; Yoav Gilad; Francesca Luca
Journal: Elife Date: 2021-05-14 Impact factor: 8.140

9. Quantifying the contribution of dominance deviation effects to complex trait variation in biobank-scale data.

Authors: Ali Pazokitoroudi; Alec M Chiu; Kathryn S Burch; Bogdan Pasaniuc; Sriram Sankararaman
Journal: Am J Hum Genet Date: 2021-04-02 Impact factor: 11.025

9 in total