Literature DB >> 33367483

A non-linear regression method for estimation of gene-environment heritability.

Abstract

MOTIVATION: Gene-environment (GxE) interactions are one of the least studied aspects of the genetic architecture of human traits and diseases. The environment of an individual is inherently high dimensional, evolves through time and can be expensive and time consuming to measure. The UK Biobank study, with all 500,000 participants having undergone an extensive baseline questionnaire, represents a unique opportunity to assess GxE heritability for many traits and diseases in a well powered setting.
RESULTS: We have developed a randomized Haseman-Elston non-linear regression method applicable when many environmental variables have been measured on each individual. The method (GPLEMMA) simultaneously estimates a linear environmental score (ES) and its GxE heritability. We compare the method via simulation to a whole-genome regression approach (LEMMA) for estimating GxE heritability. We show that GPLEMMA is more computationally efficient than LEMMA on large datasets, and produces results highly correlated with those from LEMMA when applied to simulated data and real data from the UK Biobank. AVAILABILITY: Software implementing the GPLEMMA method is available from https://jmarchini.org/gplemma/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Year: 2020 PMID： 33367483 PMCID： PMC8023682 DOI： 10.1093/bioinformatics/btaa1079

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The advent of genome-wide association studies (The Wellcome Trust Case Control Consortium, 2007) has catalyzed a huge number of discoveries linking genetic markers to many human complex diseases and traits. For the most part, these discoveries have involved common variants that confer relatively small amounts of risk and only account for a small proportion of the phenotypic variance of a trait (Manolio ). This has led to a surge of interest in methods and applications that measure the joint contribution to phenotypic variance of all measured variants throughout the genome (SNP heritability), and in testing individual variants within this framework. Most notably the seminal paper of Yang et al. (2010), who used a linear mixed model (LMM) to show that the majority of missing heritability for height could be explained by genetic variation by common SNPs (Yang ). When testing variants for association these LMMs can reduce false positive associations due to population structure, and improve power by implicitly conditioning on other loci across the genome (Listgarten ; Loh ; Yang ). These methods model the unobserved polygenic contribution as a multivariate Gaussian with covariance structure proportional to a genetic relationship matrix (GRM) (Eskin ; Lippert ; Zhou and Stephens, 2012). This approach is mathematically equivalent to a whole genome regression (WGR) model with a Gaussian prior over SNP effects (Listgarten ). Subsequent research has shown that the simplest LMMs make assumptions about the relationship between minor allele frequency (MAF), linkage disequilibrium (LD) and trait architecture that may not hold up in practice (Evans et al., 2018; Speed ) and generalizations have been proposed that stratify variance into different components by MAF and LD (Speed , 2017; Yang ). Other flexible approaches have been proposed in both the animal breeding (de los Campos ; Hayes ) and human literature (Carbonetto and Stephens, 2012; Logsdon ; Zhou ) to allow different prior distributions that better capture SNPs of small and large effects. For example, a mixture of Gaussians (MoG) prior can increase power to detect associated loci in some (but not all) complex traits (Carbonetto and Stephens, 2012; Loh ). Other methods have been proposed that estimate heritability only from summary statistics and LD reference panels (Bulik-Sullivan ; Speed and Balding, 2019). Heritability can also be estimated using Haseman–Elston regression (Haseman and Elston, 1972) and has recently been extended using a randomized approach (Wu and Sankararaman, 2018) that has computational complexity and works for multiple variance components (Pazokitoroudi ). Other recent work has shown that LMM approaches such as these are not able to disentangle direct and indirect genetic effects, the balance of which will vary depending on the trait being studied (Young ). There has been less exploration of methods for estimating heritability that account for gene–environment interactions. One interesting approach has proposed using spatial location as a surrogate for environment (Heckerman ) using a three component LMM—one based on genomic variants, one based on measured spatial location as a proxy for environmental effects, and a gene–environment component, modelled as the Hadamard product of the genomic and spatial covariance matrices. Other authors have used this method to account for gene-gene interactions (Crawford ; Ober ). Modelling gene–environment interactions when many different environmental variables are measured is a more challenging problem. If several environmental variables drive interactions at individual loci, or if an unobserved environment that drives interactions is better reflected by a combination of observed environments, it can make sense to include all variables in a joint model. StructLMM (Moore ) focuses on detecting GxE interactions at individual markers. Environmental similarity between individuals is modelled (over multiple environments) as a random effect, and each SNP is tested independently for GxE interactions. However, this approach does not model the genome wide contribution of all the markers, which is often a major component of phenotypic variance. We recently proposed a WGR approach called LEMMA applicable to large human datasets such as UK Biobank, where many potential environmental variables are available (Kerin and Marchini, 2020). The LEMMA regression model includes main effects of each genotyped SNP across the genome, and also interactions of each SNP with a environmental score (ES), that is a linear combination of the environmental variables. The ES is estimated as part of the method using a Variational Bayes algorithm to fit the WGR model. The model uses mixture of Gaussian (MoG) priors on main and GxE SNP effects, that allow for a range of different genetic architectures from polygenic to sparse genetic effects (Carbonetto and Stephens, 2012; Logsdon ; Zhou ). The ES can be readily interpreted and its main use is to test for GxE interactions one variant at a time, typically at a larger set of imputed SNPs in the dataset. However, the ES can also be used to estimate the proportion of phenotypic variability that is explained by GxE interactions (SNP GxE heritability), using a two component randomized Haseman–Elston (RHE) regression (Pazokitoroudi ). The main contribution of this article is to combine the estimation of the LEMMA ES into a stand-alone RHE framework. This results in a non-linear optimization problem that we solve using the Levenburg-Marquardt (LM) algorithm. The method implicitly assumes a Gaussian prior on main effect and GxE effect sizes. We also propose a separate RHE method that estimates the independent GxE contribution of each measured environmental variable. We set out the differences between these two models and present a simulation study to compare them to LEMMA. We show that GPLEMMA is more computationally efficient than LEMMA on large datasets. We also apply the method to UK Biobank data and show that GPLEMMA produces estimates very close to LEMMA. Software implementing the GPLEMMA algorithm in C++ is available at https://jmarchini.org/gplemma/.

2 Materials and methods

2.1 Modelling SNP heritability

The simplest model for estimating SNP heritability has the form where y is a continuous phenotype, X is an N × M matrix of genotypes that has been normalized to have column mean zero and column variance one, and β is an M-vector of SNP effect sizes. Integrating out β leads to the variance component model where is known as the genomic relationship matrix (GRM) (Yang ). Estimating the two parameters in this model σ and σ leads to an estimate of SNP heritability of . This is commonly referred to in the literature as the single component model. Subsequent research has shown that the single component model makes assumptions about the relationship between minor allele frequency (MAF), linkage disequilibrium (LD) and trait architecture that may not hold up in practice (Evans ; Speed ). There have been many follow up methods, including; generalizations that stratify variance into different components by MAF and LD (Yang ), approaches that assign different weights for the GRM (Speed , 2017), methods that replace the Gaussian prior on β with a spike and slab on SNP effect sizes (Powell ) and methods that estimate heritability only from summary statistics and LD reference panels (Bulik-Sullivan ; Finucane ).

2.2 Randomized Haseman–Elston regression

An alternative method used to compute heritability is known as Haseman–Elston (HE) regression (Haseman and Elston, 1972). HE-regression is a method of moments (MoM) estimator that optimizes variance components in order to minimize the squared difference between the observed and expected trait covariances. The MoM estimator can be obtained by solving the minimization or equivalently by solving the linear regression problem where is the vectorization operator that transforms an N × M matrix into an NM-vector. In matrix format, both of these forms correspond to solving the following linear system HE-regression methods are widely acknowledged to be more computationally efficient (Golan ; Wu and Sankararaman, 2018; Yang ) and do not require any assumptions on the phenotype distribution beyond the covariance structure (Golan ) (in contrast to maximum-likelihood estimators). However, HE-regression based estimates typically have higher variance (Yang ), thus implying that they have less power. Recent method developments (Pazokitoroudi ; Wu and Sankararaman, 2018) have shown that a randomized HE-regression (RHE) approach can be used to compute efficiently on genetic datasets with hundreds of thousands of samples. Wu and Sankararaman (2018) observed that Equation (1) can be solved efficiently without ever having to explicitly compute the kinship matrix K by using Hutchinson’s estimator (Hutchinson, 1990), which states that for any matrix where z is a random vector with mean zero and covariance given by the identity matrix. The proposed method involves approximating and using only matrix vector multiplications with the genotype matrix X, to compute the following expressions Thus an approximate solution can be obtained in time, where B denotes a relatively small number of random samples. Subsequent work by Pazokitoroudi et al. extended this approach to a multiple component model (Pazokitoroudi ) With parameter estimates obtained as solution to the linear system given by where and . Finally, both papers show how to efficiently control for covariates by projecting them out of all terms in the system of equations. Thus with covariates included the multiple component model becomes and terms in the subsequent linear system are given by and , where . The GPLEMMA and MEMMA approaches developed in this article use this method of handling covariates.

2.3 Modelling GxE heritability

We introduce two extensions of the RHE framework for modelling GxE interactions with multiple environmental variables. In both models we let E be an N × L matrix of environmental variables and C be an N × D matrix of covariates, where both matrixes are normalized such that columns have mean zero and variance one. We use notation E to denote the lth column of E, and denotes a null matrix with elements of vector x on its diagonal. We note that columns of E are always included in C, so that D > L.

MEMMA

The first model assumes that each environmental variable interacts independently with the genome where and denotes the element-wise product of E with each column of X. Integrating out β and λ leads to the variance component model where and Fitting the variance components is done analytically by solving the system of equations where and . As shown in the original RHE method (Pazokitoroudi ; Wu and Sankararaman, 2018), Hutchinson’s estimator can be used to efficiently estimate T. To do this our software streams SNP markers from a file and computes and the following N-vectors where for are random N-vectors. Then where is defined as Finally, the variance components are converted to heritability estimates using the following formula We call this approach MEMMA (Multiple Environment Mixed Model Analysis). MEMMA costs in compute and in RAM.

GPLEMMA

The second model involves the estimating a linear combination of environments, or environmetal score (ES), that interacts with the genome. The model is given by where is a column vector that we refer to as the linear environmental score (ES). This is the same model used by LEMMA (Kerin and Marchini, 2020), which we include below for completeness, except the mixture of Gaussians priors on SNP effects (β and γ) have been replaced with Gaussian priors. For this reason, we call this approach GPLEMMA (Gaussian Prior Linear Environment Mixed Model Analysis). Integrating out the SNP effects yields the model where and . Minimizing the squared loss between the expected and observed covariance is equivalent to the following regression problem In this format it is clear that optimizing is a non-linear regression problem. Further, including a parameter for is no longer necessary. From here on we set and drop the parameterization without loss of generality.

Levenburg–Marquardt algorithm

We use the Levenburg-Marquardt (LM) algorithm (Zolfaghari ), which is commonly used for non-linear least squares problems. The algorithm effectively interpolates between the Gauss-Newton algorithm and the method of steepest gradient descent, by use of an adaptive damping parameter. In this manner, it is more robust than the straight forward Gauss-Newton algorithm but should have faster convergence than a gradient descent approach. Without loss of generality, consider the model where is a function that is non-linear in the parameters θ. Given a starting point θ0, LM proposes a new point by solving the normal equations where and are respectively the Jacobian and the residual vector evaluated at θ0. If has lower squared error than θ0, then the step is accepted and the adaptive damping parameter μ is reduced. Otherwise, μ is increased and a new step δ is proposed. For small values of μEquation (11) approximates the quadratic step appropriate for a fully linear problem, whereas for large values of μEquation (11) behaves more like steepest gradient descent. This allows the algorithm to defensively navigate regions of the parameter space where the model is highly non-linear. If reduces the squared error, then the step is accepted and μ is reduced, otherwise μ is increased and a new step δ is proposed. In summary the LM algorithm requires computation of the matrices at each step, as well as the squared error (which we define as ). We now give statements of the equations used to compute each of these values, and show that each iteration can be performed in time. We apply the LM algorithm with and Several quantities can be pre-calculated and re-used in the LM algorithm. The N-vectors u, and are needed and have been defined above. In addition, GPLEMMA also benefits from the pre-calculation of which can also be computed as genotypes are streamed from file. Let denote the entry of the J that corresponds to for and define the N-vector . Then the matrix is given by is given by where Finally the squared error, which we define as , is given by where The initial preprocessing step has costs in compute and in RAM. The remaining algorithm does not require much RAM in addition to that required in the preprocessing step, so also costs in RAM. Construction of the summary variable costs in compute. Each iteration of the LM algorithm costs . It is possible to parallelize GPLEMMA using OpenMPI by partitioning samples across cores, in a similar manner to that used by LEMMA (Kerin and Marchini, 2020). Given that evaluating the objective function is characterized by BLAS level 1 array operations, a distributed algorithm using OpenMPI should have superior runtime versus an the same algorithm using OpenMP as well as providing RAM limited only by the size of a researchers compute cluster. We perform 10 repeats of the LM algorithm with different initializations, and keep results from the solution with lowest squared error . Each run is initialized with a vector of interaction weights , where each entry set to and a small amount of Gaussian noise is added. To transform the initial weights vector to the initial parameters θ0 we let be solutions to The GPLEMMA algorithm is then initialized with where .

LEMMA

The LEMMA model is given by where is the linear environmental score (ES). The use of the MoG priors makes it harder to analytically integrate out the parameters. The LEMMA algorithm uses a Variational Bayes approach to first estimate the ES and fit the whole genome regression pararmeters β and γ. The primary use of the LEMMA model is for testing individual SNPs for GxE effects. However, LEMMA can also estimate GxE heritability. It uses the VB algorithm to estimate the ES, and then plugs this estimate into Eq 6 and uses a randomized Haseman–Elston linear regression to estimate the variance components.

Relationship between MEMMA and GPLEMMA

Comparing Equation (3) with Section 6, suggests that the GPLEMMA model can be expressed at the MEMMA model with the added constraint that where is the L × M matrix of GxE effect sizes in MEMMA for the L environments and M SNPs. We can expect the two models to give similar heritability estimates, under the simplifying assumptions that GxE interactions do occur with a single linear combination of the environments and that the set of random variables is mutually independent. Let and . Then connection between the two models is revealed by observing where . Thus we should expect both models to have the same estimate for the proportion of variance explained by GxE interaction effects. Even in the case that MEMMA and GPLEMMA have the same expected heritability estimate, there are still some differences between the two. GPLEMMA is a constrained model, so the variance of its heritabiity estimates may be smaller. Further, although is proportional to the square of the weights used to construct the ES the sign of the interaction weight w has been lost. Thus it is not possible to reconstruct an ES for use in single SNP testing using MEMMA.

2.4 Simulated data

We carried out simulation studies to compare the performance of MEMMA, LEMMA and GPLEMMA in a variety of different scenarios, using three different phenotype simulation strategies; a baseline phenotype that was simulated according to the LEMMA model (Kerin and Marchini, 2020), a phenotype which generalizes the LEMMA model to three orthogonal ESs and a misspecified phenotype that had squared dependancy on a heritable environmental variable S. The simulations use real data subsampled from genotyped SNPs in the UK Biobank (Bycroft ), drawing SNPs from all 22 chromosomes in proportion to chromosome length and using unrelated samples of mixed ancestry (; 12 500 white British, 7500 Irish and 5000 white European, ; 29 567 white British, 7500 Irish and 12 568 white European, ; 79 567 white British, 7500 Irish and 12 568 white European; using self-reported ancestry in field f.21000.0.0). All samples were genotyped using the UKBB genotype chip and were included in the internal principal component analysis performed by the UK Biobank. Environmental variables were simulated from a standard Gaussian distribution. Let N be the number of individuals, M the total number of SNPs, L the total number of environmental variables and and the herirtability of main effects and GxE effects. The baseline phenotype was simulated using the model where X represents the N × M genotype matrix after columns have been standardized to have mean zero and variance one, C is the first principle component of the genotype matrix and E is the N × L matrix of environmental variables. In all simulations α was set such that explained one percent of trait variance. The interaction weight vector w contained non-zero elements, which were drawn from a decreasing sequence The effect size parameters β and γ were simulated from a spike and slab prior such that the number of non-zero elements was governed by and for main and interaction effects respectively. Non-zero elements were drawn from a standard Gaussian, and then subsequently rescaled to ensure that the heritability given by main and interaction effects was and respectively. We chose a set of default parameters: , and then varied one parameter at a time to examine the effects of sample size, number of environments, number of non-zero SNP effects and GxE heritability. In addition, we investigated performance using a larger baseline simulation with samples and variants. The first genetic principal component was provided as a covariate to all methods. Figure 1 compares estimates of the proportion of variance explained (PVE) by GxE effects from all three methods. In general, all methods had upwards bias that decreased with sample size and increased with the number of environments. While heritability estimates from LEMMA and GPLEMMA appeared quite similar, estimates from MEMMA had much higher variance and also appeared to have higher upwards bias as the total number of environments increase. All the methods exhibited less bias in the larger simulations with samples and variants (Fig. 1e–g). Supplementary Figure S1 shows the estimated PVE by main effects from the same set of simulations. In general the estimated PVE by main effects from the three methods was extremely similar.

Fig. 1.

PVE estimation. Estimates of the proportion of variance explained by GxE effects by LEMMA, MEMMA and GPLEMMA whilst varying the number of environments, the number of active environments, the number of non-zero SNP effects and GxE heritability. All simulations constructed with the baseline phenotype. a–d Simulations results using samples and variants by default, e–g show simulation results using samples and variants. Results from 15 repeats shown Figure 2 compares the absolute correlation between the simulated ES and the ES inferred by LEMMA and GPLEMMA (note that MEMMA does not provide an estimate of the ES). In general, the estimated ES from GPLEMMA had slightly lower absolute correlation with the true ES than the estimated ES from LEMMA. LEMMA models SNP effects directly, so can estimate the SNPs involved in the interaction, and this is likely the reason for the small improvement in the accuracy of the ES. In large sample sizes ( with SNPs), both methods achieve a correlation of over 0.98 with the simulated ES.

Fig. 2.

Correlation between the true ES and the inferred ES. Absolute correlation between the true ES and the ES inferred by LEMMA and GPLEMMA whilst varying the number of environments, the number of active environments, the number of non-zero SNP effects and GxE heritability. All simulations constructed with the baseline phenotype. a–d Simulations results using samples and variants by default, e–g simulation results using samples and variants. Results from 15 repeats shown Next, we tested the methods against a generalized phenotype with three orthogonal ESs that interacted with disjoint sets of SNPs. More explicitly, the generalized phenotype was simulated from the following model where is an N-vector and W is an L × K matrix of environmental weights with K = 3. W contained rows with non-zero elements drawn from a standard Gaussian distribution, and columns were pairwise orthogonal. Similarly, γ denotes an M-vector where the non-zero elements of γ and γ are disjoint. SNP coefficients were drawn from a spike and slab prior in a similar manner to the baseline phenotype with and . Figure 3 displays results from a set of simulations with samples and SNPs. The three ESs were scaled such that the singular values of Λ where decreasing (with values 80, 60, 40). Figure 3b shows that the ES estimated by both LEMMA and GPLEMMA was, in general, highly correlated with the ES with highest singular value when the simulated GxE heritability was over 0.05. In this scenario, estimates of GxE heritability from both methods was centered around the true simulated heritability of that ES (see Fig. 3a). However at lower levels of simulated GxE heritability, both methods generally failed to identify one of the true ESs (consistent with Fig. 2). In contrast, the estimates of GxE heritability from MEMMA were centered on the sum of the GxE heritabilities from all three ESs. This suggests that running both GPLEMMA and MEMMA may help to elucidate the architecture of GxE interactions for a given trait.

Fig. 3.

Comparison with 3 simulated ESs. a Estimates of the proportion of variance explained by GxE effects by LEMMA, MEMMA and GPLEMMA whilst varying the simulated GxE heritability. The red dashed line indicates the total GxE heritability. The dashed green line indicates the heritability of the first GxE component η1. b The (absolute) correlation between the estimated ES and the three simulated ESs. Simulations constructed using samples and variants. Results from 15 repeats shown For a more challenging scenario, we reran the high GxE heritability () simulation using a larger number of SNPs () so that N > M and scaled the ESs such that the singular values corresponding to the first two ESs were roughly equal (the singular values were 80, 78, 60). In this scenario the ES estimated by GPLEMMA did not correlate well with any one of the three simulated ESs, but was highly correlated with a linear combination of them (Supplementary Fig. S2). We hypothesize that with a large number of ‘null’ variants GPLEMMA becomes less able to identify a single ES and instead infers a mixture. In contrast, LEMMA consistently identified one of the simulated ESs. The GxE heritability estimates from all three methods were roughly as before, but with some inflation (Supplementary Fig. S3). In the third batch of simulation results we used a misspecified phenotype, with squared dependancy on a heritable environmmental variable S. The misspecified phenotype was simulated using the model where S was simulated to have heritability of 30% with 2500 causal SNPs drawn from a spike and slab prior, and a range of values for α was used to vary the strength of the non-linear relationship between y and S. Supplementary Figure S4 compares MEMMA, GPLEMMA and LEMMA in a simulation where the functional form of a heritable environmental variable was misspecified (or more specifically; the phenotype depended on the squared effect of a heritable environment). All methods were first tested without any attempt to control for model misspecification, and second using a preprocessing strategy where each environment was tested independently for squared effects on the phenotype and any squared effects with p-value were included as covariates. These are referred to as (−SQE) and (+SQE) respectively in the figures. Using the (−SQE) strategy, all methods showed upwards bias in estimates of GxE heritability that increased with the strength of the squared effect on the phenotype (Supplementary Fig. S4b). Model misspecification also caused bias in the ES of both GPLEMMA and LEMMA, however bias in the ES from GPLEMMA appeared to be much worse (Supplementary Fig. S4a). Using the strategy, all GxE heritability estimates were unbiased, consistent with earlier simulation results. Figure 4 shows a comparison of the runtime of LEMMA and GPLEMMA as the sample size and number of environments varied. The figure shows that on relatively small datasets ( samples or L = 10 environmental variables) LEMMA was slightly faster, however on large datasets GPLEMMA is faster due to superior algorithmic complexity. To give a direct comparison, on a simulated data with samples, SNPs and L = 30 environmental variables using 4 cores for each run, LEMMA took an average of 648 minutes to run whereas GPLEMMA took an average of 233 minutes.

Fig. 4.

Computational cost of GPLEMMA. Comparison of the computational cost of LEMMA and GPLEMMA as (a) sample size and (b) the number of environments increases. Runtime shown on a scale. By default each run used four cores, samples, L = 30 environments and 10 random starts of the Levenburg-Marquardt algorithm. Results from 10 repeats shown. Runtime for LEMMA excludes time spent on single SNP hypothesis testing Supplementary Figure S5 displays simulation results on the computational complexity of GPLEMMA. Supplementary Figure S5a and b shows that GPLEMMA achieved perfect strong scaling (A parallel algorithm has perfect strong scaling if the runtime on T processors is linear in , including communication costs.) on the range of cores tested. This suggests that GPLEMMA has superior scalability to LEMMA, as for LEMMA the speedup due to increased cores began to decay after the number of samples per core dropped below 3000 (Kerin and Marchini, 2020). Time to compute the preprocessing step and solve the non-linear least squared problem are shown in Supplementary Figure S5c–f, while the number of environments and sample size were varied. As expected, the preprocessing step appeared to be linear in both the number of environments and sample size. Time to solve the non-linear least squares problem appeared to be quadratic in the number of environments and approximately linear in sample size N. As a single LM iteration should have complexity , this suggests that the number of iterations required for convergence of the LM algorithm was independent of sample size and the number of environments (at least for the range of values tested). Finally, we tested GPLEMMA in simulation where we simulated ordinal environmental variables, created using a binomial distribution Bin(n, p) where n with levels and . Comparing GPLEMMAs performance in these simulations with previous results using a continuous environmental variable (Supplementary Fig. S6) suggests that GPLEMMA is not sensitive to the choice of ordinal or continuous environmental variables.

2.5 Analysis of UK Biobank data

To compare GPLEMMA and LEMMA on real data we ran both methods on body mass index (log BMI), systolic blood pressure (SBP), diastolic blood pressure (DBP) and pulse pressure (PP) measured on individuals from the UK Biobank. We filtered the SNP genotype data based on minor allele frequency () and IMPUTE info score (), leaving approximately 642 000 variants per trait. We used 42 environmental variables from the UK Biobank, similar to those used in previous GxE analyses of BMI in the UK Biobank (Moore ; Young ). After filtering on ancestry and relatedness, sub-setting down to individuals who had complete data across the phenotype, covariates and environmental factors we were left with approximately samples per trait. The sample, SNP and covariate processing and filtering applied is the same as that reported in the LEMMA paper (Kerin and Marchini, 2020). Table 1 shows the estimates and standard errors for SNP main effects () and GxE effects () for GPLEMMA and LEMMA applied to the 4 traits. In all cases there is good agreements between the estimates from both methods.

Table 1.

Comparison of GPLEMMA and LEMMA on 4 UK Biobank traits

Trait	hG2 (s.e)		hGxE2 (s.e)
Trait	GPLEMMA	LEMMA	GPLEMMA	LEMMA
log BMI	0.256 (0.078)	0.259 (0.069)	0.074 (0.008)	0.071 (0.009)
PP	0.230 (0.042)	0.233 (0.039)	0.063 (0.007)	0.075 (0.018)
SBP	0.237 (0.057)	0.240 (0.053)	0.036 (0.003)	0.033 (0.003)
DBP	0.273 (0.037)	0.277 (0.034)	0.021 (0.003)	0.014 (0.001)

Note: Heritability estimates obtained using genotyped SNPs.

Comparison of GPLEMMA and LEMMA on 4 UK Biobank traits Note: Heritability estimates obtained using genotyped SNPs. Finally we ran RHE-regression on the four UK Biobank traits whilst controlling for the same set of covariates. Heritability estimates from RHE-regression were marginally higher than those obtained by LEMMA and GPLEMMA (see Supplementary Table S1).

3 Discussion

Primarily this article develops a novel randomized Haseman–Elston non-linear regression approach for modelling GxE interactions of quantitative traits in large genetic studies with multiple environmental variables. This approach estimates GxE heritability at the same time as estimating the linear combination of environmental variables (called an ES) that underly that heritability. This general idea was pioneered in our previous approach LEMMA (Kerin and Marchini, 2020) which used a whole-genome regression approach to learn the ES, and this was then used in a randomized Haseman–Elston approach to estimate GxE heritability. The GPLEMMA approach introduced in this article does not need that first whole-genome regression step, and this leads to substantial computational savings. The model underlying GPLEMMA is very similar to that in LEMMA, but implicity assumes a Gaussian distribution for main SNP effects and GxE effects at each SNP. We compared GPLEMMA to a simpler approach, which we called MEMMA, that estimates GxE heritability of each environmental variable in a joint model, but does not attempt to find the best linear combination of them. We found that estimates of GxE heritability from MEMMA had higher variance than estimates from LEMMA and GPLEMMA, suggesting that the usefulness of MEMMA might be limited. Results from LEMMA and GPLEMMA were very similar, both in terms of estimating the ES and GxE heritability. The primary advantage of GPLEMMA over LEMMA is in computational complexity, as the empirical complexity of GPLEMMA appeared to be linear in sample size whereas LEMMA was shown to be super-linear (Kerin and Marchini, 2020). The methods LEMMA, GPLEMMA and MEMMA have all been developed for quantitative traits, and we have not explored their use when applied directly to binary traits, as if they were continuous, as was carried out in Pazokitoroudi . Developing GPLEMMA to directly model binary traits is a direction for future work. It maybe that transformations that exist for single component LMMs to convert heritability estimates to the liability scale may be also work here. In the future it may also be interesting to extend the GPLEMMA model to partition variance using multiple orthogonal linear combinations of environmental variables. This could be expressed using the model where is an N-vector, w is an L-vector and . LEMMA is also able to perform single SNP hypothesis testing whereas GPLEMMA (currently) does not. The linear weighting parameter w from GPLEMMA could be used to initialize LEMMA, or the estimated ES could be used as a single environmental variable in LEMMA. Exploring these, and other, approaches is future work. While we are enthusiastic about the potential of GPLEMMA (and LEMMA) to elucidate the contribution of GxE interactions to disease traits, we suggest that more care is needed than a standard genetic heritability analysis for a number of reasons. As our simulations show, including a variable as an ’environment’ that is itself under genetic control can lead to bias if the relationship between that variable and trait of interest is not well modelled, and this should be carefully considered. Also, GPLEMMA is only able to assess the contribution of those variables included in the model. It may well be the case that a relevant environment is not available and so its GxE contribution cannot be assessed. Finally, the scale the measured trait can impact results (Kerin and Marchini, 2020) so it can be useful to assess results on a range of scales. Click here for additional data file.

33 in total

1. Measuring missing heritability: inferring the contribution of common variants.

Authors: David Golan; Eric S Lander; Saharon Rosset
Journal: Proc Natl Acad Sci U S A Date: 2014-11-24 Impact factor: 11.205

Review 2. Finding the missing heritability of complex diseases.

Authors: Teri A Manolio; Francis S Collins; Nancy J Cox; David B Goldstein; Lucia A Hindorff; David J Hunter; Mark I McCarthy; Erin M Ramos; Lon R Cardon; Aravinda Chakravarti; Judy H Cho; Alan E Guttmacher; Augustine Kong; Leonid Kruglyak; Elaine Mardis; Charles N Rotimi; Montgomery Slatkin; David Valle; Alice S Whittemore; Michael Boehnke; Andrew G Clark; Evan E Eichler; Greg Gibson; Jonathan L Haines; Trudy F C Mackay; Steven A McCarroll; Peter M Visscher
Journal: Nature Date: 2009-10-08 Impact factor: 49.962

Review 3. Whole-genome regression and prediction methods applied to plant and animal breeding.

Authors: Gustavo de Los Campos; John M Hickey; Ricardo Pong-Wong; Hans D Daetwyler; Mario P L Calus
Journal: Genetics Date: 2012-06-28 Impact factor: 4.562

4. Accounting for genetic architecture improves sequence based genomic prediction for a Drosophila fitness trait.

Authors: Ulrike Ober; Wen Huang; Michael Magwire; Martin Schlather; Henner Simianer; Trudy F C Mackay
Journal: PLoS One Date: 2015-05-07 Impact factor: 3.240

5. Efficient Bayesian mixed-model analysis increases association power in large cohorts.

Authors: Po-Ru Loh; George Tucker; Brendan K Bulik-Sullivan; Bjarni J Vilhjálmsson; Hilary K Finucane; Rany M Salem; Daniel I Chasman; Paul M Ridker; Benjamin M Neale; Bonnie Berger; Nick Patterson; Alkes L Price
Journal: Nat Genet Date: 2015-02-02 Impact factor: 38.330

6. SumHer better estimates the SNP heritability of complex traits from summary statistics.

Authors: Doug Speed; David J Balding
Journal: Nat Genet Date: 2018-12-03 Impact factor: 38.330

7. Inferring Gene-by-Environment Interactions with a Bayesian Whole-Genome Regression Model.

Authors: Matthew Kerin; Jonathan Marchini
Journal: Am J Hum Genet Date: 2020-09-03 Impact factor: 11.025

8. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index.

Authors: Jian Yang; Andrew Bakshi; Zhihong Zhu; Gibran Hemani; Anna A E Vinkhuyzen; Sang Hong Lee; Matthew R Robinson; John R B Perry; Ilja M Nolte; Jana V van Vliet-Ostaptchouk; Harold Snieder; Tonu Esko; Lili Milani; Reedik Mägi; Andres Metspalu; Anders Hamsten; Patrik K E Magnusson; Nancy L Pedersen; Erik Ingelsson; Nicole Soranzo; Matthew C Keller; Naomi R Wray; Michael E Goddard; Peter M Visscher
Journal: Nat Genet Date: 2015-08-31 Impact factor: 38.330

9. Partitioning heritability by functional annotation using genome-wide association summary statistics.

Authors: Hilary K Finucane; Brendan Bulik-Sullivan; Alexander Gusev; Gosia Trynka; Yakir Reshef; Po-Ru Loh; Verneri Anttila; Han Xu; Chongzhi Zang; Kyle Farh; Stephan Ripke; Felix R Day; Shaun Purcell; Eli Stahl; Sara Lindstrom; John R B Perry; Yukinori Okada; Soumya Raychaudhuri; Mark J Daly; Nick Patterson; Benjamin M Neale; Alkes L Price
Journal: Nat Genet Date: 2015-09-28 Impact factor: 38.330

3. Genome-Wide Interaction Study of Late-Onset Asthma With Seven Environmental Factors Using a Structured Linear Mixed Model in Europeans.

Authors: Eun Ju Baek; Hae Un Jung; Tae-Woong Ha; Dong Jun Kim; Ji Eun Lim; Han Kyul Kim; Ji-One Kang; Bermseok Oh
Journal: Front Genet Date: 2022-03-30 Impact factor: 4.599

3 in total