Literature DB >> 32365090

Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models.

Sahir R Bhatnagar^1,2, Yi Yang³, Tianyuan Lu^4,5, Erwin Schurr⁶, J C Loredo-Osti⁷, Marie Forest⁸, Karim Oualkacha⁹, Celia M T Greenwood^1,4,5,10,11.

Abstract

Complex traits are known to be influenced by a combination of environmental factors and rare and common genetic variants. However, detection of such multivariate associations can be compromised by low statistical power and confounding by population structure. Linear mixed effects models (LMM) can account for correlations due to relatedness but have not been applicable in high-dimensional (HD) settings where the number of fixed effect predictors greatly exceeds the number of samples. False positives or false negatives can result from two-stage approaches, where the residuals estimated from a null model adjusted for the subjects' relationship structure are subsequently used as the response in a standard penalized regression model. To overcome these challenges, we develop a general penalized LMM with a single random effect called ggmix for simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. We develop a blockwise coordinate descent algorithm with automatic tuning parameter selection which is highly scalable, computationally efficient and has theoretical guarantees of convergence. Through simulations and three real data examples, we show that ggmix leads to more parsimonious models compared to the two-stage approach or principal component adjustment with better prediction accuracy. Our method performs well even in the presence of highly correlated markers, and when the causal SNPs are included in the kinship matrix. ggmix can be used to construct polygenic risk scores and select instrumental variables in Mendelian randomization studies. Our algorithms are available in an R package available on CRAN (https://cran.r-project.org/package=ggmix).

Entities: Chemical Disease Gene Mutation Species

Year: 2020 PMID： 32365090 PMCID： PMC7224575 DOI： 10.1371/journal.pgen.1008766

Source DB: PubMed Journal: PLoS Genet ISSN： 1553-7390 Impact factor: 5.917

Introduction

Genome-wide association studies (GWAS) have become the standard method for analyzing genetic datasets owing to their success in identifying thousands of genetic variants associated with complex diseases (https://www.genome.gov/gwastudies/). Despite these impressive findings, the discovered markers have only been able to explain a small proportion of the phenotypic variance; this is known as the missing heritability problem [1]. One plausible reason is that there are many causal variants that each explain a small amount of variation with small effect sizes [2]. Methods such as GWAS, which test each variant or single nucleotide polymorphism (SNP) independently, may miss these true associations due to the stringent significance thresholds required to reduce the number of false positives [1]. Another major issue to overcome is that of confounding due to geographic population structure, family and/or cryptic relatedness which can lead to spurious associations [3]. For example, there may be subpopulations within a study that differ with respect to their genotype frequencies at a particular locus due to geographical location or their ancestry. This heterogeneity in genotype frequency can cause correlations with other loci and consequently mimic the signal of association even though there is no biological association [4, 5]. Studies that separate their sample by ethnicity to address this confounding suffer from a loss in statistical power due to the drop in sample size. To address the first problem, multivariable regression methods have been proposed which simultaneously fit many SNPs in a single model [6, 7]. Indeed, the power to detect an association for a given SNP may be increased when other causal SNPs have been accounted for. Conversely, a stronger signal from a causal SNP may weaken false signals when modeled jointly [6]. Solutions for confounding by population structure have also received significant attention in the literature [8-11]. There are two main approaches to account for the relatedness between subjects: 1) the principal component (PC) adjustment method and 2) the linear mixed model (LMM). The PC adjustment method includes the top PCs of genome-wide SNP genotypes as additional covariates in the model [12]. The LMM uses an estimated covariance matrix from the individuals’ genotypes and includes this information in the form of a random effect [3]. While these problems have been addressed in isolation, there has been relatively little progress towards addressing them jointly at a large scale. Region-based tests of association have been developed where a linear combination of p variants is regressed on the response variable in a mixed model framework [13]. In case-control data, a stepwise logistic-regression procedure was used to evaluate the relative importance of variants within a small genetic region [14]. These methods however are not applicable in the high-dimensional setting, i.e., when the number of variables p is much larger than the sample size n, as is often the case in genetic studies where millions of variants are measured on thousands of individuals. There has been recent interest in using penalized linear mixed models, which place a constraint on the magnitude of the effect sizes while controlling for confounding factors such as population structure. For example, the LMM-lasso [15] places a Laplace prior on all main effects while the adaptive mixed lasso [16] uses the L1 penalty [17] with adaptively chosen weights [18] to allow for differential shrinkage amongst the variables in the model. Another method applied a combination of both the lasso and group lasso penalties in order to select variants within a gene most associated with the response [19]. However, methods such as the LMM-lasso are normally performed in two steps. First, the variance components are estimated once from a LMM with a single random effect. These LMMs normally use the estimated covariance matrix from the individuals’ genotypes to account for the relatedness but assumes no SNP main effects (i.e. a null model). The residuals from this null model with a single random effect can be treated as independent observations because the relatedness has been effectively removed from the original response. In the second step, these residuals are used as the response in any high-dimensional model that assumes uncorrelated errors. This approach has both computational and practical advantages since existing penalized regression software such as glmnet [20] and gglasso [21], which assume independent observations, can be applied directly to the residuals. However, recent work has shown that there can be a loss in power if a causal variant is included in the calculation of the covariance matrix as its effect will have been removed in the first step [13, 22]. In this paper we develop a general penalized LMM framework called ggmix that simultaneously selects variables and estimates their effects, accounting for between-individual correlations. We develop a blockwise coordinate descent algorithm with automatic tuning parameter selection which is highly scalable, computationally efficient and has theoretical guarantees of convergence. Our method can handle several sparsity inducing penalties such as the lasso [17] and elastic net [23]. Through simulations and three real data examples, we show that ggmix leads to more parsimonious models compared to the two-stage approach or principal component adjustment with better prediction accuracy. Our method performs well even in the presence of highly correlated markers, and when the causal SNPs are included in the kinship matrix. All of our algorithms are implemented in the ggmix R package hosted on CRAN with extensive documentation (https://sahirbhatnagar.com/ggmix). We provide a brief demonstration of the ggmix package in S2 Text. The rest of the paper is organized as follows. In Results, we compare the performance of our proposed approach and demonstrate the scenarios where it can be advantageous to use over existing methods through simulation studies and three real data analyses. This is followed by a discussion of our results, some limitations and future directions in Discussion. Materials and methods describes the ggmix model, the optimization procedure and the algorithm used to fit it.

Results

In this section we demonstrate the performance of ggmix in a simulation study and three real data applications.

Simulation study

We evaluated the performance of ggmix in a variety of simulated scenarios. For each simulation scenario we compared ggmix to the lasso and the twostep method. For the lasso, we included the top 10 principal components from the simulated genotypes used to calculate the kinship matrix as unpenalized predictors in the design matrix. For the twostep method, we first fitted an intercept only model with a single random effect using the average information restricted maximum likelihood (AIREML) algorithm [24] as implemented in the gaston R package [25]. The residuals from this model were then used as the response in a regular lasso model. Note that in the twostep method, we removed the kinship effect in the first step and therefore did not need to make any further adjustments when fitting the penalized model. We fitted the lasso using the default settings and standardize = FALSE in the glmnet package [20], with 10-fold cross-validation (CV) to select the optimal tuning parameter. For other parameters in our simulation study, we defined the following quantities: n: sample size c: percentage of causal SNPs : true effect size vector of length p S0 = {j; (β) ≠ 0} the index of the true active set with cardinality |S0| = c × p causal: the list of causal SNP indices kinship: the list of SNP indices for the kinship matrix X: n × p matrix of SNPs that were included as covariates in the model We simulated data from the model where is the polygenic effect and is the error term. Here, Φ is the covariance matrix based on the kinship SNPs from n individuals, I is the identity matrix and parameters σ2 and η ∈ [0, 1] determine how the variance is divided between P and . The values of the parameters that we used were as follows: narrow sense heritability η = {0.1, 0.3}, number of covariates p = 5, 000, number of kinship SNPs k = 10, 000, percentage of causal SNPs c = {0%, 1%} and σ2 = 1. In addition to these parameters, we also varied the amount of overlap between the causal list and the kinship list. We considered two main scenarios: None of the causal SNPs are included in kinship set. All of the causal SNPs are included in the kinship set. Both kinship matrices were meant to contrast the model behavior when the causal SNPs are included in both the main effects and random effects (referred to as proximal contamination [8]) versus when the causal SNPs are only included in the main effects. These scenarios are motivated by the current standard of practice in GWAS where the candidate marker is excluded from the calculation of the kinship matrix [8]. This approach becomes much more difficult to apply in large-scale multivariable models where there is likely to be overlap between the variables in the design matrix and kinship matrix. We simulated random genotypes from the BN-PSD admixture model with 1D geography and 10 subpopulations using the bnpsd package [26, 27]. In Fig 1, we plot the estimated kinship matrix from a single simulated dataset in the form of a heatmap where a darker color indicates a closer genetic relationship.

Fig 1

Empirical kinship matrix.

Example of an empirical kinship matrix used in simulation studies. This scenario models a 1D geography with extensive admixture.

Empirical kinship matrix.

Example of an empirical kinship matrix used in simulation studies. This scenario models a 1D geography with extensive admixture. In Fig 2 we plot the first two principal component scores calculated from the simulated genotypes used to calculate the kinship matrix in Fig 1, and color each point by subpopulation membership. We can see that the PCs can identify the subpopulations which is why including them as additional covariates in a regression model has been considered a reasonable approach to control for confounding.

Fig 2

First two principal components.

First two principal component scores of the genotype data used to estimate the kinship matrix where each color represents one of the 10 simulated subpopulations.

First two principal components.

First two principal component scores of the genotype data used to estimate the kinship matrix where each color represents one of the 10 simulated subpopulations. Using this set-up, we randomly partitioned 1000 simulated observations into 80% for training and 20% for testing. The training set was used to fit the model and select the optimal tuning parameter only, and the resulting model was evaluated on the test set. Let be the estimated value of the optimal regularization parameter, the estimate of at regularization parameter , and the index of the set of non-zero estimated coefficients. To compare the methods in the context of true positive rate (TPR), we selected the largest tuning parameter that would result in a false positive rate (FPR) closest to 5%, but not more. Note that in practice, this approach to selecting the tuning parameter is generally not possible since we do not know the underlying true model in advance. For real data, we suggest an information criterion approach described in S1 Text or a sample splitting approach such as the one we used for the UK Biobank analysis. We also compared the model size (), test set prediction error based on the refitted unpenalized estimates for each selected model, the estimation error (), and the variance components (η, σ2) for the polygenic random effect and error term. The results are summarized in Table 1. We see that ggmix outperformed the twostep in terms of TPR, and was comparable to the lasso. This was the case, regardless of true heritability and whether the causal SNPs were included in the calculation of the kinship matrix. For the twostep however, the TPR at a FPR of 5%, drops, on average, from 0.84 (when causal SNPs are not in the kinship) to 0.76 (when causal SNPs are in the kinship). Across all simulation scenarios, ggmix had the smallest estimation error, and smallest root mean squared prediction error (RMSE) on the test set while also producing the most parsimonious models. Both the lasso and twostep selected more false positives, even in the null model scenario. Both the twostep and ggmix overestimated the heritability though ggmix was closer to the true value. When none of the causal SNPs were in the kinship, both methods tended to overestimate the truth when η = 10% and underestimate when η = 30%. Across all simulation scenarios ggmix was able to (on average) correctly estimate the error variance. The lasso tended to overestimate σ2 in the null model while the twostep overestimated σ2 when none of the causal SNPs were in the kinship matrix.

Table 1

Simulation study results.

		Null model				1% Causal SNPs
		No overlap		All causal SNPs in kinship		No overlap		All causal SNPs in kinship
Metric	Method	10%	30%	10%	30%	10%	30%	10%	30%
TPR at FPR	twostep	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.84 (0.05)	0.84 (0.05)	0.76 (0.09)	0.77 (0.08)
	lasso	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.86 (0.05)	0.85 (0.05)	0.86 (0.05)	0.86 (0.05)
	ggmix	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.86 (0.05)	0.86 (0.05)	0.85 (0.05)	0.86 (0.05)
Model Size	twostep	0 (0, 5)	0 (0, 2)	0 (0, 5)	0 (0, 2)	328 (289, 388)	332 (287, 385)	284 (250, 329)	284 (253, 319)
	lasso	0 (0, 6)	0 (0, 5)	0 (0, 6)	0 (0, 5)	278 (246, 317)	276 (245, 314)	279 (252, 321)	285 (244, 319)
	ggmix	0 (0, 0)	0 (0, 0)	0 (0, 0)	0 (0, 0)	43 (39, 49)	43 (39, 48)	44 (38, 49)	43 (38, 48)
RMSE	twostep	1.02 (0.07)	1.02 (0.06)	1.02 (0.07)	1.02 (0.06)	1.42 (0.10)	1.41 (0.10)	1.44 (0.33)	1.40 (0.22)
	lasso	1.02 (0.06)	1.02 (0.06)	1.02 (0.06)	1.02 (0.06)	1.39 (0.09)	1.38 (0.09)	1.40 (0.08)	1.38 (0.08)
	ggmix	1.00 (0.05)	1.00 (0.05)	1.00 (0.05)	1.00 (0.05)	1.22 (0.10)	1.20 (0.10)	1.23 (0.11)	1.23 (0.12)
Estimation Error	twostep	0.12 (0.22)	0.09 (0.19)	0.12 (0.22)	0.09 (0.19)	2.97 (0.60)	2.92 (0.60)	3.60 (5.41)	3.21 (3.46)
	lasso	0.13 (0.21)	0.12 (0.22)	0.13 (0.21)	0.12 (0.22)	2.76 (0.46)	2.69 (0.47)	2.82 (0.48)	2.75 (0.48)
	ggmix	0.00 (0.01)	0.01 (0.02)	0.00 (0.01)	0.01 (0.02)	2.11 (1.28)	2.04 (1.22)	2.21 (1.24)	2.28 (1.34)
Error Variance	twostep	0.87 (0.11)	0.69 (0.15)	0.87 (0.11)	0.69 (0.15)	14.23 (3.53)	14.13 (3.52)	1.42 (1.71)	1.28 (1.66)
	lasso	0.98 (0.05)	0.96 (0.05)	0.98 (0.05)	0.96 (0.05)	1.04 (0.13)	1.02 (0.13)	1.03 (0.14)	1.01 (0.14)
	ggmix	0.85 (0.18)	0.64 (0.20)	0.85 (0.18)	0.64 (0.20)	2.00 (0.49)	1.86 (0.51)	1.06 (0.46)	0.83 (0.45)
Heritability	twostep	0.13 (0.11)	0.31 (0.15)	0.13 (0.11)	0.31 (0.15)	0.26 (0.14)	0.26 (0.14)	0.92 (0.08)	0.93 (0.08)
	lasso	–	–	–	–	–	–	–	–
	ggmix	0.15 (0.18)	0.37 (0.21)	0.15 (0.18)	0.37 (0.21)	0.18 (0.16)	0.23 (0.17)	0.59 (0.20)	0.68 (0.19)

Note: median (inter-quartile range) is given for model size.

Simulation study results.

Mean (standard deviation) from 200 simulations stratified by the number of causal SNPs (null, 1%), the overlap between causal SNPs and kinship matrix (no overlap, all causal SNPs in kinship), and true heritability (10%, 30%). For all simulations, sample size is n = 1000, the number of covariates is p = 5000, and the number of SNPs used to estimate the kinship matrix is k = 10000. TPR at FPR = 5% is the true positive rate at a fixed false positive rate of 5%. Model Size () is the number of selected variables in the training set using the high-dimensional BIC for ggmix and 10-fold cross validation for lasso and twostep. RMSE is the root mean squared error on the test set. Estimation error is the squared distance between the estimated and true effect sizes. Error variance (σ2) for twostep is estimated from an intercept only LMM with a single random effect and is modeled explicitly in ggmix. For the lasso we use [28] as an estimator for σ2. Heritability (η) for twostep is estimated as from an intercept only LMM with a single random effect where and are the variance components for the random effect and error term, respectively. η is explictly modeled in ggmix. There is no positive way to calculate η for the lasso since we are using a PC adjustment. Note: median (inter-quartile range) is given for model size. Overall, we observed that variable selection results and RMSE for ggmix were similar regardless of whether the causal SNPs were in the kinship matrix or not. This result is encouraging since in practice the kinship matrix is constructed from a random sample of SNPs across the genome, some of which are likely to be causal, particularly in polygenic traits. In particular, our simulation results show that the principal component adjustment method may not be the best approach to control for confounding by population structure, particularly when variable selection is of interest.

Real data applications

Three datasets with different features were used to illustrate the potential advantages of ggmix over existing approaches such as PC adjustment in a lasso regression. In the first two datasets, family structure induced low levels of correlation and sparsity in signals. In the last, a dataset involving mouse crosses, correlations were extremely strong and could confound signals.

UK Biobank

With more than 500,000 participants, the UK Biobank is one of the largest genotyped health care registries in the world. Among these participants, 147,731 have been inferred to be related to at least one individual in this cohort [29]. Such a widespread genetic relatedness may confound association studies and bias trait predictions if not properly accounted for. Among these related individuals, 18,150 have a documented familial relationship (parent-offspring, full siblings, second degree or third degree) that was previously inferred in [30]. We attempted to derive a polygenic risk score for height among these individuals. As suggested by a reviewer, the goal of this analysis was to see how the different methods performed for a highly polygenic trait in a set of related individuals. We compared the ggmix-derived polygenic risk score to those derived by the twostep and lasso methods. We first estimated the pairwise kinship coefficient among the 18,150 reportedly related individuals based on 784,256 genotyped SNPs using KING [31]. We grouped related individuals with a kinship coefficient > 0.044 [31] into 8,300 pedigrees. We then randomly split the dataset into a training set, a model selection set and a test set of roughly equal sample size, ensuring all individuals in the same pedigree were assigned into the same set. We inverse normalized the standing height after adjusting for age, sex, genotyping array, and assessment center following Yengo et al. [32]. To reduce computational complexity, we selected 10,000 SNPs with the largest effect sizes associated with height from a recent large meta-analysis [32]. Among these 10,000 SNPs, 1,233 were genotyped and used for estimating the kinship whereas the other 8,767 SNPs were imputed based on the Haplotype Reference Consortium reference panel [33]. The distribution of the 10,000 SNPs by chromosome and whether or not the SNP was imputed is shown in S1 Fig. We see that every chromosome contributed SNPs to the model with 15% coming from chromosome 6. The markers we used are theoretically independent since Yengo et al. performed a COJO analysis which should have tuned down signals due to linkage disequilibrium [32]. We used ggmix, twostep and lasso to select SNPs most predictive of the inverse normalized height on the training set, and chose the λ with the lowest prediction RMSE on the model selection set for each method. We then examined the performance of each derived polygenic risk score on the test set. Similar to simulation study, we adjusted for the top 10 genetic PCs as unpenalized predictors when fitting the lasso models, and supplied the kinship matrix based on 784,256 genotyped SNPs to ggmix and twostep. We found that with a kinship matrix estimated using all genotyped SNPs, ggmix had the possibility to achieve a lower RMSE on the model selection set compared to the twostep and lasso methods (Fig 3A). An optimized ggmix-derived polygenic risk score that utilized the least number of SNPs was also able to better predict the trait with lower RMSE on the test set (Fig 3B).

Fig 3

Model selection and testing in the UK Biobank.

Model selection and testing in the UK Biobank.

(a) Root-mean-square error of three methods on the model selection set with respect to a grid search of penalty factor used on the training set. (b) Performance of four methods on the test set with penalty factor optimized on the model selection set. The x-axis has a logarithmic scale. The BSLMM method optimized coefficients of each SNP through an MCMC process on the training set and was directly evaluated on the test set. We additionally applied a Bayesian Sparse Linear Mixed Model (BSLMM) [34] implemented in the GEMMA package [35] to derive a polygenic risk score on the training set. A posterior probability of inclusion of each SNP was provided and prediction was based on all SNPs with a positive posterior probability. We found that although the BSLMM-based polygenic risk score leveraged the most SNPs, it did not achieve a comparable prediction accuracy as the other three methods (Fig 3B). Likely due to the small effect sizes of these SNPs, only 94, 35 and 1 SNPs had a posterior inclusion probability above 0.05, 0.10 and 0.50, respectively. The model would have further reduced prediction accuracy if the prediction was based only on these SNPs.

GAW20

In the most recent Genetic Analysis Workship 20 (GAW20), the causal modeling group investigated causal relationships between DNA methylation (exposure) within some genes and the change in high-density lipoproteins ΔHDL (outcome) using Mendelian Randomization (MR) [36]. Penalized regression methods were used to select SNPs strongly associated with the exposure in order to be used as an instrumental variable (IV) [37, 38]. However, since GAW20 data consisted of families, twostep methods were used which could have resulted in a large number of false positives or false negatives. ggmix now provides an alternative approach that could be used for selecting the IV while accounting for the family structure of the data. We applied ggmix to all 200 GAW20 simulation datasets, each of 679 observations, and compared its performance to the twostep and lasso methods. Using a Factored Spectrally Transformed Linear Mixed Model (FaST-LMM) [39] adjusted for age and sex, we validated the effect of rs9661059 on blood lipid trait to be significant (genome-wide p = 6.29 × 10−9). Though several other SNPs were also associated with the phenotype, these associations were probably mediated by CpG-SNP interaction pairs and did not reach statistical significance. Therefore, to avoid ambiguity, we only focused on chromosome 1 containing 51,104 SNPs, including rs9661059. Given that population admixture in the GAW20 data was likely, we estimated the population kinship using REAP [40] after decomposing population compositions using ADMIXTURE [41]. We used 100,276 LD-pruned whole-genome genotyped SNPs for estimating the kinship. Among these, 8100 were included as covariates in our models based on chromosome 1. The causal SNP was also among the 100,276 SNPs. All methods were fit according to the same settings described in our simulation study, and adjusting for age and sex. We calculated the median (inter-quartile range) number of active variables, and RMSE (standard deviation) based on five-fold CV on each simulated dataset. On each simulated replicate, we calibrated the methods so that they could be easily compared by fixing the true positive rate to 1 and then minimizing the false positive rate. Hence, the selected SNP, rs9661059, was likely to be the true positive for each method, and non-causal SNPs were excluded to the greatest extent. All three methods precisely chose the correct predictor without any false positives in more than half of the replicates, as the causal signal was strong. However, when some false positives were selected (i.e. when the number of active variables > 1), ggmix performed comparably to twostep, while the lasso was inclined to select more false positives as suggested by the larger third quartile number of active variables (Table 2). We also observed that ggmix outperformed the twostep method with lower CV RMSE using the same number of SNPs. Meanwhile, it achieved roughly the same prediction accuracy as lasso but with fewer non-causal SNPs (Table 2). It is also worth mentioning that there was very little correlation between the causal SNP and SNPs within a 1Mb-window around it (see S2 Fig), making it an ideal scenario for the lasso and related methods.

Table 2

GAW20 simulation study results.

Method	Model Size	RMSE (SD)
twostep	1 (1–11)	0.3604 (0.0242)
lasso	1 (1–15)	0.3105 (0.0199)
ggmix	1 (1–12)	0.3146 (0.0210)
BSLMM (PIP > 0)	40,737 (39,901–41,539)	0.2503 (0.0099)
BSLMM (PIP > 0.05)	2 (1–4)	-
BSLMM (PIP > 0.10)	0 (0–1)	-
BSLMM (PIP > 0.50)	0 (0–0)	-

Note: median (inter-quartile range) is given for model size.

GAW20 simulation study results.

Summary of model performance based on 200 GAW20 simulations for the twostep, lasso, ggmix and BSLMM model with different posterior inclusion probability (PIP) thresholds. Five-fold cross-validation root-mean-square error (RMSE) was reported for each simulation replicate. Prediction performance was not reported for BSLMM with PIP greater than 0.05, 0.10 and 0.50 because some of the replications contained no active SNPs. Note: median (inter-quartile range) is given for model size. We also applied the BSLMM method by performing five-fold CV on each of the 200 simulated replicates. We found that while BSLMM achieved a lower CV RMSE compared to the other methods (Table 2), this higher prediction accuracy relied on approximately 80% of the 51,104 SNPs with a positive posterior inclusion probability. This may suggest overfitting in this dataset. We additionally tried imposing a stricter posterior inclusion probability threshold (0.05, 0.10 and 0.50) in order to improve feature selection. These thresholds however, resulted in overly sparse models as most SNPs had a low posterior probability. It is also noteworthy that we did not adjust for age and sex in the BSLMM model, as the current implementation of the method in the GEMMA package does not allow adjustment for covariates.

Mouse crosses and sensitivity to mycobacterial infection

Mouse inbred strains of genetically identical individuals are extensively used in research. Crosses of different inbred strains are useful for various studies of heritability focusing on either observable phenotypes or molecular mechanisms, and in particular, recombinant congenic strains have been an extremely useful resource for many years [42]. However, ignoring complex genetic relationships in association studies can lead to inflated false positives in genetic association studies when different inbred strains and their crosses are investigated [43-45]. Therefore, a previous study developed and implemented a mixed model to find loci associated with mouse sensitivity to mycobacterial infection [46]. The random effects in the model captured complex correlations between the recombinant congenic mouse strains based on the proportion of the DNA shared identical by descent. Through a series of mixed model fits at each marker, new loci that impact growth of mycobacteria on chromosome 1 and chromosome 11 were identified. Here we show that ggmix can identify these loci, as well as potentially others, in a single analysis. We reanalyzed the growth permissiveness in the spleen, as measured by colony forming units (CFUs), 6 weeks after infection from Mycobacterium bovis Bacille Calmette-Guerin (BCG) Russia strain as reported in [46]. By taking the consensus between the “main model” and the “conditional model” of the original study, we regarded markers D1Mit435 on chromosome 1 and D11Mit119 on chromosome 11 as two true positive loci. We directly estimated the kinship between mice using genotypes at 625 microsatellite markers. The estimated kinship was entered directly into ggmix and twostep. For the lasso, we calculated and included the first 10 principal components of the estimated kinship. To evaluate the robustness of different models, we bootstrapped the 189-sample dataset and repeated the analysis 200 times. We then conceived a two-fold criteria to evaluate performance of each model. We first examined whether a model could pick up both true positive loci using some λ. If the model failed to pick up both loci simultaneously with any λ, we counted as modeling failure on the corresponding boostrap replicate; otherwise, we counted as modeling success and recorded which other loci were picked up given the largest λ. Consequently, similar to the strategy used in the GAW20 analysis, we optimized the models by tuning the penalty factor such that these two true positive loci were picked up, while the number of other active loci was minimized. Significant markers were defined as those captured in at least half of the successful bootstrap replicates (Fig 4).

Fig 4

Comparison of model performance on the mouse cross data.

Comparison of model performance on the mouse cross data.

Pie charts depict model robustness where grey areas denote bootstrap replicates on which the corresponding model is unable to capture both true positives using any penalty factor, whereas colored areas denote successful replicates. Chromosome-based signals record in how many successful replicates the corresponding loci are picked up by the corresponding optimized model. Red dashed lines delineate significance thresholds. We demonstrated that ggmix recognized the true associations more robustly than twostep and lasso. In almost all (99%) bootstrap replicates, ggmix was able to capture both true positives, while the twostep failed in 19% of the replicates and the lasso failed in 56% of the replicates by missing at least one of the two true positives (Fig 4). The robustness of ggmix is particularly noteworthy due to the strong correlations between all microsatellite markers in this dataset (see S3 Fig). These strong correlations with the causal markers, partially explain the poor performance of the lasso as it suffers from unstable selections in the presence of correlated variables (e.g. [47]). We also identified several other loci that might also be associated with susceptibility to mycobacterial infection (Table 3). Among these new potentially-associated markers, D2Mit156 was found to play a role in control of parasite numbers of Leishmania tropica in lymph nodes [48]. An earlier study identified a parent-of-origin effect at D17Mit221 on CD4M levels [49]. This effect was more visible in crosses than in parental strains. In addition, D14Mit131, selected only by ggmix, was found to have a 9% loss of heterozygosity in hybrids of two inbred mouse strains [50], indicating the potential presence of putative suppressor genes pertaining to immune surveillance and tumor progression [51]. This result might also suggest association with anti-bacterial responses yet to be discovered.

Table 3

Mouse crosses and sensitivity to mycobacterial infection.

Additional loci significantly associated with mouse susceptibility to mycobacterial infection, after excluding two true positives. Loci needed to be identified in at least 50% of the successful bootstrap replicates that captured both true positive loci.

Method	Marker	Position in cM	Position in bp
twostep	N/A	N/A	N/A
lasso	D2Mit156	Chr2:31.66	Chr2:57081653-57081799
lasso	D14Mit155	Chr14:31.52	Chr14:59828398-59828596
ggmix	D2Mit156	Chr2:31.66	Chr2:57081653-57081799
	D14Mit131	Chr14:63.59	Chr14:120006565-120006669
	D17Mit221	Chr17:59.77	Chr17:90087704-90087842

Note: median (inter-quartile range) is given for model size.

Mouse crosses and sensitivity to mycobacterial infection.

Discussion

We have developed a general penalized LMM framework called ggmix which simultaneously selects SNPs and adjusts for population structure in high dimensional prediction models. We compared our method to the twostage procedure, where in the first stage, the dependence between observations is adjusted for in a LMM with a single random effect and no covariates (i.e. null model). The residuals from this null model can then be used in any model for independent observations because the relatedness has been effectively removed from the original response. We also compared our method to the lasso and BSLMM which are closely related to ggmix since they also jointly model the relatedness and SNPs in a single step. The key differences are that the lasso uses a principal component adjustment and BSLMM is a Bayesian method focused on phenotype prediction. Through an extensive simulation study and three real data analyses that mimic many experimental designs in genetics, we show that the current approaches of PC adjustment and two-stage procedures are not necessarily sufficient to control for confounding by population structure leading to a high number of false positives. Our simulation results show that ggmix outperforms existing methods in terms of sparsity and prediction error even when the causal variants are included in the kinship matrix (Table 1). Many methods for single-SNP analyses avoid this proximal contamination [8] by using a leave-one-chromosome-out scheme [52], i.e., construct the kinship matrix using all chromosomes except the one on which the marker being tested is located. However, this approach is not possible if we want to model many SNPs (across many chromosomes) jointly to create, for example, a polygenic risk score. For the purposes of variable selection, we would also want to model all chromosomes together since the power to detect an association for a given SNP may be increased when other causal SNPs have been accounted for. Conversely, a stronger signal from a causal SNP may weaken false signals when modeled jointly [6], particularly when the markers are highly correlated as in the mouse crosses example. In the UK Biobank, we found that with a kinship matrix estimated using all genotyped SNPs, ggmix had achieved a lower RMSE on the model selection set compared to the twostep and lasso methods. Furthermore, an optimized ggmix-derived polygenic risk score that utilized the least number of SNPs was also able to better predict the trait with lower RMSE on the test set. In the GAW20 example, we showed that while all methods were able to select the strongest causal SNP, ggmix did so with the least amount of false positives while also maintaining good predictive ability. In the mouse crosses example, we showed that ggmix is robust to perturbations in the data using a bootstrap analysis. Indeed, ggmix was able to consistently select the true positives across bootstrap replicates, while twostep failed in 19% of the replicates and lasso failed in 56% of the replicates by missing of at least one of the two true positives. Our re-analysis of the data also lead to some potentially new findings, not found by existing methods, that may warrant further study. This particular example had many markers that were strongly correlated with each other (see S3 Fig). Nevertheless, we observed that the two true positive loci were the most often selected while none of the nearby markers were picked up in more than 50% of the 200 bootstrap replicates. This shows that our method does recognize the true positives in the presence of highly correlated markers. Nevertheless, we think the issue of variable selection for correlated SNPs warrants further study. The recently proposed Precision Lasso [47] seeks to address this problem in the high-dimensional fixed effects model. We emphasize here that previously developed methods such as the LMM-lasso [15] use a two-stage fitting procedure without any convergence details. From a practical point of view, there is currently no implementation that provides a principled way of determining the sequence of tuning parameters to fit, nor a procedure that automatically selects the optimal value of the tuning parameter. To our knowledge, we are the first to develop a coordinate gradient descent (CGD) algorithm in the specific context of fitting a penalized LMM for population structure correction with theoretical guarantees of convergence. Furthermore, we develop a principled method for automatic tuning parameter selection and provide an easy-to-use software implementation in order to promote wider uptake of these more complex methods by applied practitioners. Although we derive a CGD algorithm for the ℓ1 penalty, our approach can also be easily extended to other penalties such as the elastic net and group lasso with the same guarantees of convergence. A limitation of ggmix is that it first requires computing the covariance matrix with a computation time of followed by a spectral decomposition of this matrix in time where k is the number of SNP genotypes used to construct the covariance matrix. This computation becomes prohibitive for large cohorts such as the UK Biobank [53] which have collected genetic information on half a million individuals. When the matrix of genotypes used to construct the covariance matrix is low rank, there are additional computational speedups that can be implemented. While this has been developed for the univariate case [8], to our knowledge, this has not been explored in the multivariable case. We are currently developing a low rank version of the penalized LMM developed here, which reduces the time complexity from to . There is also the issue of how our model scales with an increasing number of covariates (p). Due to the coordinate-wise optimization procedure, we expect this to be less of an issue, but still prohibitive for p > 1 × 105. The biglasso package [54] uses memory mapping strategies for large p, and this is something we are exploring for ggmix. As was brought up by a reviewer, the simulations and real data analyses presented here contained many more markers used to estimate the kinship than the sample size (n/k ≤ 0.1). In the single locus association test, Yang el al. [22] found that proximal contamination was an issue when n/k ≈ 1. We believe further theoretical study is needed to see if these results can be generalized to the multivariable models being fit here. Once the computational limitations of sample size mentioned above have been addressed, these theoretical results can be supported by simulation studies. There are other applications in which our method could be used as well. For example, there has been a renewed interest in polygenic risk scores (PRS) which aim to predict complex diseases from genotypes. ggmix could be used to build a PRS with the distinct advantage of modeling SNPs jointly, allowing for main effects as well as interactions to be accounted for. Based on our results, ggmix has the potential to produce more robust and parsimonious models than the lasso with better predictive accuracy. Our method is also suitable for fine mapping SNP association signals in genomic regions, where the goal is to pinpoint individual variants most likely to impact the undelying biological mechanisms of disease [55].

Materials and methods

Model set-up

Let i = 1, …, N be a grouping index, j = 1, …, n the observation index within a group and the total number of observations. For each group let be the observed vector of responses or phenotypes, X an n × (p + 1) design matrix (with the column of 1s for the intercept), a group-specific random effect vector of length n and the individual error terms. Denote the stacked vectors , , , and the stacked matrix . Furthermore, let be a vector of fixed effects regression coefficients corresponding to X. We consider the following linear mixed model with a single random effect [56]: where the random effect and the error variance ε are assigned the distributions Here, is a known positive semi-definite and symmetric covariance or kinship matrix calculated from SNPs sampled across the genome, is the identity matrix and parameters σ2 and η ∈ [0, 1] determine how the variance is divided between and ε. Note that η is also the narrow-sense heritability (h2), defined as the proportion of phenotypic variance attributable to the additive genetic factors [1]. The joint density of Y is therefore multivariate normal: The LMM-Lasso method [15] considers an alternative but equivalent parameterization given by: where , is the genetic variance and is the residual variance. We instead consider the parameterization in Eq 1 since maximization is easier over the compact set η ∈ [0, 1] than over the unbounded interval δ ∈ [0, ∞) [56]. We define the complete parameter vector as Θ ≔ (, η, σ2). The negative log-likelihood for Eq 1 is given by where V = η Φ + (1 − η)I and det(V) is the determinant of V. Let Φ = UDU be the eigen (spectral) decomposition of the kinship matrix Φ, where is an orthonormal matrix of eigenvectors (i.e. UU = I) and is a diagonal matrix of eigenvalues Λ. V can then be further simplified [56] where Since Eq 5 is a diagonal matrix, its inverse is also a diagonal matrix: From Eqs 4 and 6, log(det(V)) simplifies to since det(U) = 1. It also follows from Eq 4 that since for an orthonormal matrix U−1 = U. Substituting Eqs 7, 8 and 9 into Eq 3 the negative log-likelihood becomes where , , denotes the ith element of , is the i, jth entry of and 1 is a column vector of N ones.

Penalized maximum likelihood estimator

We define the p + 3 length vector of parameters Θ ≔ (Θ0, Θ1, …, Θ, Θ, Θ) = (, η, σ2) where . In what follows, p + 2 and p + 3 are the indices in Θ for η and σ2, respectively. In light of our goals to select variables associated with the response in high-dimensional data, we propose to place a constraint on the magnitude of the regression coefficients. This can be achieved by adding a penalty term to the likelihood function Eq 10. The penalty term is a necessary constraint because in our applications, the sample size is much smaller than the number of predictors. We define the following objective function: where f(Θ) ≔ −ℓ(Θ) is defined in Eq 10, P(⋅) is a penalty term on the fixed regression coefficients β1, …, β (we do not penalize the intercept) controlled by the nonnegative regularization parameter λ, and v is the penalty factor for jth covariate. These penalty factors serve as a way of allowing parameters to be penalized differently. Note that we do not penalize η or σ2. An estimate of the regression parameters is obtained by This is the general set-up for our model. In the next Section we provide more specific details on how we solve Eq 11. We note here that the main difference between the proposed model, and the lmmlasso [57], is that we rotate the response vector Y and the design matrix by the eigen vectors of the kinship matrix. This results in a diagonal covariance matrix making our method orders of magnitude faster and usable for high-dimensional genetic data. A secondary difference is that we are limiting ourselves to a single unpenalized random effect.

Computational algorithm

We use a general purpose block coordinate gradient descent algorithm (CGD) [58] to solve Eq 11. At each iteration, we cycle through the coordinates and minimize the objective function with respect to one coordinate only. For continuously differentiable f(⋅) and convex and block-separable P(⋅) (i.e. P() = ∑ P(β)), Tseng and Yun [58] show that the solution generated by the CGD method is a stationary point of Qλ(⋅) if the coordinates are updated in a Gauss-Seidel manner i.e. Qλ(⋅) is minimized with respect to one parameter while holding all others fixed. The CGD algorithm has been successfully applied in fixed effects models (e.g. [20, 59]) and linear mixed models with an ℓ1 penalty [57]. In the next section we provide some brief details about Algorithm 1. A more thorough treatment of the algorithm is given in S1 Text. Algorithm 1: Block Coordinate Gradient Descent Set the iteration counter k ← 0, initial values for the parameter vector Θ(0) and convergence threshold ϵ; for λ ∈ {λ, …, λ} do repeat k ← k + 1 until convergence criterion is satisfied: ‖Θ( − Θ(‖2 < ϵ; end

Updates for the β parameter

Recall that the part of the objective function that depends on has the form where Conditional on η( and σ2(, it can be shown that the solution for β, j = 1, …, p is given by where is the soft-thresholding operator sign(x) is the signum function and (x)+ = max(x, 0). We provide the full derivation in S1 Text.

Updates for the η paramter

Given ( and σ2(, solving for η( becomes a univariate optimization problem: We use a bound constrained optimization algorithm [60] implemented in the optim function in R and set the lower and upper bounds to be 0.01 and 0.99, respectively.

Updates for the σ2 parameter

Conditional on ( and η(, σ2( can be solved for using the following equation: There exists an analytic solution for Eq 12 given by:

Regularization path

In this section we describe how to determine the sequence of tuning parameters λ at which to fit the model. Recall that our objective function has the form The Karush-Kuhn-Tucker (KKT) optimality conditions for Eq 13 are given by: The equations in Eq 14 are equivalent to where w is given by Eq, is with the first column removed, is the first column of , and is the subgradient function of the ℓ1 norm evaluated at . Therefore is a solution in Eq 11 if and only if satisfies Eq 15 for some γ. We can determine a decreasing sequence of tuning parameters by starting at a maximal value for λ = λ for which for j = 1, …, p. In this case, the KKT conditions in Eq 15 are equivalent to We can solve the KKT system of equations in Eq 16 (with a numerical solution for η) in order to have an explicit form of the stationary point . Once we have , we can solve for the smallest value of λ such that the entire vector () is 0: Following Friedman et al. [20], we choose τλ to be the smallest value of tuning parameters λ, and construct a sequence of K values decreasing from λ to λ on the log scale. The defaults are set to K = 100, τ = 0.01 if n < p and τ = 0.001 if n ≥ p.

Warm starts

The way in which we have derived the sequence of tuning parameters using the KKT conditions, allows us to implement warm starts. That is, the solution for λ is used as the initial value Θ(0) for λ. This strategy leads to computational speedups and has been implemented in the ggmix R package.

Prediction of the random effects

We use an empirical Bayes approach (e.g. [61]) to predict the random effects . Let the maximum a posteriori (MAP) estimate be defined as where, by using Bayes rule, f(|Y, , η, σ2) can be expressed as Solving for Eq 17 is equivalent to minimizing the exponent in Eq 18: Taking the derivative of Eq 19 with respect to and setting it to 0 we get: where are the estimates obtained from Algorithm 1.

Phenotype prediction

Here we describe the method used for predicting the unobserved phenotype Y⋆ in a set of individuals with predictor set X⋆ that were not used in the model training e.g. a testing set. Let q denote the number of observations in the testing set and N − q the number of observations in the training set. We assume that a ggmix model has been fit on a set of training individuals with observed phenotype Y and predictor set X. We further assume that Y and Y⋆ are jointly multivariate Normal: Then, from standard multivariate Normal theory, the conditional distribution Y⋆|Y, η, σ2, β, X, X⋆ is where The phenotype prediction is thus given by: where Φ⋆ is the q × (N − q) covariance matrix between the testing and training individuals.

Choice of the optimal tuning parameter

In order to choose the optimal value of the tuning parameter λ, we use the generalized information criterion [62] (GIC): where is the number of non-zero elements in [63] plus two (representing the variance parameters η and σ2). Several authors have used this criterion for variable selection in mixed models with a = log N [57, 64], which corresponds to the BIC. We instead choose the high-dimensional BIC [65] given by a = log(log(N)) * log(p). This is the default choice in our ggmix R package, though the interface is flexible to allow the user to select their choice of a.

Software availability

The ggmix method is written in an R package, which is freely available on CRAN at https://cran.r-project.org/package=ggmix. The complete documentation for this package is available at https://sahirbhatnagar.com/ggmix/. Scripts for running the analyses and reproducing the tables and figures reported in the manuscript are available in an RMarkdown document at https://github.com/sahirbhatnagar/ggmix/blob/master/manuscript/bin/tables-figures.Rmd.

Distribution of SNPs used in UK Biobank analysis.

Distribution of SNPs used in UK Biobank analysis by chromosome and whether or not the SNP was imputed. (TIF) Click here for additional data file.

LD structure among the markers in the GAW20 dataset.

We illustrate the LD structure among the markers in the GAW20 dataset. We show the pairwise r2 for 655 SNPs within a 1Mb-window around the causal SNP rs9661059 (indicated) that we focused on. The dotplot above the heatmap denotes r2 between each SNP and the causal SNP. It is clear that although strong correlation does exist between some SNPs, none of these nearby SNPs is correlated with the causal SNP. The only dot denoting an r2 = 1 represents the causal SNP itself. (TIF) Click here for additional data file.

LD structure among the markers in the mouse dataset.

We illustrate the LD structure among the markers in the mouse dataset. Shown is the pairwise r2 for all microsatellite markers. It is clear that many markers are considerably strongly correlated with each other, as we expected. (TIF) Click here for additional data file.

Block coordinate gradient descent algorithm.

We provide a full derivation of the algorithm used to solve the ggmix objective function. (PDF) Click here for additional data file.

ggmix Package Showcase.

We introduce the freely available and open source ggmix package in R available on CRAN (https://cran.r-project.org/package=ggmix). (PDF) Click here for additional data file. 8 Sep 2019 Dear Dr Bhatnagar, Thank you very much for submitting your Research Article entitled 'Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review again a much-revised version. We cannot, of course, promise publication at that time. Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org. If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist. To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see our guidelines. Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process. To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder. [LINK] We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions. Yours sincerely, Heather J Cordell Associate Editor PLOS Genetics Scott Williams Section Editor: Natural Variation PLOS Genetics Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors demonstrate a method for genome-wide association that combines mixed-model based control of population structure or genetic relatedness with multi-variate regressions. The idea is to increase power for GWAS applications by including multiple markers at once while accounting for structure. Previous tools for this application are somewhat limited, so this method aims to provide a more comprehensive and computationally efficient tools. They include a well-structured R package to implement their method. This is an important topic, and their software is much faster and easier to use than existing tools. However I have several comments on their presentation, particularly the comparisons with existing methods and some limitations of the simulations: Major issues: 1. I think that the novelty of the method is not made clear in the paper. The model itself appears virtually identical to that of the glmmlasso R package (Schelldorfer et al 2011) except limited to one random effect – the objective function appears identical, and those authors also derived a block coordinate gradient descent algorithm, though I don’t know if they are identical. The addition here seems mostly in terms of computational efficiency and flexibility (combinations of L1 and L2 penalties vs just L1 for Schellendorfer). Using the SVD to rotate and diagonalize the LMM so that the random effect covariance matrix is diagonal is a very useful addition and makes the software very fast, but this is not really emphasized in the methods. Also, the BSLMM model (Zhou et al 2013 Plos Genetics) is also effectively a variable selection LMM, though it’s Bayesian instead of frequentist and requires MCMC so is slower. But it’s performance could be compared to ggmix and the two-step LASSO presented in this paper. 2. While this method is much faster than glmmlasso (and presumably BSLMM), how does it scale to large numbers of markers? Typical genomics datasets today contain >1e5-1e6 markers. The datasets used here seem to contain 10K-50K. There is some discussion about limitations of scaling to large N, but not large p. 3. The first set of simulations include population structure and admixture. But all 10K markers are simulated in linkage equilibrium. This is an ideal situation for LASSO. But in real data, nearby markers will be partially correlated. How well does this method select the correct marker when there are other correlated markers? What does it do when the true causal variant is not in the data, but imperfectly correlated markers are (the typical GWAS setting)? Does it tend to select the nearest marker, or does it select a set of nearby markers? I can’t tell in the GAW20 and mouse datasets the extent of the LD among markers and whether this is addressed there. Also, the simulation result that including the causal markers in the kinship matrix has little impact is encouraging. But the Yang et al 2014 paper, which discusses the impact of including causal markers in the estimated kinship matrix, suggests that this is only likely when N/M is large (ie as many or more individuals as markers). In the simulations presented here, N/M is <0.1, which is probably small enough that proximal contamination is not an issue. If N/M were ~1, then the impact might be greater. Note: M is the effective number of markers after accounting for LD among markers. Minor issues • 97: Wang et al 2011 does not treat the variance components as fixed, but iteratively estimates them along with beta. • 142: How much of the total variance is accounted for by causal SNPs vs the background in the simulations? • 355: X_1,…,X_N should each be transposed • Fig 5: I don’t think that the red line is a p-value threshold. Maybe “significance threshold”? How can this be <100 if there were 200 bootstraps and significance required >50% inclusion? • I believe Eq (30) is wrong. The term V^{-1} shouldn’t be present in the likelihood if b is included. Reviewer #2: I have a few comments and suggestions: Page 2, Line 33: It this true: better sensitivity and specificity? What are you using to define sensitivity? Page 7, line 143-163. I found the definitions of the different X matrices confusing. I think you can simplify to indexes: eg: causal for list of causal snp indexes, kinship for list of snp index in the kinship matrix, etc. I’d also refer the the snps as covariates and not label them as fixed, and state when the causal set is in the kinship set. Page 7, line 159: Did you try a larger number of SNPs? In GWAS the number is much larger and PRS methods use much more than 50 independent SNPs for estimation. Page 10, line 183: if I understand your sparsity estimate correctly, with a causal rate of 1%, setting all B to zero would give a value of 99%? Page 11, 188: I am curious why you only reported the ‘optimal’ value of the penalty parameter. Is your method outperforming in terms of sparsity because it just does a better job of selecting a sparse model? Your false positive rate is lower but the true positive rate is much lower. If one is searching for the set of true causal variants, they are usually willing to take the tradeoff of better sensitivity for weeding through the false positives. I would prefer to see curves of FP versus TP rates with the values at the optimal tuning parameter marked. In practice I found AIC/BIC somewhat conservative compared to CV or controlling for error rate via phenotype permutation. Page 11, Line 196: I am sorry if I missed this somewhere, but how was the model tuned for the lasso and twostep in the training data? Did ggmix use the GIC in the materials and methods? Page 14, Line 243 typo- methods Page 12: Figure 3. This data might be better summarized in a table that could include the additional data in the supplementary files. The math is a bit beyond my abilities, but I have previously read a paper that suggests maximizing the log likelihood, −12[ln|V|+(Y−βX)TV−1(Y−BX)] subject to the L1/L2 norm penalty to control for relatedness in penalized regression methods for genetic data (where V is the variance covariance matrix). Is this essentially what you are doing? It would have been interesting to use the set of related individuals in the UK Biobank on a few traits where PRS works well. Reviewer #3: The authors present a penalized multi-variate regression model, ggmix, that jointly models multiple genotypes in mixed-model setup that incorporates the kinship or the genetic relationship matrix (GRM). It is an important problem in statistical genetics and I appreciate that the authors developed a comprehensive algorithm that simultaneously incorporates population structure and variable selection problem. The methods are well described and the paper was easy to follow, but I have some concerns in the experiments and evaluation metric. Overall, I think the paper presents an important problem, but the results are not convincing. First, in simulation results, the ‘correct sparsity’ measure is basically ‘accuracy’ measure in binary classification problem of whether the regression coefficients are zero or non-zero. This ‘accuracy’ measure is often misleading when class distribution is imbalanced. For example, if 99% of coefficients are zero, you can get 99% accuracy by just classifying everything to be zero, which is clearly not a good model. I suggest adapting different measure, such as MCC. I can see that twostep and LASSO both maintains FPR at 0.05, which is why ‘correct sparsity’ is around 0.95. At the same time, we can see that LASSO and twostep achieve higher TPR. Also from a slightly different point of view, in Figure 3(D), comparing TPR at different points on FPR is not a fair comparison. To compare different methods in the context of TPR and FPR, either AUC or TPR at the same rate of FPR should be considered. Intro is slightly misleading, especially in lines 107-109, because I first thought that ggmix takes out causal (i.e. selected) variables out of the relationship matrix, then I later realized that loss of power due to causal SNPs included in the GRM still happens in ggmix, but you aim to minimize the loss by joint modeling – but it is not clear why this would be the case. Is there any theoretical or simulation-based evidence that joint modeling achieves higher power in such a case? In the mouse data, the model parameters were optimized so that the two loci are picked up, and then the evaluation metric is based on whether these two loci are picked up, which is circular. In discussion, it was mentioned that leave-one-chromosome-out approach is possible, but has not been tried. What would be the compelling reason to model all chromosomes together in the proposed problem, especially when the model is still additive and trans-interaction term is not directly modeled? ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No 10 Dec 2019 Submitted filename: response_to_reviewers.pdf Click here for additional data file. 11 Feb 2020 * Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. * Dear Dr Bhatnagar, Thank you very much for submitting your Research Article entitled 'Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some aspects of the manuscript that should be improved. We therefore ask you to modify the manuscript according to the review recommendations before we can consider your manuscript for acceptance. Your revisions should address the specific points made by each reviewer. In addition we ask that you: 1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. 2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images. We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org. If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission. PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process. To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder. [LINK] Please let us know if you have any questions while making these revisions. Yours sincerely, Heather J Cordell Associate Editor PLOS Genetics Scott Williams Section Editor: Natural Variation PLOS Genetics Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have significantly revised their manuscript. I think that the inclusions of BSLMM in the comparisons is useful. However, I have a couple additional concerns. 1. Are you sure you’re extracting the model fit from BSLMM correctly? The specifics of use are not described in the methods. It’s a Bayesian model, so gives a probability of inclusion of each SNP. From the prefix.param.txt file, you should use the gamma column to report the number of SNPs that cross a posterior inclusion probability threshold. If you count how many betas are != 0, that will likely be large. But this is not an accurate estimate of the number of markers included in the model. You can also get the posterior on the number of SNPs included from the prefix.gamma.txt file. If you’re counting SNPs that are included, what posterior probability threshold are you using for inclusion? If you’re reporting the % of SNPs included, are you reporting the posterior mean? 2. Also, the authors didn’t apply BSLMM to several of the analyses. There is a write.plink function in the snpStats package that could be used to write GEMMA-compatible input files from R. Also, the PhenotypeSimulator package seems to have a writeStandardOutput function that can write bimbam or Gemma output. Especially if the extraction of estimates of needs to be revised and in fact works more similarly to the other methods, then I would recommend applying it to all analyses (Biobank may be too large). 3. I am still a bit confused about when tuning parameters were set based on TPR / FPR and when based on GIC or CV or other direct methods. In real data, it’s generally not possible to set based on TPR/FPR, but I think most of your comparisons now are done that way. I understand that the goal is to show that your model can work well, and so comparing to the truth is useful. But I think more clarity is needed about when you’re demonstrating the true performance of the model vs when you’re demonstrating how the model would actually be run by a practitioner who did not know any true positive effects going in. 4. 412-414. Is this statement true? Schelldorfer et al 2011 used what they called a "Block Coordinate Gradient Descent method" for their penalized LMM 5. 501: I think that this is underselling your method. The main difference is that your method is orders of magnitude faster than lmmlasso, so it’s actually usable. A secondary difference is that it is limited to only one random effect. 6. 288: how do you get a p-value from ggmix? Reviewer #3: The authors addressed most of my major concerns in this revision. I have one more comment: The authors first state they could not run BSLMM on simulated data and mouse data because of data format issues. Converting genotypes to PLINK format should be straightforward and should not be an issue. I understand for simulating large amount of data, it might be practically difficult, but I cannot understand why mouse microsatellite data cannot be converted to PLINK format. I do not think you need additional experiments with BSLMM but please remove the statement that converting the data to PLINK format is not possible. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #3: No 9 Mar 2020 Submitted filename: ggmix_response_to_reviewers.pdf Click here for additional data file. 8 Apr 2020 Dear Dr Bhatnagar, We are pleased to inform you that your manuscript entitled "Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models" has been editorially accepted for publication in PLOS Genetics. Congratulations! Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional accept, but your manuscript will not be scheduled for publication until the required changes have been made. Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org. In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. If you have a press-related query, or would like to know about one way to make your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date. Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics! Yours sincerely, Heather J Cordell Associate Editor PLOS Genetics Scott Williams Section Editor: Natural Variation PLOS Genetics www.plosgenetics.org Twitter: @PLOSGenetics ---------------------------------------------------- Comments from the reviewers (if applicable): Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: I am satisfied with the changes made by the authors ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No ---------------------------------------------------- Data Deposition If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website. The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-19-01153R2 More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support. Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present. ---------------------------------------------------- Press Queries If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org. 24 Apr 2020 PGENETICS-D-19-01153R2 Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models Dear Dr Bhatnagar, We are pleased to inform you that your manuscript entitled "Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work! With kind regards, Matt Lyles PLOS Genetics On behalf of: The PLOS Genetics Team Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom plosgenetics@plos.org | +44 (0) 1223-442823 plosgenetics.org | Twitter: @PLOSGenetics

42 in total

1. Estimating kinship in admixed populations.

Authors: Timothy Thornton; Hua Tang; Thomas J Hoffmann; Heather M Ochs-Balcom; Bette J Caan; Neil Risch
Journal: Am J Hum Genet Date: 2012-06-28 Impact factor: 11.025

2. The Bayesian lasso for genome-wide association studies.

Authors: Jiahan Li; Kiranmoy Das; Guifang Fu; Runze Li; Rongling Wu
Journal: Bioinformatics Date: 2010-12-14 Impact factor: 6.937

3. Fast model-based estimation of ancestry in unrelated individuals.

Authors: David H Alexander; John Novembre; Kenneth Lange
Journal: Genome Res Date: 2009-07-31 Impact factor: 9.043

4. Multiple-trait quantitative trait loci analysis using a large mouse sibship.

Authors: A U Jackson; A Fornés; A Galecki; R A Miller; D T Burke
Journal: Genetics Date: 1999-02 Impact factor: 4.562

5. Variance component model to account for sample structure in genome-wide association studies.

Authors: Hyun Min Kang; Jae Hoon Sul; Susan K Service; Noah A Zaitlen; Sit-Yee Kong; Nelson B Freimer; Chiara Sabatti; Eleazar Eskin
Journal: Nat Genet Date: 2010-03-07 Impact factor: 38.330

6. Recombinant congenic strains derived from A/J and C57BL/6J: a tool for genetic dissection of complex traits.

Authors: A Fortin; E Diez; D Rochefort; L Laroche; D Malo; G A Rouleau; P Gros; E Skamene
Journal: Genomics Date: 2001-05-15 Impact factor: 5.736

Review 7. Finding the missing heritability of complex diseases.

Authors: Teri A Manolio; Francis S Collins; Nancy J Cox; David B Goldstein; Lucia A Hindorff; David J Hunter; Mark I McCarthy; Erin M Ramos; Lon R Cardon; Aravinda Chakravarti; Judy H Cho; Alan E Guttmacher; Augustine Kong; Leonid Kruglyak; Elaine Mardis; Charles N Rotimi; Montgomery Slatkin; David Valle; Alice S Whittemore; Michael Boehnke; Andrew G Clark; Evan E Eichler; Greg Gibson; Jonathan L Haines; Trudy F C Mackay; Steven A McCarroll; Peter M Visscher
Journal: Nature Date: 2009-10-08 Impact factor: 49.962