Literature DB >> 28295030

pLARmEB: integration of least angle regression with empirical Bayes for multilocus genome-wide association studies.

J Zhang¹, J-Y Feng¹, Y-L Ni¹, Y-J Wen¹, Y Niu¹, C L Tamba¹, C Yue¹, Q Song², Y-M Zhang^1,3.

Abstract

Multilocus genome-wide association studies (GWAS) have become the state-of-the-art procedure to identify quantitative trait nucleotides (QTNs) associated with complex traits. However, implementation of multilocus model in GWAS is still difficult. In this study, we integrated least angle regression with empirical Bayes to perform multilocus GWAS under polygenic background control. We used an algorithm of model transformation that whitened the covariance matrix of the polygenic matrix K and environmental noise. Markers on one chromosome were included simultaneously in a multilocus model and least angle regression was used to select the most potentially associated single-nucleotide polymorphisms (SNPs), whereas the markers on the other chromosomes were used to calculate kinship matrix as polygenic background control. The selected SNPs in multilocus model were further detected for their association with the trait by empirical Bayes and likelihood ratio test. We herein refer to this method as the pLARmEB (polygenic-background-control-based least angle regression plus empirical Bayes). Results from simulation studies showed that pLARmEB was more powerful in QTN detection and more accurate in QTN effect estimation, had less false positive rate and required less computing time than Bayesian hierarchical generalized linear model, efficient mixed model association (EMMA) and least angle regression plus empirical Bayes. pLARmEB, multilocus random-SNP-effect mixed linear model and fast multilocus random-SNP-effect EMMA methods had almost equal power of QTN detection in simulation experiments. However, only pLARmEB identified 48 previously reported genes for 7 flowering time-related traits in Arabidopsis thaliana.

Entities: Chemical Gene Species

Mesh：

Year: 2017 PMID： 28295030 PMCID： PMC5436030 DOI： 10.1038/hdy.2017.8

Source DB: PubMed Journal: Heredity (Edinb) ISSN： 0018-067X Impact factor: 3.821

Introduction

Most complex traits in human, plant and animal genetics are quantitative traits and these traits are controlled by multiple quantitative trait loci (QTLs). The identification of these loci is usually performed by QTL mapping or genome-wide association study (GWAS). A large number of single-nucleotide polymorphisms (SNPs) can be easily obtained for the genotypes by the rapid development of sequencing and genotyping technologies. If all the SNPs are included in a genetic model, the number of SNPs will be much larger than the sample size. The commonly used methods are infeasible for such an oversaturated model. Many approaches have been proposed to estimate the parameters in the oversaturated model and these approaches include ridge regression (Hoerl and Kennard, 1970), stochastic search variable selection (George and McCulloch, 1993; Yi ), Bayesian shrinkage estimation (Meuwissen ; Wang ), penalized maximum likelihood (Zhang and Xu, 2005; Hoggart ; Zhang ), empirical Bayes (Xu, 2010) and Bayesian-LASSO (Bayesian-least absolute shrinkage and selection operator; Park and Casella, 2008; Yi and Xu, 2008). However, these methods are mainly proposed for linkage analysis in biparental segregation populations, rather than for GWAS in natural population. GWAS has been used to dissect the genetic foundation of quantitative traits (Zhang , 2010; Yu ; Kang ; Zhou and Stephens, 2012; Wang ). The widely used approach, such as efficient mixed model association (EMMA; Kang ; Zhou and Stephens, 2012), was proposed for single-marker analysis under the population structure and polygenic background controls. However, this method has relatively low power in detecting small-effect QTLs. To overcome these problems, therefore, multilocus model methods have been suggested (Fridley ; Lü ), for example, a Bayesian-inspired penalized maximum likelihood approach (Zhang and Xu, 2005; Hoggart ) and PUMA (Penalized Unified Multiple-locus Association; Hoffman ). These methods can be used if the number of variables in the multilocus model is not too large. Recent strategies for high-dimensional modeling have focused on reducing the dimension of a large matrix and then selecting the most potentially associated SNPs by using shrinkage methods such as the LASSO and SCAD (smoothly clipped absolute deviation) penalty (Fan and Lv, 2008; Wu ). Although other multilocus approaches have also been proposed by Segura , Moser , Liu , Wang and Wen , now further refinement and studies are still needed. In this study, we integrated least angle regression (LARS) algorithm with empirical Bayes to perform multilocus GWAS for quantitative traits, as the LARS algorithm makes LASSO (Tibshirani, 1996) efficient and acceptable (Efron ). To control polygenic background, we adopted the model transformation of Wen that whitens the covariance matrix of the polygenic matrix K and residual noise. The LARS algorithm was implemented on the transformed model to select SNPs that are most potentially associated with the trait, empirical Bayes was used to estimate the effects of all the selected SNPs and all the nonzero effects were further examined by likelihood ratio test so as to confirm true quantitative trait nucleotides (QTNs). We refer to this method as the pLARmEB (polygene-background-control-based least angle regression plus empirical Bayes). pLARmEB was validated by analysis of the data sets from a series of Monte Carlo simulation experiments and seven Arabidopsis flowering time traits. We also discussed the possibility of applying pLARmEB for linkage analysis.

Materials and methods

Genetic model

Let y (i=1,⋯,n) be the phenotypic value of the ith individual in a sample of size n from a natural population. The genetic model is expressed by where y=(y1, ⋅⋅⋅,y); 1 is a n × 1 vector of 1 and μ is total average; α is population structure effect as fixed; γ~MVN(0,Σ) are QTN effects as random, and m is the number of putative QTNs; W and Z are the corresponding designed matrices for α and γ; polygenic effects is a n × 1 random vector and K is a known n × n relatedness matrix; and ɛ is residual error with an assumed MVN(0,σ2I) distribution, σ2 is residual error variance and I is an n × n identity matrix. As γ is treated as being random, the variance of y in the model (1) is where (k=1,⋯,m), and H=Zdiag{λ1,⋯,λ}Z+λK+I Using EMMA, we can obtain the estimate of λ, denoted by . Let , an eigen (or spectral) decomposition of the positive semidefinite matrix B was where Q is orthogonal, Λ is a diagonal matrix with positive eigenvalues, r=Rank(B), Q1 and Q2 are the n × r and n × (n−r) block matrices of Q, respectively, and 0 is the corresponding block zero matrix (Wen ). Let , the model (1) is changed to where y=Cy, 1=C1, W=CW, Z=CZ and ɛ=Cu+Cɛ~MVN(0,σ2I) (Wen ). In the above model (4), let , Y=y−1 μ with a zero mean, and standardizing each column in matrix (W Z) produces a new matrix X with and (j=1,⋯,m). Therefore, the model (4) can be rewritten as

Parameter estimation

LARS for the full model

LARS is a flexible method for variable selection that has been described previously (Efron ). We used the LARS algorithm to select the n−1 variables that are most likely associated with quantitative trait of interest. First, let , so Then, suppose that is the current LARS estimate and that is the vector of current correlations. The active set ∈ is the set of indices corresponding to covariates with the greatest absolute current correlations, Let for j ∈ F. We can calculate XF=(···sx···), uF=XFωF, , , where , and 1F being a vector of 1 with the length of equaling |F|. Third, update in the LARS algorithm: where , min+ indicates that the minimum is taken over only positive components within each choice of j in the formula of , and a≡XuF. Repeat step 2 to step 3 until a criterion of convergence is satisfied. The above algorithm was conducted by lars package (http://cran.r-project.org/web/packages/lars/) in R language. Usually, if all the marker effects are included in one genetic model, the parameters cannot be estimated under the situation of m≫n, where n is sample size and m is the number of variables. As most markers are not likely associated with the trait of interest, once the markers with zero effects are deleted from the full model, marker effects of the reduced model is estimable. In each LARS variable selection, the n−1 SNPs that are most potentially associated with the trait are selected to construct the reduced model.

Empirical Bayes estimation in the reduced model

In the reduced model, where y is the same as that in the model (1); β is a vector of fixed effect, γ is a vector of random effect of the selected markers and X and Z are the design matrices for β and γ, respectively. All the parameters in the model (7) were estimated by empirical Bayes proposed by Xu (2010). The fixed effect β and residual variance σ2 were estimated by where . The random effect γ of each marker and its prediction error var(γ) were predicted by best linear unbiased prediction: where , ω=τ=0, and m is the number of genotypes at locus k. The method requires inverse of matrix V. If the sample size is large, that is, n>p, binomial inverse theorem (Henderson and Searle, 1980) can be used: where Based on our experiences, empirical Bayes is feasible when the number of variables is less than 40 times of the sample size. However, this condition is not frequently met in GWAS. If the LARS algorithm is used to select the variables that are most potentially associated with the trait under polygenic background control, the effects of the selected markers can be estimated by empirical Bayes.

Likelihood ratio (LR) test

Based on the estimate of marker effect γ in the reduced model, markers with are considered not to be associated with the trait; however, the association of the chosen markers with the trait and the effects θ={γ(1),⋯,γ(} needs to be tested, where q is the number of SNPs in the reduced model. To test the null hypothesis H0:γ(=0, that is, no QTL linked to the marker, we conducted an LR test by where , is a log-likelihood function, φ(y;Xβ+Zγ,σ2) is a normal density function with mean Xβ+Zγ and variance σ2 and LOD=LR/4.605. The critical value for significance was set at LOD=2.0 (Bu ).

AIC and BIC for testing goodness of fit of models

The goodness of fit for a statistical model can be measured by where L is the likelihood function value and k is the number of independent variables, and n is sample size. Smaller Akaike information criterion (AIC) or Bayesian information criterion (BIC) value indicates a good fit. pLARmEB has been implemented in R and its software can be downloaded from https://cran.r-project.org/web/packages/mrMLM/index.html.

Data sets for analyses

One Arabidopsis data set and four Monte Carlo simulated data sets were used to validate pLARmEB. Each data set contained phenotypic observations for quantitative traits and genotypic values for molecular markers.

The Arabidopsis data set

The data set downloaded from http://www.arabidopsis.org/ includes 199 diverse inbred lines each with 216 130 SNPs and 107 traits (Atwell ). Among these traits, seven are related to flowering time, including days to flowering under long days, days to flowering under long days with vernalization, days to flowering under short days, days to flowering under short days with vernalization, days to flowering at 10 °C, days to flowering at 16 °C and days to flowering at 22 °C. We analyzed these traits using pLARmEB, EMMA, multilocus random-SNP-effect mixed linear model (mrMLM) and fast multilocus random-SNP-effect EMMA (FASTmrEMMA) methods. The population structure Q matrix and kinship coefficient matrix K between all the pairs of lines were used to control population structure and polygenic background. We also deleted the SNPs with minor allele frequency <10%. When all the markers on one chromosome were in one genetic model, the markers on other chromosomes were used to calculate K matrix as polygenic background control (Rincent ; Yang ; Wei and Xu, 2016). Here 50 SNPs most potentially associated with the trait are selected to construct the reduced model. This number may vary across different data sets.

Data sets from Monte Carlo simulation in natural population

Three Monte Carlo simulation experiments were conducted to validate pLARmEB. The three data sets are the same as those in Wang . In the first experiment, all the SNP genotypes were derived from 216 130 SNPs reported by Atwell and 2000 SNPs were randomly sampled from each chromosome (Chr.). The positions of these SNPs in the genome were between 11 226 256 and 12 038 776 bp on Chr. 1, between 5 045 828 and 6 412 875 bp on Chr. 2, between 1 916 588 and 3 196 442 bp on Chr. 3, between 2 232 796 and 3 143 893 bp on Chr. 4 and between 19 999 868 and 21 039 406 bp on Chr. 5 (Wang ). The sample size was 199, and this was the number of lines in Atwell . Six QTNs were simulated and placed on the SNPs with rare allelic frequency of 0.30. The heritabilities of the QTNs were set as 0.10, 0.05, 0.05, 0.15, 0.05 and 0.05, respectively; their positions and effects are listed in Supplementary Table S1. The total average was set at 10.0 and residual variance was set at 10.0. For each simulated QTN, we counted the number of samples in which the LOD (logarithm (base 10) of odds) exceeded 2.0 (Bu ). A detected QTN within 2 kb of the simulated QTN was considered a true QTN. The ratio of the number of such samples to the total number of replicates (1000) represented the empirical power of this QTN. False positive rate (FPR) was calculated as the ratio of the number of false positive effects to the total number of zero effects considered in the full model. To measure the bias of gene effect estimate, mean squared error (MSE) was calculated, where is the estimate of effect γ in the ith sample. We investigated the effect of polygenic background on pLARmEB in the second experiment by adding polygenic effects from a multivariate normal distribution , where is polygenic variance and K is a pairwise kinship coefficient matrix among individuals. Here , so . The QTN size (h2), total average, residual variance and other parameter values were the same as those in the first experiment, and all the parameters are listed in Supplementary Table S2. In the third experiment, we investigated the effect of epistatic background on pLARmEB. Three epistatic QTNs were added. The related parameters for the simulated three epistatic QTNs have been described in Wang . The QTN sizes (h2), total average, residual variance and other parameter values were also the same as those in the first experiment (Supplementary Table S3).

Monte Carlo simulation experiments in backcross

To test whether pLARmEB can be used in biparental population, we conducted another simulation experiment. In this experiment, 200 individuals each with 10 001 evenly spaced markers on the entire genome of 100 000 cM length were simulated in backcross population. Eight main-effect QTLs were simulated and placed at marker positions. The sizes and locations of these QTLs are listed in Supplementary Table S4. The population mean (b0) and residual error variance (σ2) were set at 10 and 10, respectively. The number of replicates was set at 200.

Results

Monte Carlo simulation studies

Statistical power for QTN detection

To validate pLARmEB, three simulation experiments were conducted. In the first experiment, each simulated sample was analyzed by pLARmEB, least angle regression plus empirical Bayes (LARmEB), EMMA, FASTmrEMMA, mrMLM and Bayesian hierarchical generalized linear model (BhGLM). Among the 1000 samples, the first 100 were further analyzed using the BhGLM method. As shown in Supplementary Table S1 and Figure 1a, the average power for the above 6 methods was 77.1, 68.9, 46.0, 70.7, 68.6 and 54.5%, respectively. The method in which polygenic background was controlled had the highest average power among the six methods (Figure 1a). To further confirm the effectiveness of pLARmEB, polygenic effect simulated from multivariate normal distribution (r2=9.2%) was added to each phenotype in the second experiment and three epistatic QTNs (r2=15%) were added in the third simulation experiment. The average powers based on pLARmEB, LARmEB, EMMA, FASTmrEMMA, mrMLM and BhGLM were 78.3, 69.6, 42.5, 75.0, 67.6 and 60.7%, respectively, in the second experiment (Supplementary Table S2); and 74.4, 57.5, 39.1, 59.2, 58.9 and 56.3%, respectively, in the third experiment (Supplementary Table S3). The highest average power was observed when pLARmEB included polygenic background control.

Figure 1

Average powers in the detection of QTNs (a) and average of mean squared errors in the estimation of QTN effects (b) across six simulated QTNs using pLARmEB, LARmEB, EMMA, FASTmrEMMA, mrMLM and BhGLM.

Accuracies of estimated QTN effects

MSE measured accuracies of estimated QTN effects, and low MSE indicates high accuracy for parameter estimation. As shown in Figure 1b and Supplementary Tables S1–S3, the average MSEs based on pLARmEB, LARmEB, EMMA, FASTmrEMMA, mrMLM and BhGLM were 0.0895, 0.1005, 0.5432, 0.2885, 0.0940 and 0.2577, respectively, in the first experiment (Figure 1b and Supplementary Table S1); 0.0917, 0.0997, 0.5680, 0.3227, 0.0852 and 1.3139, respectively, in the second experiment (Supplementary Table S2); and 0.0973, 0.1240, 0.5973, 0.3450, 0.1024 and 0.3934, respectively, in the third experiment (Supplementary Table S3). pLARmEB had the highest accuracy for estimating QTN effect among the six methods.

FPR and ROC curve

High FPR is a major concern in GWAS. To overcome this issue, a very high significance level was frequently adopted in genome-wide single marker scan. In our multilocus method, a less stringent significance level (LOD=2.0) was recommended. We wanted to know whether this criterion produces high FPR. All the FPR results in the three simulation experiments are listed in Supplementary Tables S1–S3. Clearly, the FPRs based on pLARmEB, LARmEB, EMMA, FASTmrEMMA, mrMLM and BhGLM were 0.0009, 0.0127, 0.0325, 0.0084, 0.0168 and 0.0115 (%), respectively, in the first experiment (Supplementary Table S1); 0.0025, 0.0010, 0.0166, 0.0081, 0.0210 and 0.0093%, respectively, in the second experiment (Supplementary Table S2); and 0.0089, 0.0031, 0.0253, 0.0148, 0.0265 and 0.0120%, respectively, in the third experiment (Supplementary Table S3). These results indicate that pLARmEB had a low FPR. To compare various approaches for their efficiencies in the detection of significant QTNs, receiver operating characteristic (ROC) curve was plotted. ROC is a plot of average power against FPR. We calculated the corresponding average powers for the 41 thresholds between 10−6 and 10−2 in the first simulation experiment, and compared the ROC curves among the above 6 methods. Under the 0.01 to 0.001 significant levels, pLARmEB has the highest power to detect QTN among the six methods (Figure 2).

Figure 2

Statistical powers of six simulated QTNs in the first simulation experiment plotted against false positive rate (in a log10 scale) for pLARmEB, LARmEB, EMMA, FASTmrEMMA, mrMLM and BhGLM.

Computational efficiency

We scanned and identified SNPs that were associated with the trait on each chromosome using LARS. We then included all the potentially associated SNPs across the genome into one genetic model and estimated their effects by empirical Bayes (Xu, 2010). For the first simulation experiment, the above procedures took 4.20, 6.82, 68.77, 8.32, 13.29 and >100 h (Intel Core i5-4570 CPU 3.20 GHz, Memory 7.88G, Nanjing, China) for pLARmEB, LARmEB, EMMA, FASTmrEMMA, mrMLM and BhGLM, respectively. pLARmEB took the least computing time among the six approaches. A similar trend was found in real data analyses (Supplementary Table S5).

Analysis of the Arabidopsis data set

To test the performance of pLARmEB, a data set containing 7 Arabidopsis flowering traits along with 216 130 SNPs in Atwell were reanalyzed by pLARmEB, EMMA, FASTmrEMMA and mrMLM. All the significantly associated SNPs were used to fit the regression for each trait and model fitness was reflected by AIC and BIC values. The AIC values for all the seven traits based on pLARmEB were much lower than those based on EMMA, FASTmrEMMA and mrMLM (Table 1). Hence, FASTmrEMMA and mrMLM were better than EMMA and a similar result was also observed from the BIC values. The finding suggests that pLARmEB is better in model fit than EMMA, FASTmrEMMA and mrMLM.

Table 1

AIC and BIC values for the regression of significantly associated SNPs on each Arabidopsis flowering time trait using pLARmEB, EMMA, FASTmrEMMA and mrMLM

Trait	BIC				AIC
	pLARmEB	EMMA	FASTmrEMMA	mrMLM	pLARmEB	EMMA	FASTmrEMMA	mrMLM
LD	63.53	289.74	263.56	260.60	−26.90	286.62	201.20	195.12
LDV	−306.01	−104.50	−157.79	−142.31	−380.99	−113.87	−198.40	−176.67
SD	−118.34	118.17	48.55	31.26	−251.10	115.08	2.24	−42.84
SDV	−155.98	90.55	124.10	−96.31	−269.53	75.20	78.07	−148.49
FT10	−390.40	28.18	−99.08	−216.17	−514.58	24.92	−164.44	−281.52
FT16	−6.09	222.04	189.81	192.32	−84.40	218.78	144.13	127.06
FT22	182.71	332.36	283.04	235.13	120.72	329.10	230.84	160.09

Abbreviations: AIC, Akaike information criterion; BIC, Bayesian information criterion; EMMA, efficient mixed model association; FASTmrEMMA, fast multi-locus random-SNP-effect EMMA; FT10, FT16 and FT22, days to flowering at 10, 16 and 22 °C, respectively; LD, days to flowering under long days; LDV, days to flowering under long days with vernalization; mrMLM, multilocus random-SNP-effect mixed linear model; pLARmEB, polygenic-background-control-based least angle regression plus empirical Bayes; SD, days to flowering under short days; SDV, days to flowering under short days with vernalization; SNP, single-nucleotide polymorphism.

Within 20 kb of each SNP significantly associated with traits, we mined candidate genes for these traits. Among the genes identified in previous studies, pLARmEB, FASTmrEMMA and mrMLM identified more previously reported genes than EMMA (Supplementary Table S6). For example, pLARmEB, FASTmrEMMA and mrMLM identified more than three genes for long days with vernalization, whereas EMMA detected only one gene (AT5G45890). A similar trend was also observed for other traits (Supplementary Table S6). Among these previously reported genes, 48 were identified only by pLARmEB (Table 2). Interestingly, genes AT2G19690 and AT2G19760 identified by pLARmEB were associated simultaneously with long days with vernalization and short days with vernalization SDV, and three genes (AT2G07020, AT2G07040 and AT2G07050) adjacent to the SNP at 2 910 430 bp of chromosome 2 were found to be associated with short days.

Table 2

The previously reported genes for seven flowering time traits in Arabidopsis that were detected only by pLARmEB

Traita	Gene	Chr.	SNP (bp)	P-value	Effect	LOD	r² (%)	Traita	Gene	Chr.	SNP (bp)	P-value	Effect	LOD	r² (%)
LD	AT3G56960	3	21 079 518	1.52E−03	−0.040	2.18	0.38	SDV	AT2G19690 AT2G19760	2	8 516 520	6.65E−05	0.030	3.45	0.43
	AT5G11320	5	3594 757	7.54E−05	−0.028	3.40	0.23		AT2G32700	2	13 853 405	4.60E−08	−0.066	6.49	3.29
	AT5G64510	5	25 783 160	7.64E−05	0.021	3.40	0.12		AT4G12920	4	7 586 463	3.19E−10	−0.071	8.59	2.07
LDV	AT1G68050 AT1G68090 AT1G68130	1	25 525 403	8.71E−08	0.039	6.22	3.40		AT5G01600	5	239 433	8.39E−07	0.036	5.27	0.90
	AT2G19690 AT2G19760	2	8 516 520	7.37E−09	0.048	7.26	3.29		AT5G16780	5	5 526 925	4.30E−04	0.024	2.69	0.53
	AT3G07050	3	2 215 112	5.69E−06	0.029	4.47	1.99		AT5G45890	5	18 607 728	1.23E−03	−0.014	2.27	0.13
SD	AT1G01510 AT1G01530	1	192 020	1.88E−06	0.029	4.93	0.52	FT10	AT1G61290	1	22 619 960	9.12E−06	0.012	4.28	0.56
	AT1G68090 AT1G68130	1	25 532 914	7.90E−13	0.036	11.14	1.18		AT2G01200	2	134 343	1.03E−05	−0.013	4.22	0.71
	AT2G07020 AT2G07040 AT2G07050	2	2 910 430	3.63E−10	−0.036	8.53	1.71		AT2G03500	2	1 076 833	1.30E−05	0.006	4.13	0.15
	AT2G22540	2	9 588 685	1.00E−16	−0.072	16.74	5.41		AT2G18790	2	8 124 967	1.98E−04	0.019	3.01	0.81
	AT2G27990	2	11 931 686	4.95E−14	0.041	12.32	2.25		AT3G47870	3	17 653 089	9.62E−08	−0.015	6.18	0.46
	AT3G01780	3	286 197	1.29E−04	−0.017	3.18	0.32		AT4G01220	4	518 797	1.47E−07	−0.024	6.00	2.35
	AT3G28780	3	10 816 150	2.21E−08	−0.049	6.80	1.50		AT4G33240	4	16 017 869	6.61E−05	−0.006	3.46	0.08
	AT3G55200 AT3G55220	3	20 477 225	1.49E−03	0.011	2.19	0.16	FT16	AT2G03060	2	882 256	1.48E−03	−0.022	2.19	0.29
	AT4G00650	4	268 809	2.95E−06	0.013	4.74	0.14		AT3G56960	3	21 079 518	3.24E−04	−0.043	2.81	1.01
	AT4G03090 AT4G03110	4	1 371 766	1.32E−06	0.023	5.08	0.66		AT4G01220	4	500 090	5.46E−10	−0.090	8.36	3.17
	AT5G45890	5	18 611 542	1.00E−07	0.015	6.16	0.22	FT22	AT1G52740	1	19 629 918	3.00E−06	−0.087	4.74	2.24
	AT5G59570	5	24 008 772	8.91E−06	0.010	4.28	0.07		AT1G71270	1	26 869 825	8.33E−07	0.118	5.27	1.90
	AT5G63160	5	25 347 883	2.62E−04	0.002	2.89	0.002		AT4G34040	4	16 310 486	4.26E−04	0.064	2.70	0.76

Abbreviations: Chr., chromosome; LOD, logarithm (base 10) of odds; pLARmEB, polygenic-background-control-based least angle regression plus empirical Bayes; SNP, single-nucleotide polymorphism.

Trait abbreviations are the same as those in Table 1.

Discussion

Analysis of one random sample in the first Monte Carlo simulation experiment using LARS, empirical Bayes and pLARmEB showed that LARS identified many QTNs with small effects in addition to all the simulated QTNs, and thus its FPR was high (Figure 3a). The empirical Bayes was also able to identify simulated and small-effect QTNs although FPR was decreased (Figure 3b), and pLARmEB detected almost all the simulated QTNs and the effects of nonsimulated QTNs were almost close to zero (Figure 3c). More importantly, 48 previously reported genes in Arabidopsis were identified only by pLARmEB. Therefore, pLARmEB is a good alternative method for multilocus GWAS.

Figure 3

Comparison of least angle regression (a), empirical Bayes (b) and pLARmEB (c) in the estimation of QTN effects in one random sample of the first simulation experiment.

Although pLARmEB was proposed for GWAS, it is appropriate for mapping populations of backcross, doubled haploid and recombinant inbred lines. To illustrate the effectiveness of pLARmEB, pseudo-markers in every d cM were created genome-wide, and the fourth Monte Carlo simulation experiment with 200 simulated data sets was conducted and analyzed using pLARmEB and empirical Bayes. The higher power for QTL detection and less bias for the QTL-effect estimates were observed from pLARmEB than from empirical Bayes (Supplementary Table S4). pLARmEB is also suitable for a population consisting of chromosome segment substitution lines. However, we can only scan marker positions, because we cannot calculate conditional probabilities of pseudo-marker positions. If the number of genotypes in a mapping population is more than two, for example, AA, Aa and aa in F2, the current method requires some modifications. Among the previously identified genes in Arabidopsis (Supplementary Table S6), a few were found commonly by several approaches and this is different from linkage analysis. The main reason is that GWAS mapping population has a complicated population structure. Although pLARmEB, FASTmrEMMA, and mrRMLM had similar powers of QTN detection in the simulation experiments, different previously reported genes were detected in real data analysis. For example, 48 previously reported genes were identified only by pLARmEB (Table 2). For this reason, we recommend pLARmEB as an alternative method for GWAS and also recommend the joint implementation of several methods in the GWAS analyses of one trait. The AIC or BIC values of FASTmrEMMA in Wen and mrRMLM in Wang are different from the corresponding values in this study. In this study, we considered population structure in GWAS. With the inclusion of population structure in genetic model, some different SNPs are found to be significantly associated with the trait. The above two differences result in different AIC or BIC values for the same trait in different studies. Multilocus GWAS has become the state-of-the-art GWAS procedure. Iwata , 2009) developed multilocus Bayesian GWAS approaches for quantitative and ordinal traits, although running time is a major concern. Segura proposed a multilocus linear mixed model method that is simple, stepwise mixed model regression with forward inclusion and backward elimination. Wang suggested mrMLM and Wen proposed FASTmrEMMA. To make assumptions more suitable to a given data set, Zhou and Moser proposed a hybrid method of mixed linear model and sparse regression model, named Bayesian sparse linear mixed model. In this study, the integration of LARS with empirical Bayes under polygenic background control provides one simple and efficient way for multilocus GWAS. In Arabidopsis real data analysis, the number of SNPs was >1000 times larger than sample size and we were able to scan each chromosome by LARS and include all the associated SNPs across the genome in the multilocus model and estimate their effects by empirical Bayes, and thus pLARmEB is better than EMMA. To obtain low FPR in GWAS, a relatively stringent significance criterion is widely adopted, such as Bonferroni correction. Even after using a less stringent significance criterion (such as LOD=2.0), pLARmEB has less FPR and higher power than EMMA. We also conducted GEMMA (Zhou and Stephens, 2012) and its power is same as that of EMMA (results not shown). pLARmEB works better than all the other methods considered.

Data archiving

All simulated data sets are available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.sk652. The real data set can be retrieved from: http://www.arabidopsis.org/.

30 in total

1. Bayesian shrinkage estimation of quantitative trait loci parameters.

Authors: Hui Wang; Yuan-Ming Zhang; Xinmin Li; Godfred L Masinde; Subburaman Mohan; David J Baylink; Shizhong Xu
Journal: Genetics Date: 2005-03-21 Impact factor: 4.562

2. Genome-wide association analysis by lasso penalized logistic regression.

Authors: Tong Tong Wu; Yi Fang Chen; Trevor Hastie; Eric Sobel; Kenneth Lange
Journal: Bioinformatics Date: 2009-01-28 Impact factor: 6.937

3. An expectation-maximization algorithm for the Lasso estimation of quantitative trait locus effects.

Authors: S Xu
Journal: Heredity (Edinb) Date: 2010-01-06 Impact factor: 3.821

4. Bayesian mixture models for the incorporation of prior knowledge to inform genetic association studies.

Authors: Brooke L Fridley; Daniel Serie; Gregory Jenkins; Kristin White; William Bamlet; John D Potter; Ellen L Goode
Journal: Genet Epidemiol Date: 2010-07 Impact factor: 2.135

5. Bayesian association mapping of multiple quantitative trait loci and its application to the analysis of genetic variation among Oryza sativa L. germplasms.

Authors: Hiroyoshi Iwata; Yusaku Uga; Yosuke Yoshioka; Kaworu Ebana; Takeshi Hayashi
Journal: Theor Appl Genet Date: 2007-03-14 Impact factor: 5.699

6. Advantages and pitfalls in the application of mixed-model association methods.

Authors: Jian Yang; Noah A Zaitlen; Michael E Goddard; Peter M Visscher; Alkes L Price
Journal: Nat Genet Date: 2014-02 Impact factor: 38.330

7. PUMA: a unified framework for penalized multiple regression analysis of GWAS data.

Authors: Gabriel E Hoffman; Benjamin A Logsdon; Jason G Mezey
Journal: PLoS Comput Biol Date: 2013-06-27 Impact factor: 4.475

8. An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations.

Authors: Vincent Segura; Bjarni J Vilhjálmsson; Alexander Platt; Arthur Korte; Ümit Seren; Quan Long; Magnus Nordborg
Journal: Nat Genet Date: 2012-06-17 Impact factor: 38.330

9. Epistatic association mapping in homozygous crop cultivars.

Authors: Hai-Yan Lü; Xiao-Fen Liu; Shi-Ping Wei; Yuan-Ming Zhang
Journal: PLoS One Date: 2011-03-15 Impact factor: 3.240

10. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines.

Authors: Susanna Atwell; Yu S Huang; Bjarni J Vilhjálmsson; Glenda Willems; Matthew Horton; Yan Li; Dazhe Meng; Alexander Platt; Aaron M Tarone; Tina T Hu; Rong Jiang; N Wayan Muliyati; Xu Zhang; Muhammad Ali Amer; Ivan Baxter; Benjamin Brachi; Joanne Chory; Caroline Dean; Marilyne Debieu; Juliette de Meaux; Joseph R Ecker; Nathalie Faure; Joel M Kniskern; Jonathan D G Jones; Todd Michael; Adnane Nemri; Fabrice Roux; David E Salt; Chunlao Tang; Marco Todesco; M Brian Traw; Detlef Weigel; Paul Marjoram; Justin O Borevitz; Joy Bergelson; Magnus Nordborg
Journal: Nature Date: 2010-03-24 Impact factor: 49.962

62 in total

1. Fine mapping QTL and mining genes for protein content in soybean by the combination of linkage and association analysis.

Authors: Xiyu Li; Ping Wang; Kaixin Zhang; Shulin Liu; Zhongying Qi; Yanlong Fang; Yue Wang; Xiaocui Tian; Jie Song; Jiajing Wang; Chang Yang; Xu Sun; Zhixi Tian; Wen-Xia Li; Hailong Ning
Journal: Theor Appl Genet Date: 2021-01-09 Impact factor: 5.699

2. Combination of multi-locus genome-wide association study and QTL mapping reveals genetic basis of tassel architecture in maize.

Authors: Yanli Wang; Jie Chen; Zhongrong Guan; Xiaoxiang Zhang; Yinchao Zhang; Langlang Ma; Yiming Yao; Huanwei Peng; Qian Zhang; Biao Zhang; Peng Liu; Chaoying Zou; Yaou Shen; Fei Ge; Guangtang Pan
Journal: Mol Genet Genomics Date: 2019-07-09 Impact factor: 3.291

3. Genetic dissection of flowering time in flax (Linum usitatissimum L.) through single- and multi-locus genome-wide association studies.

Authors: Braulio J Soto-Cerda; Gabriela Aravena; Sylvie Cloutier
Journal: Mol Genet Genomics Date: 2021-04-26 Impact factor: 3.291

4. GWAS Case Studies in Wheat.

Authors: Deepmala Sehgal; Susanne Dreisigacker
Journal: Methods Mol Biol Date: 2022

5. Identification and validation of a novel locus, Qpm-3BL, for adult plant resistance to powdery mildew in wheat using multilocus GWAS.

Authors: Xijun Du; Weigang Xu; Chaojun Peng; Chunxin Li; Yu Zhang; Lin Hu
Journal: BMC Plant Biol Date: 2021-07-30 Impact factor: 4.215

6. Genome-wide association study of pre-harvest sprouting tolerance using a 90K SNP array in common wheat (Triticum aestivum L.).

Authors: Yulei Zhu; Shengxing Wang; Wenxin Wei; Hongyong Xie; Kai Liu; Can Zhang; Zengyun Wu; Hao Jiang; Jiajia Cao; Liangxia Zhao; Jie Lu; Haiping Zhang; Cheng Chang; Xianchun Xia; Shihe Xiao; Chuanxi Ma
Journal: Theor Appl Genet Date: 2019-07-19 Impact factor: 5.699

7. Multi-Locus Genome-Wide Association Studies Reveal Fruit Quality Hotspots in Peach Genome.

Authors: Cassia da Silva Linge; Lichun Cai; Wanfang Fu; John Clark; Margaret Worthington; Zena Rawandoozi; David H Byrne; Ksenija Gasic
Journal: Front Plant Sci Date: 2021-02-25 Impact factor: 5.753

8. Loci harboring genes with important role in drought and related abiotic stress responses in flax revealed by multiple GWAS models.

Authors: Demissew Sertse; Frank M You; Sridhar Ravichandran; Braulio J Soto-Cerda; Scott Duguid; Sylvie Cloutier
Journal: Theor Appl Genet Date: 2020-10-12 Impact factor: 5.699

Review 9. Genome-wide association study and its applications in the non-model crop Sesamum indicum.

Authors: Muez Berhe; Komivi Dossa; Jun You; Pape Adama Mboup; Idrissa Navel Diallo; Diaga Diouf; Xiurong Zhang; Linhai Wang
Journal: BMC Plant Biol Date: 2021-06-22 Impact factor: 4.215

10. Genetic Dissection of Seedling Root System Architectural Traits in a Diverse Panel of Hexaploid Wheat through Multi-Locus Genome-Wide Association Mapping for Improving Drought Tolerance.

Authors: Thippeswamy Danakumara; Jyoti Kumari; Amit Kumar Singh; Subodh Kumar Sinha; Anjan Kumar Pradhan; Shivani Sharma; Shailendra Kumar Jha; Ruchi Bansal; Sundeep Kumar; Girish Kumar Jha; Mahesh C Yadav; P V Vara Prasad
Journal: Int J Mol Sci Date: 2021-07-02 Impact factor: 5.923