Literature DB >> 28158525

Methodological implementation of mixed linear models in multi-locus genome-wide association studies.

Yang-Jun Wen¹, Hanwen Zhang², Yuan-Li Ni¹, Bo Huang¹, Jin Zhang¹, Jian-Ying Feng¹, Shi-Bo Wang³, Jim M Dunwell⁴, Yuan-Ming Zhang^1,3, Rongling Wu^5,6.

Abstract

The mixed linear model has been widely used in genome-wide association studies (GWAS), but its application to multi-locus GWAS analysis has not been explored and assessed. Here, we implemented a fast multi-locus random-SNP-effect EMMA (FASTmrEMMA) model for GWAS. The model is built on random single nucleotide polymorphism (SNP) effects and a new algorithm. This algorithm whitens the covariance matrix of the polygenic matrix K and environmental noise, and specifies the number of nonzero eigenvalues as one. The model first chooses all putative quantitative trait nucleotides (QTNs) with ≤ 0.005 P-values and then includes them in a multi-locus model for true QTN detection. Owing to the multi-locus feature, the Bonferroni correction is replaced by a less stringent selection criterion. Results from analyses of both simulated and real data showed that FASTmrEMMA is more powerful in QTN detection and model fit, has less bias in QTN effect estimation and requires a less running time than existing single- and multi-locus methods, such as empirical Bayes, settlement of mixed linear model under progressively exclusive relationship (SUPER), efficient mixed model association (EMMA), compressed MLM (CMLM) and enriched CMLM (ECMLM). FASTmrEMMA provides an alternative for multi-locus GWAS.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Arabidopsis Proteins

Year: 2018 PMID： 28158525 PMCID： PMC6054291 DOI： 10.1093/bib/bbw145

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

Introduction

Genome-wide association studies (GWAS) have been widely used in the genetic dissection of quantitative traits in human, animal and plant genetics, especially in combination with the output of genomic sequencing technologies. The most popular method for GWAS is the mixed linear model (MLM) method [1, 2] because of its demonstrated effectiveness in correcting the inflation from many small genetic effects (polygenic background) and controlling the bias of population stratification [3-7]. Since the MLM of Yu et al. [2] was published, many MLM-based methods have been proposed. However, most of them comprise a one-dimensional genome scan by testing one marker at a time, which is involved in multiple test correction for the threshold value of significance test. The widely used Bonferroni correction is often too conservative to detect many important loci for quantitative traits. Most quantitative traits are controlled by a few genes with large effects and numerous polygenes with minor effects. However, the current one-dimensional genome scan approaches for GWAS do not match the true genetic model for these traits. To overcome this issue, multi-locus methodologies have been developed; for example, Bayesian least absolute shrinkage and selection operator (LASSO) [8], adaptive mixed LASSO [9], penalized Logistic regression [10-11], Elastic-Net [12], empirical Bayes (E-BAYES) [13] and E-BAYES LASSO [14]. If the number of markers is several times larger than sample size, all marker effects can be included in one single model and estimated in an unbiased way. If the number of markers is many times larger than sample size, however, these shrinkage approaches will fail. In this situation, we should consider how to reduce the number of marker effects in the multi-locus genetic model. For example, Zhou et al. [15] developed a Bayesian sparse linear mixed model, and Moser et al. [16] proposed a Bayesian mixture model. Under these models, two to four common components in the mixture distribution were considered and only a few variance components were estimated. Although about 500 effects in the genetic model are finally considered after several rounds of Gibbs sampling, the computing time becomes a major concern for these Bayesian approaches. Recently, Segura et al. [17] and Wang et al. [7] have proposed multi-locus MLM approaches. However, further refinement for fast algorithm is needed. Zhang et al.’s [1] MLM method treated the quantitative trait nucleotide (QTN) effect as being random, in which three component variances owing to QTNs, polygenes and residual errors need to be estimated. If the number of effects is large, this calculation takes a long time. To reduce computing time and increase power in QTN detection, a compressed MLM (CMLM) with a population parameters previously determined (P3D) algorithm [18] and an enriched CMLM (ECMLM) [19] have been proposed. On the other hand, Kang et al. [3] proposed an efficient mixed model association (EMMA), and other authors suggested alternatives, such as EMMA eXpedited (EMMAX) [20], FaST-LMM [21], FaST-LMM-Select [22], genome-wide EMMA [4] and genome-wide rapid association using mixed model and regression-Gamma (GRAMMAR-Gamma) [23]. Recently, settlement of mixed linear model under progressively exclusive relationship (SUPER) [24] has been developed based on FaST-LMM. Among the above fast methods, the SNP effect was treated as being fixed. Goddard et al. [25] noted that a random-marker model has several advantages, compared with the fixed model [7, 26, 27]. For example, the random model approach will shrink the estimated SNP effects toward zero. However, Goddard et al. [25] did not provide an efficient computational algorithm to estimate marker effects. In this article, we describe a new method that can quickly scan each random-effect marker throughout the genome by constructing a fast and new matrix transformation for the three component variances. Then, all the putative QTNs with ≤ 0.005 P-values were placed into one multi-locus genetic model and these QTN effects were estimated by EM empirical Bayes (EMEB) [28] for true QTN identification. This new method, called fast multi-locus random-SNP-effect EMMA (FASTmrEMMA), was validated by analysis of real data from Arabidopsis [29] and by a series of simulation studies and compared with the other methods, such as E-BAYES (multi-locus model) [30], SUPER, EMMA, ECMLM and CMLM (single-locus model).

Statistical approaches for GWAS

Fast multi-locus random-SNP-effect EMMA

FASTmrEMMA (Appendix A) is a multi-locus two-stage GWAS approach. In the first stage, SNP effect was treated as random and minor part of SNPs were picked up based on the prior premise that most SNPs should have no effect on the quantitative traits. Meanwhile, three techniques were implemented to save running time. First, a new matrix transformation was used to multiply original MLM and its purpose is to whiten the covariance matrix of the polygenic matrix K and environmental noise. Then, a polygenic-to-residual variance ratio under the null hypothesis was fixed in all the single marker genome tests. Finally, the number of nonzero eigenvalues was specified as one. In the second stage, all the selected SNP effects in the first stage were placed into one multi-locus model and then estimated by expectation and maximization empirical Bayes (EMEB) [28] for true QTN identification. The new method has been implemented in R and its software can be downloaded from https://cran.r-project.org/web/packages/mrMLM/index.html.

E-BAYES

E-BAYES is an existing multi-locus Bayesian approach implemented by the SAS program [30], and was used as a gold standard for multi-locus model comparison. In this method, all the SNP-effect variances are simultaneously estimated. Owing to the multi-locus nature, Bonferroni correction is replaced by a less stringent selection criterion. The critical value of P-value in the significance test is set at 0.05 in three simulation experiments.

EMMA

EMMA is an existing single-locus genome scan method for GWAS [3], and a fixed model version of the original MLM, in which QTN effect is treated as a fixed effect with no prior distribution assigned. The method was implemented by the R software package EMMA (http://mouse.cs.ucla.edu/emma/).

CMLM and ECMLM

CMLM [18] and ECMLM [19] are existing single-locus genome scan methods for GWAS. CMLM decreases the effective sample size by clustering individuals into groups and eliminates the need to re-compute variance components. ECMLM chooses the best combination of three kinship algorithms and eight grouping algorithms to increases statistical power. The two methods are also the fixed model version of the original MLM and approximation algorithm for SNP effect estimation.

SUPER

FaST-LMM [21] is a newly developed algorithm in GWAS that can solve the computational problem, but requires that the number of SNPs be less than the number of individuals. To overcome this shortcoming, SUPER [24] extracts a small subset of SNPs and uses them in the FaST-LMM. This SUPER not only retains the computational advantage of the FaST-LMM but also remarkably increases statistical power. All ECMLM, CMLM and SUPER were implemented in the R software package GAPIT (http://zzlab.net/GAPIT). The methodological comparison for the above approaches is listed in Table 1.

Table 1

Comparison of six methods and their softwares for GWAS

Case	FASTmrEMMA	E-BAYES	EMMA	CMLM	ECMLM	SUPER
Model	Multi-locus model	Multi-locus model	Single-locus model	Single-locus model	Single-locus model	Single-locus model
QTN effect	Random	Random	Fixed	Fixed	Fixed	Fixed
Polygenic background control	Yes	No	Yes	Yes	Yes	Yes
Population structure control	Yes	No	Yes	Yes	Yes	Yes
Number of variance components	Three	No. of effects	Two	Two	Two	Two
Polygenic-to-residual variance ratio	Fixed	NA	NA	Fixed	Fixed	NA
Significant critical value	LOD (logarithm of odds)=3	P-value=0.05	P-value=0.05/p, where p is no. of markers	P-value=0.05/p	P-value=0.05/p	P-value=0.05/p
Transformation matrix and performances	Q1Λr−12Q1T where (Q1Λr12Q1T)(Q1Λr12Q1T)=λ^gZKZT+In Covariance matrix of the polygenic matrix K and environmental noise are whitened.Number of nonzero eigenvalues is specified as one.	Shrinkage is selective. Large effects subject to virtually no shrinkage while small effects are shrunken to zero.	URT where SHS=URdiag(ξ1+δ,⋯,ξn+δ)URTH=ZKZT+δI and S=I−X(XTX)−1XT One-dimensional optimization by deriving the likelihood as a function of QTN-to-residual variance ratio.	Kinship among individuals is replaced by the kinship among groups.Fit the groups as the random effect, and estimates population parameters only once and then fixes them to test genetic markers.	Kinship among individuals is replaced by the kinship among groups.Chooses the best combination between kinship algorithms and grouping algorithms.	Dramatically reduces the number of markers used to define individual relationships, and uses them in FaST-LMM.
Running time	Fast	Depend on the number of effects.	Slow	Fast	Fast	Moderate
Software Web site	https://cran.r-project.org/web/packages/mrMLM/index.html	http://statgen.ucr.edu/software.html	http://mouse.cs.ucla.edu/emma/	http://zzlab.net/GAPIT	http://zzlab.net/GAPIT	http://zzlab.net/GAPIT

Comparison of six methods and their softwares for GWAS

Results

Estimation of the QTN variance

FASTmrEMMA (Appendix A) is a new algorithm that can approximate the estimation of QTN variance. Thus, we need to know whether this approximation has a significant effect on the estimate of QTN variance. To answer this question, four flowering time traits in Arabidopsis [29] (Appendix B) were re-analyzed by FASTmrEMMA and an exact method implemented by PROC MIXED in SAS. The estimates for QTN variance are listed in Figure 1 and Supplementary Table S1. As a result, the relative error between the two methods ranged from 0.0% to 24.09%, and the average was 1.60%, indicating no effect on the QTN variance estimate using FASTmrEMMA under the conditions of this simulation.

Figure 1

Comparison of the QTN-variance estimates between fast multi-locus random-SNP-effect EMMA (FASTmrEMMA) and one exact algorithm implemented by PROC MIXED in SAS. LD: days to flowering under long days; SDV: days to flowering under short days with vernalization; 8W GH LN: leaf number at flowering with 8 weeks vernalization, greenhouse; and 8W GH FT: days to flowering, 8 weeks vernalization, greenhouse. To confirm the effectiveness of FASTmrEMMA, three Monte Carlo simulation experiments (Appendix C) were carried out and the simulation procedures were almost same as those in Wang et al. [7]. In the three experiments, various backgrounds (no, polygenes and epistasis) were simulated to conduct sensitivity analysis. Each sample in these simulation experiments was analyzed by six methods. In the six methods, FASTmrEMMA is also a new multi-locus algorithm within the framework of MLM, E-BAYES [30] is an existing multi-locus approach under the framework of Bayesian statistics and SUPER, EMMA, ECMLM and CMLM are the existing single-locus GWAS methods.

Statistical power for QTN detection

In the above three simulation experiments, the power for each QTN was defined as the proportion of samples where the QTN was detected (the P-value is smaller than the designated threshold). When only six QTNs were simulated in the first experiment, the power in the detection of each QTN was higher for FASTmrEMMA than for the others (Figure 2A;Supplementary Table S2). When a polygenic background () was added to the first experiment, a similar trend was observed (Figure 2B;Supplementary Table S2). When the polygenic background was changed into an epistatic background (), the results were also similar to those in the first experiment (Figure 2C;Supplementary Table S2). These results demonstrate the highest power of FASTmrEMMA across all the approaches under various genetic backgrounds, although the other methods are also robust under these backgrounds.

Figure 2

Comparison of FASTmrEMMA with the single- and multi-locus approaches under various genetic backgrounds. The single-locus model approaches include SUPER, EMMA, ECMLM and CMLM, and the multi-locus approach has E-BAYES. The powers are presented in A–C, MSEs are showed in D–F and MADs are listed in G–I. Six QTNs (A, D and G), six QTNs plus polygenes (B, E and H) and six QTNs plus three epistasis (C, F and I) were simulated, respectively, in the first to third simulation experiments.

Accuracy for estimated QTN effects

We used the average, mean squared error (MSE) and mean absolute deviation (MAD) to measure the accuracy of an estimated QTN effect. We evaluated the accuracies for the estimates of all the six simulated QTNs across all the six methods. As a result, the estimate of each QTN effect from FASTmrEMMA was much closer to the true value than the estimates obtained from the other methods. On these occasions (QTN numbers 1 and 4), the averages from E-BAYES were closer to the true value than those from FASTmrEMMA in three simulation experiments (Supplementary Table S2). The MSE and MAD for each QTN effect were significantly less from FASTmrEMMA than from the others with two exceptions for QTN number 6, E-BAYES method had slightly higher accuracy than FASTmrEMMA method in the first and second simulation experiments (Figure 2D–I; Supplementary Table S2). These results indicate that a higher accuracy for the estimate of QTN effect can be achieved using FASTmrEMMA than using the other methods.

False-positive rate and receiver operating characteristic curve

All the false QTNs, detected by the six methods, in three simulation experiments were used to calculate the empirical false-positive rates of the six methods. These results are listed in Supplementary Table S3. In these three simulation experiments, the empirical false-positive rates of the six methods were between 0.357 and 7.785 (1E-4), and had the same order of magnitude. ECMLM has the lowest false-positive rate followed by CMLM, FASTmrEMMA and EMMA methods, and SUPER has the maximum false-positive rate followed by E-BAYES method. A receiver operating characteristic curve is a plot of the statistical power against the controlled type I error. This curve is frequently used to compare different methods for their efficiencies in the detection of significant effects; the higher the curve, the better is the method. When 11 probability levels for significance, between 1E-8 to 1E-3, were inserted, the corresponding powers were calculated in the first simulation experiment. The results are shown in Figure 3. Among the six approaches, clearly, FASTmrEMMA method is the best one and the next one is E-BAYES.

Figure 3

Statistical powers for six simulated QTNs in the first simulation experiment plotted against type I error (in a log10 scale) for the six GWAS methods (FASTmrEMMA, E-BAYES, SUPER, EMMA, ECMLM and CMLM).

Computing time

In each of the three simulation experiments, computing times for the six methods were recorded and are listed in Supplementary Table S4. In summary, FASTmrEMMA has the least computing time followed by ECMLM, E-BAYES, CMLM and SUPER methods, and EMMA has the maximum computing time.

Real data analysis in Arabidopsis

To validate FASTmrEMMA, this new method along with E-BAYES, SUPER, EMMA, ECMLM and CMLM was used to re-analyze the Arabidopsis data [29] for days to flowering under long days (LD), days to flowering under short days with vernalization (SDV), leaf number at flowering with 8 weeks vernalization, greenhouse (8W GH LN), and days to flowering, 8 weeks vernalization, greenhouse (8W GH FT) and the results are listed in Supplementary Table S5. The numbers of SNPs significantly associated with the above four traits were 20, 17, 14 and 17, respectively, for traits LD, SDV, 8W GH LN and 8W GH FT, from FASTmrEMMA method. The corresponding numbers of the associated SNPs were 2, 6, 1 and 5 from E-BAYES; 21, 0, 0 and 0 from SUPER; 1, 5, 0 and 2 from EMMA; and 0, 1, 0 and 0 from both ECMLM and CMLM. Clearly, the number of significantly associated SNPs was much larger from FASTmrEMMA than from the other methods. These significantly associated SNPs for each trait were used to conduct a multiple linear regression analysis, and the corresponding Bayesian information criteria (BIC) were calculated. For example, the BIC value for the model of 8W GH LN was −103.47 for FASTmrEMMA, 77.76 for E-BAYES and 117.50 for the others. FASTmrEMMA method shows the lowest BIC values for all the four traits (Table 2), indicating the best model fit among the six approaches.

Table 2

Bayesian information criterion values for four flowering time traits in Arabidopsis using six genome-wide association study approaches

Trait	FASTmr EMMA	E-BAYES	SUPER	EMMA	ECMLM	CMLM
LD	39.54	287.00	396.65	299.97	382.07	382.07
SDV	−88.09	43.20	179.54	100.69	169.87	169.87
8W GH LN	−103.47	77.76	117.50	117.50	117.50	117.50
8W GH FT	−321.72	−155.55	−82.41	−101.83	−82.41	−82.41

LD: days to flowering under long days; SDV: days to flowering under short days with vernalization; 8W GH LN: leaf number at flowering with 8 weeks vernalization, greenhouse; 8W GH FT: days to flowering, 8 weeks vernalization, greenhouse.

Bayesian information criterion values for four flowering time traits in Arabidopsis using six genome-wide association study approaches LD: days to flowering under long days; SDV: days to flowering under short days with vernalization; 8W GH LN: leaf number at flowering with 8 weeks vernalization, greenhouse; 8W GH FT: days to flowering, 8 weeks vernalization, greenhouse. Based on the SNPs detected by FASTmrEMMA, 6, 11, 5 and 7 genes were previously reported to be associated with the above four traits [31-33]. In the vicinity of the SNPs detected by E-BAYES, the corresponding numbers of the known genes are 2, 1, 0 and 1, respectively, for the above four traits [31]. Only four known genes for LD (SUPER), two known genes for LD (EMMA) and three known genes for SDV (EMMA) are in the neighborhood of the detected SNPs [31, 33] (Table 3). Clearly, FASTmrEMMA method detected more known genes than did the other methods.

Table 3

GWAS for four flowering time traits in Arabidopsis using six GWAS methods

Trait	Gene	Chr	SNP (bp)	FASTmrEMMA				E-BAYES				SUPER				EMMA				References
Trait	Gene	Chr	SNP (bp)	LOD	Effect	MAF	r² (%)	LOD	Effect	MAF	r² (%)	P-value	Effect	MAF	r² (%)	P-value	Effect	MAF	r² (%)	References
LD	At1g22770	1	8045438	4.872	−0.112	0.395	0.549													[31]
	At1g23000	1	8128350	9.006	−0.197	0.461	1.767													[31]
	At2g22540	2	9588685	10.338	−0.330	0.281	4.034	10.753	−0.611	0.281	13.817					2.78E-09	−0.815	0.281	24.607	[31]
	At2g22610	2	9588685	10.338	−0.330	0.281	4.034	10.753	−0.611	0.281	13.817					2.78E-09	−0.815	0.281	24.607	[31]
	At3g61970	3	22949227	5.919	0.149	0.413	0.986													[31]
	At5g10140	5	3188328	12.759	−0.272	0.263	2.630													[31]
	At4g00310	4	153459									8.39E-08	−0.363	0.168	3.374					[31]
	At4g00335	4	167142									6.75E-08	−0.538	0.138	6.307					[31]
	At4g00450	4	196614									2.88E-08	−0.227	0.389	2.243					[31]
	At4g01280	4	516758									8.15E-08	−0.504	0.108	4.483					[31]
SDV	At1g05440	1	1595585	4.298	0.117	0.214	1.346													[31]
	At1g05470	1	1595585	4.298	0.117	0.214	1.346													[31]
	At1g77080	1	28965510	10.817	−0.177	0.484	4.576	4.020	−0.170	0.484	4.221									[31]
	At2g41890	2	17488070	4.339	0.099	0.302	1.208													[31]
	At3g20260	3	7084425	3.309	0.068	0.302	0.570													[31]
	At3g49600	3	18385143	4.529	0.118	0.321	1.774													[31]
	At4g05420	4	2748735	4.286	−0.091	0.459	1.203													[31]
	At5g04240	5	1164843	4.479	−0.137	0.220	1.884													[31]
	At5g09805	5	3055565	4.763	−0.105	0.233	1.151													[32]
	At5g57360	5	23249199	5.419	−0.141	0.321	2.533													[31]
	At5g57390	5	23249199	5.419	−0.141	0.321	2.533													[31]
	At5g46880	5	19044037													3.55E-08	0.408	0.107	9.296	[31]
	At5g67100	5	26794176													1.79E-07	−0.292	0.321	10.864	[31]
	At5g67200	5	26794176													1.79E-07	−0.292	0.321	10.864	[33]
8W GH LN	At1g77080	1	28965510	3.857	−0.109	0.497	2.610													[31]
	At2g27380	2	11703876	9.631	−0.153	0.325	4.514													[33]
	At4g32980	4	15918498	4.651	−0.147	0.147	2.384													[31]
	At5g15850	5	5196549	5.923	−0.106	0.319	2.145													[31]
	At5g45890	5	18600041	4.608	−0.107	0.423	2.456													[31]
8W GH FT	At1g03457	1	863771	5.055	0.040	0.460	1.199													[31]
	At2g27380	2	11703876	4.744	−0.043	0.323	1.122													[33]
	At2g47230	2	19396129	4.208	−0.038	0.298	0.911													[31]
	At3g56900	3	21079518	3.081	−0.032	0.311	0.661													[31]
	At3g57000	3	21079518	3.081	−0.032	0.311	0.661													[31]
	At5g06550	5	2002341	3.169	−0.070	0.186	2.241													[31]
	At5g06590	5	2002341	3.169	−0.070	0.186	2.241													[33]
	at5g67100	5	26781546					4.302	0.076	0.317	3.772									[31]

MAF: minor allele frequency. The individuals with missing phenotypes and the SNPs with MAF were excluded. The critical value for significance was for FASTmrEMMA and E-BAYES, and approximately 2.8E-07 P-value for SUPER, EMMA, CMLM and ECMLM. The results from CMLM and ECMLM were not listed in this table because no genes were detected. The data set was derived from Atwell et al. (2010).

GWAS for four flowering time traits in Arabidopsis using six GWAS methods MAF: minor allele frequency. The individuals with missing phenotypes and the SNPs with MAF were excluded. The critical value for significance was for FASTmrEMMA and E-BAYES, and approximately 2.8E-07 P-value for SUPER, EMMA, CMLM and ECMLM. The results from CMLM and ECMLM were not listed in this table because no genes were detected. The data set was derived from Atwell et al. (2010). We also compared all the known genes detected in this study with all the candidate genes in Atwell et al. [29]. For example, among seven known genes (At1g03457, At2g27380, At2g47230, At3g56900, At3g57000, At5g06550 and At5g06590) for 8W GH FT in this study, no genes were within the 133 candidate genes in Atwell et al. [29]. Among 11 known genes for SDV in this study, only three genes (At5g04240, At5g57360 and At5g57390) were within the 153 candidate genes in Atwell et al. [29]. Clearly, FASTmrEMMA method detected new genes.

Discussion

When SNP effects are viewed as random, three variance components will be estimated. Generally, polygenic variance is larger than zero while variance components for most SNPs are zero because these markers are not associated with the trait of interest. In other words, as in most mixed model approaches, variance components in FASTmrEMMA are also estimated under the assumption that one variance component is zero. FASTmrEMMA is a new algorithm, different from widely used one-dimensional genome scan approaches, such as SUPER, EMMA, ECMLM and CMLM. First, the SNP effects are viewed as being random in FASTmrEMMA while they are viewed as fixed in SUPER, EMMA, ECMLM and CMLM because the random model approach will shrink the estimated SNP effects toward zero when the simulated QTN effects are small, leading to maximum correlations between observed and predicted phenotypic values [25, 34]. Meanwhile, the power of detecting QTNs with random effects is higher than that with fixed effects [35]. Then, a quick single marker genome scan method was proposed to estimate the three variance components in the above mixed model. Here several techniques have been incorporated into the algorithm. The first technique is to fix the polygenic-to-residual variance ratio, which was adopted in CMLM/P3D [18] and EMMAX [20]. Although this algorithm is approximate, it has almost no effect on the estimate of SNP-effect variance, even if there is a large difference in the above ratios between the approximate and exact algorithms (Supplementary Table S1). Clearly, this provides evidence for fixing the ratio in FASTmrEMMA. The second technique is to use a quick matrix calculation algorithm, such as the eigen decomposition of matrix is the same as that of (a positive number). Thus, eigen decomposition, determinant and derivatives in the estimation of can be quickly calculated. The final technique is to estimate residual variance along with the estimation of fixed effects. In the single marker genome scan, therefore, only one parameter needs to be estimated so that running time is obviously decreased. Although GCTA algorithm [36] may be used to estimate the above three variance components, running time is a major concern. A similar situation is also apparent when using PROC MIXED in SAS in Zhang et al. (2005) [1]. Finally, our matrix transformation algorithm in FASTmrEMMA is different from those in SUPER, EMMA, ECMLM, CMLM and multi-locus random-SNP-effect mixed linear model (mrMLM) [7]. For example, when many random effects are included simultaneously in one genetic model and polygenic background also needs to be controlled, at present there are no methods available. However, the new matrix transformation algorithm can transfer polygenic background plus residual error into a normal residual error. This new model can be easily treated by a Bayesian method. The applied study will be reported in the near future. The multi-variance-component algorithm, E-BAYES [30], was also used to conduct multi-locus GWAS, especially for the situation where the number of markers is several times larger than sample size. However, results from simulation experiments showed that FASTmrEMMA is more powerful in QTN detection and higher accurate in QTN effect estimation than is E-BAYES (Supplementary Table S2). FASTmrEMMA is different from the adaptive mixed LASSO [9]. If the number of markers is many times larger than sample size, the adaptive mixed LASSO does not work. FASTmrEMMA is also different from both the Bayesian sparse linear mixed model [15] and the Bayesian mixture model [16]. The latter two operate under the framework of Bayesian statistics, and the computing time becomes a major concern. FASTmrEMMA is different from multi-locus mixed-model (MLMM) of Segura et al. [17] in two aspects. First, MLMM is a simple, stepwise mixed-model regression with forward inclusion and backward elimination and FASTmrEMMA is a two-step combined method. In MLMM, the computationally intensive forward-backward inclusion of SNPs is clearly a limiting factor in exploring the huge model space [17]. Second, matrix transformation algorithm in MLMM is different from that in FASTmrEMMA. This difference also exists between FASTmrEMMA and mrMLM of Wang et al. [7]. As described by Wang et al. [7], single-locus genome scan approaches for GWAS require Bonferroni correction for multiple tests. However, this correction is often too conservative to detect important loci for quantitative traits when the number of markers is extremely large. Clearly, FASTmrEMMA is based on a multi-locus model. Owing to the multi-locus nature, Bonferroni correction is replaced by a less stringent selection criterion. Results from analysis of simulated and real data further validated the idea of a less stringent selection criterion in this study. FASTmrEMMA is a combined method with two steps, each of which needs a critical P-value. In the first step, three critical P-values (0.01, 0.005 and 0.001) were compared to obtain the best one. As a result, the 0.005 critical P-value is the best (Supplementary Table S6). In the second step, a less stringent selection criterion between 0.05 and 0.05/p was adopted, where p is the number of markers. The two critical P-values in FASTmrEMMA have been confirmed by our simulated and real data analysis. FASTmrEMMA was validated by sensitivity analysis in two aspects. First, various backgrounds (no, ploygenes and epistasis) in the three simulation experiments have validated the new method (Supplementary Table S2). Second, the new method works well for more than 10 QTNs. For example, 14–20 QTNs have been found to be associated with the four traits in Arabidopsis thaliana and then to be closely linked with the 5–11 known genes (Supplementary Table S5).

Conclusion

In FASTmrEMMA algorithm, random-SNP-effect and multi-locus model methods are used to improve the power for QTN detection, and to decrease the false-positive rate, a new matrix transformation in the first step of FASTmrEMMA is constructed to obtain a new genetic model that includes only QTN variation and normal residual error. Additionally, letting the number of nonzero eigenvalues be one and fixing the polygenic-to-residual variance ratio are used to save running time. As a result, FASTmrEMMA has the highest power and accuracy for QTN detection and the best fit for a genetic model, as compared with E-BAYES, SUPER, EMMA, ECMLM and CMLM. Key Points GWAS is to identify a genome-wide set of genetic variants in a population by associating all possible markers with a complex trait. Owing to low power and high false-positive rates in a single-marker genome-wide scan, multi-locus GWAS methodologies have been developed, such as FASTmrEMMA. We review and assess six GWAS methodologies using both simulated and real data. In the FASTmrEMMA, SNP effects are viewed as being random, the covariance matrix of the polygenic matrix K and environmental noise are whitened and multiple markers potentially associated with a trait are further detected by EMEB. FASTmrEMMA is more powerful in QTN detection and model fit, has less bias in QTN effect estimation, and requires a less running time than the other five methods.

Supplementary Data

Supplementary data are available online at http://bib.oxfordjournals.org/.

Funding

National Natural Science Foundation of China (grants 31571268 and 31301229), and Huazhong Agricultural University Scientific & Technological Self-innovation Foundation (Program No. 2014RC020). Click here for additional data file.

35 in total

1. Improved linear mixed models for genome-wide association studies.

Authors: Jennifer Listgarten; Christoph Lippert; Carl M Kadie; Robert I Davidson; Eleazar Eskin; David Heckerman
Journal: Nat Methods Date: 2012-05-30 Impact factor: 28.547

2. Mapping quantitative trait loci using naturally occurring genetic variance among commercial inbred lines of maize (Zea mays L.).

Authors: Yuan-Ming Zhang; Yongcai Mao; Chongqing Xie; Howie Smith; Lang Luo; Shizhong Xu
Journal: Genetics Date: 2005-02-16 Impact factor: 4.562

3. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies.

Authors: Brendan K Bulik-Sullivan; Po-Ru Loh; Hilary K Finucane; Stephan Ripke; Jian Yang; Nick Patterson; Mark J Daly; Alkes L Price; Benjamin M Neale
Journal: Nat Genet Date: 2015-02-02 Impact factor: 38.330

4. An expectation-maximization algorithm for the Lasso estimation of quantitative trait locus effects.

Authors: S Xu
Journal: Heredity (Edinb) Date: 2010-01-06 Impact factor: 3.821

5. FaST linear mixed models for genome-wide association studies.

Authors: Christoph Lippert; Jennifer Listgarten; Ying Liu; Carl M Kadie; Robert I Davidson; David Heckerman
Journal: Nat Methods Date: 2011-09-04 Impact factor: 28.547

6. Rapid variance components-based method for whole-genome association analysis.

Authors: Gulnara R Svishcheva; Tatiana I Axenovich; Nadezhda M Belonogova; Cornelia M van Duijn; Yurii S Aulchenko
Journal: Nat Genet Date: 2012-09-16 Impact factor: 38.330

7. An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations.

Authors: Vincent Segura; Bjarni J Vilhjálmsson; Alexander Platt; Arthur Korte; Ümit Seren; Quan Long; Magnus Nordborg
Journal: Nat Genet Date: 2012-06-17 Impact factor: 38.330

8. Epistatic association mapping in homozygous crop cultivars.

Authors: Hai-Yan Lü; Xiao-Fen Liu; Shi-Ping Wei; Yuan-Ming Zhang
Journal: PLoS One Date: 2011-03-15 Impact factor: 3.240

9. Genetic dissection of heterosis using epistatic association mapping in a partial NCII mating design.

Authors: Jia Wen; Xinwang Zhao; Guorong Wu; Dan Xiang; Qing Liu; Su-Hong Bu; Can Yi; Qijian Song; Jim M Dunwell; Jinxing Tu; Tianzhen Zhang; Yuan-Ming Zhang
Journal: Sci Rep Date: 2015-12-17 Impact factor: 4.379

10. A SUPER powerful method for genome wide association study.

Authors: Qishan Wang; Feng Tian; Yuchun Pan; Edward S Buckler; Zhiwu Zhang
Journal: PLoS One Date: 2014-09-23 Impact factor: 3.240

83 in total

1. Genome-wide association studies using binned genotypes.

Authors: Bingxing An; Xue Gao; Tianpeng Chang; Jiangwei Xia; Xiaoqiao Wang; Jian Miao; Lingyang Xu; Lupei Zhang; Yan Chen; Junya Li; Shizhong Xu; Huijiang Gao
Journal: Heredity (Edinb) Date: 2019-10-22 Impact factor: 3.821

2. Fine mapping QTL and mining genes for protein content in soybean by the combination of linkage and association analysis.

Authors: Xiyu Li; Ping Wang; Kaixin Zhang; Shulin Liu; Zhongying Qi; Yanlong Fang; Yue Wang; Xiaocui Tian; Jie Song; Jiajing Wang; Chang Yang; Xu Sun; Zhixi Tian; Wen-Xia Li; Hailong Ning
Journal: Theor Appl Genet Date: 2021-01-09 Impact factor: 5.699

3. Genetic dissection of flowering time in flax (Linum usitatissimum L.) through single- and multi-locus genome-wide association studies.

Authors: Braulio J Soto-Cerda; Gabriela Aravena; Sylvie Cloutier
Journal: Mol Genet Genomics Date: 2021-04-26 Impact factor: 3.291

4. Genetic mapping and genomic prediction of sclerotinia stem rot resistance to rapeseed/canola (Brassica napus L.) at seedling stage.

Authors: Jayanta Roy; Luis E Del Río Mendoza; Nonoy Bandillo; Phillip E McClean; Mukhlesur Rahman
Journal: Theor Appl Genet Date: 2022-05-06 Impact factor: 5.699

5. Identification and validation of a novel locus, Qpm-3BL, for adult plant resistance to powdery mildew in wheat using multilocus GWAS.

Authors: Xijun Du; Weigang Xu; Chaojun Peng; Chunxin Li; Yu Zhang; Lin Hu
Journal: BMC Plant Biol Date: 2021-07-30 Impact factor: 4.215

6. Genome-wide association study of pre-harvest sprouting tolerance using a 90K SNP array in common wheat (Triticum aestivum L.).

Authors: Yulei Zhu; Shengxing Wang; Wenxin Wei; Hongyong Xie; Kai Liu; Can Zhang; Zengyun Wu; Hao Jiang; Jiajia Cao; Liangxia Zhao; Jie Lu; Haiping Zhang; Cheng Chang; Xianchun Xia; Shihe Xiao; Chuanxi Ma
Journal: Theor Appl Genet Date: 2019-07-19 Impact factor: 5.699

7. Superior haplotypes towards development of low glycemic index rice with preferred grain and cooking quality.

Authors: Ramchander Selvaraj; Arun Kumar Singh; Vikas Kumar Singh; Ragavendran Abbai; Sonali Vijay Habde; Uma Maheshwar Singh; Arvind Kumar
Journal: Sci Rep Date: 2021-05-12 Impact factor: 4.379

Review 8. Genome-wide association study and its applications in the non-model crop Sesamum indicum.

Authors: Muez Berhe; Komivi Dossa; Jun You; Pape Adama Mboup; Idrissa Navel Diallo; Diaga Diouf; Xiurong Zhang; Linhai Wang
Journal: BMC Plant Biol Date: 2021-06-22 Impact factor: 4.215

9. Genome-Wide Association Mapping of bc-1 and bc-u Reveals Candidate Genes and New Adjustments to the Host-Pathogen Interaction for Resistance to Bean Common Mosaic Necrosis Virus in Common Bean.

Authors: Alvaro Soler-Garzón; Phillip E McClean; Phillip N Miklas
Journal: Front Plant Sci Date: 2021-06-29 Impact factor: 5.753

10. Genetic Dissection of Seedling Root System Architectural Traits in a Diverse Panel of Hexaploid Wheat through Multi-Locus Genome-Wide Association Mapping for Improving Drought Tolerance.

Authors: Thippeswamy Danakumara; Jyoti Kumari; Amit Kumar Singh; Subodh Kumar Sinha; Anjan Kumar Pradhan; Shivani Sharma; Shailendra Kumar Jha; Ruchi Bansal; Sundeep Kumar; Girish Kumar Jha; Mahesh C Yadav; P V Vara Prasad
Journal: Int J Mol Sci Date: 2021-07-02 Impact factor: 5.923