| Literature DB >> 33138126 |
Wenlong Ren1, Zhikai Liang2, Shu He1, Jing Xiao1.
Abstract
In genome-wide association studies, linear mixed models (LMMs) have been widely used to explore the molecular mechanism of complex traits. However, typical association approaches suffer from several important drawbacks: estimation of variance components in LMMs with large scale individuals is computationally slow; single-locus model is unsatisfactory to handle complex confounding and causes loss of statistical power. To address these issues, we propose an efficient two-stage method based on hybrid of restricted and penalized maximum likelihood, named HRePML. Firstly, we performed restricted maximum likelihood (REML) on single-locus LMM to remove unrelated markers, where spectral decomposition on covariance matrix was used to fast estimate variance components. Secondly, we carried out penalized maximum likelihood (PML) on multi-locus LMM for markers with reasonably large effects. To validate the effectiveness of HRePML, we conducted a series of simulation studies and real data analyses. As a result, our method always had the highest average statistical power compared with multi-locus mixed-model (MLMM), fixed and random model circulating probability unification (FarmCPU), and genome-wide efficient mixed model association (GEMMA). More importantly, HRePML can provide higher accuracy estimation of marker effects. HRePML also identifies 41 previous reported genes associated with development traits in Arabidopsis, which is more than was detected by the other methods.Entities:
Keywords: GWAS; computational efficiency; linear mixed model; penalized; restricted maximum likelihood
Year: 2020 PMID: 33138126 PMCID: PMC7692801 DOI: 10.3390/genes11111286
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Comparison of the statistical power and mean squared errors (MSE) for each quantitative trait nucleotide (QTN) among the hybrid of restricted and penalized maximum likelihood (HRePML), multi-locus mixed-model (MLMM), fixed and random model circulating probability unification (FarmCPU), and genome-wide efficient mixed model association (GEMMA) methods in the first simulation study *.
| QTN | Chr. | Position(bp) | R2 | Effect | Power (%) | Mean Squared Errors (MSE) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HRePML | MLMM | FarmCPU | GEMMA | HRePML | MLMM | FarmCPU | GEMMA | |||||
| 1 | 1 | 404108 | 0.01 | 0.4328 | 9.9 | 2.4 | 1.6 | 0.0 | 0.0509 | 0.1334 | 0.1224 | na # |
| 2 | 1 | 636788 | 0.03 | 0.7497 | 45.1 | 39.9 | 53.7 | 0.2 | 0.0193 | 0.0440 | 0.0241 | 0.3112 |
| 3 | 3 | 507976 | 0.03 | 0.7497 | 66.7 | 40.1 | 13.9 | 8.1 | 0.1443 | 0.0992 | 0.2756 | 0.1597 |
| 4 | 3 | 931437 | 0.05 | 0.9679 | 89.5 | 69.5 | 58.8 | 55.3 | 0.0321 | 0.0276 | 0.0434 | 0.0770 |
| 5 | 4 | 75898 | 0.08 | 1.2243 | 100.0 | 99.8 | 100.0 | 97.5 | 0.0407 | 0.0375 | 0.0527 | 0.0283 |
| 6 | 4 | 461978 | 0.01 | 0.4328 | 12.7 | 5.0 | 8.9 | 0.7 | 0.2488 | 0.3808 | 0.3429 | 0.5502 |
| 7 | 4 | 607026 | 0.05 | 0.9679 | 69.6 | 80.9 | 98.5 | 73.8 | 0.0421 | 0.0988 | 0.0367 | 0.1544 |
| 8 | 5 | 282008 | 0.05 | 0.9679 | 89.6 | 87.6 | 90.3 | 55.1 | 0.0397 | 0.0334 | 0.0345 | 0.0725 |
* In the first simulation study, the dataset consists of 500 individuals and 10,000 single nucleotide polymorphism (SNP) markers with 1000 replicates. Eight true QTNs are set in each replicate. Then, this dataset can be regarded as having 10,000,000 SNPs and 8000 true QTNs in total. # “na” represents not available.
Comparison of average statistical power, average mean squared errors (MSE), and running time among the HRePML, MLMM, FarmCPU, and GEMMA methods in the first simulation study *.
| Statistical Properties | HRePML | MLMM | FarmCPU | GEMMA |
|---|---|---|---|---|
| Average power (%) | 60.39 | 53.15 | 53.21 | 36.34 |
| Average MSE | 0.0772 | 0.1068 | 0.1165 | 0.1933 |
| Running time (Hour) | 3.1419 | 22.7274 | 4.6653 | 2.4186 |
* The dataset used in Table 2 is the same as that used in Table 1.
Figure 1Comparison of statistical powers of eight simulated quantitative trait nucleotides (QTNs) using four genome-wide association study (GWAS) methods (hybrid of restricted and penalized maximum likelihood (HRePML), multi-locus mixed-model (MLMM), fixed and random model circulating probability unification (FarmCPU), and genome-wide efficient mixed model association (GEMMA)). (A) The first simulation study: no polygenic background. (B) The second simulation study: an additive polygenic variance involved.
Figure 2Comparison of mean squared errors of each simulated QTN effect using four GWAS methods (HRePML, MLMM, FarmCPU, and GEMMA). The descriptions in (A,B) are the same as those in Figure 1.
Effect of the sample size on the statistical power and running time using the HRePML method in the third simulation study *.
| QTN | R2 | Sample Size: Power (%) | |||
|---|---|---|---|---|---|
| 500 | 1000 | 2000 | 4000 | ||
| 1 | 0.01 | 12 | 13 | 30 | 61 |
| 2 | 0.03 | 48 | 84 | 94 | 99 |
| 3 | 0.03 | 61 | 91 | 93 | 96 |
| 4 | 0.05 | 91 | 99 | 97 | 99 |
| 5 | 0.08 | 100 | 100 | 100 | 100 |
| 6 | 0.01 | 9 | 23 | 46 | 77 |
| 7 | 0.05 | 71 | 98 | 98 | 99 |
| 8 | 0.05 | 87 | 98 | 98 | 98 |
| Average power (%) | 59.88 | 75.75 | 82.00 | 91.13 | |
| Running time (Hour) | 0.3142 | 1.1244 | 3.9969 | 39.5439 | |
* In the third simulation study, there are four datasets consisting of 500, 1000, 2000, and 4000 individuals, respectively, and 10,000 SNP markers, with 100 replicates. Eight true QTNs are set in each replicate. Then, each dataset can be regarded as having 1,000,000 SNPs and 800 true QTNs in total.
Figure 3Effect of the sample size on the statistical power using HRePML in the third simulation study.
Figure 4Comparison of total running time using four GWAS methods (HRePML, MLMM, FarmCPU, and GEMMA). The descriptions in (A,B) are the same as those in Figure 1.
Figure 5(A). Effect of the sample size on running time in the third simulation. (B) Effect of markers on the running time in the fourth simulation.
Previously reported genes that were identified at least by two methods simultaneously with HRePML, MLMM, FarmCPU, and GEMMA.
| Detected Genes | Associated Trait | Chr. | Position | Effect Estimate | LOD/ | Methods |
|---|---|---|---|---|---|---|
|
| LFS GH | 2 | 7140030 | −7.461, −9.107, −5.16 |
| FarmCPU, MLMM, GEMMA |
|
| LFS GH | 3 | 2280271 | −5.934, −8.845 |
| FarmCPU, MLMM |
|
| MT GH | 3 | 20090780 | 1.002, 1.762 |
| FarmCPU, MLMM |
|
| FT Duration GH | 4 | 6228754 | 0.822, 1.136 | 3.74, | HRePML, FarmCPU |
|
| LC Duration GH | 4 | 16140068 | 2.996, 2.540 | 4.78, | HRePML, MLMM |
|
| LC Duration GH | 5 | 18625634, | −3.707, −6.051 | 4.78, | HRePML, FarmCPU |
|
| LFS GH | 5 | 18625634, | −4.318, −5.147, −5.616 | 5.23, | HRePML, FarmCPU, GEMMA |
|
| MT GH | 5 | 21646741 | 0.236, 0.267 |
| FarmCPU, GEMMA |