| Literature DB >> 30104761 |
Wei Zhou1,2, Jonas B Nielsen3, Lars G Fritsche2,4,5, Rounak Dey2,5, Maiken E Gabrielsen4, Brooke N Wolford1,2, Jonathon LeFaive2,5, Peter VandeHaar2,5, Sarah A Gagliano2,5, Aliya Gifford6, Lisa A Bastarache6, Wei-Qi Wei6, Joshua C Denny6,7, Maoxuan Lin3, Kristian Hveem4,8, Hyun Min Kang2,5, Goncalo R Abecasis2,5, Cristen J Willer9,10,11, Seunggeun Lee12,13.
Abstract
In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, the linear mixed model and the recently proposed logistic mixed model, perform poorly; they produce large type I error rates when used to analyze unbalanced case-control phenotypes. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation to calibrate the distribution of score test statistics. This method, SAIGE (Scalable and Accurate Implementation of GEneralized mixed model), provides accurate P values even when case-control ratios are extremely unbalanced. SAIGE uses state-of-art optimization strategies to reduce computational costs; hence, it is applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 samples from white British participants with European ancestry for > 1,400 binary phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.Entities:
Mesh:
Year: 2018 PMID: 30104761 PMCID: PMC6119127 DOI: 10.1038/s41588-018-0184-y
Source DB: PubMed Journal: Nat Genet ISSN: 1061-4036 Impact factor: 38.330
Comparison of different methods for GWAS with mixed effect models
| Method Features | Algorithm Complexity | Benchmarks for | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Does not | Feasible | Developed | Accounts | Tests | Time complexity | Memory usage | Time | Memory | ||||
| Step 1 | Step 2 | Step 1 | Step 2 | |||||||||
| O(PM1N1.5) | O(MN) | M1N/4 | N | 517 | 10.3G | |||||||
| O(PN3) | O(MN2) | F N2 | F N2 | NA | NA | |||||||
| O(PM1N1.5) | O(MN) | M1N/4 | N | 360 | 10.9G | |||||||
| O(N3) | O(MN2) | F N2 | FN2 | NA | NA | |||||||
N: number of samples
P: number of iterations required to reach convergence
M1: number of markers used to construct the kinship matrix;
M: total number of markers to be tested
F: Byte for floating number
Number of iterations in PCG is assumed as O(N0.5)8
Figure 1Manhattan plots of GWAS results for four binary phenotypes with various case-control ratios in the UK Biobank.
GWAS results from SAIGE, SAIGE-NoSPA(asymptotically equivalent to GMMAT) and BOLT-LMM are shown for A. coronary artery disease (PheCode 411, case:control = 1:12, N = 408,458), B. colorectal cancer (PheCode 153, case:control = 1:84, N = 387,318), C. glaucoma (PheCode 365, case: control = 1:89, N = 402,223), and D. thyroid cancer (PheCode 193, case:control=1:1138, N = 407,757). N: sample size. Blue: loci with association p-value < 5×10−8, which have been previously reported, Green: loci that have association p-value < 5×10−8 and have not been reported before. Since results from SAIGE-noSPA and BOLT-LMM contain many false positive signals for colorectal cancer, glaucoma, and thyroid cancer, the significant loci are not highlighted. The upper dashed line marks the break point for the different scales of the y axis and the lower dashed line marks the genome-wide significance (p-value = 5×10−8).
Figure 2Quantile-quantile plots of GWAS results for four binary phenotypes with various case-control ratios in the UK Biobank.
GWAS results from SAIGE, SAIGE-NoSPA (asymptotically equivalent to GMMAT) and BOLT-LMM are shown for A. coronary artery disease (PheCode 411, case: control = 1:12, N = 408,458), B. colorectal cancer (PheCode 153, case: control = 1:84, N = 387,318), C. glaucoma (PheCode 365, case: control = 1:89, N = 402,223), and D. thyroid cancer (PheCode 193, case: control=1:1138, N = 407,757). N: sample size.