| Literature DB >> 36114182 |
Rounak Dey1, Wei Zhou2,3,4,5, Tuomo Kiiskinen5,6, Aki Havulinna5,6, Amanda Elliott1,2,3, Juha Karjalainen2,3,4,5, Mitja Kurki2,3,4,5, Ashley Qin1, Seunggeun Lee7, Aarno Palotie2,3,4,5, Benjamin Neale2,3,4, Mark Daly2,3,4,5, Xihong Lin8,9,10.
Abstract
With decades of electronic health records linked to genetic data, large biobanks provide unprecedented opportunities for systematically understanding the genetics of the natural history of complex diseases. Genome-wide survival association analysis can identify genetic variants associated with ages of onset, disease progression and lifespan. We propose an efficient and accurate frailty model approach for genome-wide survival association analysis of censored time-to-event (TTE) phenotypes by accounting for both population structure and relatedness. Our method utilizes state-of-the-art optimization strategies to reduce the computational cost. The saddlepoint approximation is used to allow for analysis of heavily censored phenotypes (>90%) and low frequency variants (down to minor allele count 20). We demonstrate the performance of our method through extensive simulation studies and analysis of five TTE phenotypes, including lifespan, with heavy censoring rates (90.9% to 99.8%) on ~400,000 UK Biobank participants with white British ancestry and ~180,000 individuals in FinnGen. We further analyzed 871 TTE phenotypes in the UK Biobank and presented the genome-wide scale phenome-wide association results with the PheWeb browser.Entities:
Mesh:
Year: 2022 PMID: 36114182 PMCID: PMC9481565 DOI: 10.1038/s41467-022-32885-x
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 17.694
Fig. 1Projected computation cost for GATE, COXMEG-Score, and COXMEG-sparse as a function of sample size.
A is for computation time and B is for memory usage. The numerical data are provided in Supplementary Table 1. Benchmarking was performed for the GWAS of lifespan based on randomly subsampled data from UK Biobank White British ancestry subjects. Association tests were performed on 200,000 randomly selected markers with imputation INFO ≥ 0.3, with the filtering criteria of MAC ≥ 20. The computation times were projected for testing 46 million variants with INFO ≥ 0.3 and MAC ≥ 20. The reported run times are medians of five runs, each with randomly sampled subjects with different randomization seeds. The x-axis is plotted on a log10 scale.
Fig. 2Manhattan plots for GWAS of four time-to-event phenotypes with different censoring rates in the UK Biobank data with White British ancestry.
GWAS results using GATE-noSPA (A) and GATE (B) are shown for ischemic heart disease (PheCode 411, N = 407,776, censoring rate = 90.9%), female breast Cancer (PheCode 174.1, N = 208,160, censoring rate = 92.6%), glaucoma (PheCode 365, N = 398,971, censoring rate = 98.5%), and Alzheimer’s Disease (PheCode 290.11, N = 342,881, censoring rate = 99.8%).
Fig. 3Quantile–quantile (QQ) plots for GWAS of four time-to-event phenotypes with different censoring rates in the UK Biobank data with White British ancestry.
GWAS results using GATE-noSPA (A) and GATE (B) are shown for ischemic heart disease (PheCode 411, N = 407,776, censoring rate = 90.9%), female breast Cancer (PheCode 174.1, N = 208,160, censoring rate = 92.6%), glaucoma (PheCode 365, N = 398,971, censoring rate = 98.5%), and Alzheimer’s Disease (PheCode 290.11, N = 342,881, censoring rate = 99.8%). QQ plots are color-coded based on different minor allele frequency categories. 95% error bands around the nominal x = y diagonal line are also shown for each MAF category.
Fig. 4Predicted risk of disease onset over age for the top two loci in the GWAS of four phenotypes in the UK Biobank data with White British ancestry.
Predicted risk of disease onset is plotted over age by genotypes for loci LPA and CELSR2 for ischemic heart disease, FGFR2 and CASC16 for female breast cancer, MYOC and TMCO1 for glaucoma, and APOE e4 variant for AD. The red, green, and blue lines represent the risk of disease onset for alternate allele counts zero, one, and two, respectively, for a female subject born in 1950 (median birth year in the UKBB data) with the top four PC coordinates each set at the mean level across the UK Biobank subjects with white British ancestry.