Literature DB >> 35876838

Incorporating family disease history and controlling case-control imbalance for population based genetic association studies.

Yongwen Zhuang^1,2, Brooke N Wolford³, Kisung Nam⁴, Wenjian Bi⁵, Wei Zhou⁶, Cristen J Willer^3,7,8, Bhramar Mukherjee^2,9,10, Seunggeun Lee⁴.

Abstract

BACKGROUND: In the genome-wide association analysis of population-based biobanks, most diseases have low prevalence, which results in low detection power. One approach to tackle the problem is using family disease history, yet existing methods are unable to address type I error inflation induced by increased correlation of phenotypes among closely related samples, as well as unbalanced phenotypic distribution.
RESULTS: We propose a new method for genetic association test with family disease history, TAPE (mixed-model-based Test with Adjusted Phenotype and Empirical saddlepoint approximation), which controls for increased phenotype correlation by adopting a two-variance-component mixed model, accounts for case-control imbalance by using empirical saddlepoint approximation, and is flexible to incorporate any existing adjusted phenotypes such as phenotypes from the LT-FH method. We show through simulation studies and analysis of UK-Biobank data of white British samples and the Korean Genome and Epidemiology Study (KoGES) of Korean samples that the proposed method is robust and yields better calibration compared to existing methods while gaining power for detection of variant-phenotype associations. AVAILABILITY: The summary statistics and code generated in this study are available at https://github.com/styvon/TAPE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Year: 2022 PMID： 35876838 PMCID： PMC9477535 DOI： 10.1093/bioinformatics/btac459

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

The Genome-wide and phenome-wide studies are facilitated by the recent development of large-scale biobanks, such as the UK Biobank (UKB) (Bycroft ), BioBank Japan (Nagai ) and the Korean Genome and Epidemiology Study (KoGES) (Kim ). Individuals in the biobanks are samples from a target population and large numbers of phenotypes are collected for each individual, which allows phenome-wide scans. However, challenges remain to gain enough power to identify associated variants, especially for binary traits with a low prevalence. One promising approach to improve detection power is using family disease history to infer risk of diseases of unaffected individuals. For family-based cohorts with partially-missing genotypes, association test power can be improved by using pedigree information (Gudbjartsson ; Kong ; Thornton and McPeek, 2007; Zhong ). The GWAX method first demonstrated that with completely missing family genotypes, unaffected individuals with family disease history can be used as proxy cases to find genetic associations (Liu ). The LT-FH method (Hujoel ) further increases association power by estimating a liability of disease conditional on the observed phenotypes and family disease history, which differentiate the disease risks among the proxy cases. Despite the progress, several important limitations remain. First, when samples are related, the increased correlation among the inferred risks (Supplementary Fig. S1) can lead to type I error inflation. Hujoel showed that since samples with close relatedness, such as sibling pairs, tend to have highly correlated GWAX or LT-FH phenotypes due to nearly identical family disease history, GWAX and LT-FH suffered poor calibration compared to GWAS. Thus, the usage of the existing methods should be restricted to testing unrelated individuals only, which can reduce power. Second, with unbalanced case–control ratios, the distributions of inferred risks are still unbalanced, hence testing for association using linear mixed model (LMM) can yield inflated type I error rates. For example, diseases, such as Parkinson’s disease, have low prevalence in UKB, which leads to a small number of cases and proxy cases (i.e. controls with non-zero inferred disease risk) in GWAX and a relatively low posterior liability conditioning on family history in LT-FH (Supplementary Figs S2 and S3). Since the Gaussian approximation does not perform well in this setting, LMMs can yield inflated type I error rates. Currently no method exists to handle situations of this kind. We propose a new method for genetic association test with family disease history, mixed-model-based Test with Adjusted Phenotype and Empirical saddlepoint approximation (TAPE), which controls for increased phenotype correlation and case–control imbalance. In standard mixed-model methods, only a dense genetic relatedness matrix (GRM) is used as the variance component. TAPE uses a sparse kinship matrix as an additional variance component to further account for the increased correlation among phenotypes in closely related individuals. In addition, to adjust for case–control imbalance, TAPE uses empirical saddlepoint approximation under a LMM (Bi ; Davison and Hinkley, 1988; Feuerverger, 1989). We show through simulation studies and analysis of UKB that the proposed method is robust and yields better calibration compared to existing methods while gaining power for detection of variant–phenotype associations.

2 Materials and methods

2.1 Overview of methods

The TAPE method takes a three-step framework (Fig. 1): (i) infer the disease risk for all individuals in the analysis based on the original case–control status and family disease history to be used as phenotype; (ii) fit a two-variance-components null LMM to obtain parameter estimates; and (iii) test for genetic association using score test with empirical saddlepoint approximation.

Fig. 1.

Analytical framework of TAPE. In Step 1, latent disease risk of individuals is estimated from observed phenotypes and family disease history using a weighted proportion of the affected close relatives to the individual. In Step 2, a null LMM is fit with covariates and two random effects with the sparse kinship matrix and the dense GRM as covariance structures. In Step 3, P-values score test is performed for each genetic variant using empirical saddlepoint approximation In Step 1, the phenotypes are adjusted using inferred risk of individuals. TAPE-WP uses a weighted proportion of the affected close relatives to the control, which can be viewed as an extension of the GWAX method8 to further differentiate disease risk of controls based on family disease history configurations. TAPE-LTFH uses the liability of diseases generated from the existing LT-FH method as the adjusted phenotypes. In Step 2, we fit the null LMM with two random effects, the first uses the sparse kinship matrix (Jiang ), and the second uses the dense GRM. These two-variance-components can capture both increased correlation in phenotypes due to phenotype adjustment procedure and distance genetic relatedness. To make the method scalable, average information restricted maximum likelihood (AI-REML) (Gilmour ), with preconditioned conjugate gradient (PCG) method (Hestenes and Stiefel, 1952) similar to that used in BOLT-LMM (Loh ) and SAIGE, is used. In Step 3, a score test statistic is calculated for each genetic variant against the adjusted phenotype. Since the Gaussian approximation does not perform well at the tails of the test statistic distribution, we approximate the distribution by empirical saddlepoint approximation (Feuerverger, 1989), which uses an empirically estimated cumulant-generating function (CGF) to calculate P-value. The empirical saddlepoint approximation is utilized when the test statistic exceeds 2 SD of the mean. Time complexity for this step is .

2.2 Phenotype adjustment

We first introduce the proposed phenotype adjustment procedure (Step 1) in TAPE. For TAPE-WP, we assume a sample of individuals where each individual has relatives with phenotypic information (, denotes the kinship coefficient between individual and relative (, denotes the phenotype of relative of individual , is an -vector of observed binary phenotypes. The adjusted quantitative phenotype for individual , , is expressed as: where denotes indicator function, is a pre-specified constant indicating the increase in latent disease risk and = . If and all relatives of the ith individual are cases, the latent disease risk is =. For the analysis in this article, we assume that latent risk of such individual is 0.5 (i.e. ). In addition, the phenotype adjustment procedure can be adapted to include information other than family disease status that is potentially indicative of latent disease risk. See Supplementary Notes S1 and S2 for details.

2.3 LMM for adjusted phenotype

We denote as a (p + 1)-vector of covariates with the intercept, and as the allele counts for the variant to be tested. We consider the following linear model: where is a -vector of fixed effect coefficients, is a genetic effect coefficient and the random effect term for the ith individual with We assume the random effect to follow a multivariate Gaussian distribution , where is the variance component parameter for a noise term. Parameters for other variance components are denoted as , and are pre-specified correlation matrices. To better capture phenotype correlation, we use a variance component of sparse kinship in addition to GRM, i.e. and , where is a sparse matrix of the estimated kinship coefficients after thresholding, and is the GRM. The inclusion of the sparse kinship matrix as an additional variance component can be justified by the observation that the phenotype adjustment using family disease information increases the concordance among related individuals. For example, the adjusted phenotype for a control sibling pair would be identical as they share the same parental disease status (Supplementary Fig. S1). Such phenotypic concordance is not sufficiently captured by GRM alone and can lead to mis-calibration as pointed out by Hujoel . It is also shown that incorporating pedigree structure as a variance component in LMMs improves association outcomes (Tucker ; Zaitlen ).

2.4 Parameter estimation for the null model

In Step 2, we fit a LMM under the following null hypothesis Treating as a quantitative trait, the marginal log likelihood of in REML is where is a constant, ,. Parameters are estimated iteratively with a working model for iteration . Let be the working variance matrix. The score function with respect to are: For each iteration, variance components are updated using AI-REML algorithm (Gilmour ), in which the Hessian is approximated by an average information matrix, , with its entries expressed as: where . Then the variance component parameters are updated by . Both the score and AI matrix involve , which is computationally heavy when is large. To reduce the computational burden, the PCG method (Hestenes and Stiefel, 1952) with Jacobi preconditioner is adopted, which avoids directly calculating matrix inverse by finding solutions of linear systems and involves only matrix multiplication. Since is a linear combination of and , matrix multiplication with regard to each part can be calculated separately. For , the computation cost is further lowered by using the sparsity. For , we reduce the memory usage by calculating its elements in runtime instead of using a pre-computed GRM matrix. The overall time complexity for null model estimation is , where is the number of iterations until the algorithm reaches convergence, is the number of non-zero elements of sparse kinship matrix, is the number of variants included in the GRM construction. Here, we assume that PCG algorithm has complexity (Loh ). To avoid double-fitting the candidate variant in the model and GRM, leave-one-chromosome-out scheme was implemented.

2.5 Single variant association test with empirical SPA

In Step 3, we use the score test to calculate the P-value for association of each genetic variant. The score statistic for testing for variant is , where is an N-vector of covariate-adjusted genotypes. Under the null hypothesis, the variance of the statistic is . For computational efficiency, can be approximated using combined with a calibration factor estimated using a subset of single-nucleotide polymorphism (SNP) data (Loh ; Svishcheva ; Zhou ). The variance-adjusted statistic after calibration is . For the proposed method, 30 independent SNPs were randomly chosen to obtain the estimated calibration factor . The number of SNPs were chosen such that the estimated value is stable within a given variation threshold. The TAPE program is capable of adaptively increase the number of SNPs to reach a stable estimate. When Z is unbalanced and a variant has low minor allele count, using a Gaussian distribution to calculate a P-value of can result in type I error inflation. Saddlepoint approximation is shown to improve over Gaussian approximation in such conditions by utilizing the entire CGF (Daniels, 1954; Jensen, 1995). Fixing , can be viewed as a weighted sum of residuals , yet the adjusted phenotype has an intractable distribution, which makes it impossible to derive the CGF. Alternatively, we use the empirical version of saddlepoint approximation (Davison and Hinkley, 1988; Feuerverger, 1989) as a non-parametric estimator for the distribution of the test statistic (Bi ). The empirical estimator for CGF of is where is the residual of the ith individual from Step 2. The empirical approximation of the first and second derivative is and respectively. Suppose is a value satisfying the equation , the P-value can be calculated by the following formula (Kuonen, 1999) where , is the standard normal cumulative distribution function.

2.6 Simulation studies of type I error control and power

We considered two types of relatedness structures. The first one consists of 5000 independent individuals and 2500 sibling pairs (%). The second one is a mixture of independent individuals and families with eight members in each family. The pedigree for the eight-member family was shown in Supplementary Figure S5. Binary phenotypes for sample individuals and parents were simulated from with from a logistic mixed model where for individual , is a covariate randomly sampled from , is the genotypes of the M variants, is the intercept determined by prevalence , is a vector of log odds ratio of genetic effects and is a random effect with underlying distribution depending on the true underlying kinship coefficient matrix . Given the kinship coefficient between individual i and individual j, the value for an element in K is . Simulation results for TAPE-WP and TAPE-LTFH were compared with two other methods: (i) GWAS with original binary phenotypes by SAIGE (Zhou ); and (ii) original LT-FH method that uses BOLT-LMM with LT-FH phenotypes (Hujoel ) for all individuals (hereafter denoted as LT-FH), which is shown to increase association power over GWAX (Liu ). Type I error rates were evaluated with independent null SNPs, and sample size 10 000 at case–control ratio of 1:99, 5:95 and 10:90. Phenotypes were generated given , corresponding to liability-scale heritability 0.23 (Zhou ). To investigate type I error rates by minor allele frequency (MAF), SNPs were generated with MAF 0.001, 0.01 and 0.1, respectively, for each simulated dataset. Power of the tests was assessed using simulated datasets with 10 000 individuals and 100 000 variants with MAF 0.1 for each setting with 1% variants selected as causal variants. We calculated both the average statistics for causal SNPs and the empirical power at empirical level from SAIGE, LT-FH, TAPE-LTFH and TAPE-WP. Genetic effect sizes ranged from 0.4 to 2.3 and three case–control ratio settings were considered, i.e. 1:99, 5:95 and 10:90. We generated 100 replications for each simulation scenario.

2.7 Computation time evaluation

Computation time was evaluated with variants and sample size ranging from 10 000 to 408 898 sampled from white British individuals in UKB data for type II diabetes (case:control =1:20). Projected time for the analysis of 21 million variants with MAF % was calculated based on the evaluation results. All evaluations were computed on an Intel(R) Xeon(R) Gold 6152 CPU.

2.8 UK biobank data

Over 21 million genetic variants imputed from the Haplotype Reference Consortium (McCarthy ) and with MAF % were used for the association analysis among a sample population of 408 898 white British individuals. NCBI Build 37/UCSC hg19 was adopted for genomic coordinates. A total of 10 binary traits with available parental disease status were analyzed, where the binary traits for genotyped individuals were defined by the PheWAS codes (Zhou ). Parental phenotypes were extracted from data fields for self-reported paternal and maternal illness. We included sex, age and first 10 principal components as covariates to adjust for. GRM was constructed using 93 511 genotyped variants suggested by UKB (Bycroft ; Zhou ). Kinship coefficients were estimated using the KING software (Manichaikul ), and the sparse kinship matrix was constructed using those with estimated kinship no larger than third-degree relatedness. Calibration of the testing method was evaluated by the attenuation ratio obtained from stratified LD score regression (LDSC). The attenuation ratio is defined as (LDSC intercept−1)/(average −1), with smaller values indicating better control of false positives.

2.9 KoGES data

For the association analysis among a sample population of 72 298 Korean individuals, over 8 million genetic variants were imputed from 1000 Genome project phase 3 + Korean reference genome (397 samples) and with MAF>1% (Kim ). Two binary traits (diabetes and gastric cancer) with different case–control ratios were analyzed. Phenotypes for both genotyped individuals and their relatives are self-reported survey data. We adjusted for sex, age, first 10 principal components and 34 indicator variables of batch information (cohort collection year). GRM was constructed using 327 540 genotyped variants. The sparse GRM was constructed using SAIGE with pairwise relatedness coefficients larger than 0.1.

3 Results

3.1 Simulation study results

Type I error rates were evaluated at genome-wide with sample size of 10 000 and case–control ratio ranging from 1:99 to 10:90. For each case–control ratio setting, two sets of genotype data with independent variants were generated with MAF of 0.1, 0.01 and 0.001, respectively. We first simulated a population consisting of 2500 pairs of siblings and 5000 independent individuals (Table 1). The empirical type I error rates of LT-FH were significantly inflated under more unbalanced case–control ratio and lower MAF, while results from TAPE-WP and SAIGE were well calibrated. TAPE-LTFH also yielded better controlled type I error rates than that of LT-FH, especially when the case–control ratio is more unbalanced (1:99). Further, we evaluated type I error rates with a more complex relatedness structure, i.e. a population consisting of 625 eight-member families and 5000 independent individuals (Table 1). Inflated type I error rates were observed in results from LT-FH but with lower magnitude compared to the previous setting. TAPE-LTFH had slightly inflated type I error rates. One explanation is that LT-FH phenotypes are less concordant in the latter setting since there is a smaller number of individual sharing identical family history under a more complicated pedigree. On the other hand, type I error rates from TAPE and SAIGE were relatively well controlled with a slight deflation.

Table 1.

Empirical type I error rates for TAPE-WP, TAPE-LTFH, LT-FH and SAIGE, estimated using independent SNPs and a sample size of 10 000 ()

Case:control	MAF	TAPE-WP	TAPE-LTFH	LTFH	SAIGE
2500 pairs of siblings and 5000 independent individuals
1:99	0.001	4.977e−08	1.019e−07	5.928e−06	4.418e−08
5:95	0.001	5.115e−08	8.275e−08	1.252e−06	4.368e−08
10:90	0.001	5.476e−08	7.452e−08	5.489e−07	4.641e−08
1:99	0.01	5.455e−08	1.069e−07	1.409e−07	3.963e−08
5:95	0.01	5.143e−08	1.158e−07	1.940e−07	4.341e−08
10:90	0.01	5.459e−08	9.086e−08	1.141e−07	4.980e−08
1:99	0.10	5.007e−08	1.275e−07	1.500e−07	3.964e−08
5:95	0.10	5.213e−08	1.639e−07	1.238e−07	4.355e−08
10:90	0.10	6.416e−08	7.782e−08	7.232e−08	4.650e−08
625 8-member families and 5000 independent individuals
1:99	0.001	3.329e−08	9.028e−08	4.446e−06	3.832e−08
5:95	0.001	3.051e−08	6.563e−08	8.171e−07	4.245e−08
10:90	0.001	2.967e−08	5.145e−08	3.751e−07	4.721e−08
1:99	0.01	3.742e−08	9.792e−08	4.818e−07	4.547e−08
5:95	0.01	3.156e−08	7.906e−08	1.463e−07	4.311e−08
10:90	0.01	2.978e−08	6.215e−08	8.811e−08	4.324e−08
1:99	0.10	3.113e−08	7.730e−08	1.000e−07	3.895e−08
5:95	0.10	3.050e−08	7.983e−08	6.025e−08	4.232e−08
10:90	0.10	3.163e−08	6.372e−08	5.857e−08	4.546e−08

Note: Two types of population structure were considered: (i) sample consists of 2500 pairs of siblings and 5000 independent individuals; and (ii) sample consists of 625 8-member families and 5000 independent individuals.

Empirical type I error rates for TAPE-WP, TAPE-LTFH, LT-FH and SAIGE, estimated using independent SNPs and a sample size of 10 000 () Note: Two types of population structure were considered: (i) sample consists of 2500 pairs of siblings and 5000 independent individuals; and (ii) sample consists of 625 8-member families and 5000 independent individuals. One of the important features of TAPE is the use of a kinship matrix in addition to (dense) GRM to account for increased correlation among phenotypes. Two additional analyses were performed to investigate the influence of no kinship variance component (TAPE-nok) and mis-specified kinship matrix (TAPE-misk) on calibration of TAPE under the eight-member family pedigree scenario. For TAPE-nok, the sparse kinship matrix was not included as an LMM variance component and inflated empirical type I error was observed (Supplementary Fig. S4). For TAPE-misk, the true kinship matrix of an eight-member family pedigree was replaced with a slightly mis-specified one (Supplementary Fig. S5) in Steps 1 and 2. The empirical type I error of TAPE-misk was similar to that of TAPE. The results indicated that the impact of a slightly mis-specified kinship matrix was negligible, while the inclusion of the kinship matrix as a variance component is crucial in controlling type I error rate when family information is incorporated into the analysis. To assess empirical power, we compared the average statistics of causal SNPs (Fig. 2) and the proportion of causal SNPs significant at empirical level (Supplementary Fig. S6) for simulated datasets with sample size 10 000 under different genetic effects and case–control ratio. For each dataset, 100 000 independent variants with MAF 0.1 were simulated in which 1% were causal, and we generated 100 datasets for each setting. TAPE-WP and TAPE-LTFH achieve greater detection power over SAIGE, with a 21.0% and 26.5% average increase in average statistics, and a 18.3% and 22.4% average increase in proportion of causal SNPs detected, respectively. LT-FH also had increased over SAIGE by 27.5% and had a 22.1% average increase in detection rate, but it suffered from type I error inflation especially when analyzing related samples.

Fig. 2.

Average values of causal variants with N = 10 000 (5000 independent individuals and 2500 pairs of siblings), comparing TAPE-WP, TAPE-LTFH, LT-FH and SAIGE. For each dataset, 100 000 independent variants were simulated and 1% variants were selected as causal variants with four different effect sizes. A total of 100 datasets were generated to calculate average values. MAFs of variants were 0.1 To investigate how more complex relatedness structures will influence simulation results, we further simulated a population in which related individuals form families with eight members (Supplementary Fig. S5). TAPE-WP achieves greater detection power over SAIGE GWAS results, with a 24.1% average increase in average statistics and a 30.6% average increase in proportion of causal SNPs detected. TAPE-LTFH has higher overall power than both LT-FH and SAIGE under such relatedness structure (Supplementary Fig. S6), with a 27.5% average increase in average statistics and a 40.8% average increase in proportion of significant SNPs detected over SAIGE, whereas for LT-FH the increase is 25.6% for average statistics and 36.6% for proportion of significant SNPs detected. In general, TAPE-WP and TAPE-LTFH yielded well-controlled type I error rate even when case–control ratio is unbalanced, which makes the incorporation of family disease information in genetic association test feasible in the presence of sample relatedness and gains detection power. In addition, TAPE-LTFH achieved higher detection power than LT-FH, especially under the simulation scenario with more complex relatedness structure.

3.2 Computation time

Computation time was evaluated using randomly selected samples from 408 898 white British individuals in UKB data for type II diabetes (case:control =1:20) with variants. Projected computation time for 21 million variants with MAF % was estimated and plotted on log10 scale against sample size varying from 10 000 to 408 898 (Supplementary Fig. S7). Computation time for TAPE-LTFH is similar to that for TAPE-WP and is therefore omitted in the plot. A break-down of run time for null model estimation and P-value calculation is presented in Supplementary Table S3. Since TAPE fits the model with two-variance-components and uses ESPA in P-value calculation, which requires additional computation, TAPE was slower than SAIGE and LT-FH. Overall, TAPE is scalable to analyze biobank size data. For genome-wide analysis of testing 21 million variants, TAPE required 16 CPU hours with 40 000 samples and 284 CPU hours with 408 898 samples.

3.3 Analysis of binary traits in biobank data

We analyzed 10 binary disease outcomes with available parental disease status in the UKB (Table 2).

Table 2.

Summary of 10 traits in UKB

Trait	Phecode	Case:control	Parental prevalence
Parkinson’s disease	332	1:360	0.0186
Dementias	290.1	1:406	0.0609
Lung cancer	165.1	1:181	0.0604
Depression	296.2	1:33	0.0462
Type II diabetes	250.2	1:20	0.0845
Hypertension	401	1:4	0.2388
Chronic bronchitis	496.2	1:136	0.0785
Colorectal cancer	153	1:87	0.0499
Ischemic heart disease	411	1:11	0.2373
Cerebral ischemia	433.3	1:138	0.1348

Summary of 10 traits in UKB Figures 3 and 4 present Manhattan plots and Q–Q plots stratified by MAF categories for two phenotypes with different case–control ratio: type II diabetes (case–control ratio 1:20), and Parkinson’s disease (case–control ratio 1:350). Plots for all 10 diseases in the analysis are shown in Supplementary Figures S8 and S9. For diseases, such as Parkinson’s disease (Fig. 4), the observed quantile distribution of −log10(p) corresponding to SNPs with MAF<0.01 for LT-FH method in the Q–Q plot curved off in the middle of the graph, indicating potential type I error inflation due to unaccounted-for relatedness structure. Similar problematic patterns can also be found in LT-FH Q–Q plots for lung cancer, depression, chronic bronchitis, colorectal cancer and cerebral ischemia (Supplementary Fig. S8), but not in plots for TAPE-WP and TAPE-LTFH.

Fig. 3.

Fig. 4.

Q–Q plot for the UKB association test results from SAIGE, LT-FH, TAPE-LTFH and TAPE-WP among white British (N = 408 898), categorized by MAF. Up: type II diabetes (Phecode 250.2); bottom: Parkinson’s disease (Phecode 332)

Manhattan plot for the UKB association test results from SAIGE (first row), TAPE-LTFH (second row) and TAPE-WP (third row) among white British (N = 408 898). Left: type II diabetes (Phecode 250.2); right: Parkinson’s disease (Phecode 332). Significant clumped variants are identified using a window width of 5 Mb and a linkage disequilibrium threshold of 0.1 Q–Q plot for the UKB association test results from SAIGE, LT-FH, TAPE-LTFH and TAPE-WP among white British (N = 408 898), categorized by MAF. Up: type II diabetes (Phecode 250.2); bottom: Parkinson’s disease (Phecode 332) To assess the calibration of testing methods, we performed stratified LDSC with the baselineLD model to obtain the attenuation ratios (Finucane ) (Supplementary Table S2). For traits with more unbalanced case–control, TAPE-WP consistently yields relatively lower attenuation ratios than TAPE-LTFH, while LT-FH generates the highest attenuation ratio, indicating poor calibration. For example, the average attenuation ratio for type II diabetes (case:control =1:20) is 0.110, 0.120 and 0.142 for TAPE-WP, TAPE-LTFH and LT-FH, respectively; for Parkinson’s disease (case:control =1:360), the average attenuation ratio is 0.125, 0.222 and 0.462 for TAPE-WP, TAPE-LTFH and LT-FH, respectively. Since we used all the individuals regardless of relatedness, the observation supports the previously reported result that LT-FH suffers poor calibration in related samples due to concordance between phenotypes from closely related samples, such as sibling pairs (Hujoel ). On the contrary, TAPE-WP is able to generate better calibrated results under such situations, followed by TAPE-LTFH. Due to the above-mentioned potential type I error inflation of LT-FH method for samples with relatedness, we only compare the proposed method with SAIGE, which is also capable of handling related samples. Supplementary Table S1 lists the number of significant variants and significant clumped variants at detected by TAPE-WP, TAPE-LTFH and SAIGE. Significant clumped variants were further identified by clumping genome-wide significant variants with 5 Mb window size and linkage disequilibrium threshold using PLINK software (Purcell ). TAPE-WP identified 84 more genome-wide significant clumped variants than SAIGE for type II diabetes, and 5 more for Parkinson’s disease. For TAPE-LTFH, a total of 111 more genome-wide significant clumped variants were identified for type II diabetes, and 7 more for Parkinson’s disease as compared to SAIGE. For all 10 diseases analyzed, a total of 663 genome-wide significant clumped variants were identified by TAPE-WP, including 33 clumped variants with MAF<1%; whilst a total of 344 clumped variants were identified by SAIGE, of which 25 were with MAF<1%. For TAPE-LTFH, a total of 726 genome-wide significant clumped variants were identified, including 63 clumped variants with MAF<1%. For additional analysis, we applied TAPE-WP, TAPE-LTFH and SAIGE to two binary phenotypes for 72 298 individuals with family disease history in the KoGES data and analyzed 8 million variants. Disease prevalence among sample individuals and their relatives is shown in Supplementary Table S4. For diabetes (case:control =1:12), TAPE-WP identified 14 more genome-wide significant clumped variants than SAIGE, while TAPE-LTFH identified 15 more than SAIGE. For gastric cancer (case:control =1:191), both TAPE-WP and TAPE-LTFH identified three genome-wide significant clumped variants (rs760077, rs35972942 and rs2978977) while no variants were genome-wide significant by SAIGE. The three clumped variants have been previously reported to be associated with gastric cancer among Chinese or Japanese population (Du ; Tanikawa ; Yan ), but not among Korean samples. Manhattan plots and Q–Q plots are presented in Supplementary Figures S10 and S11.

4 Discussion

We propose a robust method that incorporates family disease information for genetic association test while accounting for case–control unbalance and close relatedness in the population. Previous studies have shown that additional information from family disease history can help improve test power, yet challenges remain (i) to control for type I error inflation induced by increased correlation of phenotypes; and (ii) to account for unbalanced distribution of phenotypes after being adjusted by family disease information. Our TAPE framework uses both a dense GRM and a sparse kinship matrix in the LMM to account for sample relatedness and family history-induced correlations. Empirical saddlepoint approximation is adopted to control for type I error inflation under unbalanced phenotypic distribution. Optimization strategies, such as PCG, for computing components with matrix inversion, and runtime GRM calculation from raw genotypes were implemented to improve computation efficiency and reduce memory usage. For the null model, both sparse kinship matrix and GRM are included in the TAPE framework as variance components to account for the potential phenotypic concordance. The use of two or more variance components in mixed model has been shown to better control for test statistics inflation and improves association power as well as prediction accuracy in standard GWAS and family studies (Speed and Balding, 2014; Widmer ), yet we are not aware of existing methods that apply more than one variance components to mixed model while incorporating family disease history. From simulation studies, we show that the absence of kinship matrix in variance components leads to inflated type I error rates of association test results. This result echoes previous findings from LT-FH (Hujoel ), and indicates a possible solution to control for phenotypic correlation introduced by incorporating family disease information. When estimating variance parameters, the TAPE framework improves computation efficiency by applying PCG algorithm on top of the sparse estimated kinship matrix and the dense GRM, where sparsity of the estimated kinship matrix is ascertained by proper thresholding. The analytical framework of TAPE allows for flexible choice of outcome variables. For example, TAPE-LTFH uses LT-FH phenotypes in the proposed two-variance-component mixed model. We show by simulation studies that TAPE-LTFH can better control for type I error inflation than LT-FH and achieves higher power. It remains a future work to better capture latent risk while accounting for phenotypic concordance to further improve association power using external information, such as family disease history. When there is relatively high sample relatedness in the target population, TAPE-WP is recommended since it consistently controls type I error better than TAPE-LTFH and LT-FH while being capable of incorporating family disease history from the whole population. For studies with exploratory or discovery purposes, we would recommend TAPE-LTFH, as it increases power for detecting possible associations, and can keep type I error inflation on a controllable level. For both TAPE-LTFH and LT-FH in the study, we used the posterior mean genetic liability from the LT-FH method proposed by Hujoel et al. as the outcome variable for all individuals (including related ones), which is computed conditioning on test samples’ binary phenotypes and family disease history. Since the original LT-FH study suggested against the use of LT-FH phenotypes for related individuals, we also evaluated the performance of two corresponding methods, TAPE-LTFHc and LT-FHc, which adjust phenotypes based on the presence of genetically related individuals in the data (details in Supplementary Note S3). Simulation studies (Supplementary Note S3) and UKB data analysis (Supplementary Note S4) show that this approach can help lowering type I error rates for both TAPE-LTFHc and LT-FHc, but at the cost of a decrease in detection power. We also note several limitations of our proposed method. First, the potential difference in the phenotype classification for genotyped individuals and their relatives is not accounted for in the TAPE framework. For example, phenotypes of genotyped individuals in UKB dataset were defined using the PheWAS codes aggregated from ICD9 and ICD10 codes, whereas parental phenotypes were extracted from self-reported surveys. The different phenotype classification standard may induce bias in the adjusted phenotype after incorporating family disease history. The second limitation lies in the modeling assumption of infinitesimal genetic effects, i.e. the effect size of each variant follows a standard Normal distribution, which may yield less detection power when the assumption does not match the true underlying genetic architecture. Despite the above-mentioned limitations, the TAPE framework is the only existing approach that incorporates family disease history while handling related samples and phenotype unbalance. With the increasing accessibility to large-scale biobank data with population relatedness and family disease history information, our proposed method is expected to contribute to improving detection power for genetic association studies, especially for late-onset diseases that are underrepresented in the sample cohorts. Click here for additional data file.

25 in total

1. Case-control association testing with related individuals: a more powerful quasi-likelihood score test.

Authors: Timothy Thornton; Mary Sara McPeek
Journal: Am J Hum Genet Date: 2007-07-10 Impact factor: 11.025

2. Rapid variance components-based method for whole-genome association analysis.

Authors: Gulnara R Svishcheva; Tatiana I Axenovich; Nadezhda M Belonogova; Cornelia M van Duijn; Yurii S Aulchenko
Journal: Nat Genet Date: 2012-09-16 Impact factor: 38.330

3. Two-Variance-Component Model Improves Genetic Prediction in Family Datasets.

Authors: George Tucker; Po-Ru Loh; Iona M MacLeod; Ben J Hayes; Michael E Goddard; Bonnie Berger; Alkes L Price
Journal: Am J Hum Genet Date: 2015-11-05 Impact factor: 11.025

4. Case-control association mapping by proxy using family history of disease.

Authors: Jimmy Z Liu; Yaniv Erlich; Joseph K Pickrell
Journal: Nat Genet Date: 2017-01-16 Impact factor: 38.330

5. MultiBLUP: improved SNP-based prediction for complex traits.

Authors: Doug Speed; David J Balding
Journal: Genome Res Date: 2014-06-24 Impact factor: 9.043

6. Efficient Bayesian mixed-model analysis increases association power in large cohorts.

Authors: Po-Ru Loh; George Tucker; Brendan K Bulik-Sullivan; Bjarni J Vilhjálmsson; Hilary K Finucane; Rany M Salem; Daniel I Chasman; Paul M Ridker; Benjamin M Neale; Bonnie Berger; Nick Patterson; Alkes L Price
Journal: Nat Genet Date: 2015-02-02 Impact factor: 38.330

7. Further improvements to linear mixed models for genome-wide association studies.

Authors: Christian Widmer; Christoph Lippert; Omer Weissbrod; Nicolo Fusi; Carl Kadie; Robert Davidson; Jennifer Listgarten; David Heckerman
Journal: Sci Rep Date: 2014-11-12 Impact factor: 4.379

8. Parental origin of sequence variants associated with complex diseases.

Authors: Augustine Kong; Valgerdur Steinthorsdottir; Gisli Masson; Gudmar Thorleifsson; Patrick Sulem; Soren Besenbacher; Aslaug Jonasdottir; Asgeir Sigurdsson; Kari Th Kristinsson; Adalbjorg Jonasdottir; Michael L Frigge; Arnaldur Gylfason; Pall I Olason; Sigurjon A Gudjonsson; Sverrir Sverrisson; Simon N Stacey; Bardur Sigurgeirsson; Kristrun R Benediktsdottir; Helgi Sigurdsson; Thorvaldur Jonsson; Rafn Benediktsson; Jon H Olafsson; Oskar Th Johannsson; Astradur B Hreidarsson; Gunnar Sigurdsson; Anne C Ferguson-Smith; Daniel F Gudbjartsson; Unnur Thorsteinsdottir; Kari Stefansson
Journal: Nature Date: 2009-12-17 Impact factor: 49.962

9. A reference panel of 64,976 haplotypes for genotype imputation.

Authors: Shane McCarthy; Sayantan Das; Warren Kretzschmar; Olivier Delaneau; Andrew R Wood; Alexander Teumer; Hyun Min Kang; Christian Fuchsberger; Petr Danecek; Kevin Sharp; Yang Luo; Carlo Sidore; Alan Kwong; Nicholas Timpson; Seppo Koskinen; Scott Vrieze; Laura J Scott; He Zhang; Anubha Mahajan; Jan Veldink; Ulrike Peters; Carlos Pato; Cornelia M van Duijn; Christopher E Gillies; Ilaria Gandin; Massimo Mezzavilla; Arthur Gilly; Massimiliano Cocca; Michela Traglia; Andrea Angius; Jeffrey C Barrett; Dorrett Boomsma; Kari Branham; Gerome Breen; Chad M Brummett; Fabio Busonero; Harry Campbell; Andrew Chan; Sai Chen; Emily Chew; Francis S Collins; Laura J Corbin; George Davey Smith; George Dedoussis; Marcus Dorr; Aliki-Eleni Farmaki; Luigi Ferrucci; Lukas Forer; Ross M Fraser; Stacey Gabriel; Shawn Levy; Leif Groop; Tabitha Harrison; Andrew Hattersley; Oddgeir L Holmen; Kristian Hveem; Matthias Kretzler; James C Lee; Matt McGue; Thomas Meitinger; David Melzer; Josine L Min; Karen L Mohlke; John B Vincent; Matthias Nauck; Deborah Nickerson; Aarno Palotie; Michele Pato; Nicola Pirastu; Melvin McInnis; J Brent Richards; Cinzia Sala; Veikko Salomaa; David Schlessinger; Sebastian Schoenherr; P Eline Slagboom; Kerrin Small; Timothy Spector; Dwight Stambolian; Marcus Tuke; Jaakko Tuomilehto; Leonard H Van den Berg; Wouter Van Rheenen; Uwe Volker; Cisca Wijmenga; Daniela Toniolo; Eleftheria Zeggini; Paolo Gasparini; Matthew G Sampson; James F Wilson; Timothy Frayling; Paul I W de Bakker; Morris A Swertz; Steven McCarroll; Charles Kooperberg; Annelot Dekker; David Altshuler; Cristen Willer; William Iacono; Samuli Ripatti; Nicole Soranzo; Klaudia Walter; Anand Swaroop; Francesco Cucca; Carl A Anderson; Richard M Myers; Michael Boehnke; Mark I McCarthy; Richard Durbin
Journal: Nat Genet Date: 2016-08-22 Impact factor: 38.330

10. The UK Biobank resource with deep phenotyping and genomic data.

Authors: Clare Bycroft; Colin Freeman; Desislava Petkova; Gavin Band; Lloyd T Elliott; Kevin Sharp; Allan Motyer; Damjan Vukcevic; Olivier Delaneau; Jared O'Connell; Adrian Cortes; Samantha Welsh; Alan Young; Mark Effingham; Gil McVean; Stephen Leslie; Naomi Allen; Peter Donnelly; Jonathan Marchini
Journal: Nature Date: 2018-10-10 Impact factor: 49.962