Literature DB >> 25519384

Adjustment of familial relatedness in association test for rare variants.

Cong Li1, Can Yang2, Mengjie Chen1, Xiaowei Chen1, Lin Hou2, Hongyu Zhao2.   

Abstract

High-throughput sequencing technology allows researchers to test associations between phenotypes and all the variants identified throughout the genome, and is especially useful for analyzing rare variants. However, the statistical power to identify phenotype-associated rare variants is very low with typical genome-wide association studies because of their low allele frequencies among unrelated individuals. In contrast, a family-based design may have more power because rare variants are more likely to be enriched in families than among unrelated individuals. Regardless, an analysis of family-based association studies needs to account appropriately for relatedness between family members. We analyzed the observed quantitative trait systolic blood pressure as well as the simulated Q1 data in the Genetic Analysis Workshop 18 data set using 4 tests: (a) a single-variant test, (b) a collapsing test, (c) a single-variant test where familial relatedness was accounted for, and (d) a collapsing test where familial relatedness was accounted for. We then compared the results of the 4 methods and observed that adjusting for familial relatedness could appropriately control the false-positive rate while maintaining reasonable power to detect several strongly associated variants/genes.

Entities:  

Year:  2014        PMID: 25519384      PMCID: PMC4143885          DOI: 10.1186/1753-6561-8-S1-S39

Source DB:  PubMed          Journal:  BMC Proc        ISSN: 1753-6561


Background

Current platforms for genome-wide association studies are limited to scanning common variants. Although rare variants may contribute a significant proportion of heritability, the statistical power to detect rare variants is low because of their low allele frequencies. Recent advances in high-throughput sequencing technologies have provided us great opportunities to delve deeper into the genetic components of complex traits by identifying millions of rare variants in the human genome [1] and allowing them to be tested for associations with complex traits. In an effort to increase the power to detect rare-variant associations, many methods have been proposed to aggregate the effects of multiple rare variants within a specific functional unit, for example, a gene [2]. Among those methods, the kernel score test has enjoyed great popularity thanks to its flexibility and computational efficiency [3]. Family-based designs may offer more power, however, because related individuals are more likely than unrelated individuals to be enriched for rare variants [4]. One challenge of analyzing family data is that familial relatedness needs to be carefully adjusted for in the association analysis. Linear mixed models have been used to account for population stratification in the context of genome-wide association studies (GWAS) [5], and can also be naturally adapted to a family-based design. In this paper, we adopted a linear mixed model to adjust for familial relatedness while aggregating the effects of rare variants within a gene using a kernel score test.

Methods

Data description and preprocessing

The Genetic Analysis Workshop 18 (GAW18) data consists of genotyping and sequencing data for 959 individuals from 20 extended families. Among them, 849 individuals had at least 1 blood pressure measurement. For the real data, we used the first nonmissing measurement of systolic blood pressure (SBP) as the target quantitative trait. For the 200 replicates of the simulated phenotype, we also used the first measurement of SBP as the target quantitative trait. Finally, to evaluate the false-positive rate, we used the simulated Q1 data as the null phenotype. The common variants in the genotyping data were used to estimate the genetic similarity matrix. As for the sequencing data, we selected only the nonsynonymous mutations for the association test in recognition of the fact that variants causing amino acid alterations tend to have large effects on the phenotype.

The model

We used the following linear mixed model to adjust for familial relatedness: where Y represents the phenotype (quantitative trait), C represents the collection of the covariates (age and gender), X includes the genotypes of the variants to be tested, and Z is the design matrix for the whole-genome polygenic random effects α. Finally, γ and β are fixed effects and e is the random residual. When the random effects are integrated out, the model is marginalized as where and are the variances for the random effects α and e, respectively. As for K, one could use twice the theoretical kinship matrix to represent the familial relationships. However, to account for potential cryptic relatedness between individuals across different families, we used a genetic similarity matrix that was estimated from the genotyping data as follows: where is the number of minor alleles for ith individual at the mth marker, is the allele frequency for the mth marker, and p is the total number of markers in the genotyping data. We investigated 4 methods of association analysis: (a) a single-variant association test without adjusting for familial relatedness (UNI), that is, the component is excluded in the model; (b) a single-variant association test adjusting for familial relatedness (UNI-ADJ); (c) a kernel score test without adjusting for familial relatedness (SKAT); and (d) a kernel score test adjusting for familial relatedness (SKAT-ADJ). The first method is a simple linear regression. The second method tests the null hypothesis of zero fixed effects in a linear mixed model. Specifically, we used the likelihood ratio test implemented in the software GEMMA [6]. The third method is the direct application of the SKAT software [3] that ignores familial relatedness. As for the fourth method, we adopted the approach described in Ref. [7]. Specifically, we estimated the covariance matrix under the null model without genes or variants being tested. Then we applied a data transformation to calculate and where: leading to the following transformed model, where . Note that each is independent and identically distributed; therefore, the kernel score test can be appropriately applied on the transformed data.

Results

Simulated data

For the simulated data, we analyzed the genes that explain the highest percentages of SBP variance (the percentages of SBP variance explained by each gene in the simulation were provided by the GAW18 organizers) using the 4 methods described above. Genes without nonsynonymous mutations were excluded. For the first two single-variant methods, we performed a Bonferroni correction on the smallest p value of the variants in a gene to obtain the gene-level p value. To assess the false-positive rate, we also calculated the p values using Q1 as the phenotype. The power and the false-positive rate were calculated based on the 200 replicates. Table 1 shows the results for the top 7 genes. As the table shows, ignoring familial relatedness results in inflation of the false-positive rate using the UNI and SKAT methods, in contrast with the UNI-ADJ and SKAT-ADJ methods, which still demonstrate good control of the false-positive rate. As for the power, the single-variant method does better for genes whose effects on the phenotype are dominated by a single variant (eg, LEPR) whereas the collapsing method is more powerful for genes containing multiple variants with comparable effect sizes.
Table 1

The power and false-positive rate (FPR) of the 4 methods in simulated data

GeneUNIUNI-ADJSKATSKAT-ADJ

# of NS variantsPowerFPRPowerFPRPowerFPRPowerFPR
MAP4810.1210.0410.1310.03

TNN200.90.2250.740.0550.8950.380.8250.04

LEPR60.9150.1250.8950.050.220.0850.730.04

FLT340.660.1050.30.0450.330.1350.3050.07

FLNB130.440.140.1650.030.080.170.5150.035

ZNF44390.0700.1550.0350.0300.1500.0750.0500.075

GSN60.2650.1550.1300.0700.0300.0650.3850.065

A significance level of 0.05 was used to calculate both power and FPR.

The power and false-positive rate (FPR) of the 4 methods in simulated data A significance level of 0.05 was used to calculate both power and FPR.

Real data

Consistent with our observation in the simulated data, the Q-Q plot (Figure 1) shows that the UNI and SKAT methods that ignore familial relatedness result in substantial inflation of the false-positive rate. For UNI-ADJ, only 1 variant (chr11:119059358) has a Bonferroni-corrected p value smaller than 0.05 (PBonferroni = 0.038). It is a coding variant for gene PDZD3. To our knowledge, the literature has suggested no role for this gene in blood pressure. For SKAT-ADJ, there is no gene with a Bonferroni-corrected p value smaller than 1.
Figure 1

Q-Q plots for the 4 methods in real data. The Q-Q plots are based on −log10 p-values of the 7781 genes with at least 1 nonsynonymous mutation in the real data. Red curves represent the observed p values and the black curves represent the expected p values under the null model.

Q-Q plots for the 4 methods in real data. The Q-Q plots are based on −log10 p-values of the 7781 genes with at least 1 nonsynonymous mutation in the real data. Red curves represent the observed p values and the black curves represent the expected p values under the null model.

Discussion and conclusions

In this study, we investigated the utility of linear mixed models in adjusting for familial structure. We also attempted to combine a linear mixed model method with a kernel score test--a state-of-the-art collapsing methodology that tests for association between a group of variants and a phenotype.

Conclusions

We found that linear mixed models are able to satisfactorily adjust for familial relatedness in the sense that false-positive rates are well controlled at the nominal significance level. Not surprisingly, in terms of the statistical power, the performance of the single-variant method and the collapsing method depends on the distribution of the variants' effect sizes within the gene. The single-variant method is more powerful for genes with one dominating causal variant, whereas the collapsing method is more powerful for genes with multiple causal variants of similar effect sizes.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

CL and HZ designed the overall study; CY, CL, MC, XC, and LH conducted statistical analyses; and CL and HZ drafted the manuscript. All authors read and approved the final manuscript.
  6 in total

1.  Genomics: In search of rare human variants.

Authors:  Rasmus Nielsen
Journal:  Nature       Date:  2010-10-28       Impact factor: 49.962

Review 2.  Family-based designs for genome-wide association studies.

Authors:  Jurg Ott; Yoichiro Kamatani; Mark Lathrop
Journal:  Nat Rev Genet       Date:  2011-06-01       Impact factor: 53.242

3.  Variance component model to account for sample structure in genome-wide association studies.

Authors:  Hyun Min Kang; Jae Hoon Sul; Susan K Service; Noah A Zaitlen; Sit-Yee Kong; Nelson B Freimer; Chiara Sabatti; Eleazar Eskin
Journal:  Nat Genet       Date:  2010-03-07       Impact factor: 38.330

4.  Rare-variant association testing for sequencing data with the sequence kernel association test.

Authors:  Michael C Wu; Seunggeun Lee; Tianxi Cai; Yun Li; Michael Boehnke; Xihong Lin
Journal:  Am J Hum Genet       Date:  2011-07-07       Impact factor: 11.025

Review 5.  Statistical analysis strategies for association studies involving rare variants.

Authors:  Vikas Bansal; Ondrej Libiger; Ali Torkamani; Nicholas J Schork
Journal:  Nat Rev Genet       Date:  2010-10-13       Impact factor: 53.242

6.  Genome-wide efficient mixed-model analysis for association studies.

Authors:  Xiang Zhou; Matthew Stephens
Journal:  Nat Genet       Date:  2012-06-17       Impact factor: 38.330

  6 in total
  1 in total

1.  Testing genetic association with rare and common variants in family data.

Authors:  Han Chen; Dörthe Malzahn; Brunilda Balliu; Cong Li; Julia N Bailey
Journal:  Genet Epidemiol       Date:  2014-09       Impact factor: 2.135

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.