Literature DB >> 26069459

Detecting the Genomic Signature of Divergent Selection in Presence of Gene Flow.

Ping Zeng1, Ting Wang2.   

Abstract

In this paper the detection of rare variants association with continuous phenotypes of interest is investigated via the likelihood-ratio based variance component test under the framework of linear mixed models. The hypothesis testing is challenging and nonstandard, since under the null the variance component is located on the boundary of its parameter space. In this situation the usual asymptotic chisquare distribution of the likelihood ratio statistic does not necessarily hold. To circumvent the derivation of the null distribution we resort to the bootstrap method due to its generic applicability and being easy to implement. Both parametric and nonparametric bootstrap likelihood ratio tests are studied. Numerical studies are implemented to evaluate the performance of the proposed bootstrap likelihood ratio test and compare to some existing methods for the identification of rare variants. To reduce the computational time of the bootstrap likelihood ratio test we propose an effective approximation mixture for the bootstrap null distribution. The GAW17 data is used to illustrate the proposed test.

Entities:  

Keywords:  Bootstrap test; Distribution approximation; Likelihood ratio test; Linear mixed model; Rare variants association study; Variance component

Year:  2015        PMID: 26069459      PMCID: PMC4460223          DOI: 10.2174/1389202916666150313230943

Source DB:  PubMed          Journal:  Curr Genomics        ISSN: 1389-2029            Impact factor:   2.236


BACKGROUND

Over the past few years due to rapid advances in next-generation sequencing technologies, the detection of rare variants (i.e., minor allele frequency MAF less than 0.01) association with phenotypes of interest has drawn much attention in the literature. However, because of their rarity the frequently-used methods for common variants (MAF greater than 0.01) have remarkably limited power for rare variants; thus a variety of approaches especially for rare variants association analysis have been developed recently [1-5], including the collapsing-based burden test [6-8], the variable allele-frequency threshold test [9], the data-adaptive sum test [10] and the sequence kernel association test (SKAT) [11, 12], and many others. However, many existing methods for the association of rare variants are subject to a serious power loss when both harmful and protective effects of causal rare variants within a gene are present [12, 13]. It has been shown that an efficient way for handling the directionality problem is to treat the effects of rare variants to be random with a common variance component and then test the significance based on this variance component [12, 14, 15]. By doing so the detection of rare variants is immediately translated into the variance component test under the context of linear mixed effects model [16]. Assume for the moment the variance component for multiple rare variants to be τ2; then H0: τ2 = 0 indicates no association exists between rare variants and phenotypes. The hypothesis testing is nonstandard since τ2 should be nonnegative and is located on the boundary of its parameter space under the null. In this situation the usual asymptotic distribution of the test statistic does not necessarily hold. Under some regularity conditions Self and Liang [17] proved that the likelihood ratio statistic asymptotically followed a 50:50 mixture of and , where is a point mass at zero and is the distribution with one degree of freedom; see also Stram and Lee [18] and Liang and Self [19]. However, some authors pointed out that this equal-weight mixture may be not reasonable in practice and may lead to conservative results [20]. To adjust for the conservatism, Pinheiro and Bates [21] advised to use the 65:35 mixture. In fact, determining a proper weight for the mixture of the likelihood ratio statistic is not straightforward [22]. In the literature under complex hypothesis-testing scenarios a natural way to resolve the problem is to resort to the bootstrap technique because of its generality and conceptual simplicity [23-25]. Faraway [26] and Samuh et al. [27] adopted parametric bootstrap methods for the likelihood ratio variance component test in linear mixed models. To examine the variance component in generalized linear mixed models, Sinha [28] proposed a score test and conducted it by a means of parametric bootstrap. They have shown the bootstrap test can maintain the type I error rate at the nominal level and is more powerful than the method that uses the usual asymptotic 50:50 mixture. The main advantage of the bootstrap test is that it avoids the derivation of the complex null distribution analytically and is easy to implement. Motivated by these results, in this paper we attempt to apply the bootstrap test to detect the association of rare variants in sequencing data. Both parametric and nonparametric bootstrap likelihood ratio tests are studied. Since the restricted maximum likelihood estimation generally offers less unbiased estimates for variance components than the maximum likelihood estimation [20, 29, 30], thus instead of the likelihood ratio test, we use the restricted likelihood ratio test (ReLRT) in the paper. This paper has the following organizations. After introducing the association of rare variants briefly and defining the ReLRT statistic, we present the algorithm for the parametric and nonparametric bootstrap tests. We implement extensive numerical studies to evaluate the performance of the proposed bootstrap likelihood ratio test, and compare it with other existing methods for the identification of rare variants. We finally apply the proposed ReLRT to the GAW17 data and give some discussions.

MATERIALS AND METHODS

Linear Mixed Model

Denote X= [x1, x2, …, x] covariates, such as smoking and age; Z= [z1, z2, …, z] genotypes of rare variants defined within a gene, where z is equal to 0, 1, and 2 representing the number of minor alleles; the phenotype is y; and the sample size is n. We use the following linear mixed model [16, 21] to characterize the relationship between y, X and Z where β = [β1, β2, …, β] are fixed effects, b = [b1, b2, …, b] are random effects with variance component τ2, where is the given weight for rare variant j and is typically calculated in light of MAF [8, 12]. Following Wu et al. [12], we selected w = Beta(MAF; 1, 25) in the paper, where Beta is the Beta probability density function with parameters 1 and 25. This selection implies that rarer variants have more weights. Clearly, τ2 = 0 suggests that b = 0 and there is no statistical association between the genotypes Z and phenotype y. Note that model is different from the model consider in Fitzmaurice et al. [22] and Sinha [28], where the mixed effects model was built for the dependent clustered data and included only one random effect representing the cluster effect; while in model the phenotype y is assumed to be unrelated although the random effects can induce a marginal correlation among y. In addition, multiple random effects are contained in model . In model the problem of the identification of rare variants has been converted into the test of variance component τ2. This transformation results in at least two useful consequences: (i) it circumvents the directionality problem encountered by the burden test and other collapsing-based methods, accordingly the variance-component based test generally has a better power when the effects of rare variants are heterogeneous [12, 31]; (ii) it is a single parameter test and thus avoids performing the multivariate hypothesis testing by directly deeming b as fixed effects, which may lead to a serious loss of degrees of freedom if a large number of rare variants are included and hence is often less powerful. Note that the null hypothesis H0: τ2 = 0 is nonstandard; the usual asymptotic distribution does not hold [22, 28]. Under H0, model reduces to the general linear model

Restricted Likelihood Ratio Test

The methods of maximum likelihood (ML) and restricted maximum likelihood (REML) are two commonly-used approaches for fitting model . It has been proved that REML provides less unbiased estimates for variance components than ML [29, 30]. So we only consider ReLRT for the problem of testing H0: τ2 = 0 versus H1: τ2 > 0. By replacing β and σ2 with their restricted maximum likelihood estimators, we can obtain the restricted profile log-likelihood function for model up to an independent constant , where V = lZWWZ′ + I with l = τ2/σ2, in which I is the identity matrix with order n and W is a diagonal matrix with elements w; ; where Y, X andZ are respectively the vector of y and the matrices of X andZ. It is easy to see that testing l = 0 is equivalent to testing τ2 = 0. The ReLRT statistic is accordingly defined as Noted that in Equation is actually a constant independent of l. Since T has a complicated expression, deriving its null distribution analytically is typically not feasible. To overcome this difficulty we next turn to the bootstrap procedure [23, 24].

Bootstrap Procedure

The bootstrap procedure is conceptually simple and easy to conduct as long as one can repeatedly obtain bootstrap samples. First we fit the null model M0 in , and yield with , and , here the subscript 0 indicates these quantities are estimated under M0. Sampling on in the bootstrap procedure is a natural choice; however, as Davison and Hinkley [23] stated that a better strategy is on the modified residuals , where h0 is the leverage value calculated as the diagonal element of hat matrix X(X′X)-1X′. Then has a constant variance while is heteroscedastic. Under these settings, the general bootstrap procedure can be described as follows. Algorithm 1. The general bootstrap procedure for ReLRT (i) Calculate the observed value of T according to M1 and M0 using the original data, denote as tobs; (ii) Generate bootstrap samples many times from , say B, and calculate T for each bootstrap sample using Equation , obtain the bootstrap value of T, say t, b = 1, 2, …, B; (iii) The bootstrap null distribution of T consists of t, and the Monte Carlo p value of ReLRT is simply estimated as the proportion of t equal to or greater than tobs. In the second step if we directly generate bootstrap samples using the fitted null model , i.e., if the bootstrap phenotype is created as , then we are conducting the parametric bootstrap test [26]; on the other hand, if we generate bootstrap samples by resampling the modified residuals with replacement, then we are conducting the nonparametric bootstrap test [23]. More specifically, the nonparametric bootstrap test is carried out by creating the bootstrap phenotype Y* = X + , where is produced through resampling with replacement. Note that in the nonparametric bootstrap test if we create via resampling without replacement, then actually we are performing the permutation test [23, 25], and they typically produce similar results.

Approximate Mixture of the Bootstrap Null Distribution

To reduce the computation time of the proposed bootstrap test, we propose a mixture to approximate the bootstrap null distribution of ReLRT. Following Greven et al. [32], we specify the approximate mixture as where π and a are respectively the proportion and scale parameters, and both of them can be estimated via the method of moment. Let t, i = 1, 2, …, L is the vector of the bootstrap value of T via the parametric or nonparametric bootstrap procedure described in Algorithm 1. Then the moment estimate for a is , where is the average of t and is the estimate of π via the spectral representation [20, 33] where μ’s are the eigenvalues of WZ′P0ZWwith P0 = I-X(X′X)-1X′, u’sare independently standard normal random variables. Note that this estimation of π is different from the original moment estimation in Greven et al. [32]. The estimate of π in is independent of the choice of L; hence it has the advantage of numerical stability over the moment estimation.

Settings of Numerical Study

The genotypes of rare variants Z were generated via the package COSI based on a coalescent model for European population [12, 34]. The covariates X = [1, x1, x2], where x1 and x2 are the standard normal variable and the binary variable with a probability of 0.5, respectively. The phenotype was sampled from the normal distribution , where k0 is the number of causal rare variants; i.e., the variants that have nonzero effects on the phenotype. The sample size was set to 200, 400, 600 and 800; in our simulations the average number of rare variants (i.e., k) corresponding to these sample sizes are 32, 46, 54 and 61, respectively. In the bootstrap ReLRT algorithm, B was set to 2000, and for the approximation mixture in we selected L = 2000, 1500, 1000, 800, 500, 300 and 100. We also implemented the simulation-based algorithm for the finite sample null distribution of ReLRT [20, 35, 36], and the number of runs in this algorithm is set to 10000. Besides ReLRT, the burden test, the optimal SKAT (SKAT-O) [37, 38], SKAT [12], the genetic random field (GenRF) model [39, 40] and the mixed effects score test (MiST) [41] were conducted together for comparisons. For type I error simulations the number of replicates was 2000; for power simulations the number of replicates was 1000. The estimated type I error rate and power were calculated as the proportion of p values equal to or less than 0.01.

Numerical Study 1: Type I Error Control

In model k0 was set to 0. The results are given in (Tables -). In the tables sim corresponds to results of ReLRT computed via the simulation-based algorithm [20], mix0 indicates results of ReLRT computed via the bootstrap test with B = 2000, and mix1-mix7 represent results of ReLRT computed via the approximate mixture with L = 2000, 1500, 1000, 800, 500, 300 and 100, respectively.

Numerical Study 2: Statistical Power Evaluation

For power simulations, only 30% rare variants were causal to reflect the fact that not all the rare variants are related to the phenotype. Like in Wu et al. [12], the effect size |b| was specified to 0.3|log10MAF|, which results in an effect size of 1.20 for MAF = 0.0001 and an effect size of 0.60 for MAF = 0.01. To study the influence of the direction of effects of rare variants on the statistical power, we considered two situations: (A) all of the causal rare variants had positive effects, i.e., b = 0.3|log10MAF|; (B) half of the causal rare variants had positive effects, i.e., b = 0.3|log10MAF| and the rest half had negative effects, i.e., b = -0.3|log10MAF|. The results are given in (Fig. -).

Data Analysis

We applied the proposed bootstrap ReLRT to the genetic analysis workshop (GAW) 17 data [42, 43]. The GAW17 data has 24487 single nucleotide polymorphisms (SNPs) on 3205 genes across 697 individuals. The MAF ranges from 0.07% to 25.8% and most of the SNPs are rare. The covariates include age, sex and smoking. We used the quantitative trait Q1 as the phenotype in our analysis. Three genes (i.e., ARNT, FLT1 and KDR) listed in (Table ) in Almasy et al. [43] were analyzed here. The results are given in (Tables and ).

RESULTS

Results for Numerical Study

The 50:50 and 65:35 mixture distributions for ReLRT typically lead to incorrect type I error rates (Table ). Specifically, in our simulations the 50:50 mixture usually produces conservative results, i.e., the type I error rate is less than the given nominal level, this is similar to that of Fitzmaurice et al. [22]; in contrast, the 65:35 mixture usually produces liberal results, i.e., the type I error rate is larger than the given nominal level. These results suggest that fixing the proportion parameter in the asymptotic null mixture for ReLRT is inappropriate. In fact, Claeskens [33] and Crainiceanu and Ruppert [20] have found that the mixture proportion parameter is case-specific instead of a fixed constant. For fair comparisons, in the following for power evaluations we do not consider the 50:50 and 65:35 mixtures. (Table ) shows that the burden test, SKAT-O, SKAT, GenRF and MiST can correctly control the type I error rate, except that SKAT is slightly conservative when the sample size is small (e.g., n = 200 and 400). SKAT is a score-based variance component test [44]; its relatively conservative nature has been also observed previously [12, 40]. It is seen from (Table ) that ReLRT based on the simulation algorithm of Crainiceanu and Ruppert [20] can protect the type I error rate. (Table ) also shows that both the parametric and nonparametric bootstrap ReLRT methods can effectively control the type I error, except when a very small value of L (e.g., L = 100) is used in the approximation mixture, under which slightly liberal results are obtained. From (Fig. to Fig. ), on the whole we can see that when all the causal rare variants are in the same direction (situation A), the burden test, SKAT-O and MiST have a relatively higher power, and MiST has the highest power; while when both negative and positive effects are present (situation B), SKAT and ReLRT have better power compared with the other competing methods. Additionally, it can be seen from (Fig. -) that: (i) the performance of the parametric and nonparametric bootstrap procedures for ReLRT is comparable, little difference is observed; (ii) and the performance of ReLRT based on various methods (e.g., the simulation-based algorithm, the parametric and nonparametric bootstrap procedures and mixture distribution) is largely analogous; (iii) under both situations A and B, ReLRT uniformly outperforms SKAT; (iv) in our simulations GenRF has the lowest power; (v) when the effects of the causal rare variants are heterogeneous (situation B), all the methods are subject to the power loss (Fig. ); however, SKAT and ReLRT suffer less reduction than the others, implying they are relatively robust to the effect heterogenicity.

Results for the GAW 17 Data

From (Tables and ) the genes FLT1 and KDR can be viewed as to be statistically associated with the phenotype at the level of 0.01, while ARNT cannot be. From (Table ) it is observed empirically that when the p value is large, the proposed approximate mixture leads to the very similar results regardless of the choice of L; when the p value is small, the mixture may lead to different results. But as seen from the simulations for a long run the approximate mixture with different choice of L generally produces comparable results.

DISCUSSION

Since its introduction in 1979 by Efron [45], the bootstrap method has become a simple and powerful means for statistical inference and has been widely employed in various scientific problems [23-25]. In this paper we propose the bootstrap-based restricted likelihood ratio test for the identification of rare variants in sequencing data. The simulation results have shown that both the parametric and nonparametric bootstrap ReLRT can correctly maintain the type I error at the nominal significant level, while the usual asymptotic mixture cannot. The two bootstrap procedures typically lead to almost the same results. Several competing methods for the identification of rare variants, including the burden test, SKAT-O, SKAT and MiST, are also considered for comparisons; actually, these methods are essentially score tests based only on the null model [12, 41]. An attractive feature of these tests is that they are computationally fast [12]. However, tests based only the null model explicitly ignore useful information about the possible alternative model. In contrast, ReLRT needs to fit both the null and alternative models; as a result, it is less computationally efficient. But compared to these existing methods, the proposed ReLRT based on bootstrap resampling has a comparable power or behave better. More importantly, the simulations show that ReLRT is robust in the sense that it suffers from less power loss when the effects of causal rare variants are in mixed directions. This is a satisfactory feature since in practice the true underlying genetic model of diseases is generally not known and tests that are robust against the effect directionality are more preferred. Despite its generic applicability, the bootstrap test is a computer-intensive statistical method. This problem will be more pronounced for the likelihood ratio test. Varieties of strategies are developed to decrease the computational cost of bootstrap; for example, reducing the number of replicates required by the bootstrap test, subsampling and the m out of n bootstrap. Kleiner et al. [46] recently presented an overview of these strategies, and pointed out that none of them are perfect and each has different degrees of shortcomings. To reduce the computational cost in this paper we employ an approximate mixture null distribution for the bootstrap ReLRT. The simulation results have shown that this mixture performs very well even for a relatively small value of L (e.g., L = 100 ~ 300). The importance is that the computation time of the bootstrap ReLRT is reduced significantly compared with the original bootstrap procedure. For example, compared to the bootstrap procedure with 2000 replicates, the proposed mixture with L = 200 can decrease the computation burden about 10 times while maintaining its desired performance. The performance with various choices of L has little difference. This property is desired since we are typically not clear about how the magnitude L should be, and tend to select a relatively small value. To improve the speed, the modern parallel computing via clusters and multicore processor can be utilized [46, 47]. Recently, a computationally efficient method, called bag of little bootstraps, has been proposed by Kleiner et al. [46]. This method has been shown to have favorable statistical performance for massive data; however, this method cannot be applicable to our setting, because it was initially designed for assessing the uncertainty of the estimates. Extending it to the likelihood ratio test may be an interesting problem needing further investigation.
Table 1

Type I error rate for the 50:50 and 65:35 mixtures and other existing methods.

n50:5065:35BurdenSKAT-OSKATGenRFMiST
2000.0070.0130.0080.0080.0070.0090.008
4000.0070.0120.0110.0100.0070.0090.009
6000.0070.0130.0100.0090.0090.0110.008
8000.0080.0130.0110.0100.0080.0080.009

Note: here 50:50 and 65: 35 correspond to results of ReLRT obtained by using the 50:50 and 65: 35 null mixtures.

Table 2

Type I error rate for ReLRT and its approximation with the mixture.

nsimmix0mix1mix2mix3mix4mix5mix6mix7
parametric bootstrap
2000.0110.0120.0120.0120.0110.0110.0120.0090.013
4000.0100.0080.0090.0090.0090.0100.0100.0110.011
6000.0110.0110.0110.0110.0120.0120.0130.0130.013
8000.0100.0090.0110.0110.0090.0100.0110.0130.013
nonparametric bootstrap
2000.0110.0110.0110.0110.0100.0110.0110.0120.016
4000.0090.0090.0110.0110.0110.0110.0090.0090.013
6000.0100.0110.0110.0100.0110.0110.0110.0120.016
8000.0070.0100.0110.0100.0100.0100.0100.0120.014

Note: here sim corresponds to results of ReLRT computed via the simulation-based algorithm described by Crainiceanu and Ruppert (2004); mix0 indicates results of ReLRT computed via the bootstrap test with B = 2000; and mix1-mix7 represent results of ReLRT computed via the mixture distribution described in Equation (6) with L = 2000, 1500, 1000, 800, 500, 300 and 100, respectively.

Table 3

Type I error rate for ReLRT and its approximation with the mixture.

 BurdenSKAT-OSKATGenRFMiSTReLRT
ARNT0.4090.5330.3410.720 0.483 0.343
FLT12.286E-31.034E-48.266E-52.589E-2 5.320E-5 2.000E-4
KDR1.080E-62.685E-68.425E-52.135E-1 6.156E-8 9.999E-5

Note: here the p value of ReLRT is computed through the simulation-based algorithm of Crainiceanu and Ruppert (2004).

Table 4

The p values for the three genes in the GWA17 data via ReLRT computed by using the parametric and nonparametric bootstrap procedures.

Mixture KDR FLT1 ARNT
pbootnbootpbootnbootpbootnboot
mix00.3560.3445.00E-45.00E-45.00E-45.00E-4
mix10.3520.3524.72E-51.27E-45.50E-79.01E-6
mix20.3530.3524.86E-51.92E-45.54E-76.78E-6
mix30.3530.3515.30E-51.20E-43.15E-76.81E-6
mix40.3540.3505.11E-51.06E-47.12E-73.01E-6
mix50.3560.3502.93E-57.27E-55.51E-71.45E-6
mix60.3560.3473.13E-52.07E-58.74E-74.55E-6
mix70.3510.3431.26E-42.55E-54.17E-106.76E-6

Note: pboot and nboot respectively represent the parametric and nonparametric bootstrap procedures; mix0 indicates results of ReLRT computed via the bootstrap test with B = 2000; and mix1-mix7 represent results of ReLRT computed via the mixture distribution described in Equation (6) with L = 2000, 1500, 1000, 800, 500, 300 and 100, respectively..

  29 in total

1.  Optimal tests for rare variant effects in sequencing association studies.

Authors:  Seunggeun Lee; Michael C Wu; Xihong Lin
Journal:  Biostatistics       Date:  2012-06-14       Impact factor: 5.899

2.  Pooled association tests for rare variants in exon-resequencing studies.

Authors:  Alkes L Price; Gregory V Kryukov; Paul I W de Bakker; Shaun M Purcell; Jeff Staples; Lee-Jen Wei; Shamil R Sunyaev
Journal:  Am J Hum Genet       Date:  2010-05-13       Impact factor: 11.025

3.  Permutation tests for random effects in linear mixed models.

Authors:  Oliver E Lee; Thomas M Braun
Journal:  Biometrics       Date:  2011-09-27       Impact factor: 2.571

4.  A note on permutation tests for variance components in multilevel generalized linear mixed models.

Authors:  Garrett M Fitzmaurice; Stuart R Lipsitz; Joseph G Ibrahim
Journal:  Biometrics       Date:  2007-04-02       Impact factor: 2.571

5.  Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models.

Authors:  Dawei Liu; Xihong Lin; Debashis Ghosh
Journal:  Biometrics       Date:  2007-12       Impact factor: 2.571

6.  A generalized genetic random field method for the genetic association analysis of sequencing data.

Authors:  Ming Li; Zihuai He; Min Zhang; Xiaowei Zhan; Changshuai Wei; Robert C Elston; Qing Lu
Journal:  Genet Epidemiol       Date:  2014-01-30       Impact factor: 2.135

7.  A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST).

Authors:  Stephan Morgenthaler; William G Thilly
Journal:  Mutat Res       Date:  2006-11-13       Impact factor: 2.433

8.  Genetic Analysis Workshop 17 mini-exome simulation.

Authors:  Laura Almasy; Thomas D Dyer; Juan Manuel Peralta; Jack W Kent; Jac C Charlesworth; Joanne E Curran; John Blangero
Journal:  BMC Proc       Date:  2011-11-29

9.  A groupwise association test for rare mutations using a weighted sum statistic.

Authors:  Bo Eskerod Madsen; Sharon R Browning
Journal:  PLoS Genet       Date:  2009-02-13       Impact factor: 5.917

10.  An evaluation of statistical approaches to rare variant analysis in genetic association studies.

Authors:  Andrew P Morris; Eleftheria Zeggini
Journal:  Genet Epidemiol       Date:  2010-02       Impact factor: 2.135

View more
  3 in total

1.  Boosting Gene Mapping Power and Efficiency with Efficient Exact Variance Component Tests of Single Nucleotide Polymorphism Sets.

Authors:  Jin J Zhou; Tao Hu; Dandi Qiao; Michael H Cho; Hua Zhou
Journal:  Genetics       Date:  2016-09-19       Impact factor: 4.562

2.  Cis-SNPs Set Testing and PrediXcan Analysis for Gene Expression Data using Linear Mixed Models.

Authors:  Ping Zeng; Ting Wang; Shuiping Huang
Journal:  Sci Rep       Date:  2017-11-10       Impact factor: 4.379

3.  A comprehensive comparison of multilocus association methods with summary statistics in genome-wide association studies.

Authors:  Zhonghe Shao; Ting Wang; Jiahao Qiao; Yuchen Zhang; Shuiping Huang; Ping Zeng
Journal:  BMC Bioinformatics       Date:  2022-08-30       Impact factor: 3.307

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.