Literature DB >> 29239496

Two-phase designs for joint quantitative-trait-dependent and genotype-dependent sampling in post-GWAS regional sequencing.

Osvaldo Espin-Garcia^1,2, Radu V Craiu³, Shelley B Bull^1,2.

Abstract

We evaluate two-phase designs to follow-up findings from genome-wide association study (GWAS) when the cost of regional sequencing in the entire cohort is prohibitive. We develop novel expectation-maximization-based inference under a semiparametric maximum likelihood formulation tailored for post-GWAS inference. A GWAS-SNP (where SNP is single nucleotide polymorphism) serves as a surrogate covariate in inferring association between a sequence variant and a normally distributed quantitative trait (QT). We assess test validity and quantify efficiency and power of joint QT-SNP-dependent sampling and analysis under alternative sample allocations by simulations. Joint allocation balanced on SNP genotype and extreme-QT strata yields significant power improvements compared to marginal QT- or SNP-based allocations. We illustrate the proposed method and evaluate the sensitivity of sample allocation to sampling variation using data from a sequencing study of systolic blood pressure.

Entities: Chemical Disease Gene Species

Keywords: Genetic Analysis Workshop 19; fine-mapping; genetic association studies; joint outcome covariate dependent sampling; outcome-/covariate-dependent sampling

Mesh：

Year: 2017 PMID： 29239496 PMCID： PMC5814750 DOI： 10.1002/gepi.22099

Source DB: PubMed Journal: Genet Epidemiol ISSN： 0741-0395 Impact factor: 2.135

INTRODUCTION

Regional sequencing to follow‐up findings from a genome‐wide association study (GWAS), in which a large number of single nucleotide polymorphisms (SNPs) are tested one by one, can be cost‐effective for fine mapping. In the “post‐GWAS” era, identifying causal variants and susceptibility genes in GWAS‐identified regions of association has become an important goal for researchers. Despite decreasing costs of next‐generation sequencing (NGS) technologies, sequencing all subjects in large‐scale studies is still prohibitive. Thus, careful planning and innovative designs become essential to optimize the available resources. Sequence variants at high density in the fine‐mapping region of interest are typically in linkage disequilibrium (LD) with previously, strongly associated, variants from GWAS. However, the GWAS SNPs may or may not have biological function themselves. Consequently, fine‐mapping variants in the selected region are tested for association to identify biologically relevant loci. A thorough review and description of the strategies utilized for fine mapping can be found in Spain and Barrett (2015). Two‐phase sampling design and analysis has proven to be an efficient technique to select and analyze a cost‐effective subsample, for example, individuals to be sequenced. In their original formulation, two‐phase sampling designs were developed to estimate population parameters of an expensive variable, that is, a random variable that is costly to measure (in time, materials, or personnel) (Pickles, Dunn, & Vázquez‐Barquero, 1995; White, 1982). At phase 1, data for an inexpensive or surrogate variable (correlated to the expensive variable) are collected for a large random sample of the population. At phase 2, the expensive variable is measured only in a subset of the phase 1 sample; the subsample is drawn based on the information provided by the sample distribution of the inexpensive variable. In this case, the expensive variable is missing by design in individuals not selected in phase 2. Various methods have been proposed to analyze data collected under a two‐phase design, including estimating functions (Breslow & Wellner, 2007; Chatterjee, Chen, & Breslow, 2003; Chen, Craiu, & Bull, 2012; Scott & Wild, 2011) and (conditional or full) maximum likelihood methods (Breslow & Cain, 1988; Breslow & Holubkov, 1997; Derkach, Lawless, & Sun, 2015; Lawless, Kalbfleisch, & Wild, 1999; Lin, Zeng, & Tang, 2013; Song, Zhou, & Kosorok, 2009; Zeng & Lin, 2014; Zhao, Lawless, & McLeish, 2009). Under a case‐control design, sampling within strata defined by both disease status and covariates is more powerful than sampling from cases and controls only (Breslow & Chatterjee, 1999; Schaid, Jenkins, Ingle, & Weinshilboum, 2013). For a quantitative trait (QT), sampling strategies that select informative individuals according to extreme‐trait or genotype values are known to have good properties (Chen et al., 2012; Lin et al., 2013; Thomas, Yang, & Yang, 2013), but designs based on joint sampling are not well developed. In this effort, we evaluate phase 2 sampling according to values of a SNP genotype and a QT. Statistical analysis is based on semiparametric maximum likelihood (SPML) estimation where a GWAS‐identified SNP or a candidate GWAS‐SNP (Z) is a surrogate covariate used to infer the association between a sequence variant (G) and a QT (Y). We estimate parameters via an EM algorithm; our approach is novel in that it is tailored for the post‐GWAS scenario by incorporating GWAS SNP data available for all individuals. As a result, efficiency is improved compared to methods that use phase 2 data alone. Specifically, we assume that the linear relationship between a causal variant, G, and a QT, Y, is of the form: where are regression parameters, ; thus , . Additionally, we hypothesize that a causal sequence variant () and the GWAS‐SNP (Z) tend to occur on the same haplotype with a specified LD structure. G and Z have minor allele frequencies (MAFs) of and , respectively. The remainder of the paper is structured as follows. Section 2 describes the designs for two‐phase sampling in the post‐GWAS setting. Section 3 explains the model formulation and details of the statistical inference via SPML. Section 4 elaborates on the alternative phase 2 sample allocations we study. In Section 5, we report simulation studies performed to evaluate estimation efficiency, association test validity, and power. Further, we quantify improvements associated with QT‐SNP joint sampling compared to QT or SNP marginal samplings under alternative sample allocations. Fine‐mapping analysis of systolic blood pressure is presented in Section 6 to illustrate application of the methods and compare sampling variation of alternative allocations. A discussion focusses on limitations and extensions of the proposed method.

TWO‐PHASE DESIGNS IN POST‐GWAS REGIONAL SEQUENCING

The main objective in the post‐GWAS setting is to conduct statistical inference on the association of a sequence variant, G, that is, a potentially causal variant, with a QT of interest Y. This variant is located in a genomic region of interest, narrowed down following GWAS results. This so‐called fine mapping generally begins with multiple single‐variant analysis across a region; although, for analysis of low‐count rare variants, the application of multivariant burden and variance component tests may be necessary. To reduce sequencing costs, however, variants in the region are ascertained in only a fraction of individuals, making G missing by design for a potentially large subset of the first sample. Consequently, two‐phase studies consist of GWAS in phase 1 and fine‐mapping analysis of sequence data in phase 2, the latter obtained in a subsample of individuals from the initial GWAS. In this post‐GWAS scenario, the trait data (Y) and the GWAS‐SNP (Z), a surrogate for the causal variant, are observed for every subject in the study. The design objective is to select a subset of informative subjects based on available data in phase 1, namely . Inference on the missing‐by‐design sequence variants is conducted using all available data from phases 1 and 2. We define the missing indicator , , where N is the number in individuals measured in phase 1. S 2 represents the phase 2 sample of n subjects. We let denote the set of () subjects that are not in the phase 2 sample but are present in the GWAS study. Regression analysis at the GWAS phase uses the surrogate variable, Z, which does not usually have biological function, say by fitting . Nonetheless, GWAS analyses serve as an efficient screening strategy to identify candidate regions in the genome. On the other hand, in fine‐mapping, both causal and null variants are colocated in the region, and these null variants do not correspond to the usual model‐based null hypothesis, that is, in (1). Rather, a null SNP in the region of interest (denoted by G 0) has no direct association with the QT (Y), but because it may be in LD with G 1 or Z, an indirect association with Y may be detected. This makes it difficult to distinguish among the individual contributions, even in the complete data case (see Faye, Machiela, Kraft, Bull, and Sun (2013) for reranking strategies in this scenario). In addition, we note that including an identified GWAS‐SNP, Z, in strong LD in the causal variant regression model (1) would introduce collinearity and decrease precision, reducing the ability to detect genetic association of a causal variant G with the QT.

Sampling designs

Under the sampling design, , with the inclusion probability for the ith subject. Therefore, is missing at random. Observe that this definition allows for inclusion probabilities of the form or , which we call a marginal sampling design as opposed to , which we refer to as a joint sampling design. Among marginal sampling designs, covariate‐dependent sampling, that is, , occurs when covariates of interest are used as surrogates of the expensive covariate to select individuals for the phase 2 subset. Analogously, when the trait of interest is used to select individuals this is called outcome‐dependent sampling, response‐dependent sampling, or trait‐dependent sampling (TDS), that is, . Among TDS designs, extreme‐trait sampling has arguably become the most widely used marginal sampling design. In particular it has been used in rare variant analysis (Derkach et al., 2015; Li, Lewinger, Gauderman, Murcray, & Conti, 2011; Lin et al., 2013; Yilmaz & Bull, 2011), because this approach effectively enriches for rare exposure, for example, White (1982). These methods are also applicable in the context of nonrare variants, for example, Satagopan and Elston (2003), Wang, Thomas, Pe'er, and Stram (2006), and Thomas et al. (2009) propose trait‐dependent phase 2 sample allocations and discuss related issues in the context of GWAS. In the following paragraphs, we describe the classes of marginal sampling designs we examine in this work. For marginal GWAS‐SNP sampling, strata are naturally defined by the number of copies of the minor allele carried by individuals in the study. The expected counts assuming Hardy‐Weinberg equilibrium (HWE) for a GWAS‐SNP with MAF are given by . The fact that Z is a discrete covariate is important in our formulation to ensure existence of the maximum likelihood estimate (MLE) (van der Vaart & Wellner, 2001). Further extensions could also include more than one SNP genotype as stratification factors, for example, Schaid et al. (2013). To operationalize the QT sampling, we discretize the QT (Y) into a three strata variable with labels (T1, T2, T3). Let be fixed cut‐off values of Y that partition the QT as follows: In applications, are prespecified quantities, for example, relevant in clinical practice, such as systolic and diastolic blood pressure level categorization into hypertensive, prehypertensive, and normotensive. An important consideration for the cut points is that these values are not determined from the data at hand; otherwise dependency among observations is introduced through the inclusion probabilities. Joint sampling designs, on the other hand, use both response and covariate simultaneously to select individuals for the phase 2 subset. Table 1 illustrates the joint distribution of the discretized QT and the GWAS‐SNP along with the marginal distributions. The goal of joint allocation in this case is to select individuals for the phase 2 subset based on the nine strata determined by the QT categories and the SNP genotype. In discretizing Y, the choice of three strata is the simplest, but is straightforward to extend. In general, for marginal and joint sampling designs, strata are determined by partitioning into K groups.

Table 1

Joint distribution of the GWAS‐SNP, Z, and the discretized version of the QT,

Z∖Yst	T1	T2	T3	M_Z
0	N ₀₁	N ₀₂	N ₀₃	N0·
1	N ₁₁	N ₁₂	N ₁₃	N1·
2	N ₂₁	N ₂₂	N ₂₃	N2·
M_Y	N·1	N·2	N·3	N

Joint distribution of the GWAS‐SNP, Z, and the discretized version of the QT,

A SEMIPARAMETRIC MAXIMUM LIKELIHOOD APPROACH

We extend the SPML formulations described in Zhao et al. (2009) and Lin et al. (2013). Let be the functional (parametric) relationship between G and Y indexed by ; in our case, corresponds to the probability density function of a normal distribution with parameters , . On the other hand, we denote , as the sets of unique observed values of G (in S 2) and Z (in ). Let be the joint probability function of G and Z given by the discrete probabilities , , which we estimate nonparametrically. The proposed method differs from the formulations of Zhao et al. (2009) and Lin et al. (2013) in three aspects: conditional independence assumption between Y and Z given G, which leads to exclusion of Z in the linear predictor of the parametric trait model ; use of Z in nonparametric estimation of the joint distribution of G and Z, with support for defined by the Cartesian product; and the individual weights calculation in the E‐step of the EM algorithm. We detail similarities and differences among these methods in Appendix A (supplementary material).

Likelihood formulation

Considering the above, we define the observed data likelihood following Robins, Hsieh, and Newey (1995) and Lawless et al. (1999) by: Because SPML estimation of θ and observed information matrix calculation does not involve s, we can disregard such terms, leading to: The loglikelihood takes the form , where we let . Note that subjects in have incomplete data , whereas phase 2 (S 2) subjects have completely observed data . Our formulation uses the auxiliary covariate, Z, only in the nonparametric part as Y and Z are assumed to be conditionally independent given G. When Z is a non‐causal surrogate for , its inclusion in the regression may adversely affect model performance due to collinearity between Z and G. Direct maximization of (4) is difficult so in the next section we describe the EM algorithm used to maximize the observed‐data loglikelihood.

EM algorithm

We apply the EM algorithm to estimate θ and , , . We specify initial values in a similar fashion as Lin et al. (2013), that is, , , except for which we specify by for , ; denotes the set of cardinality m containing the different pairs observed in S 2. Further, we define as the set of different z's in S 2. Due to some of the allocation designs, it is possible that will not contain all the values of Z observed in phase 1. E‐step. Let be the complete data loglikelihood, then where M‐step. We update the estimates' values as follows: where and ⊗ is the outer product. The algorithm iterates between the E and M steps until convergence is achieved, that is, and ), yielding the maximum likelihood estimates, MLEs, . For hypothesis testing, Wald score and likelihood ratio (LR) tests can be constructed following standard procedures (Louis, 1982). We elaborate on the construction details of these tests in Appendix B (supplementary material). Furthermore, concerning the LR, Murphy and van der Vaart (2000) point out that the usual full LR statistic fails in semiparametric models. To overcome this, they argue in favor of using the profile LR in a semiparametric framework, which is defined as , where . Hence, has corresponding test statistic asymptotically distributed as . For the profile LR statistics, we observed in our simulations that the following simplification rendered adequate results for testing , , which is effectively substituting the nonparametric estimates of the restricted MLEs, , by the unrestricted MLEs. Justification for this substitution under SPML estimation, in the simplest case without Z in the trait model, is that estimates for were the same under the genetic association alternative and null leading to acceptable profile LR tests. A precedent on related substitutions using profile likelihoods has been discussed in Murphy and van der Vaart (2000); Pace, Salvan, and Ventura (2011).

PHASE 2 SAMPLE ALLOCATION

Despite the extensive literature on two‐phase designs and estimation approaches, relatively less attention has been paid to allocation of the phase 2 sample across defined strata. We first review some of the approaches that have been applied to draw phase 2 data, and then specify alternative approaches that we investigate in simulations and application.

Background

Several authors have investigated two‐phase designs that analyze QTs (continuous outcomes) in general settings with covariates. Lawless et al. (1999) explore allocations with equal phase 2 sample sizes within strata in Y‐dependent sampling, that is, balanced on a discretized version of Y. Chatterjee et al. (2003) present a pseudoscore estimator and study three phase 2 allocations using a joint (discretized) outcome‐covariate strata definition for dichotomous Y and X with selection probabilities , namely (a) simple random sampling with constant, (b) stratified sampling with for each , or (c) restricted sampling with but . Song et al. (2009) propose semiparametric efficient inference with selection of phase 2 data via outcome‐dependent sampling with the following allocations: (a) a sample taken from the trait distribution tails, or (b) all subjects from the two trait distribution tails, both augmented with a simple random sample drawn prior to the extreme‐trait selection. Zhao et al. (2009) consider similar outcome‐dependent sampling with an allocation based on defining three strata for Y as , , and such that . Then, phase 2 sampling probabilities are assigned values 1, 0.056, and 1 for the corresponding three strata, that is, all subjects from the tails of the distribution plus a random sample from the middle. Zhou, Wu, Liu, and Cai (2011) study “outcome‐auxiliary‐dependent sampling” for the case when there is a continuous auxiliary covariate; phase 2 data are selected from a mixture of a simple random sample of size n 0 and a sample taken from the four extreme strata derived from discretizing the outcome and auxiliary covariates in tertiles. In the genetic analysis literature, two‐phase designs have specified QT and GWAS‐SNP allocations in various ways. Yilmaz and Bull (2011) consider a distance‐based sampling function for extreme‐trait values, that is, allocating only observations in the tails of the QT distribution, and compare it to simple random sampling; Li et al. (2011) study allocations based on extreme‐phenotype sampling of the tails, that is, or and almost‐extreme sampling, which is similar, but removes the very extremes beyond C 3 and C 4, that is, or ; Lin et al. (2013) examine another version of extreme sampling in which they select a predefined number of the highest and lowest values of Y while taking a random sample among the remaining values; lastly, Derkach et al. (2015) select phase 2 data under Y‐dependent sampling by drawing observations from , where and , with and determined so that . Notably, these approaches are mostly motivated by rare variant analysis. On the other hand, Chen et al. (2012) examine marginal GWAS‐SNP sampling designs and phase 2 sample allocations, whereas Chen, Craiu, and Bull (2014) focus on a fine‐mapping scenario using GWAS‐SNP as a stratification factor under equal strata size and rare‐homozygote stratum enriched allocations. Although there is some literature on joint trait‐SNP sampling designs for case‐control studies (Schaid et al., 2013; Thomas et al., 2013), to our best knowledge no previous study has focused on joint QT‐genotype sampling specifically for genetic fine mapping.

Alternative allocations

In this section, we specify marginal and joint sampling designs for various phase 2 sample allocations and focus on allocations for joint QT‐SNP‐dependent sampling. To make comparisons between the joint and marginal sampling designs, we define the marginal allocations as sums over the joint strata margins. Let be the number of subjects to be allocated in each stratum for the joint sampling design. Consequently, the marginal allocations are determined by and , where , are intervals defining the strata. Let ϱ be the overall sampling fraction, where we want to draw a subsample of size , such that . As described in Lawless et al. (1999), variable probability sampling (VPS) and basic stratified sampling (BSS) are popular choices to select subjects in the phase 2 sample. In VPS, is considered a random variable, as units are selected with specified probability . BSS, on the other hand, considers as fixed quantities given . Thus, for each defined stratum, subjects are selected with equal probability. VPS often makes more sense in the context where subjects are assigned to phase 2 dynamically as a consequence of data collection over time. In this report, we use BSS because strata can be defined a priori from the GWAS. However, full likelihood methods as well as the proposed semiparametric method are capable of handling various sampling schemes (Derkach et al., 2015). In the following, we define four possible allocations for the regional sequencing. Table 2 illustrates possible sample allocations under the joint QT‐SNP‐dependent sampling. In the simulation studies, we compare statistical efficiencies for each allocation, and associated hypothesis testing properties.

Table 2

	(A) pps			(B) extreme			(C) balanced			(D) combined
Z∖Yst	T1	T2	T3	T1	T2	T3	T1	T2	T3	T1	T2	T3
0	491	236	477	625	0	625	278	277	278	416	0	416
1	384	204	463	0	0	0	278	278	278	417	0	417
2	93	42	111	625	0	625	278	277	278	417	0	417

Adding up by row or column leads to allocations under marginal QT‐ and GWAS‐SNP‐ dependent sampling, respectively.

Example sample allocations for phase two sample size , , r2 = 0.75, , under joint QT‐GWAS‐SNP‐dependent sampling for allocations (A) proportional to size, (B) extreme on Z and , (C) balanced on Z and , and (D) balanced on Z and extreme on Adding up by row or column leads to allocations under marginal QT‐ and GWAS‐SNP‐ dependent sampling, respectively.

Proportional to stratum size

This allocation aims to preserve the strata distribution structure in the complete data. In this allocation scenario, we simply draw subsamples within strata of size , . This is the only case in which the phase 2 sample has the same distribution (over Y and Z) as the complete data, but it may be inefficient.

Extreme

As hinted by its name, extreme allocation aims to distribute the sample to the extreme values of Z and/or . In the joint sampling case, the samples are taken from extreme valued strata in Table 1, . Strata T1 and T3 are defined in Equation (2). The simplest case is to allocate the same sample size across the four extreme strata, thus Observe that the min function is necessary because the number of subjects can be smaller than especially when is low. This allocation oversamples those strata believed to be most informative for the QT and the extreme genotype categories for the GWAS‐SNP.

Balanced

Here, allocation on Z and specifies the same number of subjects per strata, therefore , ; . Once again, the min function is necessary to avoid empty cells but this occurs less frequently than in the extreme case. This allocation may be preferred when one still wants to oversample the most informative strata without losing all information from other strata.

Combined

This allocation combines balanced selection in Z and extreme selection in . It reflects what has been previously reported in the literature as useful strategies to select subjects marginally (Chen et al., 2012; Derkach et al., 2015; Lin et al., 2013; Schaid et al., 2013; Thomas et al., 2013). In this case

SIMULATION STUDIES

Design

We conduct simulation studies in which the GWAS‐SNP, Z, is indirectly associated with the QT of interest, Y, through LD with a causal SNP, G 1 (Fig. 1). The correlated SNPs G 1 and Z are randomly drawn from a haplotype with fixed MAFs and LD (r2) values assuming HWE. The generating model for the QT is given by: , . To mimic the post‐GWAS scenario, we keep only those replicates that achieve suggestive genome‐wide significance, that is, P‐value for the parameter γ1 in the regression ; we proceed until R such replicates are drawn, discarding those that do not achieve significance. Table 3 displays the simulation design parameters, and corresponding values. Genetic effects under the alternative were chosen to achieve roughly 100% power in the complete data.

Figure 1

Fine‐mapping scenario specified in the simulation setup

Note: A causal variant G 1 has a linear effect β1 on the QT, Y. Z is indirectly associated with the QT; this association may be detected through a GWAS. The indirect assocation arises from the LD structure between Z and G 1. Type I error is studied through an unrelated (with any of Y, G, or Z) SNP, G 0.

Table 3

Simulation design parameters and values. The number of replicates yields precision of for empirical type I error (at 5%) and for empirical power (at 80%)

Design parameter	Value(s)
Replicates (R)	10, 000 (studied null) and 1, 000 (Ha)
Study sample size (N)	5, 000
MAFs (qG,qZ)	0.2, 0.3; 0.2, 0.2
LD (r²)	0.5; 0.75
β₀	2
β₁	0.25
σ²	2.25
Phase‐two sample size (n)	540; 1,000; 2,500
Phase‐two sample allocation	(A) Proportional to stratum size (pps)
	(B) Extreme on both Z and Yst (extreme)
	(C) Equal number in each stratum (balanced)
	(D) Balanced on Z and extreme on Yst (combined)

Fine‐mapping scenario specified in the simulation setup Note: A causal variant G 1 has a linear effect β1 on the QT, Y. Z is indirectly associated with the QT; this association may be detected through a GWAS. The indirect assocation arises from the LD structure between Z and G 1. Type I error is studied through an unrelated (with any of Y, G, or Z) SNP, G 0. Simulation design parameters and values. The number of replicates yields precision of for empirical type I error (at 5%) and for empirical power (at 80%) For the case of no genetic effect, we note that trait values simulated under the null hypothesis when would lead only to false positives under GWAS screening on Z. Therefore, to evaluate type I error in the fine mapping setting, we retain the trait values generated under the alternative hypothesis, but instead generate a random null SNP, G 0 with the same MAF () as the causal G 1, but uncorrelated with Z. We then analyze the null SNP, G 0 in a similar fashion as the causal SNPs G 1. We examine three sampling designs to select individuals into S 2, namely (1) marginal SNP‐dependent sampling (M), (2) marginal QT‐dependent sampling (M), and (3) joint QT‐SNP‐dependent sampling (J). For (2) and (3), we discretize the QT, Y, into three‐strata (T1, T2, T3) according to fixed cut points as the percentiles of a normal distribution with mean and variance , that is, under the null . These values allow for a range of sampling variability in the simulations specially for the extreme allocations. This contrasts with other extreme‐trait allocations, e.g., Lin et al. (2013); Zhao et al. (2009), in which the sampling probabilities for the extreme strata are 1 (or close to 1). We explore a more extreme definition of the strata in Appendix C of the supplementary material. We evaluate three phase 2 (S 2) sample sizes determined by overall sampling fractions of 10.8%, 20%, and 50% of the phase 1 sample. Under each design, we examine the sample allocations described in Section 4: namely (A) proportional to stratum size (pps), (B) extreme allocation on Z and (extreme), (C) equal numbers across Z and (balanced), and (D) balanced on Z and extreme on (combined). When sampling selection of the desired number of subjects for a particular stratum is infeasible due to an insufficient number of observations in that particular stratum, we select up to the maximum number of subjects in the small stratum, and then reallocate to the remaining strata to achieve the specified S 2 sample size. We assess test validity and power for the Wald, score and LR statistics as described in Section 3, specifying test sizes of under the null, and under the alternative. In addition, we evaluate bias, empirical/average (model‐based) standard errors and relative efficiency via root mean square error (RMSE) of . We calculate relative efficiency with the complete data RMSE as denominator. The simulation studies are designed to address the following questions: identify allocations where J shows improved power; identify design/allocation combinations with good power (J vs. M or M); compare power of SPML versus alternative methods; and evaluate test statistics for the proposed SPML estimation (score vs. Wald vs. LR). In the next subsection, we highlight the main results on power comparisons: among allocations, between joint versus marginal designs, and between estimation methods including (1) analysis of phase 2 data only (denoted as S2 alone) and (2) TDS described in Lin et al. (2013) (with and without Z in the formulation). We further discuss related issues including test validity under the studied S 2 sample sizes, estimation bias and variance, relative efficiency, and effect of GWAS and phase 2 sample sizes versus sampling fractions.

Results

Under joint QT‐SNP‐dependent sampling, combined allocations yield higher power than alternative allocations (pps, extreme, balanced) (Table 4) while maintaining type 1 error (T1E) control (Fig. 2). We observe no differences between marginal and joint sampling designs within the pps allocation. Marginal dependent samplings show better power under the extreme (marginal QT) and combined (marginal SNP) allocations. Joint QT‐SNP‐dependent sampling consistently exhibits better power compared to marginal QT‐ or SNP‐dependent sampling under a combined allocation (Table 4). The proposed SPML estimation under joint trait‐covariate dependent sampling method (hereafter JTCsamp) compares favorably to two alternative methods: (1) S 2 alone and (2) our implementation of TDS. We observe substantial power increases for JTCsamp compared to S 2 data alone under all allocations (Table 4). TDS (with and without Z) and analysis of phase 2 data alone yield generally similar results for T1E and power with a phase 2 sample size of (supplementary Table S1). On the other hand, as anticipated, including Z in the trait model adversely affects the power to detect association between the causal variant, G, and the QT, Y (supplementary Table S1). This is not surprising mainly due to collinearity derived from the underlying LD between Z and G. Overall score, Wald and LR statistics tests show similar T1E and power patterns (supplementary Figs. S1 and S2).

Table 4

Empirical power (in percentages) for 1,000 replicates under the score test for a phase two sample size of ,

				(A) pps			(B) extreme			(C) balanced			(D) combined
MAFs	LD (r²)	Comp. Data	Method	M_Z	M_Y	JZ,Y	M_Z	M_Y	JZ,Y	M_Z	M_Y	JZ,Y	M_Z	M_Y	JZ,Y
0.2, 0.3	0.5	99.4	JTCsamp	84.4	84.8	85.7	84.6	92.9	90.3	90.3	78.1	85.1	87.8	92.0	94.3
			S ₂ alone	39.7	42.8	40.9	53.8	67.3	63.2	62.1	27.3	50.1	55.6	69.7	66.1
	0.75	97.0	JTCsamp	84.2	85.6	85.7	85.0	90.7	88.1	90.2	83.0	87.7	88.7	89.1	91.8
			S ₂ alone	31.8	29.3	26.6	41.4	53.7	36.8	56.5	19.8	41.4	45.4	53.9	44.4
0.2, 0.2	0.5	98.4	JTCsamp	83.4	84.5	84.9	85.8	90.1	91.8	89.6	79.3	85.3	89.3	90.7	93.8
			S ₂ alone	42.8	43.3	41.0	52.3	66.0	47.3	59.7	29.8	31.6	57.8	68.1	60.9
	0.75	97.7	JTCsamp	86.7	85.0	85.1	87.0	90.2	91.5	89.3	83.4	87.8	88.8	91.7	93.0
			S ₂ alone	29.9	29.8	25.9	42.3	54.6	16.5	52.0	19.9	12.1	45.1	52.5	32.4

Power is calculated at a nominal level of . For sampling designs: marginal GWAS‐SNP‐dependent sampling (M ), marginal QT‐dependent sampling (M ), and joint QT‐GWAS‐SNP‐dependent sampling (J ); and allocations (A) proportional to stratum size, (B) extreme on Z and , (C) balanced on Z and , and (D) balanced on Z and extreme on . Column for complete data (Comp. Data) for comparison purposes.

Figure 2

Quantile‐quantile plots of (P values) for testing β1 under the type 1 error scenario (G0) across 10,000 replicates with a phase 2 sample size of 2,500; we compare analyses of JTCsamp and S2 data alone

Notes: Each facet represents a sampling design and allocation combination. LD (r2) is fixed at 0.75 and MAFs fixed at q =0.2, q =0.3.

Empirical power (in percentages) for 1,000 replicates under the score test for a phase two sample size of , Power is calculated at a nominal level of . For sampling designs: marginal GWAS‐SNP‐dependent sampling (M ), marginal QT‐dependent sampling (M ), and joint QT‐GWAS‐SNP‐dependent sampling (J ); and allocations (A) proportional to stratum size, (B) extreme on Z and , (C) balanced on Z and , and (D) balanced on Z and extreme on . Column for complete data (Comp. Data) for comparison purposes. Quantile‐quantile plots of (P values) for testing β1 under the type 1 error scenario (G0) across 10,000 replicates with a phase 2 sample size of 2,500; we compare analyses of JTCsamp and S2 data alone Notes: Each facet represents a sampling design and allocation combination. LD (r2) is fixed at 0.75 and MAFs fixed at q =0.2, q =0.3. Regarding test validity, quantile‐quantile plots of values) for testing in JTCsamp and S 2 alone when phase 2 sample size is (Fig. 2) do not exhibit gross departures from the expected distribution (see also Table S2). Likewise, there are no signs of more liberal T1E rates in JTCsamp compared to the complete data case (supplementary Fig. S3). In analysis of smaller phase 2 sample sizes, we observe liberal T1E that decreases with sample size increments (see supplementary Figs. S4 and S5); we believe that small stratum‐specific sample sizes are driving these results. For pps and balanced allocations, effect estimation biases in JTCsamp and S 2 data alone are similar (supplementary Table S3), although these differences decrease with increasing LD. However, for extreme and combined allocations, these larger biases are only observed in the marginal SNP sampling, whereas marginal QT and joint QT‐SNP exhibit similar biases to the complete data. This latter bias arises from GWAS screening on Z, a well‐known phenomenon in genetic association studies called the “winner's curse.” In addition, in almost all cases, empirical and mean model‐based standard errors for JTCsamp are smaller than those for S 2 alone (supplementary Table S3). JTCsamp has higher relative efficiency than analysis of phase 2 data alone (supplementary Table S4). Extreme and combined allocations have the largest relative efficiencies; further, under a combined allocation, joint QT‐SNP‐dependent sampling is more efficient than marginal SNP or QT sampling. In addition, RMSE decreases with stronger Z‐G 1 LD (supplementary Table S4). To understand to what extent the overall sampling fraction or stratum‐specific sizes affect hypothesis testing, we repeated the simulations with , , and MAFs of and (supplementary Table S5). These simulations show similar patterns as those in Table 4, suggesting that it is the stratum‐specific sample size rather than the sampling fraction that is important in determining test validity.

APPLICATION TO FINE‐MAPPING OF SYSTOLIC BLOOD PRESSURE

The proposed method is intended to narrow down a candidate region by hypothesis testing of multiple variants within the region. To evaluate application under a more realistic multivariant causal model and assess sensitivity to sampling variation, we analyzed data from the Genetic Analysis Workshop 19 (GAW19). The study data included 1,943 unrelated individuals from the T2D‐GENES Project 1, each with whole‐exome sequencing (WES) by Illumina HiSeq2000 technology. Trait values for systolic and diastolic blood pressure (SBP and DBP) were generated from the WES data under a known polygenic model with causal variants in multiple genes including the MAP4 gene (Blangero et al., 2016) and replicated independently 200 times. Since all four generating “causal” variants are located in chromosome 3, we focus on the 38,102 chromosome 3 sequence variants. Since GWAS data per se are not available, we specify pseudo GWAS SNPs by extracting nonrare noncausal WES variants (MAF>1%) overlapping with a commercial genotype assay (Illumina Human Core Exome‐12v1.0). This yields a total of 3,192 chromosome 3 SNPs (48 in the MAP4 region) in the synthetic GWAS, which serves as the phase 1 genotype data.

Methods

For each of the 200 phenotypic replicates we perform the following. First, we perform chromosome‐wide scans on the synthetic GWAS data using complete data (1,943) to identify the surrogate SNPs. Linear regression of SBP is carried out for each of the genetic variants coded additively as independent covariates. The most significant SNP among the 3,192 (top SNP) is specified as the GWAS‐SNP (Z) for phase 2 design and analysis (see supplementary Figs. S6A and S7). The LD structure in the MAP4 sequenced region reveals correlations among the identified GWAS‐SNPs and the causal variants (supplementary Fig. S6B). Second, to implement a two‐phase design, we stratify SBP into three groups according to commonly used thresholds: normotensive (SBP<120), prehypertensive (120⩽SBP<140), and hypertensive (SBP⩾140). We assess three overall sampling fractions in the phase 2 data: 25%, 50%, and 75%, which roughly correspond to 486, 971, and 1,457 respectively. The observed sample sizes can vary across SNPs in the analyzed region due to missing sequencing data. For each sampling fraction, we draw subjects for the S 2 sample using a joint QT‐SNP sampling design under extreme, balanced, and combined allocations with BSS as described in Section 4 (see supplementary Fig. S8 for joint strata distributions in the complete data and the studied allocations). We do not consider pps allocation as it showed the poorest performance in our simulations. Lastly, we conduct regional association analysis in the MAP4 gene fine‐mapped through WES data. Thus, we selected sequence variants flanking 500 kb around this gene. There were 322 variants sequenced in the region (48 nonrare, that is, MAF>1%). These 48 variants constitute the analysis in the two‐phase approach, which we carry out one variant at a time. Besides the three studied sampling fractions examined at four allocations, we compare SPML analysis of combined phases 1 and 2 data (JTCsamp) to analysis of phase 2 data only (S 2 alone) and to our TDS implementation (Lin et al., 2013) without Z. Four markers, all within the MAP4 region, meet the genome‐wide significant threshold, that is, P‐value in at least one replicate (Table 5); these SNPs serve as the GWAS‐SNP in their respective replicate. Region (box)plots of the 200 replicates demonstrate that a joint QT‐SNP sampling design under a combined allocation achieves lower P values compared to balanced and extreme allocations for the studied sampling fractions (Fig. 3). Nevertheless, empirical power within allocations may vary across sampling designs, for example, marginal Y design has highest power under an extreme allocation, whereas marginal Z design achieves highest power under a balanced allocation. Additional results considering the best performing sampling design within the examined allocation closely agree with the results displayed in Figure 3; this suggests that among the studied phase 2 allocations, a combined allocation under a joint QT‐SNP sampling design achieves lower P values compared to other strategies (supplementary Fig. S9). Summaries of the fine mapping analysis of the variants that achieve genome‐wide significance under the former strategy show that JTCsamp detected the four causal variants consistently across replicates (Table 6). However, an additional noncausal variant (chr3.47734700) was also detected potentially due to LD structure in the region (see supplementary Fig. S6B).

Table 5

Summaries for four SNPs identified as the top GWAS‐SNP in at least one replicate

			Mean (min, max) across 200 replicates			Mean (min, max) in top replicates
GWAS‐SNP	MAF	Top reps.a	Estimate	SE	−log10(P‐value)	Estimate	SE	−log10(P‐value)
chr3:47618953	0.45	136	3.45 (2.2, 4.4)	0.493 (0.47, 0.51)	11.5 (5.1, 17.0)	3.6 (2.8, 4.4)	0.492 (0.47, 0.51)	12.5 (7.5, 17.0)
chr3:47712202	0.34	18	3.06 (2.1, 4.0)	0.517 (0.50, 0.54)	8.53 (4.6, 14.0)	3.57 (3.1, 4.0)	0.515 (0.50, 0.53)	11.3 (8.5, 14.0)
chr3:48309828	0.19	45	−4.04 (−5.3, −2.9)	0.630 (0.61, 0.66)	9.84 (5.7, 16.0)	−4.45 (−5.3, −3.7)	0.630 (0.61, 0.66)	11.8 (8.1, 16.0)
chr3:48360992	0.16	1	−3.80 (−5.1, −2.7)	0.673 (0.65, 0.70)	7.83 (4.1, 13.0)	−4.12 (N/A)	0.685 (N/A)	8.68 (N/A)

Denotes the number of times each SNP was detected as the most significant across 200 replicates.

Figure 3

Region (box)plots of the distribution of the P values, in scale, across 200 replicates in the GAW19 simulated data under a joint QT‐SNP design and JTCsamp analysis calculated using the likelihood ratio statistic (LRS)

Notes: Column facets denote results under extreme, balanced, and combined allocations. Row facets correspond to different overall sampling fractions (in percentages), leading to (approximate) phase 2 sample sizes of 486, 921, and 1,457, respectively. The dashed line represents the genome‐wide significance threshold.

Table 6

Summaries for five sequenced variants identified using JTCsamp under a combined allocation

				Mean (min, max) across replicates
Seq. variant	MAF	Generating value	Samp. frac	Estimate	SE	−log₁₀(P ‐value)	Emp. SD of Est.
chr3.47734700	0.33	0.0	25	−5.57 (−7.4, −4.1)	0.656 (0.59, 0.77)	17.0 (7.6, 28.0)	0.555
			50	−5.41 (−7.1, −4.0)	0.566 (0.27, 0.61)	21.4 (12.0, 110.0)	0.473
			75	−5.41 (−7.2, −4.0)	0.561 (0.50, 0.59)	21.4 (12.0, 35.0)	0.475
chr3.47956424	0.34	−16.2	25	−6.01 (−7.7, −4.5)	0.636 (0.56, 0.73)	20.7 (12.0, 31.0)	0.592
			50	−6.00 (−7.4, −4.6)	0.553 (0.53, 0.59)	26.8 (16.0, 39.0)	0.471
			75	−6.00 (−7.3, −4.6)	0.547 (0.52, 0.57)	27.4 (17.0, 39.0)	0.475
chr3.47957996	0.02	−18.2	25	−20.2 (−29.0, −15.0)	2.28 (1.90, 2.80)	18.8 (8.6, 39.0)	2.51
			50	−18.7 (−26.0, −14.0)	2.00 (1.80, 2.20)	20.3 (12.0, 38.0)	1.91
			75	−18.7 (−26.0, −14.0)	1.97 (1.80, 2.10)	20.7 (12.0, 38.0)	1.87
chr3.47958037	0.31	−1.0 × 10⁻⁵	25	−5.97 (−7.7, −4.3)	0.666 (0.66, 0.77)	18.7 (9.6, 30.0)	0.613
			50	−5.84 (−7.5, −4.3)	0.574 (0.47, 0.61)	23.8 (14.0, 36.0)	0.500
			75	−5.85 (−7.5, −4.3)	0.567 (0.26, 0.60)	24.9 (14.0, 160.0)	0.509
chr3.48040283	0.03	−20.7	25	−18.4 (−25.0, −13.0)	2.10 (1.70, 2.70)	18.4 (8.6, 38.0)	2.38
			50	−16.9 (−24.0, −13.0)	1.81 (1.70, 2.00)	20.1 (12.0, 37.0)	1.88
			75	−16.9 (−23.0, −13.0)	1.79 (1.70, 1.90)	20.6 (13.0, 37.0)	1.84

Generating values for the multivariate model used for SBP were provided by GAW19 data simulators (Blangero et al., 2016). Sampling fractions are in percentages. Interestingly, we identified one locus that is not in the list of “causal” variants, chr3:47734700, this can be explained by the LD structure in the MAP4 region (supplementary Fig. S6B) where we can observe that chr3:47734700 is in high LD with two of the causal variants: chr3:47956424 and chr3:47958037 (see Shin, Yi, and Bull, 2016, for details on this phenomenon).

Summaries for four SNPs identified as the top GWAS‐SNP in at least one replicate Denotes the number of times each SNP was detected as the most significant across 200 replicates. Region (box)plots of the distribution of the P values, in scale, across 200 replicates in the GAW19 simulated data under a joint QT‐SNP design and JTCsamp analysis calculated using the likelihood ratio statistic (LRS) Notes: Column facets denote results under extreme, balanced, and combined allocations. Row facets correspond to different overall sampling fractions (in percentages), leading to (approximate) phase 2 sample sizes of 486, 921, and 1,457, respectively. The dashed line represents the genome‐wide significance threshold. Summaries for five sequenced variants identified using JTCsamp under a combined allocation Generating values for the multivariate model used for SBP were provided by GAW19 data simulators (Blangero et al., 2016). Sampling fractions are in percentages. Interestingly, we identified one locus that is not in the list of “causal” variants, chr3:47734700, this can be explained by the LD structure in the MAP4 region (supplementary Fig. S6B) where we can observe that chr3:47734700 is in high LD with two of the causal variants: chr3:47956424 and chr3:47958037 (see Shin, Yi, and Bull, 2016, for details on this phenomenon). Comparisons across methods under the joint QT‐SNP sampling design and combined allocation in the regional association demonstrate that JTCsamp is more powerful than S 2 and TDS and is more similar to the complete data analysis. Consequently, even at the lowest sampling fraction, JTCsamp is the only method that consistently detected the four causal variants across all replicates. Although we observe some outliers in JTCsamp for noncausal variants at the end of the region, as the sampling fraction increases, the numbers of outliers decrease considerably (supplementary Fig. S10). Results of the WES fine‐mapping application are consistent with the simulation studies: higher power for joint QT‐SNP sampling and combined allocation compared to other methods (S 2 alone, TDS). Thus, the proposed method exhibits better agreement with the results of the complete data analysis. Nonetheless, further investigation is warranted in a lower powered setting. Results on sensitivity to phase 1 sampling variation suggest that the extreme and combined allocations are less sensitive to sampling variation, mainly because of the enrichment of the strata of interest that leads to sampling of most (or all) of the subjects in those categories (Appendix C supplementary material).

DISCUSSION

Two‐phase joint QT‐SNP sampling designs can be more powerful and less sensitive to sample variation than the marginal counterparts provided the proper allocation is chosen. In particular, we find that under a combined allocation, that is, extreme trait and balanced genotype, a joint QT‐SNP design exhibits better power compared to marginal designs. An important advantage of the SPML estimation is that the extreme and combined sample allocations we explored can be handled under this framework unlike, for instance, an approach based on inverse probability weighted estimating functions. The EM algorithm described in Section 3 can accommodate additional nongenetic covariates in the regression model by assuming , where w is a vector of covariates with ; and . Then is replaced with in the M‐step, for , where is the ith subject's vector of covariates, these additional covariates are observed across all subjects in both phases 1 and 2 data. Note that independence between G and W is implicitly assumed. Because JTC analysis is likelihood based, credible intervals can be constructed using Bayes factors, which may be useful for comparison with other fine mapping methods (see Maller et al. (2012) for an illustration of this approach) The simulation studies show that JTCsamp outperforms TDS in terms of power. This may be due to the fact that the latter formulation was designed for association analysis of sequencing data under marginal trait‐dependent sampling focusing on set‐based analysis of rare variants in hypothesis generating genome‐wide analysis. The proposed method fills a gap for those situations where the missing (by design) covariate observation (G) is correlated with the surrogate (and fully observed) covariate Z. It would be of practical interest to explore the impact of low allele counts of Z in the performance of the design and the estimation. Choice of stratification criteria for the QT poses some challenges, including potential losses of information, and it is arguably a limitation of this approach despite its wide usage in two‐phase designs. We examine a few reasonable joint allocations that extend existing marginal approaches that have proven useful in the literature. Further work is necessary to determine robust or optimal allocations under budget constraints. Throughout the paper, we assume additive genetic models for the (surrogate or causal) variants and the QT as these are most often encountered. We expect that a change in the model structure, for example, dominant or recessive, may affect the performances reported here. Past studies (Chen et al., 2014) suggest, however, that the form of the genetic model has a limited impact on the efficiency of the two‐phase analysis. The current specification of JTCsamp is yet to incorporate multiple G or multiple Z, which can be a desirable feature in regional sequencing; extensions in this direction are warranted. The described method can be extended to generalized linear models (GLMs) using available estimation methods by replacing the Gaussian distribution by another member of the exponential family. To do so, the individual weights computed in the E‐step need to reflect the chosen family distribution. Consequently, the updates for the parameters in the M‐step cannot be obtained in closed form, requiring iteratively reweighted least squares iterations. Extensions to other location‐scale or time‐to‐event models are warranted. In principle, these extensions can be covered by a general framework (Derkach et al., 2015). However, further considerations of sampling designs for these models will be required (see Lawless, 2016). We emphasize that the proposed method can be used in other settings in which molecular genetic technologies are too expensive to apply to an entire large cohort. For instance, in microbiome analysis, application of initial rRNA sequencing in phase 1 could be followed in phase 2 with metagenomic whole genome shotgun sequencing, a technology that provides a deeper understanding of the functions and pathways present in the human microbiome.

CONFLICT OF INTEREST

The authors have declared no conflict of interest. Supporting Information. Click here for additional data file.

24 in total

1. Two-phase stratified sampling designs for regional sequencing.

Authors: Zhijian Chen; Radu V Craiu; Shelley B Bull
Journal: Genet Epidemiol Date: 2012-03-28 Impact factor: 2.135

2. Optimal two-stage genotyping designs for genome-wide association scans.

Authors: Hansong Wang; Duncan C Thomas; Itsik Pe'er; Daniel O Stram
Journal: Genet Epidemiol Date: 2006-05 Impact factor: 2.135

3. Quantitative trait analysis in sequencing studies under trait-dependent sampling.

Authors: Dan-Yu Lin; Donglin Zeng; Zheng-Zheng Tang
Journal: Proc Natl Acad Sci U S A Date: 2013-07-11 Impact factor: 11.205

4. Semiparametric inference for a 2-stage outcome-auxiliary-dependent sampling design with continuous outcome.

Authors: Haibo Zhou; Yuanshan Wu; Yanyan Liu; Jianwen Cai
Journal: Biostatistics Date: 2011-01-20 Impact factor: 5.899

Review 5. Screening for stratification in two-phase ('two-stage') epidemiological surveys.

Authors: A Pickles; G Dunn; J L Vázquez-Barquero
Journal: Stat Methods Med Res Date: 1995-03 Impact factor: 3.021

6. A two stage design for the study of the relationship between a rare exposure and a rare disease.

Authors: J E White
Journal: Am J Epidemiol Date: 1982-01 Impact factor: 4.897

7. Methodological Issues in Multistage Genome-wide Association Studies.

Authors: Duncan C Thomas; Graham Casey; David V Conti; Robert W Haile; Juan Pablo Lewinger; Daniel O Stram
Journal: Stat Sci Date: 2009-11-01 Impact factor: 2.901

8. Are quantitative trait-dependent sampling designs cost-effective for analysis of rare and common variants?

Authors: Yildiz E Yilmaz; Shelley B Bull
Journal: BMC Proc Date: 2011-11-29

Review 9. Strategies for fine-mapping complex traits.

Authors: Sarah L Spain; Jeffrey C Barrett
Journal: Hum Mol Genet Date: 2015-07-08 Impact factor: 6.150

Review 10. Two-phase and family-based designs for next-generation sequencing studies.

Authors: Duncan C Thomas; Zhao Yang; Fan Yang
Journal: Front Genet Date: 2013-12-13 Impact factor: 4.599

3 in total

1. Estimating Additive Interaction Effect in Stratified Two-Phase Case-Control Design.

Authors: Ai Ni; Jaya M Satagopan
Journal: Hum Hered Date: 2019-10-21 Impact factor: 0.444

2. Two-phase designs for joint quantitative-trait-dependent and genotype-dependent sampling in post-GWAS regional sequencing.

Authors: Osvaldo Espin-Garcia; Radu V Craiu; Shelley B Bull
Journal: Genet Epidemiol Date: 2017-12-14 Impact factor: 2.135

3. Two-phase sample selection strategies for design and analysis in post-genome-wide association fine-mapping studies.

Authors: Osvaldo Espin-Garcia; Radu V Craiu; Shelley B Bull
Journal: Stat Med Date: 2021-10-01 Impact factor: 2.497

3 in total