Literature DB >> 34596256

Two-phase sample selection strategies for design and analysis in post-genome-wide association fine-mapping studies.

Osvaldo Espin-Garcia^1,2, Radu V Craiu³, Shelley B Bull^2,4.

Abstract

Post-GWAS analysis, in many cases, focuses on fine-mapping targeted genetic regions discovered at GWAS-stage; that is, the aim is to pinpoint potential causal variants and susceptibility genes for complex traits and disease outcomes using next-generation sequencing (NGS) technologies. Large-scale GWAS cohorts are necessary to identify target regions given the typically modest genetic effect sizes. In this context, two-phase sampling design and analysis is a cost-reduction technique that utilizes data collected during phase 1 GWAS to select an informative subsample for phase 2 sequencing. The main goal is to make inference for genetic variants measured via NGS by efficiently combining data from phases 1 and 2. We propose two approaches for selecting a phase 2 design under a budget constraint. The first method identifies sampling fractions that select a phase 2 design yielding an asymptotic variance covariance matrix with certain optimal characteristics, for example, smallest trace, via Lagrange multipliers (LM). The second relies on a genetic algorithm (GA) with a defined fitness function to identify exactly a phase 2 subsample. We perform comprehensive simulation studies to evaluate the empirical properties of the proposed designs for a genetic association study of a quantitative trait. We compare our methods against two ranked designs: residual-dependent sampling and a recently identified optimal design. Our findings demonstrate that the proposed designs, GA in particular, can render competitive power in combined phase 1 and 2 analysis compared with alternative designs while preserving type 1 error control. These results are especially evident under the more practical scenario where design values need to be defined a priori and are subject to misspecification. We illustrate the proposed methods in a study of triglyceride levels in the North Finland Birth Cohort of 1966. R code to reproduce our results is available at github.com/egosv/TwoPhase_postGWAS.

Entities: Chemical

Keywords: genetic algorithm; post-GWAS targeted sequencing; practical two-phase studies; statistical fine-mapping; two-phase study design and analysis

Mesh：

Year: 2021 PMID： 34596256 PMCID： PMC9293221 DOI： 10.1002/sim.9211

Source DB: PubMed Journal: Stat Med ISSN： 0277-6715 Impact factor: 2.497

INTRODUCTION

Genome‐wide association studies (GWASs) have become well‐established untargeted approaches for identifying genetic loci that influence the etiology of complex diseases and traits. Single‐nucleotide polymorphisms (SNPs) genotyped using GWAS arrays typically lack any known biological function. Consequently, in post‐GWAS studies, identifying causal variants and susceptibility genes in GWAS‐identified regions of association is the next important step for researchers. Identified variants and genes can become instrumental in personalized medicine from diagnosis and intervention to drug development and other forms of therapy. Recent advances in next‐generation sequencing (NGS) technologies allow investigators to sequence the entire human genome at the base‐pair level, but, the costs of whole genome sequencing are relatively high in comparison to GWAS analysis. Targeted sequencing, which identifies all variants in a region with high‐confidence, can be cost effective when fine mapping a genetic region identified at GWAS stage. Indeed, high‐density sequence variants in the targeted region are typically in linkage disequilibrium (LD) with strongly associated SNPs from GWAS, making the latter good candidates as auxiliary covariates for subsample selection. Thus, two‐phase sampling design and analysis , emerges as a suitable cost‐reduction technique in the post‐GWAS context. The main goal of this strategy is to make inference on incompletely observed sequencing data. At phase 1, GWAS data are collected for everyone in the study. At phase 2, sequencing data are collected only in a subsample of the phase 1 sample. The subsample is selected based on phase 1 information alone (outcome, auxiliary SNPs), making the sequence data missing‐by‐design in the nonselected individuals. While the majority of the literature in two‐phase sampling designs concentrates on effect estimation and hypothesis testing, relatively less attention has been paid to phase 2 sample selection. Specifically, most of the work examining optimal designs has focused on case‐control studies, , in which for example, a balanced design (equal sample distribution across strata) has been recommended as near optimal. Typically, in the design of case‐control studies, optimization is performed to determine sampling fractions across predefined strata subject to a budget constraint on the phase 2 sample size. , , Another approach, described in Zhao et al, seeks to optimize the sampling fraction () under simple random sampling considering asymptotic relative efficiency of the maximum likelihood estimators from the one‐ vs two‐phase designs. More recently, Tao et al derived general optimal designs of two‐phase studies paying special attention to continuous, binary, and time‐to‐event outcomes. Specifically, Tao et al demonstrate the relationship between their optimal design (hereafter referred to as TZL) and previously proposed (ranked) designs such as outcome‐dependent and residual‐dependent sampling (ODS and RDS, respectively). In this report, we propose two approaches for two‐phase sample selection in post‐GWA fine‐mapping studies. Our previously described methods, when paired with the resulting sample designs, preserve type I error control and are applicable for all distributions in the exponential family. The first approach, LM, extends and adapts previous work primarily developed for case‐control studies by solving a constrained optimization problem via Lagrange multipliers using numerical methods. The second approach, GA, exploits the advantages of genetic algorithms (GAs) for discrete optimization with fixed‐subsets. To the best of our knowledge, this work introduces a novel usage of GAs in the context of selecting phase 2 designs. In the next section we introduce a maximum likelihood framework for design and analysis of two‐phase studies, and define the two approaches to select a phase 2 subsample. In addition, we contrast the proposed designs (LM and GA) with ranked designs (ODS, RDS, and TZL). In Section 3 we conduct simulation studies of a quantitative trait (QT) to evaluate the performance of the proposed designs against ranked designs under the ideal scenario in which all design quantities are known in advance. In Section 4, we assess a more practical scenario where the design values are misspecified using simulated data with realistic LD patterns from the 1000 Genomes Project. Our results show that the proposed designs, GA in particular, achieve competitive power against alternative designs under various scenarios. Additionally, in Section 5, we illustrate these methods in an application to the North Finland Birth Cohort of 1966. We conclude with a discussion of the advantages and challenges of the studied approaches as well as potential avenues of future research.

PHASE 2 SAMPLE SELECTION UNDER MAXIMUM LIKELIHOOD

Two‐phase designs in fine‐mapping studies

Let Y be the trait of interest and G be a (potentially causal) sequence variant located in a genomic region identified by GWAS test results. By design, variants in the region of interest are ascertained in only a fraction of individuals. Consequently, two‐phase studies consist of a GWAS in phase 1 from which a subsample of individuals is selected; in phase 2, fine‐mapping sequence data are collected for the subsample and combined analysis is performed using information from phases 1 and 2. In this post‐GWAS setting, the trait data (Y) and the GWAS‐SNP (Z), are observed for every subject in the study. The two‐phase design aims to select a subset of informative subjects based on available data in the GWAS, namely, . Of note, Z can be either an observed or imputed genotype, in the latter case the purpose might be to verify the association with sequencing data. Inference on the missing‐by‐design sequence variants is conducted using all available data. We define the missing indicator , where N is the number of individuals in the entire phase 1 cohort and represents the set of subjects selected into the phase 2 subsample. We let denote the set of () subjects in the GWAS study who were unselected for phase 2. We specify the selection model for the ith subject as , where is a vector that characterizes the distribution of the inclusion probabilities. To operationalize the selection, can be stratified into K disjoint groups, , such that for all ; that is, all subjects in the kth stratum have equal selection probabilities. is designed to be conditionally independent of given and , that is, the phase 2 selection mechanism dictated by is completely determined by and , making missing at random.

Maximum likelihood formulation

Let be the parametric relationship between and Y indexed by . Here, corresponds to a probability function in the exponential family with , where denotes the link function. We denote , as the sets of uniquely observed values of G (in ) and Z (in ). Let be the joint probability function of G and Z given by the discrete probabilities , and , which is left unspecified and define . We consider here the nonparametric estimation of the joint distribution of G and Z, with support on the Cartesian product between and . Considering the above, we define the observed‐data likelihood following previous literature , , as In (1), the proportionality arises since estimation of does not involve 's. Thus the log‐likelihood is For simplicity, the formulation above considers G as a single variable, however, this can be extended to a vector with the respective considerations as illustrated in Section 4 below. We also note that additional phase 1 covariates can be introduced into the parametric model, that is, with corresponding , by assuming G and are conditionally independent given Z. In the case of covariates such as genomic principal components to account for population stratification, conditional independence is a working assumption, which does not negatively affect the model performance as shown in Section 5. If we denote , then, under regularity conditions, the limiting distribution of the maximum likelihood estimator () follows asymptotically , where and is the expected information matrix, which is a function of the full parameter set as the expectation is taken with respect to ; note that G is observed in and missing by design in . The derivation of the expected information matrix is shown in Appendix A.

Post‐GWAS analysis under maximum likelihood

Note that the likelihood in Equation (1) is most useful at the design stage when no phase 2 subsample has been identified nor have any data been collected. However, once these items are available, the following reexpression is typically used: The formulation above has been amply studied. , , , Estimates can be obtained via the EM algorithm , , and the corresponding asymptotic variance covariance matrix is computed via the Louis' method. In fine‐mapping, the aim is to identify and prioritize potential causal variants in a genetic region of interest to allow for follow‐up replication and functional studies. This can be achieved under the proposed maximum likelihood (ML) as follows: first, the genetic effect of each variant in the targeted region is estimated and tested individually (single‐variant analysis); second, genetic effects are estimated and tested in multivariable models (conditional on strongest single‐variant signals). Conditional analysis serves to identify independent signals in the region and to unmask associations that may have been missed in single‐variant analysis. These steps are detailed in Section 4.

Selecting phase 2 designs

In post‐GWAS fine‐mapping studies that target an identified genomic region, the costs of sequencing can make it unfeasible or inefficient to sequence all subjects available in phase 1, restricting the number of individuals in (n). Here we propose two approaches to select a phase 2 design under a budget constraint and flexible optimality criteria using Lagrange multipliers or genetic algorithms. In addition, we discuss the specification of such optimality criteria and compare the proposed methods against another class of widely used phase 2 sample selection strategies, the so‐called ranked designs.

Lagrange multipliers (LM)

Following previous ideas, , , we first propose to obtain a phase 2 design for regional fine‐mapping studies by minimizing the following expression where is an optimality criterion, is a Lagrange multiplier accounting for the budget constraint, and is the number of subjects in phase 1 belonging to the kth stratum. Here, we formulate the approach specifically for the ML framework described in Sections 2.2 and 2.3. For our purposes, and are design quantities, thus, they need to be specified a priori leaving the s to be determined from phase 1 data alone. Note that this approach aims to find selection probabilities, , that minimize equation (3) for allocating the phase 2 sample across strata , . The vector can be interpreted in terms of the (joint) genotyping distribution between G and Z, which can be easily specified according to well‐established genetic principles, for example, Hardy‐Weinberg equilibrium (HWE), or by external data such as the 1000 Genomes Project. Thus, the expected effect size of the sequence variant, , becomes the primary parameter to specify.

Genetic algorithms (GA)

Genetic algorithms are designed to mimic nature's evolutionary process, in which the fittest members of a population are selected to pass on their genetic information. GAs are powerful tools to optimize a fitness measure/objective function, ; overviews can be found in Holland and Whitley. This optimization technique is suitable for a discrete solution space and is performed through a stochastic search by building an initial population of candidate solutions that evolves generationally through pairing, mating, recombining and mutating the candidate solutions. In our case, these candidate solutions correspond to vectors of the form with and for all i, that is, vectors of indicator variables denoting whether the ith subject is selected for phase 2. The reasoning behind GA implementation in the context of phase 2 sample selection is twofold: 1) it provides a suitable framework for discrete optimization, and 2) it has proven to be an efficient strategy to find a fittest member in large search spaces ( possibilities in this case). These appealing features of GAs come along with some challenges, namely, that there are no clear convergence criteria, tuning parameters need to be specified, and they can be computationally expensive when the objective function is hard to calculate. Nevertheless, the GA approach brings novelty to the field as it nullifies the uncertainty brought by the sampling variability introduced when utilizing stratum‐specific selection probabilities. This is achieved by selecting a vector that characterizes a unique phase 2 subsample and optimizes the fitness measure. Furthermore, GA can forgo strata definition since the search can be agnostic to specific strata configurations. In GA, the budget constraint can be introduced by a so‐called cardinality constraint, which consists of selecting a subset of a required size (n). This constraint guarantees that the phase 2 sample is exactly of size n as opposed to methods that depend on selection probabilities, which introduce some variation into the achieved phase 2 sample size. To date, there are several implementations for GAs available in the R statistical language namely packages GA, genalg, kofnGA, mcga, mco, and NMOF. Of these, only kofnGA is specifically designed for fixed‐size subset selection with a flexible specification of the objective/fitness function, see Wolters for a detailed explanation of the package. In any GA, it is important to consider the set of control parameters necessary to implement decision rules at each step. Specifically, kofnGA requires the user to specify the objective/fitness function (ie, ), the subset size (n), and the number of candidates (N) while additional control parameters related to algorithmic design have default (but adjustable) settings. These control parameters are: population size (), number of generations (), size of selection tournament (), mutation rate (), and number of elites (). A pseudo‐algorithm relating standard GA terminology with the two‐phase design along with a description of the steps where the control parameters are used is presented in Algorithm 1. The population is the pool of candidate solutions at each iteration from which the fittest members, that is, members with optimal values, will be ultimately selected. The number of generations denotes the number of iterations the algorithm will run for. The size of selection tournament determines the number of members of the population selected to produce the next generation. The mutation rate determines the probability at which random swaps in the indexes of the candidate solutions occur in the population. Lastly, elites are the fittest members in a given generation that get to be kept in the next generation. The algorithm parameters can be tuned by the user in accordance with the problem at hand. For simplicity, parameters and can be reformatted as proportions of the population size (). Because kofnGA does not implement a stopping rule, the algorithm iterates for as many times as specified by the provided number of generations (). To accelerate convergence, we set and at high levels () as suggested by Walters. This approach may diminish the search improvements derived from mutations, relying more heavily on the initial population and number of generations. Therefore, instead of setting a completely random initial population (pop in Algorithm 1), we initialize it with an equal number of samples with top 20 performers (based on ) out of 100 draws of each of the balanced, combined, and LM sampling designs plus the RDS design. This strategy guarantees that GA has at least the same performance as the RDS design. Additional considerations for the optimization strategies in LM and GA as well as further details on the balanced and combined designs can be found in Sections S2 and S3, respectively (Online Supplementary Material). It is also worth noting that the proposed approaches are feasible in any context where can be modeled within the exponential family, and both G and Z are discrete (or easy to discretize) covariates.

Specifying an optimality criterion

There are several ways to define a functional as the optimality criterion/fitness measure, mostly grounded in experimental design. In this report, we explore three criteria: A‐optimality, D‐optimality, and parameter‐specific. Each criterion focuses on different features of the variance‐covariance matrix (VCM), (Table 1). The parameter‐specific criterion is optimal to identify designs with minimum variance when testing a single parameter. Similarly, A‐ and D‐criteria would be optimal to identify designs with minimum average variance across all parameters and minimum volume of the confidence ellipsoid, respectively.

TABLE 1

Description of the three optimality criteria evaluated. denotes the variance‐covariance matrix.

Λ(·)	Formula	Description
A‐optimality	∑diag(𝕁(Φ)−1)	Minimizes the average variance of the parameter estimates
D‐optimality	det(𝕁(Φ)−1)	Minimizes the product of the variances for diagonal matrices
Parameter‐specific	𝕁(Φ)[β1,β1]−1	Minimizes the variance of a particular entry in the VCM

Description of the three optimality criteria evaluated. denotes the variance‐covariance matrix. In the outlined post‐GWAS setting, the focus lies on testing a single parameter, . Thus, the parameter specific criterion is the most natural choice. However, when multiple parameters are of interest, that is, is a vector, A‐optimality may be preferred if the parameters of interest are loosely correlated, whereas D‐optimality may be preferred when the parameters are strongly correlated. This makes intuitive sense when there are two (or more) estimators, for example, and but is of interest. Then , thus, involving off‐diagonal elements of the VCM. We evaluate potential differences in the choice of optimality criterion in Section 4.

Ranked designs

Recently, Tao et al proposed general optimal designs for phase 2 studies. In this section, we aim to describe their approach, summarize their findings, and draw comparisons with the proposed designs: LM and GA. Despite their names, the outcome‐dependent sampling (ODS) and residual‐dependent sampling (RDS), as defined by Tao et al, are not sampling designs in the classical sense given the fact that their specification is independent of any sampling mechanism. Indeed, this is also true for the TZL design. We refer to them as ranked designs because they are defined in terms of ordered quantities: outcome/residuals/scaled residuals for ODS, RDS, and TZL, respectively. Tao et al show that the scaling factor in TZL is given by , which is unknown at design stage and thus needs to be specified prior to phase 2. An intuition on why this scaling factor is important for the optimal design, provided by Tao et al, is that G is harder to be retrieved by Z when is large and thus these subjects need to be oversampled. The ranked designs achieve a given phase 2 sample size by selecting an equal number of subjects from each of the top and bottom rankings of the outcome/residuals/scaled residuals. This particularity makes them appealing for a few reasons: 1) the ranked designs are unique, 2) the selection is intuitive and can be performed quickly, and 3) for QTs, no stratification on the outcome is required. Tao et al show that the TZL design reduces to the RDS design when G and Z are independent (and RDS reduces to ODS when Y and Z are independent); in Sections 4 and 5, we investigate the effect of misspecifying . There are five main underlying differences between the proposed designs (LM and GA) and the ranked designs, particularly TZL: LM depends on the stratification strategy undertaken for the outcome while none of the ranked design depends directly on outcome stratification for QTs. On the other hand, GA can, in principle, be performed without defining any stratification; however, selection of initial values may depend on values drawn from LM or other designs to accelerate convergence, which could introduce some dependency on a chosen stratification. LM provides optimal sampling fractions and thus a sample must be drawn accordingly, subjecting this design to sampling variability. GA selects a unique solution, , with optimal value, but, given the stochastic nature of the genetic algorithm, this solution is approximate and varies at each run unless a random seed is specified. In contrast, the ranked designs are not subject to sampling variation. TZL and the proposed designs (under the parameter‐specific criterion) seek to minimize the variance of . In the case of TZL, this is achieved by maximizing the inverse of the efficiency bound for estimating with one observation in theorem 1 of Tao et al. Note that the proof of this theorem relies on the assumption that Y and G are approximately independent given Z, which is justified when the effect of G on Y is small, that is, . LM and GA, on the other hand, do not depend on the small assumption and can be thus implemented in more general settings. Related to the point above, LM and GA rely on an empirical approximation to the information matrix whereas the variance considered in TZL uses an exact expected information under the working assumption of . This defines a trade‐off between the generality of LM and GA and the increase in efficiency of TZL when the assumption is justified. LM and GA can optimize general functions beyond through , whereas the results in TZL are mostly concerned with (or functions thereof). It remains unclear what constitutes a good stratification for LM; intuitively, LM should approximate TZL as the number of strata approaches the phase 1 sample size. However, a theoretical proof is beyond the scope of this paper. To circumvent the sampling variability issue in LM, one could draw a predetermined number of subsamples and select the one with optimal value. Regarding whether restricted/unrestricted values of are preferred, Tao et al show that TZL performs well for alternatives close to the null. Although these alternatives are typical for genetic studies, a more comprehensive comparison for alternatives farther away for the null is warranted. Lastly, LM and GA can, in fact, approximate the optimization strategy in TZL by utilizing instead of in the objective function, where and is partitioned with respect to as . It is worth noting that the dimension of corresponds to that of the subspace determined by the null hypothesis of interest. In the simplest case of being a scalar, is also a scalar. In LM and GA, higher dimensions can be easily accommodated by specifying a different on , to obtain say A‐ or D‐optimal designs.

SIMULATION STUDIES

In this section, we describe the data generation steps, analysis plan, and report the results of an initial set of simulations. The main objective is to compare the statistical power of the proposed phase 2 designs, LM and GA, in a post‐GWAS fine‐mapping scenario by testing for the effect of G (the missing‐by‐design variable), that is, . In addition, we compare LM and GA against two ranked designs: TZL and RDS. Of note, we exclude ODS from these studies given the known indirect association between Z and Y at GWAS stage. Estimates and standard errors are constructed following Equation (2) in Section 2.3. For comparability with RDS and TZL, we use in the parameter‐specific optimality criterion for LM and GA as described in Section 2.4.4 given that this variance estimate does not depend on the assumption of . Additional numerical studies comparing LM and GA against alternative heuristic designs utilizing in the optimization are found in Section S3, Online Supplementary Material.

Data generation

We assume a data generating mechanism similar to Espin‐Garcia et al. Briefly, for a phase 1 sample size (N), and given values for minor allele frequencies (MAFs), and and the linkage disequilibrium (LD), quantified through the Pearson correlation coefficient, r, we simulate two variants on the same haplotype under Hardy‐Weinberg equilibrium (HWE): and Z. Here, and are the frequencies of the less common allele in the population for and Z, respectively, whereas LD is the level of correlation between them. Notably, since the actual allele frequencies cannot be negative and the additive linkage disequilibrium coefficient D is constrained, not all combinations of r, and can occur. The trait of interest is then generated as , where . We note that in this setting, as opposed to Section 2.2, Z is assumed to be conditionally independent of Y given G. This simulation setup aims to resemble a more realistic scenario in which the GWAS‐SNP, Z, is not causal itself but rather is in linkage disequilibrium with the causal variant, G. To imitate the GWAS setting in each dataset, we test , the genetic effect of Z, for association in the regression model and only keep replicates that meet a suggestive genome‐wide significance criterion of for the hypothesis . Lastly, to study type 1 error (T1E) under this data‐generation mechanism, we simulate another SNP, independently from Z and with MAF . Strata for Y are defined by discretizing the trait values into a three‐category variable, , according to fixed cut points () as the percentiles (40, 60) of a normal distribution with mean and variance , so that under the null, . Strata for the biallelic GWAS SNP, Z, are defined by considering Z as a three‐category variable corresponding to genotypes, () and coded by the number of copies of the minor allele (a), that is, (additive association). Of note, stratification by and Z is only employed for optimization in LM and for visualization to compare the distribution of selected individuals under other designs.

Assessing the phase 2 designs

The first set of evaluations consists of the following. We specify a phase 1 sample size of and a phase 2 sample size of , that is, 0.108, 0.25, and 0.50 of the phase 1 data, respectively. We draw 1250 replicates for each combination of simulation parameters , , , , , and . We evaluate the performance of the proposed designs against two ranked designs, RDS and TZL, across three statistical tests (Wald, likelihood ratio [LR] and score). Since is a scalar, the comparison against ranked designs only examines a parameter‐specific criterion as a consequence from considering in the objective function, as mentioned in Section 2.4.4. To compare power, we assess the ratio of the empirical power of each design over that of the complete data case (relative empirical power, rEP). In addition, estimation efficiency is compared via relative asymptotic and empirical standard error (rASE and rESE, respectively) of for each design over that of the complete data. We deem these measures to provide a better reflection of the design performance compared with studies that benchmark against simple random sampling. Note that the closer these ratios are to 1 (100%), the better the studied designs are able to recover the performance of the complete data analysis. The specification of , the design regression parameters, corresponds to . Here and denote the maximum likelihood estimates (MLEs) from GWAS, that is, the MLEs for . Similarly, we specify , the design haplotype distribution between G and Z, under HWE by estimating from the phase 1 sample and designating and r to be the equal to their generating values. These design values are used to determine LM, GA, and TZL designs but not RDS, which is agnostic to these design quantities. Although correct specification of the design quantities is hardly ever attainable, the settings above allow us to evaluate the true type 1 error/power of the studied designs. We discuss in the next section how to proceed in practice when the true design values are unavailable. Moreover, by specifying the regression parameters under the null hypothesis, that is, , the design problem greatly simplifies to only specify values for .

Results

The ranked designs can be specified without strata definitions, however, for visualization and comparison purposes, we plot the distribution of RDS and TZL according to the predetermined strata. When comparing LM and GA against the ranked designs for the smallest and largest studied genetic effects ( and ), we observe: (1) LM, GA and TZL vary considerably across values of and n, (2) LM displays more unstable strata distribution when compared against GA and ranked designs, especially for the smaller phase 2 sample sizes (), (3) GA follows closely the RDS design especially when , (4) LM and GA reach an approximately equal strata distribution when , and (5) as expected, the strata distribution of the RDS design remains practically unchanged between genetic effect sizes and phase 2 sample sizes (Figure 1).

FIGURE 1

Mosaic plots with the average strata sizes across replicates for the proposed designs against ranked designs across phase 2 sample sizes, , under the parameter‐specific criterion. Averages were taken from the resulting designs in the main simulation study for the two most extreme values of (0.1, 0.7) At 1%, type 1 error (T1E) rates demonstrate well controlled values across the three tests in most cases. For LM, the observed T1E of the Wald test are slightly anti‐conservative when and stabilize around the nominal rate as n increases (Table 2). Closer inspection of the p‐value distribution under LR displays no gross departure from the expected uniform distribution (Figure S1). Overall, we observed the LR statistics showed better behavior compared with score and Wald statistics even under small sample sizes specially under LM design (Table 2). Empirical bias () is well centered around zero overall and decreases as n increases for all designs when the true value for is small () (Figure 2). However, for larger values of (), LM and TZL show biased estimates when and deteriorate as increases (Figure 2). All designs show relatively close agreement between (r)ASE and (r)ESE across values of and n (Tables S1 and S2). TZL shows values of rASE and rESE closer to 1, with GA second, RDS third, and LM coming last. GA, RDS designs exhibit adequate coverage while the coverage for LM and TZL worsens as increases for (Table S3).

TABLE 2

Type 1 error (T1E) rates along with their corresponding 99% Clopper‐Pearson confidence intervals () across studied designs, phase 2 sample sizes () and statistical tests under a parameter‐specific criterion

n	Test	LM	GA	RDS	TZL
540	Wald	1.45 (1.23‐1.70)	1.17 (0.97‐1.40)	1.10 (0.91‐1.32)	1.01 (0.83‐1.22)
	LR	1.02 (0.84‐1.24)	1.14 (0.94‐1.36)	1.10 (0.90‐1.32)	0.97 (0.79‐1.17)
	Score	0.94 (0.76‐1.14)	1.10 (0.90‐1.32)	1.07 (0.88‐1.29)	0.95 (0.77‐1.15)
1250	Wald	1.06 (0.87‐1.28)	1.15 (0.95‐1.37)	1.12 (0.93‐1.34)	1.02 (0.84‐1.24)
	LR	0.94 (0.76‐1.14)	1.14 (0.95‐1.37)	1.12 (0.93‐1.34)	0.99 (0.81‐1.20)
	Score	0.90 (0.72‐1.10)	1.13 (0.94‐1.35)	1.10 (0.90‐1.32)	0.96 (0.78‐1.17)
2500	Wald	1.00 (0.82‐1.21)	1.08 (0.89‐1.30)	1.08 (0.89‐1.30)	1.09 (0.89‐1.30)
	LR	0.97 (0.79‐1.18)	1.06 (0.87‐1.27)	1.07 (0.88‐1.29)	1.07 (0.88‐1.29)
	Score	0.96 (0.78‐1.17)	1.05 (0.86‐1.27)	1.07 (0.88‐1.29)	1.06 (0.87‐1.27)

Note: Each entry represents 17 500 replicates pooled across empirical null scenarios. The rest of the simulation parameters correspond to , , , , , . The complete data T1E rate is 1.16 (0.97‐1.37) for Wald/LR tests and 1.14 (0.95‐1.35) for the score test. To further evaluate test validity under the studied sample sizes, we plot histograms of the observed LR test p‐values in Figure S1.

FIGURE 2

Boxplots for the distribution of the bias across genetic effect estimates () in the studied designs under a parameter‐specific criterion. Row facets denote different true values (0, 0.1, 0.3, 0.5, 0.7) [Colour figure can be viewed at wileyonlinelibrary.com] Type 1 error (T1E) rates along with their corresponding 99% Clopper‐Pearson confidence intervals () across studied designs, phase 2 sample sizes () and statistical tests under a parameter‐specific criterion Note: Each entry represents 17 500 replicates pooled across empirical null scenarios. The rest of the simulation parameters correspond to , , , , , . The complete data T1E rate is 1.16 (0.97‐1.37) for Wald/LR tests and 1.14 (0.95‐1.35) for the score test. To further evaluate test validity under the studied sample sizes, we plot histograms of the observed LR test p‐values in Figure S1. Power curves under the LR test at level, show that TZL consistently demonstrates the highest power across values of n with GA second, RDS in the third place and LM having the lowest power. Notably, all designs reach similar power when (Figure S2). Interestingly, not all methods show power increases at the same rate due to the differences in efficiency across designs. Additional simulations for larger phase 1 sample size () and similar selection fractions () result in analogous type 1 error and power results among designs, suggesting that testing performance is contingent upon sampling fraction and not phase 2 sample size (Section S4.1, Online Supplementary Material). Under the LR test, the rEP is highest for the TZL across values of and n. GA shows higher power than RDS across virtually all scenarios while LM comes last when but reaches similar power to GA when (Table 3).

TABLE 3

			Relative empirical power (rEP)
n	β1	Complete (EP)	LM	GA	RDS	TZL
540	0.225	58.1	2.9	9.2	6.7	32.5
	0.250	80.6	4.5	14.8	11.7	40.4
	0.300	98.2	15.0	38.7	35.5	74.2
	0.400	100.0	69.2	94.2	91.8	99.6
	0.500	100.0	97.1	100.0	99.9	100.0
1250	0.225	58.1	36.0	44.1	45.5	82.1
	0.250	80.6	44.4	57.5	56.0	87.4
	0.300	98.2	77.0	85.7	84.9	97.0
	0.400	100.0	99.6	99.9	99.9	100.0
	0.500	100.0	100.0	100.0	100.0	100.0
2500	0.225	58.1	96.0	96.1	86.4	95.9
	0.250	80.6	96.8	97.1	90.8	98.6
	0.300	98.2	99.3	99.2	98.2	99.6
	0.400	100.0	100.0	100.0	100.0	100.0
	0.500	100.0	100.0	100.0	100.0	100.0

Note: Column “Complete” corresponds to the estimated power of the complete data (ie, the denominator of the rEP ratio). Phase 1 sample size is whereas phase 2 sample size is . These results exclude values of lower than 0.225 and greater than 0.5 since the power of the complete data was less than 50% for the former and had already reached 100% for latter. The rest of the simulation parameters correspond to , , , , and .

Relative empirical power (rEP), calculated as the ratio of the empirical power of each studied design over that of the complete data, across studied designs, phase 2 sample sizes, and effect sizes under the LR test () for the ideal scenario of correctly specifying Note: Column “Complete” corresponds to the estimated power of the complete data (ie, the denominator of the rEP ratio). Phase 1 sample size is whereas phase 2 sample size is . These results exclude values of lower than 0.225 and greater than 0.5 since the power of the complete data was less than 50% for the former and had already reached 100% for latter. The rest of the simulation parameters correspond to , , , , and . Besides the additional simulations on different phase 1 and 2 sample sizes, we also studied the influence of different specifications of the joint distribution of G and Z. In summary, these results are analogous to what was reported above with GA showing competitive power when compared against alternative designs (Section S4.2, Online Supplementary Material).

TWO‐PHASE STUDY DESIGN IN PRACTICE

For LM, GA, and TZL designs, specifying different design quantities, , will lead to different phase 2 subsamples. Little attention has been paid to the practical considerations entailed in choosing a study design. One practical strategy is to make an educated guess for the design quantities; another is to consider a range of plausible values. Though adaptive/sequential designs may be feasible in some circumstances, in the post‐GWAS setting processing data by batch may be operationally inefficient. In addition, although the sequential strategy will provide more precise design parameters, it will not necessarily aid in solving the withstanding issue of selecting a unique phase 2 sample given that multiple more precise estimates will be potentially identified. Therefore, we propose a strategy that relies only on phase 1 data to select a unique phase 2 sample when a range of effect sizes, allele frequencies, and LD values can be considered at design stage.

A grid search procedure to select a unique phase 2 subsample

It is likely that there will be uncertainty about the specification of the effect size () and haplotype distribution () at design stage, so we must consider a range of probable values and define a grid of intermediate points inside this range. Let , be the set of probable values or design quantities of interest. Each design quantity will yield an optimal phase 2 subsample, P2S, for the second stage. Thus, to select an unique design under the set we propose the following procedure, which is motivated by robustness considerations. Given an optimality criterion, , for each h obtain a phase 2 subsample, namely, P2S, via LM/GA or otherwise by optimizing . given P2S, calculate for compute , where is a summary function, for example, mean or median select the P2S with minimum This procedure will identify a unique design from the ones generated using alternative specifications . To better understand the proposed procedure, let us assume that we are interested in comparing two designs, PS2 and PS2 which are optimal when the design values are or , respectively. In order to select the best design, we adopt a criterion based on robustness. In other words, we are interested in determining which one of these two designs exhibits an overall superior performance when differs from its generating design value. To this end, we compute the fitness function for each design and each with and select the design that achieves the best average (or median) performance. The formal description simply extends this principle to comparing H designs and selecting the most robust one. Ultimately, regional sequencing data will be collected for the subsample from the resulting design alone. Once data are collected, statistical fine‐mapping can be conducted following the analytic strategy described in Section 2.3.

Simulation under realistic LD patterns

The purpose of this simulation study is to evaluate the studied designs when the values of are unknown and a range of values for is considered instead. In this study, we generate data under a scenario where multiple “causal variants” and a realistic LD structure from a targeted region were considered. Specifically, we select four loci in chromosome 16 as causal variants, with corresponding effect sizes in hg19 positions 56989830, 56993324, 56994990, 56995236, and designate rs247617 (pos. 56990716) as the GWAS SNP (hereafter all positions are truncated to the last five digits). We then generate a QT, Y, across 500 replicates following , where , and . Details of the data generation are provided in Section S4.5, Online Supplementary Material.

Selecting phase 2 samples under prespecified sets of design quantities

Since multiple causal variants are assumed in this section, we ascertain the performance of the studied designs under alternative optimization criteria in addition to the parameter‐specific criterion, specifically under A‐ and D‐optimality. This allows to be treated as a vector at design stage, that is , with being a zero vector. As before, correspond to the MLEs of the regression model , that is, GWAS MLEs. In the simulated data, range (1.68‐1.78), range (0.097‐0.176) across the 500 replicates. As mentioned in Section 2.4.4, it is straightforward to modify the optimality criterion for LM and GA. For TZL, no specific results were provided under alternative optimality criteria. However, Tao et al discussed that since is a matrix when is a vector, it was sufficient to replace the scaling factor (when is a scalar) with . Additionally, for LM and GA, the optimization is performed using , as this approach showed best performance in the first simulation study when the true values are close to the null hypothesis (Section 3.2). A unique design is then selected for LM, GA, and TZL considering multiple (misspecified) values of following Section 4.1. We also considered RDS in this simulation study, however, since RDS does not depend on in any way, it was not determined using the outlined procedure 4.1. For each replicate, we select a phase 2 data of size . The specification of under parameter‐specific, A‐, and D‐optimality criteria is described below.

Parameter‐specific criterion

Under this criterion, is a scalar. Thus, we specify assuming each of the resulting combinations between the following:where denote the first, second, and third quartiles of (MAF) or (LD between Z and G) across the 29 sequence variants in the region, for example, is the median MAF value across seq‐SNPs in the fine‐mapped region while denotes the 25th percentile across correlation values between the GWAS‐SNP, Z, and the seq‐SNPs, G. , and ,

A‐ and D‐optimality criteria

Under these criteria, is assumed to be a vector of size 2. Thus, we specify assuming each of the resulting combinations between the following: , , , , and , as before, and denote the first and third quartiles of (MAF), (LD between Z and G), or (LD between G and ) across the 29 sequence variants in the region. In the simulated data, ; ; and . The average sample distribution of the resulting phase 2 designs across the 500 replicates is portrayed via mosaic plots (Figures S4 and S5). GA, RDS, and TZL designs show rather similar distributions across optimality criteria especially when . LM, on the other hand, selects only from the extremes of the distribution for common heterozygous () for . In addition, the intersection of the subsamples taken across each design/optimality criterion combination is presented via upset plots for a single replicate (Figures S3 and 3). These plots show that almost a third of the phase 2 subsamples are common among GA, RDS, and TZL designs and optimality criteria when . Notably, the number of common subsamples jumps to about a half among GA, RDS, and TZL and optimality criteria when .

FIGURE 3

Upset plot for a single replicate in the realistic simulation to quantify the intersection sizes across studied designs and optimality criteria when . Each bar denotes the size of a given intersection highlighted in the x‐axis, that is, the number of subjects common among designs. The matrix in the x‐axis corresponds to each optimality criterion (parameter‐specific/A‐/D‐optimality) and design (LM/GA/TZL) combination as well as RDS

Single‐variant fine‐mapping analysis

Once the phase 2 sample is selected in a given replicate, we perform a region scan using the 29 variants, that is, we test for association one variant at a time, across each phase 2 sample size (). To decrease the collinearity between G and Z in the model, we treat the GWAS SNP, Z (rs247617), as a (three‐level) categorical covariate at design and analysis stages. We summarize the point estimates, asymptotic standard errors, and empirical power rates (under the LR test) for the region scans across replicates for each design and optimality criteria (Tables S4 and 4).

TABLE 4

Empirical power rates at significance level for causal variants only and across 500 replicates in realistic fine‐mapping simulation single‐variant analysis, that is, misspecified

				Par‐spec			A‐opt			D‐opt
G pos.	β1G	Complete	RDS	LM	GA	TZL	LM	GA	TZL	LM	GA	TZL
85805	0.00	8.4	8.6	6.6	8.4	8.0	6.0	8.6	8.2	6.2	8.6	8.8
86045a	0.00	76.4	72.4	60.4	72.0	73.0	56.4	72.2	72.6	56.4	72.4	72.2
86762a	0.00	100.0	100.0	99.4	100.0	100.0	99.2	100.0	100.0	99.0	100.0	100.0
86914	0.00	8.6	7.2	5.6	7.0	8.0	5.0	7.6	7.6	5.8	7.0	7.8
87015	0.00	0.2	0.2	0.2	0.2	0.2	0.4	0.2	0.0	0.4	0.2	0.0
87765	0.00	0.4	0.4	0.0	0.4	0.4	0.0	0.2	0.2	0.2	0.4	0.2
88044	0.00	0.4	0.2	0.2	0.2	0.2	0.2	0.2	0.2	0.2	0.2	0.2
88958a	0.00	83.6	78.8	62.0	79.2	81.4	59.4	78.6	78.4	60.2	78.8	78.0
89015	0.00	5.4	4.8	2.4	4.8	5.6	4.2	4.8	5.8	3.8	5.0	5.0
89830b	‐0.20	100.0	100.0	99.6	100.0	100.0	99.8	100.0	100.0	99.6	100.0	100.0
90803a	0.00	77.2	74.2	61.8	74.0	74.6	58.2	73.6	74.8	57.4	73.4	74.6
91143a	0.00	67.4	63.6	49.2	63.4	64.8	48.0	63.0	64.4	44.8	63.4	62.6
91524	0.00	6.0	5.2	4.4	5.0	6.0	4.6	5.0	5.8	3.6	5.2	5.4
92017	0.00	7.0	5.6	4.8	5.8	6.0	4.6	6.0	6.2	4.2	5.4	5.8
93161	0.00	0.8	0.4	0.2	0.4	0.4	0.0	0.4	0.6	1.0	0.4	0.2
93211	0.00	7.0	6.8	2.4	6.6	7.0	3.2	7.0	6.4	3.4	6.8	6.4
93324b	0.12	3.0	2.4	1.0	2.4	2.2	1.2	2.4	2.2	1.4	2.4	2.0
93886	0.00	0.2	0.2	0.0	0.2	0.0	0.0	0.2	0.2	0.2	0.2	0.2
93897	0.00	20.0	17.6	9.2	17.8	16.2	10.2	17.6	16.6	9.6	17.4	17.2
93901	0.00	18.0	16.0	10.4	16.0	15.4	9.8	16.2	16.4	8.2	16.0	17.4
93935a	0.00	80.6	76.0	62.6	75.6	75.6	61.4	75.8	76.4	60.2	76.0	75.4
94192a	0.00	80.6	76.4	63.2	76.0	75.8	61.8	75.8	75.8	61.6	76.2	75.8
94212	0.00	3.8	2.8	1.8	2.8	2.8	1.0	2.6	2.8	1.4	3.0	2.8
94244	0.00	0.2	0.2	0.0	0.2	0.0	0.0	0.2	0.2	0.2	0.2	0.2
94528	0.00	0.2	0.2	0.0	0.2	0.0	0.0	0.2	0.2	0.2	0.2	0.2
94990b	0.25	84.4	82.0	65.8	82.0	81.8	66.2	81.4	81.6	65.8	81.0	80.4
95038a	0.00	79.6	74.6	62.2	74.6	75.0	60.4	74.2	74.8	60.2	74.6	74.0
95234	0.00	0.2	0.6	0.2	0.6	0.2	0.0	0.6	0.4	0.8	0.8	0.4
95236b	‐0.15	95.2	93.0	77.8	92.8	92.4	79.0	93.0	92.0	78.4	93.0	92.4

Note: Base pair positions (pos.) marked with “b” denote causal variants whereas ”a” denote hitchhikers. The remaining are noncausal. Positions are truncated to the last five digits.

Empirical power rates at significance level for causal variants only and across 500 replicates in realistic fine‐mapping simulation single‐variant analysis, that is, misspecified Note: Base pair positions (pos.) marked with “b” denote causal variants whereas ”a” denote hitchhikers. The remaining are noncausal. Positions are truncated to the last five digits. GA, RDS, and TZL show similar results in terms of estimation and power across different values of n and optimality criteria (for GA and TZL) whereas LM exhibits considerably lower power (Tables S6 and 4). We observe similar distributions of the LR test p‐value (in scale) across replicates for the studied designs with the exception of some outliers; LM shows the smallest () p‐values compared with the other designs (Figures S6 and 4). Lastly, no optimality criterion shows consistently best estimation nor power, suggesting that no specific criteria substantially improves overall performance when the design values, , are misspecified.

FIGURE 4

Boxplots of the () p‐values across 500 replicates in the fine‐mapping simulation single‐variant analyses for a phase 2 sample size of across optimality criteria (for LM, GA, and TZL only): parameter‐specific, A‐ and D‐optimality in each row facet. Each column facet corresponds to the complete data analysis and studied designs respectively. The dashed line corresponds to a Bonferroni‐corrected significance threshold of [Colour figure can be viewed at wileyonlinelibrary.com] It is obvious that in most cases the mean of the estimate for causal sequence variants does not correspond with its true value being both over‐ and underestimated (Tables S4 and S5). In fact, this discrepancy occurs even for the complete data case, which is unsurprising considering the unaccounted variation resulting from the single‐variant analysis. The power to detect association (at ) is above 80% in the complete data for three out of four causal variants: 89830, 94990, 95236. The power for the remaining causal variant (93324) is almost zero (Table S6). This decrease in power is likely due to the high LD between this variant and the GWAS SNP, Z ( and ) which is already included in the regression, thus, diluting its signal. Moreover, there are additional noncausal variants that display power above 80% in the complete data analysis (Table S6). These so‐called “hitchhiker variants” achieve significant association as a consequence of their LD with causal variants. The performance of the studied designs for the hitchhiker variants resembles the complete data analysis and its ranking is similar to the one shown with the causal variants. These results indicate that single‐variant analysis does not distinguish well between causal and hitchhiker SNPs in complete nor two‐phase analysis. A common strategy to identify potential causal variants from hitchhikers consists of adjusting for the most significant variant (or variants) in the region and performing a new–conditional—scan (ie, one variant at a time) fitting the following model: , where G denotes a variant in the region, is the most significant locus from the single‐variant analysis, and Z is the GWAS SNP treated as a (three‐level) categorical variable. This approach aims to discover independent signals in the region. Results for this conditional analysis can be found on Section S6, Online Supplementary Material.

APPLICATION IN THE NORTHERN FINLAND BIRTH COHORT OF 1966

We illustrate the methods outlined in Sections 2.4 and 4 using the Northern Finland Birth Cohort of 1966 (NFBC1966), which is a longitudinal, prospective birth cohort consisting of women and their offspring from the two northernmost provinces in Finland: Oulu and Lapland. Comprehensive phenotypic, lifestyle, and demographic data were collected after birth via questionnaires and clinical evaluations on the offspring at years 1, 7, 14‐16, and 31. The NFBC1966 aims to study genetic, biological, social, or behavioral risk factors associated with the onset of different diseases as well as morbidity and mortality derived from adverse events such as preterm birth and intrauterine growth retardation. , In particular, as part of an NHLBI‐sponsored project designed to characterize the genetic determinants of metabolic and cardiovascular diseases, special attention was paid to a selected list of heritable quantitative traits related to cardiovascular diseases or type 2 diabetes. These traits are body mass index (BMI), high density lipoproteins (HDL), low density lipoproteins (LDL), triglycerides (TG), glucose (GLU), insulin (INS), C‐reactive protein (CRP), systolic blood pressure (SBP), and diastolic blood pressure (DBP). We focus on 5402 subjects for which genotype information was collected using the Illumina Infinium platform, which is comprised by 346 590 SNPs (after standard quality control). In addition to the genotype information, custom targeted sequencing (CTS) was collected for 4511 of them (83.5%) as part of a series of resequencing studies to deepen the understanding of genotypic variation on metabolic traits. The CTS data contain the coding sequence and 5 and 3 untranslated regions of 78 genes, which were selected based on previous GWAS meta‐analyses of cardiovascular diseases. Details of these regions can be found in Service et al. The purpose of this illustration is to optimally select a subsample of subjects for targeted sequencing study (phase 2) and fine‐mapping to locate potential causal variants using the methods outlined in the previous sections.

Phase 2 subsample selection

We first identify GWAS‐SNPs by performing genome‐wide associations on the available quantitative traits. Although genome‐wide scans on these very same metabolic traits for the NFBC1966 have been previously carried out in Sabatti et al, our analyses differ in a couple of aspects: (1) the sample size we utilized is slightly larger because additional subjects were genotyped at a later time and (2) we perform multiple linear regression adjusting by the SexOCPG covariate described in Sabatti et al, which is a composed categorical variable determined by sex, oral contraceptive use, and pregnancy status. We center attention on one trait, log‐transformed TG (Y), as its GWAS has few peaks that identify only two genetic regions for further study: GCKR in chromosome 2 and LPL in chromosome 8 (Figure S7). We comment on the challenges of more complex GWAS scenarios in the discussion. We locate one SNPs in each of these regions that meet the usual genome‐wide significance threshold (): rs1260326 (chr2:27730940, , s.e., , MAF), and rs10096633 (chr8:19830921, , s.e., , MAF). Due to missing data on the TG values, the available subjects for the genome scan was 5300. Of these, the number of subjects with both GWAS and CTS data is , which is the phase 1 sample size considered for phase 2 analyses. In addition to the GCKR and LPL regions used in the phase 2 selection, we consider another region for analysis: APOA5, to illustrate the correspondence of the two‐phase design and analysis with the complete data approach pursued in Service et al. Using the two identified GWAS‐SNPs, we select phase 2 subsamples under three of the previously described designs: GA, RDS, and TZL. We drop LM as it showed the worst performance in simulations. The phase 2 sample size is specified to be approximately 25%, or 50% of the phase 1 sample size (). To define Z, we use all allele combinations of the GWAS‐SNPs rs1260326 and rs10096633, which results in a nine‐category variable. Considering that no optimality criterion performed best in Section 4, we deem it appropriate to assume is a scalar and use a parameter‐specific criterion, which greatly simplifies the specification of the design quantities, particularly . Phase 2 subsample selection is performed separately per each phase 2 sample size. In each case, a set of design quantities is defined as follows: First, , where and correspond to the MLEs from the following regression model based on phase 1 data: rs1260326rs10096633, where is a vector of additional covariates including SexOCPG and the first four genetic principal components (PC1‐4). Second, for and given that MAF and LD values are unavailable a priori, we postulate the following ranges for these design quantities: and . We use a categorical coding to define Z in the ML analysis because it reduces the potential collinearity with G, whereas there is no such issue at subsample selection stage, where the typical GWAS analysis involves additive coding for Z (rs1260326 and rs10096633 in this case). For visualization purposes, we categorize TG () into three groups corresponding to commonly used blood test ranges, that is, normal (<150 mg/dL), borderline high (150‐199 mg/dL), and high (200 mg/dL). Notably, the groups in are asymmetrical with respect the middle stratum. On the other hand, the distribution of the nine‐category variable determined by the two GWAS‐SNPs (rs1260326 and rs10096633) has a small number of subjects for some categories due to the relatively low MAFs of SNPs rs10096633, which differs largely from the simulations (Figure S8). Consequently, the phase 2 subsample distributions tend to not select subjects from those categories. Notably, GA, RDS, and TZL show similar category distribution across phase 2 sample sizes (Figure S9). The proportion of subjects that are common among designs in the phase 2 subsamples is above between pairs of GA, RDS, and TZL across phase 2 sample sizes (Figure S10).

Fine‐mapping analysis

CTS data for genes GCKR, LPL, and APOA5 were downloaded from the NCBI's dbGAP repository according to their GRCh37.p13 location 5kbps. Since aligned reads were available, we performed variant calling using the GotCloud pipeline developed by the Center for Statistical Genetics at the University of Michigan. Sequence data were analyzed in two ways. First, we used linear regression with complete data, that is, subjects with both genotyping and CTS data (), and second via the ML approach described above for each studied design: GA, RDS, and TZL (). For the ML analysis, the nine‐category variable defined by the GWAS SNPs was used as auxiliary variable Z. All analyses were adjusted by the GWAS‐SNPs (rs1260326 and rs10096633), SexOCPG and PC1‐PC4 as covariates. Our main interest lies in gauging the performance of the two‐phase designs with respect to the complete data analysis, evaluating both estimation and hypothesis testing. For estimation, we focus on three sequence variants reported in table 2 of Service et al: rs268, rs2266788, and rs3135506 (Table 5). For these variants, GA, RDS, and TZL show similar results with improving performance as n increases. Comparisons of association estimates in Beta‐Beta plots show similar spread estimates across designs (Figure S11). Similarly, region plots of association signals for complete data and ML analyses across the studied designs indicate that all designs tend to display results closer to those of the complete data case as n increases with no overwhelmingly better design (Figure 5). In addition to the region scans performed for analyzing common variants, we demonstrate that rare variants can be investigated under a two‐phase design via burden tests in Section S7, Online Supplementary Material.

TABLE 5

Estimation and testing results for analyzing (log‐transformed) triglyceride levels across three sequence SNPs from the fine‐mapping analysis in the NFBC1966

					β^1 (S.E.) p‐value
n	Gene	Chr.	pos. (hg19)	Variant	Complete	GA	RDS	TZL
1123	LPL	8	19813529	rs268	0.186 (0.04) p=2.69e−07	0.221 (0.06) p=1.12e−05	0.223 (0.06) p=9.28e−06	0.221 (0.06) p=8.24e−06
	APOA5	11	116660686	rs2266788	0.108 (0.02) p=1.06e−10	0.115 (0.02) p=8.41e−09	0.116 (0.02) p=6.23e−09	0.114 (0.02) p=1.03e−08
			116662407	rs3135506	0.100 (0.02) p=1.50e−06	0.083 (0.02) p=4.23e−04	0.081 (0.02) p=7.78e−04	0.079 (0.02) p=8.87e−04
2246	LPL	8	19813529	rs268	0.186 (0.04) p=2.69e−07	0.186 (0.04) p=1.14e−06	0.185 (0.04) p=1.20e−06	0.198 (0.04) p=4.21e−07
	APOA5	11	116660686	rs2266788	0.108 (0.02) p=1.06e−10	0.106 (0.02) p=1.53e−09	0.105 (0.02) p=2.02e−09	0.103 (0.02) p = 3.32e−09
			116662407	rs3135506	0.100 (0.02) p=1.50e−06	0.096 (0.02) p=5.10e−06	0.096 (0.02) p=5.07e−06	0.097 (0.02) p=4.22e−06

FIGURE 5

Region plots of the NFBC1966 CTS data for the ML analyses under the studied designs compared with the complete data analysis across phase 2 sample sizes (). Column facets denote each of the studied designs plus the complete data analysis whereas row facets show each loci of interest Estimation and testing results for analyzing (log‐transformed) triglyceride levels across three sequence SNPs from the fine‐mapping analysis in the NFBC1966

DISCUSSION

In this report, we propose and evaluate state‐of‐the‐art sample selection strategies for two‐phase designs in the context of post‐GWAS fine‐mapping studies. We pay special attention in the comparison against a recently proposed optimal design, TZL. Our first set of simulations, considering a parameter‐specific criterion, shows a clear advantage of TZL under the strong assumption of correctly specified design values ( or ). On the other hand, TZL demonstrates biased estimation when the phase 2 sample size/nonmissing fraction is small () and the effect sizes are farther from the null , with improvements noticed when n increases. These results are aligned with those obtained under LM, which reinforces our initial belief that LM is a crude approximation to TZL. Moreover, LM is only competitive when , suggesting that the sampling variability introduced by this method can only be overcome under larger nonmissing fractions. In contrast, GA demonstrates unbiased estimation and competitive power (often larger than RDS) across all simulation settings. The appeal of GA lies in its generality as it can be extended to other settings beyond linear, logistic and Cox models while also avoiding uncertainty associated with sampling. Thus, GA provides an alternative approach to obtain efficient and robust two‐phase designs across a wider range of settings, including large effect sizes. Additionally, we investigate the use of different variances considered in the optimization: , and , which are, respectively, the inverse of the Fisher information matrix and the variance‐covariance matrix of the most powerful test under the null hypothesis (Sections 3.2 and S3, Online Supplementary Material). The results support the use of for the parameter‐specific criterion when the effect size is close to the null, as expected. Correct design values specification is never attainable in practice. Hence, our second set of simulations evaluates a more realistic setting for which we propose a grid search approach to select a unique phase 2 design assuming a set of (misspecified) design values is available. In this scenario, GA, RDS, and TZL have comparable performance with LM falling behind. GA is advantageous as it can be easily applied to general functions. Thus, apart from the parameter‐specific criterion, we examine two additional criteria to select a phase 2 design: A‐ and D‐optimality. Notably, based on our simulations we found no evidence in favor of any particular optimality criterion. However, the parameter‐specific criterion may be preferred as designating its design values requires fewer assumptions. A‐ and D‐optimality criteria were chosen because they have been amply explored in the literature, possess solid roots in experimental designs, and have natural connection with hypothesis testing. Nonetheless, other criteria may be better suited for optimizing power, for instance, maximizing the noncentrality parameter of the likelihood ratio or score test statistic may improve the power performance of the phase 2 designs. An important observation from the simulations is that although the best performing designs have higher relative efficiency compared with less favorable designs (eg, combined or simple random sampling) across all phase 2 sample sizes, this improvement does not automatically translate to a closer agreement with the complete data analysis. That is, power performance is contingent upon nonmissing fraction and not necessarily phase 2 sample size itself. This finding is consistent throughout our investigations, where both parameter estimates and p‐values achieve similar values as in the complete data analysis only when the phase 2 sample size is half of the phase 1 sample size (). Thus, careful evaluation of the statistical power of the phase 2 design needs to be considered in advance. The competitive performance of GA notwithstanding, there are other considerations in implementing this algorithm. For instance, we provide in all simulations an ad‐hoc approach to initialize the population of possible solutions to accelerate convergence. In general, giving a particular initialization is not necessary, however, a completely random initial population may need a larger number of generations () to achieve good performance. In addition, given the stochastic nature of the search, there are few guarantees that the final solution has indeed reached a global as opposed to a local optimum. It is also worth mentioning that the tuning parameter settings for the proposed GA are intended as a guideline only and do not replace a more careful evaluation in specific problems. Walters suggests iterative calls, which consist of running the GA multiple times, so that the final population in each run serves as the initial population in subsequent runs. The budgetary constraint implicitly assumes that the cost of sequencing samples is the same for all study samples. This assumption may be relaxed as sequencing costs may vary due to location, tissue availability or number of samples. Thus, extensions to consider differential costs are yet to be considered; one example of such approach under tracing study designs can be found in Moon et al. Another important issue that deserves further investigation in terms of budget constraints (or otherwise) is use of differential sequencing depths across samples. Further investigation regarding selection of optimal phase 2 subsample under a set of loosely defined design values, , is warranted. Indeed, beyond the proposed grid search approach, alternative means to select a phase 2 subsample across ranges of are possible, for example, via min‐max approaches. However, this selection problem may also be addressed under a Bayesian framework for which a prior (joint) distribution for and needs to be specified. , , , The appeal of this approach is that it may better incorporate the uncertainty in the design values for selecting an phase 2 subsample although at the expense of computational complexity. Beyond the feasibility and applicability of the proposed methodologies in practice, the illustration on the amply studied NFBC1966 raises some additional questions on considerations posed in the design and analysis of two‐phase post‐GWAS fine‐mapping studies. First, our rare‐variant analysis shows no association in either of the studied regions. This result is not that surprising for a couple of reasons: the limited sample size in the NFBC1966 and the low correlation between the GWAS SNPs and the computed genetic score. Additionally, the burden test assumes the same direction across all variants, which is a limitation in various settings. Thus, further investigations are required for variance component tests under the proposed post‐GWAS scenario, potentially by extending ideas from extreme‐phenotype sampling designs for cross‐sectional association studies. Second, current practice in the field involves the imputation of GWAS data using high‐quality reference panels such as the TOPMed Imputation Panel. In principle, one can use the imputed variants to construct a suitable auxiliary variable, Z, to select a phase 2 subsample using the proposed framework. Alternative methods that accommodate differences between genotyped and imputed data for subjects not selected for phase 2 sequencing have been discussed. , Hence, comparisons between these methods and the approach undertaken in this article can be also evaluated. Lastly, in a similar vein, methodological extensions for situations when multiple loci are pinpointed by GWAS and/or multiple traits of equal interest are collected remain as topics of future work. A starting point in this direction may involve the calculation and application of polygenic risk scores to inform the phase 2 subsampling. Another issue deserving further investigation is the influence of the phase 2 sample selection in the variant calling pipeline. For simplicity, in this application, variant calling was performed per each locus on all available CTS samples. However, even though genotype likelihoods are typically inferred by sample, population‐specific filters such as MAF may change with the design. Thus, sensitivity analyses of these filters can be additionally explored. We emphasize that although this report focuses on a normally distributed continuous trait, all the derivations apply in the context of generalized linear models within the exponential family. For instance, we include an example on how to obtain the GA design when the response variable is a count and can be modeled using Poisson regression as part of the accompanying Github repository. Furthermore, we are engaged in the development of an R package for this general case. Results of this research and the accompanying software aim to support investigators decision‐making pertaining to study design, evaluation, and analysis of two‐phase studies. These tools can serve to make more efficient use of limited budgetary resources for data acquisition and analysis. Two‐phase study designs can be sought in other contexts. In particular, their use in a variety of 'omics problems is broadly relevant as new and more costly technologies continue to arise. Beyond the case of fine‐mapping where causal variants from GWAS‐identified regions can be pinpointed at a fraction of the cost, this approach can be extended for phase 2 variables that are not categorical, for example, methylation, gene expression, or other 'omics measurements. Additionally, methods that introduce functional knowledge to further inform the inference (possibly through Bayesian methods) deserve further investigation.

CONFLICT OF INTEREST

The authors declare no potential conflict of interests.

AUTHOR CONTRIBUTIONS

Osvaldo Espin‐Garcia wrote a first draft of this manuscript and subsequent revisions, performed simulations, and analyzed the NFBC1966 data. Radu V. Craiu and Shelley B. Bull provided statistical guidance and methodological support. All authors contributed in the design of the numerical experiments, overall analysis plan, as well as reviewing and approving the final version of the manuscript.

FINANCIAL DISCLOSURE

This research is supported by funding from the Canadian Institutes of Health Research: CIHR Operating Grant MOP‐84287 and CIHR Project Grant PJT‐159463 (RVC, SBB), CIHR Training Grant GET‐101831 (OE‐G); and the Ontario Institute for Cancer Research (OICR) through funding provided by the Government of Ontario (OE‐G). OE‐G has been fellow trainee of OICR Biostatistics Training Initiative and CIHR STAGE (Strategic Training for Advanced Genetic Epidemiology) ‐ CIHR Training Grant in Genetic Epidemiology and Statistical Genetics. Data S1 Online Supplementary Material Click here for additional data file.

24 in total

1. Monte Carlo EM for missing covariates in parametric regression models.

Authors: J G Ibrahim; M H Chen; S R Lipsitz
Journal: Biometrics Date: 1999-06 Impact factor: 2.571

2. A Bayesian A-optimal and model robust design criterion.

Authors: Xiaojie Zhou; Lawrence Joseph; David B Wolfson; Patrick Bélisle
Journal: Biometrics Date: 2003-12 Impact factor: 2.571

3. Optimal design and efficiency of two-phase case-control studies with error-prone and error-free exposure measures.

Authors: R McNamee
Journal: Biostatistics Date: 2005-04-28 Impact factor: 5.899

4. Integrative analysis of sequencing and array genotype data for discovering disease associations with rare mutations.

Authors: Yi-Juan Hu; Yun Li; Paul L Auer; Dan-Yu Lin
Journal: Proc Natl Acad Sci U S A Date: 2015-01-12 Impact factor: 11.205

5. Anamorphic analysis: sampling and estimation for covariate effects when both exposure and disease are known.

Authors: A M Walker
Journal: Biometrics Date: 1982-12 Impact factor: 2.571

6. Early life factors and blood pressure at age 31 years in the 1966 northern Finland birth cohort.

Authors: Marjo-Riitta Järvelin; Ulla Sovio; Vanessa King; Liisa Lauren; Baizhuang Xu; Mark I McCarthy; Anna-Liisa Hartikainen; Jaana Laitinen; Paavo Zitting; Paula Rantakallio; Paul Elliott
Journal: Hypertension Date: 2004-11-01 Impact factor: 10.190

7. Optimal Designs of Two-Phase Studies.

Authors: Ran Tao; Donglin Zeng; Dan-Yu Lin
Journal: J Am Stat Assoc Date: 2019-10-29 Impact factor: 4.369

8. Two-phase sample selection strategies for design and analysis in post-genome-wide association fine-mapping studies.

Authors: Osvaldo Espin-Garcia; Radu V Craiu; Shelley B Bull
Journal: Stat Med Date: 2021-10-01 Impact factor: 2.497

9. Re-sequencing expands our understanding of the phenotypic impact of variants at GWAS loci.

Authors: Susan K Service; Tanya M Teslovich; Christian Fuchsberger; Vasily Ramensky; Pranav Yajnik; Daniel C Koboldt; David E Larson; Qunyuan Zhang; Ling Lin; Ryan Welch; Li Ding; Michael D McLellan; Michele O'Laughlin; Catrina Fronick; Lucinda L Fulton; Vincent Magrini; Amy Swift; Paul Elliott; Marjo-Riitta Jarvelin; Marika Kaakinen; Mark I McCarthy; Leena Peltonen; Anneli Pouta; Lori L Bonnycastle; Francis S Collins; Narisu Narisu; Heather M Stringham; Jaakko Tuomilehto; Samuli Ripatti; Robert S Fulton; Chiara Sabatti; Richard K Wilson; Michael Boehnke; Nelson B Freimer
Journal: PLoS Genet Date: 2014-01-30 Impact factor: 5.917

10. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program.

Authors: Daniel Taliun; Daniel N Harris; Michael D Kessler; Jedidiah Carlson; Zachary A Szpiech; Raul Torres; Sarah A Gagliano Taliun; André Corvelo; Stephanie M Gogarten; Hyun Min Kang; Achilleas N Pitsillides; Jonathon LeFaive; Seung-Been Lee; Xiaowen Tian; Brian L Browning; Sayantan Das; Anne-Katrin Emde; Wayne E Clarke; Douglas P Loesch; Amol C Shetty; Thomas W Blackwell; Albert V Smith; Quenna Wong; Xiaoming Liu; Matthew P Conomos; Dean M Bobo; François Aguet; Christine Albert; Alvaro Alonso; Kristin G Ardlie; Dan E Arking; Stella Aslibekyan; Paul L Auer; John Barnard; R Graham Barr; Lucas Barwick; Lewis C Becker; Rebecca L Beer; Emelia J Benjamin; Lawrence F Bielak; John Blangero; Michael Boehnke; Donald W Bowden; Jennifer A Brody; Esteban G Burchard; Brian E Cade; James F Casella; Brandon Chalazan; Daniel I Chasman; Yii-Der Ida Chen; Michael H Cho; Seung Hoan Choi; Mina K Chung; Clary B Clish; Adolfo Correa; Joanne E Curran; Brian Custer; Dawood Darbar; Michelle Daya; Mariza de Andrade; Dawn L DeMeo; Susan K Dutcher; Patrick T Ellinor; Leslie S Emery; Celeste Eng; Diane Fatkin; Tasha Fingerlin; Lukas Forer; Myriam Fornage; Nora Franceschini; Christian Fuchsberger; Stephanie M Fullerton; Soren Germer; Mark T Gladwin; Daniel J Gottlieb; Xiuqing Guo; Michael E Hall; Jiang He; Nancy L Heard-Costa; Susan R Heckbert; Marguerite R Irvin; Jill M Johnsen; Andrew D Johnson; Robert Kaplan; Sharon L R Kardia; Tanika Kelly; Shannon Kelly; Eimear E Kenny; Douglas P Kiel; Robert Klemmer; Barbara A Konkle; Charles Kooperberg; Anna Köttgen; Leslie A Lange; Jessica Lasky-Su; Daniel Levy; Xihong Lin; Keng-Han Lin; Chunyu Liu; Ruth J F Loos; Lori Garman; Robert Gerszten; Steven A Lubitz; Kathryn L Lunetta; Angel C Y Mak; Ani Manichaikul; Alisa K Manning; Rasika A Mathias; David D McManus; Stephen T McGarvey; James B Meigs; Deborah A Meyers; Julie L Mikulla; Mollie A Minear; Braxton D Mitchell; Sanghamitra Mohanty; May E Montasser; Courtney Montgomery; Alanna C Morrison; Joanne M Murabito; Andrea Natale; Pradeep Natarajan; Sarah C Nelson; Kari E North; Jeffrey R O'Connell; Nicholette D Palmer; Nathan Pankratz; Gina M Peloso; Patricia A Peyser; Jacob Pleiness; Wendy S Post; Bruce M Psaty; D C Rao; Susan Redline; Alexander P Reiner; Dan Roden; Jerome I Rotter; Ingo Ruczinski; Chloé Sarnowski; Sebastian Schoenherr; David A Schwartz; Jeong-Sun Seo; Sudha Seshadri; Vivien A Sheehan; Wayne H Sheu; M Benjamin Shoemaker; Nicholas L Smith; Jennifer A Smith; Nona Sotoodehnia; Adrienne M Stilp; Weihong Tang; Kent D Taylor; Marilyn Telen; Timothy A Thornton; Russell P Tracy; David J Van Den Berg; Ramachandran S Vasan; Karine A Viaud-Martinez; Scott Vrieze; Daniel E Weeks; Bruce S Weir; Scott T Weiss; Lu-Chen Weng; Cristen J Willer; Yingze Zhang; Xutong Zhao; Donna K Arnett; Allison E Ashley-Koch; Kathleen C Barnes; Eric Boerwinkle; Stacey Gabriel; Richard Gibbs; Kenneth M Rice; Stephen S Rich; Edwin K Silverman; Pankaj Qasba; Weiniu Gan; George J Papanicolaou; Deborah A Nickerson; Sharon R Browning; Michael C Zody; Sebastian Zöllner; James G Wilson; L Adrienne Cupples; Cathy C Laurie; Cashell E Jaquish; Ryan D Hernandez; Timothy D O'Connor; Gonçalo R Abecasis
Journal: Nature Date: 2021-02-10 Impact factor: 69.504

1 in total

1. Two-phase sample selection strategies for design and analysis in post-genome-wide association fine-mapping studies.

Authors: Osvaldo Espin-Garcia; Radu V Craiu; Shelley B Bull
Journal: Stat Med Date: 2021-10-01 Impact factor: 2.497

1 in total