| Literature DB >> 34596256 |
Osvaldo Espin-Garcia1,2, Radu V Craiu3, Shelley B Bull2,4.
Abstract
Post-GWAS analysis, in many cases, focuses on fine-mapping targeted genetic regions discovered at GWAS-stage; that is, the aim is to pinpoint potential causal variants and susceptibility genes for complex traits and disease outcomes using next-generation sequencing (NGS) technologies. Large-scale GWAS cohorts are necessary to identify target regions given the typically modest genetic effect sizes. In this context, two-phase sampling design and analysis is a cost-reduction technique that utilizes data collected during phase 1 GWAS to select an informative subsample for phase 2 sequencing. The main goal is to make inference for genetic variants measured via NGS by efficiently combining data from phases 1 and 2. We propose two approaches for selecting a phase 2 design under a budget constraint. The first method identifies sampling fractions that select a phase 2 design yielding an asymptotic variance covariance matrix with certain optimal characteristics, for example, smallest trace, via Lagrange multipliers (LM). The second relies on a genetic algorithm (GA) with a defined fitness function to identify exactly a phase 2 subsample. We perform comprehensive simulation studies to evaluate the empirical properties of the proposed designs for a genetic association study of a quantitative trait. We compare our methods against two ranked designs: residual-dependent sampling and a recently identified optimal design. Our findings demonstrate that the proposed designs, GA in particular, can render competitive power in combined phase 1 and 2 analysis compared with alternative designs while preserving type 1 error control. These results are especially evident under the more practical scenario where design values need to be defined a priori and are subject to misspecification. We illustrate the proposed methods in a study of triglyceride levels in the North Finland Birth Cohort of 1966. R code to reproduce our results is available at github.com/egosv/TwoPhase_postGWAS.Entities:
Keywords: genetic algorithm; post-GWAS targeted sequencing; practical two-phase studies; statistical fine-mapping; two-phase study design and analysis
Mesh:
Year: 2021 PMID: 34596256 PMCID: PMC9293221 DOI: 10.1002/sim.9211
Source DB: PubMed Journal: Stat Med ISSN: 0277-6715 Impact factor: 2.497
Description of the three optimality criteria evaluated. denotes the variance‐covariance matrix.
|
| Formula | Description |
|---|---|---|
| A‐optimality |
| Minimizes the average variance of the parameter estimates |
| D‐optimality |
| Minimizes the product of the variances for diagonal matrices |
| Parameter‐specific |
| Minimizes the variance of a particular entry in the VCM |
FIGURE 1Mosaic plots with the average strata sizes across replicates for the proposed designs against ranked designs across phase 2 sample sizes, , under the parameter‐specific criterion. Averages were taken from the resulting designs in the main simulation study for the two most extreme values of (0.1, 0.7)
Type 1 error (T1E) rates along with their corresponding 99% Clopper‐Pearson confidence intervals () across studied designs, phase 2 sample sizes () and statistical tests under a parameter‐specific criterion
|
| Test | LM | GA | RDS | TZL |
|---|---|---|---|---|---|
| 540 | Wald | 1.45 (1.23‐1.70) | 1.17 (0.97‐1.40) | 1.10 (0.91‐1.32) | 1.01 (0.83‐1.22) |
| LR | 1.02 (0.84‐1.24) | 1.14 (0.94‐1.36) | 1.10 (0.90‐1.32) | 0.97 (0.79‐1.17) | |
| Score | 0.94 (0.76‐1.14) | 1.10 (0.90‐1.32) | 1.07 (0.88‐1.29) | 0.95 (0.77‐1.15) | |
| 1250 | Wald | 1.06 (0.87‐1.28) | 1.15 (0.95‐1.37) | 1.12 (0.93‐1.34) | 1.02 (0.84‐1.24) |
| LR | 0.94 (0.76‐1.14) | 1.14 (0.95‐1.37) | 1.12 (0.93‐1.34) | 0.99 (0.81‐1.20) | |
| Score | 0.90 (0.72‐1.10) | 1.13 (0.94‐1.35) | 1.10 (0.90‐1.32) | 0.96 (0.78‐1.17) | |
| 2500 | Wald | 1.00 (0.82‐1.21) | 1.08 (0.89‐1.30) | 1.08 (0.89‐1.30) | 1.09 (0.89‐1.30) |
| LR | 0.97 (0.79‐1.18) | 1.06 (0.87‐1.27) | 1.07 (0.88‐1.29) | 1.07 (0.88‐1.29) | |
| Score | 0.96 (0.78‐1.17) | 1.05 (0.86‐1.27) | 1.07 (0.88‐1.29) | 1.06 (0.87‐1.27) |
Note: Each entry represents 17 500 replicates pooled across empirical null scenarios. The rest of the simulation parameters correspond to , , , , , . The complete data T1E rate is 1.16 (0.97‐1.37) for Wald/LR tests and 1.14 (0.95‐1.35) for the score test. To further evaluate test validity under the studied sample sizes, we plot histograms of the observed LR test p‐values in Figure S1.
FIGURE 2Boxplots for the distribution of the bias across genetic effect estimates () in the studied designs under a parameter‐specific criterion. Row facets denote different true values (0, 0.1, 0.3, 0.5, 0.7) [Colour figure can be viewed at wileyonlinelibrary.com]
Relative empirical power (rEP), calculated as the ratio of the empirical power of each studied design over that of the complete data, across studied designs, phase 2 sample sizes, and effect sizes under the LR test () for the ideal scenario of correctly specifying
| Relative empirical power (rEP) | ||||||
|---|---|---|---|---|---|---|
|
|
| Complete (EP) | LM | GA | RDS | TZL |
| 540 | 0.225 | 58.1 | 2.9 | 9.2 | 6.7 | 32.5 |
| 0.250 | 80.6 | 4.5 | 14.8 | 11.7 | 40.4 | |
| 0.300 | 98.2 | 15.0 | 38.7 | 35.5 | 74.2 | |
| 0.400 | 100.0 | 69.2 | 94.2 | 91.8 | 99.6 | |
| 0.500 | 100.0 | 97.1 | 100.0 | 99.9 | 100.0 | |
| 1250 | 0.225 | 58.1 | 36.0 | 44.1 | 45.5 | 82.1 |
| 0.250 | 80.6 | 44.4 | 57.5 | 56.0 | 87.4 | |
| 0.300 | 98.2 | 77.0 | 85.7 | 84.9 | 97.0 | |
| 0.400 | 100.0 | 99.6 | 99.9 | 99.9 | 100.0 | |
| 0.500 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | |
| 2500 | 0.225 | 58.1 | 96.0 | 96.1 | 86.4 | 95.9 |
| 0.250 | 80.6 | 96.8 | 97.1 | 90.8 | 98.6 | |
| 0.300 | 98.2 | 99.3 | 99.2 | 98.2 | 99.6 | |
| 0.400 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | |
| 0.500 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | |
Note: Column “Complete” corresponds to the estimated power of the complete data (ie, the denominator of the rEP ratio). Phase 1 sample size is whereas phase 2 sample size is . These results exclude values of lower than 0.225 and greater than 0.5 since the power of the complete data was less than 50% for the former and had already reached 100% for latter. The rest of the simulation parameters correspond to , , , , and .
FIGURE 3Upset plot for a single replicate in the realistic simulation to quantify the intersection sizes across studied designs and optimality criteria when . Each bar denotes the size of a given intersection highlighted in the x‐axis, that is, the number of subjects common among designs. The matrix in the x‐axis corresponds to each optimality criterion (parameter‐specific/A‐/D‐optimality) and design (LM/GA/TZL) combination as well as RDS
Empirical power rates at significance level for causal variants only and across 500 replicates in realistic fine‐mapping simulation single‐variant analysis, that is, misspecified
| Par‐spec | A‐opt | D‐opt | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
| Complete | RDS | LM | GA | TZL | LM | GA | TZL | LM | GA | TZL |
| 85805 | 0.00 | 8.4 | 8.6 | 6.6 | 8.4 | 8.0 | 6.0 | 8.6 | 8.2 | 6.2 | 8.6 | 8.8 |
| 86045 | 0.00 | 76.4 | 72.4 | 60.4 | 72.0 | 73.0 | 56.4 | 72.2 | 72.6 | 56.4 | 72.4 | 72.2 |
| 86762 | 0.00 | 100.0 | 100.0 | 99.4 | 100.0 | 100.0 | 99.2 | 100.0 | 100.0 | 99.0 | 100.0 | 100.0 |
| 86914 | 0.00 | 8.6 | 7.2 | 5.6 | 7.0 | 8.0 | 5.0 | 7.6 | 7.6 | 5.8 | 7.0 | 7.8 |
| 87015 | 0.00 | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 | 0.4 | 0.2 | 0.0 | 0.4 | 0.2 | 0.0 |
| 87765 | 0.00 | 0.4 | 0.4 | 0.0 | 0.4 | 0.4 | 0.0 | 0.2 | 0.2 | 0.2 | 0.4 | 0.2 |
| 88044 | 0.00 | 0.4 | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 |
| 88958 | 0.00 | 83.6 | 78.8 | 62.0 | 79.2 | 81.4 | 59.4 | 78.6 | 78.4 | 60.2 | 78.8 | 78.0 |
| 89015 | 0.00 | 5.4 | 4.8 | 2.4 | 4.8 | 5.6 | 4.2 | 4.8 | 5.8 | 3.8 | 5.0 | 5.0 |
| 89830 | ‐0.20 | 100.0 | 100.0 | 99.6 | 100.0 | 100.0 | 99.8 | 100.0 | 100.0 | 99.6 | 100.0 | 100.0 |
| 90803 | 0.00 | 77.2 | 74.2 | 61.8 | 74.0 | 74.6 | 58.2 | 73.6 | 74.8 | 57.4 | 73.4 | 74.6 |
| 91143 | 0.00 | 67.4 | 63.6 | 49.2 | 63.4 | 64.8 | 48.0 | 63.0 | 64.4 | 44.8 | 63.4 | 62.6 |
| 91524 | 0.00 | 6.0 | 5.2 | 4.4 | 5.0 | 6.0 | 4.6 | 5.0 | 5.8 | 3.6 | 5.2 | 5.4 |
| 92017 | 0.00 | 7.0 | 5.6 | 4.8 | 5.8 | 6.0 | 4.6 | 6.0 | 6.2 | 4.2 | 5.4 | 5.8 |
| 93161 | 0.00 | 0.8 | 0.4 | 0.2 | 0.4 | 0.4 | 0.0 | 0.4 | 0.6 | 1.0 | 0.4 | 0.2 |
| 93211 | 0.00 | 7.0 | 6.8 | 2.4 | 6.6 | 7.0 | 3.2 | 7.0 | 6.4 | 3.4 | 6.8 | 6.4 |
| 93324 | 0.12 | 3.0 | 2.4 | 1.0 | 2.4 | 2.2 | 1.2 | 2.4 | 2.2 | 1.4 | 2.4 | 2.0 |
| 93886 | 0.00 | 0.2 | 0.2 | 0.0 | 0.2 | 0.0 | 0.0 | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 |
| 93897 | 0.00 | 20.0 | 17.6 | 9.2 | 17.8 | 16.2 | 10.2 | 17.6 | 16.6 | 9.6 | 17.4 | 17.2 |
| 93901 | 0.00 | 18.0 | 16.0 | 10.4 | 16.0 | 15.4 | 9.8 | 16.2 | 16.4 | 8.2 | 16.0 | 17.4 |
| 93935 | 0.00 | 80.6 | 76.0 | 62.6 | 75.6 | 75.6 | 61.4 | 75.8 | 76.4 | 60.2 | 76.0 | 75.4 |
| 94192 | 0.00 | 80.6 | 76.4 | 63.2 | 76.0 | 75.8 | 61.8 | 75.8 | 75.8 | 61.6 | 76.2 | 75.8 |
| 94212 | 0.00 | 3.8 | 2.8 | 1.8 | 2.8 | 2.8 | 1.0 | 2.6 | 2.8 | 1.4 | 3.0 | 2.8 |
| 94244 | 0.00 | 0.2 | 0.2 | 0.0 | 0.2 | 0.0 | 0.0 | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 |
| 94528 | 0.00 | 0.2 | 0.2 | 0.0 | 0.2 | 0.0 | 0.0 | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 |
| 94990 | 0.25 | 84.4 | 82.0 | 65.8 | 82.0 | 81.8 | 66.2 | 81.4 | 81.6 | 65.8 | 81.0 | 80.4 |
| 95038 | 0.00 | 79.6 | 74.6 | 62.2 | 74.6 | 75.0 | 60.4 | 74.2 | 74.8 | 60.2 | 74.6 | 74.0 |
| 95234 | 0.00 | 0.2 | 0.6 | 0.2 | 0.6 | 0.2 | 0.0 | 0.6 | 0.4 | 0.8 | 0.8 | 0.4 |
| 95236 | ‐0.15 | 95.2 | 93.0 | 77.8 | 92.8 | 92.4 | 79.0 | 93.0 | 92.0 | 78.4 | 93.0 | 92.4 |
Note: Base pair positions (pos.) marked with “b” denote causal variants whereas ”a” denote hitchhikers. The remaining are noncausal. Positions are truncated to the last five digits.
FIGURE 4Boxplots of the () p‐values across 500 replicates in the fine‐mapping simulation single‐variant analyses for a phase 2 sample size of across optimality criteria (for LM, GA, and TZL only): parameter‐specific, A‐ and D‐optimality in each row facet. Each column facet corresponds to the complete data analysis and studied designs respectively. The dashed line corresponds to a Bonferroni‐corrected significance threshold of [Colour figure can be viewed at wileyonlinelibrary.com]
Estimation and testing results for analyzing (log‐transformed) triglyceride levels across three sequence SNPs from the fine‐mapping analysis in the NFBC1966
|
| ||||||||
|---|---|---|---|---|---|---|---|---|
|
| Gene | Chr. | pos. (hg19) | Variant | Complete | GA | RDS | TZL |
| 1123 | LPL | 8 | 19813529 | rs268 | 0.186 (0.04) | 0.221 (0.06) | 0.223 (0.06) | 0.221 (0.06) |
| APOA5 | 11 | 116660686 | rs2266788 | 0.108 (0.02) | 0.115 (0.02) | 0.116 (0.02) | 0.114 (0.02) | |
| 116662407 | rs3135506 | 0.100 (0.02) | 0.083 (0.02) | 0.081 (0.02) | 0.079 (0.02) | |||
| 2246 | LPL | 8 | 19813529 | rs268 | 0.186 (0.04) | 0.186 (0.04) | 0.185 (0.04) | 0.198 (0.04) |
| APOA5 | 11 | 116660686 | rs2266788 | 0.108 (0.02) | 0.106 (0.02) | 0.105 (0.02) | 0.103 (0.02) | |
| 116662407 | rs3135506 | 0.100 (0.02) | 0.096 (0.02) | 0.096 (0.02) | 0.097 (0.02) | |||
FIGURE 5Region plots of the NFBC1966 CTS data for the ML analyses under the studied designs compared with the complete data analysis across phase 2 sample sizes (). Column facets denote each of the studied designs plus the complete data analysis whereas row facets show each loci of interest