| Literature DB >> 21085648 |
C Ryan King1, Paul J Rathouz, Dan L Nicolae.
Abstract
Sequencing technologies are becoming cheap enough to apply to large numbers of study participants and promise to provide new insights into human phenotypes by bringing to light rare and previously unknown genetic variants. We develop a new framework for the analysis of sequence data that incorporates all of the major features of previously proposed approaches, including those focused on allele counts and allele burden, but is both more general and more powerful. We harness population genetic theory to provide prior information on effect sizes and to create a pooling strategy for information from rare variants. Our method, EMMPAT (Evolutionary Mixed Model for Pooled Association Testing), generates a single test per gene (substantially reducing multiple testing concerns), facilitates graphical summaries, and improves the interpretation of results by allowing calculation of attributable variance. Simulations show that, relative to previously used approaches, our method increases the power to detect genes that affect phenotype when natural selection has kept alleles with large effect sizes rare. We demonstrate our approach on a population-based re-sequencing study of association between serum triglycerides and variation in ANGPTL4.Entities:
Mesh:
Substances:
Year: 2010 PMID: 21085648 PMCID: PMC2978703 DOI: 10.1371/journal.pgen.1001202
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Figure 1Hypotheses relating SNP effect and fitness effect.
Panel A depicts the scenario where the trait is directly under selection. Panel B depicts the scenario where a gene with pleiotropic effects creates fitness-trait correlation via a related phenotype.
Figure 2Relationship between sampled frequency and mean fitness.
Simulation results using fitted DFE of non-synonymous variation from [33] and a sample size of 1000 diploids. Red bars are median +−35% of the distribution at that sampled frequency. The x-axis is logarithmic and scaled by 100, i.e., the first point is 1/2000 chromosomes.
Genetic variation in ANGPTL4.
| Population | N individuals | N Non-synonymous variants | N Non-coding variants |
| Pooled | 3476 | 32 | 62 |
| Non-Hispanic whites | 1043 | 20 | 23 |
| Non-Hispanic blacks | 1832 | 15 | 38 |
| Hispanic | 601 | 8 | 17 |
Model fit for ANGPTL4.
| Population | SNP Type |
|
|
| SE | nonfitness % variance | fitness % variance |
| Pooled | non-syn | 0.13 | 0.0 | 2.5 | 8.7 | 0.54 | 0.003 |
| Pooled | non-coding | 0.02 | 8.3 | −9.6 | 6.5 | 0.09 | 0.08 |
| NHW | non-syn | 0.15 | 0.0 | 5.8 | 13.5 | 0.53 | 0.03 |
| NHW | non-coding | 0.02 | 0.0 | 1.9 | 7.3 | 0.004 | 0.008 |
| NHB | non-syn | 0.08 | 0.0 | 0.5 | 11.4 | 0.42 | 0.0002 |
| NHB | non-coding | 0.02 | 0.0 | −11.4 | 8.1 | 0.07 | 0.13 |
| Hispanic | non-syn | 0.00 | 0.0 | 20.5 | 43.9 | 0 | 0.03 |
| Hispanic | non-coding | 0.10 | 19.6 | −40.8 | 38.2 | 0.08 | 0.66 |
Parameters are defined in equations (1), (4), and (6). SE is for . Attributable variance is that due to decomposition (7), see Text S1 for calculation. Pooled model p = .0064 on 10000 permutations. Pooled model residual variance = 0.29. NHW is non-Hispanic white; NHB is non-Hispanic black.
Figure 3Frequency versus estimated effect size in ANGPTL4 with ordinary least squares estimates.
The SNPs have been rank-ordered by observed frequency on the x-axis with ties broken by estimated effect size. The left y-axis is the predicted effect ( in (4) ) on the log of serum triglycerides. Green solid dots are the point estimate for each variant's effect on log-triglycerides from one variant at a time ordinary least squares adjusted for non-genetic covariates. Open circles are joint point estimates of from our method, and bars 95% prediction intervals on those estimates. Confidence intervals are the elementwise Wald-type estimates described in chapter 6 of [37] and produced by SAS's estimate command in the mixed procedure. See Text S1 for the calculation of point estimates, and the sample code at the author's website for SAS commands. Non-synonymous variation is in black; non-coding variation in red. The right y-axis and blue line depict observed count pooled across ethnicities on a log scale.
Simulation design.
| scenario |
|
|
| residual standard deviation | expected fitness % variance explained | expected nonfitness % variance explained |
| Base | −7.0 | 0.012 | 0.007 | 0.22 | 0.84 | 0.84 |
| High | −21.0 | 0.012 | 0.007 | 0.50 | 1.51 | 0.17 |
| High | −7.0 | 0.018 | 0.007 | 0.28 | 0.55 | 1.13 |
| Low | −6.4 | 0.012 | 0.003 | 0.21 | 0.83 | 0.85 |
| Very high | −63.1 | 0.012 | 0.003 | 1.43 | 1.66 | 0.02 |
| Zero | 0.0 | 0.012 | 0.007 | 0.16 | 0.00 | 1.68 |
Parameters chosen for simulation. Data generated by mechanism of formula (1) and (2). Parameters defined in equations (1), (2), and (6). Explained variance is the average true variance over individuals of fitness component and fitness independent component.
Simulation study power results.
| Min | CAST | CMC | Weighted Sum | Optimal Mean | EMMPAT | |||||||
| scenario | 1% | 5% | 1% | 5% | 1% | 5% | All | 5% | All | 5% | Split | One |
| Null | .05 | .05 | .05 | .05 | .04 | .06 | .05 | .04 | .04 | .05 | .05 | .04 |
| Base | .22 | .26 | .30 | .27 | .30 | .36 | .22 | .35 | .45 | .39 | .54 | .56 |
| High | .06 | .08 | .12 | .11 | .08 | .10 | .26 | .13 | .38 | .21 | .48 | .48 |
| High | .28 | .34 | .27 | .27 | .42 | .46 | .18 | .45 | .36 | .47 | .52 | .53 |
| Low | .20 | .26 | .21 | .21 | .32 | .36 | .18 | .38 | .36 | .42 | .49 | .48 |
| Very High | .04 | .06 | .06 | .06 | .05 | .06 | .22 | .08 | .32 | .15 | .45 | .45 |
| Zero | .44 | .52 | .48 | .44 | .61 | .66 | .09 | .62 | .45 | .61 | .58 | .62 |
| violation of population model assumptions | ||||||||||||
| DFE*5 | .23 | .28 | .30 | .27 | .33 | .39 | .24 | .39 | .43 | .42 | .51 | .53 |
| DFE*.2 | .22 | .27 | .33 | .30 | .33 | .36 | .24 | .37 | .46 | .39 | .56 | .55 |
| Exponential growth | .29 | .32 | .37 | .35 | .46 | .51 | .13 | .49 | .38 | .48 | .50 | .53 |
| violation of fitness linearity and distribution of | ||||||||||||
| Square | .20 | .23 | .28 | .23 | .29 | .33 | .12 | .31 | .39 | .39 | .57 | .59 |
| Random sign | .20 | .27 | .28 | .25 | .32 | .36 | .10 | .33 | .28 | .34 | .43 | .45 |
| Square root | .23 | .27 | .35 | .28 | .35 | .34 | .23 | .39 | .43 | .41 | .49 | .48 |
| Skew effects | .22 | .29 | .34 | .31 | .33 | .37 | .24 | .38 | .45 | .41 | .56 | .56 |
| 20% no effect | .30 | .33 | .31 | .28 | .40 | .45 | .24 | .46 | .45 | .50 | .59 | .60 |
| 80% no effect | .33 | .33 | .18 | .17 | .35 | .36 | .13 | .36 | .24 | .38 | .44 | .43 |
Simulation parameters are described in Table 3. 1000 replicates were generated for each scenario. Assumption violating scenarios are described in the text. Simulations on violation of model assumptions use the same parameters as “base”. Where nonlinear transformations of fitness effects are used, the variance of transformed fitness effects is rescaled to be the same. Power is proportion of p less than .05. Min p is Bonferroni corrected minimum p-value; CMC is the method of [20]; CAST is the same method ignoring common variants; Weighted is the method of [21]. Optimal Mean uses our implementation of , but not variance components. See methods text “Simulation Studies” for how CAST, CMC, and Weighted were modified to be closer to the data generating mechanism. Columns with “1%” or “5%” involve dichotomizing variants at the specified frequency threshold. CMC, Weighted, and Optimal Mean treat variants above that threshold as free regression parameters. Our method (EMMPAT) likelihood ratio p value is estimated from 500 permutations.