| Literature DB >> 27885705 |
Yun Joo Yoo1,2, Lei Sun3,4, Julia G Poirier5, Andrew D Paterson4,6, Shelley B Bull4,5.
Abstract
By jointly analyzing multiple variants within a gene, instead of one at a time, gene-based multiple regression can improve power, robustness, and interpretation in genetic association analysis. We investigate multiple linear combination (MLC) test statistics for analysis of common variants under realistic trait models with linkage disequilibrium (LD) based on HapMap Asian haplotypes. MLC is a directional test that exploits LD structure in a gene to construct clusters of closely correlated variants recoded such that the majority of pairwise correlations are positive. It combines variant effects within the same cluster linearly, and aggregates cluster-specific effects in a quadratic sum of squares and cross-products, producing a test statistic with reduced degrees of freedom (df) equal to the number of clusters. By simulation studies of 1000 genes from across the genome, we demonstrate that MLC is a well-powered and robust choice among existing methods across a broad range of gene structures. Compared to minimum P-value, variance-component, and principal-component methods, the mean power of MLC is never much lower than that of other methods, and can be higher, particularly with multiple causal variants. Moreover, the variation in gene-specific MLC test size and power across 1000 genes is less than that of other methods, suggesting it is a complementary approach for discovery in genome-wide analysis. The cluster construction of the MLC test statistics helps reveal within-gene LD structure, allowing interpretation of clustered variants as haplotypic effects, while multiple regression helps to distinguish direct and indirect associations.Entities:
Keywords: common variants; linkage disequilibrium; multibin linear combination test; multivariant test; quantitative trait
Mesh:
Substances:
Year: 2016 PMID: 27885705 PMCID: PMC5245123 DOI: 10.1002/gepi.22024
Source DB: PubMed Journal: Genet Epidemiol ISSN: 0741-0395 Impact factor: 2.135
Application to DCCT/EDIC genetics study: Regression analysis and gene‐based analysis results for association of the HDL trait with 10 SNPS in the CETP gene
| Regression Analysis Results | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Joint Analysis | Marginal Analysis | |||||||||
| SNP | rs ID | bp Position | Cluster Allocation | MAF | Beta |
| LC‐B per Cluster | Wald per Cluster | Beta |
|
| SNP1 | rs17245715 | 56961078 | 1 | 0.082 | –0.040 | 0.075 | 0.17 | 5.56 | 0.006 | 0.709 |
| SNP8 | rs12720898 | 56977331 | 1 | 0.068 | 0.048 | 0.058 | ( | ( | 0.037 | 0.043 |
| SNP10 | rs1801706 | 56983750 | 1 | 0.166 | –0.018 | 0.556 | 0.014 | 0.262 | ||
| SNP3 | rs708273 | 56966037 | 2 | 0.299 | 0.001 | 0.985 | 0.372 | 0.482 | –0.012 | 0.230 |
| SNP6 | rs289717 | 56975476 | 2 | 0.347 | –0.011 | 0.497 | ( | ( | –0.012 | 0.206 |
| SNP4 | rs12720922 | 56966973 | 3 | 0.173 | –0.063 | 0.001 | 10.5 | 12.6 | –0.080 | 4.33 × 10−11 |
| SNP5 | rs11076176 | 56973534 | 3 | 0.193 | –0.008 | 0.767 | ( | ( | –0.067 | 5.34 × 10−9 |
| SNP7 | rs736274 | 56975857 | 4 | 0.109 | 0.011 | 0.690 | 0.789 | 0.813 | 0.045 | 0.002 |
| SNP9 | rs5882 | 56982180 | 4 | 0.304 | 0.003 | 0.893 | ( | ( | 0.032 | 0.001 |
| SNP2 | rs3816117 | 56962246 | 5 | 0.469 | 0.035 | 0.173 | 1.86 | 1.86 | 0.055 | 1.45 × 10−9 |
| ( | ( | |||||||||
List of test statistics: Wald: generalized Wald test (10 df); MLC‐B: MLC test using beta coefficients; MLC‐Z: MLC test using Z statistics; LC‐B: linear combination test using beta coefficients; LC‐Z: linear combination test using Z statistics; MinP‐J: minimum P‐value test based on joint regression analysis; MinP‐M: minimum P‐value test based on marginal regression analysis; PC80: global test based on regression using the minimum number of principal components capturing 80% of variance (Gauderman et al., 2007); SSB: sum of squared marginal beta coefficients (Pan, 2009); SSBw: Sum of squared marginal beta coefficients with inverse variance weights (Pan, 2009); SKAT: SKAT for common variants with weights obtained from Beta(0.5,0.5) density function (Ionita‐Laza et al., 2013); SKAT‐O: Linear combination of SKAT and burden test with optimized mixing proportion (Lee et al., 2012).
Figure 1Clustering of SNPs in DCCT/EDIC CETP gene data by applying CLQ algorithm to linkage disequilibrium (r) pattern. Edges with |r| < 0.5 are removed. SNPs in the same cluster have the same color. The cluster construction threshold value for CLQ algorithm was set at c = 0.5
Trait models for simulation study
| Model | Description for Causal SNP Selection | No. of SNP | Effect Size | Correlation | Error ( | Allele Frequency |
|---|---|---|---|---|---|---|
| 0 | No SNP association | 4∼30 | All zero | HapMap | 5 | HapMap |
| 1 | One causal SNP within a gene | 4∼30 |
| HapMap | Adjusted | HapMap |
| 2 | Two causal SNPs in the same cluster, both deleterious | 4∼30 |
| HapMap | Adjusted | HapMap |
| 3 | Two causal SNPs in different clusters, both deleterious | 4∼30 |
| HapMap | Adjusted | HapMap |
| 4 | Two causal SNPs in the same cluster, one deleterious, and one protective | 4∼30 |
| HapMap | Adjusted | HapMap |
| 5 | Two causal SNPs in different clusters, one deleterious, and one protective | 4∼30 |
| HapMap | Adjusted | HapMap |
| 6 | Up to ten deleterious or protective causal SNPs, randomly assigned within a gene | 4∼30 |
| HapMap | Adjusted | HapMap |
The trait model is , where , C is the number of causal SNPs, b is the effect of jth causal SNP, and G is the number of causal alleles for the jth causal SNP. The jth SNP with j > C means neutral SNP.
HapMap: Based on the distribution and patterns of HapMap Asian gene panels. Adjusted: The error variance is adjusted to make the power of Wald test 60% for each gene.
.
Figure 2Simulation study results (Models 1–5): Average empirical power of MLC test statistics and other gene‐based statistics for 1,000 genes at nominal level α = 0.05 (N = 1,000 simulation replicates used to estimate power for each gene)
Figure 3Simulation study results (Model 6): Distribution of gene‐specific empirical power of MLC‐B(c = 0.5 and 0.7) and other gene‐based statistics obtained for 1,000 genes at nominal level α = 0.05 stratified by the number of causal SNPs. The box plot shows five points: median, first, and third quartiles computed using Tukey's “hinges” and end points of whiskers. The whiskers extend to the most extreme values no more than 1.5 times the interquartile range. Outliers are shown in sand color. Note that the simulation error variance was adjusted separately for each gene to obtain 60% Wald test power in a sample size of n = 1,000 assuming the regression analysis includes causal SNPs. Upper panel (a) causal SNPs included in the regression analysis; lower panel (b) causal SNPs excluded from the regression analysis