Literature DB >> 24336170

Meta-analysis of gene-level tests for rare variant association.

Dajiang J Liu¹, Gina M Peloso², Xiaowei Zhan¹, Oddgeir L Holmen³, Matthew Zawistowski⁴, Shuang Feng⁴, Majid Nikpay⁵, Paul L Auer⁶, Anuj Goel⁷, He Zhang⁸, Ulrike Peters⁹, Martin Farrall⁷, Marju Orho-Melander¹⁰, Charles Kooperberg¹¹, Ruth McPherson⁵, Hugh Watkins⁷, Cristen J Willer⁸, Kristian Hveem¹², Olle Melander¹⁰, Sekar Kathiresan¹³, Gonçalo R Abecasis¹.

Abstract

The majority of reported complex disease associations for common genetic variants have been identified through meta-analysis, a powerful approach that enables the use of large sample sizes while protecting against common artifacts due to population structure and repeated small-sample analyses sharing individual-level data. As the focus of genetic association studies shifts to rare variants, genes and other functional units are becoming the focus of analysis. Here we propose and evaluate new approaches for performing meta-analysis of rare variant association tests, including burden tests, weighted burden tests, variable-threshold tests and tests that allow variants with opposite effects to be grouped together. We show that our approach retains useful features from single-variant meta-analysis approaches and demonstrate its use in a study of blood lipid levels in ∼18,500 individuals genotyped with exome arrays.

Entities: Chemical

Mesh：

Substances：
Lipids

Year: 2013 PMID： 24336170 PMCID： PMC3939031 DOI： 10.1038/ng.2852

Source DB: PubMed Journal: Nat Genet ISSN： 1061-4036 Impact factor: 38.330

Introduction

Proceeding from the discovery of a genetic association signal to a mechanistic insight about human biology should be much easier for one or a set of alleles with clear functional consequence, including non-synonymous, splice altering and protein truncating alleles. Most of these alleles are very rare, with only one such allele expected to reach MAF>5% in the average human gene[1]. Recent advances in exome sequencing and the development of exome genotyping arrays are enabling explorations of the very large reservoir of rare coding variants in humans and are expected to accelerate the pace of discovery in human genetics[2]. Rare variants can be examined using association tests that group alleles in a gene or other functional unit[3]. Compared to tests of individual alleles, this grouping can increase power, especially when applied to large samples where several rare variants are observed in the same functional unit[4]. The simplest rare variant tests consider the number of potentially functional alleles in each individual[5], but the tests can be refined to weigh variants according to their likely functional impact[6], to allow for imputed or uncertain genotypes[7,8], or to allow variants that increase and decrease risk to reside in the same gene[9-11] (a feature that is important when the same gene harbors hypermorph and hypomorph alleles[12]). The optimal strategy for grouping and weighting rare variants – ranging from focusing on protein truncation alleles to examining all non-synonymous variants and encompassing strategies that examine all variants with frequency <5% as well as alternatives that examine only singletons – depends on the unknown genetic architecture of each trait and each locus[13]. Here, we describe practical approaches for meta-analysis of rare variants. Our approach starts with simple statistics that can be calculated in an individual study (single site score statistics and their covariance matrix, which summarizes the linkage disequilibrium information and relatedness among sampled individuals). We then show that, when these statistics are shared, a wide variety of gene-level association tests can be executed centrally – including both weighted or un-weighted burden tests with fixed[5] or variable frequency threshold[6] and sequence kernel association tests (SKAT) that accommodate alleles with opposite effects within a gene[9]. Our approach generates comparable results to sharing individual level data (and, in fact, identical results when allowing for between study heterogeneity in nuisance parameters, such as trait means, variances and covariate effects). As an illustration of our approach, we analyze blood lipid levels in >18,500 individuals genotyped with exome genotyping arrays. Our analysis of blood lipid levels provides examples of loci where signal for gene-level association tests exceeds signal for single variant tests and shows that our approach can recover signals driven by very rare variants (frequency <0.05%). Given that very large sample sizes are required for successful rare variant association studies, we expect our methods (and refined versions thereof) will be widely useful. Our approach is based on the insight that analogues of most gene level association tests can be constructed using single variant test statistics and knowledge of their correlation structures. As shown in Methods, simple[14] and weighted[10,15] burden tests, variable threshold tests[6] and tests allowing for variants with opposite effects[9] can be constructed in this manner. We meta-analyze single variant statistics using the Cochran-Mantel-Haenszel method, calculate variance-covariance matrices for these statistics, and construct gene-level association tests by combining the two. In Supplementary Notes, we show that rare variant statistics generated in this way are identical to those obtained by sharing individual level data and allowing for heterogeneity in nuisance parameters, with no loss of power. Importantly, rare variant statistics calculated in this way are less vulnerable to artifacts due to population stratification than statistics generated by naïvely pooling individual level data. As in other meta-analysis settings, sharing summary statistics accelerates the overall analysis process, mitigates concerns about participant confidentiality, and reduces the risk that data will be used for unapproved analyses (as always, to avoid violating the trust of research subjects, we strongly recommend that investigators sharing summary statistics agree that these will not be used to identify research subjects). For evaluating significance, we propose methods for calculating p-values using asymptotics and also Monte-Carlo methods that use knowledge of linkage disequilibrium relationships to sample plausible combinations of single variant statistics and then generate empirical distributions for our gene-level statistics. Since evaluating asymptotic p-values can be numerically unstable, Monte-Carlo methods can be used to verify interesting p-values.

Results

We first evaluated our method using simulations. Genes were simulated as stretches of 5,000 base-pairs using the coalescent[16] and a demographic model (including an ancient bottleneck, recent exponential growth, differentiation and migration) calibrated to mimic a sample of multiple European populations[17,18] (Supplementary Figure 1 and Supplementary Notes). The average FST value between simulated populations was 0.004 – as expected when the distribution of rare variants is geographically restricted[19]. The simulations produced samples of 1,000 individuals, each drawn from one of several related populations, typically including a few shared variants and many population specific variants. Half of the simulated variants were randomly set to increase trait values by 1/8th of a standard deviation (Supplementary Figure 2 and see Supplementary Figure 3 and 4 for similar results using alternative trait models). We analyzed each simulated sample with a series of gene-level association tests. Supplementary Figures 2-4 compare results obtained for 10,000 simulated genes using our meta-analysis approach to a combined analysis of individual level data across studies. For variable threshold tests, we found the p-values were sometimes slightly different (r2=0.995 between the two sets of log p-values); for the other two tests p-values and test statistics were indistinguishable. Calculation of analytical p-values for variable threshold tests requires the evaluation of high-dimensional integrals that can be numerically unstable and is thus very sensitive to small differences in the variance-covariance matrix. In practice, it will often be a good idea to confirm significant p-values using our Monte-Carlo approach. To evaluate our Monte-Carlo approach, we compared its empirical p-values to those obtained by permuting phenotypes between individuals within each study. We implemented adaptive versions of both algorithms[20], with more simulations carried out when the p-value is small and fewer simulations when the p-value is large. Log p-values for the two approaches are highly concordant (r2=0.996). When small p-values are estimated, increasing the number of simulations improves the precision for the estimated p-values (Supplementary Figure 5). We next verified type I error was well controlled (Supplementary Table 1). In all analyses, we first applied an inverse normal transformation to trait residuals (which helps ensure our statistics are well behaved even for very rare variants, as in Supplementary Figure 6). Reassured that type I error was well controlled, we next explored power for several scenarios (Figure 1A, 1B, 1C and Supplementary Figure 7A, 7B, 7C). It is clear that, for the effect sizes simulated here, very large samples may be required. In some settings, power only reaches ∼60% in analyses of ∼100,000 individuals. We did not find a universally most powerful method, emphasizing the value of implementing a diverse set of test statistics (see also Ladouceur et al[13]). Since meta-analysis methods that combine p-values are popular for common variants and can also be implemented for rare variants, we compared power between our method and analyses based on Fisher's method and the minimal p-value approach for combining p-values (Figure 1 and Supplementary Figure 7). In all the simulation scenarios considered, our method greatly outperforms these alternatives, especially when information is combined across a large number of samples. In addition to power, our approach provides three useful features. First, it provides great flexibility in the choice of rare variant association test (definition of functional units, choice of variants to be grouped, frequency thresholds for analysis); approaches based on Fisher's method would likely require every contributing study to re-analyze their data when any of these changes. Second, because in addition to p-values it provides for estimates of effect size (in all cases) and allele frequency thresholds for candidate variants (in the variable threshold test), our method provides rich information that helps interpretation. Third, our approach allows the relationship between multiple association signals in a region to be dissected through conditional analysis, as detailed below.

Figure 1

Power comparison for our approach, Fisher's method and the minimal p-value approach. Three phenotype models were simulated: (1) half of low frequency variants with MAF < 0.5% are causal, each increasing expected trait values by 1/4 standard deviation; (2) half of all variants are causal, irrespective of frequency, and increase trait values by 1/4 standard deviation; (3) 50% of the variants are casual, irrespective of frequency, and 80% of these increase expected trait values by 1/4 standard deviation, while the remaining 20% decrease trait values by the same amount. A number of 2-100 samples of size 1000 were simulated for each model, with each sample drawn from a randomly chosen population. Meta-analysis was performed using our approach or using Fisher's method and the minimal p-value approach to combine burden test, SKAT and variable threshold (VT) test statistics for variants with MAF<5%. The power was evaluated at the significance threshold of α=2.5×10-6 using 10,000 replicates. Panel A displays the power for three meta-analysis methods using simple burden test under model (1). Panel B displays the results for three meta-analysis methods using VT under model (1). Panel C displays the results for three meta-analysis methods using SKAT under model (1). Panel D displays the results for three meta-analysis methods using simple burden test under model (2). Panel E displays the results for three meta-analysis methods using VT under model (2). Panel F displays the results for three meta-analysis methods using SKAT under model (2). Panel G displays the results for three meta-analysis methods using simple burden test under model (3). Panel H displays the results for three meta-analysis methods using VT under model (3). Panel I displays the results for three meta-analysis methods using SKAT under model (3). Note that differences between our approach and these alternatives become more marked when more studies are meta-analyzed.

We proceeded to a meta-analysis of blood lipid levels in 18,699 individuals of European ancestry genotyped with Illumina Exome arrays and drawn from 7 studies: the Women's Health Initiative[21], the Ottawa Heart Study[22], the Malmö Diet and Cancer Study – Cardiovascular Cohort (MDC)[23], the PROCARDIS Precocious Coronary Artery Disease Case Series, PROCARDIS Control series[24] and the Nord-Trøndelag Health Study (HUNT) myocardial infraction cases and matched controls[25] (see Supplementary Table 2 and 3 for summary statistics for each of these samples, including basic demographics, summaries of lipid levels, number of non-synonymous and loss-of-function variants per individual and of variants sites shared across different studies). Overall, 171,193 variants were polymorphic in at least one individual. Among these variants, 125,702 – the vast majority – have frequency <1%. To verify the soundness of our approach, we repeated our power and type I error simulations using real genotype data from the HUNT and MDC studies but simulated phenotypes. These additional experiments confirm that our method produces well-calibrated statistics and is more robust to stratification than analyses that directly pool individual level data and treat the complete dataset as a single study without modeling heterogeneity between studies (Supplementary Figure 8). In addition, the power for our method continued to exceed that for alternatives that directly combine p-values from individual studies (Supplementary Figure 9). We then proceeded to meta-analyze single variant association test results. The resulting test statistics appear well calibrated, with genomic control value <1.05 for all three traits, both for common and for rare variants (Supplementary Figure 10). At a significance threshold of p<3×10-7 (corresponding to 0.05/171,193), we found significantly associated variants (with MAF<5%) at LPL[26], ANGPTL4[26], LIPG[26], CD300LG[27], LIPC[26], APOB[26], HNF4A[26] for HDL; PCSK9[26], BCAM-CBLC-PVR (neighboring APOE)[26], and APOB[26] for LDL; ANGPTL4[26], LPL[26] and APOB[26] for TG (Supplementary Table 4). Except for the variants in LIPC and APOB, all other significantly associated variants have frequency of >1% reflecting the limited power of single variant association tests for rare alleles. We next carried out gene-level tests. Again, test statistics appear well calibrated, with genomic control value <1.05 (Supplementary Figure 11). At a significance threshold of p <3.1×10-6 (corresponding to 0.05/16,153 and thus allowing for the number of genes tested), we observed association at LIPC, LPL, ANGPTL4, LIPG, HNF4A and CD300LG for HDL, at the PCSK9, APOE-locus (as well as nearby genes PVR, BCAM, and CBLC), and LDLR for LDL, and at ANGPTL4, and LPL for triglycerides (Table 1). Supplementary Table 5 emphasizes that, at these loci, much stronger signals are identified in meta-analysis than in any component study. Reassuringly, these signals point to loci identified in previous genome-wide association studies and/or re-sequencing studies. Importantly, note that our approach was able to appropriately identify the signal in LDLR which is driven by several very rare variants (each with frequency < .00052) that nearly always increase blood LDL cholesterol levels and that, at several other loci, gene-level p-values exceeded the best single variant p-value in the gene (Supplementary Table 6). We again compared our method with conventional methods such as minimal p-value approach, Fisher's method, and an extended Fisher's method taking into account unequal sample sizes (Methods). As shown in Supplementary Tables 7-9, our method identifies a larger number of loci, all known to be associated with lipid levels in humans. We also compared results obtained from our meta-analysis method with results from directly pooling a subset of the data (after normal transformation of trait values in each sample to avoid artifacts due to stratification). Reassuringly, p-values from our approach and joint analysis of pooled data were highly concordant with r2>0.99 (Supplementary Figure 12), in accordance with results obtained using coalescent simulations.

Table 1

Results for meta-analysis of gene-level rare variant association test. Associations that attain exome-wide significance (p < 3.1×10-6) are displayed. Five gene-level association tests were used to analyze the data: simple burden tests with 1% or 5% cutoff (Burden-1 and Burden-5), SKAT tests with 1% or 5% cutoff (SKAT-1 and SKAT-5) and variable threshold (VT) tests that analyze variants with MAF<5%. Significant p-values for each test are displayed in bold font. For the associations that are significant, estimates of average genetic effect are also shown. The loci where one or more gene-based association signal exceeds the top single variant association signal are labeled with an asterisk.

Gene	Gene Positiona	Burden-1	Burden-5	SKAT-1	SKAT-5	VT	MAF Cutoff	Direction of Single Variant Association Statisticsb	Estimates of Genetic Average Effect (s.d units) for Rare Variants under Different MAF Thresholds

									0.01	0.05	VT
HDL
LIPC*	chr15:58.7Mb	1.4×10^-12	3.5×10^-7	1.8×10^-9	1.4×10^-2	4.5×10^-12	3.7×10^-3	-++++--+-	0.5	0.1	0.5
LPL*	chr8:19.8Mb	9.7×10^-1	2.5×10^-24	3.5×10^-1	5.0×10^-13	1.5×10^-23	2.5×10^-2	(-)-(-)+-++	-	-0.3	-0.3
ANGPTL4*	chr19:8.4Mb	2.2×10^-2	2.9×10^-19	2.2×10^-2	3.0×10^-19	1.8×10^-18	2.6×10^-2	(+)--++-+++	-	0.3	0.3
LIPG*	chr18:47.1Mb	2.2×10^-5	6.4×10^-19	2.1×10^-5	2.9×10^-9	4.4×10^-18	1.3×10^-2	-++----(+)+	-	0.4	0.4
HNF4A	chr20:43.0Mb	7.5×10^-1	2.8×10^-7	6.8×10^-1	2.5×10^-7	1.5×10^-6	4.1×10^-2	(-)--+-+	-	-0.1	-0.1
CD300LG	chr17:41.9Mb	4.9×10^-1	8.5×10^-7	5.2×10^-1	1.0×10^-5	3.1×10^-6	3.3×10^-2	(-)+-(+)	-	-0.1	-

LDL
PCSK9*	chr1:55.5Mb	1.8×10^-2	7.4×10^-19	8.1×10^-2	5.5×10^-17	2.0×10^-28	1.3×10^-2	(-)--(-)--+-++-	-	-0.3	-0.5
BCAM	chr19:45.3Mb	1.7×10^-1	1.6×10^-18	1.5×10^-1	3.0×10^-5	2.6×10^-17	3.6×10^-2	(-)+++(-)+-+++---+(-)+--+--++	-	-0.1	-0.1
CBLC	chr19:45.3Mb	9.4×10^-1	2.0×10^-15	4.4×10^-1	1.5×10^-4	1.0×10^-14	4.4×10^-2	-(-)--+-(-)(+)	-	-0.1	-0.1
PVR	chr19:45.2Mb	6.1×10^-2	3.0×10^-10	4.8×10^-2	6.3×10^-2	1.1×10^-9	4.9×10^-2	(-)++--+	-	-0.1	-0.1
LDLR*	chr19:11.2Mb	1.8×10^-3	4.7×10^-5	3.8×10^-2	2.5×10^-1	2.4×10^-7	5.2×10^-4	+++++++++-++++--+	-	-	0.8

TG
ANGPTL4*	chr19:8.4Mb	2.6×10^-2	1.2×10^-24	3.7×10^-2	3.9×10^-25	7.1×10^-24	2.6×10^-2	(-)+---+---	-	-0.3	-0.2
LPL*	chr8:19.8Mb	6.8×10^-1	7.7×10^-20	2.6×10^-1	1.8×10^-11	4.6×10^-19	2.5×10^-2	(+)+(+)--+-	-	0.2	0.2

Gene position is defined based upon hg19, GRCh37 Genome Reference Consortium Human Reference 37

Direction of single site statistics for variants with MAF<5%. Variants within parenthesis have frequency >1%.

The loci with one or more gene-level association signal exceeding the top single variant signal.

An added convenience of sharing single-variant statistics together with their covariance matrices, as we propose, is that it facilitates conditional analyses, extending an idea used by Yang et al[28] for analysis of common variants in GWAS meta-analysis. Supplementary Figure 13 illustrates how, in simulations, common variants can generate shadow rare variant association signals at nearby genes, and how our method for conditional analysis resolves the problem. In real data, we re-examined two of the LDL associated loci in detail, LDLR and APOE-BCAM-CBLC-PVR. For LDLR, we examined the relationship between rare variant signals and three nearby common variants[26]. Specifically, we conditioned on genotypes for 3 common variants (rs6511720, rs2228671 and rs72658855) exhibiting significant association in the region, and found that LDLR rare variant association remains significant (p-value 4.6×10-7) (Supplementary Table 10). For the APOE-BCAM-CBLC-PVR locus, after conditioning on the common variant showing strongest association in the region (rs7412), gene-level associations at BCAM, CLBC and PVR become non-significant, suggesting that these rare-variant signals are the result of regional linkage disequilibrium with more common and well described variants in APOE (Supplementary Table 11). We also analyzed top single association signal conditional on the genotypes of rare variants (with MAF≤5%) that are included in the burden tests. We showed that the top single variant signals from both APOE gene and the LDLR gene remained significant (Supplementary Table 12). For completeness, Supplementary Figure 14 and 15 show that conditional analyses using individual level data in a subset of samples and conditional analyses using our meta-analysis based approach give highly concordant p-values (r2>0.99).

Discussion

In the analysis of each sample, when population stratification is of concern, we recommend that principal components of genotype matrix should be incorporated in the regression model as covariates[29] or that linear mixed models with empirically estimated kinship matrices should be used[30]. Linear mixed models can also be used to account for relatedness in family studies or other samples that include cryptically related individuals. Our software implementation readily allows for both these options, including correct calculation of kinship matrices to allow family samples to be included in meta-analyses (see Methods for details). Although we only presented applications of our method to quantitative trait meta-analysis, our methods and tools can be applied to binary traits as well (see Methods for details). For binary traits, distributions about normality of test statistics may be less reliable. These could affect performance of our resampling method for empirical p-values, meta-analysis results for the rarest variants, and conditional analysis statistics (see also the work of Lin and Tang [9] and Lee et al [31]). Since performance of our methods (and other similar approaches) for binary traits will depend on factors like sample size and the balance of cases and controls in each sample, we recommend careful quality control of results for such studies, including for example, review of quantile-quantile plots for variants of different frequencies. Our methods are implemented as freely available software, including programs for calculating summary statistics, annotating the resulting summaries, performing meta-analysis, calculating gene-level statistics and executing conditional analyses. Our tools work with standard VCF files[32] for genotype data and Merlin[33] or PLINK[34] files for phenotype data. Meta-analysis has facilitated many discoveries in common variant association studies. Here, we describe a powerful framework for meta-analysis of rare variants at the level of genes or other functional units. Through simulation and empirical evaluation, we demonstrate that our approach is well calibrated and provides comparable power to more cumbersome analyses that require pooling all individual level data. Through the analysis of blood lipids levels across seven studies, we show that our approach can detect rare variant association signals at known candidate loci. Our method has a variety of unique features, which include supporting a variety of rare variant association tests, allowing for the analysis of family samples and the calculation of empirical p-values, and for conditional analysis that can distinguish truly novel rare variant signals from shadows of other nearby common or rare associations. We envision that this approach (and continued development of related approaches [35-37]) will facilitate the large sample sizes required to accelerate new discoveries in complex trait genetics.

Methods

This section starts with a summary of notation, proceeds to describe the statistics to be shared between studies and methods for single variant meta-analysis. We then show that the statistics for different gene-level tests can be calculated using summary level data, enabling efficient meta-analysis. In the Supplementary Notes, we provide many additional details and summarize how each of the test statistics used here can be derived as a score test using likelihood functions that allow for per-sample nuisance parameters.

Notation

For simplicity, we describe our strategy for analysis of a single gene. Let J be number of variant nucleotide sites genotyped in at least one study. For study k, let n denote the number of samples phenotyped and genotyped, and let the vector y = (Y1,,…,Y) denote the quantitative trait residuals (after adjustment for any covariates), with variance . Within each study k, we encode genotype information in matrix where each entry X represents the genotype for individual i at site j, coded as the number of alternative alleles. We encode missing genotypes in the dataset as the average number of minor alleles in individuals who are genotyped for that marker. The multi-site genotype for individual i is denoted by the row vector x •, , and the genotypes for all N individuals at site j are given by column vector x •, For the ease of presentation, we define the mean genotype matrix , where the (i,j)-th element is (Σ)/N.

Summary Statistics To Be Shared

For each study, we first calculate and share a vector of score statistics u = ( − ) y, a corresponding variance-covariance matrix , and allele frequencies for each marker p = Σ/2N. Note that effectively describes linkage disequilibrium relationships between the variants being examined. To perform quality control, we also share mean and variance for the quantitative trait residuals, genotype call rate and Hardy-Weinberg equilibrium p-values at each variant site.

Meta-analysis of Single Variant Association Test Statistics

We first combine single variant association test statistics across studies using the Cochran-Mantel-Haenszel method. Specifically, we calculate a score statistic at each site as: where U = Σ and V = Σ. For ease of presentation, we denote the vector of single variant association tests after meta-analysis as u = Σ u. Under the null, this vector is distributed as multivariate normal with mean vector 0 and covariance matrix Σk V.

Burden Tests That Assume Variants Have Similar Effect Sizes

For a simple burden test in study k, the impact of multiple rare variants in a region can be modeled using a shared regression coefficient in a model that takes the form: C (x,•,) is a function that takes genotypes for a single individual as input and returns the count of rare alleles (the “rare variant burden”) in the gene being examined. When individual level data is available and nuisance parameters β0, and are allowed to vary between studies, the score statistic for a rare variant burden test becomes: which is equal to a linear sum of (weighted) single variant score statistics. Under the null, this statistic is approximately normally distributed with mean 0 and variance V = ω (Σ)ω, enabling significance tests. Here, ω is the vector of weights, which is = (ω1,…, ω), with each element ω representing the weight assigned to variant j according to its allele frequency or its computationally predicted functional impact[10,15]. The formula above makes it clear that, when nuisance parameters are allowed to vary between studies, the same burden score statistics that could be calculated by sharing individual data can be equivalently calculated using shared summary statistics.

Variable Threshold Tests with an Adaptive Frequency Threshold

In variable threshold test, rare variant burden statistics are calculated for each observed variant minor allele frequency threshold and significance is evaluated for the maximum of these statistics. Given a specific variant frequency threshold F we define the resulting burden score statistic as: Here, v is a vector of indicators where the jth element equals 1 if the pooled minor allele frequency at variant site j is less than F and zero otherwise. For convenience, we also define a matrix of indicators for minor allele frequency thresholds Φ = (v, v, …, v). After a burden statistic is calculated for each potential frequency threshold, these are standardized, dividing each statistic by its corresponding variance, and the maximum statistic is identified: Significance for this statistic can be evaluated using the cumulative distribution function for the multivariate normal distribution[38]. Specifically, given the definition of the covariance between burden statistics calculated using different allele frequency thresholds, we have: The p-value for the VT test statistic is given by where F is the distribution function for the multivariate normal distribution MVN(0,(Σ)).

Burden Tests that Assume A Distribution of Variant Effect Sizes (e.g. SKAT tests)

The simple burden test and variable threshold test described above can be underpowered when variants with opposite phenotypic effects reside in the same gene and are grouped together, because the shared regression coefficient can average close to zero in that situation[9-12]. To accommodate this setting, we consider an underlying distribution of rare variance effect sizes with mean zero and test whether the variance of this distribution τ is greater than zero. When individual level data is available, association analysis in study k is performed using the following model We make inferences about rare variant effect sizes β = (β1, β2, … β) by assuming these follow a common distribution with mean zero and variance τ. Under the null, τ=0. Following Wu et al[9], in Supplementary Notes we derive the score statistic for this model and show that it can be calculated on the basis of per-study summary statistics: Here, is the kernel matrix that compares multi-site genotypes. A default choice[9] is a diagonal matrix = diag(ω1, ω2 …, ω), with ω being the weight assigned to variant site j. The statistic Q follows a mixture chi-square distribution[31], which means that Q is equivalent in distribution to a weighted sum of independent chi-square random variables. The weights (or mixture proportions) are given by the eigenvalues for the matrix (Σ)1/2 (Σ)1/2.

Monte-Carlo Method for Empirical Assessment of Significance

The previous sections describe how a series of gene-level test statistics can be calculated and, for each one, propose a strategy for evaluating significance using asymptotic distributions. In practice, evaluating the required numerical integrals can be challenging because variance-covariance matrices that are sometimes singular or nearly singular. Note that single variant test statistics are distributed as: Then, to evaluate significance empirically, one can sample random vectors from the distribution MVN(0, Σ) and calculate gene-level rare variant test statistics for each of these sampled random vectors, resulting in an empirical distribution for any gene-level statistic[39]. As usual, p-values can then be evaluated by comparing the test statistics for the original data with those in this empirical distribution. For computational efficiency, we use an adaptive algorithm where a larger number of vectors are sampled when assessing small p-values and fewer vectors are sampled when assessing larger p-values[20].

Conditional Analyses

It is well known that, due to linkage disequilibrium, one or more common causal variants can result in shadow association signals at other nearby common variants. For common variants, Yang et al[28] have shown that linkage disequilibrium relationships between variants, estimated from external reference panels, can be used to enable conditional analysis in meta-analysis settings. For rare variants and gene-level tests, accurately describing relationships between variants is crucial and we recommend against the use of external reference panels. Instead, in the Supplementary Notes, we describe how conditional analysis statistics can be derived for different gene-level tests in our meta-analysis setting.

Analysis of Samples of Known or Hidden Relatedness

Our methods and tools can also be used when samples within a study are related to each other. Detailed formulae of the score statistics and their covariance matrices when linear mixed models are used to account for relatedness, are described in the Supplementary Notes.

Analysis of Dichotomous Trait

Our approach extends naturally to the analysis of binary traits. Specifically, when single variant score statistics and their covariance matrices are shared, meta-analysis test statistics can be calculated in the same manner as for continuous trait. Detailed definitions of test statistics for binary traits are given in the Supplementary Notes. A limitation is that, when variant counts in a gene or analysis unit are very small or the number of cases and controls in each study is very unbalanced, the asymptotic distributions for burden statistics may not hold, and p-values obtained using our approach may not be accurate. In practice, we recommend careful review of QQ plots for meta-analysis statistics (as is standard in genome-wide association studies).

Weighted Fisher's Methods, Incorporating Unequal Sample Sizes

To accommodate the scenario where samples of different sizes are meta-analyzed, we use a modified version of Fisher's method that incorporates sample sizes as weights for each study. Specifically, our test statistic is defined by T = −2Σ log p. The weighted Fisher's test statistic follows a mixture chi-square distribution with mixture proportions given by N, N, N, N,…,N, N.

Simulation of Population Genetic Data

We simulated haplotypes using a coalescent model and the program ms[16]. We chose a demographic model consistent with European demographic history[4], including an ancestral bottleneck followed by more recent population differentiation and exponential growth. Model parameters were based upon estimates from large scale sequencing studies[40], as detailed in Supplementary Notes.

Meta-Analysis of Lipid Traits

Summary statistics were calculated for each participating study and shared to enable a central meta-analysis. In single variant and gene-base rare variant association analysis, age, age[2], sex and cohort specific covariates, such as principal components of ancestry were included in the analysis. Trait residuals were standardized using inverse normal transformation. More detailed descriptions for each participating cohort are given in the Supplementary Notes. This research was approved by the Institutional Review Board of the University of Michigan and the Broad Institute. Informed consent was obtained from all study subjects. In addition, all participating studies received approvals from local ethics committee.

36 in total

1. Generating samples under a Wright-Fisher neutral model of genetic variation.

Authors: Richard R Hudson
Journal: Bioinformatics Date: 2002-02 Impact factor: 6.937

2. Exome sequencing and the genetic basis of complex traits.

Authors: Adam Kiezun; Kiran Garimella; Ron Do; Nathan O Stitziel; Benjamin M Neale; Paul J McLaren; Namrata Gupta; Pamela Sklar; Patrick F Sullivan; Jennifer L Moran; Christina M Hultman; Paul Lichtenstein; Patrik Magnusson; Thomas Lehner; Yin Yao Shugart; Alkes L Price; Paul I W de Bakker; Shaun M Purcell; Shamil R Sunyaev
Journal: Nat Genet Date: 2012-05-29 Impact factor: 38.330

3. Maximum-likelihood estimation of demographic parameters using the frequency spectrum of unlinked single-nucleotide polymorphisms.

Authors: Alison M Adams; Richard R Hudson
Journal: Genetics Date: 2004-11 Impact factor: 4.562

4. An efficient resampling method for assessing genome-wide statistical significance in mapping quantitative trait Loci.

Authors: Fei Zou; Jason P Fine; Jianhua Hu; D Y Lin
Journal: Genetics Date: 2004-12 Impact factor: 4.562

5. Deep resequencing reveals excess rare recent variants consistent with explosive population growth.

Authors: Alex Coventry; Lara M Bull-Otterson; Xiaoming Liu; Andrew G Clark; Taylor J Maxwell; Jacy Crosby; James E Hixson; Thomas J Rea; Donna M Muzny; Lora R Lewis; David A Wheeler; Aniko Sabo; Christine Lusk; Kenneth G Weiss; Humeira Akbar; Andrew Cree; Alicia C Hawes; Irene Newsham; Robin T Varghese; Donna Villasana; Shannon Gross; Vandita Joshi; Jireh Santibanez; Margaret Morgan; Kyle Chang; Walker Hale Iv; Alan R Templeton; Eric Boerwinkle; Richard Gibbs; Charles F Sing
Journal: Nat Commun Date: 2010-11-30 Impact factor: 14.919

6. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits.

Authors: Jian Yang; Teresa Ferreira; Andrew P Morris; Sarah E Medland; Pamela A F Madden; Andrew C Heath; Nicholas G Martin; Grant W Montgomery; Michael N Weedon; Ruth J Loos; Timothy M Frayling; Mark I McCarthy; Joel N Hirschhorn; Michael E Goddard; Peter M Visscher
Journal: Nat Genet Date: 2012-03-18 Impact factor: 38.330

7. Genome-wide efficient mixed-model analysis for association studies.

Authors: Xiang Zhou; Matthew Stephens
Journal: Nat Genet Date: 2012-06-17 Impact factor: 38.330

8. Exome sequencing-driven discovery of coding polymorphisms associated with common metabolic phenotypes.

Authors: A Albrechtsen; N Grarup; Y Li; T Sparsø; G Tian; H Cao; T Jiang; S Y Kim; T Korneliussen; Q Li; C Nie; R Wu; L Skotte; A P Morris; C Ladenvall; S Cauchi; A Stančáková; G Andersen; A Astrup; K Banasik; A J Bennett; L Bolund; G Charpentier; Y Chen; J M Dekker; A S F Doney; M Dorkhan; T Forsen; T M Frayling; C J Groves; Y Gui; G Hallmans; A T Hattersley; K He; G A Hitman; J Holmkvist; S Huang; H Jiang; X Jin; J M Justesen; K Kristiansen; J Kuusisto; M Lajer; O Lantieri; W Li; H Liang; Q Liao; X Liu; T Ma; X Ma; M P Manijak; M Marre; J Mokrosiński; A D Morris; B Mu; A A Nielsen; G Nijpels; P Nilsson; C N A Palmer; N W Rayner; F Renström; R Ribel-Madsen; N Robertson; O Rolandsson; P Rossing; T W Schwartz; P E Slagboom; M Sterner; M Tang; L Tarnow; T Tuomi; E van't Riet; N van Leeuwen; T V Varga; M A Vestmar; M Walker; B Wang; Y Wang; H Wu; F Xi; L Yengo; C Yu; X Zhang; J Zhang; Q Zhang; W Zhang; H Zheng; Y Zhou; D Altshuler; L M 't Hart; P W Franks; B Balkau; P Froguel; M I McCarthy; M Laakso; L Groop; C Christensen; I Brandslund; T Lauritzen; D R Witte; A Linneberg; T Jørgensen; T Hansen; J Wang; R Nielsen; O Pedersen
Journal: Diabetologia Date: 2012-11-19 Impact factor: 10.122

9. MASS: meta-analysis of score statistics for sequencing studies.

Authors: Zheng-Zheng Tang; Dan-Yu Lin
Journal: Bioinformatics Date: 2013-05-21 Impact factor: 6.937

10. A groupwise association test for rare mutations using a weighted sum statistic.

Authors: Bo Eskerod Madsen; Sharon R Browning
Journal: PLoS Genet Date: 2009-02-13 Impact factor: 5.917

111 in total

1. Sparse meta-analysis with high-dimensional data.

Authors: Qianchuan He; Hao Helen Zhang; Christy L Avery; D Y Lin
Journal: Biostatistics Date: 2015-09-21 Impact factor: 5.899

2. Meta-analysis of Complex Diseases at Gene Level with Generalized Functional Linear Models.

Authors: Ruzong Fan; Yifan Wang; Chi-Yang Chiu; Wei Chen; Haobo Ren; Yun Li; Michael Boehnke; Christopher I Amos; Jason H Moore; Momiao Xiong
Journal: Genetics Date: 2015-12-29 Impact factor: 4.562

3. A novel random effect model for GWAS meta-analysis and its application to trans-ethnic meta-analysis.

Authors: Jingchunzi Shi; Seunggeun Lee
Journal: Biometrics Date: 2016-02-24 Impact factor: 2.571

4. Exome-chip meta-analysis identifies association between variation in ANKRD26 and platelet aggregation.

Authors: Ming-Huei Chen; Lisa R Yanek; Joshua D Backman; John D Eicher; Jennifer E Huffman; Yoav Ben-Shlomo; Andrew D Beswick; Laura M Yerges-Armstrong; Alan R Shuldiner; Jeffrey R O'Connell; Rasika A Mathias; Diane M Becker; Lewis C Becker; Joshua P Lewis; Andrew D Johnson; Nauder Faraday
Journal: Platelets Date: 2017-11-29 Impact factor: 3.862

5. Graphical algorithm for integration of genetic and biological data: proof of principle using psoriasis as a model.

Authors: Lam C Tsoi; James T Elder; Goncalo R Abecasis
Journal: Bioinformatics Date: 2014-12-04 Impact factor: 6.937

6. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases.

Authors: Alexander Gusev; S Hong Lee; Gosia Trynka; Hilary Finucane; Bjarni J Vilhjálmsson; Han Xu; Chongzhi Zang; Stephan Ripke; Brendan Bulik-Sullivan; Eli Stahl; Anna K Kähler; Christina M Hultman; Shaun M Purcell; Steven A McCarroll; Mark Daly; Bogdan Pasaniuc; Patrick F Sullivan; Benjamin M Neale; Naomi R Wray; Soumya Raychaudhuri; Alkes L Price
Journal: Am J Hum Genet Date: 2014-11-06 Impact factor: 11.025

7. SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data.

Authors: Di Zhang; Linhai Zhao; Biao Li; Zongxiao He; Gao T Wang; Dajiang J Liu; Suzanne M Leal
Journal: Am J Hum Genet Date: 2017-06-29 Impact factor: 11.025

8. Meta-analysis of quantitative pleiotropic traits for next-generation sequencing with multivariate functional linear models.

Authors: Chi-Yang Chiu; Jeesun Jung; Wei Chen; Daniel E Weeks; Haobo Ren; Michael Boehnke; Christopher I Amos; Aiyi Liu; James L Mills; Mei-Ling Ting Lee; Momiao Xiong; Ruzong Fan
Journal: Eur J Hum Genet Date: 2016-12-21 Impact factor: 4.246

9. Common Variants at Putative Regulatory Sites of the Tissue Nonspecific Alkaline Phosphatase Gene Influence Circulating Pyridoxal 5'-Phosphate Concentration in Healthy Adults.

Authors: Tonia C Carter; Faith Pangilinan; Anne M Molloy; Ruzong Fan; Yifan Wang; Barry Shane; Eileen R Gibney; Øivind Midttun; Per M Ueland; Cheryl D Cropp; Yoonhee Kim; Alexander F Wilson; Joan E Bailey-Wilson; Lawrence C Brody; James L Mills
Journal: J Nutr Date: 2015-05-13 Impact factor: 4.798

10. Association of kidney structure-related gene variants with type 2 diabetes-attributed end-stage kidney disease in African Americans.

Authors: Meijian Guan; Jun Ma; Jacob M Keaton; Latchezar Dimitrov; Poorva Mudgal; Mary Stromberg; Jason A Bonomo; Pamela J Hicks; Barry I Freedman; Donald W Bowden; Maggie C Y Ng
Journal: Hum Genet Date: 2016-07-26 Impact factor: 4.132