Literature DB >> 27380176

An Optimal Bahadur-Efficient Method in Detection of Sparse Signals with Applications to Pathway Analysis in Sequencing Association Studies.

Hongying Dai^1,2, Guodong Wu³, Michael Wu⁴, Degui Zhi⁵.

Abstract

Next-generation sequencing data pose a severe curse of dimensionality, complicating traditional "single marker-single trait" analysis. We propose a two-stage combined p-value method for pathway analysis. The first stage is at the gene level, where we integrate effects within a gene using the Sequence Kernel Association Test (SKAT). The second stage is at the pathway level, where we perform a correlated Lancaster procedure to detect joint effects from multiple genes within a pathway. We show that the Lancaster procedure is optimal in Bahadur efficiency among all combined p-value methods. The Bahadur efficiency,[Formula: see text], compares sample sizes among different statistical tests when signals become sparse in sequencing data, i.e. ε →0. The optimal Bahadur efficiency ensures that the Lancaster procedure asymptotically requires a minimal sample size to detect sparse signals ([Formula: see text]). The Lancaster procedure can also be applied to meta-analysis. Extensive empirical assessments of exome sequencing data show that the proposed method outperforms Gene Set Enrichment Analysis (GSEA). We applied the competitive Lancaster procedure to meta-analysis data generated by the Global Lipids Genetics Consortium to identify pathways significantly associated with high-density lipoprotein cholesterol, low-density lipoprotein cholesterol, triglycerides, and total cholesterol.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2016 PMID： 27380176 PMCID： PMC4933358 DOI： 10.1371/journal.pone.0152667

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Next-generation sequencing (NGS) technology has opened a new era for studying genetic associations with complex diseases. Yet, although whole genome searching has become easier and less costly to perform, our ability to critically evaluate such high throughput data has not improved substantially. Sequencing data often contain millions of genetic variants. However, testing millions of markers using the "single marker—single trait" analysis often loses power after the multiple-testing adjustment. Genome-wide significance requires strict Bonferroni correction with p-value < 2.5×10−6 for a total of 20,000 gene-based statistical tests. To maintain statistical power of detecting rare variants, a theoretical sample size of n>10,000 may be required for sequencing data [1]. These dimensional challenges motivate us to aggregate effects from multiple genes using pathway analysis. Genetic pathways comprise molecular entities that interact with each other to regulate specific cell functions, metabolic processes, biosynthesis, and embryonic developments. For non-Mendelian diseases and complex traits, multiple genetic risk factors may function together in the pathway. As a result, signals may not be significant in the "single marker-single trait" analysis, but many such values from related genes might provide valuable information regarding gene function and regulation. The pathway information can be extracted from bioinformatic resources, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) [2], the PANTHER classification system for protein sequence data [3], and Reactome database for human pathway data [4]. We propose a two-stage combined p-value method for pathway (gene set) analysis of NGS data. The first stage is at the gene level, where we integrate effects from rare and common variants within a gene. The goal of the first stage analysis is to generate a p-value that summarizes an overall effect within a gene. The second stage is at the pathway level, where we aggregate p-values among all genes in a pathway. An exome sequencing simulation study was conducted to compare the SKAT-Lancaster procedure and Gene Set Enrichment Analysis (GSEA) [5]. We applied the competitive Lancaster procedure to meta-analysis data generated by the Global Lipids Genetics Consortium.

Methods

Two-Stage Pathway Analysis for Sequencing Data

There is a different nature of effects between gene and pathway. At the gene level, we are interested in identifying genetic variants from high throughput data. At the pathway level, genes with similar functions work together to fulfill biological tasks. Thus, we are interested in detecting effects among genes. The proposed “SKAT-Lancaster” procedure provides a two-stage framework in order to (1) reduce the dimension of genetic variants, (2) combine effects from multiple genes, and (3) take genetic correlation architecture into account.

Stage I—Gene Level Testing

In the first stage, we suggest integrating effects from rare variants within the i gene using the Sequence Kernel Association Test (SKAT) [6]. Several tests have been proposed to analyze rare variants at the gene level, including burden tests and the C-alpha test. We choose SKAT because SKAT has been proven to be a locally most powerful score test [7]. Let G be the j variant of the i gene. Let β = (β, ⋯, β, ⋯) be the effects from markers in the i gene. Generate a p-value, P for the i hypothesis testing vs. in the i gene. A → is added to denote the zero vector. SKAT is a locally most powerful score test on the variance component of a regression model , where Y is a phenotype, α is a vector of fixed effects from covariates X, and ε is an error term. To increase the power, SKAT tests H0: β = 0 by treating β as a random variable with mean zero and variance wτ, where τ is a common variance component and w is a pre-specified weight for variant G. As a result, is equivalent to H0: τ = 0. The variance component score statistic is , where is the predicted mean of Y under H0, and W = diag(w, ⋯) are the weights of the variants. Under the null hypothesis, Q follows a mixture of chi-square distributions [6]. Common variants, population stratification, and other covariates can also be included as fixed effects in the model. The goal of the first stage analysis is to generate a p-value that summarizes the overall effect for each gene.

Stage II—Pathway Level P-value Combination

The second stage is at the pathway level, where we perform the modified Lancaster procedure to combine effects from multiple genes within a pathway. We choose the Lancaster procedure because it is optimal in Bahadur efficiency among all weighted combined p-value methods. The original Lancaster procedure is based on the independent p-value assumption. However, genetic data are highly correlated and ignoring the correlation structure will severely inflate the Type I error rate. Thus we need a modification of the Lancaster procedure to take the complex correlation structure among genetic variants into account [8]. Consider m sequences of test statistics, and the corresponding significance levels,, where n is the sample size for the i test statistic. Let the Lancaster statistic , where is the inverse cumulative distribution function (CDF) of with w > 0 for i = 1, 2,⋯, m. When p-values are correlated, does not follow . The null distribution of does not have an explicit analytical form. To address this issue, we approximate the statistic with a scaled chi-square distribution. Let where c > 0 is a scalar and v > 0 is the degrees of freedom for the approximate chi-square distribution. Under H0: θ ∈ Θ0, we have and where takes the correlation among p-values into account. We use the Satterthwaite method to match the mean and variance of and , and solve the equations to derive c and v. Thus we have , where and . As genetic variants have very complex correlation architecture, there is no analytical form for the exact correlated Lancaster procedure. The Satterthwaite approximation is an effective approach to summarize the distribution of the exact correlated Lancaster procedure. Q-Q plots from simulated data suggest a good match between the approximated and exact , with a very slight deviation in the tail part. By introducing the correlation structure, the Satterthwaite approximation can significantly reduce the Type I error among correlated p-values.

Self Contained vs. Competitive Lancaster Procedure

The main difference between competitive and self-contained tests lies in the formulation of the null hypothesis [9]. Let μ stand for the effect size from the i pathway. The null hypothesis for the self-contained test of the i pathway is H0, self-contained: μ = 0. Thus, the correlated Lancaster procedure can be considered as a self-contained test. The null hypothesis in the competitive test is H0, competitive: μ1 = μ2 = ⋯ = μ = ⋯ The competitive Lancaster test can be carried out using permutation testing: Step 1: Let P be the p-value from the Lancaster procedure in the i real pathway. Step 2: Create L, say 100000, permutated pathways by shuffling genes among pathways. The permutated pathway sizes should resemble the real pathway sizes. Let P be the p-value from the Lancaster procedure in the l permutated pathway for l = 1, 2,⋯, L. Step 3: The p-value of the competitive Lancaster procedure is for the i real pathway, where I{.} is an indicator function.

Meta-analysis in Sequencing Association Studies

Due to cost, the rarity of diseases involved, and high dimensionality of variants, sequencing association studies are often underpowered to detect modest genetic effects. Meta-analysis can be used to address this issue by analyzing data across studies. Meta-analysis uses study-specific summary statistics, allowing investigators to combine information across studies when individual-level data cannot be shared. The Lancaster procedure is independent from the SKAT test. One can directly apply the Lancaster procedure to meta-analysis, as we demonstrated in our analysis of the Global Lipids Genetics Consortium data. In this work, we choose SKAT to pair with the Lancaster procedure in order to detect rare variants in exome sequencing data. For other types of sequencing data, we suggest replacing SKAT with other statistical tests, such as FaST-LMM [10] or GEMMA [11], at the gene level and then applying the Lancaster procedure to combine multiple effects at the pathway level.

Lancaster Procedure Is Optimal in Bahadur Efficiency

Several weighted combined p-value methods have been developed. See [12] for a comprehensive review. Since high throughput sequencing data pose a severe challenge in retaining the statistical power for small sample sizes in detection of sparse signals, it is critical to theoretically assess the efficiency among the weighted combined p-value methods. Let P, (i = 1, 2,⋯, m) be p-values from m hypothesis tests. Littell and Folks [13, 14] showed that Fisher's method of combining independent tests () is asymptotically Bahadur efficient. However, Fisher's method does not allow a weight function when combining p-values. The weight function can be used to integrate multiple-source omics data from varying sequencing platforms. For instance, one can apply weight functions to integrate microarray data and CHIP-TIE data to identify the protein involved in transcription. In this case, weight functions can be considered as prior information to ensure the binding calling is a real signal instead of an artifact. As [15] pointed out, carefully chosen weights can generally improve power for a combination of p-values. There is no uniformly most powerful method of combining p-values. The Bahadur efficiency is an important way to compare sample sizes required by two statistics in detection of sparse signals (ε → 0).

The Notation of Bahadur Relative Efficiency

Consider a hypothesis test for H0: θ ∈ Θ0 vs. H: θ ∈ Θ − Θ0. Bahadur efficiency offers an asymptotic relative comparison between two competing test statistics. Under H, a test statistic whose significance level converges to zero at a faster rate is considered more Bahadur efficient. Let T be a real valued test statistic depending on an independent sample, x1, x2,⋯, x for n = 1, 2, ⋯ Assume for all θ ∈ Θ0, T follows the same null CDF F0. Let t be the value attained by T, then the significance level of T is P = 1 − F0(t). Suppose that − 2ln P / n converges to c(θ) with probability 1, i.e., for some c(θ) > 0 under H: θ ∈ Θ−Θ0. The value c(θ) is dependent on θ under the alternative hypothesis and c(θ) is called the Bahadur efficiency slope of T when n → ∞. Consider two competing sequences of test statistics, and , with the Bahadur efficiency slopes c1(θ) and c2(θ), respectively. The ratio is the Bahadur efficiency of relative to . Let N( be the minimal sample size satisfying for the i test. Bahadur [16] shows that with probability 1 under H: θ ∈ Θ − Θ0, which indicates that the Bahadur efficiency ratio ϕ12(θ) gives the limiting ratio of sample sizes required by the two statistics to attain an equally small significance level. As a result, is deemed superior to, i.e. more Bahadur efficient than, if ϕ12(θ) ≥ 1 under H: θ ∈ Θ − Θ0.

Bahadur Efficiency for Lancaster Procedure, Weighted Z-test, and Good's test

Consider m sequences of test statistics, and the corresponding significance levels,, where n is the sample size for the i test statistic. Assume that for each i = 1, 2,⋯, m, the sequence has a Bahadur efficiency slope c(θ). That is, for some c(θ) > 0 under H: θ ∈ Θ − Θ0. Assume also that the sample sizes n1, ⋯, n have an average sample size n = (n1 +⋯+n) / m and for i = 1, 2,⋯, m. Then we have . For each n, it is desired to combine the m statistics into an overall test statistic T for testing H0: θ ∈ Θ0 vs. H: θ ∈ Θ − Θ0. We first derive the Bahadur efficiency for the Lancaster test. Let f, F and be the PDF, CDF, and inverse CDF for , with w > 0 for i = 1, 2,⋯, m. [Theorem 1] Assume m independent test statistics have significance levels respectively. The Lancaster statistic, , has the Bahadur efficiency slope under H: θ ∈ Θ − Θ0. [13] derived the Bahadur efficiency slope for the regular (unweighted) Z-test, . Here we generalize their findings to the weighted Z-test. [Theorem 2] Assume m independent test statistics have significance levels respectively. The weighted Z-test, , has the Bahadur efficiency slope under H: θ ∈ Θ − Θ0. When w = 1 for all i = 1, 2,⋯, m, the weighted Z-test is reduced to the regular z statistic and the Bahadur efficiency slope is . This finding is in agreement with the Bahadur efficiency finding for the regular Z-test in [13]. The Lancaster test statistic is superior to the weighted Z-test and regular Z-test in terms of Bahadur efficiency. Using the induction method, we show that the Bahadur relative efficiency ϕ12 = c (θ)/ c (θ) ≥ 1 and ϕ12 = c (θ)/ c (θ) ≥ 1 for all θ ∈ Θ − Θ0. The fact that indicates that the Lancaster procedure will require smaller sample sizes as compared to the weighted Z-test to achieve the same significance level. [17] suggested a weighted Fisher’s method. Let , so . When the weights are unequal, the null CDF of Q is given by , where . Below we will derive the Bahadur efficiency for Good’s test. [Theorem 3] Assume m independent test statistics have significance levels respectively. Good’s test statistic, , has the Bahadur efficiency slope under H: θ ∈ Θ − Θ0. The maximal weight, , has a strong impact on the Bahadur efficiency in Good’s test. Only the individual test(s) assigned with the maximal weight reserves its Bahadur efficiency in Good's test. That is, if , then . Other individual tests will relatively lose more Bahadur efficiency as the maximal weight gets larger. That is, if , then . The Lancaster procedure is superior to Good’s test in terms of the Bahadur efficiency, i.e. ϕ12 = c (θ)/ c (θ) ≥ 1 for all θ ∈ Θ − Θ0 and . For large-scale tests, which often occur in next-generation sequencing data, the Lancaster procedure will require relatively smaller sample sizes as compared to Good’s test, i.e., N ≤ N when the significance level goes to 0, which represents sparse signaling in high throughput data.

Lancaster Procedure Has the Optimal Bahadur Efficiency

We can further prove that the Lancaster procedure reaches the upper bounds of Bahadur efficiency among all non-decreasing T. Thus the Lancaster procedure has the optimal Bahadur efficiency compared to all other combination methods under mild conditions. [Proposition 1] Let T be any function of m independent test statistics . Let c (θ) > 0 be the Bahadur efficiency slope of T. Assume T is non-decreasing in a way that . Then the Lancaster statistics have the optimal Bahadur efficiency, with c (θ) ≥ c (θ) for all θ ∈ Θ − Θ0. The Lancaster procedure and Fisher’s test both have the optimal Bahadur efficiency among all non-decreasing combined tests. Since the Lancaster procedure can incorporate weight functions for auxiliary information in modeling and testing, the Lancaster procedure has more flexibility and it can be considered as the optimal generalized Fisher’s method. The non-decreasing condition, , is easy to meet in practice.

Comparing Bahadur Efficiency for Correlated Data

It is critical to assess Bahadur efficiency for correlated data as it will shed light on the impact of correlation structures on the asymptotic convergence rate of significance levels and will further impact the sample sizes required for the experiments. This is a challenging topic since the distributions of combined test statistics under complex correlation structures have no closed analytical forms. To address this issue, we give an approximate Bahadur efficiency using the techniques described in the Methods Section. Below are some interesting results. [Proposition 2] When m statistics are correlated, under H: θ ∈ Θ − Θ0: the Lancaster statistic, has an approximate Bahadur efficiency slope where the Good’s test statistic has an approximate Bahadur efficiency slope where the Fisher’s statistic has an approximate Bahadur efficiency slope

Simulation Study

We conducted an extensive simulation study to assess the type I error and power of the SKAT-Lancaster procedure. We further compared our proposed method to Gene Set Enrichment Analysis (GSEA) [5]. The empirical assessment was based on rigorous simulation algorithms for sequencing-based genome-wide association studies [18]. The simulation was conducted using the whole exome sequencing genotype data from the 1000 Genomes Project Phase 1 study (n = 822 individuals). After filtration, 40,918 biallellic protein-changing coding variants in selected pathways were mapped to KEGG and Biocarta pathways. To avoid testing over- or under- sized pathways, we selected pathways containing 10 to 100 genes. This yielded 353 pathways with 3304 genes for our simulation study. We applied a genome-wide additive model to evaluate pathway-testing methods using realistic genetic architectures. Let Y = Xβ + ε, where Y is a continuous trait, the vector X is the whole exome sequencing genotype for the i subject and is random noise. The vector β contains genetic effect regression coefficients corresponding to genotype variants. In simulation, the j variant is causal if |β| > 0; pathways and genes are causal if they harbor causal variants. We adopted a stochastic hierarchical effect model β = C × Cd × Cd × e to distribute the total genetic variance into pathways, genes, and individual variants [18]. Within a central causal pathway, we first randomly selected 50% of the genes to be associated with the trait. Then we randomly selected 70% of the variants in causal genes to be associated with the trait. We randomly assigned 80% (20%) of causal genes to be detrimental (protective). For variants within causal genes, 80% were detrimental and 20% were protective. We set the whole-genome heritability h2 = Var(Xβ)/(Var(Xβ) + σ2) = 20%. This resembles heritability in real data, which often ranges between 20% and 30%. We used Bonferroni correction to control Family-Wise Error Rate (FWER) and set the genome-wide significance level at α = 0.05/353 = 1.4164E-4 for testing 353 pathways. We performed principal component analysis and included the top 3 principal components as covariates in regression analyses to adjust for the population stratification. In the SKAT-Lancaster procedure, we first performed SKAT to test overall effects on the gene level. Then we considered 4 weight functions in the Lancaster procedure when combining p-values among genes in a pathway: , where n is the number of SNPs in the i gene and is the median gene size. This weight can remove bias when testing overly small or overly generalized pathways. AIC weight, BIC weight: these weight functions calculate the degrees of variations summarized by the gene level multi-SNP regression. Uniform weight = 2. We considered 3 simulation scenarios (Table 1). In Scenario 1, we assessed the global null hypothesis type I error by setting all genetic effect coefficients as zero, i.e. . Any pathways or genes reaching the significance level were considered as false positives. The results in Table 2 indicate that the SKAT-Lancaster procedure has well-controlled type I error rates (~10E-4). We further investigated the Q-Q plot by comparing observed p-values versus expected p-values (Fig 1). The type I error inflation factor (λ) is the ratio between the area under the curve and the area under the diagonal reference line. Fig 1 indicates that SKAT-Lancaster procedure with 4 weight functions has no inflation of the global null hypothesis type I error rate (λ < 1).

Table 1

Simulation Scenarios and parameters*.

Simulation Scenarios 1

Include 353 pathways, 3304 genes.

Phenotype is normally distributed.

Assume heritability is 20%.

β=0→.

No pathways, genes or variations are associated with the trait.

Significance level is 0.05/353. All significant results are considered as type 1 errors.

Simulation Scenario 2

Include 353 pathways and 3304 genes.

Phenotype is normally distributed.

Randomly assign one central causal pathway. Within the central causal pathway, randomly assign 50% causal genes. Randomly assign 70% causal variants in associated genes.

Randomly assigned 80% (20%) of causal genes to be detrimental (protective). For variants within the causal genes, 80% are detrimental and 20% are protective.

Associated variants' effect size ~ log₁₀(MAF).

Significance level is 0.05/353.

Simulation Scenarios 3

Include 353 pathways and 3304 genes.

Phenotype is normally distributed.

Assume heritability is 20%.

Randomly assign one central causal pathway. Within the central causal pathway, randomly assign 50% causal genes. Randomly assign 70% causal variants in associated genes.

Randomly assigned 80% (20%) of causal genes to be detrimental (protective). For variants within the causal genes, 80% are detrimental and 20% are protective.

Associated variants' effect size ~1/MAF*(1−MAF).

Significance level is 0.05/353.

* Covariates: top 3 principal components for population stratification are included as covariates in all three simulation scenarios.

Table 2

Comparison of type I error and power among competing methods.

Simulation Scenario 1
		Type_1 error	Inflation factor
Test	Weight function	(10E-4)	λ
SKAT- Lancaster	Uniform	1.1615	0.9921
SKAT- Lancaster	Gene size	1.3598	0.9852
SKAT- Lancaster	AIC	0.9632	0.9477
SKAT- Lancaster	BIC	1.1331	0.9770
GSEA		12.0000	1.2390
Simulation Scenario 2
Test	Weight function	Stringent Power	Lenient Power
SKAT- Lancaster	Uniform	0.870	0.884
SKAT- Lancaster	Gene size	0.810	0.836
SKAT- Lancaster	AIC	0.832	0.854
SKAT- Lancaster	BIC	0.809	0.826
GSEA		0.279	0.373
Simulation Scenario 3
Test	Weight function	Stringent Power	Lenient Power
SKAT- Lancaster	Uniform	0.610	0.645
SKAT- Lancaster	Gene size	0.509	0.543
SKAT- Lancaster	AIC	0.585	0.628
SKAT- Lancaster	BIC	0.540	0.558
GSEA		0.468	0.505

Fig 1

Q-Q plots investigating global null hypothesis type-I errors for the SKAT-Lancaster procedure under Simulation Scenario 1 (λ is the inflation factor for the Type I error rate).

The type I error inflation factor (λ) is the ratio between the area under the curve and the area under the diagonal reference line.

Include 353 pathways, 3304 genes. Phenotype is normally distributed. Assume heritability is 20%. . No pathways, genes or variations are associated with the trait. Significance level is 0.05/353. All significant results are considered as type 1 errors. Include 353 pathways and 3304 genes. Phenotype is normally distributed. Randomly assign one central causal pathway. Within the central causal pathway, randomly assign 50% causal genes. Randomly assign 70% causal variants in associated genes. Randomly assigned 80% (20%) of causal genes to be detrimental (protective). For variants within the causal genes, 80% are detrimental and 20% are protective. Associated variants' effect size ~ log10(MAF). Significance level is 0.05/353. Include 353 pathways and 3304 genes. Phenotype is normally distributed. Assume heritability is 20%. Randomly assign one central causal pathway. Within the central causal pathway, randomly assign 50% causal genes. Randomly assign 70% causal variants in associated genes. Randomly assigned 80% (20%) of causal genes to be detrimental (protective). For variants within the causal genes, 80% are detrimental and 20% are protective. Associated variants' effect size ~. Significance level is 0.05/353. * Covariates: top 3 principal components for population stratification are included as covariates in all three simulation scenarios.

Q-Q plots investigating global null hypothesis type-I errors for the SKAT-Lancaster procedure under Simulation Scenario 1 (λ is the inflation factor for the Type I error rate).

The type I error inflation factor (λ) is the ratio between the area under the curve and the area under the diagonal reference line. In Scenarios 2 and 3, we assessed the stringent power and lenient power when randomly generating one central causal pathway in each simulation (Table 1). The stringent power calculates the percentage of times the central causal pathway is found significant. Due to the correlation among pathways, pathways that share causal genes with the central causal pathway are overlapping causal pathways. The lenient power calculates the percentage of times (central and overlapping) causal pathways are found significant. The results in Table 2 indicate that the SKAT-Lancaster procedure outperformed GSEA. In Scenario 2, the SKAT-Lancaster procedure with 4 weight functions had lenient power ranging between 0.826 and 0.884, while GSEA had lenient power of 0.373. In Scenario 3, the SKAT-Lancaster procedure with 4 weight functions had lenient power ranging between 0.543 and 0.645, while GSEA had lenient power of 0.505. We randomly assign the causal variances in Scenarios 2 and 3, the SKAT-Lancaster procedure with uniform weight had the best detection power. Regarding the computing time, the self-contained Lancaster procedure compares a test statistic to an asymptotic distribution, thus it does not require intensive computation. The competitive Lancaster procedure is based on permutation and it has similar computation efficiency as compared with GSEA.

Case Study: Lipid Meta-Analysis

We illustrate our method using meta-analysis data generated by the Global Lipids Genetics Consortium. To identify new loci and validate existing loci associated with lipids, [19] we analyzed the levels of low-density lipoprotein (LDL) cholesterol, high-density lipoprotein (HDL) cholesterol, triglycerides (TG) and total cholesterol (TC) of 196,475 individuals from 60 studies. A total of 1,048,161 Single Nucleotide Polymorphisms (SNPs) were genotyped using the genome-wide association study (GWAS) arrays and Metabochip arrays. These variants were selected from promising loci associated with lipid and coronary artery disease, based on findings from previous GWAS studies and the 1000 Genome Project. Subjects taking lipid-lowering medications were excluded in the meta-analysis. The additive effect of each SNP on blood lipid levels after adjusting for age and sex was analyzed and p-value was generated for each SNP and each lipid variable. Genomic control values for the initial meta-analyses were 1.10–1.15, indicating that population stratification had only a minor impact on the results [20]. The SKAT-Lancaster procedure can only be applied to original data. Remarkably, as the Lancaster procedure is independent from the SKAT test, it can be applied to secondary data analysis. To identify pathways that are more significant than others, we performed the competitive Lancaster procedure. In the competitive test, we performed 100,000 times of permutations and ensured that the permutated pathways preserved the size and characteristics of original pathways. Our simulation study showed that the competitive Lancaster procedure had well-controlled type I error rates to prevent false discoveries. Before comparing the proposed method to Fisher’s method [21] and weighted Z-test [22], we considered 4 weight functions for the Lancaster procedure: , where n is the number of SNPs in the i gene and is the median gene size. This is a weight adjusted by gene size to remove the bias from large genes. , where MAF stands for minor allele frequency. Common variants receive higher weights. . Rare variants receive higher weights. Uniform weight: w4 = 2. Pathway analysis was performed using the gene ontology (GO) gene sets from http://www.broadinstitute.org/gsea/index.jsp. A total of 1454 pathways were analyzed and multiple testing was adjusted by False Discovery Rate (FDR) [23]. The numbers of significant pathways are summarized in Fig 2 As shown in Table 3, the Lancaster procedure outperformed Fisher's method and weighted Z-test by identifying more significant pathways. When the Lancaster procedure was assigned with uniform weights (w4), it performed equivalently to Fisher's method. The weighted Z-test is not optimal in Bahadur efficiency, so it identified fewer pathways than the Lancaster procedure and Fisher's method. Weight functions w1 and w2 outperformed w3 and w4, indicating that removing gene size bias and assigning higher weights to common variants can improve power of the Lancaster procedure.

Fig 2

Venn Diagrams for Significant Pathways (FDR < 0.05).

Table 3

Number of Significant pathways.

FDR < 0.05	HDL	LDL	TC	TG
Lancaster (w1)	117	79	150	93
Lancaster (w2)	91	55	129	72
Lancaster (w3)	0	0	0	0
Lancaster (w4)	77	44	115	60
Fisher	77	44	115	60
Weighted Z-test (w1)	2	1	3	2
Weighted Z-test (w2)	5	0	5	4
Weighted Z-test (w3)	0	0	0	0
Weighted Z-test (w4)	4	0	4	6

We compared our pathway findings with findings from the MAGENTA analysis in [19] (Table 4). The Lancaster procedure (w1) showed that the "enzyme binding" pathway is significantly associated with HDL (FDR<10−5), which agrees with the finding from [19] (FDR = 0.038). The "enzyme binding" pathway contained 178 genes interacting selectively and non-covalently with any enzyme. The Lancaster procedure (w1, w2, w4) showed that the "lipid transport" pathway is significantly associated with LDL (FDR adjusted p-value <10−5), which agrees with the finding from [19] (FDR = 0.0016). The "lipid transport" pathway contains 28 genes involving directed movement of lipids into, out of, within, or between cells. Lipids are compounds soluble in an organic solvent but not, or sparingly, in an aqueous solvent. The Lancaster procedure (w1, w2, w4)found that the "lipoprotein metabolic process" pathway is significantly associated with LDL (FDR adjusted p-value <10−5), which agrees with the finding from [19] (FDR = 0.00017). The "lipoprotein metabolic process" pathway contains 33 genes involving the chemical reactions. The pathway also involves any conjugated, water-soluble protein in which the non-protein moiety consists of a lipid or lipids.

Table 4

Comparison of pathway analysis p-values.

Pathway Name	enzyme binding	lipid transport	lipoprotein metabolic process
GO Accession	GO:0019899	GO:0006869	GO:0042157
Gene Ontology	molecular function	biological process	biological process
Description	Interacting selectively with any enzyme	The directed movement of lipids into, out of, within or between cells. Lipids are compounds soluble in an organic solvent but not, or sparingly, in an aqueous solvent.	The chemical reactions and pathways involving any conjugated, water-soluble protein in which the nonprotein moiety consists of a lipid or lipids.
number of genes	178	28	33
number of SNPs	12089	1058	769
(Willer 2013)*	0.038	0.0016	0.00017
Lancaster (w1)*	<10⁻⁵	<10⁻⁵	<10⁻⁵
Lancaster (w2) *	0.80	<10⁻⁵	<10⁻⁵
Lancaster (w3)*	0.98	0.44	0.44
Lancaster (w4)*	0.82	<10⁻⁵	<10⁻⁵
Fisher*	0.82	<10⁻⁵	<10⁻⁵
Weighted z (w1)*	0.90	0.36	0.58
Weighted z (w2)*	0.94	0.21	0.35
Weighted z (w3)*	0.99	0.65	0.55
Weighted z (w4)*	0.94	0.23	0.36

* FDR adjusted p-values

Discussion and Conclusions

The proposed two-stage approach is a powerful tool to integrate information in pathway analysis of sequencing association studies. The first stage is the gene-based testing, where effects from rare variants within a gene are summarized into one p-value using the SKAT test. In the second stage, p-values from multiple genes are combined for pathway analysis and meta-analysis using the correlated Lancaster procedure. In this work, we prove that the Lancaster procedure is optimal in Bahadur efficiency among all combined p-value methods. We assess the Bahadur efficiency among weighted combined p-value methods and further prove that the Lancaster procedure is optimal in Bahadur efficiency under very mild conditions. There has been a lack of theatrical comparison among combined p-value methods. Several simulation studies have compared weighted combined p-value methods [15, 22, 24]. With more than 400 citations in the literature, these studies have been a subject of intense interest to the research community heated discussions in the research community, but yield controversial results in different simulation scenarios. Thus, we fill the gap by comparing the Bahadur efficiency among methods. The Bahadur efficiency is a critical measure of performance of statistical testing [25] [26]. In [25], Bahadur efficiency has been applied for sensitivity analyses in observation studies. The Bahadur efficiency, , compares sample sizes among different statistical tests when signals become sparse in sequencing data, i.e. ε → 0. As the number of genetic variants scanned by the sequencing technology increases from thousands to millions, signals that are associated with phenotypes become sparse, requiring a more stringent statistical significance level to detect sparse signals, i.e. (). The optimal Bahadur efficiency ensures that the Lancaster procedure asymptotically requires a minimal sample size to detect sparse signals. Among combined p-value methods, the Lancaster procedure can be considered as the generalized Fisher's method with a weight function. Weight functions, when used appropriately, can generally increase the power of combined p-value methods [27-29]. Evaluating Bahadur efficiency for high-throughput genetic data is critical since there is no combined p-value method that that is uniformly the most powerful. Bahadur efficiency calculates the limiting ratio of sample sizes required by two statistics to attain an equally small significance level. The optimal Bahadur efficiency indicates that the Lancaster procedure asymptotically requires a minimal sample size to attain the significance level.

Data and Software

R package ‘CombinePValue’ has been created for the proposed Lancaster procedure. Case study data are available from http://csg.sph.umich.edu//abecasis/public/lipids2013/. Source codes for simulation analyses can be provided upon contacting Dr Guodong Wu.

Appendix

Lemma 1 [16, 30] is needed to derive the Bahadur efficiency. [Lemma 1] If the following two conditions are met, (Condition 1) there exists a function b(θ), 0 < b(θ) < ∞, such that with probability 1 under H: θ ∈ Θ − Θ0; (Condition 2) there exists a function f(t), 0 < f(t) < ∞, which is continuous in some open set containing the range of b(θ) such that for each t in the open set , then the Bahadur efficiency slope of {T} is c(θ) = 2f (b(θ)). Proof of Theorem 1: Since the equivalent tests have the same Bahadur efficiency, we can consider , where . Under H: θ ∈ Θ0, and . According to Theorem 2.1 by [31], we have as . So with probability 1 under H: θ ∈ Θ − Θ0. It follows that Now, for θ ∈ Θ0, is distributed as the square root of with the CDF F and PDF f. According to Theorem 2.1 by [31], we have, Plug the results from Eqs (1) and (2) to Lemma 1. Then the Bahahur efficiency slope for the Lancaster statistic is under H: θ ∈ Θ − Θ0. Proof of Theorem 2: Rewrite the weighted z statistic as , where . Under H0: θ ∈ Θ0, and . According to Theorem 2.1 by [31], we have as . So with probability 1 under H: θ ∈ Θ − Θ0. It follows that Now, for θ ∈ Θ0, . Thus, Eqs (3) and (4) and Lemma 1 imply that the Bahadur efficiency slope of is Proof of Theorem 3: It is equivalent to consider . Then with probability 1 under H: θ ∈ Θ − Θ0. Direction calculation shows that the PDF of is under H0. Let I be the index corresponds to . One can construct the upper and lower bounds of f(x), when x is greater than a certain finite number. This implies that By Lemma 1, the Bahadur efficiency slope for Good’s test is for θ ∈ Θ − Θ0. Proof of Proposition 1: Let P be the significance level of T and let t(1), ⋯, t( be the observed values of . For any non-decreasing T, we have Therefore, for all θ ∈ Θ − Θ0. Proof of Proposition 2: We give the proof to the Lancaster statistic. Note that any correlation structure among p-values has no impact to the first condition of Lemma 1. As a result, one can repeat the derivation for Theorem 1 to get under H: θ ∈ Θ − Θ0. When are correlated, the null distribution of no longer follows distribution. One can approximate it by a scaled chi-square distribution such as . By matching expectation and variance between two sides, one can solve for c and v. Under H0: θ ∈ Θ0, direct calculations show that where F is the CDF of and () are the CDF (PDF) of . Plug the results from Eqs (5) and (6) to Lemma 1. Then the Bahahur efficiency slope for the Lancaster statistic is under H: θ ∈ Θ − Θ0.

17 in total

1. Genomic control for association studies.

Authors: B Devlin; K Roeder
Journal: Biometrics Date: 1999-12 Impact factor: 2.571

Review 2. Analysing biological pathways in genome-wide association studies.

Authors: Kai Wang; Mingyao Li; Hakon Hakonarson
Journal: Nat Rev Genet Date: 2010-12 Impact factor: 53.242

3. Combining probability from independent tests: the weighted Z-method is superior to Fisher's approach.

Authors: M C Whitlock
Journal: J Evol Biol Date: 2005-09 Impact factor: 2.411

4. Pathway-based approaches for analysis of genomewide association studies.

Authors: Kai Wang; Mingyao Li; Maja Bucan
Journal: Am J Hum Genet Date: 2007-12 Impact factor: 11.025

5. Protein networks and pathway analysis. Preface.

Authors: Yuri Nikolsky; Julie Bryant
Journal: Methods Mol Biol Date: 2009

6. Is the weighted z-test the best method for combining probabilities from independent tests?

Authors: Z Chen
Journal: J Evol Biol Date: 2011-01-24 Impact factor: 2.411

7. FaST linear mixed models for genome-wide association studies.

Authors: Christoph Lippert; Jennifer Listgarten; Ying Liu; Carl M Kadie; Robert I Davidson; David Heckerman
Journal: Nat Methods Date: 2011-09-04 Impact factor: 28.547

8. Evolution and functional impact of rare coding variation from deep sequencing of human exomes.

Authors: Jacob A Tennessen; Abigail W Bigham; Timothy D O'Connor; Wenqing Fu; Eimear E Kenny; Simon Gravel; Sean McGee; Ron Do; Xiaoming Liu; Goo Jun; Hyun Min Kang; Daniel Jordan; Suzanne M Leal; Stacey Gabriel; Mark J Rieder; Goncalo Abecasis; David Altshuler; Deborah A Nickerson; Eric Boerwinkle; Shamil Sunyaev; Carlos D Bustamante; Michael J Bamshad; Joshua M Akey
Journal: Science Date: 2012-05-17 Impact factor: 47.728

9. Reactome knowledgebase of human biological pathways and processes.

Authors: Lisa Matthews; Gopal Gopinath; Marc Gillespie; Michael Caudy; David Croft; Bernard de Bono; Phani Garapati; Jill Hemish; Henning Hermjakob; Bijay Jassal; Alex Kanapin; Suzanna Lewis; Shahana Mahajan; Bruce May; Esther Schmidt; Imre Vastrik; Guanming Wu; Ewan Birney; Lincoln Stein; Peter D'Eustachio
Journal: Nucleic Acids Res Date: 2008-11-03 Impact factor: 16.971

10. Discovery and refinement of loci associated with lipid levels.

Authors: Cristen J Willer; Ellen M Schmidt; Sebanti Sengupta; Michael Boehnke; Panos Deloukas; Sekar Kathiresan; Karen L Mohlke; Erik Ingelsson; Gonçalo R Abecasis; Gina M Peloso; Stefan Gustafsson; Stavroula Kanoni; Andrea Ganna; Jin Chen; Martin L Buchkovich; Samia Mora; Jacques S Beckmann; Jennifer L Bragg-Gresham; Hsing-Yi Chang; Ayşe Demirkan; Heleen M Den Hertog; Ron Do; Louise A Donnelly; Georg B Ehret; Tõnu Esko; Mary F Feitosa; Teresa Ferreira; Krista Fischer; Pierre Fontanillas; Ross M Fraser; Daniel F Freitag; Deepti Gurdasani; Kauko Heikkilä; Elina Hyppönen; Aaron Isaacs; Anne U Jackson; Åsa Johansson; Toby Johnson; Marika Kaakinen; Johannes Kettunen; Marcus E Kleber; Xiaohui Li; Jian'an Luan; Leo-Pekka Lyytikäinen; Patrik K E Magnusson; Massimo Mangino; Evelin Mihailov; May E Montasser; Martina Müller-Nurasyid; Ilja M Nolte; Jeffrey R O'Connell; Cameron D Palmer; Markus Perola; Ann-Kristin Petersen; Serena Sanna; Richa Saxena; Susan K Service; Sonia Shah; Dmitry Shungin; Carlo Sidore; Ci Song; Rona J Strawbridge; Ida Surakka; Toshiko Tanaka; Tanya M Teslovich; Gudmar Thorleifsson; Evita G Van den Herik; Benjamin F Voight; Kelly A Volcik; Lindsay L Waite; Andrew Wong; Ying Wu; Weihua Zhang; Devin Absher; Gershim Asiki; Inês Barroso; Latonya F Been; Jennifer L Bolton; Lori L Bonnycastle; Paolo Brambilla; Mary S Burnett; Giancarlo Cesana; Maria Dimitriou; Alex S F Doney; Angela Döring; Paul Elliott; Stephen E Epstein; Gudmundur Ingi Eyjolfsson; Bruna Gigante; Mark O Goodarzi; Harald Grallert; Martha L Gravito; Christopher J Groves; Göran Hallmans; Anna-Liisa Hartikainen; Caroline Hayward; Dena Hernandez; Andrew A Hicks; Hilma Holm; Yi-Jen Hung; Thomas Illig; Michelle R Jones; Pontiano Kaleebu; John J P Kastelein; Kay-Tee Khaw; Eric Kim; Norman Klopp; Pirjo Komulainen; Meena Kumari; Claudia Langenberg; Terho Lehtimäki; Shih-Yi Lin; Jaana Lindström; Ruth J F Loos; François Mach; Wendy L McArdle; Christa Meisinger; Braxton D Mitchell; Gabrielle Müller; Ramaiah Nagaraja; Narisu Narisu; Tuomo V M Nieminen; Rebecca N Nsubuga; Isleifur Olafsson; Ken K Ong; Aarno Palotie; Theodore Papamarkou; Cristina Pomilla; Anneli Pouta; Daniel J Rader; Muredach P Reilly; Paul M Ridker; Fernando Rivadeneira; Igor Rudan; Aimo Ruokonen; Nilesh Samani; Hubert Scharnagl; Janet Seeley; Kaisa Silander; Alena Stančáková; Kathleen Stirrups; Amy J Swift; Laurence Tiret; Andre G Uitterlinden; L Joost van Pelt; Sailaja Vedantam; Nicholas Wainwright; Cisca Wijmenga; Sarah H Wild; Gonneke Willemsen; Tom Wilsgaard; James F Wilson; Elizabeth H Young; Jing Hua Zhao; Linda S Adair; Dominique Arveiler; Themistocles L Assimes; Stefania Bandinelli; Franklyn Bennett; Murielle Bochud; Bernhard O Boehm; Dorret I Boomsma; Ingrid B Borecki; Stefan R Bornstein; Pascal Bovet; Michel Burnier; Harry Campbell; Aravinda Chakravarti; John C Chambers; Yii-Der Ida Chen; Francis S Collins; Richard S Cooper; John Danesh; George Dedoussis; Ulf de Faire; Alan B Feranil; Jean Ferrières; Luigi Ferrucci; Nelson B Freimer; Christian Gieger; Leif C Groop; Vilmundur Gudnason; Ulf Gyllensten; Anders Hamsten; Tamara B Harris; Aroon Hingorani; Joel N Hirschhorn; Albert Hofman; G Kees Hovingh; Chao Agnes Hsiung; Steve E Humphries; Steven C Hunt; Kristian Hveem; Carlos Iribarren; Marjo-Riitta Järvelin; Antti Jula; Mika Kähönen; Jaakko Kaprio; Antero Kesäniemi; Mika Kivimaki; Jaspal S Kooner; Peter J Koudstaal; Ronald M Krauss; Diana Kuh; Johanna Kuusisto; Kirsten O Kyvik; Markku Laakso; Timo A Lakka; Lars Lind; Cecilia M Lindgren; Nicholas G Martin; Winfried März; Mark I McCarthy; Colin A McKenzie; Pierre Meneton; Andres Metspalu; Leena Moilanen; Andrew D Morris; Patricia B Munroe; Inger Njølstad; Nancy L Pedersen; Chris Power; Peter P Pramstaller; Jackie F Price; Bruce M Psaty; Thomas Quertermous; Rainer Rauramaa; Danish Saleheen; Veikko Salomaa; Dharambir K Sanghera; Jouko Saramies; Peter E H Schwarz; Wayne H-H Sheu; Alan R Shuldiner; Agneta Siegbahn; Tim D Spector; Kari Stefansson; David P Strachan; Bamidele O Tayo; Elena Tremoli; Jaakko Tuomilehto; Matti Uusitupa; Cornelia M van Duijn; Peter Vollenweider; Lars Wallentin; Nicholas J Wareham; John B Whitfield; Bruce H R Wolffenbuttel; Jose M Ordovas; Eric Boerwinkle; Colin N A Palmer; Unnur Thorsteinsdottir; Daniel I Chasman; Jerome I Rotter; Paul W Franks; Samuli Ripatti; L Adrienne Cupples; Manjinder S Sandhu; Stephen S Rich
Journal: Nat Genet Date: 2013-10-06 Impact factor: 38.330