Literature DB >> 27307612

Using genomic annotations increases statistical power to detect eGenes.

Dat Duong¹, Jennifer Zou¹, Farhad Hormozdiari¹, Jae Hoon Sul², Jason Ernst³, Buhm Han⁴, Eleazar Eskin⁵.

Abstract

MOTIVATION: Expression quantitative trait loci (eQTLs) are genetic variants that affect gene expression. In eQTL studies, one important task is to find eGenes or genes whose expressions are associated with at least one eQTL. The standard statistical method to determine whether a gene is an eGene requires association testing at all nearby variants and the permutation test to correct for multiple testing. The standard method however does not consider genomic annotation of the variants. In practice, variants near gene transcription start sites (TSSs) or certain histone modifications are likely to regulate gene expression. In this article, we introduce a novel eGene detection method that considers this empirical evidence and thereby increases the statistical power.
RESULTS: We applied our method to the liver Genotype-Tissue Expression (GTEx) data using distance from TSSs, DNase hypersensitivity sites, and six histone modifications as the genomic annotations for the variants. Each of these annotations helped us detected more candidate eGenes. Distance from TSS appears to be the most important annotation; specifically, using this annotation, our method discovered 50% more candidate eGenes than the standard permutation method. CONTACT: buhm.han@amc.seoul.kr or eeskin@cs.ucla.edu.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2016 PMID： 27307612 PMCID： PMC4908356 DOI： 10.1093/bioinformatics/btw272

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Many studies over the past decade examined the contribution of genetic loci to phenotypic variation in complex traits. Genetic loci that are associated with gene expression are called expression quantitative trait loci (eQTLs) (Brem and Kruglyak, 2005; Gilad ; The GTEx Consortium, 2015). One important task in eQTL studies is to find eGenes or genes whose expressions are associated with at least one genetic variant. The standard method to determine whether a gene is an eGene requires association testing at all variants near the gene (cis-variants) and the permutation test to correct for multiple testing. The permutation test is the gold standard for multiple-testing correction, as it properly accounts for the linkage disequilibrium (LD) structure in the genome. However, this standard method does not consider which variants are more likely to regulate gene expression. In order to better detect eGenes, we can increase the statistical power of this standard method by using annotation data. In practice, regulatory variants found near the transcription start sites (TSSs) and certain histone modifications are more likely to be associated with gene expression (van de Geijn ). Additionally, recent large-scale genomics studies have annotated regions of the genome that are likely to alter gene expression in individuals (Ernst and Kellis, 2015; The Roadmap Epigenomics Mapping Consortium, 2015). For example, almost 80% of the chip-based heritability of disease risk for 11 human diseases examined in the Wellcome Trust Case Control Consortium (WTCCC) can be explained by genome variation in DNase I hypersensitivity sites. These variations are likely to regulate chromatin accessibility and thus transcription (Gusev ). These genomic annotations for the variants can be used to increase the power to detect eGenes. Although several methods were recently developed to address challenges in multiple-testing correction in eQTL studies, these methods do not improve statistical power in comparison to the standard method. Sul improved the runtime of the standard permutation test by replacing the permutation procedure with sampling from the multivariate normal distribution (Mvn). Davis further improved this runtime by estimating the effective number of tests based on the eigen-decomposition of the genotype correlation matrix. These methods aim to reduce runtime but do not attempt to increase statistical power of the standard approach. Therefore, these methods are not capable of detecting more eGenes. In this article, we introduce a new method for discovering eGenes (Func-eGene) that uses genomic annotation of the variants to increase statistical power. Although gene expression can be affected by trans-variants (Bryois ), in this article, we focus on methods to detect cis-acting eQTLs. We rely on the multi-threshold association test that specifies different significance thresholds for the variants when correcting for multiple testing (Darnell ; Eskin, 2008). In this multi-threshold association study mindset, we can assign less stringent significance thresholds to variants that have a high propensity to contribute to gene expression, thereby increasing power. If an appropriate prior is provided, this multi-threshold association method has a closed-form solution that guarantees the best statistical power for the association test (Darnell ). However, there are two key difficulties we encounter when directly applying this multi-threshold association study method to discover eGenes. First, the multi-threshold association test depends on a permutation test to correct for LD. This permutation test is slow when applying to a large dataset. Second, we rarely know of an appropriate prior based on the annotation for genetic variants under study. Our new method Func-eGene avoids these difficulties. To reduce runtime, we replace the permutation test with the sampling procedure in Sul . To find an appropriate prior, we do a grid search over possible sets of scores assigned to annotation categories. The goal of our search is to find the set of scores that maximizes the number of eGenes. To avoid data re-use and over-fitting, we use a cross-validation strategy. We applied our method to the liver dataset from the Genotype Tissue Expression (GTEx) Consortium. First, we used the distance from the TSS as a genomic annotation because variants near the TSS are likely to affect gene expression. Using this annotation alone, our method Func-eGene increased the number of candidate eGenes by 50% compared with the standard method. Then, we added DNase hypersensitivity sites as a second genomic annotation. The TSS and DNase annotations together did not discover more candidate eGenes than the TSS annotation alone. Third, we separately applied the binding sites for histones H3K27ac, H3K27me3, H3K4me1, H3K4me3, H3K9ac and H3K9me3 as genomic annotations. These histone marks found more candidate eGenes than the standard method but less than the number reported by using the TSS annotation alone. Distance from TSS appears to create the most informative prior for detecting eGenes.

2 Methods

Standard method to detect eGenes

An eGene is a gene that has an associated eQTL (Sul ). Let s be a test statistic such as a t-distributed statistic from Pearson or Spearman correlation between a genetic variant and gene expression. Define F(s) as its cumulative density function. Suppose α is the desired false-positive rate. In a two-tailed hypothesis test, assuming that the distribution is symmetric, is the P-value, and defines the rejection region for s. We use to specify a matrix x of dimension p × q. Parentheses and subscripts are omitted whenever the context is clear.

Association test

Suppose that there are N individuals. In the simplest scenario, which considers only one variant and one gene y, the hypothesis test tests whether gene expression of y is independent of the variant genotype . If they are independent, the variant is not an eQTL and y is not an eGene. The null hypothesis H0 is that y is not an eGene, and the alternative hypothesis H1 is that y is an eGene. To conduct this single hypothesis test, the standard method assumes a linear model In Equation 1, the design matrix contains P fixed effects (i.e. gender, ethnicity, age, etc.). The vector is their regression coefficients. β is the regression coefficient of the variant genotypes. is a vector of independent sampling errors that is normally distributed (). In Equation 1, linear regression is used to estimate and its variance . Our test statistic s is the normalized (). Under the null hypothesis, s follows a t-distribution with degrees of freedom. If we suppose N is large, then where is a normal cumulative density with mean μ and variance one. This mean μ is also known as the z-score non-centrality parameter. Our null and alternative hypotheses can then be written as H0 is rejected if the P-value is less than α. This P-value is named eGene P-value (Sul ).

Multi-association test

In a more common scenario, many variants in-cis with gene y are tested. In this case, the test consists of M univariate association tests. The hypothesis test tests whether the expression of y is independent of all variant genotypes . As before, one assumes In Equation 2, contains P fixed effects, and is their regression coefficients. β is the regression coefficient for the genotypes of variant i. is a vector of independent sampling errors, and follows . In Equation 2, linear regression is again used to estimate and its variance . Our test statistic s is the normalized (). Let μ be the expected value of each test statistic s. We write the hypothesis as We then compute the P-value at each variant i and reject H0 if their minimum P-value is less than α (Sul ). α is the false-positive rate adjusted for multiple testing. For example, if Bonferonni correction is applied, . Bonferonni correction is conservative because it ignores linkage-disequilibrium among the variants. The permutation test is thus the gold standard method (Sul ).

Functional annotation-based multi-threshold eGene (Func-eGene)

The standard method applies an identical univariate association test to each variant and uses the minimum P-value as a test statistic. This is equivalent to assigning to all variants a uniform prior of being associated with the gene expression. However, we often have annotations that specify whether the variants are in some regulatory regions of the genome. We developed a new method named Func-eGene that considers this evidence to increase statistical power to find candidate eGenes.

Multi-threshold association test

Our Func-eGene is built upon the multi-threshold association test method that assigns different significance thresholds to different hypothesis tests (Darnell ; Eskin, 2008). We briefly describe their method here, assuming that we have data about the relative importance of the genetic variants in our study. Let this information about the M variants be given as , where and for all i. For example, regulatory variants can be assigned higher c than those which are not. We assume that are given beforehand. Later, we drop this assumption and determine an appropriate from the data. Darnell and Eskin (2008) use this data to create a non-uniform prior in the hypothesis test by giving the M observed statistics their own thresholds , which are functions of . These t must be corrected for multiple testing by using the constraint , where α is the genome-wide false-positive rate. This constraint holds only if the variants are independent. Later, we remove this requirement and properly account for LD. We then maximize the statistical power of the hypothesis test, which is a function of t. Statistical power is defined as the probability of an observed statistic being significant when the alternative hypothesis is true. In the simplest case, there is only one variant i and the gene y. The power of the two-tail hypothesis test denoted as is the probability that when the true mean of is not zero. Suppose N is large so that can be approximated using the standard normal cumulative density . In a more common scenario, one considers M variants in-cis with gene y. The statistical power to detect y being an eGene denoted as is the weighted average over M variants, . For a more detailed discussion of this definition, see Eskin (2008) and Rubin . Darnell and Eskin (2008) find so that the statistical power is maximized under the constraint and for all i. This solution is obtained by taking the gradient of the Lagrangian with respect to the Lagrangian multiplier , and the unknown variables . Optimal solution is achieved when for all (Eskin, 2008). Moreover, this equality has a closed form (Darnell ) The symbol is the probability density function of a normal distribution having mean z and variance one. is the quantile of w under a normal distribution of mean zero and variance one. Once the observed statistics are estimated, we compare their P-values to . If p<, then variant i is an eQTL. If at least one variant is an eQTL, then the gene is an eGene. This method can be used to calculate the multiple-testing-corrected P-values . In fact, finding per-marker significance thresholds and computing the corrected P-value are two related tasks in multiple-testing correction. Briefly, to obtain the corrected P-value , we begin with a small , find optimal with constraint in which case due to small . We repeat this analysis while increasing until . will be our corrected P-value . If , then variant i is an eQTL. If any of the variants is an eQTL, then the gene is an eGene. The multiple-testing-corrected eGene P-value becomes Comparing p against t and comparing against α give identical eQTLs and eGenes. These are two different viewpoints of the same multiple-testing correction.

LD-corrected P-value

When LD among the variants is unignorable, corrected P-values and eGene P-value violate the independence assumption and become conservative. To avoid this, Darnell and Eskin (2008) suggested using a permutation test to compute . Because these studies did not describe in detail how a permutation test is done, we explain the procedure here. We do one permutation by permuting the expression measurements among the individuals while keeping their genotype data unchanged so that LD is retained. Leaving the LD intact keeps the correlation between the genotypes of the individuals which then retains the correlation of the test statistics. Suppose that we do such permutation B times. In the j-th permutation, we find M corrected P-values and their eGene P-value as Let be the eGene P-value in the observed data. Define the LD-corrected eGene P-value as where is an indicator function. This LD-corrected eGene P-value is not conservative and has correct false-positive rate. This permutation however is time-consuming. In each permutation, we need to find the corrected P-values which requires a search for as described above. Repeating this search B times makes the permutation test very time-consuming.

LR-based permutation test

To speed up the permutation test, Func-eGene uses the likelihood ratio (LR). The permutation using P-values in Darnell and Eskin (2008) is slow because every permutation finds the corrected P-value . Func-eGene uses a test statistic that does not require . To do this, we interpret Equation 5 as a LR multiplied by a prior probability. Define such that Equation 8 becomes a LR evaluated at s where and H1 is an average of two two-tail hypotheses and . Here is a monotonic increasing in |s|. The key concept is that we can replace in Equation 6 with . Define the eGene LR to be Let be the observed eGene LR in the data computed as in Equation 9 using the observed test statistics . Define as the eGene LR computed as in Equation 9 using the test statistics in the j-th permutation. The LD-corrected P-value becomes and is equivalent to Equation 7. By using LR instead of P-value, we avoid finding and make the permutation test faster.

LR-based Mvn sampling

Even with the LR-based permuation test, the method’s runtime depends on the number of individuals. To overcome this problem, we observe that Equation 8 is a simple function in s. Applying the method in Sul , we use the Mvn density instead of the permutation test. We sample from a null density . measures the LD among the M variants, and is computed as where is the genotype data. When there are few individuals, the null density in the permutation test and one formed by are mismatched. Following Sul , we correct using the variant minor allele frequencies and the sample size. The corrected are used to compute in Equation 10.

Finding appropriate priors using genomic annotations

We demonstrated that Func-eGene can maximize statistical power and approximate eGene P-values when the functional priors are specified beforehand. However, these priors are hardly known a priori. In this section, we introduce a heuristic data-adaptive procedure to determine an appropriate prior that can yield the most number of candidate eGenes. Suppose we categorize the cis-variants using J different annotations. Each annotation j is given a score b. A variant belonging to j inherits its score. Define so that k = 1 if the variant i belongs to j. The score u at a variant i is defined as The normalized prior c at variant i is where M is the number of variants. Equation 11 assumes that the functional annotations behave in an additive manner. It is possible to include interaction terms among the annotations, and the optimization procedure below remains applicable. Equations 11 and 12 imply that Equations 8, 9 and 10 are functions of the annotation score . Using these priors in Func-eGene, we calculate Equation 10. It is important to see that Equation 8 is the product of c and the function . Thus, we compute from the observed data only once, and then use them when evaluating eGene LR at each possible b to get . Our objective is to determine the optimal score for the annotations that yield the most number of candidate eGenes. One immediate but impractical solution is to search all possible b. To do this, Func-eGene must be run for each b to get the threshold for the observed eGene LR in Equation 9, which corresponds to the specified significance threshold α. This threshold is some upper α quantile under a null density of . This quantile depends on b. Let be the quantile threshold of at gene y, using some k-th choice of b denoted as . If , then y is an eGene. Using these quantile thresholds, the number of eGenes can be estimated for a choice . Running Func-eGene for all b is time demanding. Thus, for finding a good choice of b, we use the following procedure. We aim to approximate quantile thresholds of all b, while implementing Func-eGene only once. One can pick a starting , compute for all genes y in the data. In each subsequent choice k, find for only a subset of genes, and estimate the ratio change of the quantile threshold Determine the average using the computed over the subset. Use and to estimate for all genes y in the data, assuming that the ratio Equation 13 changes only slightly for all the genes. This procedure quickly calculates the observed eGene LR and its threshold at all b, using the permutation test or the sampling scheme in Sul only once. We emphasize that this approximation is based on a subset of genes and is best used for finding a good choice . After we find , ideally, we would apply a complete Func-eGene run using to determine the number of eGenes. Because we determine good choices for b from the data, data re-use and over-fitting are two issues which can inflate the false-positive rate. To avoid this, we use a cross-validation method that divides the data into two subsets. We obtain best scores from one set and apply these scores to find eGenes in the other set, and vice versa.

3 Results

We applied our method Func-eGene to the GTEx dataset. The GTEx pilot study collected 9365 tissue samples from more than 30 distinct tissues from 237 post-mortem donors and performed RNA-seq to quantify gene expression in those tissues (The GTEx Consortium, 2015). We used the liver tissue data that has 97 samples. All individuals were genotyped at 5M SNPs and imputed with 1000 Genomes Phase I as the reference panel. The number of genes expressed in this tissue is 21 868.

Func-eGene controls false-positive rate

There are two ways to apply Func-eGene. Permutation Func-eGene relies on the traditional permutation test to calculate the null density of the observed statistic, whereas Mvn Func-eGene relies on the Mvn-sampling procedure in Sul . Simulations demonstrate that both the permutation and the Mvn Func-eGene control the false-positive rate well. The gene ENSG00000204219.5 expressed in the liver tissue is chosen as an example. This gene belongs to chromosome 1 and has 3872 cis-variants of which 431 are in the TSS region. For the sake of simplicity, variants in the TSS region are assumed 100 times more likely to affect gene expression. The non-uniform priors are then specified such that the ratio of priors for variants inside and outside TSS is 100/1. This ratio is reset to 1/1 in the uniform prior. In the alternative hypothesis, our method requires the true effects (i.e. z-score non-centrality parameters) of the variants as input parameters μi's. In this experiment and onward, in the alternative hypothesis, we use —this choice is addressed in Section 4. To simulate gene expression data under the null hypothesis, we permuted the gene expression measurements among the 97 individuals and kept the genotype data unchanged. In each simulation, we computed the eGene P-value using permutation and Mvn Func-eGene with both uniform and non-uniform priors. We applied the permutation procedure using 104 permutations and the Mvn method using 106 samplings. Simulated eGene P-values under 0.05 are considered significant. Permutation and Mvn Func-eGene have false-positive rates of 0.046 and 0.044, respectively, using a uniform prior. Both have false-positive rates of 0.051 and 0.052, respectively, using a non-uniform prior. Q–Q plots against the uniform density illustrate that the simulated densities under the null hypothesis match the uniform distribution well (Fig. 1(a)).

Fig. 1.

(a) Q–Q plots of the uniform density quantiles against the simulated eGene P-value quantiles using Func-eGene at the gene ENSG00000204219.5 under the null hypothesis. (b) Func-eGene simulated statistical power at the gene ENSG00000204219.5

Func-eGene increases statistical power

An appropriate prior when applied in Func-eGene increases statistical power to detect candidate eGenes. To demonstrate this, we conducted simulation studies where there exists one variant that induces the gene expression. The gene ENSG00000204219.5 is again chosen as an example. To simulate the gene expression measurements for 97 individuals, we consider only variants within 150 kb up- and down-stream of the TSS, and presume that the maximum absolute effects of these variants (denoted as ) to be the only genetic contribution toward the gene expression. Gene expression measurements are thus simulated as where is the simulated gene expression measurements, is the genotype of the variant corresponding to , and is sampled from . In our experiment, we vary this standard deviation σ from 0.00 to 1.50 At each instance of σ, we simulate 200 sets of gene expression data and compute the eGene P-value for each of them. The simulated power at this σ is the fraction of eGene P-values from the 200 simulated datasets that are less than 0.05. As we increase σ, the randomness effect dominates the effects of variants and the statistical power to discover any association between a variant and the gene expression diminishes. We applied the uniform and non-uniform priors to the simulated data. In the non-uniform prior, the ratio of prior for a variant inside and outside the TSS region is 100/1. This non-uniform prior reflects the fact that eQTLs tend to reside near the TSS. Figure 1(b) indicates that the non-uniform prior increases statistical power in both the permutation and Mvn Func-eGene and that conditioned on the priors, permutation and Mvn Func-eGene archive equivalent statistical power.

Func-eGene discovers more candidate eGenes in the liver GTEx data

Permutation and Mvn Func-eGene are applied to the liver GTEx data which contains 21 868 quantile normalized gene expression measurements across 22 autosomal chromosomes in 97 individuals. Both uniform and non-uniform priors are tested. Our goal in this experiment is to demonstrate that: (i) using an informative non-uniform prior increases the number of candidate eGenes and (ii) Mvn Func-eGene is a good estimation of the gold-standard permutation test. We computed eGene P-values by using 106 samplings for Mvn Func-eGene and 104 permutations for the permutation Func-eGene as indicated in the GTEx pilot analysis (The GTEx Consortium, 2015). The efficiency gain of Mvn sampling over the permutation test diminishes when the number of cis-variants for a gene is much greater than the number of sample size. Following Sul , we divide the cis-variants for a gene into blocks of size 500 kb, and apply Mvn sampling separately to each block. The most significant P-value taken across these blocks is the eGene P-value. Cis-variants are variants located within the 1 Megabase up- and down-stream of TSS of a gene (The GTEx Consortium, 2015). In the liver GTEx data, the average number of cis-variants per gene is 4681. We define gene TSS-region to be 150 kb up- and down-stream of the gene TSS. The average fraction of variants inside this region is 14.74%. Spurious effects on gene expression might dominate the effects of the cis-variants. To remove them, we regress out the following covariates: the first three genotyping principal components, the first 15 expression Probabilistic Estimation of Expression Residuals (PEER) factors, and gender (Stegle ; The GTEx Consortium, 2015). To be consistent with the GTEx pilot analysis, we transform eGene P-values into Q-values to control the false discovery rate over the entire sample. Genes having Q-values under 0.05 are considered eGenes (Storey and Tibshirani, 2003; The GTEx Consortium, 2015). For the sake of simplicity, non-uniform priors are assigned so that the ratio of prior for a variant inside and outside the TSS region is 100/1, an assumption we address later. Using an informative non-uniform prior increases the number of candidate eGenes in both permutation and Mvn Func-eGene (Table 1). The number of candidate eGenes has increased by 50.4% in permutation Func-eGene (from 1626 to 2445) and by 54.8% in Mvn Func-eGene (from 1582 to 2449) by applying non-uniform priors. The numbers were comparable between the two implementations, indicating that Mvn approximates the null distribution of observed statistics well.

Table 1.

Number of candidate eGenes discovered by permutation and Mvn Func-eGene using uniform and non-uniform priors for 21 868 genes in the liver tissue GTEx data

	Permutation		Mvn		Overlap^a
eGene^b	Yes	No	Yes	No	Yes	No
Uniform^c	1626	20 242	1582	20 286	1549	20 209
Non-uniform^d	2445	19 423	2449	19 419	2379	19 353
Overlap^e	1484	19 281	1457	19294

aCondition on uniform prior or non-uniform prior and count the number of eGenes (or not eGenes) that permutation Func-eGene agrees with Mvn Func-eGene.

bIndicates the number of genes detected to be eGenes.

cUniform prior uses the prior ratio 1/1 for all variants.

dNon-uniform prior uses the prior ratio 100/1 so that variants in the TSS are 100 times more likely to affect gene expression.

eCondition on permutation or Mvn Func-eGene and count the number of eGenes (or not eGenes) that the uniform prior agrees with non-uniform prior.

Number of candidate eGenes discovered by permutation and Mvn Func-eGene using uniform and non-uniform priors for 21 868 genes in the liver tissue GTEx data aCondition on uniform prior or non-uniform prior and count the number of eGenes (or not eGenes) that permutation Func-eGene agrees with Mvn Func-eGene. bIndicates the number of genes detected to be eGenes. cUniform prior uses the prior ratio 1/1 for all variants. dNon-uniform prior uses the prior ratio 100/1 so that variants in the TSS are 100 times more likely to affect gene expression. eCondition on permutation or Mvn Func-eGene and count the number of eGenes (or not eGenes) that the uniform prior agrees with non-uniform prior. The number of eGene discovered at 19 annotation score ratios aPrior ratios of variants inside and outside an annotation. These ratios are in order . bThe numbers are obtained using the approximation method in Section 2.2.5. Numbers in parentheses are obtained using Mvn Func-eGene Sul have shown that Mvn sampling and permutation test have comparable eGene P-values without using any prior. Here we show that the results are also comparable when using a non-uniform prior. Figure 2(a) compares Mvn eGene P-values against those in the permutation test, and Figure 2(b) indicates the Mvn method is at most ±0.10 units from the gold-standard permutation test P-values. Our Mvn method and the permutation test agree on 2379 candidate eGenes. Because Mvn method is an approximation to the permutation test, we analyze the candidate eGenes that the Mvn method misses and falsely discovers. Of the 66 missed eGenes, the maximum Q-value is 0.079 and the median is 0.053. Of the 70 falsely discovered eGenes, the minimum Q-value is 0.028 and the median 0.045. Thus, these misclassified eGenes are genes with borderline Q-values.

Fig. 2.

(a) Scatter plot of eGene P-values using Mvn Func-eGene and the permutation test. (b) Histogram of the difference between eGene P-values using Mvn Func-eGene and the permutation test

(a) Scatter plot of eGene P-values using Mvn Func-eGene and the permutation test. (b) Histogram of the difference between eGene P-values using Mvn Func-eGene and the permutation test The function Q-value in R requires that the distribution of input P-values has a fairly flat right tail (Storey and Tibshirani, 2003). Our eGene P-values meet this condition (Fig. 3).

Fig. 3.

Histogram of the eGene P-values using Mvn Func-eGene and the permutation test

Histogram of the eGene P-values using Mvn Func-eGene and the permutation test Figure 4(a) compares our eGene Q-values against those in the permutation test, and Figure 4(b) indicates that accuracy of Mvn Func-eGene is at most ±0.05 units from the gold-standard permutation test Q-values.

Fig. 4.

(a) Scatter plot of the eGene Q-values using Mvn Func-eGene and the permutation test. (b) Histogram of the difference between eGene Q-values using Mvn Func-eGene and the permutation test

Func-eGene chooses appropriate priors

So far, we have assumed that the priors are defined a priori. In this section, we applied the method in Section 2.2.5 to determine appropriate priors from the data. To demonstrate this method, we consider two functional annotations and . contains variants within 150 kb of the TSS. contains variants within the E066 DNase hypersensitivity narrow gapped peaks (http://egg2.wustl.edu/roadmap/data) in the liver tissue. Across 21 868 autosomal genes, the average fraction of cis-variants belonging in the and the are 14.74% and 4.66%, respectively. The overlap between them is 0.88%. The relative prior ratio between annotations can be represented by three numbers, b1, b2 and b3, which correspond to the scores for and neither. For example, a variant in has times higher score than a variant in . Since the scores are assumed to be additive in Equation 11, a variant in both and has times higher score than a variant in neither classes. We constrained each of b1, b2, and b3 to be between 100 times greater and 100 times smaller than the other two. Only the relative ratios of the scores matter. Given this constraint, we did a grid search over the parameter space evaluating a total of 441 choices of score . We implemented Mvn Func-eGene only once with the functional scores . At each gene, we recorded the upper 1% quantile of the observed statistics, which corresponds to the significance threshold . We used this P-value threshold because this threshold roughly corresponds to the maximum eGene P-value of genes whose Q-values are under the threshold 0.05 in our data of Section 3.3. Using this complete Mvn Func-eGene run at , we computed new observed statistics and their thresholds at another choice using the approximation method in Section 2.2.5. We tested 441 choices of b. Table 2 displays 19 of these 441 choices and the number of candidate eGenes discovered using these scores. The score had the most candidate eGenes, indicating that using alone is enough to increase the number of eGene discovered. For a few choices of b, we ran the Mvn Func-eGene without using the approximation method, and the results are comparable.

Table 2.

The number of eGene discovered at 19 annotation score ratios

Row	Ratio^a	eGene ^b	Row	Ratio^a	eGene^b
1	1:1:1	2057	10	1:1:10	1579
2	1:10:1	2032	11	1:10:10	1747
3	1:100:1	1890 (1834)	12	1:100:10	1904
4	10:1:1	2450	13	10:1:10	2060
5	10:10:1	2331	14	100:1:10	2473 (2413)
6	10:100:1	1991	15	1:1:100	1280
7	100:1:1	2493 (2449)	16	1:10:100	1391
8	100:10:1	2489 (2449)	17	1:100:100	1673
9	100:100:1	2329	18	10:1:100	1548
			19	100:1:100	2014

aPrior ratios of variants inside and outside an annotation. These ratios are in order .

bThe numbers are obtained using the approximation method in Section 2.2.5. Numbers in parentheses are obtained using Mvn Func-eGene

The number of candidate eGenes detected by Mvn Func-eGene at the best priors in each annotation aAverage percent of variants in an annotation. bBest prior ratios of variants inside and outside an annotation given by the method in Section 2.2.5. cNumbers are obtained by using Mvn Func-eGene at the best ratios. dAssociated with gene activation. eAssociated with gene suppression

Evaluation using histone marks

Because there exist other genomic annotations, we hope to evaluate their effects on eGene detection. Ideally, we would use all annotations simultaneously in the model and find the optimal prior parameters. Unfortunately, our method uses grid search which is prohibitively time-consuming in high dimensional space. For this reason, we evaluate each annotation separately, which at least can provide an overview of which annotation contains useful information for eGene discovery. We applied the optimization to six histone marks, , and . In each annotation, we find the best prior ratio of the variants inside and outside the annotated site. Table 3 indicates that using these annotations can increase the number of candidate eGenes. These numbers are from a complete Mvn Func-eGene run using all genes in the liver tissue data and not from the method in Section 2.2.5. The marks associated with activation of gene expression (H3K27ac, H3K4me1, H3K4me3, H3K9ac) have prior ratios more than one, whereas the marks associated with suppression of gene expression (H3K27me3, H3K9me3) have prior ratios less than one. All of the histone marks yield more candidate eGenes than the uniform prior.

Table 3.

The number of candidate eGenes detected by Mvn Func-eGene at the best priors in each annotation

Annotation	(%)^a	Ratio^b	eGene^c
H3K27ac^d	12.25	40/1	1944
H3K27me3^e	7.26	1/70	1880
H3K4me1^d	16.38	80/1	1858
H3K4me3^d	7.73	50/1	1917
H3K9ac^d	9.74	100/1	1861
H3K9me3^e	11.92	1/50	1879
TSS	14.74	60/1	2479
DNase	4.66	100/1	1834
Uniform	100	1/1	1592

aAverage percent of variants in an annotation.

bBest prior ratios of variants inside and outside an annotation given by the method in Section 2.2.5.

cNumbers are obtained by using Mvn Func-eGene at the best ratios.

dAssociated with gene activation.

eAssociated with gene suppression

In Table 3, TSS has the best prior ratio at 60/1. This prior gives more eGenes than the prior 100/1 reported in Table 1. Figure 5(a) indicates that the false-positive rate at the gene ENSG00000204219.5 using prior 60/1 matches well to the uniform density and is not inflated. Figure 5(b) indicates that the simulated statistical power at this gene using Mvn Func-eGene with the prior ratio 60/1 is not worse than the guessed ratio 100/1. It is important to stress that the ratio 60/1 is best with respect to all the genes in liver tissue data and not this particular gene.

Fig. 5.

(a) False-positive rate for permutation and Mvn Func-eGene using prior ratio 60/1. (b) Mvn Func-eGene statistical power at gene ENSG00000204219.5 using ratio 60/1 is not worse than the ratio 100/1

Mvn Func-eGene has better runtime than permutation method

In this section, we compare the runtime of the permutation and Mvn Func-eGene. In this experiment, the Mvn method uses 1000 samples, and the permutation-based procedure uses 1000 permutations. In both cases the number of individuals is 97. The standard permutation method computes one eGene P-value in time linear in the number of cis-variants for a gene (Sul ). The permutation-based Func-eGene relies on this basic permutation procedure and has almost identical runtime to the standard permutation method. Figure 6 displays the runtime for a few randomly chosen genes from the GTEx liver tissue data. Mvn Func-eGene is considerably faster than permutation approaches but runs in polynomial time. In the GTEx liver tissue data, the 5% upper quantile of cis-variants per gene is 6833 variants. Thus in practice the polynomial nature of the Mvn Func-eGene does not impede its application.

Fig. 6.

Permutation and Mvn Func-eGene runtime

4 Discussion

In this article, we have introduced a new method Func-eGene that relies on the association study methods in Darnell and Eskin (2008) and uses genomic annotations of the cis-variants to create a non-uniform prior that can detect more eGenes. We applied our method to the liver tissue dataset from the GTEx Consortium, and the results indicate that distance from TSS appears to contain enough information that is needed to find more candidate eGenes. Our method has many layers of procedures which can be time-consuming. To reduce runtime, we introduced many ideas. We employed LR statistic which is more efficient to obtain than a P-value in a multi-threshold association study. We replaced the time-consuming permutation test with the use of Mvn sampling. To avoid reassessing significance thresholds at each new prior in our grid search, we developed an approximation method which uses a subset of the tested genes. Due to these heuristics, we were able to conduct a grid search using TSS, DNase and six histone modifications as functional classes. Because our method uses grid search where the runtime increases exponentially with the number of annotations, our current method is not yet applicable for simultaneously handling a large number of annotations. One future goal is to develop a better optimization method than a grid search. Lastly, we address the fact that in the alternative hypothesis, the true effects of the variants on the gene expression are unknown in practice. It has been demonstrated in previous studies that different choices of these effect sizes do not greatly change the outcome (Darnell ; Eskin, 2008). Another option is to consider some continuous prior density on these true effects and then integrate over their valid domain (Benner ; Hormozdiari , 2015). This idea is another future research plan.

17 in total

1. Increasing power in association studies by using linkage disequilibrium structure and molecular function as prior information.

Authors: Eleazar Eskin
Journal: Genome Res Date: 2008-03-18 Impact factor: 9.043

2. Accurate and fast multiple-testing correction in eQTL studies.

Authors: Jae Hoon Sul; Towfique Raj; Simone de Jong; Paul I W de Bakker; Soumya Raychaudhuri; Roel A Ophoff; Barbara E Stranger; Eleazar Eskin; Buhm Han
Journal: Am J Hum Genet Date: 2015-05-28 Impact factor: 11.025

3. Identifying causal variants at loci with multiple signals of association.

Authors: Farhad Hormozdiari; Emrah Kostem; Eun Yong Kang; Bogdan Pasaniuc; Eleazar Eskin
Journal: Genetics Date: 2014-08-07 Impact factor: 4.562

4. The landscape of genetic complexity across 5,700 gene expression traits in yeast.

Authors: Rachel B Brem; Leonid Kruglyak
Journal: Proc Natl Acad Sci U S A Date: 2005-01-19 Impact factor: 11.205

5. Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues.

Authors: Jason Ernst; Manolis Kellis
Journal: Nat Biotechnol Date: 2015-02-18 Impact factor: 54.908

6. Incorporating prior information into association studies.

Authors: Gregory Darnell; Dat Duong; Buhm Han; Eleazar Eskin
Journal: Bioinformatics Date: 2012-06-15 Impact factor: 6.937

7. Identification of causal genes for complex traits.

Authors: Farhad Hormozdiari; Gleb Kichaev; Wen-Yun Yang; Bogdan Pasaniuc; Eleazar Eskin
Journal: Bioinformatics Date: 2015-06-15 Impact factor: 6.937

8. Integrative analysis of 111 reference human epigenomes.

Authors: Anshul Kundaje; Wouter Meuleman; Jason Ernst; Misha Bilenky; Angela Yen; Alireza Heravi-Moussavi; Pouya Kheradpour; Zhizhuo Zhang; Jianrong Wang; Michael J Ziller; Viren Amin; John W Whitaker; Matthew D Schultz; Lucas D Ward; Abhishek Sarkar; Gerald Quon; Richard S Sandstrom; Matthew L Eaton; Yi-Chieh Wu; Andreas R Pfenning; Xinchen Wang; Melina Claussnitzer; Yaping Liu; Cristian Coarfa; R Alan Harris; Noam Shoresh; Charles B Epstein; Elizabeta Gjoneska; Danny Leung; Wei Xie; R David Hawkins; Ryan Lister; Chibo Hong; Philippe Gascard; Andrew J Mungall; Richard Moore; Eric Chuah; Angela Tam; Theresa K Canfield; R Scott Hansen; Rajinder Kaul; Peter J Sabo; Mukul S Bansal; Annaick Carles; Jesse R Dixon; Kai-How Farh; Soheil Feizi; Rosa Karlic; Ah-Ram Kim; Ashwinikumar Kulkarni; Daofeng Li; Rebecca Lowdon; GiNell Elliott; Tim R Mercer; Shane J Neph; Vitor Onuchic; Paz Polak; Nisha Rajagopal; Pradipta Ray; Richard C Sallari; Kyle T Siebenthall; Nicholas A Sinnott-Armstrong; Michael Stevens; Robert E Thurman; Jie Wu; Bo Zhang; Xin Zhou; Arthur E Beaudet; Laurie A Boyer; Philip L De Jager; Peggy J Farnham; Susan J Fisher; David Haussler; Steven J M Jones; Wei Li; Marco A Marra; Michael T McManus; Shamil Sunyaev; James A Thomson; Thea D Tlsty; Li-Huei Tsai; Wei Wang; Robert A Waterland; Michael Q Zhang; Lisa H Chadwick; Bradley E Bernstein; Joseph F Costello; Joseph R Ecker; Martin Hirst; Alexander Meissner; Aleksandar Milosavljevic; Bing Ren; John A Stamatoyannopoulos; Ting Wang; Manolis Kellis
Journal: Nature Date: 2015-02-19 Impact factor: 69.504

9. Cis and trans effects of human genomic variants on gene expression.

Authors: Julien Bryois; Alfonso Buil; David M Evans; John P Kemp; Stephen B Montgomery; Donald F Conrad; Karen M Ho; Susan Ring; Matthew Hurles; Panos Deloukas; George Davey Smith; Emmanouil T Dermitzakis
Journal: PLoS Genet Date: 2014-07-10 Impact factor: 5.917

10. FINEMAP: efficient variable selection using summary data from genome-wide association studies.

Authors: Christian Benner; Chris C A Spencer; Aki S Havulinna; Veikko Salomaa; Samuli Ripatti; Matti Pirinen
Journal: Bioinformatics Date: 2016-01-14 Impact factor: 6.937

7 in total

Review 1. Computational Prediction of the Global Functional Genomic Landscape: Applications, Methods, and Challenges.

Authors: Weiqiang Zhou; Ben Sherwood; Hongkai Ji
Journal: Hum Hered Date: 2017-01-12 Impact factor: 0.444

2. Meningeal lymphatics affect microglia responses and anti-Aβ immunotherapy.

Authors: Zachary Papadopoulos; Taitea Dykstra; Logan Brase; Sandro Da Mesquita; Fabiana Geraldo Farias; Morgan Wall; Hong Jiang; Chinnappa Dilip Kodira; Kalil Alves de Lima; Jasmin Herz; Antoine Louveau; Dylan H Goldman; Andrea Francesca Salvador; Suna Onengut-Gumuscu; Emily Farber; Nisha Dabhi; Tatiana Kennedy; Mary Grace Milam; Wendy Baker; Igor Smirnov; Stephen S Rich; Bruno A Benitez; Celeste M Karch; Richard J Perrin; Martin Farlow; Jasmeer P Chhatwal; David M Holtzman; Carlos Cruchaga; Oscar Harari; Jonathan Kipnis
Journal: Nature Date: 2021-04-28 Impact factor: 69.504

7. Applying meta-analysis to genotype-tissue expression data from multiple tissues to identify eQTLs and increase the number of eGenes.

Authors: Dat Duong; Lisa Gai; Sagi Snir; Eun Yong Kang; Buhm Han; Jae Hoon Sul; Eleazar Eskin
Journal: Bioinformatics Date: 2017-07-15 Impact factor: 6.937