Literature DB >> 27489002

Gene and Network Analysis of Common Variants Reveals Novel Associations in Multiple Complex Diseases.

Priyanka Nakka1, Benjamin J Raphael2, Sohini Ramachandran3.   

Abstract

Genome-wide association (GWA) studies typically lack power to detect genotypes significantly associated with complex diseases, where different causal mutations of small effect may be present across cases. A common, tractable approach for identifying genomic elements associated with complex traits is to evaluate combinations of variants in known pathways or gene sets with shared biological function. Such gene-set analyses require the computation of gene-level P-values or gene scores; these gene scores are also useful when generating hypotheses for experimental validation. However, commonly used methods for generating GWA gene scores are computationally inefficient, biased by gene length, imprecise, or have low true positive rate (TPR) at low false positive rates (FPR), leading to erroneous hypotheses for functional validation. Here we introduce a new method, PEGASUS, for analytically calculating gene scores. PEGASUS produces gene scores with as much as 10 orders of magnitude higher numerical precision than competing methods. In simulation, PEGASUS outperforms existing methods, achieving up to 30% higher TPR when the FPR is fixed at 1%. We use gene scores from PEGASUS as input to HotNet2 to identify networks of interacting genes associated with multiple complex diseases and traits; this is the first application of HotNet2 to common variation. In ulcerative colitis and waist-hip ratio, we discover networks that include genes previously associated with these phenotypes, as well as novel candidate genes. In contrast, existing methods fail to identify these networks. We also identify networks for attention-deficit/hyperactivity disorder, in which GWA studies have yet to identify any significant SNPs.
Copyright © 2016 by the Genetics Society of America.

Entities:  

Keywords:  GWAS; common variants; complex diseases; pathway analysis; quantitative traits

Mesh:

Year:  2016        PMID: 27489002      PMCID: PMC5068862          DOI: 10.1534/genetics.116.188391

Source DB:  PubMed          Journal:  Genetics        ISSN: 0016-6731            Impact factor:   4.562


GENOME-WIDE association (GWA) studies and meta-analyses are widely used to identify susceptibility loci for complex diseases and traits, which are phenotypes generated by multiple mutations of moderate to small effect (Hirschhorn and Daly 2005; McCarthy ; Daly 2010; Jiang ; Evangelou and Ioannidis 2013; Nalls ; Skibola ; Woo ; Buch ; Kouri ; Litchfield ; Renton ; Hallberg ). To date, >2400 GWA studies have been conducted to find causal variants that are statistically associated with a disease or trait (http://www.ebi.ac.uk/gwas/). The GWA framework tests the hypothesis that individual mutations of large effect generate phenotypes of interest. However, this framework has multiple limitations when applied to complex diseases. First, complex diseases are known to exhibit genetic heterogeneity on multiple levels: (i) The disease may be generated by multiple mutations within an associated gene and (ii) mutations in distinct genes within a pathway may interact and produce the disease state (McClellan and King 2010). In both cases, separately testing individual variants for statistical associations with a phenotype may not identify susceptibility loci (McClellan and King 2010; Stranger ). Further, SNP-level GWA results are unlikely to reveal complex disease mechanisms, given that different combinations of functionally related variants in genes and pathways may interact to produce the phenotype of interest. Gene-set analyses, which test for the statistical association of phenotype state with a set of genes, are commonly used to address these limitations of the GWA framework (see Wang , Leiserson , and Mooney for reviews). To increase computational efficiency and limit the number of hypotheses tested, it is necessary to reduce the combinations of variants examined to a tractable number. This is typically done using databases of known pathways or other biological interactions, nearly all of which are annotated at the gene level. Thus, a crucial step in most gene-set analyses is combining SNP-level GWA P-values within genes into a “gene score” (Mooney ). Here we use “gene-set analysis” to describe three types of statistical tests for association at the gene level with a phenotype of interest. First, we describe permutation tests (e.g., DAPPLE and dmGWAS) (Jia ; Rossin ) where P-values are assigned to gene scores observed in an annotated pathway. Second, we describe tests for enrichment in related annotations among genes in a predetermined list (e.g., GRAIL, MAGENTA, DAVID, and GSEA-SNP) (Huang ; Holden ; Raychaudhuri ; Segrè ). To conduct these tests, the investigator must compute a gene score and, in some cases, determine a threshold for extreme gene scores to generate a list of genes associated with the phenotype of interest. In the third type of test, once a gene score is computed, the investigator can conduct a gene-level association test and/or a gene-network association test, to identify novel combinations of variants that generate a phenotype of interest, due to unknown interactions between genes or uncharacterized crosstalk between pathways. An informative gene score is a necessary ingredient for accurate gene-set analyses, but all commonly used methods for generating gene scores have substantial drawbacks. Commonly used methods include choosing the best SNP P-value within a gene to be the gene score, which is sometimes referred to as “minSNP” (Torkamani ; Fehringer ; Gelernter ; Hu ); permutation-based methods such as permSNP (Wang ; Eleftherohorinou ; Ballard ; Christoforou ; Evangelou ; Backes ); regression-based methods such as the sequence kernel association test (SKAT) family of tests (Wu , 2011) and stratified LD score regression (Finucane ); and VEGAS (Liu ) and RAREMETALS (Liu ), which correct for linkage disequilibrium (LD) between SNPs, using simulations from a multivariate normal distribution whose variance is the empirical LD observed among SNPs within each gene being analyzed. Multiple methods exist that use the same null distribution as VEGAS (Tzeng and Zhang 2007; Pan 2009). Other methods that have been proposed include Fisher’s combination test (where the gene score must be calculated empirically using permutation tests), Simes’ combination test, and Sidak’s combination test (Ballard ; Peng ; Wojcik ). The limitations of these approaches range from biased results to computational inefficiency to imprecision (Table 1). minSNP is heavily biased by gene length; the longer the gene is, the more likely it is to have a low gene score. permSNP permutes case–control labels within a genotype data set to calculate an empirical P-value for every gene, which becomes computationally intractable for large data sets (Liu ). Further, permSNP and SKAT require gaining access to genotype data to perform permutations and regression, respectively. The VEGAS method is more computationally efficient than other permutation methods (e.g., permSNP requires recomputing GWA P-values for each permuted data set) and requires only GWA SNP-level P-values as input, but both permSNP and VEGAS give gene scores whose numerical precision depends on the number of permutations and simulations, respectively, that are performed. The smallest gene score VEGAS reports by default is and for permSNP, it is the reciprocal of the number of permutations performed per gene (Table 1).
Table 1

Summary of minSNP, permSNP, SKAT, and VEGAS gene score methods and limitations

Gene score methodHow it worksLimitations
minSNPThe gene score for each gene is the smallest SNP P-value observed within that gene in a GWA study.Biased by gene length (longer genes have lower gene scores).
permSNPPermutes case–control labels within a genotype data set, recomputes GWA SNP P-values using a permuted data set, and calculates an empirical gene P-value based on the number of times the observed average SNP P-value is lower than the permuted P-valuesRequires access to genotype data; very computationally costly for genome-wide data sets; numerical precision of gene scores is bounded by the number of permutations performed.
SKATUses multiple linear/logistic mixed-model regression of covariates (such as principal components to control for population stratification) and genotypes for variants in a gene set onto disease state.Requires access to genotype data.
VEGASUses simulations from a multivariate normal distribution to correct for LD between SNPs. The variance of the distribution is the empirical LD observed among SNPs within each gene in the data set.Numerical precision of gene scores is bounded by the number of simulations performed; computationally inefficient due to simulations.
Here we propose a new method—the precise, efficient gene association score using SNPs (PEGASUS)—to calculate gene scores analytically from a null chi-square distribution that captures LD between SNPs in a gene and addresses the shortcomings of existing methods. PEGASUS requires only GWA study summary statistics and a suitable reference population for LD calculations as input and thus can be applied to GWA study meta-analyses performed on summary statistics, pooled DNA sequencing GWA studies, family-based GWA studies, transmission disequilibrium test (TDT) results, and also traditional GWA studies where consent guidelines prohibit release of genotype data. PEGASUS gene scores are correlated with statistics like VEGAS (Tzeng and Zhang 2007; Pan 2009; Liu ), which rely on the same null distributions to calculate gene scores. These methods use different approximations for the distribution of the sum of correlated chi-square statistics, in contrast to the more accurate numerical integration of the null distribution implemented in PEGASUS. We apply our method to publicly available GWA data sets for nine common diseases and three quantitative traits from the Psychiatric Genomics Consortium (PGC) (Neale ; Ripke , 2013; Sklar ), the International IBD Genetics Consortium (IIBDGC) (Franke ; Anderson ), the Genetic Investigation of Anthropometric Traits (GIANT) Consortium (Heid ; Lango Allen ; Speliotes ), the Broad Institute (Stahl ), the Diabetes Genetics Replication and Meta-analysis (DIAGRAM) Consortium (Morris ), and Xu (Table 2). These data sets were chosen because the full set of SNP-level P-values from the GWA study were available for public download. We compare our method to gene scores generated by minSNP, permSNP, SKAT, and VEGAS, using real and simulated GWA data. Finally, we use our gene scores as input in pathway analysis with HotNet2 (Leiserson ), thereby conducting the first application of HotNet2 to common genetic variation and identifying gene networks harboring several variants associated with three phenotypes: attention-deficit/hyperactivity disorder, ulcerative colitis, and waist–hip ratio. For these three phenotypes of interest, HotNet2 using VEGAS gene scores recovered fewer significant subnetworks for attention-deficit/hyperactivity disorder, ulcerative colitis, and waist–hip ratio. Neither VEGAS nor PEGASUS yielded significant subnetworks for the other nine traits studied here.
Table 2

Total numbers of cases and controls and number of SNP loci in GWA studies for the 12 phenotypes studied here

Disease or trait (reference)No. casesNo. controlsNo. SNPs
Attention-deficit/hyperactivity disorder (ADHD) (Neale et al. 2010)864 + 2,064 trios2,4551,206,461
Acute lymphoblastic leukemia (ALL) (Xu et al. 2013)1,5936,661709,059
Bipolar disorder (BIP) (Sklar et al. 2011)7,4819,2502,427,220
Body mass index (BMI) (Speliotes et al. 2010)NA123,8652,471,516
Crohn’s disease (CD) (Franke et al. 2010)6,33315,056953,241
Height (Lango Allen et al. 2010)NA183,7272,469,635
Major depressive disorder (MDD) (Ripke et al. 2013)9,2409,5191,235,109
Rheumatoid arthritis (RA) (Stahl et al. 2010)5,53920,1692,556,271
Schizophrenia (SCZ) (Ripke et al. 2011)9,39412,4621,252,901
Type 2 diabetes (T2D) (Morris et al. 2012)12,17156,8622,473,441
Ulcerative colitis (UC) (Anderson et al. 2011)6,68719,7181,428,749
Waist–hip ratio adjusted for BMI (WHR) (Heid et al. 2010)NA77,1672,483,325

Materials and Methods

Data sets analyzed from genome-wide association studies

Methods for computing gene scores require a full list of GWA SNP-level P-values; these can be computed from genotype data obtained from previously published GWA studies. We were able to obtain complete results from previous GWA studies for nine common diseases and three quantitative traits (Table 2). See Supplemental Material, Table S3 for URLs to download GWA P-values from the studies referenced in Table 2.

Gene scores

We compared PEGASUS to four existing methods for generating gene-based scores from GWA SNP-level P-values: (i) minSNP (Torkamani ; Fehringer ; Gelernter ; Hu ), (ii) permSNP (Wang ; Eleftherohorinou ; Ballard ; Christoforou ; Evangelou ; Backes ), (iii) SKAT (Wu , 2011), and (iv) VEGAS (Liu ). For n markers in a given gene, these methods use different strategies, each detailed below and summarized in Figure 1, to combine P-values within the gene and calculate a gene-level P-value. We refer to this gene-level P-value as the gene score or
Figure 1

Schematic representations of PEGASUS and the three other methods—minSNP, permSNP, and VEGAS—assessed in this study. (A) minSNP defines the gene score to be the lowest of the SNP-level P-values within the gene observed in a GWA study. (B) permSNP (Ballard ) performs M permutations of case–control labels in genotype data, recomputes GWA P-values for each SNP for each permuted data set, averages SNP P-values within each gene, and computes an empirical gene P-value based on the number of times the observed gene P-value is lower than permuted P-values. (C) VEGAS performs multivariate normal simulations from a null distribution of statistics where the statistics are correlated by empirical LD calculated from genotype data. M simulations are performed, the null statistics are summed within each gene and the empirical gene P-value is the number of times the observed statistic is lower than the permuted statistic. (D) In PEGASUS, for each gene, we numerically integrate the distribution of the sum of correlated statistics at the observed gene statistic to determine the gene score. We also assess the performance of SKAT (Wu , 2011), which is not depicted here. SKAT uses a multiple linear/logistic regression framework, where genotypes for variants in a gene set and covariates are regressed onto phenotype to generate gene scores.

Schematic representations of PEGASUS and the three other methods—minSNP, permSNP, and VEGAS—assessed in this study. (A) minSNP defines the gene score to be the lowest of the SNP-level P-values within the gene observed in a GWA study. (B) permSNP (Ballard ) performs M permutations of case–control labels in genotype data, recomputes GWA P-values for each SNP for each permuted data set, averages SNP P-values within each gene, and computes an empirical gene P-value based on the number of times the observed gene P-value is lower than permuted P-values. (C) VEGAS performs multivariate normal simulations from a null distribution of statistics where the statistics are correlated by empirical LD calculated from genotype data. M simulations are performed, the null statistics are summed within each gene and the empirical gene P-value is the number of times the observed statistic is lower than the permuted statistic. (D) In PEGASUS, for each gene, we numerically integrate the distribution of the sum of correlated statistics at the observed gene statistic to determine the gene score. We also assess the performance of SKAT (Wu , 2011), which is not depicted here. SKAT uses a multiple linear/logistic regression framework, where genotypes for variants in a gene set and covariates are regressed onto phenotype to generate gene scores.

minSNP:

The minSNP method (Torkamani ; Fehringer ; Gelernter ; Hu ) for generating gene scores assigns the smallest GWA SNP-level P-value in a given gene to be the gene score (Equation 1):

permSNP:

The permSNP method (Wang ; Eleftherohorinou ; Ballard ; Christo-Forou ; Evangelou ; Backes ) produces gene scores by permuting phenotype labels across all genotyped individuals to generate an empirical P-value for every gene. We carried out permSNP only on the acute lymphoblastic leukemia (ALL) data set (Xu ), as genotype data are required for this method, and we did not have genotype data for the other traits analyzed here. We calculated permSNP gene scores only for the top 400 most significant genes determined by minSNP using set-based test analysis in PLINK due to computational constraints (Purcell ) (see File S1, Algorithm S1 for more details.) The following settings were used to calculate permSNP gene scores in PLINK (Purcell ): --set-r2 1, --set-p 1, --set-max 99999, --maf 0.01, and --mperm 10,000 permutations of case–control labels. With these command flags, PLINK first does an association test between phenotype state and allele dosage at each SNP. Second, for every gene, the SNP test statistics () within the gene are averaged to calculate the observed gene-level test statistic (File S1, Algorithm S1). Third, the phenotype labels are permuted M times and the previous two steps are repeated for the permuted data each time, resulting in SNP statistics using the permuted phenotype data and corresponding gene statistics The gene score is then the fraction of times the gene statistic is greater than the observed statistic over the M permutations (File S1, Algorithm S1).

SKAT (Wu , 2011):

This method uses multiple linear/logistic mixed-model regression of covariates and genotypes for variants in a gene set, along with covariates, onto disease state. Covariates can include sex, age, or top principal components of genotype data to control for population stratification. Under the multiple logistic regression model for a continuous phenotype, the relationship between variant genotypes and the phenotype for the ith individual (of p total individuals) is given by Equation 2, where is an intercept term, is a vector of covariates, is the vector of regression coefficients for m covariates, is the vector of regression coefficients for the n SNPs in a gene, and is an error term that is normally distributed with mean of zero and variance Given this model, SKAT tests the null hypothesis Assuming that each for the jth variant follows some distribution with mean 0 and variance where τ is a variance component and is a weight for variant j, the null hypothesis can be restated as The variance component score statistic for this test is given by Equation 3, where is a matrix with the weighted genetic similarity between two subjects i and in the region with n markers, and Wu suggest setting the weights the beta distribution density function with parameters and evaluated at the sample minor-allele frequency (MAF) for a given variant j. The SKAT test statistic follows a mixture of chi-square distributions that can be evaluated using numerical integration to obtain a P-value for the gene (Wu , 2011):Because full genotype data are required for this method, we applied SKAT only on the ALL data set (Xu ) and the Wellcome Trust Case Control Consortium (WTCCC) type 2 diabetes data set (WTCCC 2007). We used the top four principal components from principal components analysis (PCA) on these data sets as covariates in the regression and hold these covariates constant across all methods tested in this study (Wu , 2011; Peloso ). The following settings were used to calculate SKAT scores in R, using the R package SKAT (Lee et al. 2015): (a) in the R function SKAT_Null_Model, out_type “D” and Adjustment “F” and (b) in the R function SKATBinary.SSD.All, method “SKAT”. These settings specify a linear weighted kernel with the weights

VEGAS (Liu ):

Consider a gene with n SNPs. Under the null hypothesis of no association, a gene can be represented by an n-element multivariate normal vector with mean 0 and variance the pairwise LD matrix. Given this model, VEGAS generates gene scores by (i) performing multivariate normal simulations from the null distribution of LD-correlated SNPs, (ii) squaring the simulated values and summing to get a null test statistic for each gene, and (iii) calculating an empirical gene-level P-value based on the proportion of times the observed test statistic is smaller than the simulated null statistics across all simulations (Liu ).

PEGASUS:

The main innovation in PEGASUS is using an analytical approach to compute gene-level P-values of observed gene scores according to a null distribution modeling LD (Figure 1D). Consider a gene (defined as the gene boundaries ±50 kb to include regulatory regions; the buffer of 50 kb can be varied) with n SNPs. Suppose the P-values for SNPs within the gene boundaries are Let where is the inverse of the cumulative distribution function (CDF) At the gene level, we are interested in the observed value q, defined as the sum of the correlated variables within a gene (Equation 4):Our model for q is as follows: Let be an n-element multivariate normal vector with mean and positive definite covariance matrix where is the LD between SNP i and SNP j and The quadratic form in the random variables associated with an symmetric matrix is defined as (Mathai and Provost 1992). The quadratic form has the following representation (Equation 5), where are the eigenvalues of and are mutually independent standard normal variables (Mathai and Provost 1992): follows the same distribution as Equation 5, and so the characteristic function of can be inverted to find the CDF of the null distribution accounting for empirical LD, which can be numerically integrated at the observed value (q) to find the gene-level P-value (), (Mathai and Provost 1992). The numerical integration is implemented in the R package CompQuadForm (Duchesne and Lafaye De Micheaux 2010). The LD (covariance) matrix is calculated using the --r flag (correlation) in PLINK (Purcell ). In contrast, VEGAS (Liu ) draws samples from the multivariate normal distribution with variance equal to the LD matrix, which are then summed to obtain an approximation of the P-value. Software to run PEGASUS is available at https://github.com/ramachandran-lab/PEGASUS. Empirical LD can be calculated using the 1000 Genomes Phase 3 data set (Auton ) (release date: November 2014) as references. These data contain 2426 individuals in five superpopulations: East Asians, Europeans, Africans, South Asians, and admixed Americans.

Connection between SKAT and PEGASUS tests:

As shown in Text S1, the SKAT and PEGASUS null distributions are mixtures of chi-square distributions. Mixture proportions for the SKAT null distribution are the eigenvalues of the matrix where is a projection matrix dependent on the covariate matrix and is a kernel matrix dependent on the genotype matrix and a diagonal matrix of weights For the PEGASUS null distribution, mixture proportions are given by the eigenvalues of the LD matrix If no covariates are considered and the variant weights are uniform () for all variants, the SKAT null distribution becomes a mixture of chi-square distributions with mixture proportions given by the eigenvalues of the matrix, which is a variance–covariance matrix similar to the PEGASUS LD matrix Thus, under these circumstances, the two tests give similar results. However, PEGASUS requires only summary statistics and is a better choice when genotype data are not available for analysis.

GWA study replication

To further assess the robustness of our method, PEGASUS, we attempted to replicate gene hits ( or ) generated by PEGASUS for four data sets [bipolar disorder (BIP), Crohn’s disease (CD), rheumatoid arthritis (RA), and type 2 diabetes (T2D)], using genotype data from the WTCCC (WTCCC 2007). For the replication study, we carried out PEGASUS on these four WTCCC data sets (our “replication cohort”) and compared the top genes found in our “discovery” data sets (Franke ; Stahl ; Sklar ; Morris ) to those found in the WTCCC data sets. We note that this is not an independent replication study since the WTCCC data sets were included in the discovery cohorts; cases from the WTCCC data sets comprised at most 38% of the cases included in the discovery cohorts (Table S2). The replication data sets consist of 2000 cases for each disease and 3000 shared controls recruited from the United Kingdom and genotyped on the Affymetrix 500K GeneChip (WTCCC 2007). Eight quality control steps were carried out for each of the four WTCCC data sets. Steps 1–7 were carried out using PLINK (Purcell ) (version 1.07): Markers with minor allele frequency <1% were removed. Loci with a call rate ≤95% across individuals were removed. Individuals with at least 5% missingness across all loci were removed. Loci not in Hardy–Weinberg equilibrium were removed (P-value threshold of ). Individuals were pruned based on inbreeding coefficient (). Duplicate individuals were removed (one individual for each pair with ). Related individuals were removed (one individual for each pair with ). Individuals determined to be outliers by principal component analysis were removed. SmartPCA from the EIGENSOFT (Price ) software package (version 4.0.2) was used to do PCA with outlier removal. Five iterations of outlier removal were performed with the outlier σ threshold = 6. We conducted GWA analysis using PLINK (Purcell ) (version 1.07) on the WTCCC data sets. SNP-level P-values were determined by logistic regression of disease state onto minor allele dosage, using the top four principal components as covariates in the logistic regression to control for ancestry.

GWA study simulation

To compare how well minSNP, SKAT, VEGAS, and PEGASUS can recover causal genes, we conducted a GWA study for a simulated complex phenotype with known genetic architecture based on the approach outlined in Wojcik and applied these four methods to the simulated data (Figure S13). To choose causal genes, we picked four pathways with >20 genes each at random from the KEGG pathway database (Kanehisa 1997; Kanehisa ). For each pathway, we randomly sampled 20% of its genes, resulting in 54 total causal genes. We ran Tagger (Haploview, Version 4.3) (Barrett ) on each gene to find independent tag SNPs ( 0.2), using the WTCCC controls (N = 2900 individuals) as reference individuals to calculate LD. For each of the 54 causal genes, we chose 1, 2, or 5 tag SNPs to be associated with the phenotype, giving 123 total causal SNPs. All the chosen SNPs in each gene were randomly assigned an effect size of either 1.2 or 2 to simulate a range of effect sizes. Using software from Wojcik , we then calculated a per-individual liability score for each individual (WTCCC controls served as our simulated cases and controls) from a model of additive genetic effects by summing the effect size s of each SNP multiplied by the minor allele dosage X at the SNP over all n SNPs (Equation 6). A “wiggle” (ε) was added to each raw liability score (Equation 7) to allow the cases and controls to overlap in their liability score distributions:Phenotype was assigned to each individual based on the mean of 100 deviates from the binomial distribution with probability of success equal to the probability of the wiggled score from the logistic distribution, which we obtained by applying the logistic function to the wiggled score. We then conducted GWA analysis using PLINK (Purcell ) (version 1.07) on the WTCCC controls and the simulated phenotypes. SNP-level P-values were determined by logistic regression of minor allele dosage onto disease state. We used the top four principal components, determined by applying smartPCA (Price ) to the genotypes, as covariates in the logistic regression to control for ancestry. To simulate spurious associations between SNPs and our associated phenotype, we added 20% of significant SNP P-values (144 new SNPs total) from an existing GWA study on CD (Franke ) to our simulated GWA P-values; these spuriously associated SNPs did not overlap with SNPs already associated with simulated phenotype. By “spuriously associated” SNPs, we mean SNPs that achieve genome-wide significance (P-value ) but are not discussed or selected for replication studies, eQTL analysis, or other downstream analyses due to filtering steps. Such SNPs may be excluded based on criteria such as failure to achieve significance within a majority of the individual cohorts analyzed in a meta-analysis (Anderson ), location within regions with complex LD or complex association patterns with the trait such as the MHC or TNFAIP3 regions for RA (Stahl ), or P-value thresholds based on additional in silico analyses such as GRAIL (Raychaudhuri ; Franke ). Since the true causal genes underlying the simulated phenotype are known, we are able to measure true positive rate (TPR) and false positive rate (FPR) for each gene score method and used the following gene score thresholds: We find that our simulation results are robust to varying percentages of spuriously associated SNPs added and the GWA data set used (Figure S14).

Pathway analysis

We performed pathway analysis with HotNet2 (Leiserson ), a topology-based method for finding significantly mutated subnetworks within protein–protein interaction networks, originally developed for analyzing somatic mutation data from cancer data sets. HotNet2 uses directed heat diffusion along interaction networks where every gene, represented by nodes in the network graph, has a “heat score” based on its gene score. We used negative log-transformed gene scores generated by PEGASUS, VEGAS, and minSNP as heat scores in HotNet2. We use HotNet2 to find gene interaction subnetworks containing genes that we have highest confidence are truly associated with the phenotype. We found that HotNet2 does not perform well when too many genes are assigned similar heat scores, as will happen when the majority of genes have insignificant P-values. Thus, following the approach used in earlier applications of HotNet2 to somatic mutations in cancer, we assigned heat scores of zero to genes that we have low confidence are truly associated with the phenotype (Leiserson ). To identify low-confidence genes, we compute a hard threshold based on local false discovery rates (lFDR) for the gene P-values. In a disease association setting, lFDR is the probability that a gene is not associated with the phenotype given its corresponding observed P-value; thus, can be thought of as “confidence.” When plotting confidence against gene scores (Figure S5 and Figure S6), an “elbow” or inflection point typically corresponds to a sharp drop in confidence; therefore, the inflection point is a natural choice for a gene score threshold. We compute the lFDRs for the gene scores using the twilight R package (Scheid and Spang 2005) (version 1.44.0). We calculate lFDR for minSNP, VEGAS, and PEGASUS gene scores and then determine a cutoff at the first elbow or inflection point in the graph of against gene scores where possible (Figure S5 and Figure S6). If the graph has no elbow point, as in the minSNP lFDR curves, we used the gene score corresponding to an lFDR cutoff of 0.05 (Figure S4). Since the cutoffs for the minSNP scores were greater than those calculated using PEGASUS and VEGAS gene scores, we ran HotNet2 twice: once using the PEGASUS lFDR threshold and once with the higher minSNP lFDR threshold. We assessed significance of HotNet2 results for each run by permuting the heat scores on genes to find a P-value for the number of subnetworks containing ≥k genes, as reported by HotNet2 (Leiserson ). HotNet2 analysis was performed using the HINT interaction network (Das and Yu 2012). Runs that had multiple P-values of varying size k in the permutation test were further studied, for example, by annotation using the Genome-wide Repository of Associations between SNPs and Phenotypes (GRASP) GWA study catalog (Leslie ) to determine significance of the genes in previous GWA studies, along with functional annotations and literature searches.

Data availability

GWA P-values analyzed here (Table 2) can be accessed at the URLs given in Table S3. WTCCC data (WTCCC 2007) used in power simulations and replication studies can be accessed through the Wellcome Trust Case Control Consortium: http://www.wtccc.org.uk/. PEGASUS software and LD reference data are available online at the following URL: https://github.com/ramachandran-lab/PEGASUS. HotNet2 (Leiserson ) software is available online at the following URL: https://github.com/raphael-group/hotnet2.

Results

Performance comparison of PEGASUS against minSNP, permSNP, SKAT, and VEGAS

We compared PEGASUS against minSNP (Torkamani ; Fehringer ; Gelernter ; Hu ), permSNP (Wang ; Eleftherohorinou ; Ballard ; Christoforou ; Evangelou ; Backes ), SKAT (Wu , 2011), and VEGAS (Liu ) (Figure 2 and Figure S15), using several metrics to evaluate the different scores.
Figure 2

Quantile–quantile plots comparing gene scores produced by PEGASUS against those produced by minSNP, permSNP, and VEGAS. (A) Quantile–quantile plots of PEGASUS gene scores vs. minSNP gene scores. Each point represents a gene and is colored yellow, red, or blue based on gene length percentile, 0–25%, 25–75%, and 75–100%, respectively. The phenotype used is waist–hip ratio adjusted for body mass index (WHR). minSNP gene scores are smaller than PEGASUS gene scores and decrease with increasing number of SNPs in a gene. The deviations from show that minSNP scores are biased toward being smaller than PEGASUS scores, and this bias increases for increasing gene length (genes colored in blue and red). (B) Base-10 logarithm of PEGASUS gene scores vs. base-10 logarithm of permSNP gene scores for acute lymphoblastic leukemia (ALL). permSNP can determine gene scores only as low as the reciprocal of the number of permutations (10,000 in this case) whereas PEGASUS can determine gene scores as low as (the numerical precision of R). Note that the minimum permSNP scores of differ widely from their P-values computed by PEGASUS. (C) Base-10 logarithm of PEGASUS gene scores vs. base-10 logarithm VEGAS gene scores. Using 1 million simulations, the lowest gene scores output by VEGAS are while PEGASUS determines gene scores as low as In addition, for gene scores close to the reciprocal of the maximum number of simulations performed, VEGAS can return inaccurate gene scores compared to PEGASUS.

Quantile–quantile plots comparing gene scores produced by PEGASUS against those produced by minSNP, permSNP, and VEGAS. (A) Quantile–quantile plots of PEGASUS gene scores vs. minSNP gene scores. Each point represents a gene and is colored yellow, red, or blue based on gene length percentile, 0–25%, 25–75%, and 75–100%, respectively. The phenotype used is waist–hip ratio adjusted for body mass index (WHR). minSNP gene scores are smaller than PEGASUS gene scores and decrease with increasing number of SNPs in a gene. The deviations from show that minSNP scores are biased toward being smaller than PEGASUS scores, and this bias increases for increasing gene length (genes colored in blue and red). (B) Base-10 logarithm of PEGASUS gene scores vs. base-10 logarithm of permSNP gene scores for acute lymphoblastic leukemia (ALL). permSNP can determine gene scores only as low as the reciprocal of the number of permutations (10,000 in this case) whereas PEGASUS can determine gene scores as low as (the numerical precision of R). Note that the minimum permSNP scores of differ widely from their P-values computed by PEGASUS. (C) Base-10 logarithm of PEGASUS gene scores vs. base-10 logarithm VEGAS gene scores. Using 1 million simulations, the lowest gene scores output by VEGAS are while PEGASUS determines gene scores as low as In addition, for gene scores close to the reciprocal of the maximum number of simulations performed, VEGAS can return inaccurate gene scores compared to PEGASUS. We find that, for all 12 GWA data sets analyzed, minSNP gene scores are almost always smaller than PEGASUS gene scores (Figure 2A and Figure S1). We also find that minSNP gene scores show a clear dependence on gene length; as the number of SNPs in a gene increases, the minSNP gene score decreases for all data sets analyzed. In contrast, PEGASUS gene scores do not show this trend (Figure 2A, Figure S2, and Figure S16). We tested whether two corrections to minSNP gene scores mitigated its bias with gene length: (i) calculating P-values for minSNP gene scores from the Beta(1, no. of SNPs in gene) distribution (note that minSNP can be thought of as the first-order statistic for SNP P-values) and (ii) multiplying minSNP gene scores by the number of SNPs in a gene. Both corrections resulted in gene scores that decrease with increasing gene length (Figure S7 and Figure S8). Since permSNP requires genotype data and is computationally expensive, permSNP was carried out only on the ALL data set (Xu ), using 10,000 permutations; thus, there is large variation in PEGASUS gene scores at a permSNP gene score of (Figure 2B). Further, permSNP is extremely computationally costly: Carrying out permSNP on a random subset of 400 genes took ∼6 hr. Thus, permSNP would be extremely computationally inefficient for analyzing a genome-wide human data set (∼18,000–20,000 genes). Since SKAT requires genotype data, SKAT was carried out on the ALL (Xu ) and the WTCCC T2D (WTCCC 2007) data sets, using the top four principal components from PCA on these data sets as covariates. We find that PEGASUS gene scores and SKAT gene scores are correlated ( and P-values and for ALL and T2D, respectively) for both data sets (Figure S15, A and B). We also find that PEGASUS gene scores and unweighted SKAT gene scores are correlated ( and P-values and for ALL and T2D, respectively) (Figure S15, C and D). Compared to VEGAS, our method has increased numerical precision when calculating gene scores (Figure 2C). Due to its underlying Monte Carlo simulations ( by default), VEGAS does not calculate gene scores less than the reciprocal of the number of simulations. However, PEGASUS can evaluate gene scores to the machine precision of R, which is In addition, VEGAS gene scores become inaccurate close to due to the random nature of Monte Carlo simulations (Figure 2C), whereas PEGASUS does not have a stochastic element. We find that VEGAS produces less numerically precise gene scores than PEGASUS in all 12 data sets analyzed (Figure S3). We also find that PEGASUS runs twice as fast as VEGAS when using HapMap data (Frazer ) (Phase 2) as references for LD.

Enrichment for known associations in real data

To assess how well minSNP and PEGASUS recover known GWA associations, we calculated the percentage of genes with significant minSNP and PEGASUS gene scores () for the 12 phenotypes in this analysis that have been found to be significantly associated (SNP-level ) with the disease or trait in GWA studies conducted with different genotype data (Figure 3). Taking known associations to be “true positives,” we calculated positive predictive values (PPV) for every gene score for every disease. We find that in 10 of 12 data sets, significantly associated PEGASUS gene hits () have higher PPV than minSNP gene hits by as much as 2.8-fold, showing that PEGASUS gene hits are enriched for known associations in comparison to minSNP (Figure 3). For the remaining two disorders, attention-deficit/hyperactivity disorder (ADHD) and major depressive disorder (MDD), minSNP identifies significantly associated genes () that have not been found in other GWA studies while PEGASUS does not report any findings (Figure 3).
Figure 3

PEGASUS gene hits are enriched for known GWA study associations compared to minSNP gene hits. Shown are the numbers of minSNP gene hits (blue) and PEGASUS gene hits (orange) that contain known GWA study associations and gene hits not previously found in GWA studies (gray) for 12 GWA study data sets. To the right of each bar are positive predictive values (PPV) for each gene score for every data set; where possible, boldface type indicates the gene score with the highest PPV for each disease and “NA” means that PPV is undefined, which occurs when there are zero gene hits. A gene hit is a gene with a score of and known GWA study associations are genes containing genome-wide significant SNPs in GWA studies conducted with different data sets from the 12 data sets analyzed here. We find that PEGASUS gene hits have as much as 2.8-fold higher PPV than minSNP gene hits.

PEGASUS gene hits are enriched for known GWA study associations compared to minSNP gene hits. Shown are the numbers of minSNP gene hits (blue) and PEGASUS gene hits (orange) that contain known GWA study associations and gene hits not previously found in GWA studies (gray) for 12 GWA study data sets. To the right of each bar are positive predictive values (PPV) for each gene score for every data set; where possible, boldface type indicates the gene score with the highest PPV for each disease and “NA” means that PPV is undefined, which occurs when there are zero gene hits. A gene hit is a gene with a score of and known GWA study associations are genes containing genome-wide significant SNPs in GWA studies conducted with different data sets from the 12 data sets analyzed here. We find that PEGASUS gene hits have as much as 2.8-fold higher PPV than minSNP gene hits.

Replication of PEGASUS gene hits in WTCCC data

We attempted to replicate gene hits () generated by PEGASUS for four data sets (BIP, CD, RA, and T2D) for which we have genotype data from the WTCCC (WTCCC 2007). We note that our replication cohorts (WTCCC) were included in the discovery cohorts, and thus this is not an independent replication. We find that we are able to replicate up to 57.2% of gene hits in the case of RA and as low as 0 gene hits in the BIP data set (Table S2). We note that the four meta-analyses we consider our “discovery cohorts” are composed of much larger sample sizes than our replication data sets and thus had greater power to identify associated variants. Further, we do not necessarily follow the same steps for quality control and ancestry correction in our GWA study as did the meta-analyses reanalyzed here (see Materials and Methods), which may explain our low percentage of replicated significant genes for BIP and CD.

Power analysis using simulated GWA data

To compare how well minSNP, SKAT, VEGAS, and PEGASUS can recover causal genes, we conducted gene-level tests for association, using these methods on a GWA study for a simulated phenotype calculated from a model of additive genetic effects. PEGASUS and VEGAS outperform minSNP with a 30% TPR when the FPR is fixed at 1%, and PEGASUS and VEGAS outperform SKAT with 28% higher TPR when FPR is fixed at 1% (Figure 4). We find that minSNP, when applied to the GWA data sets in Table 2, outputs high numbers of significant genes (genes with ). For example, as many as 5.5% of all genes (lower bound: 0.01%) are below the Bonferroni-corrected threshold for significance when using minSNP across the 12 data sets we analyzed (Table S1). In our simulation, we find that the high FPR in minSNP is due to genes with spurious SNP association P-values added in as part of the simulation. This suggests that some minSNP gene hits in observed GWA studies are spurious associations as well. By correcting for LD, PEGASUS and VEGAS are able to ignore these false positives (Figure 4). At very low FPR (), SKAT is the most sensitive method, but SKAT has lower sensitivity overall. This could be a desirable feature if one is looking for a small number of reliable gene hits. However, for pathway/network analysis, it is useful to obtain high sensitivity, as the pathway/network information will help reduce the remaining false positives. We were unable to add in spurious SNP association P-values for the SKAT method since this test requires genotype data; however, inclusion of these SNPs could only decrease the performance of SKAT in simulation.
Figure 4

Receiver operating characteristic (ROC) curves from GWA conducted with simulated phenotypes show gene scores controlling for LD achieve higher true positive rates at low false positive rates. We performed a GWA study for a simulated phenotype with known underlying true causal genes (see Materials and Methods) and determined true positive rate (TPR; genes truly associated with phenotype that were identified as such) and false positive rate (FPR; genes identified as causal by a gene score method that were not truly associated with the simulated phenotype) for minSNP, VEGAS, SKAT, and PEGASUS for various gene score thresholds (see Materials and Methods). We find that PEGASUS and VEGAS, which control for pairwise correlations between SNPs within genes, outperform minSNP and SKAT with higher TPRs at very low FPRs.

Receiver operating characteristic (ROC) curves from GWA conducted with simulated phenotypes show gene scores controlling for LD achieve higher true positive rates at low false positive rates. We performed a GWA study for a simulated phenotype with known underlying true causal genes (see Materials and Methods) and determined true positive rate (TPR; genes truly associated with phenotype that were identified as such) and false positive rate (FPR; genes identified as causal by a gene score method that were not truly associated with the simulated phenotype) for minSNP, VEGAS, SKAT, and PEGASUS for various gene score thresholds (see Materials and Methods). We find that PEGASUS and VEGAS, which control for pairwise correlations between SNPs within genes, outperform minSNP and SKAT with higher TPRs at very low FPRs.

The downstream effect of gene scores on pathway analysis

Using gene scores generated from minSNP, VEGAS, and PEGASUS as input to HotNet2 (Leiserson ), we performed pathway analysis on each of the 12 GWA data sets. We find significantly associated gene subnetworks for ADHD, ulcerative colitis (UC), and waist–hip ratio adjusted for body mass index (WHR). Figure 5 shows a selection of subnetworks containing known or biologically plausible gene associations for these three phenotypes based on previous GWA or functional studies. Other significantly associated subnetworks can be found in Figure S9 and Figure S12.
Figure 5

Subnetworks for ulcerative colitis (A–C), attention-deficit/hyperactivity disorder (D–F), and waist–hip ratio adjusted for body mass index (G and H) from significant runs of HotNet2 (Leiserson ) ( for multiple subnetwork sizes), using PEGASUS gene scores as input. Circles represent genes in each subnetwork and are colored by heat score (negative log-transformed PEGASUS gene scores); the color bar indicates the lowest heat score (blue or “cold” genes) and the highest heat score (red or “hot” genes) in each subnetwork for a given phenotype. Lines between genes indicate a direct gene–gene interaction from the HINT database (Das and Yu 2012). Gene names that are underlined, in boldface type, and italicized represent genes that have been previously associated with the ulcerative colitis (Duerr ; Raelson ; WTCCC 2007; Barrett ; Silverberg ; Franke ; McGovern ; Anderson ; Festen ; Ellinghaus ; Jostins ; Scharl ; Parkes ), attention-deficit/hyperactivity disorder (Wan ; Maden 2007; Davis ; Naka ; McGrath ; Need ; Chen ; Cirulli , 2012; Fuentealba ; Neale ; Hu ; Luciano ; Tang ; De Jager ; Rivière ; Schuurs-Hoeijmakers ; Peixoto and Abel 2013; Rietveld ), and waist–hip ratio (Cantile ; Eguchi , 2011; Guth ; Hagberg ; Heid ; Siervo ; Tchkonia ; Karpe and Pinnick 2015) phenotypes in GWA or functional studies.

Subnetworks for ulcerative colitis (A–C), attention-deficit/hyperactivity disorder (D–F), and waist–hip ratio adjusted for body mass index (G and H) from significant runs of HotNet2 (Leiserson ) ( for multiple subnetwork sizes), using PEGASUS gene scores as input. Circles represent genes in each subnetwork and are colored by heat score (negative log-transformed PEGASUS gene scores); the color bar indicates the lowest heat score (blue or “cold” genes) and the highest heat score (red or “hot” genes) in each subnetwork for a given phenotype. Lines between genes indicate a direct gene–gene interaction from the HINT database (Das and Yu 2012). Gene names that are underlined, in boldface type, and italicized represent genes that have been previously associated with the ulcerative colitis (Duerr ; Raelson ; WTCCC 2007; Barrett ; Silverberg ; Franke ; McGovern ; Anderson ; Festen ; Ellinghaus ; Jostins ; Scharl ; Parkes ), attention-deficit/hyperactivity disorder (Wan ; Maden 2007; Davis ; Naka ; McGrath ; Need ; Chen ; Cirulli , 2012; Fuentealba ; Neale ; Hu ; Luciano ; Tang ; De Jager ; Rivière ; Schuurs-Hoeijmakers ; Peixoto and Abel 2013; Rietveld ), and waist–hip ratio (Cantile ; Eguchi , 2011; Guth ; Hagberg ; Heid ; Siervo ; Tchkonia ; Karpe and Pinnick 2015) phenotypes in GWA or functional studies. PEGASUS identifies multiple subnetworks containing genes with known associations to each phenotype of interest. Some of these subnetworks are not found when using minSNP and VEGAS gene scores as input to HotNet2 (Figure S10 and Figure S11). We also identify subnetworks associated with ADHD (Figure 5, D–F), a disorder for which GWA studies have not identified any SNPs with genome-wide significance. We detail our findings for each disease or trait below.

UC:

Inflammatory bowel disease (IBD), an inflammatory disease of the gastrointestinal tract, has two major subtypes: UC and CD. IBD is hypothesized to result from dysregulated T-cell immune responses to commensal enteric bacteria in the gut that develop in individuals who are genetically predisposed to the disease. Environmental factors also play an important role in triggering onset or recurrence of symptoms (Sartor 2006; Lee ). UC is characterized by superficial, ulcerating inflammation that is limited to the colon (Christophi ). HotNet2 analysis using PEGASUS scores identifies a subnetwork containing several genes in JAK2-STAT signaling pathways as associated with UC disease state (Figure 5A). The subnetwork in Figure 5A contains the genes JAK2, IL12RB2, IFNG, PTPN2, and STAT4, which all have genome-wide significant SNP hits (SNP P-value ) in GWA studies for IBD (Duerr ; Jostins ). The gene IFNG also has genome-wide significant SNPs in GWA studies for ulcerative colitis conducted with different data sets from the data set used in this study and (Silverberg ; McGovern ). The following genes shown in this subnetwork have also been significantly associated (SNP P-value ) with CD (CD alone or in concert with psoriasis or celiac disease) in other GWA studies: JAK2, IL12RB2, SOCS1, and PTPN2 (Raelson ; WTCCC 2007; Barrett ; Franke ; Festen ; Ellinghaus ). Two closely related pro-inflammatory cytokine signaling pathways involve many genes shown in this subnetwork: the interleukin (IL)-23/type 17 helper T-cell () signaling pathway and the IL-12/type 1 helper T-cell () signaling pathway. Both signaling pathways ultimately result in cytokine-mediated gut destruction (Wang ; Parkes ). Similar pathways centered around IL-12, IL-23, and JAK2-STAT signaling have been manually compiled based on GWA studies for CD and IBD (Wang ; Parkes ). In addition, the subnetwork (Figure 5A) contains the gene PTPN2, which encodes protein tyrosine phosphatase nonreceptor type 2 (PTPN2) and has been shown to regulate autophagy in human intestinal epithelial cells; knockdown of PTPN2 caused impaired autophagosome formation and dysfunctional autophagy that eventually resulted in increased apoptosis of intestinal cells in response to IFNG and tumor necrosis factor-α (Scharl ). This subnetwork illustrates interactions between multiple genes involved in the immune response to pathogens that may underlie the pathology of ulcerative colitis. The subnetwork we identify in Figure 5B shows interactions between the human leukocyte antigen (HLA) class I genes and transporter associated with antigen processing (TAP) genes. These genes are thought to underlie IBD pathology as well as other immune-mediated disorders such as psoriasis and ankylosing spondylitis (Parkes ). Finally, HotNet2 reports a significantly associated subnetwork containing genes that are part of the tumor necrosis factor (TNF) signaling pathway (Figure 5C). TNF signaling results in activation of nuclear factor kappa-light-chain enhancer of activated B cells (NF-κB), which is a known inflammatory response in IBD (Anderson ). Additional significant gene subnetworks are shown in Figure S12. We emphasize that neither the JAK2-STAT subnetwork (Figure 5A) nor the HLA class 1 and TAP genes (Figure 5B) are found when using VEGAS gene scores as input to HotNet2 (Figure S11), demonstrating the importance of high-precision gene scores in downstream analysis.

ADHD:

ADHD is a highly heritable neuropsychiatric disorder characterized by the following traits: inattention, hyperactivity, and impulsivity (Franke ). It is thought to be a very complex multifactorial trait and, despite high heritability estimates (76%) (Neale ), it has been difficult to find genes underlying the phenotype. GWA studies with very large sample sizes have been performed, yet no variants have reached genome-wide significance (Franke ). Using PEGASUS gene scores as input to HotNet2, we find five significantly associated gene subnetworks associated with ADHD (Figure 5 and Figure S9). Figure 5D includes interactions between multiple genes previously associated with cognition-related traits and neurodevelopmental disorders. RPS6KA5, PDZD2, and RAI1 are associated with years of education (GWA SNP P-values of and respectively) (Rietveld ). The genes RPS6KA5, PDZD2, and ST3GAL3 have also been associated () with performance on multiple tests of cognitive function and memory in previous GWA studies (Need ; Cirulli , 2012; Luciano ; De Jager ). In particular, RPS6KA5, which encodes ribosomal protein S6 kinase alpha-5, is part of a pathway involved in brain-derived neurotrophic factor (BDNF)/neurotrophin signaling; BDNF and other neurotrophins are important in neural development, learning, and memory and are implicated in neurodegenerative diseases such as Huntington’s, Alzheimer’s and Parkinson’s (Tang ). RPS6KA5 is also thought to be a part of the upstream pathway involved in learning-dependent chromatin remodeling, which is important in long-term memory formation (Peixoto and Abel 2013). In addition, the gene PDLIM1 has been found to have strong maternal transmission in trios where the child is affected with ADHD, and PDLIM1 has been shown to play a role in Alzheimer’s disease (Wang ). A linkage study for intellectual disability found mutations in the ST3GAL3 gene (Hu ), and mutations in the ACTG1 gene cause Baraitser–Winter syndrome, a developmental disorder affecting the face and brain (Rivière ). Taken together, multiple studies suggest that genes in this subnetwork we identified using PEGASUS and HotNet2 are involved in regulatory processes that affect neural development, learning, and memory. The second subnetwork (Figure 5E) contains FURIN and other genes that play important roles in cell trafficking processes that may be involved in the pathology of neuropsychiatric and cognitive disorders such as Alzheimer’s disease and intellectual disability (Wan ; Fuentealba ; Schuurs-Hoeijmakers ; Carlino ). A third subnetwork (Figure 5F) contains genes that are likely involved in brain development (Maden 2007). Two additional subnetworks found by HotNet2 using PEGASUS scores contain genes that play regulatory roles in neurogenesis and responses to stress (Figure S9A) and genes associated with other social and behavioral abnormalities (Figure S9B). Additional information about genes in these subnetworks can be found in Text S2.

WHR adjusted for body mass index:

WHR adjusted for body mass index (BMI) is a quantitative trait that measures body fat distribution. Both WHR and BMI are heritable traits (25–70% heritability), but mechanisms underlying body fat distribution are still unclear (Baker ). WHR is a useful trait for predicting risk for T2D and heart disease since it accounts for waist and hip size, which both have associations with these traits (Heid ). Increasing waist size is associated with increased risk for T2D and heart disease, but gluteal fat deposits play a protective role against T2D, hypertension, and dyslipidemia (Heid ; Shungin ). HotNet2 analysis with PEGASUS gene scores identifies two subnetworks with interactions between genes known to be associated with WHR that may shed light on the genetic determination of body fat distribution. Figure 5G displays a subnetwork of homeobox (HOX) genes, a family of transcription factors that play an important role in morphogenesis in animals (Zhang ). The complete HOX gene network was found to be active in human white adipose tissue and fetal brown adipose tissue (Cantile ). HOX genes contained in this subnetwork include HOXA9, HOXC11, HOXA11, and EMX2. Multiple studies of gene expression in human subcutaneous abdominal adipose tissue and gluteal adipose tissue found that the HOXA9 gene has increased expression in abdominal adipose tissue; MEIS1, which encodes a HOX cofactor and is also contained in this subnetwork, was also expressed more in abdominal adipose tissue than in gluteal depots in men only (Karpe and Pinnick 2015). In contrast, HOXC11 and HOXA11 have increased gene expression in gluteal adipose tissue than in abdominal adipose tissue (Karpe and Pinnick 2015). HOXA9 and EMX2, another homeobox gene, were induced after extreme weight loss following bariatric surgery (Tchkonia ). The subnetwork shows SOX8, which encodes a transcription factor thought to play a role in development; mice deficient in SOX8 develop surprisingly normally, but undergo a severe degeneration of adipose tissue as adult mice (Guth ). Guth posit that SOX8 plays a role in adipocyte development, especially during replenishment of the adipocyte pool in adult mice. IRF4 is also contained in this subnetwork and encodes interferon regulatory factor 4 (IRF4), which is part of a family of transcription factors that are involved in various immune functions including regulation of innate immunity via the Toll-like receptor (TLR) signaling pathway (Eguchi ). Many IRF proteins including IRF4 are also expressed in preadipocytes and mature adipocytes; their function is to repress adipogenesis (Eguchi ). Eguchi find that knockout mice lacking IRF4 in adipocytes display excess adiposity and resistance to lipolysis induced by catecholamines, fasting, or cold exposure, suggesting that IRF4 plays an important role in transcriptional regulation of lipid handling in fat. These findings indicate that this subnetwork may play an important role in development of adipocytes and fat distribution. The second subnetwork found by HotNet2 (Figure 5H) elucidates the interactions of VEGFA, which contains a known GWA study association for WHR (SNP P-value ) and VEGFB, which is moderately associated with WHR (SNP P-value ) (Heid ). Additional information about this subnetwork can be found in Text S3. We also conducted HotNet2 analysis using minSNP gene scores for all phenotypes studied here. Since minSNP gene scores are misleadingly small (Figure 2A, Figure S2, and Figure S16), single highly significant genes pull in many unrelated genes with low gene scores to create artifactual “star-shaped” subnetworks that are likely false positives (see Figure S10 for HotNet2 results using minSNP gene scores for ulcerative colitis).

Discussion

Here we present a new approach for identifying geno–phenotype associations from case–control data that combines a novel gene score, PEGASUS, with network-based analyses using HotNet2 (Leiserson ). PEGASUS computes gene scores that measure the statistical association of a gene with a phenotype of interest and has multiple advantages over commonly used methods for generating gene scores. First, PEGASUS analytically models LD among SNPs within genes, producing a computationally efficient method that yields precise results—gene scores as small as (the machine precision of R). By modeling linkage disequilibrium, PEGASUS, like VEGAS (Liu ), is sensitive to genes with multiple SNPs that are moderately associated with the phenotype of interest. Unlike existing methods, our gene scores are not biased by gene length or poor precision. In the future, our approach can be extended to combine SNP-level P-values within linkage blocks in contrast to the gene boundaries used in this study. Second, we apply PEGASUS to 12 genome-wide association studies for complex diseases and traits and, in 10 of 12 studies, our significant gene scores () enrich for genes known to be associated with the phenotypes of interest (Figure 3). In simulation studies, we find that modeling fine-scale LD (or pairwise correlations between SNPs within genes), as in the PEGASUS and VEGAS methods, produces gene scores with much higher true positive rates of identifying genes associated with the simulated phenotype than does using minSNP (Figure 4). Thus, PEGASUS can be applied after conducting a GWA study to prioritize genes for functional validation. Third, because our approach precisely assesses the statistical association of a gene with a phenotype of interest, PEGASUS offers the opportunity to identify novel sets of interacting genes underlying complex phenotypes via pathway or network analysis with our gene scores as input. Complex phenotypes may be generated via mutations in a subset of genes in a predefined gene set [such as the gene sets used in gene set enrichment analysis (Subramanian )] or via crosstalk between gene sets or pathways. To identify novel subnetworks of genes associated with each of the complex phenotypes analyzed in this study, we used gene scores generated by PEGASUS as input to HotNet2 (Leiserson ); this is the first application of HotNet2 to common genetic variation. In our network analyses, minSNP and VEGAS miss key subnetworks containing genes known to be associated with the phenotype of interest (Figure 5, A and B, ulcerative colitis). Fourth, in ADHD, a disease for which large GWA studies ( individuals) have not identified genome-wide significant associations (Franke ; Neale ), we identify significant subnetworks of interacting genes () that are involved in neural development and learning and cognition. As knowledge of the human interactome continues to improve, post hoc analysis of GWA studies using gene scores from PEGASUS in conjunction with HotNet2 has the potential to generate promising hypotheses for functional validation. Finally, we argue that human genetics would benefit greatly if more GWA studies released all SNP-level P-values generated, instead of reporting a subset of P-values the authors consider to be of interest. The present situation, where only the genotypes are deposited in public repositories under managed access, makes it impossible to replicate the various filtering, quality control, ancestry correction, and other steps that lead from raw genotype calls to the handful of genome-wide significant SNPs reported in publications. Of course, participant confidentiality requirements may limit the public distribution of SNP P-values (Masca ), but P-values could be released with the genotype data under a managed access model. If these summary statistics were routinely released, gene score methods like PEGASUS, network/pathway analyses like HotNet2 (Leiserson ), and other computational innovations from the community could be more widely applied to yield new insight into the genomic underpinnings of complex diseases and traits.
  114 in total

Review 1.  Genome-wide association studies for common diseases and complex traits.

Authors:  Joel N Hirschhorn; Mark J Daly
Journal:  Nat Rev Genet       Date:  2005-02       Impact factor: 53.242

2.  Principal components analysis corrects for stratification in genome-wide association studies.

Authors:  Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal:  Nat Genet       Date:  2006-07-23       Impact factor: 38.330

Review 3.  Genetic insights into common pathways and complex relationships among immune-mediated diseases.

Authors:  Miles Parkes; Adrian Cortes; David A van Heel; Matthew A Brown
Journal:  Nat Rev Genet       Date:  2013-08-06       Impact factor: 53.242

4.  Meta-analysis of genome-wide association studies identifies 1q22 as a susceptibility locus for intracerebral hemorrhage.

Authors:  Daniel Woo; Guido J Falcone; William J Devan; W Mark Brown; Alessandro Biffi; Timothy D Howard; Christopher D Anderson; H Bart Brouwers; Valerie Valant; Thomas W K Battey; Farid Radmanesh; Miriam R Raffeld; Sylvia Baedorf-Kassis; Ranjan Deka; Jessica G Woo; Lisa J Martin; Mary Haverbusch; Charles J Moomaw; Guangyun Sun; Joseph P Broderick; Matthew L Flaherty; Sharyl R Martini; Dawn O Kleindorfer; Brett Kissela; Mary E Comeau; Jeremiasz M Jagiella; Helena Schmidt; Paul Freudenberger; Alexander Pichler; Christian Enzinger; Björn M Hansen; Bo Norrving; Jordi Jimenez-Conde; Eva Giralt-Steinhauer; Roberto Elosua; Elisa Cuadrado-Godia; Carolina Soriano; Jaume Roquer; Peter Kraft; Alison M Ayres; Kristin Schwab; Jacob L McCauley; Joanna Pera; Andrzej Urbanik; Natalia S Rost; Joshua N Goldstein; Anand Viswanathan; Eva-Maria Stögerer; David L Tirschwell; Magdy Selim; Devin L Brown; Scott L Silliman; Bradford B Worrall; James F Meschia; Chelsea S Kidwell; Joan Montaner; Israel Fernandez-Cadenas; Pilar Delgado; Rainer Malik; Martin Dichgans; Steven M Greenberg; Peter M Rothwell; Arne Lindgren; Agnieszka Slowik; Reinhold Schmidt; Carl D Langefeld; Jonathan Rosand
Journal:  Am J Hum Genet       Date:  2014-03-20       Impact factor: 11.025

5.  Transcriptional control of adipose lipid handling by IRF4.

Authors:  Jun Eguchi; Xun Wang; Songtao Yu; Erin E Kershaw; Patricia C Chiu; Joanne Dushay; Jennifer L Estall; Ulf Klein; Eleftheria Maratos-Flier; Evan D Rosen
Journal:  Cell Metab       Date:  2011-03-02       Impact factor: 27.287

6.  Genetic variants associated with antithyroid drug-induced agranulocytosis: a genome-wide association study in a European population.

Authors:  Pär Hallberg; Niclas Eriksson; Luisa Ibañez; Emmanuelle Bondon-Guitton; Reinhold Kreutz; Alfonso Carvajal; M Isabel Lucena; Esther Sancho Ponce; Mariam Molokhia; Javier Martin; Tomas Axelsson; Qun-Ying Yue; Patrik K E Magnusson; Mia Wadelius
Journal:  Lancet Diabetes Endocrinol       Date:  2016-05-03       Impact factor: 32.069

7.  Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors:  Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal:  Proc Natl Acad Sci U S A       Date:  2005-09-30       Impact factor: 11.205

8.  Combined analysis of genome-wide association studies for Crohn disease and psoriasis identifies seven shared susceptibility loci.

Authors:  David Ellinghaus; Eva Ellinghaus; Rajan P Nair; Philip E Stuart; Tõnu Esko; Andres Metspalu; Sophie Debrus; John V Raelson; Trilokraj Tejasvi; Majid Belouchi; Sarah L West; Jonathan N Barker; Sulev Kõks; Külli Kingo; Tobias Balschun; Orazio Palmieri; Vito Annese; Christian Gieger; H Erich Wichmann; Michael Kabesch; Richard C Trembath; Christopher G Mathew; Gonçalo R Abecasis; Stephan Weidinger; Susanna Nikolaus; Stefan Schreiber; James T Elder; Michael Weichenthal; Michael Nothnagel; Andre Franke
Journal:  Am J Hum Genet       Date:  2012-04-06       Impact factor: 11.025

9.  Meta-analysis identifies 29 additional ulcerative colitis risk loci, increasing the number of confirmed associations to 47.

Authors:  Carl A Anderson; Gabrielle Boucher; Charlie W Lees; Andre Franke; Mauro D'Amato; Kent D Taylor; James C Lee; Philippe Goyette; Marcin Imielinski; Anna Latiano; Caroline Lagacé; Regan Scott; Leila Amininejad; Suzannah Bumpstead; Leonard Baidoo; Robert N Baldassano; Murray Barclay; Theodore M Bayless; Stephan Brand; Carsten Büning; Jean-Frédéric Colombel; Lee A Denson; Martine De Vos; Marla Dubinsky; Cathryn Edwards; David Ellinghaus; Rudolf S N Fehrmann; James A B Floyd; Timothy Florin; Denis Franchimont; Lude Franke; Michel Georges; Jürgen Glas; Nicole L Glazer; Stephen L Guthery; Talin Haritunians; Nicholas K Hayward; Jean-Pierre Hugot; Gilles Jobin; Debby Laukens; Ian Lawrance; Marc Lémann; Arie Levine; Cecile Libioulle; Edouard Louis; Dermot P McGovern; Monica Milla; Grant W Montgomery; Katherine I Morley; Craig Mowat; Aylwin Ng; William Newman; Roel A Ophoff; Laura Papi; Orazio Palmieri; Laurent Peyrin-Biroulet; Julián Panés; Anne Phillips; Natalie J Prescott; Deborah D Proctor; Rebecca Roberts; Richard Russell; Paul Rutgeerts; Jeremy Sanderson; Miquel Sans; Philip Schumm; Frank Seibold; Yashoda Sharma; Lisa A Simms; Mark Seielstad; A Hillary Steinhart; Stephan R Targan; Leonard H van den Berg; Morten Vatn; Hein Verspaget; Thomas Walters; Cisca Wijmenga; David C Wilson; Harm-Jan Westra; Ramnik J Xavier; Zhen Z Zhao; Cyriel Y Ponsioen; Vibeke Andersen; Leif Torkvist; Maria Gazouli; Nicholas P Anagnou; Tom H Karlsen; Limas Kupcinskas; Jurgita Sventoraityte; John C Mansfield; Subra Kugathasan; Mark S Silverberg; Jonas Halfvarson; Jerome I Rotter; Christopher G Mathew; Anne M Griffiths; Richard Gearry; Tariq Ahmad; Steven R Brant; Mathias Chamaillard; Jack Satsangi; Judy H Cho; Stefan Schreiber; Mark J Daly; Jeffrey C Barrett; Miles Parkes; Vito Annese; Hakon Hakonarson; Graham Radford-Smith; Richard H Duerr; Séverine Vermeire; Rinse K Weersma; John D Rioux
Journal:  Nat Genet       Date:  2011-02-06       Impact factor: 38.330

10.  A global reference for human genetic variation.

Authors:  Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal:  Nature       Date:  2015-10-01       Impact factor: 49.962

View more
  25 in total

Review 1.  Network propagation: a universal amplifier of genetic associations.

Authors:  Lenore Cowen; Trey Ideker; Benjamin J Raphael; Roded Sharan
Journal:  Nat Rev Genet       Date:  2017-06-12       Impact factor: 53.242

2.  Germline Features Associated with Immune Infiltration in Solid Tumors.

Authors:  Sahar Shahamatdar; Meng Xiao He; Matthew A Reyna; Alexander Gusev; Saud H AlDubayan; Eliezer M Van Allen; Sohini Ramachandran
Journal:  Cell Rep       Date:  2020-03-03       Impact factor: 9.423

3.  Novel Gene and Network Associations Found for Acute Lymphoblastic Leukemia Using Case-Control and Family-Based Studies in Multiethnic Populations.

Authors:  Priyanka Nakka; Natalie P Archer; Heng Xu; Philip J Lupo; Benjamin J Raphael; Jun J Yang; Sohini Ramachandran
Journal:  Cancer Epidemiol Biomarkers Prev       Date:  2017-07-27       Impact factor: 4.254

4.  Impact of polymorphisms within genes involved in regulating DNA methylation in patients with metastatic colorectal cancer enrolled in three independent, randomised, open-label clinical trials: a meta-analysis from TRIBE, MAVERICC and FIRE-3.

Authors:  Alberto Puccini; Fotios Loupakis; Sebastian Stintzing; Shu Cao; Francesca Battaglin; Ryuma Togunaka; Madiha Naseem; Martin D Berger; Shivani Soni; Wu Zhang; Christoph Mancao; Bodour Salhia; Shannon M Mumenthaler; Daniel J Weisenberger; Gangning Liang; Chiara Cremolini; Volker Heinemann; Alfredo Falcone; Joshua Millstein; Heinz-Josef Lenz
Journal:  Eur J Cancer       Date:  2019-03-07       Impact factor: 9.162

5.  AMPK variant, a candidate of novel predictor for chemotherapy in metastatic colorectal cancer: A meta-analysis using TRIBE, MAVERICC and FIRE3.

Authors:  Ryuma Tokunaga; Shu Cao; Madiha Naseem; Francesca Battaglin; Jae Ho Lo; Hiroyuki Arai; Fotios Loupakis; Sebastian Stintzing; Alberto Puccini; Martin D Berger; Shivani Soni; Wu Zhang; Christoph Mancao; Bodour Salhia; Shannon M Mumenthaler; Daniel J Weisenberger; Gangning Liang; Chiara Cremolini; Volker Heinemann; Alfredo Falcone; Joshua Millstein; Heinz-Josef Lenz
Journal:  Int J Cancer       Date:  2019-03-26       Impact factor: 7.396

6.  Early complement genes are associated with visual system degeneration in multiple sclerosis.

Authors:  Kathryn C Fitzgerald; Kicheol Kim; Matthew D Smith; Sean A Aston; Nicholas Fioravante; Alissa M Rothman; Stephen Krieger; Stacey S Cofield; Dorlan J Kimbrough; Pavan Bhargava; Shiv Saidha; Katharine A Whartenby; Ari J Green; Ellen M Mowry; Gary R Cutter; Fred D Lublin; Sergio E Baranzini; Philip L De Jager; Peter A Calabresi
Journal:  Brain       Date:  2019-09-01       Impact factor: 13.501

7.  Boosting GWAS using biological networks: A study on susceptibility to familial breast cancer.

Authors:  Héctor Climente-González; Christine Lonjou; Fabienne Lesueur; Dominique Stoppa-Lyonnet; Nadine Andrieu; Chloé-Agathe Azencott
Journal:  PLoS Comput Biol       Date:  2021-03-18       Impact factor: 4.475

8.  Estimation of non-null SNP effect size distributions enables the detection of enriched genes underlying complex traits.

Authors:  Wei Cheng; Sohini Ramachandran; Lorin Crawford
Journal:  PLoS Genet       Date:  2020-06-15       Impact factor: 5.917

9.  Natural variation in the regulation of neurodevelopmental genes modifies flight performance in Drosophila.

Authors:  Adam N Spierer; Jim A Mossman; Samuel Pattillo Smith; Lorin Crawford; Sohini Ramachandran; David M Rand
Journal:  PLoS Genet       Date:  2021-03-18       Impact factor: 5.917

10.  Genetic underpinnings of affective temperaments: a pilot GWAS investigation identifies a new genome-wide significant SNP for anxious temperament in ADGRB3 gene.

Authors:  Xenia Gonda; Nora Eszlari; Dora Torok; Zsofia Gal; Janos Bokor; Andras Millinghoffer; Daniel Baksa; Peter Petschner; Peter Antal; Gerome Breen; Gabriella Juhasz; Gyorgy Bagdy
Journal:  Transl Psychiatry       Date:  2021-06-01       Impact factor: 6.222

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.