MOTIVATION: To increase detection power, gene level analysis methods are used to aggregate weak signals. To greatly increase computational efficiency, most methods use as input summary statistics from genome-wide association studies (GWAS). Subsequently, gene statistics are constructed using linkage disequilibrium (LD) patterns from a relevant reference panel. However, all methods, including our own Joint Effect on Phenotype of eQTL/functional single nucleotide polymorphisms (SNPs) associated with a Gene (JEPEG), assume homogeneous panels, e.g. European. However, this renders these tools unsuitable for the analysis of large cosmopolitan cohorts. RESULTS: We propose a JEPEG extension, JEPEGMIX, which similar to one of our software tools, Direct Imputation of summary STatistics of unmeasured SNPs from MIXed ethnicity cohorts, is capable of estimating accurate LD patterns for cosmopolitan cohorts. JEPEGMIX uses this accurate LD estimates to (i) impute the summary statistics at unmeasured functional variants and (ii) test for the joint effect of all measured and imputed functional variants which are associated with a gene. We illustrate the performance of our tool by analyzing the GWAS meta-analysis summary statistics from the multi-ethnic Psychiatric Genomics Consortium Schizophrenia stage 2 cohort. This practical application supports the immune system being one of the main drivers of the process leading to schizophrenia. AVAILABILITY AND IMPLEMENTATION: Software, annotation database and examples are available at http://dleelab.github.io/jepegmix/. CONTACT: donghyung.lee@vcuhealth.org SUPPLEMENTARY INFORMATION: Supplementary material is available at Bioinformatics online.
MOTIVATION: To increase detection power, gene level analysis methods are used to aggregate weak signals. To greatly increase computational efficiency, most methods use as input summary statistics from genome-wide association studies (GWAS). Subsequently, gene statistics are constructed using linkage disequilibrium (LD) patterns from a relevant reference panel. However, all methods, including our own Joint Effect on Phenotype of eQTL/functional single nucleotide polymorphisms (SNPs) associated with a Gene (JEPEG), assume homogeneous panels, e.g. European. However, this renders these tools unsuitable for the analysis of large cosmopolitan cohorts. RESULTS: We propose a JEPEG extension, JEPEGMIX, which similar to one of our software tools, Direct Imputation of summary STatistics of unmeasured SNPs from MIXed ethnicity cohorts, is capable of estimating accurate LD patterns for cosmopolitan cohorts. JEPEGMIX uses this accurate LD estimates to (i) impute the summary statistics at unmeasured functional variants and (ii) test for the joint effect of all measured and imputed functional variants which are associated with a gene. We illustrate the performance of our tool by analyzing the GWAS meta-analysis summary statistics from the multi-ethnic Psychiatric Genomics Consortium Schizophrenia stage 2 cohort. This practical application supports the immune system being one of the main drivers of the process leading to schizophrenia. AVAILABILITY AND IMPLEMENTATION: Software, annotation database and examples are available at http://dleelab.github.io/jepegmix/. CONTACT: donghyung.lee@vcuhealth.org SUPPLEMENTARY INFORMATION: Supplementary material is available at Bioinformatics online.
Univariate analysis of genome-wide association studies (GWAS) has emerged as the main tool for identifying trait/disease-associated genetic variants (Burton ). However, most variants reported by complex trait GWAS are common single nucleotide polymorphisms (SNPs) with weak or moderate effect sizes, which account for only a small fraction of the overall phenotypic variation (Manolio ). This is due to the fact that, due to their small effect sizes, most common causal variants are unlikely to be detected in GWAS (Yang ).A reasonable approach to increase the power to detect true association signals with small effect sizes is to aggregate them by jointly analyzing multiple SNPs. To leverage information from multiple SNPs, multivariate association tests (Ehret ; Wood ; Yang ) have been also proposed. However, these methods typically test all SNPs, regardless of their functionality.Given that functional SNPs are likely to jointly impact on gene expression, to increase detection power, our group proposed JEPEG (Joint Effect on Phenotype of eQTL/functional SNPs associated with a Gene; Lee ), which (i) uses only summary association statistics, (ii) imputes summary statistics of unmeasured functional SNPs and (iii) boosts detection power by jointly analyzing measured and imputed functional variants. However, similar to direct imputation methods based on summary statistics, e.g. DIST (Lee ) and ImpG (Pasaniuc ), it is only applicable to homogeneous cohorts. To overcome this limitation, concurrently with Adapt-Mix (Park ) and DISSCO (Xu ), our group developed DISTMIX (Direct Imputation of summary STatistics of unmeasured SNPs from MIXed ethnicity; Lee ). It extends DIST capabilities to the analysis of mixed ethnicity cohorts by estimating their linkage disequilibrium (LD) patterns as a mixture of the LD patterns from the constituent ethnicities of large reference panels, e.g. 1000 Genomes data (1KG) (Altshuler ). Here, for the gene level analysis of the ever more common (and well powered) mixed ethnicity cohorts, we propose JEPEG for MIXed ethnicity cohorts (JEPEGMIX), which adapts the LD estimation strategy used by DISTMIX, while retaining all JEPEG advantages.
2 Methods
Similar to DISTMIX, to accurately estimate LD patterns for mixed ethnicity cohorts, JEPEGMIX first estimates the ethnic proportions of study cohorts using study allele frequency (AF) information [see Supplementary Text S1 in supplementary data (SD) for details]. (Alternatively, when AF information is not available, user can pre-specify the proportions based on the ethnic composition information typically provided by published studies.) Next, using the estimated/user-specified ethnic proportions, the software estimates LD patterns of the study cohort as a weighted mixture of the LD matrices of all ethnic groups in a reference panel (Supplementary Text S2 of SD). Finally, it uses these estimated mixture LD patterns and association summary statistics to (i), when necessary, rapidly and accurately impute summary statistics of unmeasured functional SNPs (Supplementary Text S3 of SD) and (ii) jointly test the effect of measured and imputed functional SNPs associated with each gene (Supplementary Text S4 of SD).
3 Results
To estimate false positive rates, null hypothesis cosmopolitan cohorts were simulated using haplotypic patterns from 1KG (see Supplementary Text S5 of SD). When compared with JEPEG, JEPEGMIX maintains the false positive rates at or below nominal thresholds (Supplementary Fig. S1 in SD). We obtained gene-level statistics by applying the method to association summary statistics from the large-scale cosmopolitan Psychiatric Genomics Consortium Schizophrenia stage 2 (PGC SCZ2) cohort (Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014). A subsequent Ingenuity Pathway Analysis (www.ingenuity.com) of the 61 significant JEPEGMIX genes (Supplementary Table S1 in SD), i.e. those with false discovery rate q-values < 0.05, yields a large number of immune pathways and only one (in italics) which is neurologically related (Table 1). The pattern is maintained even when excluding the 21 genes located in the immune related Major Histocompatibility (MHC) region from chromosome 6p (Supplementary Table S2).
Table 1.
Pathways significant at a Type I error of 0.05
Pathway
P-value
Antigen presentation pathway
0.0002
Graft-versus-host disease signaling
0.0004
Autoimmune thyroid disease signaling
0.0005
Granzyme A signaling
0.002
Dendritic cell maturation
0.002
Allograft rejection signaling
0.002
OX40 signaling pathway
0.003
Crosstalk between dendritic cells and natural killer cells
0.003
Communication between innate and adaptive immune cells
0.003
Cytotoxic T lymphocyte-mediated apoptosis
0.004
Type I diabetes mellitus signaling
0.005
Role of RIG1-like receptors in antiviral innate immunity
0.008
Neuroprotective Role of THOP1 in Alzheimer's Disease
0.009
Nur77 signaling in T lymphocytes
0.01
Cdc42 signaling
0.01
Calcium-induced T Lymphocyte apoptosis
0.02
Caveolar-mediated endocytosis signaling
0.02
CTLA4 signaling in cytotoxic T lymphocytes
0.03
Virus entry via endocytic pathways
0.03
p53 Signaling
0.04
G-protein coupled receptor signaling
0.05
Pathways significant at a Type I error of 0.05
4 Conclusions
For multi-ethnic cohorts, unlike existing methods, JEPEGMIX controls the Type I error rates at or below nominal levels. Due to ridge adjustment being inversely related to the size of 1KG relevant subpopulations (Supplementary Text S2 of SD), at present the method is rather conservative. However, the conservativeness is expected to become negligible with the advent of extremely large reference panels (http://www.haplotype-reference-consortium.org). Thus, to the capabilities of JEPEG, JEPEGMIX adds the much needed applicability to the analysis of large cosmopolitan cohorts, which are the state-of-the-art in detecting genetic signals. For such cohorts, it (i) imputes unmeasured functional SNPs, (ii) pools in a synthetic variable the information of measured and imputed SNPs from the same functional category and (iii) combines these synthetic variables in a gene-level Mahalanobis test. JEPEGMIX application to PGC SCZ2 cohort suggests that, in the etiology of SCZ, the immune system might play a more substantial role than currently accepted.
Funding
This work was supported by National Institute on Drug Abuse [R25DA026119 to D.L.], National Institutes of Mental Health [R21MH100560 to S.A.B. and B.P.R.] and National Institute on Alcohol Abuse and Alcoholism [R21AA022717 to S.A.B. and V.I.V.; P50AA022537 to S.A.B. and K.S.K.].Conflict of Interest: none declared.
Authors: Jian Yang; Beben Benyamin; Brian P McEvoy; Scott Gordon; Anjali K Henders; Dale R Nyholt; Pamela A Madden; Andrew C Heath; Nicholas G Martin; Grant W Montgomery; Michael E Goddard; Peter M Visscher Journal: Nat Genet Date: 2010-06-20 Impact factor: 38.330
Authors: Gonçalo R Abecasis; David Altshuler; Adam Auton; Lisa D Brooks; Richard M Durbin; Richard A Gibbs; Matt E Hurles; Gil A McVean Journal: Nature Date: 2010-10-28 Impact factor: 49.962
Authors: Teri A Manolio; Francis S Collins; Nancy J Cox; David B Goldstein; Lucia A Hindorff; David J Hunter; Mark I McCarthy; Erin M Ramos; Lon R Cardon; Aravinda Chakravarti; Judy H Cho; Alan E Guttmacher; Augustine Kong; Leonid Kruglyak; Elaine Mardis; Charles N Rotimi; Montgomery Slatkin; David Valle; Alice S Whittemore; Michael Boehnke; Andrew G Clark; Evan E Eichler; Greg Gibson; Jonathan L Haines; Trudy F C Mackay; Steven A McCarroll; Peter M Visscher Journal: Nature Date: 2009-10-08 Impact factor: 49.962
Authors: Bogdan Pasaniuc; Noah Zaitlen; Huwenbo Shi; Gaurav Bhatia; Alexander Gusev; Joseph Pickrell; Joel Hirschhorn; David P Strachan; Nick Patterson; Alkes L Price Journal: Bioinformatics Date: 2014-07-01 Impact factor: 6.937
Authors: Donghyung Lee; Vernell S Williamson; T Bernard Bigdeli; Brien P Riley; Ayman H Fanous; Vladimir I Vladimirov; Silviu-Alin Bacanu Journal: Bioinformatics Date: 2014-12-12 Impact factor: 6.937
Authors: Donghyung Lee; T Bernard Bigdeli; Vernell S Williamson; Vladimir I Vladimirov; Brien P Riley; Ayman H Fanous; Silviu-Alin Bacanu Journal: Bioinformatics Date: 2015-06-09 Impact factor: 6.937
Authors: Alexis C Edwards; Joseph D Deak; Ian R Gizer; Dongbing Lai; Chris Chatzinakos; Kirk P Wilhelmsen; Jonathan Lindsay; Jon Heron; Matthew Hickman; Bradley T Webb; Silviu-Alin Bacanu; Tatiana M Foroud; Kenneth S Kendler; Danielle M Dick; Marc A Schuckit Journal: Alcohol Clin Exp Res Date: 2018-10-28 Impact factor: 3.455
Authors: Chris Chatzinakos; Donghyung Lee; Bradley T Webb; Vladimir I Vladimirov; Kenneth S Kendler; Silviu-Alin Bacanu Journal: Bioinformatics Date: 2018-01-15 Impact factor: 6.937