Motivation: To increase detection power, researchers use gene level analysis methods to aggregate weak marker signals. Due to gene expression controlling biological processes, researchers proposed aggregating signals for expression Quantitative Trait Loci (eQTL). Most gene-level eQTL methods make statistical inferences based on (i) summary statistics from genome-wide association studies (GWAS) and (ii) linkage disequilibrium patterns from a relevant reference panel. While most such tools assume homogeneous cohorts, our Gene-level Joint Analysis of functional SNPs in Cosmopolitan Cohorts (JEPEGMIX) method accommodates cosmopolitan cohorts by using heterogeneous panels. However, JEPGMIX relies on brain eQTLs from older gene expression studies and does not adjust for background enrichment in GWAS signals. Results: We propose JEPEGMIX2, an extension of JEPEGMIX. When compared to JPEGMIX, it uses (i) cis-eQTL SNPs from the latest expression studies and (ii) brains specific (sub)tissues and tissues other than brain. JEPEGMIX2 also (i) avoids accumulating averagely enriched polygenic information by adjusting for background enrichment and (ii) to avoid an increase in false positive rates for studies with numerous highly enriched (above the background) genes, it outputs gene q-values based on Holm adjustment of P-values. Availability and implementation: https://github.com/Chatzinakos/JEPEGMIX2. Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: To increase detection power, researchers use gene level analysis methods to aggregate weak marker signals. Due to gene expression controlling biological processes, researchers proposed aggregating signals for expression Quantitative Trait Loci (eQTL). Most gene-level eQTL methods make statistical inferences based on (i) summary statistics from genome-wide association studies (GWAS) and (ii) linkage disequilibrium patterns from a relevant reference panel. While most such tools assume homogeneous cohorts, our Gene-level Joint Analysis of functional SNPs in Cosmopolitan Cohorts (JEPEGMIX) method accommodates cosmopolitan cohorts by using heterogeneous panels. However, JEPGMIX relies on brain eQTLs from older gene expression studies and does not adjust for background enrichment in GWAS signals. Results: We propose JEPEGMIX2, an extension of JEPEGMIX. When compared to JPEGMIX, it uses (i) cis-eQTL SNPs from the latest expression studies and (ii) brains specific (sub)tissues and tissues other than brain. JEPEGMIX2 also (i) avoids accumulating averagely enriched polygenic information by adjusting for background enrichment and (ii) to avoid an increase in false positive rates for studies with numerous highly enriched (above the background) genes, it outputs gene q-values based on Holm adjustment of P-values. Availability and implementation: https://github.com/Chatzinakos/JEPEGMIX2. Supplementary information: Supplementary data are available at Bioinformatics online.
Gene expression is believed to have influenced human evolution and play a key role in diseases (Emilsson ). Thus, it is critical for understanding diseases and developing treatments. The importance of gene expression was further underlined by the enrichment of association signals in SNPs tagging gene expression (Nica and Dermitzakis, 2008; Nicolae ), which are denoted as expression quantitative trait loci (eQTL).Currently, the identification of complex disease susceptibility loci is performed via genome-wide association studies (GWAS). It involves scanning single nucleotide polymorphisms (SNPs) across the entire genome for genetic variants associated with a trait. Univariate analysis of GWAS is still the de facto tool for identifying trait associated SNPs (Wellcome Trust Case Control, 2007). However, when analyzing more complex GWAS SNPs with weak or moderate effect sizes, the significant findings account only for a small fraction of the total trait variation (Manolio ). Due to their small effect sizes, these SNPs are rarely detected in GWAS (Yang ). To increase the power of detection, researchers proposed analyzing genetic variants multivariately (Wang ).One type of multivariate analyses is the transcriptome-wide association study (TWAS) which identifies significant expression-trait associations. Such methods, e.g. joint effect on phenotype of eQTL/functional SNPs associated with a gene (JEPEG) (Lee ), PredictXcan (Gamazon ), JEPEGMIX (Lee ) and TWAS (Gusev ) use eQTL to predict gene expression and/or infer which genes are associated with traits. However, unlike competing non-eQTL paradigms, e.g. LDscore/LDpred (Bulik-Sullivan ), current TWAS methods (i) lack competitive adjustment for background enrichment (‘average signal’) and (ii) do not output q-values that control false positive rates when there is a substantial number of genes enriched (above background) in signals.To address these shortcomings, we propose JEPEGMIX2, an extension of JEPEGMIX, which, in addition to the existing advantage of imputing eQTLs statistics and inferring gene-trait association in cosmopolitan cohorts, it also (i) adjusts for background enrichment, (ii) offers the option to upweight rarer eQTLs and (iii) to avoid false positive rate increase for high signal enrichment, it outputs Holm q-values.
2 Materials and methods
To avoid a mere accumulation of just averagely enriched polygenic information, we competitively adjust statistics for background enrichment. This is achieved by adjusting the statistic for average non-centrality. Such ‘centralized’ JEPEGMIX statistic we denote as competitive (C) and the original statistic as the non-competitive (NC).Let be the vector of -scores for measured SNPs in the genome scans. Due to polygenicity, the expected genome scan statistics, each with 1 degree of freedom (df), has a non-zero background noncentrality parameter , i.e. . Thus, by the method of moments, we can estimate , where is computed using all measured SNPs in the genome scan, However, given that , a better estimator is, thus, . To develop a competitive test, before computing gene-level statistics, Z-scores must be shrunk towards zero by adjusting for the average background enrichment. This can be achieved via a 3 step process:
By Delta method (a first order Taylor approximation), as a linear transformation (deflation) of has the same correlation structure. Thus, can be used to build the competitive gene statistics (Supplementary Text S1), which has the same variance as their non-competitive versions.Recompute, under ‘average’ noncentrality, the P-value associated with statistics:|), where |), is the cumulative distribution function (cdf) of the non-central distribution with 1 df and noncentrality parameter .Transform into its quantile vector from a central distribution with 1 df, i.e. |),Transform to a ‘central’ Z-score: .To facilitate user-specific input along with future extensions, the new annotation file now includes a R-like formula for the expression of each gene as a function of its eQTL genotypes. The annotation file includes cis-eQTL for all tissues available in PREDICTDB (http://predictdb.hakyimlab.org/). To avoid making inference about genes poorly predicted by SNPs, for the available tissues we retain only genes for which the expression is predicted with q-value from its eQTLs. Additionally, given the increased deleteriousness of rarer mutations, we offer the possibility to upweight coefficient of rarer variants (Supplementary Text S1 for statistic computation) using a Madsen and Browning type approach (Madsen and Browning, 2009). For linkage disequilibrium (LD) estimates in cosmopolitan cohorts (needed for both imputation and statistical inference), we allow user to input the study cohort proportions of ethnicities from the reference panel. LD patterns of the study cohort are estimated as a weighted mixture (with the above weights) of the LD matrices for all ethnic groups in a reference panel (Supplementary Text S2). LD patterns are subsequently used to (i) accurately impute summary statistics of unmeasured eQTLs (Supplementary Text S3) and (ii) compute the variance of the SNP linear combinations used for gene level tests in each tissue (Supplementary Text S2). The current version uses the 1000 genome (1KG) Phase I release version 3 as reference panel (Durbin ). It consists of Europeans, Asians, Africans and Native Americans.
3 Simulations
To estimate the false positive rates of JEPEGMIX2, for five different cosmopolitan studies scenarios (Supplementary Text S4), we simulated (under ) 100 cosmopolitan cohorts of 10, 000 subjects for Ilumina 1 M autosomal SNPs using 1KG haplotype patterns (Supplementary Text S4, Supplementary Table 1). The subject phenotypes were simulated independent of genotypes as a random Gaussian sample. SNP phenotype-genotype association summary statistics, were computed as a correlation test. We obtained JEPEGMIX2 statistics for: (i) competitive (C), non-competitive (NC) and (ii) tests with rare (Madsen and Browning like) (R) and non-rare (NR) eQTL weights. To test the ability of methods to maintain false positive rates under background enrichment, we provide an enriched scenario. Under this scenario, we quantile transform the simulated ‘central’ Z-score (CZ) to a ‘non-central’ Z-score (NCZ) scenario by following the three steps from the previous section with the first step having noncentrality and the second one [extrapolation of PGC3 Schizophrenia nocentrality from PGC2 (Ripke )]. We also applied JEPEGMIX2 to 16 real summary datasets (Supplementary Text S5, Supplementary Table S2). To limit the increase in Type I error rates of JEPEGMIX2, we deem as significantly associated only genes with Holm-adjusted P-value (q-value) Due to C4 explaining most of Major Histocompatibility (MHC) (chr6: 25–33 Mb) (McCarthy ), signals for schizophrenia (SCZ), for this trait, we omit non-C4 genes in this region.
Table 1.
Signals for real datasets
Traits
No unique genes
SCZ
68
ALZ
34
AMD
17
BIP
11
HDL
79
LDL
78
T2D
5
TG
48
Smoking
5
4 Results
JEPEGMIX2 with competitive (C) statistics, controls the false positive rates at or below nominal thresholds for both central (CZ) and non-central (NCZ) scenarios while the non-competitive (NC) has similar behavior only for the central case (when the GWAS statistics are not enriched) (Supplementary Text S5, Supplementary Figs S1–S5). Under the enriched scenario (NCZ) the non-competitive version of the test has much increased false positive rates.Using the Holm P-value adjustment and both rare (R) and non-rare (NR) e QTL weights, for the real datasets significant gene signals were found in 9 traits, for which we present heatmaps (Supplementary Text S5, Supplementary Figs S6–S23). The number of genes with q-value is presented in Table 1 (for the abbreviations see Supplementary Table S2). Each analysis ran in less than 3 h on a cluster node with 4× Intel Xeon 6 core 2.67 GHz.Signals for real datasets
5 Conclusions
We propose JEPEGMIX2, an updated software/method for testing the association between (cis-eQTL mediated) gene expression and trait. Unlike existing methods, even for highly enriched GWAS, JEPEGMIX2 competitive version fully controls the false positive rates at or below nominal levels. To the applicability of JEPEGMIX to cosmopolitan cohorts, we add a competitive version and extend the number of included (i) eQTLs and (ii) tissues. Unlike existing methods, it also accommodates up weighting of the rare variants and avoids the increased rate of false positives incurred by FDR adjustment (under enrichment) by using a Holm adjustment. While gene expression in different tissues are often correlated and incomplete due to the rather small sample sizes of existing gene expression experiments, the capacity of discriminating causal tissues will be enhanced by further increases in sample size of such studies. Being written in C ++, JEPEGMIX2 is very fast. Future versions of the software will use larger reference panels.Conflict of Interest: none declared.Click here for additional data file.
Authors: Brendan K Bulik-Sullivan; Po-Ru Loh; Hilary K Finucane; Stephan Ripke; Jian Yang; Nick Patterson; Mark J Daly; Alkes L Price; Benjamin M Neale Journal: Nat Genet Date: 2015-02-02 Impact factor: 38.330
Authors: Jian Yang; Beben Benyamin; Brian P McEvoy; Scott Gordon; Anjali K Henders; Dale R Nyholt; Pamela A Madden; Andrew C Heath; Nicholas G Martin; Grant W Montgomery; Michael E Goddard; Peter M Visscher Journal: Nat Genet Date: 2010-06-20 Impact factor: 38.330
Authors: Valur Emilsson; Gudmar Thorleifsson; Bin Zhang; Amy S Leonardson; Florian Zink; Jun Zhu; Sonia Carlson; Agnar Helgason; G Bragi Walters; Steinunn Gunnarsdottir; Magali Mouy; Valgerdur Steinthorsdottir; Gudrun H Eiriksdottir; Gyda Bjornsdottir; Inga Reynisdottir; Daniel Gudbjartsson; Anna Helgadottir; Aslaug Jonasdottir; Adalbjorg Jonasdottir; Unnur Styrkarsdottir; Solveig Gretarsdottir; Kristinn P Magnusson; Hreinn Stefansson; Ragnheidur Fossdal; Kristleifur Kristjansson; Hjortur G Gislason; Tryggvi Stefansson; Bjorn G Leifsson; Unnur Thorsteinsdottir; John R Lamb; Jeffrey R Gulcher; Marc L Reitman; Augustine Kong; Eric E Schadt; Kari Stefansson Journal: Nature Date: 2008-03-16 Impact factor: 49.962
Authors: Donghyung Lee; Vernell S Williamson; T Bernard Bigdeli; Brien P Riley; Bradley T Webb; Ayman H Fanous; Kenneth S Kendler; Vladimir I Vladimirov; Silviu-Alin Bacanu Journal: Bioinformatics Date: 2015-10-01 Impact factor: 6.937
Authors: Shane McCarthy; Sayantan Das; Warren Kretzschmar; Olivier Delaneau; Andrew R Wood; Alexander Teumer; Hyun Min Kang; Christian Fuchsberger; Petr Danecek; Kevin Sharp; Yang Luo; Carlo Sidore; Alan Kwong; Nicholas Timpson; Seppo Koskinen; Scott Vrieze; Laura J Scott; He Zhang; Anubha Mahajan; Jan Veldink; Ulrike Peters; Carlos Pato; Cornelia M van Duijn; Christopher E Gillies; Ilaria Gandin; Massimo Mezzavilla; Arthur Gilly; Massimiliano Cocca; Michela Traglia; Andrea Angius; Jeffrey C Barrett; Dorrett Boomsma; Kari Branham; Gerome Breen; Chad M Brummett; Fabio Busonero; Harry Campbell; Andrew Chan; Sai Chen; Emily Chew; Francis S Collins; Laura J Corbin; George Davey Smith; George Dedoussis; Marcus Dorr; Aliki-Eleni Farmaki; Luigi Ferrucci; Lukas Forer; Ross M Fraser; Stacey Gabriel; Shawn Levy; Leif Groop; Tabitha Harrison; Andrew Hattersley; Oddgeir L Holmen; Kristian Hveem; Matthias Kretzler; James C Lee; Matt McGue; Thomas Meitinger; David Melzer; Josine L Min; Karen L Mohlke; John B Vincent; Matthias Nauck; Deborah Nickerson; Aarno Palotie; Michele Pato; Nicola Pirastu; Melvin McInnis; J Brent Richards; Cinzia Sala; Veikko Salomaa; David Schlessinger; Sebastian Schoenherr; P Eline Slagboom; Kerrin Small; Timothy Spector; Dwight Stambolian; Marcus Tuke; Jaakko Tuomilehto; Leonard H Van den Berg; Wouter Van Rheenen; Uwe Volker; Cisca Wijmenga; Daniela Toniolo; Eleftheria Zeggini; Paolo Gasparini; Matthew G Sampson; James F Wilson; Timothy Frayling; Paul I W de Bakker; Morris A Swertz; Steven McCarroll; Charles Kooperberg; Annelot Dekker; David Altshuler; Cristen Willer; William Iacono; Samuli Ripatti; Nicole Soranzo; Klaudia Walter; Anand Swaroop; Francesco Cucca; Carl A Anderson; Richard M Myers; Michael Boehnke; Mark I McCarthy; Richard Durbin Journal: Nat Genet Date: 2016-08-22 Impact factor: 38.330
Authors: Stephan Ripke; Colm O'Dushlaine; Kimberly Chambert; Jennifer L Moran; Anna K Kähler; Susanne Akterin; Sarah E Bergen; Ann L Collins; James J Crowley; Menachem Fromer; Yunjung Kim; Sang Hong Lee; Patrik K E Magnusson; Nick Sanchez; Eli A Stahl; Stephanie Williams; Naomi R Wray; Kai Xia; Francesco Bettella; Anders D Borglum; Brendan K Bulik-Sullivan; Paul Cormican; Nick Craddock; Christiaan de Leeuw; Naser Durmishi; Michael Gill; Vera Golimbet; Marian L Hamshere; Peter Holmans; David M Hougaard; Kenneth S Kendler; Kuang Lin; Derek W Morris; Ole Mors; Preben B Mortensen; Benjamin M Neale; Francis A O'Neill; Michael J Owen; Milica Pejovic Milovancevic; Danielle Posthuma; John Powell; Alexander L Richards; Brien P Riley; Douglas Ruderfer; Dan Rujescu; Engilbert Sigurdsson; Teimuraz Silagadze; August B Smit; Hreinn Stefansson; Stacy Steinberg; Jaana Suvisaari; Sarah Tosato; Matthijs Verhage; James T Walters; Douglas F Levinson; Pablo V Gejman; Kenneth S Kendler; Claudine Laurent; Bryan J Mowry; Michael C O'Donovan; Michael J Owen; Ann E Pulver; Brien P Riley; Sibylle G Schwab; Dieter B Wildenauer; Frank Dudbridge; Peter Holmans; Jianxin Shi; Margot Albus; Madeline Alexander; Dominique Campion; David Cohen; Dimitris Dikeos; Jubao Duan; Peter Eichhammer; Stephanie Godard; Mark Hansen; F Bernard Lerer; Kung-Yee Liang; Wolfgang Maier; Jacques Mallet; Deborah A Nertney; Gerald Nestadt; Nadine Norton; Francis A O'Neill; George N Papadimitriou; Robert Ribble; Alan R Sanders; Jeremy M Silverman; Dermot Walsh; Nigel M Williams; Brandon Wormley; Maria J Arranz; Steven Bakker; Stephan Bender; Elvira Bramon; David Collier; Benedicto Crespo-Facorro; Jeremy Hall; Conrad Iyegbe; Assen Jablensky; Rene S Kahn; Luba Kalaydjieva; Stephen Lawrie; Cathryn M Lewis; Kuang Lin; Don H Linszen; Ignacio Mata; Andrew McIntosh; Robin M Murray; Roel A Ophoff; John Powell; Dan Rujescu; Jim Van Os; Muriel Walshe; Matthias Weisbrod; Durk Wiersma; Peter Donnelly; Ines Barroso; Jenefer M Blackwell; Elvira Bramon; Matthew A Brown; Juan P Casas; Aiden P Corvin; Panos Deloukas; Audrey Duncanson; Janusz Jankowski; Hugh S Markus; Christopher G Mathew; Colin N A Palmer; Robert Plomin; Anna Rautanen; Stephen J Sawcer; Richard C Trembath; Ananth C Viswanathan; Nicholas W Wood; Chris C A Spencer; Gavin Band; Céline Bellenguez; Colin Freeman; Garrett Hellenthal; Eleni Giannoulatou; Matti Pirinen; Richard D Pearson; Amy Strange; Zhan Su; Damjan Vukcevic; Peter Donnelly; Cordelia Langford; Sarah E Hunt; Sarah Edkins; Rhian Gwilliam; Hannah Blackburn; Suzannah J Bumpstead; Serge Dronov; Matthew Gillman; Emma Gray; Naomi Hammond; Alagurevathi Jayakumar; Owen T McCann; Jennifer Liddle; Simon C Potter; Radhi Ravindrarajah; Michelle Ricketts; Avazeh Tashakkori-Ghanbaria; Matthew J Waller; Paul Weston; Sara Widaa; Pamela Whittaker; Ines Barroso; Panos Deloukas; Christopher G Mathew; Jenefer M Blackwell; Matthew A Brown; Aiden P Corvin; Mark I McCarthy; Chris C A Spencer; Elvira Bramon; Aiden P Corvin; Michael C O'Donovan; Kari Stefansson; Edward Scolnick; Shaun Purcell; Steven A McCarroll; Pamela Sklar; Christina M Hultman; Patrick F Sullivan Journal: Nat Genet Date: 2013-08-25 Impact factor: 38.330
Authors: John M Hettema; Brad Verhulst; Chris Chatzinakos; Silviu-Alin Bacanu; Chia-Yen Chen; Robert J Ursano; Ronald C Kessler; Joel Gelernter; Jordan W Smoller; Feng He; Sonia Jain; Murray B Stein Journal: Am J Med Genet B Neuropsychiatr Genet Date: 2019-12-30 Impact factor: 3.568
Authors: Chris Chatzinakos; Foivos Georgiadis; Donghyung Lee; Na Cai; Vladimir I Vladimirov; Anna Docherty; Bradley T Webb; Brien P Riley; Jonathan Flint; Kenneth S Kendler; Nikolaos P Daskalakis; Silviu-Alin Bacanu Journal: Am J Med Genet B Neuropsychiatr Genet Date: 2020-09-21 Impact factor: 3.568