Literature DB >> 35771864

Integration of multidimensional splicing data and GWAS summary statistics for risk gene discovery.

Ying Ji^1,2, Qiang Wei^1,2, Rui Chen^1,2, Quan Wang^1,2, Ran Tao^2,3, Bingshan Li^1,2.

Abstract

A common strategy for the functional interpretation of genome-wide association study (GWAS) findings has been the integrative analysis of GWAS and expression data. Using this strategy, many association methods (e.g., PrediXcan and FUSION) have been successful in identifying trait-associated genes via mediating effects on RNA expression. However, these approaches often ignore the effects of splicing, which can carry as much disease risk as expression. Compared to expression data, one challenge to detect associations using splicing data is the large multiple testing burden due to multidimensional splicing events within genes. Here, we introduce a multidimensional splicing gene (MSG) approach, which consists of two stages: 1) we use sparse canonical correlation analysis (sCCA) to construct latent canonical vectors (CVs) by identifying sparse linear combinations of genetic variants and splicing events that are maximally correlated with each other; and 2) we test for the association between the genetically regulated splicing CVs and the trait of interest using GWAS summary statistics. Simulations show that MSG has proper type I error control and substantial power gains over existing multidimensional expression analysis methods (i.e., S-MultiXcan, UTMOST, and sCCA+ACAT) under diverse scenarios. When applied to the Genotype-Tissue Expression Project data and GWAS summary statistics of 14 complex human traits, MSG identified on average 83%, 115%, and 223% more significant genes than sCCA+ACAT, S-MultiXcan, and UTMOST, respectively. We highlight MSG's applications to Alzheimer's disease, low-density lipoprotein cholesterol, and schizophrenia, and found that the majority of MSG-identified genes would have been missed from expression-based analyses. Our results demonstrate that aggregating splicing data through MSG can improve power in identifying gene-trait associations and help better understand the genetic risk of complex traits.

Entities: Chemical

Mesh：

Substances：
Sodium Glutamate

Year: 2022 PMID： 35771864 PMCID： PMC9278751 DOI： 10.1371/journal.pgen.1009814

Source DB: PubMed Journal: PLoS Genet ISSN： 1553-7390 Impact factor: 6.020

Introduction

Over the past two decades, genome-wide association studies (GWAS) have led to the discovery of many trait-associated loci. However, most loci are located in non-coding regions of the genome, whose functional relevance remains largely unclear [1]. Recent research suggested that a large portion of GWAS loci might influence complex traits through regulating gene expression levels [2, 3]. One family of methods called transcriptome-wide association studies (TWAS) has been developed to integrate GWAS and gene expression datasets to identify gene-trait associations [4]. In particular, TWAS methods like PrediXcan [5], FUSION [6], and S-PrediXcan [7] first build gene expression prediction models using reference transcriptome datasets (e.g., the Genotype-Tissue Expression (GTEx) Project [8]) and then test the associations between tissue-specific genetically predicted gene expressions and disease phenotypes using readily-available GWAS individual- or summary-level data. These methods have been widely used in practice as they facilitate the functional interpretation of existing GWAS associations and detection of novel trait-associated genes. Gene expression is not the only mediator of genetic effects on complex traits. Splicing is of comparable importance and often functions independently of expression [3, 7, 9, 10]. The splicing process involves highly context-dependent regulation and other complex mechanisms, which could be prone to errors with potentially pathological consequences [11]. In fact, recent studies indicated that at least 20% of disease-causing mutations might affect pre-mRNA splicing [12], and splicing quantitative trait loci (sQTLs) could account for disproportionately high fractions of disease heritability [13, 14]. Despite the importance of splicing regulation, it has been understudied largely due to its complexity. Therefore, there is a pressing need to investigate trait-associated genes with effects mediated by splicing. Splicing can be quantified in many ways using RNA-seq data. LeafCutter [15] is a recently developed approach that allows for the identification and quantification of novel and known splicing events by leveraging information from spliced reads (reads that span an intron) in short-read RNA-seq data. LeafCutter doesn’t reply on transcript models or predefined splicing events used in previous splicing quantification approaches like isoform-quantification or exon-quantification, which might be inaccurate or incomplete. It has been used to analyze RNA-seq data in a number of important genomic studies, including GTEx [16], ROSMAP [17], and CommonMind [18]. While gene expression can usually be summarized into one measurement per gene per tissue, there are on average eight RNA splicing events quantified using LeafCutter per protein coding gene per tissue [15]. To analyze splicing data, a straightforward extension of the TWAS framework for expression data is to test each genetically predicted splicing event separately and then correct for multiple testing [9, 14, 17, 19, 20]. For example, Gusev et al. [19] detected a comparable number of significant genes associated with schizophrenia from around nine times splicing events (99,562) compared to expression (10,819). While these results lend support for the importance of splicing as a genotype-phenotype link, they also suggest that there is room for appreciative power gain when information embedded in splicing events can be effectively aggregated and multiple testing burden can be dramatically alleviated. A closely related multiple testing problem arises in TWASs when the most relevant tissue for the disease of interest is unclear, and one has to test the association between the predicted gene expression and disease outcome in each tissue separately and then apply multiple testing correction. To alleviate this multiple testing burden and improve statistical power, multi-tissue TWAS approaches like S-MultiXcan [21] and UTMOST [22] have been proposed to evaluate multiple single-tissue associations jointly by an omnibus test. Specifically, S-MultiXcan first builds gene expression prediction models in each tissue separately and then performs a chi-squared test for the joint effects of expressions from different tissues on the trait of interest. To avoid collinearity issues, it applies singular value decomposition (SVD) to the covariance matrix of predicted expressions and then discards the axes of small variation. UTMOST first builds tissue-specific expression prediction models by borrowing information across tissues and then uses the generalized Berk-Jones (GBJ) test [23, 24] to combine associations across tissues. Recently, Feng at al. [25] proposed to use sparse canonical correlation analysis (sCCA) [26] to directly build multi-tissue gene expression features and then jointly test those sCCA features and single-tissue predicted expressions using the aggregate Cauchy association test (ACAT) [27]. They showed that this sCCA+ACAT approach could be more powerful than S-MultiXcan and UTMOST. In this paper, we propose a multidimensional splicing gene (MSG) framework to jointly test the association between all splicing events in a gene and the trait of interest. In brief, we use sCCA to build genetically predicted multi-splicing-event features, and then perform association tests of the predicted splicing events with the trait of interest. We use the GTEx published intron excision ratios calculated by LeafCutter as the splicing event phenotype in this analysis. To efficiently capture the genetic components of splicing, we use the SVD regularization approach of S-MultiXcan [21] to compute a pseudo-inverse of covariance matrix of the genetically predicted splicing events, which removes the axes of small variation. This strategy offers advantages in statistical power by reducing the degree of freedom of the chi-sqared test statistic in subsequent gene-trait association analysis. We evaluated the performance of our MSG approach, and compared its performance with those of S-MultiXcan, UTMOST, and sCCA+ACAT through extensive simulations and real data applications. In simulations, we showed that MSG provided properly controlled type I error rates, and yielded substantial power gains over S-MultiXcan, UTMOST, and sCCA+ACAT. Real data applications using GTEx data and summary statistics from 14 complex human traits demonstrated that MSG identified on average 83%, 115%, and 223% more significant genes than sCCA+ACAT, S-MultiXcan, and UTMOST, respectively. We showcased the applications of MSG to GWAS summary statistics of Alzhimer’s disease (AD), low-density lipoprotein cholesterol (LDL-C), and schizophrenia, and found that the majority of significant splicing-trait associated genes (75%, 86%, and 89% genes for AD, LDL-C, and schizophrenia, respectively) would have been missed from expression-based analyses, highlighting the potential to incorporating splicing data into post-GWAS analyses to better our understanding of the genetic underpinnings of complex traits.

Results

Methods overview

Our proposed MSG method consists of two stages. In the first stage, we use sCCA to construct latent canonical vectors (CVs) by identifying sparse linear combinations of single nucleotide polymorphisms (SNPs) and splicing events that are maximally correlated with each other, where each splicing event is a standardized and normalized intron excision ratio calculated via LeafCutter [15] using default parameters (see details in the Methods section). In the second stage, we test for the association between each of the genetically regulated splicing CVs and the trait of interest using GWAS summary statistics. To integrate single splicing CV-trait associations into a gene-level statistic, we estimate the correlation matrix of these predicted splicing CVs using an external linkage disequilibrium (LD) reference panel. We use the SVD regularization method of [21] to determine the number of informative splicing CVs (i.e., effective degree of freedom) that explain the largest variations. Finally, we combine the associations using a chi-squared test. Fig 1 displays an overview of the MSG method (see details in the Methods section).

Fig 1

Schematic of the MSG method.

Simulations: Type I error and power analysis

We performed extensive simulations to compare the performance of MSG, S-MultiXcan, UTMOST, and sCCA+ACAT in terms of their type I error and power under various scenarios (see details in the Methods section). Following Barbeira et al. [21], Hu et al. [22], and Feng et al. [25], we considered correlations between splicing events within a gene but not correlations between nearby genes induced by LD, as such simulations will be too complicated and obscure the main purpose of the paper. In the first set of simulations, we varied the number of effect-sharing splicing events (refereed to as “sharing”), the proportion of genetic variants that have non-zero effects on splicing (referred to as “sparsity”), and the cis-heritability of splicing events (referred to as ). We found that MSG, S-MultiXcan, and sCCA+ACAT have properly controlled type I error rates in all scenarios, while UTMOST is slightly conservative (Table 1). Fig 2 shows that splicing heritability increase is associated with power increase; sparsity decrease is associated with power decrease, though its impact on power appears to be less visible than heritability’s. In the second set of simulations, we defined “effect-sharing splicing events”, “non-effect-sharing splicing events”, and “trait-contributing splicing events” as splicing events that are regulated by a common set of SNPs, splicing events that are regulated by non-overlapping SNPs, and splicing events that are associated with the trait, respectively. We considered three scenarios: 1) all splicing events are trait-contributing; 2) only effect-sharing splicing events are trait-contributing; and 3) only non-effect-sharing splicing events are trait-contributing. Fig 3 shows that power increases with the number of trait-contributing splicing events for all methods, regardless of the number of effect-sharing splicing events. In both sets of simulations, we found that MSG is unanimously more powerful than S-MultiXcan, sCCA+ACAT, and UTMOST, with substantial margins.

Table 1

Type I error rates in the first set of simulations.

Sharing	Sparsity	hc2	S-MultiXcan	UTMOST	sCCA+ACAT	MSG
2	1%	1%	0.047	0.036	0.052	0.049
		5%	0.051	0.038	0.053	0.050
		10%	0.057	0.043	0.051	0.049
	5%	1%	0.044	0.035	0.050	0.051
		5%	0.050	0.036	0.051	0.051
		10%	0.055	0.042	0.046	0.052
	10%	1%	0.041	0.031	0.054	0.051
		5%	0.054	0.042	0.052	0.053
		10%	0.053	0.039	0.054	0.052
4	1%	1%	0.044	0.034	0.053	0.050
		5%	0.053	0.039	0.052	0.051
		10%	0.056	0.043	0.055	0.048
	5%	1%	0.043	0.033	0.049	0.051
		5%	0.048	0.034	0.054	0.049
		10%	0.053	0.040	0.054	0.051
	10%	1%	0.044	0.031	0.055	0.047
		5%	0.047	0.037	0.055	0.051
		10%	0.057	0.045	0.053	0.054
8	1%	1%	0.044	0.034	0.054	0.051
		5%	0.047	0.037	0.054	0.050
		10%	0.054	0.041	0.050	0.049
	5%	1%	0.044	0.033	0.051	0.047
		5%	0.051	0.036	0.054	0.053
		10%	0.053	0.040	0.059	0.050
	10%	1%	0.042	0.032	0.051	0.048
		5%	0.050	0.040	0.053	0.050
		10%	0.044	0.034	0.051	0.051

Note: Type I error was computed as the proportion of significant genes under the p-value cutoff of 0.05. Each entry is based on 20,000 replicates. The total number of splicing events is 10.

Fig 2

Power comparison between the S-MultiXcan, UTMOST, sCCA+ACAT, and MSG methods in the first set of simulations.

With different number of effect-sharing splicing events (2, 4, 8), sparsity (0.01, 0.05, 0.1) and splicing heritability (0.01, 0.05, 0.1). The trait heritability is fixed at 0.01. For each subplot, the x-axis stands for the number of effect-sharing splicing events and the y-axis stands for the proportion of significant genes under the p-value cutoff of 5 × 10−6 across 2000 replicates.

Fig 3

Power comparison between the S-MultiXcan, UTMOST, sCCA+ACAT, and MSG models in the second set of simulations.

Power comparison between the S-MultiXcan, UTMOST, sCCA+ACAT, and MSG methods in the first set of simulations.

Power comparison between the S-MultiXcan, UTMOST, sCCA+ACAT, and MSG models in the second set of simulations.

With different trait-contributing splicing events. For each subplot, the x-axis stands for the number of effect-sharing splicing events (2, 4, 8) and the y-axis stands for the proportion of significant genes under the p-value cutoff of 5 × 10−6 across 2000 replicates. Note: Type I error was computed as the proportion of significant genes under the p-value cutoff of 0.05. Each entry is based on 20,000 replicates. The total number of splicing events is 10.

Applications to complex human traits

Summary of applications to 14 traits

We applied MSG, S-MultiXcan, sCCA+ACAT, and UTMOST to splicing data from the GTEx project (V8 release [28]) to obtain genetic prediction models for splicing events. We then applied the models to GWAS summary statistics of 14 complex traits to identify trait-associated genes whose genetic effects were mediated via splicing. Following [22], for each trait, we chose the tissue with the top trait heritability enrichment in the respective tissue-specific annotation using linkage disequilibrium score regression [29] (see Supplementary Table 24 of [22]). We applied all methods (MSG, S-MultiXcan, sCCA+ACAT, and UTMOST) to splicing data from the same tissue for each trait; see Table 2 for the tissue chosen for each trait. The sample sizes of these tissues in GTEx range from 175 (brain frontal cortex BA9) to 706 (muscle skeletal). We extracted cis-SNPs within 500 kb upstream of the transcription start site and 500 kb downstream of the transcription stop site. We selected GWASs of 14 complex traits (both quantitative and binary traits) with reasonably large sample sizes, ranging from 51,710 (bipolar disorder) to 408,953 (type 2 diabetes). When implementing MSG, we used 5,000 randomly selected European subjects from BioVU, the Vanderbilt University biorepository linked to de-identified electronic medical records [30], because MSG requires a larger LD reference panel to ensure proper type I error control (Table A in S1 Appendix). We also used the same LD reference panel for S-MultiXcan, UTMOST, and sCCA+ACAT to ease the comparison between methods. We used Bonferroni correction to account for multiple testing across all testable genes (Table B in S2 Appendix) for each trait separately. Table 2 shows that, at Bonferroni threshold of 0.05, MSG identified on average 83%, 115%, and 223% more significant genes than sCCA+ACAT, S-MultiXcan, and UTMOST, respectively, a substantial improvement over existing methods (see Tables C-P in S2 Appendix for complete lists of significant genes identified by MSG). In particular, we examined closely the results for AD, LDL-C, and schizophrenia, with details in the next three subsections.

Table 2

Numbers of significant gene-trait associations across 14 human traits using S-MultiXcan, UTMOST, sCCA+ACAT, and MSG.

Trait	Tissue	MSG	S-MultiXcan	UTMOST	sCCA+ACAT
AD	Brain frontal cortex BA9	32	19	14	19
Bipolar disorder	Brain frontal cortex BA9	67	23	17	31
Major depressive disorder	Brain frontal cortex BA9	23	5	3	0
Body mass index (BMI)	Brain frontal cortex BA9	1757	786	497	704
Schizophrenia	Brain frontal cortex BA9	501	222	153	234
Neuroticism	Brain frontal cortex BA9	178	68	46	72
Type 2 diabetes	Adipose subcutaneous	104	53	41	59
Total cholesterol	Liver	202	109	66	118
LDL-C	Liver	200	108	69	120
Serum urate	Liver	87	63	50	64
High-Density Lipoprotein Cholesterol	Adipose subcutaneous	161	79	53	111
Triglycerides	Adipose subcutaneous	144	96	69	94
Waist hip ratio adjusted for BMI	Adipose subcutaneous	860	397	259	516
Age at natural menopause	Muscle skeletal	220	118	79	115

Application to AD

We used the brain frontal cortex BA9 splicing data from the GTEx project to build genetic prediction models for splicing events and then conducted gene-trait association analysis using the stage I GWAS summary statistics from the International Genomics of Alzheimer’s Project (IGAP) (N = 54,162) [31]. MSG, UTMOST, S-MultiXcan, and sCCA+ACAT identified 32, 14, 19, and 19 significant genes, respectively (Table 2 and Fig 4A). We observed that 26 out of the 32 MSG significant genes are within 500 kb distance to five GWAS identified lead SNPs, including the PTK2B-CLU locus on chromosome (CHR) 1, SPI1 locus on CHR 11, MS4A4A locus on CHR 11, PICALM locus on CHR 11, and APOE locus on CHR 19 (Table Q in S2 Appendix). Among the gene-trait associations identified using MSG, 21% (7/32) were also identified by all the other three approaches; 44% (14/32) were also identified by at least one of the other approaches; and 34% (11/32) were identified by MSG only (Fig 4B). To replicate our findings, we applied these four methods to summary statistics from the GWAS by proxy (GWAX) for AD in the UK Biobank (N = 114,564) [32]. MSG, sCCA+ACAT, S-MultiXcan, and UTMOST replicated six (MARK4, ERCC1, RELB, CLASRP, PPP1R37, CEACAM19), two (RELB, APOC1), one (RELB), and zero significant genes, respectively, under the Bonferroni-corrected significance threshold. We compiled a list of well-known AD-associated genes (Note 1 Section 1A in S1 Appendix) from [33], and found that several MSG-identified AD genes are in this list (labelled in red in Fig 4C).

Fig 4

Results of the AD analysis using the IGAP stage I GWAS summary statistics.

Results of the AD analysis using the IGAP stage I GWAS summary statistics.

Application to LDL-C

We used the liver splicing data from the GTEx project to build genetic prediction models for splicing events and then conducted gene-trait association analysis using the LDL-C GWAS summary statistics from the global lipids genetics consortium (GLGC) (N = 188,578) [43]. MSG, UTMOST, S-MultiXcan, and sCCA+ACAT identified 200, 108, 69, 120 significant genes, respectively (Table 2 and Fig 5A). We found that 102 out of the 200 MSG significant genes are within 500 kb distance to the 20 GWAS significant lead SNPs, which cluster around known SNP-level significant loci to a lesser extent than AD (Table R in S2 Appendix). Among the gene-trait associations identified by MSG, 23% (47/200) were also identified by all the other three approaches; 37% (75/200) were also identified by at least one of the other approaches; and 39% (78/200) were identified by MSG only (Fig 5B). To replicated our findings, we applied these four approaches to summary statistics from the LDL-C UK Biobank GWAS [44] (N = 343,621) and identified 474, 223, 254, and 175 genes using MSG, S-MultiXcan, sCCA+ACAT, and UTMOST, respectively. The replication rates are high for all four methods: among the significant genes identified in the GLGC GWAS, 161 out of 200 (81%), 79 out of 108 (73%), 93 out of 120 (77%), and 52 out of 69 (75%) were replicated in the UK Biobank analysis using MSG, S-MultiXcan, sCCA+ACAT, and UTMOST, respectively, under the Bonferroni-corrected significance threshold. We compiled a list of well-known LDL-associated genes (Note 1 Section 1B in S1 Appendix) from [45], and found that several MSG-identified LDL-C genes are in this list, including LPIN3, FADS3, LDLRAP1, FADS1, LDLR, FADS2 (labelled in red in Fig 5C).

Fig 5

Results of the LDL-C analysis using the GLGC GWAS summary statistics.

Results of the LDL-C analysis using the GLGC GWAS summary statistics.

Application to schizophrenia

We used the brain frontal cortex BA9 splicing data from the GTEx project to build genetic prediction models for splicing events and then conduct gene-trait association analysis using a schizophrenia GWAS (N = 105,318) [55]. MSG, S-MultiXcan, sCCA+ACAT, and UTMOST identified 501, 222, 234, 153 significant genes, respectively (Table 2 and Fig 6A). We also investigated the performance of the testing each genetically predicted splicing event separately and then correct for multiple testing strategy adopted in [19]. This approach identified 170 significant genes, which is lower than those of MSG (501), S-MultiXcan (222) and sCCA+ACAT (234), showcasing room for appreciative power gain when information embedded in multiple splicing events can be aggregated and multiple testing burden can be alleviated (incidentally, this approach identified more significant genes than UTMOST).

Fig 6

Results of schizophrenia analysis.

Results of schizophrenia analysis.

A) Bar plots of the number of significant genes using different methods; B) Venn diagram showing the overlap of significant genes identified by different methods; C) Manhattan plot for the MSG analysis. Genes with strong literature support are labeled in red. We observe that 376 out of 501 MSG significant genes are within 500 kb distance to 76 GWAS significant SNPs (see full list of these genes in Table S in S2 Appendix). Among the gene-trait associations identified using MSG, 16% (83/501) were also identified by all the other three approaches; 33% (165/501) were also identified by at least one of the other approaches; and 51% (253/501) were identified by MSG only (Fig 6B). Current available large-scale schizophrenia GWAS often have sample overlap, so we were unable to replicate the genes in an independent GWAS. We found that a few genes identified by MSG had been reported to influence schizophrenia risk via splicing (Note 1 Section 1C in S1 Appendix), including SNX19 [56], AS3MT [57], and CYP2D6 [56] (Fig 6C). We also conducted a conventional TWAS using S-PrediXcan, GTEx brain frontal cortex gene BA9 expression data, and the same GWAS summary statistics. We found that out of the 501 genes identified by MSG using splicing data, 55 genes could also be identified by S-PrediXcan using expression data. Due to the complex haplotype and LD structure of the major histocompatibility complex (MHC) region, we summarized the results for genes in and outside of the MHC region separately. In the MHC region, 30 genes overlapped between 33 genes identified by S-PrediXcan and 101 genes identified by MSG (Table V in S2 Appendix). Genes with literature support to be associated with schizophrenia that could only be identified using splicing data in the MHC region includes NOTCH4 (MSG p-value = 8.35 × 10−29; S-PrediXcan p-value = 8.19 × 10−2) [58], TRIM26 (MSG p-value = 4.64 × 10−14; S-PrediXcan p-value = 4.40 × 10−1) [59], and ZSCAN9 (MSG p-value = 4.64 × 10−14; S-PrediXcan p-value = 4.40 × 10−1) [60]. Outside of the MHC region, 25 genes overlapped between 58 genes identified by S-PrediXcan and 400 genes identified by MSG (Table V in S2 Appendix). Among genes that could only be identified using splicing data (Fig E in S1 Appendix), we highlighted CACNA1C (MSG p-value = 9.35 × 10−10; S-PrediXcan p-value = 5.44 × 10−1), which encodes the predominant calcium voltage-gated channel α1 subunit in neurons, cardiac muscle, and endocrine cells [61]. CACNA1C is a high confidence disease risk gene with evidence from GWAS [55, 62], candidate gene studies [63, 64], and a whole exome sequencing study [65]. Yet, CACNA1C has not been reported to be significant from TWAS analyses based on expression data. CACNA1C has complex splicing profile [66], encodes multiple alternatively spliced transcripts, which can result in functionally and pharmacologically distinct channels. Since intracellular calcium is important for cellular signalling processes, and its intracellular levels are tightly regulated in neurons, it is unsurpisingly that dysregulation of these calcium channels can cause disruption of neural developmental pathways [67]. Other genes include CACNA1G (MSG p-value = 2.45 × 10−6; S-PrediXcan p-value = 7.15 × 10−1), which also encodes calcium voltage-gated channel subunit and has been implicated in multiple studies to be a risk gene associated with schizophrenia [62, 68], SNX19 (MSG p-value = 2.27 × 10−10; S-PrediXcan p-value = 2.38 × 10−3), which has been reported to have schizophrenia risk-associated transcripts defined by an exon-exon splice junction between exons 8 and 10 (junc8.10) and is predicted to encode proteins lacking the characteristic nexin C terminal domain [69], GRIA1 (MSG p-value = 1.28 × 10−8; S-PrediXcan p-value = 1.62 × 10−5), which has been reported to be a schizophrenia risk gene [62], and PPP1R16B (MSG p-value = 4.86 × 10−18; S-PrediXcan p-value = 8.62 × 10−1), which has been reported to be associated with schizophrenia in several populations [62, 70] and multiple psychiatric disorders [71, 72].

Discussion

While there is extensive research on trait-associated gene discovery based on gene expression using methods like S-PrediXcan and FUSION and their multidimensional variants like S-MultiXcan, UTMOST, and sCCA+ACAT recently, there has been few studies on trait-associated gene discovery using splicing data so far. Splicing data present unique challenges due to its multidimensional nature, which demands the development of efficient analytic approaches. In this paper, we proposed MSG, a framework to construct cross-splicing event models using sCCA to boost power in identifying genes influencing traits via splicing. Through simulations, we showed that MSG has proper type I error control and superior power compared to current state-of-the-art approaches, e.g. S-MultiXcan, UTMOST, and sCCA+ACAT. In real data applications, MSG identified on average 83%, 115%, and 223% more significant genes than sCCA+ACAT, S-MultiXcan, and UTMOST, respectively, across 14 complex traits. We highlighted our findings on AD, LDL-C, and schizophrenia, and found that the significant genes identified by MSG cover comparable or more existing GWAS loci than the ones identified by sCCA+ACAT, UTMOST, or S-MultiXcan. Furthermore, MSG identified more genes both within GWAS loci (+/- 500kb) and outside GWAS loci. The latter could be potential novel associations missed by GWAS due to small genetic effect sizes at the single-SNP level. We found independent literature support for MSG-identified genes, showcasing MSG’s advantage of capturing novel risk genes mediated via splicing. While we focused on splicing intron excision ratios calculated by LeafCutter [15] throughout this paper, MSG is agnostic to different splicing quantifications and can potentially be applied to splicing data generated by other quantification methods. Throughout this paper, we compared our proposed MSG approach with existing TWAS approaches (UTMOST, S-MultiXcan, and SCCA+ACAT) for their performance in analyzing multidimensional splicing events in one tissue rather than multi-tissue expression data. This is primarily because these existing TWAS methods can handle multidimensional data and to our knowledge there is no existing method that was specifically developed for multidimensional splicing data in a single tissue. While the comparisons between the proposed MSG and UTMOST, S-MultiXcan, and SCCA+ACAT are “fair” in the sense that all methods used the same input data, we acknowledge that such comparisons may not be “fair” in the sense that those existing TWAS approaches were all developed for analyzing multi-tissue expression data and thus might not be optimized for analyzing multidimensional splicing data in a single tissue. An interesting future project would be to extend MSG to analyze multi-tissue expression data and compare its performance with UTMOST, S-MultiXcan, and SCCA+ACAT in that scenario, where the results and conclusions could be different from the ones presented in this paper. The interpretation of genes identified by MSG using splicing data is different from those identified in conventional TWAS using expression data. While a gene identified using expression data indicates significant association between its genetically regulated expression level and the trait of interest, a gene identified by MSG using splicing data indicates significant association between the combined variation from multiple genetically regulated splicing events and the trait of interest. Because the genetic effects on splicing are largely independent of those on expression [3, 7, 9, 10], the proposed MSG approach using splicing data could complement conventional TWAS approaches using expression data only to identify additional gene-trait associations mediated through splicing. Through MSG, we found a considerable number of trait-associated genes that were not identified from S-PrediXcan using expression data, demonstrating the complementary roles of genetic regulation through splicing and expression on trait variation and disease susceptibility. The number of genes identified by MSG using splicing data is usually larger than that identified by S-PrediXcan using expression data. A few factors may contribute to this phenomenon. One factor is that splicing is highly prevalent, affecting over 95% of human genes [12]. It provides the possibility of cell type- and tissue-specific protein isoforms, and the possibility of regulating the production of different proteins through specific signaling pathways [73]. Another factor is that the rich multidimensional splicing information may yield higher power to detect gene-trait associations compared to one-dimensional expression information. It was shown that the power of conventional TWASs increase to a maximum when the sample size of the reference transcriptome dataset exceeds 1000 [6]. As most tissues in GTEx have sample sizes less than 1000, the sample size of a target tissue may be too small to yield enough power for expression data-based TWAS analysis, but may be sufficient to detect associations for multidimensional splicing data analysis. Thus, we believe that splicing data may offer unique opportunities to study genetic risk of complex traits, and view our method as an important step toward using sQTLs for GWAS interpretation and gene discovery. We observed 83%–223% increase in the number of trait-associated splicing genes identified by MSG compared to established methods like sCCA+ACAT, S-MultiXcan, and UTMOST. The relative increase of power using MSG can be attributed to several factors. Specifically, the MSG models tend to be less sparse (i.e., include more SNPs with non-zero weights) than the S-MultiXcan and UTMOST models and explain more variability in splicing variation. As a result, the number of testable genes for MSG tends to be larger than those for S-MultiXcan and UTMOST and comparable to those of sCCA+ACAT (Table B in S2 Appendix). MSG is also substantially more powerful than sCCA+ACAT, despite the fact that both use SCCA to build genetically regulated splicing models. We speculate that it may be due to the following reasons: 1) MSG directly uses the sCCA-generated CVs for association tests, while sCCA+ACAT retrains the splicing CV models using elastic net. Retraining the splicing CV models is unnecessary because the CVs learned in sCCA are the best sparse linear combinations of the columns of the splicing and SNP data matrices that maximize the correlations between them. Therefore, the SNP weights obtained from sCCA are already the desirable predictors for the splicing CVs. In fact, retraining the splicing CV models could hurt because in this situation the splicing CVs estimated from sCCA would be treated as “observed outcomes” and the uncertainties associated with the sCCA procedure would be ignored. In addition, elastic net tends to generate models that are sparser and captures less splicing variation than sCCA. 2) MSG chooses the number of CVs to be included in the association test in an adaptive manner using the SVD regularization approach of [21], while sCCA+ACAT uses three CVs throughout, which may not be optimal for all genes and tissues. 3) MSG fully incorporates information from multiple CVs using a multi-degree-of-freedom chi-squared test, while the ACAT test directly combines p-values and thus could entail information loss. To check our speculations, we conducted simulation studies and real-data applications using the sCCA+ACAT method but without the model retraining step using elastic net. The results are shown in Fig F in S1 Appendix and Table W in S2 Appendix. We found that the sCCA+ACAT method without the model retraining step have comparable or higher power than that with the model retraining step using elastic net. We also used the same sCCA-generated CVs (without model retraining) and compared the SVD regularization coupled with multi-degree-of-freedom chi-squared test (implemented in MSG and S-MultiXcan) and the GBJ (implemented in UTMOST) and ACAT (implemented in sCCA+ACAT) tests. We found that the SVD regularization coupled with multi-degree-of-freedom chi-squared test has comparable or higher power than GBJ or ACAT tests in both simulations (Fig G in S1 Appendix) and real-data applications (Table X in S2 Appendix). Because the MSG models tend to be less sparse compared to alternative methods, they require a larger reference panel than the commonly used 1000 Genomes European samples to ensure accurate LD calculation and proper type I error control. We conducted simulation studies using LD reference panels of different sizes when performing gene-trait association analysis using the summary statistics of a GWAS with 50,000 samples and found that a reference panel of 5,000 individuals is adequate for MSG (Table A in S1 Appendix). To construct this large reference sample in practice, we randomly selected 5,000 samples of European descent in BioVU [30]. Note that similarly large reference samples are also needed when performing conditional analysis using GWAS summary statistics [74]. With that being said, we acknowledge that accessing large genetic datasets might be difficult for some users. When such LD reference panels are not available, one may still use the 1000 Genomes samples for initial screening purposes, but more stringent validation will be needed to follow up with the candidate genes identified. There are several limitations in our study. First, we focused on single-gene, single-trait analyses of splicing data, and there are exciting opportunities for methods development and gene discovery in multi-tissue, multi-trait, multi-gene, and cis and trans effects analyses [25, 75–77]. There are some important challenges for multi-tissue, multiple splicing events analysis. To leverage multi-tissue splicing data for gene discovery, one optioon would be to select a subset of relevant tissues for each trait. However, the relevant tissues are largely unknown and splicing regulation might be shared across many tissues. Another option is to jointly analyzing data from all tissues. In this situation, cross-tissue imputation will be needed because the GTEx data does not have full overlapping of individuals across tissues. Second, we used the GTEx transcriptome data from adult bulk tissues. Consequently, findings driven by differences in cellular composition or developmental stages cannot be fully resolved. As splicing is likely to be tightly regulated, the association of splicing implicated genes with traits in different cell types or developmental stages remains to be studied. Third, like S-MultiXcan, UTMOST, and sCCA+ACAT for multi-tissue expression data analysis, significant multidimensional splicing-trait associations identified by MSG does not imply causality because LD tagging effects and the sheer number of potential splicing events that could be tagged within a given locus can lead to apparent associations at non-causal genes [4, 20]. One can perform multidimensional conditional analysis similar to the one proposed in [22] to resolve the issue of gene prioritization to some extent, but fine-mapping of multidimensional splicing-trait associations is generally a challenging problem, especially in regions with extensive LD, and further investigation in this important direction is warranted.

Conclusion

By integrating multidimensional splicing information with GWAS summary statistics, we are able to pinpoint candidate risk genes associated with common traits via splicing. This approach can potentially be extended to integrate molecular data beyond splicing, such as epigenetic data. With the increasing availability of GWAS summary statistics of many complex traits and molecular data, we believe that our framework and its extensions will enable us to better understand how genes influence complex traits through diverse regulatory effects.

Methods

MSG framework

In this study, we use splicing and genotype data from the GTEx project and GWAS summary statistics of the traits of interest to identify splicing-trait-associated genes. For a given gene, let n, p, and q denote the sample size, number of SNPs in the cis-region of the gene (i.e., a 1-Mb window around the transcription start sites of a gene), and the number of splicing events, respectively, in GTEx. We note that q ≪ p in practice. Let X and Y denote the n × p standardized genotype matrix and n × q matrix of standardized measured splicing events, respectively. In the first stage of MSG, we use sCCA [26, 78] implemented in the R package “PMA” to identify sparse linear combinations of the columns of X and Y that are highly correlated with each other. That is, we wish to find vectors w1 and u1 that solve the following optimization problem: where ‖⋅‖1 and ‖⋅‖2 denote the L1 and L2 norms, respectively, and c1 and c2 are parameters that control the sparsity of w1 and u1, respectively. We choose c1 and c2 by permutaion using the “CCA.permute” function in “PMA”. As demonstrated in [26], this choice of parameters does fairly well at identifying linear combinations of the underlying factors with reasonable sparsity across a wide range of scenarios. Given the selected pair of (c1, c2), we obtain subsequent CVs by repeatedly applying the sCCA algorithm (1) to the updated matrix XTY after regressing out the previous CVs. We repeat this procedure q − 1 times to obtain (w2, u2), …, (w, u). Let W ≡ (w1, …, w) be the p × q matrix of SNP weights. We note that genes with W = 0 are “non-testable” and will be excluded from subsequent analysis. In the second stage of MSG, we test the association between the genetically regulated splicing CVs and the trait of interest using GWAS summary statistics. Specifically, let z be the vector of z-statistics in the GWAS of trait of interest. The multivariate z-statistic for the association between genetically regulated splicing CVs and the trait of interest is WT. Under the null hypothesis of no association, it can be shown that WT follows a multivariate normal distribution with mean zero and covariance matrix WTΣW, where Σ is the p × p LD matrix. In practice, we can estimate Σ using an external LD reference panel. A chi-squared test statistic about the gene-trait association can be constructed as In practice, the splicing events within a gene can be highly correlated, such that the rank of the SNP weight matrix W can be less than q, and the majority of variations may be explained by a few leading splicing CVs. Consequently, WTΣW in expression (3) can be close to singular and its inverse cannot be reliably estimated for many genes. To address this problem, we use the SVD regulation of [21]. Specifically, we compute the pseudo-inverse of WTΣW via SVD, decomposing it into its principal components and removing those with small eigenvalues. We use the condition number threshold λmax/λi < 30 to select the number of components, where λ and λ are the ith and maximum eigenvalue of WTΣW. Denoting the resulting pseudo-inverse of WTΣW as (WTΣW)− and substitute it into Eq (3), we have Under the null hypothesis, T2 follows a distribution, where r is the number of components that contribute to the pseudo-inverse. We test the gene-trait association using a chi-squared test. For each trait and tissue combination, we use Bonferroni correction to determine the genome-wide significance threshold by dividing 0.05 with the number of genes with at least two splicing events in that tissue. This value varies between trait-tissue pairs, and is usually around 0.05/10000 = 5 × 10−6.

Simulations

To evaluate the type I error rate and power of the gene-trait association tests, we simulated a training dataset with genetic and splicing data, a GWAS dataset, and a LD reference panel. Then, we conducted gene–trait association tests using our proposed MSG method and the S-MultiXcan, UTMOST, and sCCA+ACAT methods in a variety of realistic scenarios. To simulate the training dataset with genetic and splicing data, we set n = 200, p = 300, and q = 10. We generated rows of X independently from a multivariate normal distribution with mean 0, variance 1, and autoregressive covariance structure determined by ρ = 0.1. We generated Y from the multivariate linear regression model Y = XB + E, where B is a p × q matrix of genetic effects on splicing events, and E is n × q matrix of random errors. Following [79], we factor the effect size matrix B into SNP- and splicing event-dependent components, such that B = diag(b)D, where b is a p-vector of shared genetic effects on all splicing events, diag(b) is the p × p diagonal matrix expanded by b, and D is a p × q matrix of splicing event-specific effects. We specify the structure of D through the following parameters: the number of effect-sharing splicing events (i.e., sharing = 2, 4, 8), the fraction of shared SNPs among non-zero effect SNPs for those effect-sharing splicing events (fixed at 0.3); and the proportion of genetic variants that have non-zero effects on splicing (i.e., sparsity = 1%, 5%, 10%). We generated the elements of b independently from a standard normal distribution. We generated the non-zero elements in D independently from the uniform distribution on [−1, 1]. We generated the rows of E independently from a multivariate normal distribution with mean zero, variance scaled such that the desirable splicing heritability (0.01, 0.05, 0.1) was achieved, and autoregressive covariance structure determined by ρ = 0.5. As a note, [19] estimated the average cis- and trans- heritability of Leafcutter generated splicing events (that mapped to canonical exon junctions) to be 0.017 and 0.046, respectively. To simulate the GWAS dataset, we generated a genotype matrix X1 with 50,000 rows representing the subjects and 300 columns representing the cis-SNPs. We generated the rows of X1 in a similar manner as we generated X. We generated the trait of interest using Y1 = X1Bα + ϵ, where α is a q-vector of splicing effects on the trait, and ϵ is a vector of random errors. Under the null hypothesis of no gene-trait association, we set α = 0. Under the alternative hypothesis, we denote the splicing events with non-zero elements in α as the “trait-contributing splicing events”, with non-zero values generated independently from a uniform distribution on [−1, 1]. We generated the elements of ϵ independently from a normal distribution with mean zero and variance scaled such that the trait heritability was 0.01. We first generated the individual-level dataset and then obtained the GWAS summary statistics. We assumed that only the GWAS summary statistics rather than the individual-level data were available in the subsequent gene-trait association analysis. We generated an independent genotype matrix X3 in a similar manner as we generated X1 and X2 and used it as an external LD reference panel. We considered two sample sizes for this LD reference panel: 400 (mimicking the 1000 Genomes European reference samples) and 5,000 (mimicking the randomly selected BioVU European samples). Our simulation showed that the MSG method requires more than 400 subjects in the LD reference panel to ensure proper type I error control (Table A in S1 Appendix). We considered a number of realistic scenarios by varying splicing sparsity, splicing heritability, effect-sharing splicing events, and trait-contributing splicing events. When implementing the S-MultiXcan, MSG, and sCCA+ACAT methods, we used their default settings. For type I error evaluation, we used 20,000 replicates for each scenario and used the p-value cutoff of 0.05. For power evaluation, we used 2,000 replicates for each scenario and used the p-value cutoff of 5 × 10−6, which was chosen to mimic the Bonferroni correction in real data applications.

Compilation of well-known trait-associated gene lists

We obtained AD genes (Note 1 Section 1A in S1 Appendix) from [33]. The authors performed intensive hand-curation to identify confident AD-associated genes from various disease gene resources, including AlzGene, AlzBase, OMIM, DisGenet, DistiLD, UniProt, Open Targets, GWAS Catalog, ROSMAP, and existing literature. We obtained LDL-C genes (Note 1 Section 1B in S1 Appendix) from [45] that included genes from KEGG pathways and existing literature. We obtained a list of genes that influence schizophrenia via splicing (Note 1 Section 1C in S1 Appendix) from [18, 56, 80, 81]. Note 1. Well-known trait-associated gene lists. Fig A. LocusZoom [82] plot for SNP rs9331888 near gene CLU. This SNP was identified as a significant sQTL (q-value = 2.7 × 10−5) in the ROSMAP study [17]. It is also nominally significant in the IGAP GWAS (p-value = 3.86 × 10−5). Fig B. LocusZoom [82] plot for SNP rs1991570 near gene PTK2B. This SNP was identified as a significant sQTL (q-value = 2.98 × 10−13) in the ROSMAP study [17]. It is also nominally significant in the IGAP GWAS (p-value = 3.67 × 10−5). Fig C. AD genes identified via splicing analysis using MSG that would have been missed from expression analysis using S-PrediXcan. Fig D. LDL-C genes identified via splicing analysis using MSG that would have been missed from expression analysis using S-PrediXcan. Fig E. Schizophrenia genes identified via splicing analysis using MSG that would have been missed from expression analysis using S-PrediXcan. Fig F. Power comparison between sCCA+ACAT with and without retraining the prediction model using elastic net in the first set of simulations. Fig G. Power comparison between SVD+χ2, GBJ, and ACAT tests using sCCA-generated splicing-CVs in the first set of simulations. Table A. Comparison of type I error for MSG using individual GWAS, MSG with GWAS summary statistics and reference genome of 400 and 5000 individuals in simulation. (PDF) Click here for additional data file. Tables B-P. Summary of MSG application to 14 human traits. The source of GWAS and MSG identified trait-associated genes are provided. Tables Q-S. MSG identified genes that are within 500 kb distance to GWAS significant loci in AD, LDL-C, and schizophrenia. Tables T-V. MSG identified genes that overlap with S-PrediXcan identified genes in AD, LDL-C, and schizophrenia. Table W. Summary of applying sCCA+ACAT with and without retraining the prediction model using elastic net to AD, LDL-C, and schizophrenia. Table X. Summary of using sCCA-generated splicing-CVs and applying SVD+χ2, GBJ, and ACAT tests to AD, LDL-C, and schizophrenia. Table Y. Number of testable genes in the analysis of 14 human traits. Table Z. Number of significant genes within and outside of GWAS loci and number of GWAS loci covered by significant genes in the analysis of AD, LDL-C, and schizophrenia. (XLSX) Click here for additional data file.

77 in total

1. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies.

Authors: Brendan K Bulik-Sullivan; Po-Ru Loh; Hilary K Finucane; Stephan Ripke; Jian Yang; Nick Patterson; Mark J Daly; Alkes L Price; Benjamin M Neale
Journal: Nat Genet Date: 2015-02-02 Impact factor: 38.330

2. Common SNPs in HMGCR in micronesians and whites associated with LDL-cholesterol levels affect alternative splicing of exon13.

Authors: Ralph Burkhardt; Eimear E Kenny; Jennifer K Lowe; Andrew Birkeland; Rebecca Josowitz; Martha Noel; Jacqueline Salit; Julian B Maller; Itsik Pe'er; Mark J Daly; David Altshuler; Markus Stoffel; Jeffrey M Friedman; Jan L Breslow
Journal: Arterioscler Thromb Vasc Biol Date: 2008-09-18 Impact factor: 8.311

3. Pleiotropic Meta-Analysis of Cognition, Education, and Schizophrenia Differentiates Roles of Early Neurodevelopmental and Adult Synaptic Pathways.

Authors: Max Lam; W David Hill; Joey W Trampush; Jin Yu; Emma Knowles; Gail Davies; Eli Stahl; Laura Huckins; David C Liewald; Srdjan Djurovic; Ingrid Melle; Kjetil Sundet; Andrea Christoforou; Ivar Reinvang; Pamela DeRosse; Astri J Lundervold; Vidar M Steen; Thomas Espeseth; Katri Räikkönen; Elisabeth Widen; Aarno Palotie; Johan G Eriksson; Ina Giegling; Bettina Konte; Annette M Hartmann; Panos Roussos; Stella Giakoumaki; Katherine E Burdick; Antony Payton; William Ollier; Ornit Chiba-Falek; Deborah K Attix; Anna C Need; Elizabeth T Cirulli; Aristotle N Voineskos; Nikos C Stefanis; Dimitrios Avramopoulos; Alex Hatzimanolis; Dan E Arking; Nikolaos Smyrnis; Robert M Bilder; Nelson A Freimer; Tyrone D Cannon; Edythe London; Russell A Poldrack; Fred W Sabb; Eliza Congdon; Emily Drabant Conley; Matthew A Scult; Dwight Dickinson; Richard E Straub; Gary Donohoe; Derek Morris; Aiden Corvin; Michael Gill; Ahmad R Hariri; Daniel R Weinberger; Neil Pendleton; Panos Bitsios; Dan Rujescu; Jari Lahti; Stephanie Le Hellard; Matthew C Keller; Ole A Andreassen; Ian J Deary; David C Glahn; Anil K Malhotra; Todd Lencz
Journal: Am J Hum Genet Date: 2019-08-01 Impact factor: 11.025

4. A comprehensive family-based replication study of schizophrenia genes.

Authors: Karolina A Aberg; Youfang Liu; Jozsef Bukszár; Joseph L McClay; Amit N Khachane; Ole A Andreassen; Douglas Blackwood; Aiden Corvin; Srdjan Djurovic; Hugh Gurling; Roel Ophoff; Carlos N Pato; Michele T Pato; Brien Riley; Todd Webb; Kenneth Kendler; Mick O'Donovan; Nick Craddock; George Kirov; Mike Owen; Dan Rujescu; David St Clair; Thomas Werge; Christina M Hultman; Lynn E Delisi; Patrick Sullivan; Edwin J van den Oord
Journal: JAMA Psychiatry Date: 2013-06 Impact factor: 21.596

5. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits.

Authors: Jian Yang; Teresa Ferreira; Andrew P Morris; Sarah E Medland; Pamela A F Madden; Andrew C Heath; Nicholas G Martin; Grant W Montgomery; Michael N Weedon; Ruth J Loos; Timothy M Frayling; Mark I McCarthy; Joel N Hirschhorn; Michael E Goddard; Peter M Visscher
Journal: Nat Genet Date: 2012-03-18 Impact factor: 38.330

6. L-type Ca²⁺ channels in heart and brain.

Authors: Jörg Striessnig; Alexandra Pinggera; Gurjot Kaur; Gabriella Bock; Petronel Tuluc
Journal: Wiley Interdiscip Rev Membr Transp Signal Date: 2014-03-01

7. Annotation-free quantification of RNA splicing using LeafCutter.

Authors: Yang I Li; David A Knowles; Jack Humphrey; Alvaro N Barbeira; Scott P Dickinson; Hae Kyung Im; Jonathan K Pritchard
Journal: Nat Genet Date: 2017-12-11 Impact factor: 38.330

8. Genome-wide identification of splicing QTLs in the human brain and their enrichment among schizophrenia-associated loci.

Authors: Atsushi Takata; Naomichi Matsumoto; Tadafumi Kato
Journal: Nat Commun Date: 2017-02-27 Impact factor: 14.919

9. Genetic Control of Expression and Splicing in Developing Human Brain Informs Disease Mechanisms.

Authors: Rebecca L Walker; Gokul Ramaswami; Christopher Hartl; Nicholas Mancuso; Michael J Gandal; Luis de la Torre-Ubieta; Bogdan Pasaniuc; Jason L Stein; Daniel H Geschwind
Journal: Cell Date: 2019-10-17 Impact factor: 66.850

10. Integrative transcriptome analyses of the aging brain implicate altered splicing in Alzheimer's disease susceptibility.

Authors: Towfique Raj; Yang I Li; Garrett Wong; Jack Humphrey; Minghui Wang; Satesh Ramdhani; Ying-Chih Wang; Bernard Ng; Ishaan Gupta; Vahram Haroutunian; Eric E Schadt; Tracy Young-Pearse; Sara Mostafavi; Bin Zhang; Pamela Sklar; David A Bennett; Philip L De Jager
Journal: Nat Genet Date: 2018-10-08 Impact factor: 38.330