Literature DB >> 26829591

A new view of transcriptome complexity and regulation through the lens of local splicing variations.

Jorge Vaquero-Garcia^1,2, Alejandro Barrera^1,2, Matthew R Gazzara^1,3, Juan González-Vallinas^1,2, Nicholas F Lahens⁴, John B Hogenesch⁴, Kristen W Lynch^1,3, Yoseph Barash^1,2.

Abstract

Alternative splicing (AS) can critically affect gene function and disease, yet mapping splicing variations remains a challenge. Here, we propose a new approach to define and quantify mRNA splicing in units of local splicing variations (LSVs). LSVs capture previously defined types of alternative splicing as well as more complex transcript variations. Building the first genome wide map of LSVs from twelve mouse tissues, we find complex LSVs constitute over 30% of tissue dependent transcript variations and affect specific protein families. We show the prevalence of complex LSVs is conserved in humans and identify hundreds of LSVs that are specific to brain subregions or altered in Alzheimer's patients. Amongst those are novel isoforms in the Camk2 family and a novel poison exon in Ptbp1, a key splice factor in neurogenesis. We anticipate the approach presented here will advance the ability to relate tissue-specific splice variation to genetic variation, phenotype, and disease.

Entities: CellLine Chemical Disease Gene Species

Keywords: alternative splicing; alzheimer's disease; camk2; chromosomes; computational biology; genes; human; mouse; ptb1; statistical modeling; systems biology; transcriptomics

Mesh：

Year: 2016 PMID： 26829591 PMCID： PMC4801060 DOI： 10.7554/eLife.11752

Source DB: PubMed Journal: Elife ISSN： 2050-084X Impact factor: 8.140

Introduction

Production of distinct mRNA isoforms from the same locus has been shown to be common phenomena across metazoans (Barbosa-Morais et al., 2012; Merkin et al., 2012). Different isoforms may arise through the use of alternative transcription start and end sites, or through alternative processing of pre-mRNA. A key process is alternative splicing (AS) of pre-mRNA, where different subsets of pre-mRNA segments are removed while others are joined, or spliced together. The resulting differences between the mature mRNA isoforms can, in turn, encode different protein products, or affect mRNA stability, localization, and translation. Over 95% of human multiexon genes undergo AS, and disease associated genetic variants have been shown to frequently lead to splicing defects (Cooper et al., 2009; Pan et al., 2008; Wang et al., 2008). These observations emphasize the need to accurately map and quantify splice variations. RNA-Seq technology has advanced the detection and quantitation of splice variants by producing millions of short sequence reads derived from the transcriptome. Despite constant technological advancement, the combination of limited coverage depth, experimental biases, and reads spanning only a small fraction of the variable parts of transcripts has left accurate mapping of transcriptome variations an open challenge (Alamancos et al., 2014). Transcriptome variations have been traditionally studied either at the level of full gene isoforms or through the specification of alternative splicing 'events'. The latter have been categorized into several common types, such as intron retention, alternative 3’/5’ splice sites, or cassette exons. Importantly, while exact isoforms and their quantifications cannot be directly inferred from the short RNA-Seq reads, AS events can be detected via reads that span across spliced exons (junction reads). Both AS events and full isoforms can be captured by a gene schematic or a splice graph (Heber et al., 2002), where edges (lines) connect pre-mRNA segments spliced together in different transcripts (Figure 1A, top).

Figure 1.

LSV formulation and prevalence.

DOI: http://dx.doi.org/10.7554/eLife.11752.003

LSV formulation and prevalence.

(A) LSVs can be represented as splice graph splits from a single source exon (yellow) or into a single target exon (pink). LSV formulation captures previously defined, 'classical', binary alternative splicing cases (top) as well as other variations (bottom). An asterisk denotes complex variations involving more than two alternative junctions; dash line denotes redundant LSVs that are a subset of other LSVs (see Materials and methods). (B) Example of a complex LSV in the Camk2g gene. The gene’s splice graph (top) includes known splice junctions from annotated transcripts (red) and novel junctions (green) detected from RNA-Seq data. The splice graph includes a complex LSV involving exons 14–17 (middle). RT-PCR validation of the LSV in brainstem, cerebellum, hypothalamus, muscle, and adrenal is shown at the bottom. Several isoforms are preferentially included in brain and muscle. DOI: http://dx.doi.org/10.7554/eLife.11752.003 While useful, the previously defined AS types fail to capture the full complexity of spliceosome decisions. Specifically, AS types represent spliceosome decisions as strictly binary, involving only two exons or two splice sites in the same exon. The bottom panel in Figure 1A illustrates a few possible splicing variations that do not fit the previously defined AS types and can involve more than two alternative junctions. Figure 1B serves as a visual summary for both the potential and challenges in analyzing splicing variations. Combining known transcripts and RNA-Seq data results in the Camk2g splice graph shown (Figure 1B, top). This splice graph includes novel, unannotated, splice junctions detected from junction spanning RNA-Seq reads (green), as well as a complex case where exon 14 can be spliced to exons 15, 16, or 17 (Figure 1B, middle). Quantification by RT-PCR in several mouse tissues validate the existence of these variations and also points to isoforms that are predominantly produced in brain subregions and in muscle (Figure 1B, bottom). In order to achieve such results we need to have a computational framework that efficiently combines RNA-Seq with existing gene annotation and enables us to accurately detect, quantify, and visualize diverse splicing variations across different experimental conditions.

Results

Formulation of local splicing variations (LSVs)

To address the shortcomings of previously defined AS types we suggest the formulation of local splicing variations, or LSVs. LSVs are defined and easily visualized as splits (multiple edges) in a splice graph where several edges either come into or from a single exon, termed the reference exon. A Single Source (SS) LSV (Figure 1, yellow) corresponds to a reference exon spliced to several downstream RNA segments while single target (ST) LSV (Figure 1, pink) corresponds to a reference exon spliced to upstream segments. The full specification of an LSV also includes the relative location of the exons and junctions (see Material and methods). Figure 1A illustrates how this formulation captures previously defined AS types (top panel) as well as more complex cases (bottom panel). Specifically, previously defined 'classical' AS events appear as special cases of binary graph splits (e.g., include or skip a cassette exon), while LSVs capture non-classical binary splits and splits involving more than two junctions. Such non-binary splits are termed complex LSVs. LSVs can also involve intron retention (intronic LSVs) or be comprised of only exons (exonic LSVs). Moreover, the transcriptome variability captured by LSVs may be the result of not only spliceosome decisions but also of alternative transcription start or end positions. For example, the gene in Figure 1A bottom panel involves two alternative first exons so a relative change in the transcription start site usage can result in changes in downstream LSVs quantification. Importantly, LSV formulation allows the probing of transcriptome structure and complexity yet, unlike full transcripts, can still be quantified directly from junction spanning reads.

LSV detection, quantification and visualization using MAJIQ

In order to address the challenges involved in detection, quantification and visualization of LSVs we developed a new computational framework that we have termed Modeling Alternative Junction Inclusion Quantification (MAJIQ). MAJIQ’s first step (Figure 2A, top) is to parse a known database of transcripts, given as a GFF3 annotation file, along with a set of mapped and aligned RNA-Seq experiments (indexed BAM files). Unlike many methods that only analyze known isoforms, MAJIQ supplements known transcripts with 'reliable' edges derived from de novo junction spanning reads. Several filters can be applied to define which edges are considered reliable and which LSVs have enough reads to be later quantified (see Material and methods). Similarly, LSVs whose edges are a subset of other LSVs, such as those denoted with dashed rectangles in Figure 1A, are removed to avoid redundancy (see Material and methods). Next, MAJIQ can be executed to quantify LSVs either in a specific condition or to compare two experimental conditions, with or without replicates. LSV quantification in a specific condition is based on the marginal percent selected index (PSI, denoted Ψ) for each junction involved in the LSV, while comparison of experimental conditions is based on relative changes in PSI (dPSI, ΔΨ). MAJIQ uses a combination of read rate modeling, Bayesian Ψ modeling, and bootstrapping to report posterior Ψ and ΔΨ distributions for each quantified LSV. The results of MAJIQ’s LSV detection and quantification can be interactively visualized with the package VOILA in a standard web browser (Figure 2A bottom).

Figure 2.

LSV analysis using MAJIQ.

(A) MAJIQ’s analysis pipeline. RNA-Seq reads are combined with an annotated transcriptome to create splice graphs and detect LSVs for each gene, then LSVs are quantified and compared between conditions. The visual output (VOILA) lists LSVs with violin plots representing estimates of percent inclusion index (PSI, Ψ) or changes in inclusion (dPSI, ΔΨ). Two cases are illustrated, for a single source three way LSV (orange), and a single target two way LSV (pink). (B) Correspondence between E[Ψ] by MAJIQ and Ψ by RT-PCR. R is the correlation coefficient. Colors and shapes represent different experimental conditions: mouse cerebellum and liver (dark and light orange diamonds, respectively); human unstimulated and stimulated T-Cells (dark and light purple dots, respectively). Total n = 208. (C) Correspondence between E[ΔΨ] by MAJIQ and ΔΨ by RT-PCR, where |ΔΨRT|>0.2. R is the correlation coefficient. Changes in inclusion were measured between liver and cerebellum mouse tissues (diamonds, n = 45); stimulated and unstimulated T-Cells (dots, n = 9). (D) Reproducibility ratio (RR) of high confidence differentially included LSVs, i.e. LSVs for which P(|ΔΨ|> 0.2) > 0.95), when comparing RNA-Seq from two conditions. A differentially included LSV is considered replicated if it maintains a rank at least as high as N in biological replicates, where N is the set size. LSVs are ranked by E[ΔΨ] and filtered for overlap. Twelve replicate pairs from Keane et al. (2011) were used to compute the histogram’s std (light blue). Other lines show MAJIQ’s RR with replicates (thick blue), RR for AS events detected by rMATS w/wo replicates (light and dark green), MISO (red), and RR for LSVs using Naïve Bootstrapping (orange). The inset bar chart shows the number of LSVs or AS events (N) derived by each method and used in the RR plots (see Materials and methods for more details).

DOI: http://dx.doi.org/10.7554/eLife.11752.004

(A) Correspondence between E[Ψ] by MAJIQ (top) or MISO (bottom) and Ψ by RT-PCR across four different experimental conditions. (B) The same set of LSVs used to measure correspondence between E[ΔΨ] by MAJIQ (top) and MISO (bottom) and ΔΨ by RT-PCR. Changes in inclusion were measured between cerebellum and liver mouse tissues (diamonds, right panel, n = 50); stimulated and unstimulated T-Cells (dots, center panel, n = 57). Setting a threshold of ΔΨRT =20% for a significant change MAJIQ has no false positives and fewer false negatives compared to MISO. (C) Histogram of Ψ reproducibility, computed as the absolute difference between biological replicates of hippocampus and liver (R = E[Ψr1]-E[Ψr1] ). Overall, 81.2% of the junctions in quantifiable LSVs were reproducible within 5% (R(Ψ) < 5%. Average n = 8058. Twelve replicate pairs were used to compute the histogram’s std (light color). Inset graph: comparing MAJIQ and MISO reproducibility for paired junction (ΔR = RMISO - RMAJIQ). Plot shows the cumulative distribution over ΔR>0 (blue) and ΔR<0 (red) and over the subset with significant difference ( ΔR >0.05, dashed lines). Overall MAJIQ improved Ψ reproducibility for approximately two thirds of the LSVs (P(ΔR = RMISO- RMAJIQ)> 0 = 61.7%) and over two fold more showed a significant improvement (P(ΔR>0.05) = 21.2%), P(ΔR< -0.05) = 10.1%).

DOI: http://dx.doi.org/10.7554/eLife.11752.005

DOI: http://dx.doi.org/10.7554/eLife.11752.006

LSV analysis using MAJIQ.

Quantifying PSI and dPSI accuracy.

Quantifying differential splicing reproducibility.

(A) Effect of the threshold applied to the significance of splicing changes on the number of LSVs identified as changing (N) and the reproducibility ratio (RR). Both MAJIQ and rMATS estimate (P(ΔΨ) > α) > β) for an inclusion difference α with confidence level β. In the paper we used a strict β = 95% to control for false positives and a conservative α = 20% to call differentially spliced LSVs. Here, the results with a relaxed criteria of α = 15% (left) and α = 10% (right) are shown. The plots are otherwise identical to Figure 2D. (B) Breakup by coverage level (x-axis) of the high confidence differentially spliced LSVs depicted in Figure 2D. Y-axis denotes reproducibility ratio by RNA-Seq from biological replicates and the numbers at the top of each bar denote reproducibility by RT-PCR (|ΔΨRT|>0.2) of a randomly chosen subset of LSVs from that bin. The overall reproducibility is represented by the far left bin. (C) Correlation between MAJIQ E[ΔΨ] and average ΔΨ by RT-PCR among 3 biologic replicates for the most changing junction in validated complex LSVs examined in this paper between various pairs of tissues (n = 78). Of the junctions predicted to change between tissues (|E[ΔΨ]| > 20%), 55/56 validated (98.2%) by RT-PCR with an average |ΔΨ| > 20%. DOI: http://dx.doi.org/10.7554/eLife.11752.006 We assessed MAJIQ’s quantification accuracy for both Ψ and ΔΨ using a combination of RNA-Seq from biological replicates and an extensive set of 208 RT-PCR validations. These experiments included two mouse tissues (cerebellum and liver [Zhang et al., 2014]), and a human Jurkat T cell line (unstimulated and stimulated, [Cole et al., 2015]). While accuracy depended on the dataset used, MAJIQ achieved an overall correlation of R = 0.8 and R = 0.95 for PSI and dPSI quantification by RT-PCR, comparing favorably to alternative methods on all datasets (Figure 2B,C, Figure 2—figure supplement 1). Next, we used biological replicates from the Mouse Genome Project (Keane et al., 2011) to assess reproducibility of differential splicing detection from RNA-Seq when comparing two experimental conditions. The reproducibility ratio (RR, see Material and methods) captures the fraction of top ranked differentially spliced LSVs that maintain their top ranking when analyzing another set of replicate experiments. Figure 2D shows MAJIQ compares favorably to other methods, including MISO (Katz et al., 2010), rMATS (Shen et al., 2014), and a bootstrapping approach (Xiong et al., 2015) adopted for LSV. While MISO and rMATS achieved a reproducibility ratio of 61–67% we found the bootstrapping approach (N.B.) suffered from particularly high variance, which degraded reproducibility of LSVs ranking. In comparison, MAJIQ achieved a mean RR=77% when comparing two pairs of experiments and improving to RR=86% when the experiments compared had replicates. Notably, detection power was also improved. Defining differentially spliced LSVs as those for which P(|ΔΨ|>0.2) > 0.95, the number of detected LSVs (N), after removing LSVs overlap (see Materials and methods), was on average 400 for pairwise and 447 for group comparisons, compared to 240 and 260 respectively by rMATS. The improvement in both detection and reproducibility of differentially spliced LSVs (N, RR) was robust to the statistical threshold used to define N (Figure 2—figure supplement 2A) and when we removed MAJIQ’s de-novo junction detection the number of LSVs dropped as expected but reproducibility remained high (N = 337, RR= 87%, data not shown). Importantly, this result also indicated that including de-novo junctions increased the number of differentially spliced LSVs that could be detected by over 30% (337 vs. 447), while retaining equivalent reproducibility. Defining differential splicing reproducibility by RT-PCR as LSVs for which |ΔΨRT|>20% resulted in 95% reproducibility. The higher reproducibility by RT-PCR can be expected given the lower experimental variability compared to RNA-Seq. Notably, the LSVs tested by RT-PCR were selected to cover a wide spectrum of read depth. We found that while higher coverage allowed more differential LSVs to be detected and steadily increased reproducibility by RNA-Seq, MAJIQ’s reproducibility by RT-PCR was stable across read coverage depth, pointing to the robustness of the method (Figure 2—figure supplement 2B). Finally, we note that the above RT-PCR evaluation concentrated on binary LSVs to allow comparison to currently available methods, but we observed similar accuracy for the quantification of complex LSVs (Figure 2—figure supplement 2C).

Figure 2—figure supplement 1.

Quantifying PSI and dPSI accuracy.

DOI: http://dx.doi.org/10.7554/eLife.11752.005

Figure 2—figure supplement 2.

Quantifying differential splicing reproducibility.

DOI: http://dx.doi.org/10.7554/eLife.11752.006

Complex LSV are prevalent in diverse metazoa

To assess the significance of LSVs formulation we estimated LSVs prevalence in several metazoans, ranging from lizard to human (Figure 3). Naturally, this analysis is affected by how well a species transcriptome is annotated, and how permissive the database used is. In human for example, complex LSVs constitute 20.6% to 33.7% of the LSVs in annotated transcripts by RefSeq and Ensembl respectively, but only 1.86% in opossum’s Ensembl annotation (Figure 3A,B). Next, we expanded the set of annotated transcripts with novel junctions detected from RNA-Seq junction spanning reads. Limiting our analysis to only 5–6 similar tissues in all species and conservative junction detection still increased the total number of LSVs in human by 11% and the fraction of complex LSVs from 33.7% to 37.1% (Figure 3A). In species not as well annotated the effect of adding RNA-Seq data was more dramatic, jumping in opossum for example from 1,610 to 10,228 LSVs, of which 10% were complex. In summary, while LSV analysis across species was confounded by read coverage and transcriptome annotation we find that non-classical and complex LSVs make up a substantial fraction of observed transcriptome variations. Such complex LSVs are likely to be removed, undetected, or mislabeled by algorithms that only quantify binary AS events from previously annotated transcripts.

Figure 3.

LSV prevalence across diverse metazoans.

() Number of LSVs (top) and fraction of complex LSVs (bottom) when using Ensembl annotated transcripts only (grey) or combining it with RNA-Seq from 5–6 similar tissues (red). Mouse* is the dataset from Zhang et al. (2014). (B) Number of LSVs (top) and fraction of complex LSVs (bottom) when using RefSeq (orange) and Ensembl (blue). The RNA-Seq dataset is the same as in (A).

DOI: http://dx.doi.org/10.7554/eLife.11752.007

LSV prevalence across diverse metazoans.

A genome wide view of LSV across 12 mouse tissues

Given the clear impact of the RNA-Seq dataset and transcriptome annotation, we chose to focus our genome wide analysis on a recent mouse dataset. This allowed us to analyze 12 tissues with an average of over 120M reads per sample, produced by a single lab (Zhang et al., 2014). This data included three brain subregions, eight samples per tissue, and matching RNA for RT-PCR validations, leading to a total of 100,512 LSVs detected. First, we used this data to assess the usage of LSVs across tissues. In order to minimize LSVs that result from false junctions identified by the mapper we only included junctions with multiple uniquely mapped staggered reads across multiple biological replicates (see Material and methods). Next, we tested the maximal inclusion level of the second, third, or the least used junction in an LSV across twelve mouse tissues. We detected a switch behavior where a different junction becomes dominant at 50% inclusion or more in approximately 5% of the classical binary LSVs (Figure 4A, grey), compared to 12% for the second most used junction in complex LSVs (Figure 4A, light green). Setting a conservative threshold of Ψ > 10% to denote splice junctions that are less likely to be splicing noise or database errors we find that for the classical binary LSVs approximately 32%, or 9,516 pass that threshold, compared to 57% and 19% of the complex LSVs that pass that threshold for the second and third most used junction respectively. These correspond to a total of 6,338 and 2,112 LSVs in our datasets, pointing to the importance of complex LSVs in transcriptome analysis. Even when testing for the least used junction in complex LSVs (e.g. the ninth in a nine junction LSV), we still find almost 10% pass the 10% inclusion threshold (Figure 4A, dark green). Finally, for intronic LSVs we find almost 11,000 cases where an intron is retained at least 50% in one tissue, and 3,844 cases where the intron is almost always retained with Ψ > 99% (Figure 4—figure supplement 1D). This observation of widespread intron retention (IR), especially in brain tissues, is in line with a recent study across many more tissues and cell lines (Braunschweig et al., 2014), though our overall estimate of IR prevalence is more conservative.

Figure 4.

Genome wide view of exonic LSVs across twelve mouse tissues.

(A) Cumulative distribution (CDF) for maximal junction inclusion (PSI) across tissues. Plot includes the least used junction in binary LSV (grey), the second, third and least used junction in complex LSVs (light, medium, dark green). Dashed vertical line denotes 10% inclusion. (B) Histogram of the most common exonic LSV types. (C) Histogram of the number of exons, junctions, 3’ and 5’ splice sites in all identified LSV. (D) Histogram of which 3’ (left) or 5’ (right) splice site are found to be dominant across all tissues and all LSVs. X-axis denotes the order of the splice site. Dominance is defined as E[Ψ] > 0.6. Cases with no dominant junction are represented by the bars on the far left. (E) The fraction of complex LSVs (green, top right) from the total number (purple, bottom left) of differentially spliced LSVs (|E[ΔΨ]| >0.2) between pairs of tissues.

DOI: http://dx.doi.org/10.7554/eLife.11752.008

DOI: http://dx.doi.org/10.7554/eLife.11752.009

DOI: http://dx.doi.org/10.7554/eLife.11752.010

Figure 4—figure supplement 1.

Intronic LSV detection and quantification.

DOI: http://dx.doi.org/10.7554/eLife.11752.010

Genome wide view of exonic LSVs across twelve mouse tissues.

dPSI values for all pairs of tissues.

DOI: http://dx.doi.org/10.7554/eLife.11752.009

Intronic LSV detection and quantification.

(A) Effect of intronic average coverage on the number of detected introns in LSVs. (B) Histogram of mean Ψ reproducibility as in Figure 2—figure supplement 1C but for intronic LSV. Ψ Reproducibility is computed as the absolute difference between biological replicates (R = |E[Ψr1]-E[Ψr1]| ). Twelve replicate pairs were used to compute each histogram’s mean. The histograms’ std was too small to be plotted clearly. Colors correspond to different thresholds on average intronic coverage. Numbers in the legend represent average number of introns quantified in experiment pairs. Based on the tradeoff between reproducibility and overall detection shown in (A) subsequent evaluations and figures were executed using average intronic coverage threshold of 0.5. (C) Histogram of the most common intronic LSV types. Only non-redundant LSVs are included (See Materials and methods). (D) Cumulative distribution function for the fraction of the introns in intronic LSVs as a function of the minimal intronic Ψ observed across the twelve mouse tissues. Vertical dashed line corresponds to Ψ=10%. (E) Bottom left (purple): Each entry A(i,j) is the number of intron containing LSVs where the intron is differentially spliced between tissue i and tissue j. Upper right (green): Each entry A(i,j) is the fraction of complex LSVs from the LSVs listed in the matching bottom left rectangle entry A(j,i). Diagonal (red): Each entry A(i,i) is the total number of unique differentially spliced intron containing LSVs in tissue i compared to all other tissues where the intron is differentially spliced. (F) Bottom left (purple): Each entry A(i,j) is the number of intron containing LSVs where an exonic junction (and not the intron) is differentially spliced between tissue i and tissue j. Note that in this case the upper right (green) triangle that gives the fraction of complex LSVs from the LSVs listed in the matching bottom left rectangle is by definition 100% and is therefore not shown. DOI: http://dx.doi.org/10.7554/eLife.11752.010 Commonly occurring network substructures, or network motifs, have garnered much research attention in diverse fields (Milo et al., 2002). Gene splice graphs can also be thought of as networks with exons as nodes and spliced junctions as edges. In this interpretation, LSVs can be thought of as small network motifs and used to shed light on the transcriptome complexity and commonly reoccurring sub-structure. Comparing the frequency of exonic LSV types (Figure 4B) we find that the more common non classical LSVs involve 3 to 5 exons, combine exon skipping with an alternative 3’/5’ splice site, or involve alternative transcript start/end at the LSV’s reference exon. In contrast, intronic LSVs are much less diverse, with classical intron retention making 68% of the cases (Figure 4—figure supplement 1C). Figure 4C shows that for exonic LSVs 14% involve more than 2 exons, 30% of the single source and 20% and of the single target LSVs involve a reference exon with two or more 5’/3’ splice sites, respectively. Overall, complex (non-binary) LSVs comprise 36.2% of the transcriptome variations detected in the data and 27.5% of the variations deemed quantifiable (see Materials and methods), yet spliceosome decisions still appear localized, with few LSVs involving more than 6 exons or junctions. When analyzing LSVs usage, we found that the biochemical 'proximity rule', by which the splice site nearest to the reference exon is preferred (Reed and Maniatis, 1986), is commonly not reflected at the genomic level. Defining 'dominant' junctions as those included at least 60%, we found proximal junctions appear dominant in approximately two thirds of the cases involving binary LSVs (Figure 4D) while more complex LSV tend to have more evenly distributed inclusion levels with no dominant junction (Figure 4D, left bars). This more evenly distributed usage of exons and junctions in complex LSVs further supports possible functionality of multiple isoforms. Figure 4E gives a genome wide view of the exonic LSVs that exhibit significant splicing changes (|E[ΔΨ]|> 20%) between mouse tissues. In line with previous reports (Barash et al., 2010; Barbosa-Morais et al., 2012), we find clear clusters for brain and muscle tissues (average of 875 and 657 changing LSVs, respectively), a weaker cluster for digestive tissues (liver, kidney) with an average of 501 changing LSVs, and lung as a unique signal (549 changing LSVs). Brain regions have a higher average of 927 (Cerebellum) to 840 (brainstem) changing LSVs compared to non-brain tissues. The number of LSVs changing between brain subregions varies between 36% and 57% of those changing between CNS and non-CNS tissues, with hypothalamus standing out as more similar to the two other CNS tissues (average of 937 and 343 changing LSVs when compared to non brain and other brain sub-regions, respectively). Overall, we find that complex LSVs make up almost 47% of the differentially spliced LSVs, a fold enrichment of 1.7 compared to their relative proportion of 27.5% in the quantifiable set (P < 2.3 x10-278, binomial test).

Complex LSV are enriched in regulated splicing that is associated with higher intronic conservation and specific protein features

Given the above result of complex LSV enrichment in tissue dependent splicing variations we decided to test whether this enrichment holds in other datasets that involve developmental stages, splice factor knockdowns, and disease. We performed a meta analysis of 31 mouse datasets that involve a total of 243 RNA-Seq experiments covering a variety of tissues, cell lines, developmental stages, and knockdowns of key splicing factors. To this set we also added a human dataset comparing Alzheimer’s disease and healthy brain samples (Figure 5A and below). We found the median fraction of complex LSV in these datasets was 0.309 and their median fold enrichment in differentially spliced LSVs was 1.63, a significant enrichment in 30/32 of the datasets (1.6x10-322 < p-val < 1x10-3, Bonferroni corrected binomial test, see Figure 5A, and Figure 5—source data1). This consistent overrepresentation of complex LSVs among differentially spliced LSVs across a variety of contexts further suggests that complex LSVs are an important aspect of regulated alternative splicing.

Figure 5.

Meta analysis of complex LSVs.

(A) Fold enrichment (green dots) of complex LSVs calculated by comparing the fraction of complex LSVs among differentially spliced LSVs (dark blue bars) to their relative proportion (light blue bars) in 32 datasets. The corrected p-value column on the left measures significance of the fold enrichment (binomial test, Bonferroni corrected p-value) Medians are displayed for fold enrichment (green line, 1.63), fraction of complex LSVs among changing LSVs (orange line, 0.52), and fraction of complex LSVs among all detected LSVs (red line, 0.31). Human AD versus healthy brain data corresponds to the cohort from (Bai et al., 2013). See Figure 5—source data 1 for more information. (B) Empirical cumulative distribution function (CDF) of the maximal change of junction inclusion ( ΔΨ ) across all mouse datasets in Figure 5A. Only the LSVs detected in the twelve mouse tissues (Figure 4) are included. The plot includes junctions in binary LSVs (grey), and the second, third, and least changing junction in complex LSVs (light, medium, dark green). Dashed vertical line denotes ΔΨ of 10%. (C) Per nucleotide average conservation score (phastCons60 track) in regions proximal to single source (top) and single target (bottom) LSVs that were differentially spliced between any pair of tissues shown in Figure 4. The average is plotted for the subsets of complex (green) LSVs and binary (grey) LSVs as well as around a randomly selected set of constitutively spliced junctions (red, see Materials and methods for details).

DOI: http://dx.doi.org/10.7554/eLife.11752.011

DOI: http://dx.doi.org/10.7554/eLife.11752.012

DOI: http://dx.doi.org/10.7554/eLife.11752.013

Meta analysis of complex LSVs.

LSV enrichment meta analysis table.

DOI: http://dx.doi.org/10.7554/eLife.11752.012

Empirical cumulative distribution function (CDF) of the maximal junction inclusion (E[Ψ]) across all mouse datasets in Figure 5A.

Only the LSVs detected in the twelve mouse tissues (Figure 4) are included. This plot is equivalent to the ΔΨ plot in Figure 5B and includes junctions in binary LSVs (grey), as well as the second, third, and least included junction in complex LSVs (light, medium, dark green). Dashed vertical line denotes 10% inclusion. DOI: http://dx.doi.org/10.7554/eLife.11752.013 Next, we asked how does the inclusion of junctions change across these datasets. For this, we took a conservative approach monitoring only the LSVs that have been already identified in normal tissues used to build the genome wide view of LSVs (Figure 4). Figure 5B shows over 20% of all complex LSVs detected in more than one sample had the third most differentially included junction exhibit |ΔΨ|> 10%, corresponding to 2,236 LSVs. Strikingly, these additional experimental contexts showed that over 39% of all complex LSVs detected in our normal tissue set had their third most included junction with Ψ > 10%, corresponding to 4,201 LSVs (Figure 5—figure supplement 1).

Figure 5—figure supplement 1.

Empirical cumulative distribution function (CDF) of the maximal junction inclusion (E[Ψ]) across all mouse datasets in Figure 5A.

DOI: http://dx.doi.org/10.7554/eLife.11752.013

Finally, we plotted the conservation level around constitutive exons and differentially spliced LSVs shown in Figure 4 that are either binary or complex (Figure 5C). Inline with previous reports, we found tissue regulated splicing involves significantly higher conservation in the intron proximal to the variable exonic segments, a region known to include cis elements to which tissue specific splice factors bind. However, we also found that differentially spliced complex LSVs exhibited significantly higher conservation levels in these regions compared to their binary counterparts. This finding may be the result of the more complex splicing changes that need to be controlled or tighter control associated with complex LSVs specific function. In summary, these different lines of evidence all support the functional relevance and utility of accurately mapping and quantifying complex splicing variations in genome wide studies. The observed evolutionary pressure to conserve intronic segments around tissue dependent LSV raises the questions what are the functional consequences of LSVs and whether complex LSVs are functionally distinct from classical binary ones. To probe possible function we mapped exons in LSVs into their matching protein domains (see Material and methods). We then grouped LSV junctions based on whether they were part of binary or complex LSVs and whether they were differentially included across tissues. In line with previous works (Ellis et al., 2012), we find that binary LSVs, such as cassette exons, which are also differentially included across tissues, more frequently affect low-complexity, disordered regions when compared to non-changing binary LSVs (p<1x10-4, corrected Fisher’s exact test). Interestingly, differentially included complex LSVs are similarly enriched for such low-complexity regions (p<1x10-4), but also show enrichment for specific protein families (e.g. spectrin/filamin) and domains (e.g. RNA recognition motifs) when compared to non-changing complex LSVs. These families and domains are largely distinct from those enriched in binary LSVs (e.g. WW domains or coiled coils). The complete list of enriched protein features can be found in Supplementary file 1. Overall, this analysis suggests that regulated alternative splicing of both binary and complex LSVs can affect protein interactions via unstructured protein regions, or affect the inclusion of distinct protein domains in specific families.

MAJIQ detects a novel, brain-specific, PTC-introducing, developmentally- regulated exon in Ptbp1

To further demonstrate the power of MAJIQ and our LSV based approach we validated a set of complex LSVs that exhibit tissue and brain region dependent splicing patterns. Surprisingly, this analysis revealed a previously uncharacterized, brain-specific exon in the gene encoding PTBP1, an extremely well studied splicing factor critical to neural development (Keppetipola et al., 2012) (Figure 6A, Figure 6—figure supplement 1A). While this novel exon remained undetected when running cufflinks (Trapnell et al., 2010) on this dataset (data not shown), expression of this novel exon as part of a complex LSV was supported by RT-PCR from cerebellum and adrenal tissues (Figure 6B, top) with good concordance with MAJIQ’s PSI quantification (Figure 6B, bottom). Products including exon 14 were also weakly detected by RT-PCR of brainstem and hypothalamus-derived RNA, but not from any of the other eight tissues tested (Figure 6—figure supplement 2). Together these data strongly support exon 14 as brain-specific.

Figure 6.

Identification of a novel, brain-specific, PTC-introducing, developmentally-regulated exon in Ptbp1.

(A) Top: Splice graph representation of a complex target LSV containing a previously unannotated, PTC-introducing exon in Ptbp1 (exon 14, green). Stop signs indicate multiple conserved premature termination codons. Bottom: UCSC Genome Browser tracks of RNA-seq reads from adrenal (red) and cerebellum (blue), and conserved Rbfox binding sites ([U]GCAUG) found within the bounds of this LSV. (B) Top panel: RT-PCR validation of RNA from replicate cerebellar and adrenal tissues with isoforms illustrated on the left. Asterisk denotes a background band that migrates non-specifically. Bottom panel: E[Ψ] violin plots of MAJIQ quantification for the colored junctions in (A). Matching isoforms are indicated on the left. (C) Top: RNA-seq reads from mouse cortices (Yan et al., 2015). Developmental time points indicated on the right with exons colored as in (A). Bottom: Ψ violin plots for the PTC-introducing exon 14 across brain development. (D) Top panel: Top regulatory motifs predicted by AVISPA to influence the neuronal-specific splicing of exon 14. Stacked bars represent the normalized feature effect (NFE) for each motif. Colors indicate the contribution of the corresponding motif in the region indicated in the inset. (E) MAJIQ Ψ quantification of the LSV shown in (A), using RNA-seq from one month old wild type whole brain (left) and nestin-specific Rbfox1 KO littermates (right).

DOI: http://dx.doi.org/10.7554/eLife.11752.014

(A) PTBP1 domain structure (top) and splice graph from cerebellum data (bottom) highlighting approximate locations of known alternatively spliced linker region (dark grey) encoded by exon 13 (exon 9 in the literature), the novel PTC introducing exon 14 (red stop sign in protein, green exon in splice graph), and the known PTC upon exclusion of exon 16 (exon 11 in the literature). (B) UCSC genome browser view with sequence alignment and placental mammalian conservation. Novel exon 14 is highlighted in blue and boxed regions correspond to conservation of 3’ and 5’ splice sites and the in frame PTCs. (C) Locations of additional primers with RT-PCR from replicate cerebellum RNA. (D) UCSC genome browser view showing conserved Rbfox binding sites ([U]GCAUG) and brain RNA-seq reads from wild type one month old mice (top) and Rbfox1 KO littermates (bottom) corresponding to experiments quantified in Figure 6E. Exon 14 location is highlighted in blue. (E) MAJIQ Ψ quantification of junctions as illustrated in (C) from one month old wild type mice (top) and Rbfox2 KO littermates (bottom).

DOI: http://dx.doi.org/10.7554/eLife.11752.015

DOI: http://dx.doi.org/10.7554/eLife.11752.016

Figure 6—figure supplement 1.

Novel exon and PTCs in Ptbp1 are conserved, independent from known PTC event, and regulated by Rbfox1 and 2.

DOI: http://dx.doi.org/10.7554/eLife.11752.015

Figure 6—figure supplement 2.

RT-PCR validation of complex Ptbp1 LSV across 11 mouse tissues.

DOI: http://dx.doi.org/10.7554/eLife.11752.016

Identification of a novel, brain-specific, PTC-introducing, developmentally-regulated exon in Ptbp1.

Novel exon and PTCs in Ptbp1 are conserved, independent from known PTC event, and regulated by Rbfox1 and 2.

RT-PCR validation of complex Ptbp1 LSV across 11 mouse tissues.

(A) Representation of Ptbp1 target LSV analyzed with primers indicated by arrows. (B) RT-PCR from replicates across tissues indicated with isoforms indicated on the left. (C) Representative RT-PCR from tissues indicated with isoforms indicated on the left. [Bstm: brainstem; Hyp: hypothalamus; Cer: cerebellum; Adr: adrenal gland; Kid: kidney; Hrt: heart; Mus: muscle; Bfat: brown adipose; Wfat: white adipose; Liv: liver] DOI: http://dx.doi.org/10.7554/eLife.11752.016 Interestingly, Ptbp1 exon 14 shows conservation of splice sites between mouse and human and inserts multiple premature termination codons (PTCs) in both species, as well as in other mammals, before RMMs 3–4 of PTBP1 (Figure 6—figure supplement 1A,B), suggesting that mRNAs including this exon are likely targets of nonsense-mediated decay (NMD). Regulated alternative splicing that introduce PTCs is a common theme among numerous splicing factors (Ni et al., 2007) and exclusion of Ptbp1 exon 16 (exon 11 in the literature) has already been identified and shown to induce NMD (Figure 6—figure supplement 1A) (Wollerton et al., 2004). Remarkably, exclusion of exon 16 is barely detectable in the brain regions examined and inclusion of exon 14 is not associated with this event (Figure 6—figure supplement 1C). Together, this suggests that these splicing events are independent mechanisms to control Ptbp1 expression and that inclusion of novel exon 14 plays a larger role in the brain regions examined, with 26% of the Ptbp1 transcripts in the cerebellum containing PTCs. Embryonic down regulation of Ptbp1 by miR-124 is crucial at the onset of neurogenesis (Makeyev et al., 2007) and leads a change in splicing programs (Boutz et al., 2007; Keppetipola et al., 2012), but cannot account for additional postnatal down regulation of this protein (Boutz et al., 2007; Zheng et al., 2012). Remarkably, MAJIQ analysis of RNA-seq data from mouse cortices across development (Yan et al., 2015) reveals clear developmental regulation of exon 14 with a dramatic increase in inclusion from P15 through adulthood (Figure 6C). Taken together, this complex LSV offers a novel mechanism for postnatal neuronal reduction in Ptbp1. To identify putative regulators of novel exon 14, we used AVISPA (Barash et al., 2013), a web tool that utilizes splicing code models to suggest motifs important for tissue-specific splicing, and identified the [U]GCAUG binding motif of the Rbfox family as important for neuronal splicing outcome (Figure 6D). AVISPA’s map of regulatory motifs pointed to a number of Rbfox binding sites downstream of exon 14 (Figure 6A). These motifs, perfectly conserved between mouse and human, suggested enhancement of inclusion by the Rbfox family (Lovci et al., 2013). Consistent with this regulatory hypothesis, MAJIQ analysis of RNA-seq data from one month old nestin-specific Rbfox1 KO mice revealed a marked decrease in inclusion of exon 14 from ~16% in wild type mice to nearly undetectable in the KO (Figure 6E; Figure 6—figure supplement 1D) and similar decreased inclusion was observed upon Rbfox2 KO (Lovci et al., 2013) (Figure 6—figure supplement 1E). Together these data demonstrate the power of MAJIQ, in combination with the VOILA and AVISPA analysis tools, in identifying previously uncharacterized isoforms and understanding the regulation of biologically important transcript variation.

MAJIQ detects novel splicing variations in the CAMK2 family which are conserved, developmentally regulated, and dysregulated in AD

Several of the brain specific LSVs we detected were found in genes encoding calcium/calmodulin-dependent protein kinase II (CAMK2) subunits which regulate functions in the brain such as neurotransmitter synthesis and release, cellular transport, neurite extension, synaptic plasticity, learning and memory (Griffith, 2004). We focused on Camk2d and Camk2g as these exhibit complex changes and were expressed in nearly all tissues examined (Figure 4—source data 1). Figure 1B and Figure 7—figure supplement 1B show MAJIQ’s analysis and matching RT-PCR validation of a Camk2g LSV containing three exons across five tissues. Figure 7 shows similar verification for another complex LSV but in Camk2d. In both cases, exon inclusion creates consensus NLS motifs (KKRK), which localize these subunits to the nucleus (Braun and Schulman, 1995). For Camk2g the NLS motif is contained in exon 15 whose inclusion levels are highest in the brain, particularly in the brainstem (Figure 1B, Figure 7—figure supplement 1B).

Figure 7—figure supplement 1.

Complex and de novo LSVs in Camk2g are developmentally regulated and dysregulated in Alzheimer’s disease.

(A) Splice graph representation of Camk2g in the cerebellum. Green junctions and exons denote de novo detection from RNA-seq data. Numbers represent number of raw reads across the junction. (B) Representation of complex source LSV in Camk2g (top) with matching RT-PCR validation in five tissues (brainstem, cerebellum, hypothalamus, heart, and adrenal, middle). Colored arcs represent the junctions quantified by MAJIQ for this LSV while dashed arcs correspond to junctions in the RNA-seq data, but not directly quantified by the LSV. Violin plots on the bottom display E[Ψ] quantifications (x-axis) for each of the colored junctions (y-axis) across the five tissues with appropriate isoforms from the gel on the right. (C) Line graphs of MAJIQ Ψ quantification (y-axis) of junctions as in (B) across time points (x-axis) through cortex development (top) and heart development (bottom). Points represent mean Ψ and error bars represent one standard deviation. (D) Representation of de novo exon 13 detected in mouse (top) and MAJIQ Ψ across mouse cortex development, points represent mean Ψ and error bars represent one standard deviation (bottom). (E) VOILA ΔΨ visualization of LSV from (D) that is conserved in human showing E[Ψ] values (stacked bar chart, sides) and E[ΔΨ] (center) for each junction between control and Alzheimer’s disease brains. (F) Top regulatory motifs predicted by AVISPA to influence the CNS splicing patterns of exon 13. Stacked bars represent the normalized feature effect (NFE) for each motif as in (Barash et al., 2013). Colors indicate the contribution of the corresponding motif in the region indicated in the inset. (G) VOILA ΔΨ visualization LSV from (B) showing E[Ψ] values (stacked bar chart, sides) and E[ΔΨ] (center) between wild type and Rbfox1 (top) or Rbfox2 (bottom) KO mice.

DOI: http://dx.doi.org/10.7554/eLife.11752.018

Figure 7.

Camk2d LSV exhibits complex developmental dynamics and is misregulated in Alzheimer’s disease.

(A) Representation of complex source LSV in Camk2d with matching RT-PCR validation in five tissues (brainstem, cerebellum, hypothalamus, heart, and adrenal). Colored arcs represent the junctions quantified by MAJIQ for this LSV while dashed arcs correspond to junctions in the RNA-seq data that are not part of the quantified LSV. Violin plots on the bottom display Ψ quantifications (x-axis) for each of the colored junctions (y-axis) across the five tissues with appropriate isoforms from the gel on the right. Isoforms with known tissue-specific splicing patterns are labeled as in the literature (B) Line graphs of MAJIQ E[Ψ] quantification (y-axis) of junctions as in (A) across time points (x-axis) through cortex development (top) and heart development (bottom). Points represent mean Ψ and error bars represent one standard deviation in E[Ψ]. (C) ΔΨ quantification comparing changes between control and Alzheimer’s patient brains of the homologous junctions illustrated in (A).

DOI: http://dx.doi.org/10.7554/eLife.11752.017

DOI: http://dx.doi.org/10.7554/eLife.11752.018

(A) Splice graph representation of Camk2a across three tissues indicated. Dashed box indicates region containing cassette exon that inserts a consensus NLS. (B) MAJIQ E[ΔΨ] between cerebellum and muscle for inclusion (green) and exclusion (blue) isoforms. Red junction corresponds to an alternative 5’ss not highly used in any tissue. (C) Line graphs of MAJIQ Ψ quantification (y-axis) of junctions as in (B) across time points (x-axis) through cortex development (left) and heart development (right). Points represent mean Ψ and error bars represent one standard deviation.

DOI: http://dx.doi.org/10.7554/eLife.11752.019

(A) VOILA thumbnail representation of complex target LSV in Camk2b detected from cortex development data including a alternative transcription start in exon 21 (green junction) that is not highly utilized and NAGNAG alternative 3’ splice sites of reference exon 22. (B) Line graphs of MAJIQ Ψ quantification (y-axis) of junctions as in (A) across time points (x-axis) through cortex development show known increase in exon 20 inclusion through development, coupled with a novel switch from proximal NAG 3’ss (red) to almost exclusive use of distal NAG 3’ss (orange) by adulthood. Points represent mean Ψ and error bars represent one standard deviation. (C) UCSC genome browser view of mapped reads from cortex of embryonic 16.5 mouse (top, purple) or postnatal 21-month mouse (bottom, blue). Dashed box highlights nucleotides corresponding to conserved NAGNAG alternative 3’ss that is developmentally regulated. (D) Top regulatory motifs predicted by AVISPA to influence the CNS splicing patterns of exon 20. Stacked bars represent the normalized feature effect (NFE) for each motif. Colors indicate the contribution of the corresponding motif in the region indicated in the inset. (E) Violin plots representing MAJIQ Ψ for wild type E18.5 mice (top) and Ptbp2 KO littermates (bottom) shows embryonic Ptbp2 represses adult specific inclusion of exon 20, as previously reported (Li et al., 2014), in addition to the switch in NAGNAG 3’ splice site use.

DOI: http://dx.doi.org/10.7554/eLife.11752.020

(A) Boxplot showing distribution of E[Ψ] values and all E[Ψ] values (dots) for the most changing junction in the CAMK2D event examined in Figure 7 from a larger, independent cohort of normal and AD patients in the given brain sub regions. Two-tailed rank sum p-values are shown. (B) Same as (A) but for CAMK2G event examined in Figure 7—figure supplement 1. (C) Same as (A) but for CLTA event examined in Figure 7—figure supplement 6. Total samples analyzed for frontal pole normal and AD are 58 and 62; superior temporal gyrus normal and AD, 37 and 50; parahippocampal gyrus normal and AD, 33 and 45.

DOI: http://dx.doi.org/10.7554/eLife.11752.021

(A) Splice graph representation of a complex alternative end LSV of Klc1. Dark grey represents a 26 nt alternative 5’ss of exon 13. (B) Top panel: RT-PCR validation with RNA from replicate cerebellar and adrenal tissues with isoforms illustrated on the left. Dark outlined isoforms are those that include the 26 nt alternative 5’ss of exon 13. Bottom panel: PSI violin plots of MAJIQ quantification of junctions as colored in (A).

DOI: http://dx.doi.org/10.7554/eLife.11752.022

(A) Splice graph for Clta and representation of target LSV. (B) Top panel: RT-PCR validation with RNA from replicate tissues with isoforms illustrated on the left. Bottom panel: PSI violin plots of MAJIQ quantification of junctions as colored in (A). (C) Line graphs of MAJIQ Ψ quantification (y-axis) of junctions as in (A) across time points (x-axis) through cortex development. Points represent mean Ψ and error bars represent one standard deviation. (D) ΔΨ quantification comparing changes between control and Alzheimer’s patient brains of the homologous junctions illustrated in (A).

DOI: http://dx.doi.org/10.7554/eLife.11752.023

DOI: http://dx.doi.org/10.7554/eLife.11752.024

Camk2d LSV exhibits complex developmental dynamics and is misregulated in Alzheimer’s disease.

Complex and de novo LSVs in Camk2g are developmentally regulated and dysregulated in Alzheimer’s disease.

LSV in Camk2a is developmentally regulated oppositely in the brain and heart.

Developmentally controlled, complex LSV in Camk2b is regulated by Ptbp2.

Analysis of CAMK2D, CAMK2D, and CLTA LSVs in an independent Alzheimer’s cohort.

Figure 7—figure supplement 6.

Clta splicing is developmentally regulated and dysregulated in Alzheimer’s Disease.

DOI: http://dx.doi.org/10.7554/eLife.11752.023

DOI: http://dx.doi.org/10.7554/eLife.11752.021

Complex alternative end of Alzheimer’s-associated Klc1.

Clta splicing is developmentally regulated and dysregulated in Alzheimer’s Disease.

Eif4g3 splicing shows brain subregion-specificity and a novel exon in muscle.

(A) Representation of complex source LSV in Eif4g3. Red junction, red portion of exon 10 correspond to a novel alternative 3’ss detected in the brain. Purple junction, purple portion of exon 12, and dashed exon 13 correspond to Ensembl annotated tandem cassette exons with no support in any experiment. Larger, pink portion of exon 12 corresponds to 120 nt, unannotated exon that is included with exon 11 in muscle. (B) Top panel: RT-PCR validation with RNA from replicate tissues with isoforms and expected product sizes illustrated on the left. Bottom panel: PSI violin plots of MAJIQ quantification of junctions as colored in (A). (C) Line graphs of MAJIQ Ψ quantification (y-axis) of junctions as in (A) across time points (x-axis) through cortex development. Points represent mean Ψ and error bars represent one standard deviation. (D) UCSC genome browser view of the bounds of this LSV with mapped reads from representative muscle sample. Inset shows zoomed area of dashed lines corresponding to the location of the novel bounds of exon 11 . (E) RT-PCR from replicate muscle RNA using an additional primer set from exon 10 to 14. DOI: http://dx.doi.org/10.7554/eLife.11752.024 Several other important aspects of Camk2d splicing are accurately captured by MAJIQ. These include near 100% skipping of exons 21 through 23 in all non brain or muscle tissues (known in the literature as isoform C or Camk2δC, (Xu et al., 2005)), high relative inclusion of NLS containing exon 21 in heart (isoform B or Camk2δB), and high levels of isoform A (Camk2δA), which includes exons 22 and 23, in the brain regions examined (Figure 7A). This result is consistent with previous reports of Camk2d splicing patterns and isoform A being neuronal-specific (Xu et al., 2005). Importantly though, MAJIQ also detects isolated inclusion of exon 23 in the heart (Figure 7A, green junction), which is supported by both the RT-PCR experiment and analysis of an independent dataset across heart development (see below). Previous studies focused on splicing regulation of Camk2d in the heart used junction spanning primers that preclude detection of this highly utilized splicing choice (Xu et al., 2005; Ye et al., 2015). Because CAMK2 has been implicated in neurodevelopment and is proposed to be critical for postnatal heart development (Xu et al., 2005), we next looked for developmental changes in LSVs by analyzing RNA-seq data derived from mouse cortices (Yan et al., 2015) and hearts (Giudice et al., 2014) at different time points. In the brain there is a switch in the splicing of Camk2d between the C and the A isoforms, reaching over 80% use of the A isoform by postnatal day 15, corresponding to a time of intense synaptogenesis and plasticity (Licatalosi et al., 2012) (Figure 7B, top). In the heart we see a more modest decrease in isoform C and increase in exon 23 only during postnatal heart development (Figure 7B, bottom, compare purple with green), consistent with results from RT-PCR from eight week old mice (Figure 7A). Notably, other CAMK2 subunits also displayed developmental dynamics in both tissues, such as inclusion of NLS containing exons in Camk2g and Camk2a (Figure 7—figure supplement 1C and 2), an unannotated mouse cassette exon in Camk2g regulated by the Rbfox family (Figure 7—figure supplement 1D–G), and a complex LSV in the variable domain of Camk2b that affects autophosphorylation and is regulated by Ptbp2 (Li et al., 2014) (Figure 7—figure supplement 3).

Figure 7—figure supplement 2.

LSV in Camk2a is developmentally regulated oppositely in the brain and heart.

DOI: http://dx.doi.org/10.7554/eLife.11752.019

Figure 7—figure supplement 3.

Developmentally controlled, complex LSV in Camk2b is regulated by Ptbp2.

DOI: http://dx.doi.org/10.7554/eLife.11752.020

Given the suggested role of calcium signaling in neurodegeneration (Marambaud et al., 2009) and CAMK2 implication in Alzheimer’s disease (AD) (Steiner et al., 1990), we also analyzed RNA-seq data from three control brains and compared them to three AD brains (Bai et al., 2013). Strikingly, in CAMK2D we observe a marked decrease of ~38% of the neuronal specific isoform of the complex, developmentally-regulated mouse LSV we validated above, with reciprocal increase in the all exclusion, isoform C in AD brains (Figure 7C). We also observe changes in a CAMK2G LSV that corresponds to an unannotated mouse exon (Figure 7—figure supplement 1D,E). Importantly, these exons are perfectly conserved between mouse and human at the amino acid level, further suggesting physiologic importance of the novel splicing variations detected by MAJIQ. Finally, we validated that the observed CAMK2 splicing changes in AD brains can be reproduced in a second independent study. We used data from the AMP-AD Target Discovery Consortium (doi:10.7303/syn2580853) involving a larger cohort of 157 samples from AD patient’s brains and 128 control samples, across three different brain sub regions (Figure 7—figure supplement 4). Overall, we detected approximately 200 LSVs that are reproducibly differentially spliced between AD and normal brains (see Methods) and enriched in GO terms such as cytoskeleton, GTPase regulator activity, and synapse organization (data not shown). This set constitutes approximately 12% of the changing LSVs detected in the original dataset, a fraction that grows to 21% but only 164 LSVs if stricter filtering is applied to both datasets (data not shown). This relatively low percentage of reproducible changes across the two datasets can be at least partially attributed to the small number of samples in the original study combined with an average of 1.8 fold lower coverage in the second, larger dataset. Notably though, among the reproducible set of differentially spliced LSVs 79 are complex, a significant, 1.2-fold enrichment compared to their relative proportion among all LSVs detected (p=0.04, binomial test). While the validation and experimental follow up on these LSVs is beyond the scope of this paper these results and the related CAMK2 analysis demonstrate the usefulness of our combined approach for LSV detection, quantification, and visualization for disease studies.

Figure 7—figure supplement 4.

Analysis of CAMK2D, CAMK2D, and CLTA LSVs in an independent Alzheimer’s cohort.

DOI: http://dx.doi.org/10.7554/eLife.11752.021

Overall, our analysis of CAMK2 is in line with previous studies but also detects additional isoforms and exons that are conserved, developmentally regulated, and dysregulated in AD, making for a more accurate picture of CAMK2 splicing patterns. Additional complex LSVs we validated and analyzed include brain specific isoforms of the kinesin light chain Klc1, recently shown to be an amyloid-beta accumulation modifier (Morihara et al., 2014) (Figure 7—figure supplement 5); the clathrin light chain Clta, which displays developmental dynamics and dysregulation in both Alzheimer's disease cohorts (Figure 7—figure supplement 6, Figure 7—figure supplement 4); and the translation initiation factor scaffold Eif4g3, which has high inclusion of a cassette microexon specifically in cerebellum and a novel, muscle-specific exon (Figure 7—figure supplement 7).

Figure 7—figure supplement 5.

Complex alternative end of Alzheimer’s-associated Klc1.

DOI: http://dx.doi.org/10.7554/eLife.11752.022

Figure 7—figure supplement 7.

Eif4g3 splicing shows brain subregion-specificity and a novel exon in muscle.

DOI: http://dx.doi.org/10.7554/eLife.11752.024

Discussion

The work presented here spans a wide spectrum of topics from a new formulation of transcriptome variations in units of local splicing variations (LSVs); through algorithms for detecting, quantification and visualization of LSVs; a genome wide map of LSVs; analysis of the prevalence and functional significance of complex LSVs; to validation of several complex LSVs that affect protein domains in developmentally regulated genes with key roles in neurogenesis or other brain functions. For the latter, we also demonstrated dysregulation in Alzheimer’s disease using two independent datasets. The new formulation of LSVs sheds light on what has thus far been mostly a 'dark side' of the transcriptome and RNA-Seq based studies, i.e. complex splicing variations. Several previous works aimed to address the apparent representational gap between full transcripts and the classical binary AS events. For example, (Nagasaki et al., 2006) developed an efficient bit array representation for the various exonic segments that make up different gene isoforms, and (Sammeth et al., 2008) suggested an elaborate notational system that allowed them to catalogue all the splicing variations in a given transcriptome, comparing the frequencies of different AS types across 12 metazoa. More recently, (Pervouchine et al., 2013) developed bam2ssj, a package implementing a general intron centric approach to estimate AS from RNA-Seq data that can capture non classical AS variations. bam2ssj gives a BAM-file–processing pipeline that counts junction reads to compute the ratio of inclusion levels either from the 5’ or the 3’ end of an intron, denoted Ψand ΨA different, graph based, approach was taken by (Hu et al., 2013) where a splice graph is divided into subunits termed alternative splicing modules (ASMs). ASMs are hierarchically structured, each capturing all the possible paths along a splice graph between specific start (‘single entry’) and end (‘single exit’) points. The matching algorithm, DiffSplice, then aims to identify cases of differential transcription of ASMs between two experimental conditions. All of these works differ substantially in the formulation of splicing variation, the underlying algorithms, and visualization approach, yet all share the effort to capture non classical AS types. In comparison, MAJIQ offers a unique approach that spans formulation, detection, quantification and visualization of splicing variations. Unlike ASMs, LSVs can be inferred directly from junction spanning reads and result in quantitative PSI and dPSI estimates, while MAJIQ’s probabilistic model offers significant accuracy boost for PSI and dPSI estimates compared to alternative methods. The importance of LSVs formulation is manifested in how common complex LSVs are in diverse metazoans, making up at least a third of observed LSVs in human and mouse. Complex LSVs are also enriched for regulated splicing when analyzing over thirty datasets across different tissues, developmental stages, splice factor knockdowns and neurodegenerative disease. In addition, LSV formulation can be used to investigate substructures of the transcriptome. We found that the biochemically-based proximity rule is commonly overcome at the genomic level and that complex LSVs are less likely to have a dominant splice junction. As for LSVs possible function, our results indicate that tissue dependent binary and complex LSVs both tend to occur in unstructured regions known to affect protein-protein interactions, as well as in specific yet distinct protein domains and families. In order to benefit from the new LSV formulation matching software is needed. The software we developed, MAJIQ, is LSV focused and compares favorably with available tools on AS quantification based both on RNA-Seq from biological replicates and on a compendium of over 200 RT-PCR experiments. Unlike many tools, MAJIQ supplements annotated transcriptomes with novel splice junctions, while VOILA allows the resulting LSVs to be interactively visualized within standard web browsers. Thus, MAJIQ and VOILA offer a compelling LSV centered addition to tools such as MISO (Katz et al., 2010), rMATS (Shen et al., 2014) and cuffdiff (Trapnell et al., 2013) that allow users to quantify whole isoforms relative abundance, alternative polyadenylation, or differential expression. Immediate applications of the novel LSV framework and the MAJIQ software cover a wide spectrum. Examples include improved disease studies where transcriptome variations play a role, enhancing predictive models for splicing and for the effect of genetic variants, studying the regulatory underpinning of complex LSVs, and examining their evolutionary history. At the most basic level, our results illustrate the potential for novel discoveries in reanalyzing previously published data with the new LSV based methods. We anticipate the framework and resources provided here will form the basis of many additional new discoveries in diverse fields.

Materials and methods

RNA-Seq read mapping

All RNA-Seq was mapped using STAR (Dobin et al., 2013). STAR was run with alignSJoverhangMin 8. We created the STAR genome based on mm10 or hg19, with an in-house junction DB containing all possible junctions within each gene.

LSV definition

An LSV (local splice variation) is defined as a split in a splice graph into or from a single exon, termed the reference exon. Single Source LSV (SS-LSV) correspond to splits from a reference exon to multiple 3’ splice sites in downstream exons, single target LSV (ST-LSV) correspond to multiple 5’ splice sites spliced to an upstream reference exon. The reference exon may include multiple 3’ splice sites (ST-LSV) or 5’ splice sites (SS-LSV). An LSV type is defined by the reference exon type (SS, ST) and the set of junctions it includes. Each junction is defined by the splice site ID in the reference exon, and the splice site ID in its target/source exon. Under the above formulation some SS-LSV and ST-LSV may include exactly the same set of edges or one LSV may contain a subset of another LSV’s edges. For example the SS-LSV from exon 4 and the ST-LSV into exon 5 in Figure 1A bottom are comprised of exactly the same edges, while the ST-LSV into exon 2 is a subset of the SS-LSV from exon 1. Such cases are easily detected and removed from further analysis to avoid redundancy. It is important to note that under the LSV formulation classical cassette exons correspond to two distinct LSVs, a single source and a single target. These LSVs are not redundant as they correspond to different lines of experimental evidence; one from junction reads connecting the alternative middle exon with the upstream exon (SS-LSV) and one connecting the middle exon to the downstream exon (ST-LSV). The separate quantification for such LSVs, combined with the joint visualization using VOILA (see below), help distinguish between cases where the two LSVs give similar PSI or dPSI quantifications and cases where they disagree. A case of possible disagreement is illustrated in the last three exons of Figure 1A, where an alternative transcription start site and a third junction going into the last exon may lead to different PSI values.

LSV as structural network motifs

The above definition gives a one-to-one mapping between a local splice graph split and an LSV type. Given a set of LSVs we can compute a distribution over their types or group several types together to detect a distribution over specific LSV features. We note that unlike the analysis of network motifs in Milo et al. (2002), we do not compare LSVs to random connections in a network as the null hypothesis, but rather to a sequential network where all exons are connected via a single path. Thus, we compute a distribution over relevant statistics such as the number of junctions in the reference exon, the total number of junctions or the total number of exons in the LSV (Figure 4).

MAJIQ

MAJIQ is comprised of two main components, a builder and a quantifier. The builder analyzes a given set of RNA-Seq experiments and a transcriptome database to detect LSVs (either known or de-novo) and create a splice graph for each gene in the database. The quantifier subsequently estimates PSI or dPSI for LSVs detected by the builder.

MAJIQ Builder

The builder accepts as input a list of RNA-Seq indexed BAM files and a transcriptome database. For each gene defined in the database it determines all its known exons, associated 3’/5’ splice sites, and the splice graph edges (i.e. splice sites spliced together). It then scans the BAM files to find which of those edges are supported by RNA-Seq reads, and which de-novo splice sites and edges should be added based on junction reads. A user-controlled filter defines which edge is considered 'reliable' to be included in the splice graph. The default is set to “appears in the database or has at least two reads from two different positions”. In order to create the splice graph, MAJIQ combines the known transcriptome with the reliable de-novo junctions. If de-novo 3’ or 5’ splice sites are found outside the boundaries of any annotated exons, they are connected to the proximal upstream or downstream exon, respectively. If such a de-novo 3’ splice site is followed by a de-novo 5’ splice site then the area in between is denoted as a putative de-novo exon. However, de-novo exonic regions added to the splice graph are not allowed to exceed a user-defined threshold. The default threshold is set to 500bp, which corresponds to approximately the 95 percentile of known exons length in the mouse genome. In cases where a de-novo exonic extension might have exceeded this threshold it is instead used to create an 'open ended' putative de-novo exonic region with its boundaries marked accordingly. All the de-novo junctions and exonic regions are marked in green in VOILA’s output (see for example Figure 6—figure supplement 1A). Finally, for retained introns, an additional filter is added for the average coverage in consecutive windows across the intron. The default is set to 1.5 (see Supplementary Figure 4—figure supplement 1A, B) but it is important to adjust this threshold depending on the coverage depth and how permissive one wants to be when calling retained introns. In the next step, the builder creates a list of LSVs that are considered to 'exist in the data' based on a second user defined filter. The default for this filter is 'all the LSV junctions are reliable and at least one junction has at least two reads from at least two positions'. Intuitively this setting retains only LSVs for which there is evidence for expression of at least some gene isoforms involved in that LSV though not all edges involved in the LSV will necessarily be found in the data. The MAJIQ Builder outputs two types of binary files, the first including the LSVs to be quantified by the MAJIQ quantifier, the second including the splice graphs to be visualized by VOILA.

MAJIQ quantifier

The quantifier estimates the fraction each LSV junction is selected, denoted percent selected index (PSI or Ψ), or the changes in each junction’s PSI between two experimental conditions with or without replicates (dPSI or ∆Ψ). These fractions are inferred from short sequence reads that span across junctions (junction reads), whose distribution can be affected by many factors. Consequently, MAJIQ quantifies PSI and dPSI not as a point estimate but as a posterior distribution over possible fractions in the range [0,1]. Importantly, the quantifier screens the builder’s list of LSVs for those that are deemed 'quantifiable' by a user-defined filter. Intuitively, the sparsity of RNA-Seq data results in many LSVs being reliably detected by RNA-Seq yet lacking sufficient read coverage to accurately quantify PSI or to confidently infer significant changes in PSI across conditions. By filtering those from downstream analysis by the quantifier significant computational resources can be saved. The default for the quantifiable filter is at least 10 reads from at least 3 positions. The first step in MAJIQ’s quantification is estimating the read rate per position in each junction, after correcting for GC content bias (Risso et al., 2011). Note that here the position’s read rate corresponds to reads that start/end at that position, not reads that overlap it. The estimation of each junction’s read rate involves three components: A global parametric model per experiment for read count variability; a stack removal procedure; and a local estimator for read rate derived from bootstrapping over each junction’s relevant positions. For the global parametric model MAJIQ uses the zero truncated negative binomial (ZTNB) distribution. The dispersion parameter r is optimized per experiment t by bootstrapping over non-zero positions in a randomly selected set of up to 10,000 quantifiable junctions. Next, given the derived negative binomial model with dispersion r, MAJIQ performs a screening step to detect possible read stacks. A read stack is defined as a position i in junction j with an observed read rate xj,i which is highly unlikely given r and the average read rate in other non zero positions in the junction. After experimentation with the effect on reproducibility, a conservative threshold of p-val ≤ 10−7 was set as the default to flag possible read stacks. Flagged positions and their respective reads are then removed from further consideration. Finally, we noticed that even after fitting a global dispersion parameter per experiment r and discarding read stacks the data still exhibits variability not fully accounted for by this model (data not shown). Therefore, and in order to account for local dispersion (i.e. at a specific junction) we bootstrap N positions from the relevant set of positions to get an estimate for the read rate in junction j where is the observed number of reads that start in the i sampled position and W is the number of relevant positions in junction j. Here, the relevant positions refer to those where uniquely mapped, non-stack, reads start. By repeating this procedure M times we get an empirical distribution over estimates. These M samples are then used for computing posterior distributions over PSI and ∆PSI. Estimating the percent selected index (PSI, or Ψ) per junction j in a given LSV e in experiment t requires to derive a posterior distribution over multinomial distributions Ψ, s.t. ∑ Ψ Previous works concentrated on common cases involving two junctions such as cassette exons, where the posterior for Ψ can be computed in closed form using for example a Beta prior. For more complex cases where J > 2 the PSI posterior was commonly computed either as a point estimate (e.g., ML estimator using EM) or using MCMC sampling techniques (Katz et al., 2010). In general, sampling based estimation for Ψ or ∆Ψ (below) scales exponentially in the number of junctions J and is also hard to visualize beyond J = 2. However, noting that in most cases researchers are interested in relative abundance of specific variants rather than a complete distribution over all isoforms, MAJIQ side steps these issues by computing only the posterior marginal distributions per variant. This computation scales linearly with J and simplifies both downstream analysis and visualization of the results. It has been previously observed that alternative junctions in a given experimental condition generally tend to be either highly included or highly excluded (Shen et al., 2012; Wu et al., 2011). In line with these observations and based on fitting empirical distributions of observed PSI (data not shown) MAJIQ uses the following prior: . The default is resulting in a Jeffery prior that encourages either high inclusion or exclusion levels, but any (α, β) can be set. MAJIQ then uses the M read samples per junction (see above) to get a posterior Ψ as an average over those posterior distributions. PSI and ∆PSI are modeled as continuous random variables confined to the intervals [0, 1] and [−1, 1] respectively. In practice though, the required precision for these quantities is limited by both the problem being studied and the experimental techniques used to validate results (e.g., RT-PCR). This observation motivates MAJIQ’s discretized representation of possible PSI and ∆PSI. The discretization level, controlled by a tunable resolution parameter V, allows explicit tradeoff between accuracy and computation cost. Setting V = 40 (default) results in 2.5% PSI resolution. This discretization allows MAJIQ’s implementation to maintain and visualize a full distribution over PSI and ∆PSI, exploit efficient matrix operations for the entire range of Ψ values and avoid costly sampling procedures. MAJIQ’s estimation of for LSV e between experiments t,t′ is based on a joint prior P. While many previous works implicitly assume independence (i.e., P∼ P) both MAJIQ and rMATS (Shen et al., 2014) use a prior biased towards similar Ψt, Ψt′ values, which helps overcome falsely reporting high ∆Ψ due to fluctuations in small read counts. However, unlike rMATS that uses a multivariate uniform prior for ∆Ψ, MAJIQ combines the P prior described above with a ∆Ψ prior: P, with P having the form of a mixture of beta distributions: After some experimentation and measuring the effect on LSV quantification (data not shown), we found the following settings worked well. We set K = 3 with one component set as a spike at ∆Ψ = 0, the second as a beta distribution of small perturbations around 0, and the third mixture component set to a flat uniform prior (α = β = 1). Given the above prior, the joint posterior distribution is given by: where D represents the set of estimated number of reads per junction in the m-th sample (see above) and P(Ψ is the posterior beta distribution given the observed reads. Similarly, when comparing two conditions T, T′ with replicates we have: with the last two terms decomposing elegantly by the chain rule for the conjugate beta prior. More information regarding MAJIQ’s usage and parameters can be found in the software's tutorial, available at majiq.biociphers.org.

VOILA

VOILA creates HTML5 based visualization of gene splice graphs, LSVs, PSI and dPSI estimates. It uses two types of input files: a binary file output from MAJIQ builder summarizing gene splice graphs, and another binary file from MAJIQ quantifier summarizing LSV PSI/dPSI quantifications. The HTML5 lists splice graphs and associated LSVs according to user defined filters. Distributions over PSI or dPSI are represented using violin plots and each splice graph and LSV is also linked to the UCSC genome browser to allow comparison to raw reads or other track information. Interactive filters allow users to select which types of LSVs to display while a table view allows users to sort and search LSVs. The VOILA splice graphs, LSVs cartoons and violin plots are shown in Figure 6,7 and their respective supplementary figures. The original VOILA plots used for these figures can be found at: majiq.biociphers.org. More information regarding VOILA usage and parameters can be found in the software's user guide, available at majiq.biociphers.org.

PSI reproducibility

PSI reproducibility by RNA-Seq from biological replicates was evaluated using the following procedure. First, MAJIQ Builder was executed to detect the union set of LSVs in a set of biological replicates of hippocampus and liver from Keane et al. (2011). To avoid redundancy and enable comparison to other methods only a single junction from binary LSVs were included in downstream analysis. Next, for each replicate pair the difference in LSV quantification for each LSV was computed as R(ΨMAJIQ) = E[Ψr1]-E[Ψr1]. LSVs that were only detected in one of the replicates were discarded. The same set of LSVs were fed into MISO using the MAJIQ Builder GFF3 output file and the same procedure was executed to compute R(ΨMISO). This procedure was repeated 6 times to compute the mean and standard errors for the empirical R(Ψ) PDF shown in Figure 2—figure supplement 1C. The empirical PDF and standard error for the difference in reproducibility ΔR = RMISO- RMAJIQ (Figure 2—figure supplement 1C inset graph) were computed by a similar procedure. PSI reproducibility by RT-PCR was evaluated using the following procedure. For the data from Zhang et al. (2014), we first selected LSVs that were estimated by MAJIQ to be differentially spliced with high confidence (P( ΔΨ >0.2) > 0.95)) when using three samples from cerebellum and liver. This allowed us to also assess dPSI reproducibility for a wide range of dPSI values (see below). Next, for each LSV the total number of reads starting at positions within all the LSV’s junctions in each replicate were summed together for each tissue. Then, the LSVs were binned by the average total read coverage in the two tissues. Bins were defined to be: 10–30, 30–40, 40–80, 80–200, and above 200 reads. From each such bin, a set of LSVs was randomly selected for RT-PCR validation. Each RT-PCR was executed in triplicates (see below). Finally, the average PSI by RT-PCR and the expected PSI by either MAJIQ or MISO were used to produce Figure 2, and Figure 2—figure supplement 1. MISO was executed with default parameters. For the stimulated and unstimulated T-Cell dataset, we collected a compendium of historical RT-PCR quantifications for previously annotated cassette exons. These experiments were executed by different Lynch lab members across several years and pre selected for specific studies regardless of dPSI or RNA-Seq coverage level. The vast majority of these cassette exons did not exhibit differential splicing between stimulated and unstimulated cells and some lacked triplicates. This set of previously annotated cassette exons was mapped to MAJIQ’s LVS and then quantified using RNA-Seq from Cole et al. (2015) (Figure 2, Figure 2—figure supplement 1, circle shaped points).

dPSI reproducibility

dPSI reproducibility by RNA-Seq from biological replicates was evaluated using the following procedure. First, the MAJIQ Builder was executed for all replicates of hippocampus and liver experiments from Keane et al. (2011), yielding the union of all LSVs in these experiments. Next, for each liver and hippocampus pair of experiments, all quantifiable LSVs were ranked according to their E[ΔΨ] and the set of N LSVs with significant splicing changes at high confidence was defined as LSVs for which P( ΔΨ >0.2) > 0.95. This threshold was selected to be conservative, but see Figure 2—figure supplement 2A for more relaxed thresholds. This process was then repeated in another pair of experiments and the relative rank of the original set of N LSVs was recorded. The reproducibility ratio of any ranked LSVs subset was defined as the fraction where n* is the subset of the first n ranked LSVs that were in the N best ranked LSVs by the replicate experiments. Similar to the IRD statistic used to assess reproducibility of Chip-Seq peak calling (Li et al., 2011), a perfect RR(n) graph follows the diagonal line. Unlike IRD though, the RR statistic is invariant to small or even complete perturbations in the relative rank of the top ranked LSVs. Intuitively, this means that the RR will remain the same as long as the same subset n* makes the best N cutoff. It is important to note that the RR value can vary greatly, affected by biological, experimental, and technical factors. Nonetheless, one can use the RR to assess reproducibility in specific settings, or compare dPSI reproducibility by different algorithms under the same experimental setup. An inherent challenge in comparing MAJIQ to other methods is that MAJIQ quantifies LSVs while other methods quantify the classical AS event types. One complication as a result of that is that while redundant LSVs are removed (see above) different LSVs may still partially overlap. A good example for that are cassette exons. In the LSV formulation a differentially included cassette exon may have two LSVs that support it, corresponding to different lines of experimental evidence (junction reads from the up and downstream exons) but other methods/tools will only count this exon as a single event. This in turn may bias both the reproducibility ratio (RR) and detection power (N) in favor of MAJIQ. In our experiments, when we ignored such possible overlap of LSVs the reproducibility ratio remained the same but the number of differentially spliced LSVs detected was significantly higher (RR=86%, N = 752, data not shown). In order to avoid such a bias in favor of MAJIQ we implemented a conservative approach where the ranked LSVs are filtered so that no LSV contained overlapping exons with another LSV. We note this is a conservative filter as there may be complex LSVs that involve multiple differentially spliced exons, or cases where the same exone involves different variations (e.g. skipping the exon but also alternative 3’ or 5’ splice sites). In such cases only a single LSV would pass that filter while the methods we compared to would still be able to retain separate AS events for those. dPSI reproducibility for MISO was evaluated by the following procedure. First, we followed MISO’s (Katz et al., 2010) guidelines for performing exon-centeric analysis (i.e. AS events) rather than whole transcripts analysis. For this, we used the set of alternative events for the mm10 mouse genome provided by MISO. We indexed the GFF3 file and ran MISO with default parameters on the same data pairs of experiments described above to compute expected dPSI (E[ΔΨMISO]). Finally, we ranked LSVs by decreasing expected dPSI and computed the reproducibility ratio (RR) as described above. As MISO does not supply a statistical criteria for selecting the number of events (N) from its ranked list, we used the number produced by rMATS. Changing N to the higher number of LSVs detected by MAJIQ degraded MISO’s performance (data not shown). dPSI reproducibility for rMATS was evaluated by the following procedure. We ran rMATS (Shen et al., 2014) with replicates (groups) and without them (pairs), using ENSEMBL annotation file in GTF format. rMATS estimates differential expression for each one of the classic alternative splicing events it identifies from the annotation file (exon skipping, 5 and 3 prime splice site donor/acceptor, mutually exclusive exons and intron retention). We used the default parameters except for the cutoff employed to compute the FDR associated to each AS event quantification, which was set to 0.2 (see Figure 2—figure supplement 2A for the impact in reproducibility of different cutoffs). Lastly, we extracted the RR for confident changing AS events identified by rMATS (FDR < 0.05, P( ΔΨ >0.2) as reported in the rMATS output file). dPSI reproducibility for the Naive Bootstrapping approach used in Xiong et al. (2015) for cassette exons was adopted for LSVs using the following procedure. First, we implemented the bootstrapping over junction positions described in Xiong et al. (2015), with the same beta prior to avoid zero read counts. These samples gave an empirical distribution over possible PSI values and these were subsequently used to estimate the expected PSI. Similar to MISO, the Naive Bootstrapping approach does not assume a joint prior so that the expected dPSI estimates are simply the difference in the expected PSI in each experiment. The resulting expected dPSI was then used to rank the LSVs, filter them for possible overlap of exons, and compute RR as described above. dPSI accuracy by RT-PCR was evaluated by the same procedure as that described above for PSI. ΔΨRT was then computed as the difference between the average of each triplicate set of experiments in cerebellum and liver or the difference between previously recorded measurements in the Lynch Lab for the stimulated vs. unstimulated T-Cells. dPSI reproducibility by RT-PCR was defined as cases for which ΔΨRT >20%. This definition allowed assessing false positives and false negatives (Figure 2—figure supplement 1B, Figure 2—figure supplement 2B).

Protein feature enrichment

In order to construct the LSV junctions and protein features (PF) table we first built the union set of LSVs detected from Zhang et al. (2014). We used ENSEMBL RESTful services [http://www.ncbi.nlm.nih.gov/pubmed/25236461] to retrieve PF along with their genomic coordinates associated with transcripts containing LSVs. Next, we annotated each LSV junction by PF that overlap its reference exon and its target/source exon, discarding junctions in non-coding regions. Because the overlap of a LSV junction region and a protein feature can be partial, we considered a PF to be associated with a junction when there was at least a 20% overlap. Lastly, we labeled as changing junctions those that had an estimated delta psi greater than 20% in any two tissues. We assessed relative enrichment of PFs using the following procedure when comparing groups of LSV junctions such as changing vs. unchanging, or binary vs. complex. For each PF we computed the p-value by Fisher’s Exact Test (FET) for its distribution between the two junction groups compared. To correct for multiple hypotheses testing while accounting for the high correlation between some PF we applied a permutation based testing procedure (Column M). Specifically, we shuffled the labels (e.g. changing, unchanging) but controlled for the LSV origin of each junction. Thus, junctions from the same LSV were randomly switched with junctions from an LSV with the same number of junctions. This procedure guarantees that the number of labeled junctions remains the same, helps control for correlation between PF of junctions in the same LSV and for the distribution of LSV types. We repeated this process 10000 times and then calculate an empirical corrected FET p-value. The results from this analysis are included in Supplementary file 1.

LSVs species analysis

RNA-Seq data for lizard and chicken was downloaded from Barbosa-Morais et al. (2012); opossum and chimp datasets were downloaded from Brawand et al. (2011). RNA-Seq for human was downloaded from Illumina’s Body Atlas 2.0 (NCBI GSE30611). Transcriptomes were downloaded from Ensembl for lizard (genome assembly AnoCar2.0), chicken (assembly Galgal4), opossum (assembly monDom5), chimp (assembly Pan_troglodytes-2.1.3), mouse (assembly GRCm38.p3) and human (assembly GRCh38.p2). For mouse and human RefSeq transcriptome annotations were used for comparison (Figure 3). The latest genome builds annotated in RefSeq were used, GRCm38/mm10 for mouse and GRCh37/hg19 for human.

Meta analysis of complex LSVs across datasets

In order to assess the prevalence and potential enrichment of complex LSVs across additional datasets beyond the 12 mouse tissues, we analyzed a number of additional datasets shown in Figure 5A and Figure 5—source data 1. All raw data was downloaded from SRA and mapped using STAR as described above. In a select number of older or low-coverage experiments, mapped reads from replicates were pooled together before analyzing with MAJIQ (see 'Notes on processing', Figure 5—source data 1). MAJIQ dPSI was run for each comparison (e.g., tissue pairs, pairwise developmental time points, control versus altered splice factor expression). LSVs were considered differentially spliced if E[ΔΨ] > 20%. For datasets with multiple conditions (e.g. 12 tissues, or multiple developmental timepoints), the union of differentially spliced LSVs and all detected LSVs between all pairwise comparisons was considered. The enrichment of complex LSVs in the differentially spliced group compared to their relative proportion among all detected LSVs in each dataset was evaluated using a binomial test, with a Bonferroni correction for the number of datasets used. All counts, SRA and GEO data accession numbers, and PubMed IDs for each study are detailed in Figure 5—source data 1. To assess the distributions of PSI and dPSI across all datasets in Figure 5B and Figure 5—figure supplement 1, we considered only the LSVs and junctions detected across the 12 mouse tissues and required exact matches to these junctions in the additional datasets in order to consider those PSI or dPSI values in the analysis. This conservative approach ensured we only monitored 'natural' LSVs and no LSVs that are unique to a specific cell line or KD experiment.

Analysis of splicing changes in Alzheimer’s disease in two cohorts

In order to validate splicing changes in AD identified for the complex LSVs examined in this study (CAMK2D, CAMK2G, and CLTA) we took all differentially spliced LSVs we identified from the 3 healthy and 3 AD brains (Bai et al., 2013) and looked for similar changes in an independent, larger cohort. We used data from the Mount Sinai Brain Bank (MSBB) RNA sequencing study (ID: syn3157743, accessed at https://www.synapse.org/#!Synapse:syn3157743) We focused on samples that came from healthy brains and definite AD brains, based on CERAD Neuropathology Criteria given, across the following brain regions: frontal pole (healthy: n=58, AD: n=62); superior temporal gyrus (healthy: n=37, AD: n=50); parahippocampal gyrus (healthy: n=33, AD: n=45). Because overall coverage was lower in these datasets compared to the original cohort, which affects the ability to detect intron retention (data not shown), we ran MAJIQ Builder on both datasets with a high threshold for IR detection (--min_intronic_cov 1000) in order to only compare exonic LSVs. Additionally, to account for heterogeneity in the data and to save computational time we considered PSI values for each patient separately, as opposed to running all possible pairwise dPSI comparisons. An LSV that was changing in the first cohort was considered validated if in the MSBB cohort the distribution of PSI values for the most changing junction was significantly different between healthy and AD individuals in at least one brain subregion (p<0.05, two-tailed rank sum test) with a difference in the median PSI of > 10% in the same direction as the original cohort. This lead to 199 LSVs in 145 genes changing in both cohorts. Finally, DAVID was used to find enriched GO terms among these genes with shared differentially spliced LSVs between the two cohorts using default parameters (Huang et al., 2008).

LSV conservation analysis

The conservation plots in Figure 5C were generated using the union of the changing LSV from the 66 pairwise tissue comparisons shown in Figure 4E. For each such LSV we extracted the phastCons60 conservation scores for vertebrates (Siepel et al., 2005) for the first 50 positions in each exon and the first 300 intronic positions proximal to each exon. It is not immediately clear which of the variable regions in a complex LSV (left hand plot for the single source LSVs, right hand plot for the single target LSVs in Figure 5C) should be included in such conservation analysis. Previous work focused on binary cases of cassette exons and compared constitutive exons to the regions around the alternative exon, which tend to be more conserved. Since the focus of this analysis was on conservation of possible regulation in LSV units we chose to apply the max function for each position in such variable LSV regions. To partially correct for the possible bias for high scores that the max operation may introduce we also applied it to the binary LSVs and to randomly selected sets of K constitutive exons, where the size K is sampled based on the distribution of number of exons in LSVs (Figure 4C). Overall we sampled 5000 such sets for the constitutive regions plot. Finally, the lines in Figure 5C were smoothed using a 5 bases sliding window.

RT-PCR validations

Total RNA was extracted from mouse tissues as described previously (Zhang et al., 2014). For each tissue three samples corresponding to RNA from circadian times 31, 41, and 53 were used for validation. For additional validations we used total RNA extracted from a clonal Jurkat T cell line (JSL1, described in detail previously [Lynch and Weiss, 2000]) cultured in RPMI medium supplemented with 5% heat-inactivated fetal bovine serum (unstimulated) or the same growth medium supplemented with the phorbol ester PMA (Sigma-Aldrich, St. Louis, MO) at a concentration of 20 ng/mL (stimulated). Stable identify of this clonal line is continuously monitored by assessing hallmark changes in splicing induced by PMA (Cole et al., 2015; Martinez et al., 2012; Shankarling et al., 2013). Low cycle reverse transcription-PCR (RT-PCR) was performed on 0.5 micrograms of RNA as described previously in detail (Rothrock et al., 2003) using sequence specific primers. Gels were quantified by densitometry with the use of a Typhoon PhosphorImager (Amersham Biosciences, UK). Primers and expected size of products for all events are given in Supplementary file 2. For cassette exon LSVs percent spliced in was calculated as the percent of isoforms including the alternative exon over the total inclusion and exclusion isoforms. For complex LSVs, each band present on the gel was quantified. The percent selected index (PSI) for each junction of an LSV was calculated as the isoform(s) including that junction over the total isoforms present. For example, for the Camk2g source LSV (Figure 1C) the percent usage of the red junction that goes from the reference source exon 14 to exon 15 corresponds to the sum of the bands corresponding to the 214 nt isoform that includes exon 15 alone and the 256 nt isoform that includes both exons 15 and 16.

Software availability

MAJIQ and VOILA are available for download at majiq.biociphers.org In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included. Thank you for submitting your work entitled "A new view of transcriptome complexity and regulation through the lens of local splicing variations" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Juan Valcárcel (Reviewing Editor) and James Manley as the Senior Editor. One of the other two reviewers, Roderic Guigo, has agreed to share his identity. The reviewers have discussed the reviews with one another and the Reviewing editor has drafted this decision to help you prepare a revised submission. Summary: The work described in the manuscript by Vaquero-Garcia et al. has four main components: 1) A method to identify and categorize Alternative Splicing (AS) events-to which the authors refer as to Local Splicing Variations (LSV). 2) A method (MAJIQ) to quantify these events using RNASeq data and a visualization component (VOILA). 3) The application of the method to produce a catalogue of LSVs in mouse. 4) The detailed investigation of some of the complex LSVs identified that affect protein domains in developmentally regulated genes and have roles in neurogenesis and other brain functions. The authors illustrate the usefulness of their technique by reporting novel tissue-specific transcript variants of Camk2d and a poison exon in Ptbp1, as well as the apparent enrichment of certain transcript variants in brain samples from Alzheimer disease patients. The major advances of these tools are 1) that they are not restricted to being able to analyze the standard types of splicing events (skipped exons, retained introns, alternative 5' splice sites, etc.) but can also identify and quantify much more complex local splicing variations, 2) can analyze both annotated and novel splicing events identified using RNA-seq data, and 3) can compare samples using replicates. There is indeed a need to understand the full complexity of transcript variants in most biological/pathological settings, and therefore sensitive methods for their detection and quantification, particularly using standard sequencing depth datasets, can be very valuable. The performance of the package, which combines various softwares into a single pipeline, appears to be excellent and all tools are available. Essential revisions: 1) The manuscript would benefit from presenting the work in framework of previous developments in the field, comparing with them, and emphasizing the relative merits of the current contribution. The authors seem to imply that theirs is the first method that attempts to systematically classify and categorize AS events, beyond merely, exon skipping, alternative usage of splicing sites or intron retention. But, in my opinion, this is incorrect. The authors appear to ignore a large body of previous literature that actually attempts to address exactly this problem. Worth mentioning in this regard is the work of Nagasaki et al. (2005, http://bioinformatics.oxfordjournals.org/content/22/10/1211.full) or Sammeth et al. (2008, http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000147). In the latter work, in particular, analysis similar to those presented in Figure 3 in Vaquero-Garcia et al., regarding the frequency of complex LSVs depending on the annotation used and the species considered, are presented. Moreover the methods developed by Sammeth et al. have actually been used to categorize AS events in a number of publications (https://scholar.google.es/scholar?cites=8586199022962348855&as_sdt=2005&sciodt=0,5&hl=en). Also single source and single target LSVs are quantified in their recent work by Pervouchine et al. (2013, http://bioinformatics.oxfordjournals.org/content/29/2/273.long). The authors also claim that they obtain splicing information by combining the annotation and RNA-seq evidence (Figure 2A). It is not clear if they use only splice junctions, or also exons and transcript models. The latter is important for the definition of LSV. Indeed, if "the reference exon may include multiple 3' splice sites (ST-LSV) or 5' splice sites (SS-LSV)", as it is said in Materials and methods, it would be necessary to know novel exons as continuous units. If this is the case, a lot more has to be explained in Materials and methods regarding the procedure of exon discovery from RNA-seq. In summary, in contrast to the claims by Vaquero-Garcia et al., complex LSVs have received wide attention within the field. 2) Another potential caveat is whether the results reported will persuade the general reader of the genuine potential of the approach to get biological insight. Specifically: A) The authors argue that complex LSVs represent over 30% of all transcripts variants. The general reader may wonder whether the switches in transcript isoforms within this category generally represent significantly large differences between tissues or biological conditions as to have biological impact. In other words, is biologically-relevant LSV likely to be limited to exceptional cases, or are the majority of complex variants likely to be functionally relevant? It would be useful, in this regard, to analyze the range and frequency distribution of ΔPSI values associated with complex LSVs compared to more standard alternative splicing events. Analysis of the conservation level of the predictably more relevant events could also be helpful. B) The authors predict that their method "will advance the ability to relate tissue-specific splice variation to genetic variation, phenotype and disease". They illustrate this by the identification of transcript variants possibly related to Alzheimer Disease, but the results are essentially descriptive and, considering the number of samples tested and the absence of validation in different cohorts, of limited value to extract rigorous biomedical conclusions. The manuscript would greatly benefit from additional efforts to more broadly illustrate the utility of the tool, for example to show how the approach is advantageous for explaining the impact of genetic variation using publicly available data from the GTEx project (e.g. pilot samples), ENCODE, modENCODE, etc. [Editors' note: further revisions were requested prior to acceptance, as described below.] Thank you for resubmitting your work entitled "A new view of transcriptome complexity and regulation through the lens of local splicing variations" for further consideration at eLife. Your revised article has been favorably evaluated by James Manley (Senior editor) and a Reviewing editor. The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below: The authors have significantly expanded and improved their manuscript, which can now provide a generally useful tool for exploring transcriptome diversity in RNA-seq data and, on these basis, would be acceptable for publication in eLife. The authors have done a good job discussing previous efforts by other scientists to identify complex alternative splicing decisions (main point #1 of the referees' report) and to illustrate the range and frequency distribution of complex LSVs compared to more standard alternative splicing events (point #2A). It would be helpful if the authors could more clearly state what was the validation rate of their disease-associated LSV changes in a second cohort of Alzheimer Disease patients (point #2B) (unless I have missed it, they only provide the number of LSVs reproducibly differentially spliced – p20 – but not what is the fraction of predictions validated in the two datasets). Another point: it is not clearly stated in the text/Figure 5A legend that the data corresponds to LSVs consistently differentially spliced between brain samples from patients and controls in the two datasets (as stated in the rebuttal). 1) The manuscript would benefit from presenting the work in framework of previous developments in the field, comparing with them, and emphasizing the relative merits of the current contribution. The authors seem to imply that theirs is the first method that attempts to systematically classify and categorize AS events, beyond merely, exon skipping, alternative usage of splicing sites or intron retention. But, in my opinion, this is incorrect. The authors appear to ignore a large body of previous literature that actually attempts to address exactly this problem. Worth mentioning in this regard is the work of Nagasaki et al. (2005, ) or Sammeth et al. (2008, The original manuscript lacked a section describing related work, which was mistakenly removed before submission. We apologize for this mistake and have now re introduced an extended section for this in the revised Discussion. We also added text in the Methods section to answer the question regarding how MAJIQ combines known annotation and de-novo junction to annotate putative de-novo exonic regions. 2) Another potential caveat is whether the results reported will persuade the general reader of the genuine potential of the approach to get biological insight. Specifically: A) The authors argue that complex LSVs represent over 30% of all transcripts variants. The general reader may wonder whether the switches in transcript isoforms within this category generally represent significantly large differences between tissues or biological conditions as to have biological impact. In other words, is biologically-relevant LSV likely to be limited to exceptional cases, or are the majority of complex variants likely to be functionally relevant? It would be useful, in this regard, to analyze the range and frequency distribution of ΔPSI values associated with complex LSVs compared to more standard alternative splicing events. Analysis of the conservation level of the predictably more relevant events could also be helpful. The reviewers raise here an excellent point, which is also related to the concern raised below about disease relevance. To address the concern about significance of complex LSVs we followed the reviewers’ suggestion and created a new figure (Figure 5). Figure 5A is a meta analysis of over 30 datasets and 243 RNA-Seq experiments that shows the significant enrichment of complex LSVs in datasets that cover developmental stages, key splice factors, and human disease (AD) across diverse tissues and cell lines. Figure 5B shows the changes in LSV inclusion levels (dPSI) across these conditions, just as suggested by the reviewers. Finally, Figure 5C shows the suggested conservation analysis with a significant increase in intronic conservation for differentially spliced complex LSVs compared to their binary LSVs counterparts. B) The authors predict that their method "will advance the ability to relate tissue-specific splice variation to genetic variation, phenotype and disease". They illustrate this by the identification of transcript variants possibly related to Alzheimer Disease, but the results are essentially descriptive and, considering the number of samples tested and the absence of validation in different cohorts, of limited value to extract rigorous biomedical conclusions. The manuscript would greatly benefit from additional efforts to more broadly illustrate the utility of the tool, for example to show how the approach is advantageous for explaining the impact of genetic variation using publicly available data from the GTEx project (e.g. pilot samples), ENCODE, modENCODE, etc. To address the concern about the anecdotal nature of the disease related results in the original submission we performed the following: First, we followed the reviewers’ suggestion and validated these results on an independent data set (syn3157743 from the AMP-AD Target Discovery Consortium data portal) which is much larger: 157 samples from AS patient’s brains and 128 control samples, across three different brain sub regions. Second, we added a genome wide analysis of LSVs that are consistently differentially spliced between brain samples from patients and controls in the two datasets (included in Figure 5A described above). We very much agree this work is still just the tip of the iceberg and we believe much more can be done with the AD datasets and many others. Specifically, the suggested analysis of genetic variations with MAJIQ is a great direction to pursue in future work. However, we are very much aware that a careful analysis of this sort is beyond the scope of this paper. [Editors' note: further revisions were requested prior to acceptance, as described below.] […] The authors have done a good job discussing previous efforts by other scientists to identify complex alternative splicing decisions (main point #1 of the referees' report) and to illustrate the range and frequency distribution of complex LSVs compared to more standard alternative splicing events (point #2A). It would be helpful if the authors could more clearly state what was the validation rate of their disease-associated LSV changes in a second cohort of Alzheimer Disease patients (point #2B) (unless I have missed it, they only provide the number of LSVs reproducibly differentially spliced – p20 – but not what is the fraction of predictions validated in the two datasets). Another point: it is not clearly stated in the text/ We have worked hard to address the reviewers’ concerns and suggestions. In particular, to address the reviewers’ concern about the significance of complex LSVs we performed a whole new meta analysis of LSVs across 32 datasets covering diverse tissues, cell lines, developmental stages and splice factors knockdown. We added this to a conservation analysis as suggested by the reviewers, creating a new figure (Figure 5). In addition, to address the concern regarding the relevance of our findings to disease studies we added a genome wide analysis of differentially spliced LSVs in independent datasets of 285 samples from AD patients and controls, which validated our initial findings for the CAMK2 family and other genes. We also added a section detailing previous related work, which was erroneously omitted in the original submission, a mistake we apologize for. Finally, we made sure to add details in all sections indicated by the reviewers as unclear.

55 in total

1. A model system for activation-induced alternative splicing of CD45 pre-mRNA in T cells implicates protein kinase C and Ras.

Authors: K W Lynch; A Weiss
Journal: Mol Cell Biol Date: 2000-01 Impact factor: 4.272

2. Network motifs: simple building blocks of complex networks.

Authors: R Milo; S Shen-Orr; S Itzkovitz; N Kashtan; D Chklovskii; U Alon
Journal: Science Date: 2002-10-25 Impact factor: 47.728

3. Deciphering the splicing code.

Authors: Yoseph Barash; John A Calarco; Weijun Gao; Qun Pan; Xinchen Wang; Ofer Shai; Benjamin J Blencowe; Brendan J Frey
Journal: Nature Date: 2010-05-06 Impact factor: 49.962

4. The MicroRNA miR-124 promotes neuronal differentiation by triggering brain-specific alternative pre-mRNA splicing.

Authors: Eugene V Makeyev; Jiangwen Zhang; Monica A Carrasco; Tom Maniatis
Journal: Mol Cell Date: 2007-08-03 Impact factor: 17.970

5. rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data.

Authors: Shihao Shen; Juw Won Park; Zhi-xiang Lu; Lan Lin; Michael D Henry; Ying Nian Wu; Qing Zhou; Yi Xing
Journal: Proc Natl Acad Sci U S A Date: 2014-12-05 Impact factor: 11.205

6. Systematic discovery of regulated and conserved alternative exons in the mammalian brain reveals NMD modulating chromatin regulators.

Authors: Qinghong Yan; Sebastien M Weyn-Vanhentenryck; Jie Wu; Steven A Sloan; Ye Zhang; Kenian Chen; Jia Qian Wu; Ben A Barres; Chaolin Zhang
Journal: Proc Natl Acad Sci U S A Date: 2015-03-03 Impact factor: 11.205

Review 7. The multifunctional calcium/calmodulin-dependent protein kinase: from form to function.

Authors: A P Braun; H Schulman
Journal: Annu Rev Physiol Date: 1995 Impact factor: 19.318

8. Phosphorylation of microtubule-associated protein tau: identification of the site for Ca2(+)-calmodulin dependent kinase and relationship with tau phosphorylation in Alzheimer tangles.

Authors: B Steiner; E M Mandelkow; J Biernat; N Gustke; H E Meyer; B Schmidt; G Mieskes; H D Söling; D Drechsel; M W Kirschner; M Goedert; E Mandelkow
Journal: EMBO J Date: 1990-11 Impact factor: 11.598

9. Intron-centric estimation of alternative splicing from RNA-seq data.

Authors: Dmitri D Pervouchine; David G Knowles; Roderic Guigó
Journal: Bioinformatics Date: 2012-11-21 Impact factor: 6.937

10. Calcium signaling in neurodegeneration.

Authors: Philippe Marambaud; Ute Dreses-Werringloer; Valérie Vingtdeux
Journal: Mol Neurodegener Date: 2009-05-06 Impact factor: 14.195

130 in total

Review 1. Environmental influences on RNA processing: Biochemical, molecular and genetic regulators of cellular response.

Authors: Athma A Pai; Francesca Luca
Journal: Wiley Interdiscip Rev RNA Date: 2018-09-14 Impact factor: 9.957

Review 2. Escape From ALL-CARTaz: Leukemia Immunoediting in the Age of Chimeric Antigen Receptors.

Authors: Sisi Zheng; Mukta Asnani; Andrei Thomas-Tikhonenko
Journal: Cancer J Date: 2019 May/Jun Impact factor: 3.360

3. Poly(C)-Binding Protein Pcbp2 Enables Differentiation of Definitive Erythropoiesis by Directing Functional Splicing of the Runx1 Transcript.

Authors: Louis R Ghanem; Andrew Kromer; Ian M Silverman; Xinjun Ji; Matthew Gazzara; Nhu Nguyen; Gabrielle Aguilar; Massimo Martinelli; Yoseph Barash; Stephen A Liebhaber
Journal: Mol Cell Biol Date: 2018-07-30 Impact factor: 4.272

4. Single-cell isoform RNA sequencing characterizes isoforms in thousands of cerebellar cells.

Authors: Ishaan Gupta; Paul G Collier; Bettina Haase; Ahmed Mahfouz; Anoushka Joglekar; Taylor Floyd; Frank Koopmans; Ben Barres; August B Smit; Steven A Sloan; Wenjie Luo; Olivier Fedrigo; M Elizabeth Ross; Hagen U Tilgner
Journal: Nat Biotechnol Date: 2018-10-15 Impact factor: 54.908

Review 5. Applications of machine learning in drug discovery and development.

Authors: Jessica Vamathevan; Dominic Clark; Paul Czodrowski; Ian Dunham; Edgardo Ferran; George Lee; Bin Li; Anant Madabhushi; Parantu Shah; Michaela Spitzer; Shanrong Zhao
Journal: Nat Rev Drug Discov Date: 2019-06 Impact factor: 84.694

6. Splicing factor DHX15 affects tp53 and mdm2 expression via alternate splicing and promoter usage.

Authors: John McElderry; Blake Carrington; Kevin Bishop; Erika Kim; Wuhong Pei; Zelin Chen; Ramanagouda Ramanagoudr-Bhojappa; Anupam Prakash; Shawn M Burgess; P Paul Liu; Raman Sood
Journal: Hum Mol Genet Date: 2019-12-15 Impact factor: 6.150

7. Phosphoproteomics reveals that glycogen synthase kinase-3 phosphorylates multiple splicing factors and is associated with alternative splicing.

Authors: Mansi Y Shinde; Simone Sidoli; Katarzyna Kulej; Michael J Mallory; Caleb M Radens; Amanda L Reicherter; Rebecca L Myers; Yoseph Barash; Kristen W Lynch; Benjamin A Garcia; Peter S Klein
Journal: J Biol Chem Date: 2017-09-15 Impact factor: 5.157

8. Global variation in gene expression and the value of diverse sampling.

Authors: Derek E Kelly; Matthew E B Hansen; Sarah A Tishkoff
Journal: Curr Opin Syst Biol Date: 2017-03-03

9. Systematic evaluation of differential splicing tools for RNA-seq studies.

Authors: Arfa Mehmood; Asta Laiho; Mikko S Venäläinen; Aidan J McGlinchey; Ning Wang; Laura L Elo
Journal: Brief Bioinform Date: 2020-12-01 Impact factor: 11.622

10. A combined RNA-seq and whole genome sequencing approach for identification of non-coding pathogenic variants in single families.

Authors: Revital Bronstein; Elizabeth E Capowski; Sudeep Mehrotra; Alex D Jansen; Daniel Navarro-Gomez; Mathew Maher; Emily Place; Riccardo Sangermano; Kinga M Bujakowska; David M Gamm; Eric A Pierce
Journal: Hum Mol Genet Date: 2020-04-15 Impact factor: 6.150