Literature DB >> 29304370

The Expanding Landscape of Alternative Splicing Variation in Human Populations.

Eddie Park¹, Zhicheng Pan², Zijun Zhang², Lan Lin¹, Yi Xing³.

Abstract

Alternative splicing is a tightly regulated biological process by which the number of gene products for any given gene can be greatly expanded. Genomic variants in splicing regulatory sequences can disrupt splicing and cause disease. Recent developments in sequencing technologies and computational biology have allowed researchers to investigate alternative splicing at an unprecedented scale and resolution. Population-scale transcriptome studies have revealed many naturally occurring genetic variants that modulate alternative splicing and consequently influence phenotypic variability and disease susceptibility in human populations. Innovations in experimental and computational tools such as massively parallel reporter assays and deep learning have enabled the rapid screening of genomic variants for their causal impacts on splicing. In this review, we describe technological advances that have greatly increased the speed and scale at which discoveries are made about the genetic variation of alternative splicing. We summarize major findings from population transcriptomic studies of alternative splicing and discuss the implications of these findings for human genetics and medicine.

Entities: CellLine Chemical Disease Gene Mutation Species

Mesh：

Year: 2018 PMID： 29304370 PMCID： PMC5777382 DOI： 10.1016/j.ajhg.2017.11.002

Source DB: PubMed Journal: Am J Hum Genet ISSN： 0002-9297 Impact factor: 11.025

Main Text

Introduction

Pre-mRNA splicing is a conserved biological process in which introns within nascent RNA molecules are removed and exons are ligated to form mature mRNA products. Through alternative choices of exons and splice sites during splicing—a process known as alternative splicing—a single gene can produce multiple mRNA isoforms that dramatically diversify the transcriptome and the proteome. Although the human genome has only approximately 20,000 protein-coding genes, the unique mRNA isoforms generated from each gene can be more than ten times that number. Nearly all multi-exon human genes are alternatively spliced.5, 6 The basic patterns of alternative splicing include exon skipping, alternative 5′ and 3′ splice sites, mutually exclusive exons, intron retention, and alternative splicing coupled with alternative first or last exons (Figure 1A). Beyond these basic patterns involving binary choices of exons or splice sites during splicing, many complex alternative splicing patterns exist in the transcriptome (see Figure 1B for examples). In extreme cases, the combinatorial choices of multiple alternatively spliced regions can generate tens of thousands of mRNA isoforms from a single gene. The resulting mRNA isoforms can have distinct regulatory properties in the cell, such as localization, stability, and translational efficiency, and can be translated into stable protein isoforms with divergent structures and functions.9, 10 Therefore, alternative splicing provides a powerful mechanism for expanding the regulatory and functional repertoire of eukaryotic organisms.

Figure 1

A Primer on Alternative Splicing

(A and B) Basic (A) and complex (B) patterns of alternative splicing. Dark-blue boxes represent constitutively spliced exons. Red, light-blue, and green boxes represent alternatively spliced exons.

(C) Alternative splicing is regulated by an extensive protein-RNA interaction network involving cis elements within the pre-mRNA and trans-acting factors that bind to these cis elements. The most essential splicing signals within the pre-mRNA are the 5′ splice site (5′SS), 3′ splice site (3′SS), branch site (A), and polypyrimidine tract (Y(n)). The 5′ and 3′ splice sites have highly conserved GU and AG dinucleotides as the first and last two nucleotides of the intron, respectively. The U1 snRNP complex recognizes the 5′ splice site, and the U2 snRNP complex recognizes the branch site. The U2AF proteins recognize the 3′ splice site and polypyrimidine tract. Exonic splicing enhancers (ESEs), exonic splicing silencers (ESSs), intronic splicing enhancers (ISEs), and intronic splicing silencers (ISSs) are pre-mRNA cis regulatory motifs that recruit various RNA-binding proteins (e.g., SR and hnRNP proteins) to regulate alternative splicing.

A Primer on Alternative Splicing (A and B) Basic (A) and complex (B) patterns of alternative splicing. Dark-blue boxes represent constitutively spliced exons. Red, light-blue, and green boxes represent alternatively spliced exons. (C) Alternative splicing is regulated by an extensive protein-RNA interaction network involving cis elements within the pre-mRNA and trans-acting factors that bind to these cis elements. The most essential splicing signals within the pre-mRNA are the 5′ splice site (5′SS), 3′ splice site (3′SS), branch site (A), and polypyrimidine tract (Y(n)). The 5′ and 3′ splice sites have highly conserved GU and AG dinucleotides as the first and last two nucleotides of the intron, respectively. The U1 snRNP complex recognizes the 5′ splice site, and the U2 snRNP complex recognizes the branch site. The U2AF proteins recognize the 3′ splice site and polypyrimidine tract. Exonic splicing enhancers (ESEs), exonic splicing silencers (ESSs), intronic splicing enhancers (ISEs), and intronic splicing silencers (ISSs) are pre-mRNA cis regulatory motifs that recruit various RNA-binding proteins (e.g., SR and hnRNP proteins) to regulate alternative splicing. Alternative splicing is regulated in a cell-type- and developmental-stage-specific manner. This regulation is orchestrated through an extensive protein-RNA interaction network involving cis elements within the pre-mRNA and trans-acting factors that bind to these cis elements (Figure 1C). The most conserved cis splicing elements include the 5′ and 3′ splice sites that define the boundary of an intron with its upstream and downstream exon, respectively, as well as the branch site and polypyrimidine tract upstream of the 3′ splice site. These elements are recognized by the core splicing machinery (the spliceosome) and play an essential role in defining exon and intron identity. In addition to these core elements, auxiliary cis elements in exons or flanking introns can act as splicing enhancer or silencer elements to promote or repress exon splicing via their interactions with trans-acting splicing regulators, in particular RNA-binding proteins (RBPs). For example, cell-type-specific splicing regulators, such as ESRP, CELF, MBNL, RBFOX, and PTB family members, control the alternative splicing profiles and cell identities of epithelial, muscle, and neuronal cells by interacting with their cognate cis elements within the pre-mRNA to produce cell-type-specific isoforms. Alternative splicing is frequently affected by human genetic variants and disease mutations. A large fraction of human disease mutations disrupt splice site signals or splicing enhancer or silencer elements within the pre-mRNA, leading to the production of aberrant mRNA and protein products. It has been estimated that such cis splicing mutations constitute 15%–60% of human disease mutations. Additionally, mutations disrupting trans-acting splicing regulators cause a wide spectrum of diseases by globally compromising the splicing of many downstream target genes. Through decades of genetic and medical research, the role of aberrant splicing as a primary cause of Mendelian diseases has been firmly established and extensively reviewed.15, 17 However, until recently, much less was known and appreciated about the extent of naturally occurring alternative splicing variation among human individuals and how alternative splicing affects phenotypic variability and disease susceptibility in human populations. Recent developments in genomic technologies and computational tools have enabled transcriptome-wide studies of alternative splicing at an unprecedented scale and resolution.5, 6 New data depict an expanding landscape of alternative splicing variation across human tissues and populations. Here, we describe technological advances that have markedly increased the speed and scale at which discoveries are made about the genetic variation of alternative splicing. We review population-scale transcriptome studies that have revealed alternative splicing to be a primary causal mechanism underlying genome-wide association study (GWAS) signals of complex traits and diseases. We highlight innovative experimental and computational approaches that enable the rapid discovery and characterization of genomic variants that alter splicing. Finally, we discuss the clinical applications of these findings as well as their implications for future genetic and medical research.

Technologies for High-Throughput Analysis of Alternative Splicing

The conventional molecular biology approach to the quantification of alternative splicing is reverse transcription polymerase chain reaction (RT-PCR). In the late 1990s, sequencing of expressed sequence tags (ESTs), which are fragments of full-length mRNAs, revealed widespread alternative splicing in eukaryotic organisms. The development of splicing-sensitive microarrays in the mid-2000s allowed researchers to examine global splicing regulatory programs across tissues, cellular states, and species. Notably, all three types of technologies have been used to discover the association between genotypes and alternative splicing patterns in human populations. However, these technologies have low throughput (RT-PCR and ESTs), have high noise (ESTs and splicing microarray), or are limited to known splicing events (RT-PCR and splicing microarray).19, 20 Powered by high-throughput second-generation DNA sequencers, the advent of RNA sequencing (RNA-seq) in the late 2000s transformed many aspects of biomedical research, including studies of transcriptome complexity and alternative splicing. Because of their massively parallel nature, state-of-the-art high-throughput sequencers are now able to generate billions of short sequence reads in a single run. Sequencing mRNAs with these sequencers allows the discovery of novel genes and mRNA isoforms, the estimation of gene expression levels, and the quantitation of alternative splicing events. Three landmark papers in 2008 demonstrated the use of RNA-seq for characterizing alternative splicing in mammalian tissues.5, 6, 24 Since then, RNA-seq has rapidly eclipsed microarray as the standard approach for transcriptome profiling. Currently, RNA-seq data for over 70,000 human samples have been deposited into public repositories, and the number continues to rise at a rapid pace. Although typical RNA-seq experiments analyze polyadenylated (polyA+) mRNAs from whole cells or bulk tissue, the RNA-seq workflow is versatile enough to allow diverse types of applications that can obtain transcriptome information at a more fine-grained level. For example, RNA-seq analysis of non-polyadenylated (polyA−) RNAs enables the discovery and quantitation of polyA− non-coding RNAs, including circular RNAs created by back-splicing events.27, 28 Isolation and sequencing of RNAs from distinct subcellular fractions have been used for characterizing the subcellular localization of mRNA isoforms as well as co-transcriptional splicing of nascent RNAs on chromatin.29, 30, 31 Single-cell RNA-seq has become an increasingly popular approach to studying the transcriptome, including alternative splicing, at the individual-cell level.32, 33 Finally, although Illumina sequencers generate only short sequence reads, specialized protocols for library preparation can be used for inferring full-length mRNA isoforms with the use of Illumina RNA-seq data. Tilgner et al. developed a “synthetic long read” RNA-seq approach for use with Illumina sequencers. The principle behind this method is to generate RNA-seq libraries from a given sample separated into many small pools. Each pool contains a small number of RNA molecules (approximately 1,000 or fewer), and the assumption is that for most genes, no more than one molecule per gene is present in each pool. Then, short reads from each pool can be assembled into full-length transcripts by de novo sequence assembly algorithms. Using this approach, the authors identified novel mRNA isoforms and determined that certain distant alternatively spliced exons tend to co-occur in full-length mRNA molecules, whereas others tend to be spliced in a mutually exclusive manner. A caveat to this approach is that it is limited by the same issues of de novo assembly with short reads, primarily mis-assemblies and repetitive sequences. Moreover, the assumption of one RNA molecule per gene in each pool might not hold true for highly expressed genes. Ultimately, the interest in sequencing full-length mRNA transcripts has led to a renaissance of long-read mRNA sequencing, now using third-generation DNA sequencers most notably from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies. For example, PacBio isoform sequencing (Iso-Seq) has successfully identified many novel transcripts and alternative splicing events in tissues and cell types with well-characterized transcriptomes,38, 39 whereas Nanopore RNA-seq has been used for determining exon connectivity and full-length mRNAs in complex alternatively spliced genes with thousands of distinct isoform products. The strengths of third-generation long-read RNA-seq are in their long read lengths, which allow the direct resolution of isoform structure and the interrogation of repetitive RNA sequences, whereas their main weaknesses are their higher error rates and lower throughput (Figure 2). For the purpose of analyzing alternative splicing, the higher error rates are tolerable because aligners can leverage the long read lengths to align reads to exons and splice junctions. However, the smaller read number due to the lower throughput is a major bottleneck for the accurate quantitation of isoform abundance. A hybrid approach of combining long, error-prone reads from third-generation sequencers with short, accurate reads from second-generation sequencers has been developed for correcting sequencing errors and obtaining isoform quantitation from long reads. From a historical perspective, the data of third-generation long-read RNA-seq resemble those of EST sequencing, and computational methods developed for EST data have proven useful for PacBio and Nanopore RNA-seq data.

Figure 2

Strengths and Weaknesses of Short-Read and Long-Read RNA-Seq

(A) Schematic diagram of an alternatively spliced gene that generates two distinct mRNA isoforms. The first, middle, and last exons are constitutive exons. The second and fourth exons are alternative exons. The two alternative exons are co-spliced such that the long isoform contains all five exons and the short isoform contains only the first, middle, and last exons.

(B) Short-read RNA-seq generates many reads, enabling the accurate quantitation of individual alternative exons, but the long-range coupling between the two alternative exons is lost.

(C) Long-read RNA-seq captures the long-range coupling between alternative exons and identifies the correct full-length mRNA isoforms, but the limited number of reads reduces the precision of isoform quantitation.

Strengths and Weaknesses of Short-Read and Long-Read RNA-Seq (A) Schematic diagram of an alternatively spliced gene that generates two distinct mRNA isoforms. The first, middle, and last exons are constitutive exons. The second and fourth exons are alternative exons. The two alternative exons are co-spliced such that the long isoform contains all five exons and the short isoform contains only the first, middle, and last exons. (B) Short-read RNA-seq generates many reads, enabling the accurate quantitation of individual alternative exons, but the long-range coupling between the two alternative exons is lost. (C) Long-read RNA-seq captures the long-range coupling between alternative exons and identifies the correct full-length mRNA isoforms, but the limited number of reads reduces the precision of isoform quantitation. Beyond sequencing, imaging is emerging as a powerful technology for transcriptome analysis with spatiotemporal resolution. Sequential fluorescence in situ hybridization (seqFISH) and multiplexed error-robust fluorescence in situ hybridization (MERFISH) are imaging-based methods for single-cell transcriptomics and can quantify hundreds of target transcripts at the single-molecule level with spatial resolution. These methods integrate single-molecule fluorescence in situ hybridization with a barcoding scheme to distinguish hundreds of transcripts simultaneously. Each target transcript has a predefined sequential fluorescent barcode, which is used for identifying the transcript via cycles of hybridization with different fluorescent probes. Currently, seqFISH and MERFISH have primarily been applied to gene-level quantification, but with customizable probes, these approaches are in principle applicable to isoform analysis.

Quantifying Alternative Splicing by Using RNA-Seq Data

Because of the popularity of Illumina RNA-seq, many computational tools have been developed for estimating mRNA isoform expression and quantifying alternative splicing variation with the use of short-read RNA-seq data.44, 45 These tools fall into two broad categories according to their strategies for data analysis. The first category represents transcript-based tools that seek to estimate the abundances and relative proportions of full-length mRNA isoforms by using short-read RNA-seq data. This approach typically involves aligning short reads to a reference genome or transcriptome and then estimating the abundances of mRNA isoforms by using an expectation-maximization algorithm.46, 47 Recent innovations in pseudo-alignment algorithms have led to alignment-free RNA-seq transcript quantitation with significantly improved speed and computational efficiency.48, 49 Isoform proportions can then be inferred from the estimated abundances of all mRNA isoforms of a given gene. A drawback of the transcript-based approach is that inferring the abundance of full-length mRNA isoforms from short reads is non-trivial, and the results are sensitive to the choice of transcript annotations. Moreover, for genes with multiple alternatively spliced regions, it is not straightforward to attribute change in the abundance of mRNA isoforms to differential splicing regulation at specific exons or splice sites. The second category represents event-based tools that seek to directly quantify individual alternative splicing events by using RNA-seq data. In this approach, alternative splicing events are discovered from RNA-seq data, reads aligned to specific exons or splice junctions are counted, and appropriate statistical methods are used for quantifying alternative splicing and detecting differential splicing between distinct biological conditions. A widely used metric in event-based analyses is percent spliced in (PSI or ψ), which represents the percentage of a gene’s mRNA transcripts that include a specific exon or splice site. For a given alternative splicing event, the PSI value can be calculated from the counts of RNA-seq reads supporting specific exons or splice junctions.50, 51 Many popular computational tools for RNA-seq analysis of alternative splicing are event based (MISO, SpliceTrap, rMATS, and MAJIQ, to name a few). These tools differ in their definitions of alternative splicing events (basic versus complex), read-counting procedures, and statistical methods for quantifying and determining differential alternative splicing. Nonetheless, for the same set of alternative splicing events, these tools tend to produce highly concordant PSI estimates. Given that the PSI value represents a proportion estimated from read counts, the confidence interval of the PSI estimate is dependent on the overall RNA-seq read coverage for an event of interest, such that a higher coverage leads to a more reliable PSI estimate. This is a critical issue in RNA-seq analysis of alternative splicing, and studies have shown that modeling the confidence interval of PSI values on the basis of RNA-seq read counts improves downstream statistical inference.50, 51, 54 Interestingly, a hybrid approach leveraging full-length transcript quantitation for event-based analysis has been employed in a tool called SUPPA. This tool first runs alignment-free transcript quantitation software to estimate the abundance of mRNA isoforms and then converts these estimates to alternative splicing quantitation at the event level. With the use of pseudo-alignment algorithms,48, 49 this approach is fast and scalable to large datasets. However, it is restricted to pre-existing transcript annotations and cannot discover or quantify novel alternative splicing events. This issue is a limitation for analyzing genetic variation of alternative splicing, given that genomic variants can generate novel alternative splicing events in individual transcriptomes.55, 56

Computational Approaches for Discovering Genetic Associations of Alternative Splicing

With the continued increase in capacity and reduction in cost of high-throughput sequencers, generating RNA-seq datasets across many individuals in a population has become feasible (Figure 3A). Such population-scale RNA-seq datasets enable transcriptome-wide studies to associate genotypes with alternative splicing variation. Splicing quantitative trait locus (sQTL) analysis is a commonly used approach for discovering genetic variants associated with alternative splicing (Figure 3B).57, 58, 59 QTL analyses involve correlating genotypes with quantifiable phenotypes (traits). In sQTL analysis, the quantitative profiles of alternative splicing (e.g., PSI values) are treated as traits and tested for association with genotypes. Several computational methods have been developed for identifying sQTLs from population-scale genotype and RNA-seq data.57, 58, 59, 60, 61 Zhao et al. developed GLiMMPS, a computational method that identifies sQTLs at the event level by associating the PSI values of individual alternative splicing events with genotypes across the population. An important feature of GLiMMPS is that it uses a generalized linear mixed model to model the confidence interval of the PSI value in each individual as a function of RNA-seq coverage, which leads to improved accuracy over competing statistical models that treat the PSI value as a point estimate. Monlong et al. developed sQTLseekeR, a computational method that identifies sQTLs at the transcript level. sQTLseekeR treats the relative abundances of all alternatively spliced isoforms of a gene as a vector and uses a distance-based approach to test for association with genotypes. Because this method is applicable to any number of isoforms, it can detect sQTLs arising from both simple and complex alternative splicing events. Notably, the sQTL approach can be used to test for the association between any alternative splicing event and any SNP in cis or trans. cis-sQTL analyses could pinpoint genetic variants affecting cis splicing regulatory elements. On the other hand, trans-sQTL analyses can potentially identify hotspots where a SNP at a single genomic locus affects the alternative splicing of numerous genes across the genome. Such trans-sQTL hotspots have the potential to reveal known or novel regulators of alternative splicing.

Figure 3

Strategies for Discovering Genetic Associations of Alternative Splicing

(A) A population of individuals is genotyped, and their transcriptomes are subject to RNA-seq.

(B) Splicing quantitative trait locus (sQTL) analysis. For a given exon, the splicing level (PSI value) is measured for each individual on the basis of RNA-seq reads aligned to distinct mRNA isoforms. The PSI values are treated as quantitative traits and tested for association with genotypes across all individuals for the identification of significant sQTLs.

(C) Allele-specific alternative splicing (ASAS) analysis. Splicing levels (PSI values) are measured in an allele-specific manner for individuals who are heterozygous for a given SNP. For each individual, a PSI measurement can be obtained for each allele on the basis of allele-specific reads aligned to distinct mRNA isoforms. Reproducible allelic differences in PSI values across multiple heterozygous individuals provide evidence for significant ASAS events.

Strategies for Discovering Genetic Associations of Alternative Splicing (A) A population of individuals is genotyped, and their transcriptomes are subject to RNA-seq. (B) Splicing quantitative trait locus (sQTL) analysis. For a given exon, the splicing level (PSI value) is measured for each individual on the basis of RNA-seq reads aligned to distinct mRNA isoforms. The PSI values are treated as quantitative traits and tested for association with genotypes across all individuals for the identification of significant sQTLs. (C) Allele-specific alternative splicing (ASAS) analysis. Splicing levels (PSI values) are measured in an allele-specific manner for individuals who are heterozygous for a given SNP. For each individual, a PSI measurement can be obtained for each allele on the basis of allele-specific reads aligned to distinct mRNA isoforms. Reproducible allelic differences in PSI values across multiple heterozygous individuals provide evidence for significant ASAS events. Allele-specific alternative splicing (ASAS) analysis is a complementary approach to sQTL analysis for discovering genetic variants associated with alternative splicing (Figure 3C). ASAS analysis aims to identify differential alternative splicing between mRNA transcripts expressed from two haplotypes of an individual. This approach involves using heterozygous SNPs present in mRNAs to assign RNA-seq reads to two alleles and then testing for differential splicing between the two alleles. Such an allele-specific strategy has been applied to different types of alternative RNA processing mechanisms, including alternative splicing.63, 64, 65 Compared with the sQTL approach, the ASAS approach is unique in that the two alleles are exposed to an identical cellular environment; thus, their splicing differences in the individual can be attributed to cis genetic effects. However, for the ASAS approach to work, a heterozygous SNP must be expressed outside of the alternatively spliced region to enable allele-specific read assignment while being sufficiently close to the alternative splicing event to be detected on the same RNA-seq read with this event. As a result of this limitation, certain events might not be accessible with the ASAS approach using short-read RNA-seq data; however, recent work has explored the use of long-read RNA-seq for identifying ASAS events. In an interesting extension of the conventional ASAS approach applied to RNA-seq data of polyA+ mRNAs, Hsiao et al. integrated ASAS analysis with polyA+ and polyA− RNA-seq data for distinct subcellular compartments (cytosolic and nuclear). By examining the allelic ratio of RNA-seq reads from mature cytosolic polyA+ mRNAs or from nuclear polyA− RNAs representing spliced-out products, the authors were able to identify both exonic and intronic variants affecting alternative splicing.

Widespread Variation and Phenotypic Association of Alternative Splicing in Human Populations

In the last few years, population-scale RNA-seq datasets have been generated for diverse tissues and cell types (Table 1). Many of the initial RNA-seq studies were performed with lymphoblastoid cell lines (LCLs).64, 68, 69, 71, 76 LCLs are individual-specific immortalized cell lines created through the infection of human B cells with Epstein-Barr virus. These cell lines have been extensively characterized by large-scale genotyping efforts, such as the HapMap and 1000 Genomes projects.78, 79 Therefore, they provide readily available materials for studying the association between genetic variants and gene regulation, including alternative splicing. In two pioneering studies, Pickrell et al. and Montgomery et al. performed RNA-seq of LCLs from African and European populations.68, 69 In addition to identifying QTLs affecting overall gene expression levels (expression QTLs or eQTLs), both studies discovered over 100 sQTLs. The largest LCL RNA-seq dataset was generated by the Geuvadis (Genetic European Variation in Health and Disease) Consortium, which performed RNA-seq on 462 LCL samples from five populations from the 1000 Genomes Project. A major limitation of LCLs, however, is that they represent a single, relatively homogeneous cell type, whereas transcriptome regulation is strongly tissue and cell-type specific. More recently, population-scale RNA-seq studies have been applied to different tissues.62, 70, 72, 73, 74, 75 The most comprehensive effort to date is the GTEx (Genotype-Tissue Expression) Consortium,80, 81 which has released raw RNA-seq data along with whole-genome genotype data for over 10,000 tissue samples from 53 tissue sites (GTEx release V7), and this dataset continues to expand. Furthermore, induced pluripotent stem cells (iPSCs) are being explored for RNA-seq-based QTL studies as an alternative to LCLs and tissues. Not only would human iPSCs be able to replace LCLs as a source of individual-specific, continuously expandable biological materials, but these cells can also be differentiated in vitro into many mature cell types, thus circumventing the bottleneck of availability and access in tissue-based RNA-seq studies.

Table 1

Population-Scale RNA-Seq Studies of Alternative Splicing Variation in Human Transcriptomes

Study	Tissue or Cell Type	Sample Size	Summary
Montgomery et al.⁶⁸	LCLs	60	one of the first two population-scale transcriptome genetics studies to use RNA-seq; identified 110 sQTL events in a European population at a 0.01 permutation threshold
Pickrell et al.⁶⁹	LCLs	69	one of the first two population-scale transcriptome genetics studies to use RNA-seq; identified 187 genes with significant sQTLs in an African population at a 10% FDR, and many of these altered splicing by affecting cis splicing regulatory elements
Lappalainen et al.⁶⁴	LCLs	462	the largest population-scale RNA-seq dataset on LCLs; was generated by the Geuvadis project and included data on four European populations and one African population; identified 639 genes with trQTLs, where the genotype is significantly associated with the ratio of individual transcript level to total gene expression; found that genetic variation of gene expression levels and transcript isoform structure is equally common but largely controlled by independent causal variants
Battle et al.⁶²	whole blood	922	whole blood from the Depression Genes and Networks cohort; identified 1,370 genes with significant sQTLs at a 5% FDR; a total of 159 sQTLs were in high LD with trait- and disease-associated GWAS SNPs; the large sample size also allowed the identification of candidate trans-sQTLs
Fadista et al.⁷⁰	pancreatic islets	89	identified 371 sQTLs, including sQTLs in known T2D-associated loci or in genes associated with beta cell function and glucose metabolism
Li et al.⁷¹	LCLs	17	RNA-seq study of a 17-individual, three-generation family; allowed the discovery of sQTLs controlled by rare variants; identified 261 sQTLs at a 50% FDR; found that sQTLs with large effects in the family were enriched with rare variants
GTEx Consortium⁷²	43 tissues	1,641	data from the pilot phase of the GTEx project: 1,641 samples from 43 tissues across 175 individuals; identified an average of ∼1,900 and ∼250 sQTL genes per tissue with Altrans⁵⁸ and sQTLseekeR,⁵⁹ respectively; most sQTL genes were not eQTL genes; significant sQTLs tended to be shared among tissues, whereas tissue-specific sQTLs represented only 7%–21% of sQTLs, depending on the tissue type
Chen et al.⁷³	monocytes, neutrophils, and T cells	197	CD14⁺ monocytes, CD16⁺ neutrophils, and naive CD4⁺ T cells from up to 197 individuals; quantified splicing by using both PSI event-based measurements and relative abundances of transcript isoforms; identified over 2,000 genes with sQTLs at a 5% FDR in each of the three cell types
Pala et al.⁷⁴	leukocytes	624	included a total of 624 individuals from Sardinia; first sQTL study to integrate whole-genome and RNA-seq data of multiple families to discover common and rare variants affecting splicing; identified 6,768 sQTLs
Takata et al.⁷⁵	brain (prefrontal cortex)	206	identified 1,595 sQTLs in 1,341 unique genes; significant sQTLs were enriched with disease-associated GWAS loci, particularly loci associated with schizophrenia

The following abbreviations are used: FDR, false discovery rate; T2D, type 2 diabetes; and trQTL, transcript ratio QTL.

Population-Scale RNA-Seq Studies of Alternative Splicing Variation in Human Transcriptomes The following abbreviations are used: FDR, false discovery rate; T2D, type 2 diabetes; and trQTL, transcript ratio QTL. Using these large-scale datasets, researchers have begun to define the landscape, genetic architecture, and phenotypic association of alternative splicing variation in human populations (Table 1). Despite the differences in tissue and cell type, sample size, and sequencing depth, as well as the computational methods used for discovering sQTLs, several consensuses have emerged. These studies demonstrate that inheritable genetic variation of alternative splicing is widespread across diverse human tissues and cell types. Although sQTL SNPs tend to be enriched at the essential 5′ and 3′ splice sites,57, 69, 72 many sQTLs can be attributed to SNPs located outside of the splice site regions. These SNPs can modify splicing enhancer or silencer elements as well as known RBP binding sites in exonic or intronic regions. The approach of coupling sQTL results to GWAS signals has identified a large number of sQTLs in high linkage disequilibrium (LD) with previously identified GWAS SNPs (Table 1), suggesting that SNPs affecting alternative splicing could be the causal variants underlying a substantial fraction of GWAS signals for complex traits and diseases. For example, an RNA-seq study of 206 human brain (prefrontal cortex) tissues reported significant enrichment of sQTLs among GWAS disease loci, particularly for GWAS SNPs associated with schizophrenia. Similarly, an RNA-seq study of 89 pancreatic islets identified sQTLs in known type-2-diabetes-associated loci. One key question is whether sQTLs identified in these studies are the primary contributors to GWAS-associated traits and diseases or merely reflect the secondary effects of SNPs that affect phenotypes via other layers of gene regulation. To address this question, an elegant study by Li et al. integrated multiple datasets to analyze eight types of regulatory QTLs in a cohort of LCLs from an African population. The authors found that most sQTLs are independent of eQTLs, and sQTLs appear to have a comparable or even greater magnitude of effects on GWAS traits than eQTLs. These data suggest that splicing is a primary link between genetic variation and complex diseases, consistent with the prevalence of aberrant splicing as a primary cause of Mendelian diseases.15, 17 Two examples of sQTLs that correlate with GWAS signals are highlighted here. SP140 is a tissue-restricted gene with high expression in lymphoid cells, and its domain structure suggests a role in chromatin-mediated regulation of gene expression. Several GWASs identified SP140 SNPs that are significantly associated with chronic lymphocytic leukemia, multiple sclerosis, Crohn disease, and inflammatory bowel disease. However, the causal mechanism underlying these GWAS signals remained unknown. On the basis of sQTL analysis of RNA-seq data of LCLs from a European population, a significant sQTL signal was found for exon 7 of SP140, and the peak SNP was a C-to-T exonic SNP, rs28445040 (Figures 4A and 4B). Although this SNP does not alter the encoded protein sequence of SP140, minigene splicing reporter assays demonstrated its role in regulating the splicing level of SP140 exon 7, such that the T allele is associated with significantly reduced exon inclusion. Because the exon is 78 bp in length, skipping of this exon would remove an in-frame 26 amino acid peptide from the protein product without affecting the downstream reading frame. Strikingly, this SNP is in high LD with GWAS SNPs of all four diseases (Figure 4C), suggesting that this is the causal variant underlying the association between SP140 and these diseases. Furthermore, the association between this sQTL and multiple sclerosis was replicated in a recent case-control study. In another example, several studies identified an sQTL in exon 10 of ERAP2,57, 93, 94 a gene encoding a protease that processes antigenic epitopes for MHC class I antigen presentation. An A-to-G intronic SNP (rs2248374) within the 5′ splice site of ERAP2 deactivates the canonical 5′ splice site and activates a downstream cryptic 5′ splice site. This change leads to the production of an aberrant transcript that contains a premature termination codon subject to nonsense-mediated mRNA decay. RNA-seq data of LCLs indicate a significant switch in splicing among different genotypes of rs2248374, along with a significant change in steady-state mRNA levels due to alternative-splicing-coupled mRNA decay (Figures 4D and 4E). The G allele is associated with lower levels of MHC class I molecules at the surface of B cells and is in LD with GWAS signals for several diseases, such as Crohn disease and inflammatory bowel disease (Figure 4F). These two examples are just the tip of the iceberg for many sQTLs identified across various studies, and they illustrate that sQTLs can influence complex traits and diseases by altering protein activity and function (SP140) or mRNA stability and steady-state mRNA levels (ERAP2). It is also worth noting that the causal variants for these two GWAS-associated sQTLs are silent exonic (SP140) or intronic (ERAP2) and would therefore be missed by many commonly used tools for variant annotation.

Figure 4

Two Examples of sQTLs Associated with GWAS Signals for Complex Diseases

(A–C) Alternative splicing of SP140 exon 7 is associated with chronic lymphocytic leukemia, Crohn disease, inflammatory bowel disease, and multiple sclerosis. The alternative splicing event is an exon-skipping event. The C allele is associated with a higher level of exon inclusion, whereas the T allele is associated with a higher level of exon skipping. (A) Boxplot showing the significant association between SNP rs28445040 and the splicing level (PSI value) of SP140 exon 7 within the Geuvadis CEU (Utah residents with ancestry from northern and western Europe) population. Each dot represents the PSI value from a particular individual, and the size of each dot is proportional to the RNA-seq read coverage for the alternative splicing event in that individual. (B) Sashimi plot indicating the average RNA-seq read density and splice junction counts for each genotype. Exons and introns are not drawn to scale, and the relative width of exons is increased for clarity. (C) LD plot showing multiple GWAS SNPs (green boxes) linked with the sQTL SNP (purple box).

(D–F) Alternative splicing of ERAP2 exon 10 is associated with Crohn disease, ulcerative colitis, inflammatory bowel disease, and birdshot chorioretinopathy. The alternative splicing event is an alternative 5′ splice site event. The A allele is associated with a higher level of the upstream canonical 5′ splice site, whereas the G allele is associated with a higher level of the downstream cryptic 5′ splice site. Usage of the downstream cryptic 5′ splice site introduces a premature stop codon and results in nonsense-mediated mRNA decay. (D) Boxplot showing the significant association between SNP rs2248374 and the splicing level (PSI value) of ERAP2 exon 10 (i.e., usage of the downstream cryptic 5′ splice site) within the Geuvadis CEU population. Each dot represents the PSI value from a particular individual, and the size of each dot is proportional to the RNA-seq read coverage for the alternative splicing event in that individual. (E) Sashimi plot indicating the average RNA-seq read density and splice junction counts for each genotype. Exons and introns are not drawn to scale, and the relative width of exons is increased for clarity. (F) LD plot showing multiple GWAS SNPs (green boxes) linked with the sQTL SNP (purple box).

RNA-seq data of 89 CEU individuals are from the Geuvadis project. Sashimi plots were drawn with rmats2sashimiplot (see Web Resources). LD plots were drawn with Haploview 4.2 and include CEU individuals from the 1000 Genomes Project (phase 3). For each boxplot, the top and bottom of the box represent the third and first quartiles, respectively. The band in the middle of the box represents the median. The whiskers of each boxplot extend to the most extreme data points within 1.5 times the interquartile range from each box.

Two Examples of sQTLs Associated with GWAS Signals for Complex Diseases (A–C) Alternative splicing of SP140 exon 7 is associated with chronic lymphocytic leukemia, Crohn disease, inflammatory bowel disease, and multiple sclerosis. The alternative splicing event is an exon-skipping event. The C allele is associated with a higher level of exon inclusion, whereas the T allele is associated with a higher level of exon skipping. (A) Boxplot showing the significant association between SNP rs28445040 and the splicing level (PSI value) of SP140 exon 7 within the Geuvadis CEU (Utah residents with ancestry from northern and western Europe) population. Each dot represents the PSI value from a particular individual, and the size of each dot is proportional to the RNA-seq read coverage for the alternative splicing event in that individual. (B) Sashimi plot indicating the average RNA-seq read density and splice junction counts for each genotype. Exons and introns are not drawn to scale, and the relative width of exons is increased for clarity. (C) LD plot showing multiple GWAS SNPs (green boxes) linked with the sQTL SNP (purple box). (D–F) Alternative splicing of ERAP2 exon 10 is associated with Crohn disease, ulcerative colitis, inflammatory bowel disease, and birdshot chorioretinopathy. The alternative splicing event is an alternative 5′ splice site event. The A allele is associated with a higher level of the upstream canonical 5′ splice site, whereas the G allele is associated with a higher level of the downstream cryptic 5′ splice site. Usage of the downstream cryptic 5′ splice site introduces a premature stop codon and results in nonsense-mediated mRNA decay. (D) Boxplot showing the significant association between SNP rs2248374 and the splicing level (PSI value) of ERAP2 exon 10 (i.e., usage of the downstream cryptic 5′ splice site) within the Geuvadis CEU population. Each dot represents the PSI value from a particular individual, and the size of each dot is proportional to the RNA-seq read coverage for the alternative splicing event in that individual. (E) Sashimi plot indicating the average RNA-seq read density and splice junction counts for each genotype. Exons and introns are not drawn to scale, and the relative width of exons is increased for clarity. (F) LD plot showing multiple GWAS SNPs (green boxes) linked with the sQTL SNP (purple box). RNA-seq data of 89 CEU individuals are from the Geuvadis project. Sashimi plots were drawn with rmats2sashimiplot (see Web Resources). LD plots were drawn with Haploview 4.2 and include CEU individuals from the 1000 Genomes Project (phase 3). For each boxplot, the top and bottom of the box represent the third and first quartiles, respectively. The band in the middle of the box represents the median. The whiskers of each boxplot extend to the most extreme data points within 1.5 times the interquartile range from each box.

Characterizing Causal Variants of Alternative Splicing via Massively Parallel Reporter Assays

Although RNA-seq can reveal associations between genetic variants and alternative splicing, identifying the causal variants underlying the detected associations remains a challenging task. In an sQTL analysis, multiple variants within a haplotype block can be significantly associated with alternative splicing, but we do not know which variant(s) causally affect(s) splicing regulation. A widely used molecular biology approach to the study of splicing regulation is the minigene splicing reporter assay. A minigene splicing reporter is constructed via the insertion of a piece of genomic DNA that contains the exon of interest and its flanking intronic sequences into a position where it is flanked either by exons from another gene (i.e., heterologous minigene reporter) or by the upstream and downstream constitutive exons from the same gene (Figure 5A). Site-directed mutagenesis within a minigene splicing reporter can be used for assessing the impact of specific genomic variants or splicing regulatory elements (Figure 5B). Coupled with high-throughput screens, minigene splicing reporters can be used for identifying splicing enhancer or silencer elements and discovering trans-acting factors or small-molecule compounds that regulate the splicing of specific exons.

Figure 5

Experimental and Computational Tools for Characterizing the Causal Impacts of Genomic Variants on Alternative Splicing

(A) Schematic diagram of a minigene splicing reporter. An exon of interest, along with its flanking intronic sequences, is inserted into a splicing reporter construct, where it is flanked by upstream and downstream exons containing a promoter and a polyA site. The splicing profile of the minigene splicing reporter can be determined by RT-PCR or RNA-seq.

(B) Use of minigene splicing reporters for characterizing the effects of disease-causing variants or exonic and intronic splicing regulatory elements on splicing.

(C) Minigene splicing reporters can be used in massively parallel reporter assays (MPRAs) for determining the consequences of many sequence variants on splicing in a high-throughput manner. A library of minigenes is transfected into a cell line, and splicing levels are measured for all variants simultaneously by RNA-seq.

(D) Deep learning framework for analyzing alternative splicing. Starting with input data, including the genome sequence and RNA-seq data, the framework extracts genomic and RNA features. These features include diverse types of quantitative or qualitative features, such as conservation score, sequence motifs, secondary structure, and epigenetic marks. A computational model is trained to predict splicing patterns and levels by using the extracted features. The predictions can be evaluated with experimental validation (e.g., by RNA-seq, RT-PCR, or minigene).

Experimental and Computational Tools for Characterizing the Causal Impacts of Genomic Variants on Alternative Splicing (A) Schematic diagram of a minigene splicing reporter. An exon of interest, along with its flanking intronic sequences, is inserted into a splicing reporter construct, where it is flanked by upstream and downstream exons containing a promoter and a polyA site. The splicing profile of the minigene splicing reporter can be determined by RT-PCR or RNA-seq. (B) Use of minigene splicing reporters for characterizing the effects of disease-causing variants or exonic and intronic splicing regulatory elements on splicing. (C) Minigene splicing reporters can be used in massively parallel reporter assays (MPRAs) for determining the consequences of many sequence variants on splicing in a high-throughput manner. A library of minigenes is transfected into a cell line, and splicing levels are measured for all variants simultaneously by RNA-seq. (D) Deep learning framework for analyzing alternative splicing. Starting with input data, including the genome sequence and RNA-seq data, the framework extracts genomic and RNA features. These features include diverse types of quantitative or qualitative features, such as conservation score, sequence motifs, secondary structure, and epigenetic marks. A computational model is trained to predict splicing patterns and levels by using the extracted features. The predictions can be evaluated with experimental validation (e.g., by RNA-seq, RT-PCR, or minigene). With recent advances in oligonucleotide synthesis technologies and high-throughput sequencing, massively parallel reporter assays (MPRAs) have become an increasingly popular approach to the study of gene regulation, including alternative splicing. MPRAs test the functional impacts of many sequence variants in parallel. These sequences are inserted into a reporter construct and transfected into cells or combined with cellular extracts for determining the functional impacts of sequence variants (Figure 5C). Two recent studies conducted MPRAs with minigene splicing reporters to determine the effects of cis sequence variants on splicing.99, 100 Rosenberg et al. tested over two million synthetic minigenes in a high-throughput fashion. Specifically, they created two separate libraries to study alternative 5′ or 3′ splice sites and analyzed the ability of random sequences to influence splice site selection. The authors split a single-gene sequence (Citrine, a derivative of YFP) into two exons as the backbone of the reporter and inserted introns with degenerate sequences between the two exons. For the alternative 5′ splice site library, each intron was designed to have two competing alternative 5′ splice sites, and two random 25 bp sequences were inserted into positions between the two competing 5′ splice sites or downstream of the distal 5′ splice site. The library for alternative 3′ splice site analysis was designed in the same manner. The resulting libraries were transfected into cells, and the splicing profiles of all sequences were measured in parallel by RNA-seq. Leveraging the abundant synthetic reporter data, the authors were able to use machine learning to model splicing patterns and predict the effects of human SNPs on splicing. Interestingly, the models learned from alternative 5′ and 3′ splice sites can also predict exon skipping in vivo. In another study, Soemedi et al. developed a massively parallel splicing assay (MaPSy) to interrogate the effects of 4,964 exonic disease-causing mutations on alternative splicing. The authors synthesized a 170 bp genomic sequence library for all mutant and wild-type exon pairs. Disease-mutation-containing exons that were less than or equal to 100 bp in length were selected and synthesized to include at least 55 bp of the upstream intron and at least 15 bp of the downstream intron. Two parallel assays were performed. The first assay tested the impact of the mutation on the exon's inclusion or skipping in vivo when the reporter was transfected into cells, and the second tested whether the mutation influenced the splicing of the upstream intron in vitro when the sequence was incubated with nuclear extracts. Even though they used distinct experimental systems, the two assays reached general agreement. Approximately 10% of the tested disease-causing mutations perturbed splicing in both assays. By contrast, only 3% of common SNPs perturbed splicing in both assays. This 10% is most likely a lower-bound estimate for the percentage of pathogenic exonic mutations that disrupt splicing, considering the cell-type-specific nature of splicing regulation and that only a single cell type (HEK293) was used for the in vivo assay. MPRAs provide a powerful tool for characterizing the causal genetic variants of alternative splicing. A major advantage of MPRAs is that these experiments generate a massive amount of data. As demonstrated by Rosenberg et al., these data-rich experiments can be coupled with computational modeling for learning important features of splicing regulation and predicting the impact of cis variants on splicing. Additionally, although both studies performed MPRA experiments in the HEK293 cell line, these reporters can be transfected into other cell lines for determining the splicing effects of cis variants in other cell types. Moreover, MPRAs can be coupled with sQTL analyses for identifying causal variants underlying sQTL signals, or they can be utilized in clinical exome or genome sequencing studies for identifying splicing-altering variants in disease-affected individuals. One inherent limitation of MPRAs is that the reporter system might not completely recapitulate the exact cellular environment that allows splicing to occur. For example, factors such as chromatin states, DNA methylation, and histone marks are known to influence alternative splicing. CRISPR-Cas9-based genome editing could address these issues and has been used in recent work for characterizing splicing regulatory elements in endogenous genes. MPRAs are also limited by the ability to generate libraries; thus, not all exons or variants are assessable by current systems. Future improvements in oligonucleotide synthesis technologies could address this limitation and allow a broader set of exons and deep intronic variants to be examined.

Alternative Splicing Meets Machine Learning

There has been a long-standing interest in developing in silico methods of predicting alternative splicing. The basic scientific premise is that there exists a “splicing code,” a set of genomic and RNA features and associated rules that determine the splicing pattern of any primary transcript in a given cell type. Machine learning serves the general purpose of learning underlying patterns from data to allow pattern recognition, classification, and prediction. In computational biology, machine learning has been extensively employed in genomics, transcriptomics, proteomics, and other domains. For example, algorithms have been developed to predict regulatory elements such as promoters, enhancers, and splice sites. Shortly after the EST-based discovery of widespread alternative splicing, several studies applied machine learning methods to predict a binary classification of alternative versus constitutive exons.104, 105, 106, 107 Alternative exons have distinct sequence features such as exon and intron length, splice site strength, divisibility by three, sequence conservation within exonic and flanking intronic regions, and composition of oligonucleotides reflecting splicing regulatory elements. Machine learning methods can leverage these features to predict whether an exon undergoes alternative splicing.104, 105, 106, 107 In a landmark study, Barash et al. used quantitative splicing microarray data across 27 mouse tissues to predict tissue-specific patterns of alternative splicing. They grouped the 27 tissues into four broad categories and converted the PSI value of each exon for each tissue category into three probabilities representing an increase, a decrease, or no change in exon inclusion in that tissue category. Then, the authors collected 1,014 features representing RNA sequence motifs and transcript features. They applied a single-layer logistic Bayesian network that models how individual features cooperate or compete to influence splicing in each tissue type. Importantly, the resulting splicing code can reveal novel regulatory features and predict mutation-induced changes in splicing patterns. This work represents a breakthrough in the field because it was the first demonstration that in silico models can successfully predict tissue-regulated alternative splicing. After this work, Xiong et al. added hidden layers to the Bayesian network to construct a Bayesian neural network (BNN). These hidden layers helped the authors model non-linear relationships between features, leading to an improved prediction accuracy. Based on the BNN framework, the web tool AVISPA was constructed for splicing prediction and analysis and was trained with more data and an expanded feature set. Recently, deep learning, a state-of-the-art machine learning technology, has been applied to predicting alternative splicing111, 112, 113 (Figure 5D). Deep learning refers to methods that map raw input feature data to increasingly abstract feature representations, where higher layers contain more abstract representations. Compared with canonical machine learning methods, deep learning is capable of automatically learning complex functions without a need for handcrafted features or rules, and it scales well to large and high-dimensional datasets.114, 115 Deep learning has been successfully applied in a variety of fields, including image classification and speech recognition and more recently in computational biology. In two studies, Frey and colleagues used RNA-seq data from mouse and human tissues to construct deep learning models that predict the splicing levels of individual exons across different tissues and the effects of cis genetic variants on splicing. Unlike their previous work that treated tissue-specific splicing patterns as categorical data, these new methods attempted to predict the numerical PSI values for each exon in each tissue.111, 112 Evaluations using independent RNA-seq datasets showed good agreement (R2 = 0.65) between predicted and empirical PSI values. The authors then applied the deep learning model to predict the effects of cis genetic variants on RNA splicing. Their predictions on clinical variants of selected exons matched well with data from minigene splicing reporters. Furthermore, they applied their model to genome sequencing data of people with autism spectrum disorder (ASD) and control individuals and predicted misregulated splicing in 19 candidate genes with ASD-related neuronal functions. This study demonstrates that deep-learning-based modeling of splicing provides a powerful tool for annotating clinical variants and elucidating the genetic determinants of complex diseases. In another interesting application, Huang et al. developed a method called BRIE, which learns prior information from RNA sequence features to augment splicing quantification by using single-cell RNA-seq data. With the rapid accumulation of RNA-seq data and RBP-RNA interaction maps in the public domain,25, 117 future work should take advantage of more comprehensive training data and feature space coupled with more advanced machine learning frameworks to improve in silico prediction of alternative splicing. As a step in this direction, Jha et al. recently developed a new deep learning framework to integrate additional RNA genomics data, such as CLIP-seq data of RBP-RNA interactions, and RNA-seq data after the knockdown or overexpression of RBPs. The integrative model generalizes well for RBP perturbation data and improves the accuracy of alternative splicing prediction. Another interesting direction for future work is to incorporate chromatin states, epigenetic marks, and 3D genome organization in a predictive model, given that splicing is a co-transcriptional process and these features influence splicing via a variety of molecular mechanisms. In addition to using machine learning techniques to directly predict splicing patterns and PSI values, other studies have adopted an alternative strategy of predicting splicing-altering genomic variants by using prior variant annotations as training data.118, 119, 120, 121 The basic idea is to collect variants known to affect splicing and/or cause human diseases along with common “splicing-neutral” variants that are likely to have no effect on splicing and then build classifiers to distinguish these two categories of variants. The potential shortcomings of these approaches are that the classification of positive versus negative training data might not be accurate and that the results might suffer from selection bias or overfitting. Nonetheless, these tools offer a complementary strategy for evaluating the potential effects of genomic variants on splicing. An interesting method called ExonImpact was recently developed to prioritize disease-associated splicing-altering variants on the basis of the predicted effects of alternative splicing at the protein level. The rationale behind this work is that not all aberrant splicing events are equally detrimental at the protein level, and pathogenic splicing mutations have distinct protein features that can be incorporated into the predictive model.

Alternative Splicing for Disease Diagnosis

Given the importance of splicing in disease pathogenesis and progression, several therapeutic strategies have been pursued for correcting splicing defects in disease. A notable success is the recent FDA approval of nusinersen, an antisense oligonucleotide drug for correcting splicing in spinal muscular atrophy. New data are emerging that alternative splicing might provide diagnostic biomarkers for disease status or outcome. An example of the predictive power of alternative splicing for disease prognosis was demonstrated in two recent studies showing that alternative splicing profiles can predict cancer patients’ survival time at a comparable and often better accuracy than gene expression levels.54, 123 One possible explanation for these observations is the intrinsic feature of alternative splicing data. Given that alternative splicing is quantified as the relative ratio of multiple isoforms from a single gene, alternative splicing data are self-normalized on a per-gene basis and can be viewed as having an “internal control” that could provide a more robust molecular signature than gene expression levels, especially for large clinical RNA-seq datasets that are prone to technical biases and confounding issues. Consistent with these observations, a new study reported that alternative-splicing-based classifiers generally outperform gene-expression-based classifiers for a wide range of biological classification problems. In a major advance with broad implications, Cummings et al. demonstrated the potential of RNA-seq and alternative splicing analysis for diagnosing rare diseases. The authors analyzed the muscle transcriptomes of 63 individuals with muscle disorders and compared their RNA-seq data with GTEx RNA-seq data of 184 control muscle samples. Of the 63 individuals with muscle disorders, 50 were genetically undiagnosed. Strikingly, through RNA-seq analysis, the authors obtained a genetic diagnosis for 35% of the previously undiagnosed individuals by identifying novel disease-associated aberrant splicing events in known disease-associated genes. In four individuals, a recurrent aberrant splicing event was discovered in COL6A1, in which a GC-to-GT genetic variant created a novel 5′ splice site, leading to the exonization of a 72 bp intronic segment that disrupted the COL6A1 protein product. This variant would not be easily identifiable by exome or genome sequencing alone, given that exome sequencing would miss this deep intronic variant, and genome sequencing would identify too many variants, making it difficult to determine their pathogenicity in the absence of RNA-seq information. Thus, this study offers an important proof of concept that alternative splicing analysis via the integration of RNA-seq with exome or genome sequencing improves disease diagnosis.

Conclusions

The past decade since the advent of RNA-seq has seen tremendous growth in the amount of human transcriptome data. Advances in RNA-seq technologies and computational methods have transformed the study of alternative splicing in health and disease. Population-scale RNA-seq studies have discovered many naturally occurring genomic variants that modulate alternative splicing. Many of these variants are associated with GWAS signals, suggesting a ubiquitous contribution of alternative splicing to phenotypic variability and disease susceptibility in human populations. These genetically regulated, GWAS-associated mRNA isoforms are prime candidates for functional studies of alternative splicing. Future work using isoform-specific gain-of-function or loss-of-function assays should elucidate how genetic variation of alternative splicing affects gene functions and consequently cellular and organismal phenotypes. The prevalent role of alternative splicing in Mendelian and complex diseases suggests that evaluating the impact of genomic variants on splicing needs to be an integral part of clinical variant prioritization. Many computational tools and online resources exist for prioritizing and annotating variants discovered by exome or genome sequencing. Most tools are designed to predict the pathogenic effects of missense variants on protein products. However, there is overwhelming evidence that missense, nonsense, and silent variants within exons, as well as intronic variants, can disrupt splicing and cause disease. Currently, it is challenging to predict the pathogenic effects of splicing variants within exonic and intronic regions, except for variants affecting the conserved splice site signals, and they are thus ignored by many commonly used pipelines for variant assessment. Recent advances in experimental (e.g., MPRAs) and computational (e.g., deep learning) tools will allow researchers and clinicians to screen a large number of variants for their effects on splicing in a systematic and unbiased manner. Beyond SNPs, other non-SNP variants such as indels or short tandem repeats can modify cis splicing regulatory elements and affect alternative splicing.125, 126 The genetic associations between these non-SNP variants and alternative splicing can also be discovered and characterized by the computational and experimental approaches described in this review. A comprehensive catalog of alternative splicing variation in human populations, along with the ability to discover and characterize splicing-altering variants in specific individuals, holds great value for improving disease diagnoses and ultimately patient care in the era of sequencing and precision medicine.

125 in total

Review 1. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962

2. Drosophila Dscam is an axon guidance receptor exhibiting extraordinary molecular diversity.

Authors: D Schmucker; J C Clemens; H Shu; C A Worby; J Xiao; M Muda; J E Dixon; S L Zipursky
Journal: Cell Date: 2000-06-09 Impact factor: 41.582

3. Accurate identification of alternatively spliced exons using support vector machine.

Authors: Gideon Dror; Rotem Sorek; Ron Shamir
Journal: Bioinformatics Date: 2004-11-05 Impact factor: 6.937

4. Genomic features defining exonic variants that modulate splicing.

Authors: Adam Woolfe; James C Mullikin; Laura Elnitski
Journal: Genome Biol Date: 2010-02-16 Impact factor: 13.583

5. ExonImpact: Prioritizing Pathogenic Alternative Splicing Events.

Authors: Meng Li; Weixing Feng; Xinjun Zhang; Yuedong Yang; Kejun Wang; Matthew Mort; David N Cooper; Yue Wang; Yaoqi Zhou; Yunlong Liu
Journal: Hum Mutat Date: 2016-10-03 Impact factor: 4.878

6. Understanding mechanisms underlying human gene expression variation with RNA sequencing.

Authors: Joseph K Pickrell; John C Marioni; Athma A Pai; Jacob F Degner; Barbara E Engelhardt; Everlyne Nkadori; Jean-Baptiste Veyrieras; Matthew Stephens; Yoav Gilad; Jonathan K Pritchard
Journal: Nature Date: 2010-03-10 Impact factor: 49.962

7. Genetic Drivers of Epigenetic and Transcriptional Variation in Human Immune Cells.

Authors: Lu Chen; Bing Ge; Francesco Paolo Casale; Louella Vasquez; Tony Kwan; Diego Garrido-Martín; Stephen Watt; Ying Yan; Kousik Kundu; Simone Ecker; Avik Datta; David Richardson; Frances Burden; Daniel Mead; Alice L Mann; Jose Maria Fernandez; Sophia Rowlston; Steven P Wilder; Samantha Farrow; Xiaojian Shao; John J Lambourne; Adriana Redensek; Cornelis A Albers; Vyacheslav Amstislavskiy; Sofie Ashford; Kim Berentsen; Lorenzo Bomba; Guillaume Bourque; David Bujold; Stephan Busche; Maxime Caron; Shu-Huang Chen; Warren Cheung; Oliver Delaneau; Emmanouil T Dermitzakis; Heather Elding; Irina Colgiu; Frederik O Bagger; Paul Flicek; Ehsan Habibi; Valentina Iotchkova; Eva Janssen-Megens; Bowon Kim; Hans Lehrach; Ernesto Lowy; Amit Mandoli; Filomena Matarese; Matthew T Maurano; John A Morris; Vera Pancaldi; Farzin Pourfarzad; Karola Rehnstrom; Augusto Rendon; Thomas Risch; Nilofar Sharifi; Marie-Michelle Simon; Marc Sultan; Alfonso Valencia; Klaudia Walter; Shuang-Yin Wang; Mattia Frontini; Stylianos E Antonarakis; Laura Clarke; Marie-Laure Yaspo; Stephan Beck; Roderic Guigo; Daniel Rico; Joost H A Martens; Willem H Ouwehand; Taco W Kuijpers; Dirk S Paul; Hendrik G Stunnenberg; Oliver Stegle; Kate Downes; Tomi Pastinen; Nicole Soranzo
Journal: Cell Date: 2016-11-17 Impact factor: 41.582

8. Genome-wide identification of splicing QTLs in the human brain and their enrichment among schizophrenia-associated loci.

Authors: Atsushi Takata; Naomichi Matsumoto; Tadafumi Kato
Journal: Nat Commun Date: 2017-02-27 Impact factor: 14.919

9. Transcriptome and genome sequencing uncovers functional variation in humans.

Authors: Tuuli Lappalainen; Michael Sammeth; Marc R Friedländer; Peter A C 't Hoen; Jean Monlong; Manuel A Rivas; Mar Gonzàlez-Porta; Natalja Kurbatova; Thasso Griebel; Pedro G Ferreira; Matthias Barann; Thomas Wieland; Liliana Greger; Maarten van Iterson; Jonas Almlöf; Paolo Ribeca; Irina Pulyakhina; Daniela Esser; Thomas Giger; Andrew Tikhonov; Marc Sultan; Gabrielle Bertier; Daniel G MacArthur; Monkol Lek; Esther Lizano; Henk P J Buermans; Ismael Padioleau; Thomas Schwarzmayr; Olof Karlberg; Halit Ongen; Helena Kilpinen; Sergi Beltran; Marta Gut; Katja Kahlem; Vyacheslav Amstislavskiy; Oliver Stegle; Matti Pirinen; Stephen B Montgomery; Peter Donnelly; Mark I McCarthy; Paul Flicek; Tim M Strom; Hans Lehrach; Stefan Schreiber; Ralf Sudbrak; Angel Carracedo; Stylianos E Antonarakis; Robert Häsler; Ann-Christine Syvänen; Gert-Jan van Ommen; Alvis Brazma; Thomas Meitinger; Philip Rosenstiel; Roderic Guigó; Ivo G Gut; Xavier Estivill; Emmanouil T Dermitzakis
Journal: Nature Date: 2013-09-15 Impact factor: 49.962

Review 10. Translating RNA sequencing into clinical diagnostics: opportunities and challenges.

Authors: Sara A Byron; Kendall R Van Keuren-Jensen; David M Engelthaler; John D Carpten; David W Craig
Journal: Nat Rev Genet Date: 2016-03-21 Impact factor: 53.242

102 in total

Review 1. The Pediatric Cell Atlas: Defining the Growth Phase of Human Development at Single-Cell Resolution.

Authors: Deanne M Taylor; Bruce J Aronow; Kai Tan; Kathrin Bernt; Nathan Salomonis; Casey S Greene; Alina Frolova; Sarah E Henrickson; Andrew Wells; Liming Pei; Jyoti K Jaiswal; Jeffrey Whitsett; Kathryn E Hamilton; Sonya A MacParland; Judith Kelsen; Robert O Heuckeroth; S Steven Potter; Laura A Vella; Natalie A Terry; Louis R Ghanem; Benjamin C Kennedy; Ingo Helbig; Kathleen E Sullivan; Leslie Castelo-Soccio; Arnold Kreigstein; Florian Herse; Martijn C Nawijn; Gerard H Koppelman; Melissa Haendel; Nomi L Harris; Jo Lynne Rokita; Yuanchao Zhang; Aviv Regev; Orit Rozenblatt-Rosen; Jennifer E Rood; Timothy L Tickle; Roser Vento-Tormo; Saif Alimohamed; Monkol Lek; Jessica C Mar; Kathleen M Loomes; David M Barrett; Prech Uapinyoying; Alan H Beggs; Pankaj B Agrawal; Yi-Wen Chen; Amanda B Muir; Lana X Garmire; Scott B Snapper; Javad Nazarian; Steven H Seeholzer; Hossein Fazelinia; Larry N Singh; Robert B Faryabi; Pichai Raman; Noor Dawany; Hongbo Michael Xie; Batsal Devkota; Sharon J Diskin; Stewart A Anderson; Eric F Rappaport; William Peranteau; Kathryn A Wikenheiser-Brokamp; Sarah Teichmann; Douglas Wallace; Tao Peng; Yang-Yang Ding; Man S Kim; Yi Xing; Sek Won Kong; Carsten G Bönnemann; Kenneth D Mandl; Peter S White
Journal: Dev Cell Date: 2019-03-28 Impact factor: 12.270

Review 2. Idiosyncrasies of hnRNP A1-RNA recognition: Can binding mode influence function.

Authors: Jeffrey D Levengood; Blanton S Tolbert
Journal: Semin Cell Dev Biol Date: 2018-04-09 Impact factor: 7.727

Review 3. Alternative splicing of lncRNAs in human diseases.

Authors: Jiaxi Chen; Yawen Liu; Jingyu Min; Huizhi Wang; Feifan Li; Chunhui Xu; Aihua Gong; Min Xu
Journal: Am J Cancer Res Date: 2021-03-01 Impact factor: 6.166

Review 4. Alternative splicing and cancer metastasis: prognostic and therapeutic applications.

Authors: Diego M Marzese; Ayla O Manughian-Peter; Javier I J Orozco; Dave S B Hoon
Journal: Clin Exp Metastasis Date: 2018-05-29 Impact factor: 5.150

Review 5. When one becomes many-Alternative splicing in β-cell function and failure.

Authors: Maria Inês Alvelos; Jonàs Juan-Mateu; Maikel Luis Colli; Jean-Valéry Turatsinze; Décio L Eizirik
Journal: Diabetes Obes Metab Date: 2018-09 Impact factor: 6.577

6. TranscriptAchilles: a genome-wide platform to predict isoform biomarkers of gene essentiality in cancer.

Authors: Fernando Carazo; Lucía Campuzano; Xabier Cendoya; Francisco J Planes; Angel Rubio
Journal: Gigascience Date: 2019-04-01 Impact factor: 6.524

7. Future directions for high-throughput splicing assays in precision medicine.

Authors: Christy L Rhine; Christopher Neil; David T Glidden; Kamil J Cygan; Alger M Fredericks; Jing Wang; Nephi A Walton; William G Fairbrother
Journal: Hum Mutat Date: 2019-08-17 Impact factor: 4.878

8. Changes in Alternative Splicing in Response to Domestication and Polyploidization in Wheat.

Authors: Kuohai Yu; Man Feng; Guanghui Yang; Lv Sun; Zhen Qin; Jie Cao; Jingjing Wen; Haoran Li; Yan Zhou; Xiangping Chen; Huiru Peng; Yingyin Yao; Zhaorong Hu; Weilong Guo; Qixin Sun; Zhongfu Ni; Keith Adams; Mingming Xin
Journal: Plant Physiol Date: 2020-10-13 Impact factor: 8.340

9. VIPdb, a genetic Variant Impact Predictor Database.

Authors: Zhiqiang Hu; Changhua Yu; Mabel Furutsuki; Gaia Andreoletti; Melissa Ly; Roger Hoskins; Aashish N Adhikari; Steven E Brenner
Journal: Hum Mutat Date: 2019-08-17 Impact factor: 4.878

Review 10. Non-coding transcript variants of protein-coding genes - what are they good for?

Authors: Sonam Dhamija; Manoj B Menon
Journal: RNA Biol Date: 2018-09-10 Impact factor: 4.652