Literature DB >> 23172860

Intron-centric estimation of alternative splicing from RNA-seq data.

Dmitri D Pervouchine¹, David G Knowles, Roderic Guigó.

Abstract

MOTIVATION: Novel technologies brought in unprecedented amounts of high-throughput sequencing data along with great challenges in their analysis and interpretation. The percent-spliced-in (PSI, ) metric estimates the incidence of single-exon-skipping events and can be computed directly by counting reads that align to known or predicted splice junctions. However, the majority of human splicing events are more complex than single-exon skipping.
RESULTS: In this short report, we present a framework that generalizes the metric to arbitrary classes of splicing events. We change the view from exon centric to intron centric and split the value of into two indices, and , measuring the rate of splicing at the 5' and 3' end of the intron, respectively. The advantage of having two separate indices is that they deconvolute two distinct elementary acts of the splicing reaction. The completeness of splicing index is decomposed in a similar way. This framework is implemented as bam2ssj, a BAM-file-processing pipeline for strand-specific counting of reads that align to splice junctions or overlap with splice sites. It can be used as a consistent protocol for quantifying splice junctions from RNA-seq data because no such standard procedure currently exists. AVAILABILITY: The C code of bam2ssj is open source and is available at https://github.com/pervouchine/bam2ssj CONTACT: dp@crg.eu

Entities: Chemical Species

Mesh：

Substances：
RNA Splice Sites

Year: 2012 PMID： 23172860 PMCID： PMC3546801 DOI： 10.1093/bioinformatics/bts678

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

One major challenge in the analysis of high-throughput RNA sequencing data is to disentangle relative abundances of alternatively spliced transcripts. Many existing quantification methods do so by using considerations of likelihood, parsimony and optimality to obtain a consolidated view of cDNA fragments that map to a given transcriptional unit (Katz ; Montgomery ; Trapnell ). The advantage of such integrative approaches is that they provide robust estimators for transcript abundance by reducing sampling errors, as they effectively consider samples of larger size. In contrast, because all the reads from the same transcriptional unit are combined into one master model, there is no guarantee that the inclusion or exclusion of a specific exon is estimated independently of co-occurring splicing events (Katz ; Pan ). The quantification of alternatively spliced isoforms based on the metric captures more accurately the local information related to splicing of each particular exon (Katz ). We follow Kakaradov in considering only the reads that align to splice junctions (Fig. 1) and ignoring the reads that align to exon bodies (position-specific read counts are not considered). is defined as where the factor of two in the denominator accounts for the fact that there are twice as many mappable positions for reads supporting exon inclusion as exon exclusion. Equation (1) defines an unbiased estimator for the fraction of mRNAs that represent the inclusion isoform under the assumption that splice-junction reads are distributed evenly. can also be derived from the expression values of whole isoforms, for instance, as the abundance of the inclusion isoform as the fraction of the total abundance. However, the non-uniform read coverage not only between but also within transcripts makes such estimates generally detrimental (Kakaradov ).

Fig. 1.

The percent-spliced-in (PSI, ) metric is defined as the number of reads supporting exon inclusion () as the fraction of the combined number of reads supporting inclusion and exclusion (). The exon of interest is shown in gray. Only reads that span to the adjacent exons (solid arcs) account for Equation (1) The metric can be generalized beyond the class of single-exon–skipping events by counting inclusion and exclusion reads regardless of exon adjacency (Fig. 1, dashed arcs). Although this definition helps to reduce the undercoverage bias by taking into account splice junctions that are not present in the reference annotation, it often assigns misleading values to metric, for instance, in the case of multiple-exon skipping, where the amount of support for exon exclusion does not reflect the true splicing rate of each individual intron.

2 APPROACH

In this work, we change the view from exon centric to intron centric. Each intron is defined uniquely by the combination of its 5′-splice site (, donor) and 3′-splice site (, acceptor). Denote by the number of reads aligning to the splice junction spanning from to (Fig. 2) and define where and run over all donor and acceptor sites, respectively, within the given genomic annotation set. Because could be and could be , both and are real numbers from to . The value of can be regarded as an estimator for the conditional probability of splicing from to , i.e. the fraction of transcripts in which the intron to is spliced, relative to the number of transcripts in which is used as a splice site. Similarly, is the relative frequency of -to- splicing with respect to the splicing events in which is used.

Fig. 2.

Left: the 5′-splicing index, , is the number of reads supporting the splicing event from to relative to the combined number of reads supporting splicing from to any acceptor site . Right: the 3′-splicing index, , is the number of reads supporting the splicing event from to relative to the combined number of reads supporting splicing from any donor site to . The intron of interest is drawn thick In the particular case of single-exon skipping (Fig. 1), the values of , and are related as follows. Denote the upstream and downstream introns of the highlighted exon by and , respectively. Let and . Then, , and , where and . Assuming uniform read coverage across the gene (), we get and, therefore, That is, in the particular case of single-exon skipping, the value of is equal to the average of and given that the read coverage is reasonably uniform. If a and differ significantly, the contribution of and to is given by the weight factors and . Similarly, the completeness of splicing index (Tilgner ) is split into two indices, and , where and denotes the number of genomic reads (reads mapped uniquely to the genomic sequence) overlapping the splice site . Note that depends only on and depends only on . The values of and are unbiased estimators for the absolute frequency of splice site usage, i.e. the proportion of transcripts in which (or ) is used as a splice site, among all transcripts containing the splice site (or ).

3 METHODS

To compute , , and for a given donor–acceptor pair, one needs to know five integers, , , , and , of which only the first one depends on both and , while the rest have a single argument. We developed bam2ssj, a pipeline for counting these five integers directly from BAM input. bam2ssj is implemented in C and depends on SAMtools (Li ). The input consists of (i) a sorted BAM file containing reads that align uniquely to the genome or to splice junctions and (ii) a sorted GTF file containing the coordinates of exon boundaries. Each time the CIGAR string (Li ) contains MNM, , the counter corresponding to the splice junction defined by N is incremented. One mapped read may span several splice junctions and increment several counters. If the CIGAR string does not contain the MNM pattern, the read is classified as genomic and increments for every splice site it overlaps. Position-specific counts (Kakaradov ) are implemented as a stand-alone utility that is not included in the current distribution. Importantly, bam2ssj counts reads that align to splice junctions in a strand-specific way, i.e. , , , and are reported for the correct (annotated) and incorrect (opposite to annotated) strand. We leave further processing of these counts by Equations (2)–(4) to the user.

4 RESULTS AND DISCUSSION

We validated bam2ssj by counting reads aligning to splice junctions in the whole-cell polyadenylated fraction of Cold Spring Harbor Long RNA-seq data (http://genome.ucsc.edu/ENCODE/). In total, 8 558 231 343 mapped reads were analyzed in 404 min (350 000 reads/sec). 1 184 553 724 reads align to splice junctions, of which 1% align to the opposite strand. 1 699 718 327 reads overlap annotated splice junctions, of which 5% map to the opposite strand. The values of coincide with those reported by ENCODE in 98.9% of cases (1 163 251 008 reads); all discrepancies were due to the ambiguity of CIGAR translation in the mapper’s output. Because RNA-seq data are increasingly processed into the compact BAM form, we propose that bam2ssj be used as a standard operating procedure for counting splice junction reads. Funding: Grants BIO2011-26205 and CSD2007-00050 Consolider, Ministerio de Educación y Ciencia (Spain). Conflict of Interest: none declared.

7 in total

1. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing.

Authors: Qun Pan; Ofer Shai; Leo J Lee; Brendan J Frey; Benjamin J Blencowe
Journal: Nat Genet Date: 2008-11-02 Impact factor: 38.330

2. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.

Authors: Cole Trapnell; Adam Roberts; Loyal Goff; Geo Pertea; Daehwan Kim; David R Kelley; Harold Pimentel; Steven L Salzberg; John L Rinn; Lior Pachter
Journal: Nat Protoc Date: 2012-03-01 Impact factor: 13.491

3. Analysis and design of RNA sequencing experiments for identifying isoform regulation.

Authors: Yarden Katz; Eric T Wang; Edoardo M Airoldi; Christopher B Burge
Journal: Nat Methods Date: 2010-11-07 Impact factor: 28.547

4. Transcriptome genetics using second generation sequencing in a Caucasian population.

Authors: Stephen B Montgomery; Micha Sammeth; Maria Gutierrez-Arcelus; Radoslaw P Lach; Catherine Ingle; James Nisbett; Roderic Guigo; Emmanouil T Dermitzakis
Journal: Nature Date: 2010-03-10 Impact factor: 49.962

5. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

6. Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs.

Authors: Hagen Tilgner; David G Knowles; Rory Johnson; Carrie A Davis; Sudipto Chakrabortty; Sarah Djebali; João Curado; Michael Snyder; Thomas R Gingeras; Roderic Guigó
Journal: Genome Res Date: 2012-09 Impact factor: 9.043

7. Challenges in estimating percent inclusion of alternatively spliced junctions from RNA-seq data.

Authors: Boyko Kakaradov; Hui Yuan Xiong; Leo J Lee; Nebojsa Jojic; Brendan J Frey
Journal: BMC Bioinformatics Date: 2012-04-19 Impact factor: 3.169

7 in total

36 in total

1. Integrative classification of human coding and noncoding genes through RNA metabolism profiles.

Authors: Neelanjan Mukherjee; Lorenzo Calviello; Antje Hirsekorn; Stefano de Pretis; Mattia Pelizzola; Uwe Ohler
Journal: Nat Struct Mol Biol Date: 2016-11-21 Impact factor: 15.369

2. Integrative transcriptomic analysis suggests new autoregulatory splicing events coupled with nonsense-mediated mRNA decay.

Authors: Dmitri Pervouchine; Yaroslav Popov; Andy Berry; Beatrice Borsari; Adam Frankish; Roderic Guigó
Journal: Nucleic Acids Res Date: 2019-06-04 Impact factor: 16.971

3. An alternative splicing signature model for predicting hepatocellular carcinoma-specific survival.

Authors: Sheng Dong; Li-Jun Lu
Journal: J Gastrointest Oncol Date: 2020-10

4. Detection of aberrant gene expression events in RNA sequencing data.

Authors: Vicente A Yépez; Christian Mertes; Michaela F Müller; Daniela Klaproth-Andrade; Leonhard Wachutka; Laure Frésard; Mirjana Gusic; Ines F Scheller; Patricia F Goldberg; Holger Prokisch; Julien Gagneur
Journal: Nat Protoc Date: 2021-01-18 Impact factor: 13.491

5. Quantitative analysis of the splice variants expressed by the major hepatitis B virus genotypes.

Authors: Chun Shen Lim; Vitina Sozzi; Margaret Littlejohn; Lilly K W Yuen; Nadia Warner; Brigid Betz-Stablein; Fabio Luciani; Peter A Revill; Chris M Brown
Journal: Microb Genom Date: 2021-01

6. Comparison of RNA-seq and microarray-based models for clinical endpoint prediction.

Authors: Wenqian Zhang; Ying Yu; Falk Hertwig; Jean Thierry-Mieg; Wenwei Zhang; Danielle Thierry-Mieg; Jian Wang; Cesare Furlanello; Viswanath Devanarayan; Jie Cheng; Youping Deng; Barbara Hero; Huixiao Hong; Meiwen Jia; Li Li; Simon M Lin; Yuri Nikolsky; André Oberthuer; Tao Qing; Zhenqiang Su; Ruth Volland; Charles Wang; May D Wang; Junmei Ai; Davide Albanese; Shahab Asgharzadeh; Smadar Avigad; Wenjun Bao; Marina Bessarabova; Murray H Brilliant; Benedikt Brors; Marco Chierici; Tzu-Ming Chu; Jibin Zhang; Richard G Grundy; Min Max He; Scott Hebbring; Howard L Kaufman; Samir Lababidi; Lee J Lancashire; Yan Li; Xin X Lu; Heng Luo; Xiwen Ma; Baitang Ning; Rosa Noguera; Martin Peifer; John H Phan; Frederik Roels; Carolina Rosswog; Susan Shao; Jie Shen; Jessica Theissen; Gian Paolo Tonini; Jo Vandesompele; Po-Yen Wu; Wenzhong Xiao; Joshua Xu; Weihong Xu; Jiekun Xuan; Yong Yang; Zhan Ye; Zirui Dong; Ke K Zhang; Ye Yin; Chen Zhao; Yuanting Zheng; Russell D Wolfinger; Tieliu Shi; Linda H Malkas; Frank Berthold; Jun Wang; Weida Tong; Leming Shi; Zhiyu Peng; Matthias Fischer
Journal: Genome Biol Date: 2015-06-25 Impact factor: 13.583

7. A Quantitative Profiling Tool for Diverse Genomic Data Types Reveals Potential Associations between Chromatin and Pre-mRNA Processing.

Authors: Isaac Kremsky; Nicolás Bellora; Eduardo Eyras
Journal: PLoS One Date: 2015-07-24 Impact factor: 3.240

8. A semi-supervised approach uncovers thousands of intragenic enhancers differentially activated in human cells.

Authors: Juan González-Vallinas; Amadís Pagès; Babita Singh; Eduardo Eyras
Journal: BMC Genomics Date: 2015-07-14 Impact factor: 3.969

Review 9. How Machine Learning and Statistical Models Advance Molecular Diagnostics of Rare Disorders Via Analysis of RNA Sequencing Data.

Authors: Lea D Schlieben; Holger Prokisch; Vicente A Yépez
Journal: Front Mol Biosci Date: 2021-06-01

10. Full-length single-cell RNA-seq applied to a viral human cancer: applications to HPV expression and splicing analysis in HeLa S3 cells.

Authors: Liang Wu; Xiaolong Zhang; Zhikun Zhao; Ling Wang; Bo Li; Guibo Li; Michael Dean; Qichao Yu; Yanhui Wang; Xinxin Lin; Weijian Rao; Zhanlong Mei; Yang Li; Runze Jiang; Huan Yang; Fuqiang Li; Guoyun Xie; Liqin Xu; Kui Wu; Jie Zhang; Jianghao Chen; Ting Wang; Karsten Kristiansen; Xiuqing Zhang; Yingrui Li; Huanming Yang; Jian Wang; Yong Hou; Xun Xu
Journal: Gigascience Date: 2015-11-05 Impact factor: 6.524