Literature DB >> 25707295

IRcall and IRclassifier: two methods for flexible detection of intron retention events from RNA-Seq data.

Abstract

BACKGROUND: The emergence of next-generation RNA sequencing (RNA-Seq) provides tremendous opportunities for researchers to analyze alternative splicing on a genome-wide scale. However, accurate detection of intron retention (IR) events from RNA-Seq data has remained an unresolved challenge in next-generation sequencing (NGS) studies.
RESULTS: We propose two new methods: IRcall and IRclassifier to detect IR events from RNA-Seq data. Our methods combine together gene expression information, read coverage within an intron, and read counts (within introns, within flanking exons, supporting splice junctions, and overlapping with 5' splice site/ 3' splice site), employing ranking strategy and classifiers to detect IR events. We applied our approaches to one published RNA-Seq data on contrasting skip mutant and wild-type in Arabidopsis thaliana. Compared with three state-of-the-art methods, IRcall and IRclassifier could effectively filter out false positives, and predict more accurate IR events. AVAILABILITY: The data and codes of IRcall and IRclassifier are available at http://mlg.hit.edu.cn/ybai/IR/IRcallAndIRclass.html.

Entities: Chemical Gene Species

Mesh：

Substances：

Year: 2015 PMID： 25707295 PMCID： PMC4476074 DOI： 10.1186/1471-2164-16-S2-S9

Source DB: PubMed Journal: BMC Genomics ISSN： 1471-2164 Impact factor: 3.969

Background

Alternative splicing of precursor messenger RNA (pre-mRNA) produces different mRNA isoforms from a single genic locus during gene expression, resulting in functional complexity and diversity of proteins for higher eukaryotic organisms [1-3]. Alternative splicing has four main patterns that include cassette exon skipping (ES), alternative 5' and 3' splice site (ASS), mutually exclusive exon splicing (MXE) and intron retention (IR) [4]. Among those four alternative splicing patterns, genome-wide intron retention (IR) detection is a popular research topic in biology. Traditional methods which analyze microarray data, have provided a rich source of information for IR event detection [5]. However, the hybridization-based technology employed by microarray is largely restricted to existing genome sequence knowledge, with a limited range of quantification [6]. With the development of next-generation sequencing technology, RNA-Seq, which extends the analysis of previous unidentified genes and splicing variants, is rapidly outperforming microarrays for genome-wide studies [6,7]. With various statistical and computational strategies, many recent studies have analyzed RNA-Seq data for IR event detection, including ExpressionPlot [8], MATS [9], Wang's Framework [10], IRFinder [11] and etc. ExpressionPlot [8] cuts off low-density introns and adopts Fisher's exact test or Chi-squared test to quantify IR events, using only read counts (within introns/ flanking exons). Another framework MATS [9] calculates the Bayesian posterior probabilities of IR to assess their difference between treatment and control conditions, using only read counts (supporting splice junctions/ within introns). Besides, an analysis pipeline based on de novo mapping using BLAT [12] is introduced by [10], which detects the complete retention of an intron in a transcript. However, BLAT employs some low quality reads for read alignment, which would bring in noises [13]. Recently, IRFinder [11] uses read counts within introns to estimate the difference of read coverage between treatment and control conditions by adopting Audic and Claverie Test [14]. However, as for read coverage calculation, besides uniquely mapped reads, IRFinder employs the reads mapped to multiple positions in the genome, which would bring in noises. Existing methods for IR event detection usually omit some significant features (i.e. gene expression information). They depend on only read counts (within an intron, within flanking exons, and supporting splice junction) to estimate IR events between treatment and control conditions [8-11]. They omit the gene expression information and read coverage within an intron or use some un-uniquely mapped reads and low quality reads, which would involve false positive data, reducing prediction precision. As illustrated in Figure 1, there exist three common false positive cases: (1) More reads are overlapping with 5' or 3' splice site, but less reads are locating within an intron; (2) Many reads are clustering within a specific intronic inside region; (3) The gene expression between treatment and control conditions is different, where the read counts (within introns, within flanking exons, supporting splice junctions) should have been similar if there is no difference on gene expression. As for these three false positive cases, MATS and ExpressionPlot falsely predict case (1) and (2) respectively, as they do not consider read distribution and coverage within introns; while IRFinder falsely predicts case (3) as it does not consider gene expression difference. Therefore, we are motivated to take in more features to design more precise prediction methods for IR event detection.

Figure 1

False positive IR events (Short black bars stand for reads with numbers indicating read counts). (a) More reads are overlapping with 5' or 3' splice site, but less reads are locating within an intron; (b) Many reads are clustering within a specific intronic inside region; (c) The gene expression between treatment and control conditions is different. In this paper, we propose two efficient and effective methods: Intron Retention call (IRcall) and Intron Retention classifier (IRclassifer) for of IR event detection from RNA-Seq data. Compared with existing IR event detection methods, IRcall and IRclassifier have four major contributions: First, besides traditional features for IR event detection, IRcall and IRclassifier employ new features, including read counts overlapping with 5' splice site/ 3' splice site, and gene expression between treatment and control conditions, which help to filter out false positives caused by different gene expression; Second, IRcall employs ranking strategy, with a new formula to calculate IR scores, which avoids bias and declines false positives; Third, IRclassifier employs machine learning techniques, based on J48 Decision Tree and Random Forests, for IR event detection. IRclassifier selects training examples by incorporating the predictions from three state-of-the-art approaches, and constructs 17 features between treatment and control conditions to represent introns in higher-dimensional spaces; Last but not least, experimenting the published RNA-Seq data of skip mutant and wild-type in Arabidopsis thaliana [10], IRcall and IRclassifier could identify both known and novel IR events, with high prediction precisions. Moreover, IRclassifier based on J48 Decision Tree and Random Forests could predict IR events with a precision of 98.5% and 99.0%, respectively.

Methods

To overcome limitations of existing approaches, our idea is to utilize more features, employing ranking strategy and machine learning methods for IR event detection.

Input data

The inputs of IRcall and IRclassifier include the alignment results of RNA-Seq data and the reference annotation. The raw RNA-Seq reads between treatment and control conditions are aligned to the reference genome sequences by TopHat2 [15], and the resulting alignments are stored and sorted in BAM files, defined by [16]. The reference annotation provides the locations of introns, exons and genes. As for the raw RNA-Seq read alignment, the parameters of TopHat2 are set as follows: min − anchor − length = 4, min − intron − length = 40, max − intron − length = 2000, min − segment − intron = 40, max − segment − intron = 2000, min − coverage − intron = 40, and max − coverage − intron = 2000.

Extract features

In each sample of RNA-seq data, we extract 7 features (Figure 2(a)), including read counts within an intron (N), read counts within flanking exons (N), read counts supporting splice junctions (N), read counts overlapping with 5' splice site (N5), read counts overlapping with 3' splice site (N3), read coverage within an intron (N) and gene expression information (N).

Figure 2

Features and their illustration. (a) 7 Features. Red(Dark red/ Purple) short bars stand for the unique mapped reads within the intron (within flanking exons, overlapping with 5'/ 3' splice site); while black short bars connected by blue dash lines stand for the splitted reads supporting splice junctions. (b) An intron with its features between treatment and control conditions. For N, N, N, N5and N3, we count the original reads within an intron, within flanking exons, supporting splice junctions, overlapping with 5' splice site and 3' splice site, respectively. As for N, we calculate the percentage of covered positions (with one read at least) within an intron. In addition, we employ Cuffdiff [17] to calculate gene expression RPKM values (RPKM: Reads per kilo base per million), denoted by N. As shown in Figure 2(b), we will get 7 features , , , , , and for treatment condition, and 7 features , , , , , and for control condition. Those 14 features form the feature space of an intron for IR event detection.

IRcall

The framework of IRcall is illustrated in Figure 3(a): the input data (bam files, and the reference annotation) are firstly analyzed to extract 14 features, representing each intron between treatment and control conditions; then low quality introns are removed according to the criteria in Table 1; after that, the IR score for each intron is calculated according to the features; finally, the introns are ranked by IR scores in descending order, with top-n% introns returned as IR events. Note that n% is the user-specified threshold; and an intron is retained for further process if it satisfies all criteria in Table 1. The criteria come from some domain knowledge and experiments [8,11].

Figure 3

Frameworks of IRcall and IRclassifier.

Table 1

Intron Removal Criteria.

TN5ss	>	3
TN3ss	>	3
TNcoverage	>	0:9
TNexon	>	1
TNexpression	>	10
TNintron	>	1
CNexpression	>	10
CNJunc	>	1
CNexon	>	1
TNexpressionCNexpression	<	log232orTNexpressionCNexpression>log223

Frameworks of IRcall and IRclassifier. Intron Removal Criteria. We propose a novel IR score formula to reflect the IR changes between treatment and control conditions, on three aspects: (a) ratio of read counts within an intron to flanking exons (NDIE); (b) ratio of read counts within an intron to read counts supporting splice junctions (NDIJ); (c) read coverage within an intron (NDIC). The IR score takes the weighted average of the three aspects, formulated as: In our experiments, we set equal weights to NDIE, NDIJ and NDIC, thus . The NDIE, NDIJ and NDIC are calculated as follows: First, in all treatment and control samples, IE, IJ , and IC of each intron are calculated to reflect the ratio of read counts within an intron to flanking exons, the ratio of read counts within an intron to read counts supporting splice junctions, and read coverage within an intron, by formula (2) (3) (4): Second, DIE, DIJ and DIC are calculated to reflect the changes in IE, IJ , and IC between treatment and control conditions (logarithm is taken to handle zero values in denominators), by formula (5) (6) (7). Note that DIE, DIJ and DIC should be larger than zero; otherwise, the intron instance is filtered out as noise. Finally, as the value scales of DIE, DIJ and DIC are different, we normalized them to the same scale of 0 to 1, by dividing their corresponding maximum value MAXDIE, MAXDIJ and MAXDIC, resulting in NDIE, NDIJ and NDIC, as shown in formula (8) (9) (10): As IRcall defines proper intron removal criteria (in Table 1) and IR score formula, it well considers read counts (within introns, within flanking exons, supporting splice junctions, and overlapping with 5' splice site/ 3' splice site), read coverage within an intron, and gene expression information, between treatment and control conditions. Hence, IRcall could effectively filter out false positives and improve prediction precision.

IRclassifier

IRclassifier is a framework drawing on machine learning ideas to facilitate better identification of IR events from RNA-Seq data, which contains three main steps (Figure 3(b)): first, the training set is constructed by incorporating the predictions of ExpressionPlot [8], MATS [9] and IRFinder [11], with 17 significant features to interpret each instance. The instances are saved in feature matrix , labeled with "yes" and "no", indicating positive and negative ones, respectively; second, the J48 Decision Tree (J48) classifier and Random Forests (RF) classifier are trained respectively, with 10-fold cross-validation to avoid over-fitting; finally, new RNA-Seq data are sent to the two classifiers respectively for IR event prediction. Our training data set are constructed based on the results of IR event identification, by MATS, ExpressionPlot and IRFinder. The instances hit by at least two of the three predicting algorithms are taken as "positive" ones, labeled by "yes", while those hit by none of the algorithms are taken as "negative" ones, labeled by "no". Thus, 741 positive instances and 3525 random negative instances are selected as the training set. The instance is interpreted by 17 features: 14 features from treatment and control samples (Figure 2) and 3 features DIE, DIJ , and DIC (defined in section Methods). As for classifiers, we employ J48 Decision Tree and Random Forests algorithms from Weka [18] to conduct two independent predictions. J48 classifier forms rules from pruned partial decision trees built by C4.5 heuristics [19], which minimizes the number of tree levels and tree nodes to maximize data generalization. During tree building, information gain is adopted for feature selection, such that feature of largest information gain is chosen for the next splitting. RF classifier [20] constructs independent decision tree classifiers by bagging and random feature selection, and aggregates the outputs of each decision tree in RF to produce one final prediction by a majority vote of the trees. Here, the RF classifier is constructed with 10 decision trees, each selecting 5 random features. To avoid overfitting, both J48 classifier and RF classifier employ 10-fold cross-validation.

Results and discussion

We use real RNA-seq data http://www.ncbi.nlm.nih.gov/sra?term=SRP008262 to detect IR events, which are comprise of skip mutant and wild-type in Arabidopsis thaliana [10], including two replicated skip mutant samples (skip1 and skip2) and two replicated wild-type samples (wt1 and wt2). As for the reference, we take Arabidopsis thaliana TAIR9 genome, with sequences and annotations ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR9_genome release/. The performance of IRcall and IRclassifier are studied against state-of-the-art IR prediction methods ExpressionPlot [8], MATS [9] and IRFinder [11].

Data normalization and experimental setup

As the numbers of mapped reads in the two skip mutant samples (skip1 and skip2) and two wild-type samples (wt1 and wt2) are different (28465249 reads in skip1, 20849546 reads in skip2, 44865242 reads in wt1, and 18343654 reads in wt2), we take the minimum read number 18343654 as the standard normalization scale (SNS). Then all read counts that construct features are divided by its sample's overall read number and multiplied by SNS for normalization. The sum of normalized read counts in skip1 and skip2 (wt1 and wt2) is taken as the read count for treatment (control) sample. Then all relevant features and scores are calculated according to the algorithms in Section Methods. As for evaluation reference, we get all positive IR event predictions (see Figure 4(a)) of ExpressionPlot [8], MATS [9], and IRFinder [11], with higher statistical significance (-value <0.05). In total, there are 10202 positive IR event predictions, with 29 from MATS, 8986 from ExpressionPlot, and 1928 from IRFinder.

Figure 4

IR Event Predictions. (a) Predictions of MATS, ExpressionPlot and IRFinder; (b) Predictions of IRcall, MATS, ExpressionPlot and IRFinder; (c) Predictions of IRclassifier (J48 Decision Tree), MATS, ExpressionPlot and IRFinder; (d) Predictions of IRclassifier (RF), MATS, ExpressionPlot and IRFinder.

IRcall performance

In the first group of experiments, IRcall is taken to detect differential IR events from RNA-Seq data on skip mutant and wild-type. For all of the 158580 introns in Arabidopsis thaliana TAIR9 reference annotation, we firstly filtered out the introns not satisfying the intron removal criteria of IRcall (see Table 1), resulting in 928 IR event candidates. Then the IR scores of those 928 IR event candidates are calculated, according to which the candidates are sorted in descending order. To evaluate the performance of IRcall, we compared the 928 IR event candidates of IRcall with the positive predictions of MATS, ExpressionPlot and IRFinder. As illustrated in Figure 4(b), the 928 IR event candidates of IRcall have 199 new predictions, and 5, 624, 383 overlaps with MATS, ExpressionPlot, and IRFinder, respectively. From the new predictions and each overlap, we randomly picked one case to display by IGV software [21] and [22], to study whether the introns are retained. As shown in Figure 6, our new prediction intron 1 of AT4G26080.1 (d) is retained between treatment and control conditions, in the similar way as those in overlaps with MATS (a), ExpressionPlot (b) and IRFinder (c). That is, our IRcall could predict known and novel IR event candidates.

Figure 6

Pre-mRNAs with predicted IR events (red arrows indicating retained introns).

However, those overlaps in Figure 4(b) only demonstrate the worst case of IRcall, as the IR score and ranking strategy have not been applied. To verify that the top-ranked IR event candidates tend to be true IR events, we studies the accuracy of IRcall top-n% predictions with the evaluation reference, that is, the 10202 positive predictions of MATS, ExpressionPlot and IRFinder. As shown in Figure 5, with the increase of returned IR events number (from top-10% to top-100%), the prediction accuracy decreases. That is to say, the top-ranked IRcall predictions tend to have higher accuracy (i.e. 98.91% for top-10% predictions), thus the IR score and ranking strategy proposed by IRcall are effective for IR event detection. Therefore, biologists are suggested taking top-ranked IR events for wet-lab verification.

Figure 5

Accuracy of the top-.

Accuracy of the top-. Pre-mRNAs with predicted IR events (red arrows indicating retained introns).

IRclassifier performance

As for IRclassifier, the instances hit by at least two of the three algorithms MATS, ExpressionPlot and IRFinder are taken as our positive training instances, labeled by "yes". As shown in Figure 4(a), the positive predictions of IRFinder and MATS have 10 overlaps, IRFinder and ExpressionPlot have 731 overlaps, while MATS and ExpressionPlot have none overlap. Thus, we get 741 instances for our positive training set. As for the negative training set, we randomly take 3525 introns hit by none of the above three algorithms, labeled by "no". 10-fold cross-validation is employed to avoid over-fitting and evaluate prediction accuracy for both J48 and RF classifiers. As shown in Table 2, the J48 classifier is of 99.1% true negative rate (specificity), 96.0% true positive rate (sensitivity) and 98.5% accuracy; while the RF classifier is of 99.7% true negative rate, 96.3% true positive rate and 99.0% accuracy.

Table 2

IRclassifier Performance.

Classifier	Acc	Sp	Sn	AUC
J48	98.5%	99.1%	96.0%	97.2%
RF	99.0%	99.7%	96.3%	99.8%

Acc: Accuracy, Sp: Specificity, Sn: Sensitivity, AUC: Area under ROC curve.

IRclassifier Performance. Acc: Accuracy, Sp: Specificity, Sn: Sensitivity, AUC: Area under ROC curve. In addition, the predictions of IRclassifier are evaluated with the reference predictions of MATS, ExpressionPlot and IRFinder. As shown in Figure 4(C) and 4(D), the J48 classifier predicts 1086 new IR events, and 22 MATS, 1898 ExpressionPlot, 1226 IRFinder overlaps; while the RF classifier predicted 850 new IR events, and 18 MATS, 1751 ExpressionPlot, 1407 IRFinder overlaps. Among the new IRclassifier predictions, we randomly choose two cases displayed by IGV software in Figure 6: (e) intron 7 of AT5G61500.1 (predicted by J48 classifier) in skip mutant was retained compared with wild-type; and (f) AT3G45050 (predicted by J48 and RF classifiers) was different spliced between treatment and control conditions, where transcript 1 of AT3G45050 (AT3G45050.1) was expressed under treatment condition, but transcript 2, 3, and 4 of AT3G45050 (AT3G45050.2, AT3G45050.3 and AT3G45050.4) are expressed under control condition.

Feature ranking

To examine how each individual feature affects IR event prediction, we adopt the information gain for feature ranking. As each feature comes from both treatment and control samples, we take the value of treatment/control for information gain calculation in that IR events usually exist where the difference between treatment and control samples is significant. As shown in Table 3, features with higher information gains are relatively more relevant to IR event detection. The results show that the read counts within an intron, read counts overlapping with 5' and 3' splicing site are the most influential features for IR event detection.

Table 3

Feature ranking.

Features(treatment/control)	Information gain
read counts within an intron	0.4038
read counts overlapping with 3 splicing site	0.3948
read counts overlapping with 5 splicing site	0.3917
read coverage within an intron	0.2496
read counts supporting splice junction	0.2343
read counts within flanking exons	0.0713
gene expression RPKM values	0.0527

Feature ranking.

Conclusions

In this paper, we overcome the limitations of traditional statistical and computational methods for IR event identification from RNA-Seq data, by proposing two novel algorithms, IRcall and IRclassifier, to detect the IR events between treatment and control conditions. In our methods, new features are employed to filter out false positives effectively. IRcall takes 14 features and defines a novel formula IRscore for candidate IR event ranking. From experiments, 98.91% of IRcall's top-10% high ranked IR event candidates have been verified by other methods. As a machine learning method for IR event detection, IRclassifier employs 17 features, and gets 98.5% and 99.0% recognition precisions, with J48 Decision Tree classifier and Random Forests classifier, respectively. Therefore, we believe that our IRcall and IRclassifier are effective to detect IR events between treatment and control conditions from RNA-Seq data, delivering valuable information and tools for alternative splicing research. In future, instead of using TopHat2 for junction alignment, we will consider assessing the influence of different junction alignment methods on the performance of IRcall and IRclassifier. Besides, we plan to use IRcall and IRclassifier to detect IR events from paired-end sequencing data.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

YB, SJ and YW conceived and designed the algorithm and experiments. YB implemented the algorithm and finished the experiments. All authors rote and revised the manuscript.

19 in total

Review 1. Alternative splicing: increasing diversity in the proteomic world.

Authors: B R Graveley
Journal: Trends Genet Date: 2001-02 Impact factor: 11.639

Review 2. Understanding alternative splicing: towards a cellular code.

Authors: Arianne J Matlin; Francis Clark; Christopher W J Smith
Journal: Nat Rev Mol Cell Biol Date: 2005-05 Impact factor: 94.444

3. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing.

Authors: Qun Pan; Ofer Shai; Leo J Lee; Brendan J Frey; Benjamin J Blencowe
Journal: Nat Genet Date: 2008-11-02 Impact factor: 38.330

4. Orchestrated intron retention regulates normal granulocyte differentiation.

Authors: Justin J-L Wong; William Ritchie; Olivia A Ebner; Matthias Selbach; Jason W H Wong; Yizhou Huang; Dadi Gao; Natalia Pinello; Maria Gonzalez; Kinsha Baidya; Annora Thoeng; Teh-Liane Khoo; Charles G Bailey; Jeff Holst; John E J Rasko
Journal: Cell Date: 2013-08-01 Impact factor: 41.582

5. The significance of digital gene expression profiles.

Authors: S Audic; J M Claverie
Journal: Genome Res Date: 1997-10 Impact factor: 9.043

6. Whole-genome microarray in Arabidopsis facilitates global analysis of retained introns.

Authors: Hadas Ner-Gaon; Robert Fluhr
Journal: DNA Res Date: 2006-09-16 Impact factor: 4.458

7. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

Review 8. RNA-Seq: a revolutionary tool for transcriptomics.

Authors: Zhong Wang; Mark Gerstein; Michael Snyder
Journal: Nat Rev Genet Date: 2009-01 Impact factor: 53.242

9. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration.

Authors: Helga Thorvaldsdóttir; James T Robinson; Jill P Mesirov
Journal: Brief Bioinform Date: 2012-04-19 Impact factor: 11.622

10. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions.

Authors: Daehwan Kim; Geo Pertea; Cole Trapnell; Harold Pimentel; Ryan Kelley; Steven L Salzberg
Journal: Genome Biol Date: 2013-04-25 Impact factor: 13.583

5 in total

1. Quantitative analysis of cryptic splicing associated with TDP-43 depletion.

Authors: Jack Humphrey; Warren Emmett; Pietro Fratta; Adrian M Isaacs; Vincent Plagnol
Journal: BMC Med Genomics Date: 2017-05-26 Impact factor: 3.063

2. iREAD: a tool for intron retention detection from RNA-seq data.

Authors: Hong-Dong Li; Cory C Funk; Nathan D Price
Journal: BMC Genomics Date: 2020-02-06 Impact factor: 3.969

3. New insights into minor splicing-a transcriptomic analysis of cells derived from TALS patients.

Authors: Anne-Louise Leutenegger; Sylvie Mazoyer; Patrick Edery; Vincent Lacroix; Audric Cologne; Clara Benoit-Pilven; Alicia Besson; Audrey Putoux; Amandine Campan-Fournier; Michael B Bober; Christine E M De Die-Smulders; Aimee D C Paulussen; Lucile Pinson; Annick Toutain; Chaim M Roifman
Journal: RNA Date: 2019-06-07 Impact factor: 4.942

4. Union Exon Based Approach for RNA-Seq Gene Quantification: To Be or Not to Be?

Authors: Shanrong Zhao; Li Xi; Baohong Zhang
Journal: PLoS One Date: 2015-11-11 Impact factor: 3.240

Review 5. Intron Retention as a Mode for RNA-Seq Data Analysis.

Authors: Jian-Tao Zheng; Cui-Xiang Lin; Zhao-Yu Fang; Hong-Dong Li
Journal: Front Genet Date: 2020-07-07 Impact factor: 4.599

5 in total