| Literature DB >> 29062321 |
Rui Mao1, Chun Liang2,3, Yang Zhang1, Xingan Hao4, Jinyan Li5.
Abstract
Intron retention, one of the most prevalent alternative splicing events in plants, can lead to introns retained in mature mRNAs. However, in comparison with constitutively spliced introns (CSIs), the relevantly distinguishable features for retained introns (RIs) are still poorly understood. This work proposes a computational pipeline to discover novel RIs from multiple next-generation RNA sequencing (RNA-Seq) datasets of Arabidopsis thaliana. Using this pipeline, we detected 3,472 novel RIs from 18 RNA-Seq datasets and re-confirmed 1,384 RIs which are currently annotated in the TAIR10 database. We also use the expression of intron-containing isoforms as a new feature in addition to the conventional features. Based on these features, RIs are highly distinguishable from CSIs by machine learning methods, especially when the expressional odds of retention (i.e., the expression ratio of the RI-containing isoforms relative to the isoforms without RIs for the same gene) reaches to or larger than 50/50. In this case, the RIs and CSIs can be clearly separated by the Random Forest with an outstanding performance of 0.95 on AUC (the area under a receiver operating characteristics curve). The closely related characteristics to the RIs include the low strength of splice sites, high similarity with the flanking exon sequences, low occurrence percentage of YTRAY near the acceptor site, existence of putative intronic splicing silencers (ISSs, i.e., AG/GA-rich motifs) and intronic splicing enhancers (ISEs, i.e., TTTT-containing motifs), and enrichment of Serine/Arginine-Rich (SR) proteins and heterogeneous nuclear ribonucleoparticle proteins (hnRNPs).Entities:
Keywords: constitutively spliced introns (CSIs); distinguishable features; high-throughput next-generation RNA sequencing (RNA-Seq); intronic splicing enhancers (ISEs); intronic splicing silencers (ISSs); random forest; retained introns (RIs)
Year: 2017 PMID: 29062321 PMCID: PMC5640774 DOI: 10.3389/fpls.2017.01728
Source DB: PubMed Journal: Front Plant Sci ISSN: 1664-462X Impact factor: 5.753
Figure 1The computational pipeline for the identification of RIs. Overview of the steps involved in clean reads, reads mapping, transcriptional reconstruction, RIs extraction and express quantification at isoform level, specific procedural methods, and corresponding software's are described in order to identifying reliable RIs for each class of RNA-Seq dataset.
Data sources of RNA-Seq.
| Sample1 | 10-d seedlings and flowers mixed in a 1:1 ratio | Illumina Genome Analyzer II | 76 | Paired | 98.79 | |
| 98.78 | ||||||
| Sample2 | Inflorescent meristem | AB SOLiD System 3.0 | 50 | Single | 99.95 | |
| 99.94 | ||||||
| Sample3 | Cold | Illumina Genome Analyzer II | 36 | Single | 97.03 | |
| 99.77 | ||||||
| 99.65 | ||||||
| 91.12 | ||||||
| Sample4 | Heat | Illumina Genome Analyzer II | 36 | Single | 92.83 | |
| 95.78 | ||||||
| 99.86 | ||||||
| 91.40 | ||||||
| Sample5 | Salt | Illumina Genome Analyzer II | 36 | Single | 88.27 | |
| 87.73 | ||||||
| 85.61 | ||||||
| Sample6 | Drought | Illumina Genome Analyzer II | 36 | Single | 92.58 | |
| 92.86 | ||||||
| 90.81 |
All samples of RNA-Seq in analysis are divided into six classes (denoted as Sample1, Sample2, Sample3, Sample4, Sample5 and Sample6) based on different issues and conditions. The percentage of aligned mapping illustrates the percents of the aligned mapping reads with the phred quality score (Q) more than 20 (Q ≥ 20) after running gsnap.
Figure 2The known RIs in TAIR10 and novel discovered RIs respectively in six classes of RNA-Seq datasets. All experimental RNA-Seq datasets for analysis are divided into six classes: Sample1, Sample2, Sample3, Sample4, Sample5, and Sample6. For each class, that the discovered RIs are novel means they are unknown in TAIR10 but newly detected by our method, which depicted with light gray, and contrarily with dark gray.
Figure 3The 6-venn diagrams of RIs from six classes. The sets of RIs found from Sample1~6 have a large amount of overlap. (A) The known RIs unique to the six classes or their combinations. (B) The novel discovered RIs unique to the six classes or their combinations. They use the same color setting rules, red brown set represents Sample1, red set for Sample2, blue set for Sample3, yellow set for Sample4, purple set for Sample5, green set for Sample 6. The numbers of intersection areas reflect the details of overlap among the different classes.
Intersections of known and novel RIs among six classes.
| RI-known123456 | 872 | RI-novel123456 | 96 |
| RI-known12345 | 1 | RI-novel13456 | 14 |
| RI-known12356 | 2 | RI-novel1234 | 26 |
| RI-known13456 | 10 | RI-novel1256 | 14 |
| RI-known23456 | 2 | RI-novel1345 | 1 |
| RI-known1234 | 107 | RI-novel3456 | 866 |
| RI-known1235 | 1 | RI-novel134 | 2 |
| RI-known1256 | 38 | RI-novel345 | 37 |
| RI-known1345 | 1 | RI-novel346 | 2 |
| RI-known3456 | 163 | RI-novel356 | 13 |
| RI-known123 | 2 | RI-novel456 | 11 |
| RI-known124 | 1 | RI-novel12 | 1,577 |
| RI-known156 | 1 | RI-novel34 | 525 |
| RI-known234 | 3 | RI-novel46 | 3 |
| RI-known12 | 85 | RI-novel56 | 155 |
| RI-known34 | 62 | RI-novel1 | 52 |
| RI-known56 | 27 | RI-novel2 | 6 |
| RI-known1 | 1 | RI-novel3 | 25 |
| RI-known2 | 1 | RI-novel4 | 20 |
| RI-known3 | 1 | RI-novel5 | 20 |
| RI-known5 | 1 | RI-novel6 | 7 |
| RI-known6 | 2 | ||
The prefixes “RI-known” and “RI-novel” represent respectively known RIs in TAIR10 and novel RIs detected from RNA-Seq by our method. The suffix numbers show the unique RIs belong to serial numbers of datasets combinations.
The Feature vector to represent the RIs and CSIs for the classification.
| FeatureSet-1 | Length; AT and GC content; nucleotide occurrence probabilities of A, C, G and T; Segmental probabilities of four nucleotides correlation factors (θ |
| FeatrueSet-2 | SFvalue, SFaccvalue; IDdonv, IDacceptv |
| FeatureSet-3 | |
| The Expression of Intron-containing Isoforms | |
| Class label | True (RIs); False (CSIs) |
The Feature vector consists of the expression of intron-containing isoforms (denoted as FPKM) and three types of conventional features (FeatureSet-1, FeatureSet-2, and FeatureSet-3). More definitions details of these conventional features can be referred to Mao et al. (.
Figure 4Performances of Random Forest in all datasets. Eight datasets (i.e., the green line with triangle marks for RI-set1, the pink line with marks for RI-set2, the red line with marks for RI-set3, the light gray line with filled circle marks for RI-set4, the black line with cross marks for RI-set5, the yellow line with hollow circle marks for RI-set6, the blue line with rhombus marks for RI-set-all-expressed, and the red brown with square marks for RI-set-stage-expressed) respectively represent the RIs and the corresponding CSIs extracted from Sample1, Sample2, Sample3, Sample4, Sample5, Sample6, all expressed in six datasets (RI-set-all-expressed) and co-occurred in the developmental tissues or under stress conditions (RI-set-stage-expressed). For each dataset, nofpkm (classification features except FPKM), fpkm (only FPKM feature) and various subsets of RIs (all classification features) depending on different criteria of expressional retention odds (all, 10/90, 20/80, 30/70, 40/60, and 50/50, Table S2) are prepared to do classification by Random Forest. (A) Depict the obtained performances of accuracy. (B) Describe the obtained performances of F-Measure. (C) Show the obtained performances of AUC. Obviously, 50:50 expressional odds of retention consistently reach the best overall performance (0.909 Accuracy and F-Measure, 0.954 AUC averagely) in all eight experimental datasets.
Figure 5Hierarchical clustering of RI expressional retention odds for all RIs in RI-set-all-expressed. These 968 RIs of RI-set-all-expressed are expressed in all six classes. The blue area means that the expressional retention odds of these RIs < 1.0, while smaller value shows heavier blue color. The red area means the expressional retention odds of these RIs more than 1.0, while larger value shows heavier red color. The white area means that the expressional retention odds of these RIs are equal to 1.0.
The GC content, length distribution quartiles of introns and the mean values of splice regulating factors in CSIs, RIs, and RIg50.
| CSIs | 32.66 | 4.8691 | 6.4489 | 18.453 | 18.362 | 155 | [20, 86, 99, 156, 155] |
| RIs | 35.91 | 4.1038 | 5.0915 | 18.221 | 18.056 | 140 | [15, 82, 96, 134, 213] |
| RIg50 | 41.15 | 1.8208 | 2.1530 | 17.788 | 18 | 260 | [18, 84, 119, 266, 522] |
Our focus is the differences of them among CSIs, RIs, and RIs of more-than-1.0 expressional retention odds of RI-set-stage-expressed (namely RIg50). In the following tables, CSIs, RIs and RIg50 are defined as the same. SFvalue and SFaccvalue represent the strength of splice sites. IDdonv and IDacceptv represent the similarity between two flanking sequences of splicesites.
The occurrence of RIs and RIg50 in CDS or UTR, according to TAIR10.
| RIs | 594 | 42 | 332 | 61.36% |
| RIg50 | 507 | 78 | 101 | 73.90% |
Figure 6The distribution features of conservative sequence motifs at the branch point. Here all negative samples (CSIs, blue color), positive samples(RIs, red color) and positive samples with more-than-1.0 expressional retention odds of RI-set-stage-expressed (RIg50, navy blue color) are chosen for the comparisons. (A) The length distribution of branch point sequence motifs from the acceptor site (3′ss) to the branch point A. (B) The weblogo of the branch point sequence motifs. The vertical scale indicates the nucleotide occurrence probabilities of A, T, C, and G. All branch point sequence motifs is strictly aligned with the branch point (A, 0 point). (C) The features of splicesites. SFvalue, SFaccvalue indicate the strength of the splice sites, and IDdonv, IDacceptv represent the similarity between two flanking sequences of corresponding splice sites.
The occurrence of the conservative motif NNYTRAY in CSIs, RIs and RIg50.
| CSIs | 1 | 811 | 1188 | [12, 20, 26, 33, 51] |
| RIs | 29 | 420 | 423 | [12, 19.5, 25, 35, 51] |
| RIg50 | 52 | 400 | 234 | [12, 21, 27, 36, 51] |
Distribution of distance to the accepter site (3′ss) respectively illustrate the minimum, 0.25, 0.5, 0.75 quantiles, and the maximum distance between the branch point .
The frequency details of A, T, C, and G in each position of the conservative motif NNYTRAY in CSIs, RIs, and RIg50.
| CSIs | A | 0.227273 | 0.251684 | 0 | 0.476431 | 0 |
| C | 0.175926 | 0.164983 | 0.367845 | 0 | 0.298822 | |
| G | 0.157407 | 0.175084 | 0 | 0.523569 | 0 | |
| T | 0.439394 | 0.408249 | 0.632155 | 0 | 0.701178 | |
| RIs | A | 0.247863 | 0.226496 | 0 | 0.465812 | 0 |
| C | 0.15812 | 0.209402 | 0.435897 | 0 | 0.337607 | |
| G | 0.239316 | 0.260684 | 0 | 0.534188 | 0 | |
| T | 0.354701 | 0.303419 | 0.564103 | 0 | 0.662393 | |
| RIg50 | A | 0.247863 | 0.226496 | 0 | 0.465812 | 0 |
| C | 0.15812 | 0.209402 | 0.435897 | 0 | 0.337607 | |
| G | 0.239316 | 0.260684 | 0 | 0.534188 | 0 | |
| T | 0.354701 | 0.303419 | 0.564103 | 0 | 0.662393 |
For −2 and 0 point, T and A are respectively constant nucleotides.
The assessment indexes of typical representatives for the conventional FeatureSet-3 in RIg50.
| GGG-containing | −0.77 | 0.14 | |
| GGAG-containing | −0.92 | 0.21 | |
| − | |||
| TA/AT-rich(4–5bp) | 0.33 | 0.28 | |
α(x(k)) indicates the diversities of x(k) (typical frequent motifs) between in RIg50 and CSIs. S.
The bold “AG/GA-rich” and “TTTT-containing” motifs are putative intronic splicing silencers (ISSs) and intronic splicing enhancers (ISEs).
Figure 7Significant KEGG enrichment pathway involving the genes which contain co-occurring RIs in multiple Samples.