| Literature DB >> 25110928 |
Rui Mao1, Praveen Kumar Raj Kumar2, Cheng Guo2, Yang Zhang3, Chun Liang4.
Abstract
One of the important modes of pre-mRNA post-transcriptional modification is alternative splicing. Alternative splicing allows creation of many distinct mature mRNA transcripts from a single gene by utilizing different splice sites. In plants like Arabidopsis thaliana, the most common type of alternative splicing is intron retention. Many studies in the past focus on positional distribution of retained introns (RIs) among different genic regions and their expression regulations, while little systematic classification of RIs from constitutively spliced introns (CSIs) has been conducted using machine learning approaches. We used random forest and support vector machine (SVM) with radial basis kernel function (RBF) to differentiate these two types of introns in Arabidopsis. By comparing coordinates of introns of all annotated mRNAs from TAIR10, we obtained our high-quality experimental data. To distinguish RIs from CSIs, We investigated the unique characteristics of RIs in comparison with CSIs and finally extracted 37 quantitative features: local and global nucleotide sequence features of introns, frequent motifs, the signal strength of splice sites, and the similarity between sequences of introns and their flanking regions. We demonstrated that our proposed feature extraction approach was more accurate in effectively classifying RIs from CSIs in comparison with other four approaches. The optimal penalty parameter C and the RBF kernel parameter [Formula: see text] in SVM were set based on particle swarm optimization algorithm (PSOSVM). Our classification performance showed F-Measure of 80.8% (random forest) and 77.4% (PSOSVM). Not only the basic sequence features and positional distribution characteristics of RIs were obtained, but also putative regulatory motifs in intron splicing were predicted based on our feature extraction approach. Clearly, our study will facilitate a better understanding of underlying mechanisms involved in intron retention.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25110928 PMCID: PMC4128822 DOI: 10.1371/journal.pone.0104049
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Feature extraction approaches for calculating signal strength of splice sites and similarity of intron and the flanking exons.
A. The sequence extraction approach for calculating signal strength of splice sites; B. The sequence extraction approach for calculating increment of diversity (ID).
The parameter values or ranges of PSOSVM.
| Parameter | Value or Range |
|
| 10 |
| S(the number of particles) | 100 |
|
| 2 |
|
| 1.49618 |
|
| 1.49618 |
|
| 0.7298 |
| C | (2−8, 210) |
|
| (2−8, 28) |
The rule-of-thumb settings of , and are cited from [74].
Figure 2The pseudo-code of PSOSVM.
The details of Eq. 16 are illustrated in Materials and Methods.
Figure 3Numbers of various RNA types annotated in TAIR10 gene annotation for Arabidopsis.
Each horizontal bar (with the number) indicates the number for a given RNA type.
Distribution of RIs and CSIs in Arabidopsis.
| Introns Categories | RIs | CSIs |
| All RNAs | 2,811 | 113,098 |
| mRNAs | 2,762 | 110,304 |
| ChrC, ChrM | 0 | 42 |
| Chr1, Chr2, Chr3, Chr4, Chr5 | 2,762 | 110,262 |
| Redundant Cases | 229 | 0 |
All RNAs means the 8 types of RNAs described in Figure 3. Redundant cases could only happen in RIs, the detailed description sees Materials and Methods.
Average size, range and sample qualtiles of RIs and CSIs.
| Introns Categories | Average size (bp) | Range [Min,Max] (bp) | Quantile (bp) | |||||
| 0.02 | 0.2 | 0.4 | 0.6 | 0.8 | 0.98 | |||
| RIs | 145 | [10–2,075] | 44 | 81 | 92 | 112 | 182 | 501 |
| CSIs | 160 | [8–10,234] | 70 | 83 | 92 | 110 | 195 | 631 |
Quantile represents quantile() function in R. For given probabilities [0.02, 0.2, 0.4, 0.6, 0.8, 0.98], quantile() returns estimates of corresponding distribution quantiles based on sort order.
Feature vectors of experimental dataset.
| Feature types | Feature vector |
| Basic Features | Length; AT content; GC content; nucleotide occurrence probabilities of A, C, G and T; |
| Frequent motifs features |
|
| Splice sites and the flanking sequences features | SFvalue, SFaccvalue; IDdonv, IDacceptv |
| Complete features | Combined features ( |
| Optimized features | Length, |
| Class label | True (RIs); False (CSIs) |
Optimal parameters and performances of random forest and PSOSVM using five different feature sets.
| Algorithm | Feature set | Parameter ( | Parameter ( | Accuracy | F-Measure | AUC |
| Random forest |
| 4 | 42 | 0.771 | 0.772 | 0.867 |
|
| 4 | 42 | 0.785 | 0.785 | 0.897 | |
| Combined |
|
|
|
|
| |
| Complete | 7 | 42 | 0.782 | 0.782 | 0.898 | |
| Optimized | 5 | 42 | 0.788 | 0.788 | 0.891 |
Figure 4The ROC curves of random forest versus PSOSVM.
The ROC curve of random forest is shown by the solid line and PSOSVM by the dashed line. The classification accuracy of these two methods is measured by AUC (the area under the ROC curve). Random forest gains significant advantages compared to PSOSVM (i.e., 0.900 versus 0.844).
Figure 5Performance of random forest and PSOSVM (F-Measure) in five different feature sets.
Classification accuracy is assessed with F-Measure. Each solid round dot represents the accuracy of random forest and each triangle means the accuracy of PSOSVM for a given feature set. Compared with the other feature sets, our combined A+B+C feature set obtains the optimal classification performance by using both classifiers.
The mean value and P value of SFvalue, SFaccvalue, IDdonv and IDacceptv.
| SFvalue | SFaccvalue | IDdonv | IDacceptv | |
| The mean value in RIs | 3.930 | 5.075 | 17.934 | 17.891 |
| The mean value in CSIs | 4.806 | 6.363 | 18.412 | 18.385 |
| P value (One–way ANOVA) | 2.2e-16 | 2.2e-16 | 6.488e-07 | 3.545e-07 |
P value was calculated by applying F-test in one-way ANOVA based on experiment dataset included RIs and CSIs. The influences of classification among four features are all significant (p<0.0001).
Figure 6The mean occurrences of B frequent motifs between RIs and CSIs.
In the left side of the histogram there are ten frequent motifs that have higher occurrences in RIs than in CSIs. In the right site of the histogram there are nine frequent motifs that have higher occurrences in CSIs than in RIs.