| Literature DB >> 29349080 |
Ying Cui1,2, Meng Cai2,3, H Eugene Stanley2.
Abstract
Alternative splicing (AS) is a major engine that drives proteome diversity in mammalian genomes and is a widespread cause of human hereditary diseases. More than 95% of genes in the human genome are alternatively spliced, and the most common type of AS is the cassette exon. Recent discoveries have demonstrated that the cassette exon plays an important role in genetic diseases. To discover the formation mechanism of cassette exon events, we statistically analyze cassette exons and find that cassette exon events are strongly influenced by individual exons that are smaller in size and that have a lower GC content, more codon terminations, and weaker splice sites. We propose an improved random-forest-based hybrid method of distinguishing cassette exons from constitutive exons. Our method achieves a high accuracy in classifying cassette exons and constitutive exons and is verified to outperform previous approaches. It is anticipated that this study will facilitate a better understanding of the underlying mechanisms in cassette exons.Entities:
Mesh:
Year: 2017 PMID: 29349080 PMCID: PMC5734011 DOI: 10.1155/2017/7323508
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Results of sequence feature analysis.
| Feature | Mean value of cassette exons | Mean value of constitutive exons |
|
|---|---|---|---|
| Length | 142.85 | 176.39 | <2.2 |
| GC content | 0.4975 | 0.5062 | 0.009886 |
| Termination codon | 0.0128 | 0.0117 | 0.0006005 |
| 5′ splice strength | −13.43874 | −12.1145 | 0.0005285 |
| 3′ splice strength | −19.91228 | −18.29848 | 0.0002665 |
Figure 1Mean occurrence of mononucleotide, dinucleotide, and trinucleotide.
List of extracted features.
| Feature subset | Number of features |
|---|---|
| Length | 1 |
| Mononucleotide | 4 |
| Dinucleotide | 16 |
| Trinucleotide | 64 |
| Termination codon | 3 |
| GC content | 1 |
| Splice site strength | 2 |
|
|
|
Classification results of SVM classifier with different kernels.
| Kernel | Parameters |
|
| TA (%) | |||
|---|---|---|---|---|---|---|---|
|
|
|
|
| ||||
| Linear | 1 | 70.23 | 68.57 | 69.27 | |||
| RBF | 25 | 0.000012 |
|
|
| ||
| Poly | 14 | 0.015623 | 3 | 72.26 | 70.12 | 71.08 | |
| Sigmoid | 357 | 0.000564 | 0.3 | 73.27 | 69.78 | 71.44 | |
Parameter optimization details in different classifiers.
| Classifier | Parameters | Step size in search | Search range | Optimal value |
|---|---|---|---|---|
| KNN |
| 1 | 1 : 20 | 7 |
| SVM |
| 1 | 1 : 500 | 25 |
|
| 0.000001 | 10−6 : 1 | 0.000012 | |
| Random forest |
| 50 | 50 : 2000 | 1000 |
|
| 1 | 1 : 91 | 91 | |
| CForest |
| 50 | 50 : 2000 | 1050 |
|
| 1 | 1 : 91 | 10 | |
| XGBoost |
| 0.1 | 0.1 : 1 | 0.5 |
|
| 1 | 1 : 10 | 4 |
Classification performance of different classifiers.
| Classifier |
|
| TA (%) | AUC |
|---|---|---|---|---|
| KNN | 67.03 | 75.57 | 70.45 | 0.7818 |
| SVM | 73.23 | 70.34 | 71.69 | 0.7865 |
| Random forest | 71.49 | 78.72 | 74.59 | 0.8411 |
| CForest | 95.90 | 97.62 | 96.69 | 0.9954 |
| XGBoost | 83.55 | 85.37 | 84.44 | 0.9270 |
Figure 2ROC curves of different classifiers.
Figure 3Importance rank of features (top 15) in different classification models. (a) CForest, (b) random forest.
Figure 4Performance comparison between existing methods and our method.