| Literature DB >> 27181057 |
Ying Wang1,2, Xiaoye Li1, Bairui Tao1.
Abstract
MicroRNAs (miRNAs) are ~20-25 nucleotides non-coding RNAs, which regulated gene expression in the post-transcriptional level. The accurate rate of identifying the start sit of mature miRNA from a given pre-miRNA remains lower. It is noting that the mature miRNA prediction is a class-imbalanced problem which also leads to the unsatisfactory performance of these methods. We improved the prediction accuracy of classifier using balanced datasets and presented MatFind which is used for identifying 5' mature miRNAs candidates from their pre-miRNA based on ensemble SVM classifiers with idea of adaboost. Firstly, the balanced-dataset was extract based on K-nearest neighbor algorithm. Secondly, the multiple SVM classifiers were trained in orderly using the balance datasets base on represented features. At last, all SVM classifiers were combined together to form the ensemble classifier. Our results on independent testing dataset show that the proposed method is more efficient than one without treating class imbalance problem. Moreover, MatFind achieves much higher classification accuracy than other three approaches. The ensemble SVM classifiers and balanced-datasets can solve the class-imbalanced problem, as well as improve performance of classifier for mature miRNA identification. MatFind is an accurate and fast method for 5' mature miRNA identification.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27181057 PMCID: PMC4867574 DOI: 10.1038/srep25941
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1The sequence description for feature exacting (taking has-mir-19a as an example).
All features used by MatFind.
| Features | Sequence | Number | Explanation |
|---|---|---|---|
| pn1–25 | structured-sequence | 25 | paired-nucleotides type of each position of duplex |
| ss1–50 | sequence-structure | 50 | Combination of nucleotide and paired-type of each position of duplex |
| fl1–18 | sequence-structure | 18 | Combination of nucleotide and paired-type of each position of left 9 bp region of duplex |
| fr1–3 | structured-sequence | 6 | Combination of nucleotide and paired-type of each position of right 3 bp region of duplex |
| MFE1–5 | structured-sequence | 5 | minimum free energy of duplex; minimum free energy of left 3 bp double-stranded sequence; minimum free energy of left 5 bp double-stranded sequence; minimum free energy of left 9 bp double-stranded sequence; minimum free energy of right 3 bp double-stranded sequence |
| length | structured-sequence | 1 | Distance from the first position to terminal loop |
| Num1–3 | structured-sequence | 3 | The number of “−” in double-strand from +2 bp to +5 bp of duplex; The number of “−” in double-strand from +3 bp to +8 bp of duplex; The number of “−” in double-strand from +9 bp to +12 bp of duplex; |
| fn | sequence | 1 | The first nucleotide of mature miRNA |
| pair | structured-sequence | 1 | paired state of the first base pair of duplex |
Figure 2
Figure 3The computational procedure of SVM classifier.
The training data component of 10 classifiers.
| Classifier | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Negative Samples(N) | 1119 | 1230 | 1335 | 1507 | 1564 | 1611 | 1654 | 1699 | 1723 | 1739 |
| Incorrectly classified Samples(N) | 248 | 146 | 118 | 88 | 72 | 64 | 72 | 38 | 36 | 40 |
| Training subset(N) | 2237 | 2348 | 2453 | 2625 | 2682 | 2729 | 2772 | 2817 | 2841 | 2857 |
The parameters of 10 classifiers.
| Classifier | ∝ | g | c | Acc | |
|---|---|---|---|---|---|
| classifier 1 | 0.013 | 2.1605 | 0.5 | 2 | 87.07 |
| classifier 2 | 0.1193 | 0.9996 | 0.5 | 8 | 91.81 |
| classifier 3 | 0.3533 | 0.3023 | 0.5 | 8 | 93.12 |
| classifier 4 | 0.3611 | 0.2853 | 0.5 | 2 | 94.51 |
| classifier 5 | 0.3513 | 0.3067 | 0.5 | 8 | 95.26 |
| classifier 6 | 0.3878 | 0.2283 | 0.5 | 2 | 95.63 |
| classifier 7 | 0.4169 | 0.1677 | 0.5 | 8 | 95.26 |
| classifier 8 | 0.1613 | 0.8241 | 0.5 | 2 | 96.84 |
| classifier 9 | 0.4542 | 0.0917 | 0.5 | 2 | 96.93 |
| classifier 10 | 0.4337 | 0.1334 | 0.5 | 8 | 96.74 |
Figure 4The accuracy rates of the multiple SVM classifiers over the training data.
The Acc of the multiple SVM classifiers over the training data based on different position deviation distributions.
| Classifier | 0 nt | 1 nt | 2 nt | 3 nt | 4 nt | 5 nt |
|---|---|---|---|---|---|---|
| classifier 1 | 0.8707 | 0.8986 | 0.9293 | 0.9516 | 0.9516 | 0.9647 |
| classifier 2 | 0.9181 | 0.9451 | 0.9563 | 0.9665 | 0.9665 | 0.9721 |
| classifier 3 | 0.9312 | 0.9386 | 0.9460 | 0.9609 | 0.9609 | 0.9684 |
| classifier 4 | 0.9451 | 0.9479 | 0.9544 | 0.9563 | 0.9563 | 0.9674 |
| classifier 5 | 0.9526 | 0.9544 | 0.9600 | 0.9637 | 0.9637 | 0.9684 |
| classifier 6 | 0.9563 | 0.9628 | 0.9647 | 0.9665 | 0.9665 | 0.9730 |
| classifier 7 | 0.9526 | 0.9535 | 0.9553 | 0.9581 | 0.9581 | 0.9665 |
| classifier 8 | 0.9684 | 0.9693 | 0.9702 | 0.9721 | 0.9721 | 0.9758 |
| classifier 9 | 0.9693 | 0.9712 | 0.9721 | 0.9721 | 0.9721 | 0.9740 |
| classifier 10 | 0.9674 | 0.9684 | 0.9684 | 0.9693 | 0.9693 | 0.9749 |
Figure 5The accuracy rate of the balanced-data-based and imbalanced-data-based methods over test dataset.
The first candidate’s accuracy rate of the balanced-data-based and imbalanced-data-based methods over test dataset.
| Classifier | 0 nt | 1 nt | 2 nt | 3 nt | 4 nt | 5 nt | Sum |
|---|---|---|---|---|---|---|---|
| Mat_SVM(%) | 0.30 | 0.23 | 0.16 | 0.11 | 0.07 | 0.03 | 1 |
| MatFind(%) | 0.33 | 0.24 | 0.14 | 0.09 | 0.09 | 0.03 | 1 |
The top 5 candidate’s accuracy rate of the balanced-data-based and imbalanced-data-based methods over test dataset.
| Classifier | 0 nt | 1 nt | 2 nt | 3 nt | 4 nt | 5 nt | Sum |
|---|---|---|---|---|---|---|---|
| Mat_SVM (%) | 0.59 | 0.32 | 0.02 | 0.04 | 0.03 | 0 | 1 |
| MatFind(%) | 0.70 | 0.17 | 0.07 | 0.03 | 0.03 | 0 | 1 |
Figure 6The prediction results of miRdup, MatureByes, MiRPara and MatFind.