| Literature DB >> 17553836 |
Peng Jiang1, Haonan Wu, Wenkai Wang, Wei Ma, Xiao Sun, Zuhong Lu.
Abstract
To distinguish the real pre-miRNAs from other hairpin sequences with similar stem-loops (pseudo pre-miRNAs), a hybrid feature which consists of local contiguous structure-sequence composition, minimum of free energy (MFE) of the secondary structure and P-value of randomization test is used. Besides, a novel machine-learning algorithm, random forest (RF), is introduced. The results suggest that our method predicts at 98.21% specificity and 95.09% sensitivity. When compared with the previous study, Triplet-SVM-classifier, our RF method was nearly 10% greater in total accuracy. Further analysis indicated that the improvement was due to both the combined features and the RF algorithm. The MiPred web server is available at http://www.bioinf.seu.edu.cn/miRNA/. Given a sequence, MiPred decides whether it is a pre-miRNA-like hairpin sequence or not. If the sequence is a pre-miRNA-like hairpin, the RF classifier will predict whether it is a real pre-miRNA or a pseudo one.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17553836 PMCID: PMC1933124 DOI: 10.1093/nar/gkm368
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
The performance of RF prediction modules based on various features. The prediction system was assessed by OOB estimation on data set 1.
| Features | Sp (%) | Se (%) | ACC (%) | MCC |
|---|---|---|---|---|
| A | 90.48 | 85.89 | 88.21 | 0.77 |
| A + B | 95.24 | 91.41 | 93.35 | 0.87 |
| A + C | 97.62 | 94.47 | 96.07 | 0.92 |
| A + B + C | 98.21 | 95.09 | 96.68 | 0.94 |
A: local contiguous triplet structure composition;
B: Minimum of free energy (MFE) of the secondary structure;
C: P-value.
Estimating and ranking the relative importance of the features
| Rank | Features | Mean decrease accuracy (%) |
|---|---|---|
| 1 | 15.80 | |
| 2 | MFE | 5.48 |
| 3 | C … | 2.04 |
| 4 | U((( | 2.00 |
| 5 | A((( | 1.49 |
| 6 | A … | 0.83 |
| 7 | G … | 0.76 |
| 8 | U..( | 0.43 |
| 9 | G.(( | 0.34 |
| 10 | A(.. | 0.31 |
| 11 | C((. | 0.31 |
| 12 | G..( | 0.29 |
| 13 | G(.. | 0.29 |
| 14 | U((. | 0.27 |
| 15 | U … | 0.26 |
| 16 | U(.( | 0.24 |
| 17 | G((( | 0.23 |
| 18 | C((( | 0.20 |
| 19 | A..( | 0.20 |
| 20 | U(.. | 0.19 |
| 21 | C(.. | 0.14 |
| 22 | U.(. | 0.14 |
| 23 | G((. | 0.09 |
| 24 | C.(. | 0.09 |
| 25 | A(.( | 0.08 |
| 26 | C..( | 0.08 |
| 27 | C.(( | 0.07 |
| 28 | A.(. | 0.07 |
| 29 | G(.( | 0.06 |
| 30 | A.(( | 0.03 |
| 31 | G.(. | 0.02 |
| 32 | A((. | 0.00 |
| 33 | U.(( | 0.00 |
| 34 | C(.( | 0.00 |
Comparison with the existing method and the competing method. All the algorithms are trained on the same training data set and tested on the same testing data set
| Methods | Sp (%) | Se (%) | ACC (%) | MCC |
|---|---|---|---|---|
| RF | 93.21 | 89.35 | 91.29 | 0.826 |
| SVM | 90.94 | 87.83 | 89.39 | 0.788 |
| Triplet-SVM-classifier | 88.30 | 79.47 | 83.90 | 0.681 |
RF: An RF-based method with ‘P-value + MFE + local contiguous triplet structure composition’ features;
SVM: An SVM-based method with ‘P-value + MFE + local contiguous triplet structure composition’ features;
Triplet-SVM-classifier: An SVM-based method with Local contiguous triplet structure composition features.
Prediction accuracy on an independent data set test
| Species | Accuracy | |
|---|---|---|
| MiPred | miR- | |
| 100% (2/2) | 0% (0/2) | |
| 100% (1/1) | 100% (1/1) | |
| 100% (18/18) | 33.3%(6/18) | |
| 100% (4/4) | 25% (1/4) | |
| 100% (16/16) | 68.75% (11/16) | |
| Total | 100% (41/41) | 46.34% (19/41) |
Figure 1.Prediction results of a query RNA sequence.