| Literature DB >> 30858864 |
Xiangzheng Fu1, Wen Zhu2, Lijun Cai1, Bo Liao1,2, Lihong Peng3, Yifan Chen1, Jialiang Yang2,4.
Abstract
Playing critical roles as post-transcriptional regulators, microRNAs (miRNAs) are a family of short non-coding RNAs that are derived from longer transcripts called precursor miRNAs (pre-miRNAs). Experimental methods to identify pre-miRNAs are expensive and time-consuming, which presents the need for computational alternatives. In recent years, the accuracy of computational methods to predict pre-miRNAs has been increasing significantly. However, there are still several drawbacks. First, these methods usually only consider base frequencies or sequence information while ignoring the information between bases. Second, feature extraction methods based on secondary structures usually only consider the global characteristics while ignoring the mutual influence of the local structures. Third, methods integrating high-dimensional feature information is computationally inefficient. In this study, we have proposed a novel mutual information-based feature representation algorithm for pre-miRNA sequences and secondary structures, which is capable of catching the interactions between sequence bases and local features of the RNA secondary structure. In addition, the feature space is smaller than that of most popular methods, which makes our method computationally more efficient than the competitors. Finally, we applied these features to train a support vector machine model to predict pre-miRNAs and compared the results with other popular predictors. As a result, our method outperforms others based on both 5-fold cross-validation and the Jackknife test.Entities:
Keywords: feature representation algorithm; mutual information; pre-miRNAs identification; structure analysis; support vector machine
Year: 2019 PMID: 30858864 PMCID: PMC6397858 DOI: 10.3389/fgene.2019.00119
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1The overall framework of the proposed method for predicting pre-miRNAs.
Figure 2The 2-gram and 3-gram feature representation.
Figure 3The pre-miRNA secondary structure of miRNA hsa-mir-302f.
The performance of different features on benchmark dataset (Jackknife test evaluation).
| PSFMI | 67.99 | 69.04 | 68.52 | 0.370 |
| SSFMI | 80.21 | 88.34 | 84.27 | 0.688 |
| PSFMI+MFE | 78.60 | 84.93 | 81.76 | 0.637 |
| SSFMI+MFE |
The best values are shown in boldface.
Figure 4The AUROC comparison of four feature combinations through the Jackknife cross-validation.
Importance of the relatively specific features in the proposed features set.
| 1 | PSFMI_1 | 1.0000 |
| 2 | PSFMI_2 | 1.0000 |
| 3 | PSFMI_3 | 0.9981 |
| 4 | PSFMI_4 | 0.9975 |
| 5 | SSFMI_1 | 0.9963 |
| 6 | PSFMI_5 | 0.9963 |
| 7 | PSFMI_6 | 0.9933 |
| 8 | SSFMI_2 | 0.9890 |
| 9 | SSFMI_3 | 0.9772 |
| 10 | PSFMI_7 | 0.9750 |
| 11 | PSFMI_8 | 0.9739 |
| 12 | PSFMI_9 | 0.9722 |
| 13 | PSFMI_10 | 0.9717 |
| 14 | SSFMI_4 | 0.9680 |
| 15 | PSFMI_11 | 0.9625 |
| 16 | SSFMI_5 | 0.9608 |
| 17 | PSFMI_12 | 0.9423 |
| 18 | PSFMI_13 | 0.9143 |
| 19 | SSFMI_6 | 0.8940 |
| 20 | SSFMI_7 | 0.8936 |
| 21 | PSFMI_14 | 0.8916 |
| 22 | SSFMI_8 | 0.8909 |
| 23 | SSFMI_9 | 0.8870 |
| 24 | SSFMI_10 | 0.8787 |
| 25 | SSFMI_11 | 0.8624 |
| 26 | PSFMI_15 | 0.8429 |
| 27 | SSFMI_12 | 0.8387 |
| 28 | SSFMI_13 | 0.8364 |
| 29 | SSFMI_14 | 0.8282 |
| 30 | PSFMI_16 | 0.7897 |
| 31 | PSFMI_17 | 0.7859 |
| 32 | PSFMI_18 | 0.7851 |
| 33 | PSFMI_19 | 0.7386 |
| 34 | PSFMI_20 | 0.6681 |
| 35 | SSFMI_15 | 0.6008 |
| 36 | SSFMI_16 | 0.6008 |
| 37 | SSFMI_17 | 0.6008 |
| 38 | SSFMI_19 | 0.5995 |
| 39 | MFE | 0.4575 |
| 40 | PSFMI_21 | 0.3508 |
| 41 | PSFMI_22 | 0.3504 |
| 42 | PSFMI_23 | 0.3351 |
| 43 | PSFMI_24 | 0.3218 |
| 44 | SSFMI_19 | 0.2647 |
| 45 | SSFMI_20 | 0.1044 |
| 46 | PSFMI_25 | 0.0058 |
| 47 | PSFMI_26 | 0.0057 |
| 48 | PSFMI_27 | 0.0052 |
| 49 | PSFMI_28 | 0.0032 |
| 50 | PSFMI_29 | 0.0025 |
| 51 | PSFMI_30 | 0.0025 |
| 52 | PSFMI_31 | 0.0023 |
| 53 | PSFMI_32 | 0.0009 |
| 54 | PSFMI_33 | 0.0005 |
| 55 | PSFMI_34 | 0.0003 |
Comparison of performance of different kernel functions on the benchmark dataset S1 (Jackknife test evaluation).
| SVM (linear kernel) | 88.83 | 92.12 | 90.48 | 0.810 | 96.20 |
| SVM (polynomial kernel) | 84.86 | 85.86 | 85.36 | 0.707 | 93.04 |
| SVM (rbf kernel) | 88.59 | ||||
| SVM (sigmoid kernel) | 91.94 | 90.45 | 0.809 | 96.26 |
The best values are shown in boldface.
A brief introduction to the state-of-the-art predictors.
| Triplet-SVM | SVM | 32 | No parameter |
| miRNAPre | SVM | 98 | No parameter |
| iMiRNA-SSF | SVM | 98 | No parameter |
| iMcRNA-PseSSC | SVM | 113 | |
| iMiRNA-PseDPC | SVM | 725 | |
| Our method | SVM | 55 | No parameter |
ACC's best parameter settings.
Results of the proposed method and state-of-the-art predictors on benchmark dataset S1 (Jackknife test evaluation).
| iMiRNA-SSF | 86.91 | 88.09 | 0.762 | 94.64 | |
| Triplet-SVM | 82.44 | 85.24 | 83.84 | 0.677 | 91.97 |
| miRNAPre | 84.24 | 87.90 | 86.07 | 0.722 | 93.49 |
| iMcRNA-PseSSC | 84.55 | 86.41 | 85.48 | 0.710 | 93.22 |
| iMiRNA-PseDPC | 86.72 | 89.21 | 87.97 | 0.760 | 94.97 |
| Our method | 88.59 |
The best values are shown in boldface.
Results of the proposed method and state-of-the-art predictors on benchmark dataset S2 (Jackknife test evaluation).
| iMiRNA-SSF | 84.49 | 85.86 | 85.17 | 0.704 | 92.03 |
| Triplet-SVM | 82.07 | 84.12 | 83.10 | 0.662 | 90.86 |
| miRNApre | 84.80 | 86.79 | 85.79 | 0.716 | 92.81 |
| iMcRNA-PseSSC | 80.02 | 82.75 | 81.39 | 0.628 | 89.71 |
| iMiRNA-PseDPC | 87.16 | 87.41 | 0.748 | 94.79 | |
| Our method | 87.28 |
The best values are shown in boldface.
Comparing the proposed method with other state-of-the-art predictors on an independent dataset S3.
| miRNApre | 80.68 | 48.86 | 64.77 | 0.312 | 72.02 |
| iMiRNA-PseDPC | 47.73 | 65.91 | 0.342 | ||
| iMcRNA-PseSSC | 80.68 | 56.82 | 68.75 | 0.386 | 75.81 |
| Triplet-SVM | 80.68 | 46.59 | 63.64 | 0.290 | 68.92 |
| Our method | 76.14 | 75.54 |
The best values are shown in boldface.
Five-fold cross-validation prediction performance of the proposed method and 4 state-of-the-art predictors on imbalanced benchmark dataset S and S.
| iMcRNA-PseSSC | 0.9103 | 0.5707 | 0.6743 | 0.9333 | 0.7157 | 0.7628 |
| iMiRNA-PseDPC | 0.9333 | 0.6404 | 0.7317 | 0.9534 | 0.7708 | 0.8259 |
| Triplet-SVM | 0.8905 | 0.5207 | 0.6364 | 0.9357 | 0.7182 | 0.7806 |
| miRNApre | 0.7029 | 0.9454 | 0.7447 | 0.8140 | ||
| Our method | 0.9526 | 0.7694 |
The best values are shown in boldface.
Figure 5The AUROC curves of our method on the imbalanced benchmark dataset S4 and S5 via 5-fold cross validation.
Figure 6The AUPR curves of our method on the imbalanced benchmark dataset S4 and S5 via 5-fold cross validation.
Comparing the proposed method and state-of-the-art predictors on the benchmark dataset S6 (using the Jackknife test).
| miRNApre | 92.59 | 88.43 | 90.51 | 0.811 | 97.11 |
| Triplet-SVM | 89.35 | 88.43 | 88.89 | 0.778 | 95.85 |
| iMiRNA-PseDPC | 91.67 | 91.67 | 91.67 | 0.833 | 97.41 |
| iMcRNA-PseSSC | 89.81 | 87.96 | 88.89 | 0.778 | 95.55 |
| Our method |
The best values are shown in boldface.
False-positive pre-miRNAs predicted to be negative by our method.
| hsa-mir-566 | |
| hsa-mir-3607 | |
| hsa-mir-3656 | |
| hsa-mir-4417 | |
| hsa-mir-4459 | |
| hsa-mir-4792 | |
| hsa-mir-6723 | |
| hsa-mir-7641-1 |
The running time (in seconds) of different methods on benchmark dataset S6 using the Jackknife test, where C and γ represent the penalty coefficient of the SVM model and the parameters of the RBF function, respectively.
| miRNApre | 0.25 | 16 | 133.159 | 90.51 | 97.11 |
| Triplet-SVM | 0.0156 | 2 | 21.689 | 88.89 | 95.85 |
| iMiRNA-PseDPC | 0.25 | 16 | 543.803 | 91.67 | 97.41 |
| iMcRNA-PseSSC | 0.0039 | 39 | 50.459 | 88.89 | 95.55 |
| Our method | 0.0156 | 16 | 19.712 | 92.59 | 98.07 |