| Literature DB >> 23012261 |
Supatcha Lertampaiporn1, Chinae Thammarongtham, Chakarida Nukoolkit, Boonserm Kaewkamnerdpong, Marasri Ruengjitchatchawalya.
Abstract
An ensemble classifier approach for microRNA precursor (pre-miRNA) classification was proposed based upon combining a set of heterogeneous algorithms including support vector machine (SVM), k-nearest neighbors (kNN) and random forest (RF), then aggregating their prediction through a voting system. Additionally, the proposed algorithm, the classification performance was also improved using discriminative features, self-containment and its derivatives, which have shown unique structural robustness characteristics of pre-miRNAs. These are applicable across different species. By applying preprocessing methods--both a correlation-based feature selection (CFS) with genetic algorithm (GA) search method and a modified-Synthetic Minority Oversampling Technique (SMOTE) bagging rebalancing method--improvement in the performance of this ensemble was observed. The overall prediction accuracies obtained via 10 runs of 5-fold cross validation (CV) was 96.54%, with sensitivity of 94.8% and specificity of 98.3%-this is better in trade-off sensitivity and specificity values than those of other state-of-the-art methods. The ensemble model was applied to animal, plant and virus pre-miRNA and achieved high accuracy, >93%. Exploiting the discriminative set of selected features also suggests that pre-miRNAs possess high intrinsic structural robustness as compared with other stem loops. Our heterogeneous ensemble method gave a relatively more reliable prediction than those using single classifiers. Our program is available at http://ncrna-pred.com/premiRNA.html.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23012261 PMCID: PMC3592496 DOI: 10.1093/nar/gks878
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
List of 125 features used in this work
| Feature groups | No. of features | Feature symbol |
|---|---|---|
| Sequence-based features | 19 | Len, %G+C,% A+U, %AA, %AC, %AG, %AU, %CA, %CC, %CG, %CU, %GA, %GC, %GG, %GU, %UA, %UC, %UG, %UU |
| Secondary structure features | 30 | MFE, efe, MFEI1, MFEI2, MFEI3, MFEI4, dG, dQ, dD, dF, Prob, zG, zQ, zD, zF,
nefe, Freq, diff, dH, dH/L, dS, dS/L, Tm, Tm/L, |
| Base pair features | 32 | dP, zP, div, tot_bp, stem, loop, A-U/L, G-U/L, G-C/L, %A–U/Stem,
%G–C/Stem, %G–U/Stem |
| Triplet sequence structure | 32 | A(((, A((., A(.., A(.(,A.((,A.(.,A..(, A …, C(((, C((., C(.., C(.(, C.((, C.(., C..(, C…, G(((, G((., G(.., G(.(, G.((, G.(., G..(, U…, U(((, U((., U(.., U(.(, U.((, U.(., U..(, U…, |
| Structural robustness features (SC-derived features) | 12 | |
| Total | 125 |
Our additional features are shown in bold.
Figure 1.(A) Overview of proposed ensemble method: the training process is shown by dark thick arrows. The testing process is shown in white arrows. (B) Rebalancing class distribution procedure: the imbalanced training data were processed to obtain four subsets of training data with balanced distribution between positive and negative classes.
Predictive performance of each feature groups by the 5-fold CV
| Feature groups | No. of features | Sn | Sp | Gm |
|---|---|---|---|---|
| Sequence features | 19 | 45.0 | 96.3 | 65.83 |
| Structure features | 30 | 82.8 | 97.6 | 89.90 |
| Base pair feature | 32 | 81.5 | 97.8 | 89.28 |
| Triplet sequence structure | 32 | 77.7 | 97.1 | 86.86 |
| SC related feature | 12 | 84.5 | 98.4 | 91.18 |
| All five feature groups | 125 | 84.0 | 98.3 | 90.86 |
| Feature SC | 1 | 76.6 | 96.9 | 86.15 |
| Feature SC × dP | 1 | 80.5 | 98.1 | 88.86 |
| Feature SC/(1 − dP) | 1 | 81.3 | 97.9 | 89.76 |
| Feature SCxdP/(1 − dP) | 1 | 82.2 | 98.2 | 89.84 |
| Feature SC × MFE/Mean_dG | 1 | 81.9 | 97.5 | 87.76 |
| Feature SC × zG | 1 | 78.9 | 98.8 | 89.36 |
| Feature SC/tot_bp | 1 | 0 | 100 | 0 |
| Feature SC/Len | 1 | 0 | 100 | 0 |
| Feature SC/NonBP_A | 1 | 68.5 | 98.4 | 82.09 |
| Feature SC/NonBP_C | 1 | 0 | 100 | 0 |
| Feature SC/NonBP_G | 1 | 0 | 100 | 0 |
| Feature SC/NonBP_U | 1 | 48.8 | 98.4 | 69.29 |
Sn = Sensitivity, Sp = Specificity and Gm = Geometric mean.
Figure 2.The SC-base pair composite features of Human_miRNA, Plant_miRNA, Other ncRNAs and Pseudo hairpins in our training dataset. (A) Original SC feature. (B) Feature SC × dP. (C) Feature SC/(1 − dP). (D) Feature SC × dP/(1 − dP).
The average performance by different feature selection algorithms on our training data
| Feature subsets | No. of features | Sn (%) | Sp (%) | Gm (%) |
|---|---|---|---|---|
| All features (No FS) | 125 | 84.0 | 98.3 | 90.86 |
| microPred features (J–M) ( | 21 | 83.0 | 97.9 | 90.14 |
| FS1: ReliefF | 50 | 84.9 | 98.4 | 91.40 |
| FS2: InfoGain | 75 | 84.9 | 98.3 | 91.35 |
| FS3: CFS + GA | 20 | 84.9 | 98.6 | 91.49 |
Comparison of the performance of different methods on training data using 20 selected features
| Algorithms | Performance measurement | |||||
|---|---|---|---|---|---|---|
| ACC | Sn | Sp | PPV | FPR | AUC | |
| K-nearest neighbors (kNN) | 95.511 | 83.3 | 0.8 | 0.966 | ||
| Support vector machine (SVM) | 85.1 | 98.6 | 94.8 | 1.4 | ||
| Artificial neural network (MLP) | 95.283 | 86.5 | 97.9 | 92.4 | 2.1 | 0.964 |
| Decision tree (J48) | 94.581 | 84.4 | 97.6 | 91.3 | 2.4 | 0.920 |
| RBF networks (RBFNets) | 94.352 | 86.4 | 96.7 | 88.7 | 3.3 | 0.968 |
| Rule based (RIPPER) | 94.809 | 84.0 | 98.0 | 92.6 | 2.0 | 0.923 |
| Naïve bayes (NB) | 93.585 | 85.5 | 96.0 | 86.4 | 4.0 | 0.955 |
| Random forest (RF) | 95.283 | 97.8 | 92.1 | 2.2 | 0.965 | |
Sn = Sensitivity, Sp = Specificity, PPV = Positive predictive value, ACC = Accuracy, FPR = False positive rate and AUC = Area under ROC curve. The highest values are in bold.
The 10 × 5 fold CV generalization performance of balanced and imbalanced ensembles with selected features
| Algorithms | Performance measurement | |||||
|---|---|---|---|---|---|---|
| ACC | Sn | Sp | FPR | Gm | AUC | |
| Vote 1 (Imbalanced, all features) | 95.48 | 84.1 | 98.7 | 1.3 | 91.2 | 0.973 |
| Vote 2 (Imbalanced, 20 selected features) | 95.81 | 85.1 | 99.1 | 0.9 | 91.4 | 0.976 |
| Vote 3 (Balanced, 20 selected features) | 96.54 | 94.8 | 98.3 | 1.7 | 96.5 | 0.996 |
ACC = Accuracy, Sn = Sensitivity, Sp = Specificity, FPR = False positive rate, Gm = Geometric mean and AUC = Area under the ROC curve.
The prediction performance of the automated classifier, yasMir, miPred, tripletSVM and our method evaluated the same testing data set
| Test sets | Automated classifier ( | yasMiR ( | miPred ( | Triplet SVM ( | Our method |
|---|---|---|---|---|---|
| TE-H | 94.30 | 93.77 | 93.50 | 87.96 | |
| IE-NH | 94.91 | 94.11 | 86.15 | 95.31 | |
| IE-NC | 77.71 | 82.95 | 68.68 | 78.37 | |
| IE-M | 96.77 | 87.09 | 0 |
TE-H (123 human pre-miRNA and 246 pseudo hairpins), IE-NH (1918 pre-miRNA across 40 non-human species and 3836 pseudo hairpins), IE-NC (12 387 functional ncRNAs) and IE-M (31 mRNAs). The values are percentages of correct prediction for each method on each data set. The highest values are in bold.
Comparison of our method to other method on ‘Common test’ testing data set of mirExplorer
| Method | Balance method | CV | SE | SP | Gm | ACC(%) Multiloop |
|---|---|---|---|---|---|---|
| Triplet-SVM (SVM) | – | – | 88.40 | 83.50 | 0.859 | N/A |
| MiPred (random forest) | – | – | 84.34 | 93.56 | 0.888 | N/A |
| microPred (SVM) | SMOTE | outer 5cv | 90.50 | 66.43 | 0.775 | 54.23 |
| mirExplorer (AdaBoost) | SMOTE + undersampling | outer 10cv | 94.32 | 97.11 | 0.957 | 92.68 |
| Our method (ensemble) | Modified-SMOTEBagging | 10x5cv | 95.11 | 97.91 | 0.965 | 97.25 |
SE, SP and ACC represent sensitivity, specificity and accuracy, respectively.
Sensitivity performance on plant specie pre-miRNAs
| Species | No. of sequences | Accuracy (%) | Our method | |||
|---|---|---|---|---|---|---|
| PlantMiRNA Pred ( | Triplet-SVM ( | microPred ( | yasMir ( | |||
| ath | 180 | 92.22 | 76.06 | 89.44 | 97.78 | 99.44 |
| osa | 397 | 94.21 | 75.54 | 90.43 | 96.72 | 100 |
| ptc | 233 | 91.85 | 75.21 | 84.98 | 93.99 | 96.99 |
| ppt | 211 | 92.42 | 71.49 | 89.57 | 98.10 | 98.57 |
| mtr | 106 | 100 | 80.18 | 95.28 | 100 | 100 |
| sbi | 131 | 98.47 | 69.51 | 94.66 | 95.42 | 100 |
| zma | 97 | 97.94 | 66.97 | 93.81 | 97.94 | 96.90 |
| gma | 83 | 98.31 | 74.12 | 86.75 | 96.38 | 98.79 |
| updated aly | 191 | 97.91 | 70.98 | 91.62 | 100 | 100 |
| updated gma | 118 | 98.31 | 79.66 | 93.22 | 100 | 100 |
All methods were tested on the testing data set of PlantMiRNAPred (14).
Specificity of our ensemble when applied to the negative testing data, compared with yasMir (the 2nd best sensitivity from Table 8)
| Negative data | No. of sequences | Our method | yasMir | ||
|---|---|---|---|---|---|
| Correctly classified (%) | FPR (%) | Correctly classified (%) | FPR (%) | ||
| Pseudo hairpin | 4494 | 93.74 | 6.26 | 86.91 | 13.09 |
| Shuffle | 21 470 | 88.35 | 11.65 | 83.69 | 16.31 |
| IE-NC (1238ncRNA) | 12 387 | 83.22 | 16.78 | 82.95 | 17.05 |
| Average | 12 784 | 88.44 | 11.56 | 84.52 | 15.48 |
Correctly classified (%) is the percent of the correctly classified as not pre-miRNAs, FPR (%) is the false positive rate.