| Literature DB >> 28002428 |
Bin Xue1, David Lipps1, Sree Devineni1.
Abstract
MiRNAs are short non-coding RNAs of about 22 nucleotides, which play critical roles in gene expression regulation. The biogenesis of miRNAs is largely determined by the sequence and structural features of their parental RNA molecules. Based on these features, multiple computational tools have been developed to predict if RNA transcripts contain miRNAs or not. Although being very successful, these predictors started to face multiple challenges in recent years. Many predictors were optimized using datasets of hundreds of miRNA samples. The sizes of these datasets are much smaller than the number of known miRNAs. Consequently, the prediction accuracy of these predictors in large dataset becomes unknown and needs to be re-tested. In addition, many predictors were optimized for either high sensitivity or high specificity. These optimization strategies may bring in serious limitations in applications. Moreover, to meet continuously raised expectations on these computational tools, improving the prediction accuracy becomes extremely important. In this study, a meta-predictor mirMeta was developed by integrating a set of non-linear transformations with meta-strategy. More specifically, the outputs of five individual predictors were first preprocessed using non-linear transformations, and then fed into an artificial neural network to make the meta-prediction. The prediction accuracy of meta-predictor was validated using both multi-fold cross-validation and independent dataset. The final accuracy of meta-predictor in newly-designed large dataset is improved by 7% to 93%. The meta-predictor is also proved to be less dependent on datasets, as well as has refined balance between sensitivity and specificity. This study has two folds of importance: First, it shows that the combination of non-linear transformations and artificial neural networks improves the prediction accuracy of individual predictors. Second, a new miRNA predictor with significantly improved prediction accuracy is developed for the community for identifying novel miRNAs and the complete set of miRNAs. Source code is available at: https://github.com/xueLab/mirMeta.Entities:
Mesh:
Substances:
Year: 2016 PMID: 28002428 PMCID: PMC5176297 DOI: 10.1371/journal.pone.0168392
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Infrastructure of the meta-predictor.
Query sequence is input into each individual predictor. The outputs of individual predictors are preprocessed and then fed into an ANN to make a new prediction, which is the output of meta-predictor. Therefore, the meta-predictor is composed of individual predictors, preprocessing modules, and ANN. The parameters of ANN will be trained using datasets containing both positive and negative samples of miRNAs. Although five individual predictors were shown in the figure, the meta-predictor could be made from any number of individual predictors out of five. The total number of possible meta-predictors is 26. The final meta-predictor mirMeta contains all five individual predictors.
Possible outputs of five individual predictors.
| MiPred | MIReNA | miRPara | ProMiR | TripletSVM | |
|---|---|---|---|---|---|
| Positive Prediction | 50 ~ 91 | Yes | 0.8 ~ 1 | 0.017 ~ 3240 | 1 |
| Negative Prediction | 0 | No | 0 | 10e-10 to ~10e-2, 0, NA | NA |
Fig 2Non-linear transformations change the distribution of ProMiR prediction scores of all the samples in the D1679 dataset.
The upper panels show the distribution of raw prediction scores of ProMiR for positive samples (a) and negative samples (b). The inset in (b) is the distribution of scores for negative samples when x-axis is scaled using logarithm. The intermediate panels present the distribution of prediction scores after preprocess-I transformation for positive samples (I-a) and negative samples (I-b). The lower panels are scores after preprocess-II transformation for positive samples (II-a) and negative samples (II-b).
Prediction accuracies of five individual predictors in the D163 and D1679 datasets.
| D163 | D1679 | |||||||
|---|---|---|---|---|---|---|---|---|
| SENS | SPEC | ACC | MCC | SENS | SPEC | ACC | MCC | |
| MiPred | 99.4% | 42.9% | 70.7% | 0.51 | 91.6% | 72.0% | 86.0% | 0.65 |
| MIReNA | 47.2% | 96.4% | 72.2% | 0.50 | 46.7% | 69.3% | 53.2% | 0.15 |
| MiRPara | 91.4% | 51.8% | 71.3% | 0.47 | 73.6% | 38.6% | 63.6% | 0.12 |
| ProMiR | 76.1% | 99.4% | 87.9% | 0.78 | 49.4% | 99.1% | 63.7% | 0.46 |
| Triplet-SVM | 91.4% | 51.8% | 71.3% | 0.47 | 84.8% | 49.7% | 74.8% | 0.36 |
Comparison of true predictions between every two individual predictors for all the 163 positive samples in the D163 dataset.
| MiPred | MIReNA | MiRPara | ProMiR | Triplet-SVM | |
|---|---|---|---|---|---|
| MiPred | (162/163) | 77/162 | 149/162 | 77/163 | 149/162 |
| MIReNA | --- | (77/163) | 65/161 | 60/141 | 71/155 |
| MiRPara | --- | --- | (149/163) | 115/158 | 137/161 |
| ProMiR | --- | --- | --- | (124/163) | 118/155 |
| TripSVM | --- | --- | --- | --- | (149/163) |
In each of the diagonal cells, the number above the slash is the number of true positive (TP) of that predictor, while the number below the slash shows total number of samples. In each of the non-anti-diagonal cells, the number above the slash represents the number of overlapped TP predictions between two predictors, while the number below the slash is the total number of non-redundant TP prediction of two predictors.
Comparison of true predictions between every two individual predictors for all the 168 negative samples in the D163 dataset.
| MiPred | MIReNA | MiRPara | ProMiR | Triplet-SVM | |
|---|---|---|---|---|---|
| MiPred | (72/168) | 70/164 | 45/141 | 72/168 | 68/158 |
| MIReNA | --- | (162/168) | 87/162 | 161/168 | 83/166 |
| MiRPara | --- | --- | (87/168) | 87/168 | 35/139 |
| ProMiR | --- | --- | --- | (167/168) | 86/168 |
| TripSVM | --- | --- | --- | --- | (154/168) |
In each of the diagonal cells, the number above the slash is the number of true negative (TN) of a predictor, while the number below the slash shows total number of samples. In each of the non-anti-diagonal cells, the number above the slash represents the number of overlapped TN predictions between two predictors, while the number below the slash is the total number of non-redundant TN predictions of two predictors.
Comparison of true predictions between every two individual predictors for all the 1679 positive samples in the D1679 dataset.
| MiPred | MIReNA | MiRPara | ProMiR | Triplet-SVM | |
|---|---|---|---|---|---|
| MiPred | (1538/1679) | 748/1575 | 1166/1608 | 777/1591 | 1368/1594 |
| MIReNA | --- | (785/1679) | 553/1468 | 494/1121 | 743/1466 |
| MiRPara | --- | --- | (1236/1679) | 625/1441 | 1127/1533 |
| ProMiR | --- | --- | --- | (830/1679) | 782/1472 |
| TripSVM | --- | --- | --- | --- | (1424/1679) |
In each of the diagonal cells, the number above the slash is the number of true positive (TP) predictions of a predictor, while the number below the slash shows total number of samples. In each of the non-anti-diagonal cells, the number above the slash represents the number of overlapped TP predictions between two predictors, while the number below the slash is the total number of non-redundant TP predictions of two predictors.
Comparison of true predictions between every two individual predictors for all the 674 negative samples in the D1679 dataset.
| MiPred | MIReNA | MiRPara | ProMiR | Triplet-SVM | |
|---|---|---|---|---|---|
| MiPred | (485/674) | 288/664 | 120/625 | 482/668 | 171/536 |
| MIReNA | --- | (467/674) | 251/476 | 462/673 | 197/492 |
| MiRPara | --- | --- | (260/674) | 258/670 | 137/345 |
| ProMiR | --- | --- | --- | (668/674) | 221/669 |
| TripSVM | --- | --- | --- | --- | (222/674) |
In each of the diagonal cells, the number above the slash is the number of true negative (TN) predictions of a predictor, while the number below the slash shows total number of samples. In each of the non-anti-diagonal cells, the number above the slash represents the number of overlapped TN predictions between two predictors, while the number below the slash is the total number of non-redundant TN predictions of two predictors.
Fig 3Prediction accuracies of meta-predictors made from various combinations of five individual predictors in the (A) D163 and (B) D1679 datasets. The input of ANN in the meta-predictor were preprocessed using preprocess-I transformation. X-axis shows the number of individual predictors in the meta-predictor, while y-axis shows the prediction accuracy (ACC) under three-fold cross validation (A) and five-fold cross validation (B). In the case that x equals to 1, y-axis shows the prediction accuracies of five individual predictors. The numbers of meta-predictors composed of 2, 3, 4, and 5 individual predictors are 10, 10, 5, and 1, respectively.
Performance of meta-predictors using preprocess-I transformation under multi-fold cross validation and in independent dataset.
| Predictor | Dataset | SENS | SPEC | ACC | MCC |
|---|---|---|---|---|---|
| Meta-I-4S | D163 | 93.3 ± 2.8% | 98.8 ± 1.6% | 96.1 ± 2.2% | 0.92 ± 0.04 |
| D1679 | 88.1 ± 2.4% | 92.9 ± 0.1% | 89.5 ± 1.7% | 0.76 ± 0.03 | |
| Meta-I-5L | D1679 | 79.4 ± 4.4% | 94.0 ± 1.3% | 86.6 ± 1.6% | 0.74 ± 0.03 |
| D163 | 98.9 ± 0.4% | 93.2 ± 3.3% | 96.0 ± 1.4% | 0.92 ± 0.03 |
(*) Meta-I-4S is composed of four individual predictors: MiPred, miReNA, MiRPara, and ProMiR. The predictor was optimized in the D163 dataset using three-fold cross validation; Meta-I-5L is composed of five individual predictors: MiPred, miReNA, MiRPara, ProMiR, and TripSVM. It was trained in the D1679 dataset using five-fold cross validation.
(**) The performance of these two predictors in independent dataset, which was D1679 for Meta-I-4S and D163 for Meta-I-5L, was averaged over three- or five-iterations of prediction that correspond to three- or five-fold cross validation. Errors were standard errors calculated from either three- or five-iterations of prediction.
Performance of meta-predictors under multi-fold cross validation and in independent dataset under preprocess-II transformation strategy.
| Predictor | Dataset | SENS | SPEC | ACC | MCC |
|---|---|---|---|---|---|
| Meta-II-2S | D163 | 96.3 ± 1.2% | 1.0 ± 0.0% | 98.2 ± 0.6% | 0.96 ± 0.01 |
| D1679 | 73.8 ± 4.5% | 94.6 ± 1.0% | 79.8 ± 2.9% | 0.62 ± 0.04 | |
| Meta-II-5L | D1679 | 82.5 ± 1.9% | 95.0 ± 1.7% | 88.7 ± 0.6% | 0.78 ± 0.01 |
| D163 | 99.4 ± 0.0% | 90.5 ± 0.6% | 94.9 ± 0.3% | 0.90 ± 0.01 |
(*) Meta-II-2S is composed of MiPred and ProMiR. The predictor was optimized in the D163 dataset using three-fold cross validation; Meta-II-5L is composed of all the five individual predictors including: MiPred, miReNA, MiRPara, ProMiR, and TripSVM. It was trained in the D1679 dataset using five-fold cross validation.
(**) The performance of these two predictors in independent dataset, which was D1679 for Meta-II-2S and D163 for Meta-II-5L, was averaged over three- or five-iterations of prediction that correspond to three- or five-fold cross validation. Errors were standard errors calculated from either three- or five-iterations of prediction.
Performance of meta-predictors under multi-fold cross validation and in independent dataset under preprocess-III transformation strategy.
| Predictor | Dataset | SENS | SPEC | ACC | MCC |
|---|---|---|---|---|---|
| Meta-III-5S | D163 | 93.1 ± 2.1% | 1.0 ± 0.0% | 96.6 ± 1.0% | 0.93 ± 0.02 |
| D1679 | 91.4 ± 0.1% | 83.7 ± 0.1% | 89.2 ± 0.1% | 0.74 ± 0.00 | |
| Meta-III-5L | D1679 | 89.1 ± 3.2% | 97.7 ± 0.4% | 93.4 ± 1.7% | 0.87 ± 0.03 |
| D163 | 99.4 ± 0.0% | 90.1 ± 0.1% | 94.7 ± 0.7% | 0.90 ± 0.01 |
(*) Meta-III-5S and Meta-III-5L are composed of five individual predictors, which are: MiPred, miReNA, MiRPara, ProMiR, and TripSVM. Meta-III-5S was optimized in the D163 dataset using three-fold cross validation, while Meta-III-5L was trained in the D1679 dataset using five-fold cross validation.
(**) The performance of these two predictors in independent dataset, which was D1679 for Meta-III-5S and D163 for Meta-III-5L, was averaged over three- or five-iterations of prediction that correspond to three- or five-fold cross validation. Errors were standard errors calculated from either three- or five-iterations of prediction. Meta-III-5L has the best overall performance, and is therefore used as the final meta-predictor mirMeta.
Comparison between mirMeta and HetroMirPred.
| Predictor | Dataset | SENS | SPEC | ACC | MCC |
|---|---|---|---|---|---|
| mirMeta | D1679 | 89.1 ± 3.2% | 97.7 ± 0.4% | 93.4 ± 1.7% | 0.87 ± 0.03 |
| D163 | 99.4% | 90.1% | 94.7% | 0.90 | |
| HeteroMirPred | D1679 | 78.1% | 52.5% | 69.6% | 0.29 |
| D163 | 98.7% | 96.6% | 97.6% | 0.95 |
(a) The accuracies of mirMeta in D1679 were obtained from five-fold cross-validation. The values are the same as those in Table 9.
(b) The accuracy were calculated using D163 dataset as an independent dataset.