| Literature DB >> 28839141 |
Müşerref Duygu Saçar Demirci1, Jan Baumbach2,3, Jens Allmer1,4.
Abstract
MicroRNAs are crucial for post-transcriptional gene regulation, and their dysregulation has been associated with diseases like cancer and, therefore, their analysis has become popular. The experimental discovery of miRNAs is cumbersome and, thus, many computational tools have been proposed. Here we assess 13 ab initio pre-miRNA detection approaches using all relevant, published, and novel data sets while judging algorithm performance based on ten intrinsic performance measures. We present an extensible framework, izMiR, which allows for the unbiased comparison of existing algorithms, adding new ones, and combining multiple approaches into ensemble methods. In an exhaustive attempt, we condense the results of millions of computations and show that no method is clearly superior; however, we provide a guideline for biomedical researchers to select a tool. Finally, we demonstrate that combining all of the methods into one ensemble approach, for the first time, allows reliable purely computational pre-miRNA detection in large eukaryotic genomes.As the experimental discovery of microRNAs (miRNAs) is cumbersome, computational tools have been developed for the prediction of pre-miRNAs. Here the authors develop a framework to assess the performance of existing and novel pre-miRNA prediction tools and provide guidelines for selecting an appropriate approach for a given data set.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28839141 PMCID: PMC5571158 DOI: 10.1038/s41467-017-00403-z
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Available pre-miRNA detection tools
| Study | ML algorithm | Feature number | Positive data | Negative data | Sampling | Implementation | Number of citations (Google Scholar) |
|---|---|---|---|---|---|---|---|
| Xue[ | SVM | 32 | MiRBase 5.0 | CODING dataset (Pseudo) | Random selection (approx. 1:1 positive negative ratio) | * | 412 (34) |
| Jiang[ | RF, SVM | 34 | MiRBase 8.2 | pseudo | Random sampling (approx. 1:1 positive negative and 1:1.5 training testing ratio) | * | 376 (48) |
| Ng[ | SVM | 29 | MiRBase 8.2 | pseudo | Random selection without replacement (1:2 positive negative ratio) | * | 203 (19) |
| Batuwita[ | SVM | 21 | MiRBase 12 | pseudo & Human other ncRNAs | Outer-5-fold-cv | + | 172 (16) |
| Xu[ | A novel ranking algorithm based on random walks & SVM | 35 | MiRBase (September 1, 2007) | Random, non-overlapping 90nt fragments from the human genome | Random selection (1:2 positive to negative ratio) | * | 80 (4) |
| Ding[ | SVM | 32 | Known miRNAs | UTRdb & ncRNA from Rfam 9.1 | Outer 3-fold cross-validation | − | 61 (11) |
| Chen[ | LibSVM | 99 | miRBase (2013) | pseudo & Zou | Leave-one-out | + | 31 (24) |
| Burgt[ |
| 18 | non-plant miRNA hairpin sequences (miRBase version 9.0) | – | 10-fold cross-validation | * | 31 (4) |
| Gudys[ | NB, MLP, SVM, RF, APLSC | 28 | MiRBase 17 | From genomes and mRNAs of ten animal and seven plant species as well as 29 viruses | Stratified 10-fold CV | + | 27 (5) |
| Ritchie[ | SVM | 36 | Murine miRBase v17 | Transcripts without evidence of processing by Dicer | – | − | 20 (5) |
| Bentwich[ | – | 26 | Hairpins from Human Genome | 10000 hairpins found in non-coding regions | − | 20 (2) | |
| Lopes[ | SVM, RF, G2 DE | 13 | MiRBase 19 | pseudo | Non-standard training and testing scheme. | * | 16 (6) |
| Gao[ | SVM | 57 | MiRBase v20 | Exonic regions of our some available genomes and ncRNAs from rFam | 1:1 positive to negative ratio | * | 11 (1) |
SVM support vector machine, NB Naïve Bayes, MLP Multi-Layered Perceptron, RF Random Forest, APLSC Asymmetric Partial Least Squares Classification, G2DE Generalized Gaussian Density Estimator, + implementation exists, − no implementation, * experienced problems with the implementation
Previously published studies performing ab initio pre-miRNA detection using machine learning (ML). Listed are the number of features that were effectively used, the training data that was employed and whether an implementation is available
The negative data (see Online Methods) “pseudo” was generated by Xue[16] but downloaded from Ng[17]. The Table is sorted by the number of citations in Google Scholar (please note that there is a relationship between year of publication and number of citations, therefore, the number of citations in 2016 is provided in parentheses, as well)
Data sets
| Dataset | Type | Size | Property | Source |
|---|---|---|---|---|
| hsa | Positive | 1881 | All human miRNAs in miRBase |
|
| mirbase | Positive | 28596 | All miRNAs available in miRBase |
|
| mmu | Positive | 1193 | All mouse miRNAs in miRBase |
|
| mmu* | Positive | 380 | Mouse miRNAs in miRBase (RPM > = 100) |
|
| mirgenedb | Positive | 1434 | All miRNAs available in MirGeneDB |
|
| hsa+ | Positive | 523 | All human miRNAs available in MirGeneDB |
|
| mmu+ | Positive | 395 | All mouse miRNAs available in MirGeneDB |
|
| gga+ | Positive | 229 | All chicken miRNAs available in MirGeneDB |
|
| dre+ | Positive | 287 | All zebra fish miRNAs available in MirGeneDB |
|
| NegHsa | Negative | 68046 | Extracted from genome and mRNAs of H. sapiens |
|
| Zou | Negative | 14246 | Extracted from coding regions |
|
| pseudo | Negative | 8492 | Popular, used in many studies, constructed by using the protein coding sequences (CDSs) of human RefSeq genes with no known alternative splice events |
|
| Chen | Negative | 3054 | Excerpt of the combination of Zou and Pseudo |
|
| NotBestFold | Negative | 1881 | Created by not using the best fold proposed by RNAFold for human hairpins from miRBase |
|
| Shuffled | Negative | 1423 | Created by shuffling hsa data |
|
| hsaFR | Positive | 5000 | Created by random number generation between minimum and maximum for all features (hsa) |
|
| hsaBQ | Positive | 5000 | Created by random number generation between lower and upper quartile for all features (hsa) |
|
| hsaAM | Positive | 5000 | Created by random number generation between 40th and 60th percentile for all features (hsa) |
|
| pseudoFR | Negative | 5000 | Created by random number generation between minimum and maximum for all features (pseudo) |
|
| pseudoBG | Negative | 5000 | Created by random number generation between lower and upper quartile for all features (pseudo) |
|
| pseudoAM | Negative | 5000 | Created by random number generation between 40th and 60th percentile for all features (pseudo) |
|
List of positive and negative data sets used to create and evaluate pre-miRNA detection tools. The first 13 rows refer to previously available data sets whereas the latter 8 are created for this study
Fig. 1Classifier accuracy distribution. Box-whisker plots showing the accuracy distribution among selected studies for 1000-fold MCCV. The individual accuracy measures of the DT, NB, and SVM classifiers were merged to create this plot. Per classifier results can be found in Supplementary Figs. 1–3
Model performance summary
| Negative | Positive | |||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | NegHsa | Zou | Pseudo | NotBestFold | Shuffled | Chen | PseudoFR | PseudoBQ | PseudoAM | Neg Rank | hsa | mmu | mmu* | mirbase | hsaFR | hsaBQ | hsaAM | mirgenedb | hsa+ | mmu+ | gga+ | dre+ | Pos rank | Total rank |
| AverageDT | 82 | 56 | 93 | 31 | 93 | 77 | 30 | 100 | 100 | 99 | 97 | 83 | 95 | 91 | 91 | 100 | 100 | 97 | 98 | 96 | 98 | 96 | 52 | 151 |
| ConsensusNB | 89 | 52 | 86 | 24 | 96 | 77 | 53 | 100 | 100 | 97 | 86 | 82 | 93 | 89 | 100 | 100 | 100 | 96 | 96 | 93 | 98 | 97 | 84 | 181 |
| ConsensusDT | 74 | 44 | 90 | 20 | 88 | 72 | 16 | 100 | 100 | 155 | 99 | 87 | 96 | 93 | 97 | 100 | 100 | 98 | 99 | 97 | 100 | 97 | 31 | 186 |
| DingNB | 93 | 47 | 84 | 9 | 96 | 73 | 30 | 100 | 100 | 127 | 88 | 84 | 94 | 90 | 100 | 100 | 100 | 97 | 97 | 96 | 97 | 97 | 59 | 186 |
| AverageNB | 92 | 58 | 89 | 95 | 97 | 82 | 86 | 100 | 100 | 50 | 83 | 77 | 91 | 87 | 99 | 100 | 100 | 94 | 95 | 91 | 96 | 95 | 148 | 198 |
| NgDT | 74 | 64 | 89 | 13 | 91 | 77 | 31 | 100 | 100 | 118 | 89 | 80 | 93 | 88 | 85 | 100 | 100 | 96 | 96 | 94 | 98 | 97 | 100 | 218 |
| Consensus Model | 84 | 69 | 96 | 69 | 89 | 81 | 58 | 100 | 100 | 70 | 97 | 76 | 94 | 87 | 33 | 100 | 100 | 92 | 95 | 89 | 93 | 92 | 157 | 227 |
| BatuwitaNB | 90 | 53 | 83 | 11 | 97 | 76 | 45 | 100 | 100 | 114 | 86 | 79 | 92 | 87 | 98 | 100 | 100 | 96 | 96 | 93 | 97 | 97 | 120 | 234 |
| BentwichNB | 37 | 23 | 71 | 9 | 69 | 52 | 21 | 100 | 100 | 222 | 92 | 92 | 98 | 95 | 99 | 100 | 100 | 99 | 99 | 97 | 100 | 100 | 26 | 248 |
| NgNB | 74 | 42 | 81 | 9 | 87 | 63 | 36 | 100 | 100 | 187 | 86 | 83 | 95 | 91 | 99 | 100 | 100 | 96 | 97 | 94 | 98 | 98 | 63 | 250 |
| Mean | 80 | 56 | 86 | 51 | 89 | 75 | 52 | 98 | 97 | 85 | 75 | 89 | 84 | 67 | 96 | 97 | 89 | 89 | 86 | 89 | 89 | |||
Top ten models from the two machine learning algorithms generated for all 13 studies and 6 ensemble methods and their performance in respect to the 21 data sets examined in this study. The table is sorted by total rank which indicates the overall best performance considering all data sets equally. The minimum possible total rank is 21, and the maximum one is 672. Values are the prediction correctness (positive/negative) for the different data sets (Table 2). The complete results are provided in Supplementary Table 5 including highlighting similar to a heat map. The calculated mean refers to the complete results in Supplementary Table 5
Fig. 2Generalization performance of izMiR. Line plot showing the true prediction rates (y axis) of hairpins from different organisms. Mmu* stands for filtered mouse hairpins from miRBase based on a minimum RPM value of 100 and mmu refers to all mouse hairpins without filtering while mmu + indicates mouse hairpins from MirGeneDB. Only organisms with a minimum of 200 hairpins in miRBase were selected for this plot. Results for all organisms in miRBase are available for download from our web page: http://jlab.iyte.edu.tr/izmir. The lines do not indicate a mathematical relationship and are only added for simplifying visual tracking
Fig. 3Model training workflow. Filtered human miRNA hairpins from miRBase served as positive data and pseudo hairpins for negative data. Each data set is randomly sampled individually; 70% of positive data and the same number of negative examples were used during 1000-fold Monte Carlo cross validation (MCCV) [30]. The remaining 30% of positive data and the same number of negative examples are used for testing the model. In the end, the best models for naïve Bayes and decision tree were stored for prediction in PMML format while SVM performance was not stored as a PMML model due to limitations of the available SVM implementation