| Literature DB >> 31582883 |
Abstract
MicroRNAs (miRNAs) are posttranscriptional regulators of gene expression. While a miRNA can target hundreds of messenger RNA (mRNAs), an mRNA can be targeted by different miRNAs, not to mention that a single miRNA might have various binding sites in an mRNA sequence. Therefore, it is quite involved to investigate miRNAs experimentally. Thus, machine learning (ML) is frequently used to overcome such challenges. The key parts of a ML analysis largely depend on the quality of input data and the capacity of the features describing the data. Previously, more than 1000 features were suggested for miRNAs. Here, it is shown that using 36 features representing the RNA secondary structure and its dynamic 3D graphical representation provides up to 98% accuracy values. In this study, a new approach for ML-based miRNA prediction is proposed. Thousands of models are generated through classification of known human miRNAs and pseudohairpins with 3 classifiers: decision tree, naïve Bayes, and random forest. Although the method is based on human data, the best model was able to correctly assign 96% of nonhuman hairpins from MirGeneDB, suggesting that this approach might be useful for the analysis of miRNAs from other species.Entities:
Keywords: RNA structure; decision tree; machine learning; naïve Bayes; random forest; MicroRNA
Year: 2019 PMID: 31582883 PMCID: PMC6713912 DOI: 10.3906/biy-1904-59
Source DB: PubMed Journal: Turk J Biol ISSN: 1300-0152
Figure 1Representation of sequence and secondary structure of hsa-let-7a-1. mfe: Minimum free energy.
Figure 2The basic workflow of the analysis. Hairpin sequences were folded into their secondary structures and based on the state of the bases (bonded or nonbonded) for each hairpin 3D features were calculated. Learning datasets were used for classification analysis with 1000-fold Monte Carlo cross-validation and the best models with the highest accuracy scores were applied to the test datasets for prediction.
Figure 3Accuracies of classifiers when positive dataset is MirGeneDB human precursors. DT: Decision tree, NB: naïve Bayes, RF: random forest.
Figure 4Accuracies of classifiers when positive dataset is miRBase human precursors. DT: Decision tree, NB: naïve Bayes, RF: random forest.
Prediction results for other organisms’ miRNA hairpins. All organisms from MirGeneDB (except human) were included. MiRNA #shows total number of hairpins per species. Prediction column shows the number of miRNA and negative predictions, respectively. The table is sorted alphabetically for species.
| Species | Acronym | MiRNA # | Prediction |
|---|---|---|---|
| Anolis carolinensis | Aca | 261 | 244, 17 |
| Alligator mississippiensis | Ami | 272 | 259, 13 |
| Ascaris suum | Asu | 95 | 94, 1 |
| Branchiostoma floridae | Bfl | 90 | 88, 2 |
| Bos taurus | Bta | 433 | 418, 15 |
| Caenorhabditis elegans | Cel | 139 | 135, 4 |
| Canis familiaris | Cfa | 444 | 427, 17 |
| Crassostrea gigas | Cgi | 150 | 145, 5 |
| Columba livia | Cli | 246 | 237, 9 |
| Chrysemys picta bellii | Cpi | 290 | 278, 12 |
| Cavia porcellus | Cpo | 397 | 384, 13 |
| Capitella teleta | Cte | 102 | 96, 6 |
| Drosophila melanogaster | Dme | 152 | 133, 19 |
| Dasypus novemcinctus | Dno | 373 | 362, 11 |
| Daphnia pulex | Dpu | 79 | 67, 12 |
| Danio rerio | Dre | 385 | 369, 16 |
| Eisenia fetida | Efe | 192 | 177, 15 |
| Echinops telfairi | Ete | 339 | 328, 11 |
| Gallus gallus | Gga | 262 | 248, 14 |
| Ixodes sp. | Isc | 56 | 52, 4 |
| Lottia gigantea | Lgi | 80 | 79, 1 |
| Macaca mulatta | Mml | 498 | 488, 10 |
| Mus musculus | Mmu | 448 | 428, 20 |
| Oryctolagus cuniculus | Ocu | 366 | 361, 5 |
| Ptychodera flava | Pfl | 83 | 81, 2 |
| Patiria miniata | Pmi | 58 | 54, 4 |
| Rattus norvegicus | Rno | 413 | 394, 19 |
| Sarcophilus harrissii | Sha | 417 | 409, 8 |
| Saccoglossus kowalevskii | Sko | 83 | 80, 3 |
| Strongylocentrotus purpuratus | Spu | 57 | 51, 6 |
| Tribolium castaneum | Tca | 188 | 186, 2 |
| Xenopus tropicalis | Xtr | 253 | 241, 12 |
Comparison of the model developed in this work with the existing classifiers. FN: Number of features used to build the classification model, ML: machine learning method, SE: sensitivity, SP: specificity, Acc: accuracy, SVM: support vector machine, NB: naïve Bayes, MLP: multilayered perceptron, RF: random forest, DT: decision tree.
| Method | FN | ML | SE | SP | Acc |
|---|---|---|---|---|---|
| Triplet-SVM (Xue et al., 2005) | 32 | SVM | 93.30 | ||
| MiPred (Jiang et al., 2007) | 34 | RF, SVM | 98.21 | 95.09 | 96.68 |
| MicroPred (Batuwita et al., 2009) | 21 | RF, SVM | 90.02 | 97.28 | |
| izMiR (Saçar Demirci et al., 2017) | ~900 | SVM, NB, DT | 91.98 | 91.98 | 91.25 |
| 3D model | 36 | RF, NB, DT | 98.87 | 98.87 | 98.58 |