| Literature DB >> 30733467 |
R J Peace1, M Sheikh Hassani1, J R Green2.
Abstract
Methods for the de novo identification of microRNA (miRNA) have been developed using a range of sequence-based features. With the increasing availability of next generation sequencing (NGS) transcriptome data, there is a need for miRNA identification that integrates both NGS transcript expression-based patterns as well as advanced genomic sequence-based methods. While miRDeep2 does examine the predicted secondary structure of putative miRNA sequences, it does not leverage many of the sequence-based features used in state-of-the-art de novo methods. Meanwhile, other NGS-based methods, such as miRanalyzer, place an emphasis on sequence-based features without leveraging advanced expression-based features reflecting miRNA biosynthesis. This represents an opportunity to combine the strengths of NGS-based analysis with recent advances in de novo sequence-based miRNA prediction. We here develop a method, microRNA Prediction using Integrated Evidence (miPIE), which integrates both expression-based and sequence-based features to achieve significantly improved miRNA prediction performance. Feature selection identifies the 20 most discriminative features, 3 of which reflect strictly expression-based information. Evaluation using precision-recall curves, for six NGS data sets representing six diverse species, demonstrates substantial improvements in prediction performance compared to three methods: miRDeep2, miRanalyzer, and mirnovo. The individual contributions of expression-based and sequence-based features are also examined and we demonstrate that their combination is more effective than either alone.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30733467 PMCID: PMC6367335 DOI: 10.1038/s41598-018-38107-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
NGS data sets examined in this article.
| Data set | Organism | Accession | # Reads |
|---|---|---|---|
|
| H. sapiens | GSM1820470[ | 38210937 |
|
| M. musculus | GSM1528810 | 54947527 |
|
| D. melanogaster | GSM1123781[ | 18723989 |
|
| B. Taurus | GSE74879[ | 43164654 |
|
| G. Gallus | GSM2095817 | 27937224 |
|
| E. Caballus | GSE100852 | 42178766 |
Figure 1miPIE Pipeline flowchart.
Number of samples in positive and negative classification data sets derived from each NGS experiment data set.
| Data set | #of positive genomic regions | #of negative genomic regions |
|---|---|---|
|
| 167 | 609 |
|
| 384 | 872 |
|
| 110 | 97 |
|
| 341 | 683 |
|
| 193 | 104 |
|
| 364 | 228 |
Comparison of performance of the integrated miPIE feature set, relative to the performance of similarly trained classifiers trained using only sequence- and only expression-based features.
| Data set | Re@Pr75 | Re@Pr90 | ||||
|---|---|---|---|---|---|---|
| all | Seq | Exp | all | Seq | Exp | |
| mmu | 0.977 | 0.958 | 0.958 | 0.948 | 0.904 | 0.885 |
| hsa | 0.915 | 0.915 | 0.836 | 0.879 | 0.849 | 0.545 |
| dme | 0.955 | 0.982 | 0.954 | 0.891 | 0.918 | 0.864 |
| bta | 0.924 | 0.891 | 0.891 | 0.928 | 0.912 | 0.969 |
| gga | 0.993 | 0.953 | 0.995 | 0.798 | 0.757 | 0.648 |
| eca | 0.985 | 0.953 | 0.995 | 0.953 | 0.929 | 0.887 |
| Average | 0.958 | 0.942 | 0.938 | 0.900 | 0.878 | 0.800 |
Figure 2Performance of miPIE, miRDeep2, and miRanalyzer across six data sets. miPIE performance is estimated through 10-fold cross-validation. miRanalyzer produced binary prediction values, so only a single precision level is represented. miPIE outperforms miRDeep2 and miRanalyzer on all six data sets, with the possible exception of miRanalyzer’s performance on mmu. In all plots, the y-axis represents precision while the x-axis is recall.
Summary of results comparing miPIE with the state of the art miRDeep2 method, on six NGS data sets.
| Data set | Re@Pr75 | Re@Pr90 | AUPRC | |||
|---|---|---|---|---|---|---|
| miRDeep2 | miPIE | miRDeep2 | miPIE | miRDeep2 | miPIE | |
| mmu | 0.930 | 0.977 (+5.1%) | 0.867 | 0.948 (+9.3%) | 0.813 | 0.898 (+10.5%) |
| hsa | 0.867 | 0.909 (+4.8%) | 0.598 | 0.873 (+46.0%) | 0.948 | 0.978 (+3.08%) |
| dme | 0.909 | 0.955 (+5.1%) | 0.873 | 0.891 (+2.1%) | 0.931 | 0.960 (+3.11%) |
| bta | 0.860 | 0.924 (+7.4%) | 0.604 | 0.798 (+32.1%) | 0.873 | 0.918 (+5.12%) |
| gga | 0.964 | 0.985 (+2.2%) | 0.902 | 0.927 (+2.8%) | 0.940 | 0.976 (+3.80%) |
| eca | 0.975 | 0.997 (+2.3%) | 0.897 | 0.940 (+4.8%) | 0.952 | 0.966 (+1.50%) |
| Average | 0.918 | 0.958 (+4.0%) | 0.790 | 0.896 (+16.0%) | 0.882 | 0.949 (+7.57%) |
miPIE outperforms miRDeep2 by 16% and 4%, at the 90% and 75% precision thresholds, respectively, and by 7.57% when performance is measured using area under the precision-recall curve.
Summary of results comparing miPIE with miRanalyzer using six NGS data sets.
| Data set | Precision level | miRanalyzer recall rate | miPIE recall rate |
|---|---|---|---|
| mmu | 0.982 | 0.866 | 0.841 (−2.9%) |
| hsa | 0.770 | 0.806 | 0.909 (+12.8%) |
| dme | 0.851 | 0.882 | 0.927 (+5.1%) |
| bta | 0.810 | 0.849 | 0.910 (+7.2%) |
| gga | 0.868 | 0.854 | 0.943 (+10.4%) |
| eca | 0.880 | 0.891 | 0.970 (+8.9%) |
| Average | 0.821 | 0.840 | 0.897 (+6.90%) |
When operating at miRanalyzer’s precision threshold, miPIE outperforms miRanalyzer by 6.9% on average (p = 0.046).
Summary of results comparing miPIE with the mirnovo method on six NGS data sets.
| Data set | Re@Pr75 | Re@Pr90 | ||
|---|---|---|---|---|
| mirnovo | miPIE | mirnovo | miPIE | |
|
| 0.78 | 0.977 | 0.70 | 0.948 |
|
| 0.51 | 0.909 | 0.37 | 0.873 |
|
| 0.20 | 0.955 | 0.12 | 0.891 |
|
| 0.76 | 0.924 | 0.61 | 0.798 |
|
| 0.76 | 0.985 | 0.36 | 0.927 |
|
| 0.62 | 0.997 | 0.08 | 0.940 |
| Avg. | 0.61 | 0.96 | 0.37 | 0.90 |
Recall achievable at a precision of at least 90% (Re@Pr90) for 6 test datasets using our method trained over the following datasets: 10CV = 10-fold-cross-validation within test species dataset; all = combination of 5 datasets, excluding test set; 10CV = 10-fold cross-validation over test dataset; mmu = mouse dataset; hsa = human dataset; dme = fruit-fly dataset; bta = cow dataset; gga = chicken dataset; eca = horse dataset.
| Data set | Same-species | Cross-species Training Dataset | ||||||
|---|---|---|---|---|---|---|---|---|
| 10CV |
|
|
|
|
|
|
| |
|
| 0.948 | — | 0.878 | 0.857 | 0.927 | 0.820 | 0.929 | |
|
| 0.873 | 0.770 | 0.594 | — | 0.578 | 0.769 | 0.649 | |
|
| 0.891 | 0.782 | 0.764 | 0.854 | — | 0.781 | 0.809 | |
|
| 0.798 | 0.528 | 0.669 | 0.150 | — | 0.420 | 0.636 | |
|
| 0.927 | 0.586 | 0.901 | 0.891 | 0.876 | — | 0.633 | |
|
| 0.940 | 0.920 | 0.902 | 0.840 | 0.939 | 0.871 | — | |