| Literature DB >> 23748953 |
Mickael Leclercq1, Abdoulaye Banire Diallo, Mathieu Blanchette.
Abstract
MicroRNAs (miRNAs) are short RNA species derived from hairpin-forming miRNA precursors (pre-miRNA) and acting as key posttranscriptional regulators. Most computational tools labeled as miRNA predictors are in fact pre-miRNA predictors and provide no information about the putative miRNA location within the pre-miRNA. Sequence and structural features that determine the location of the miRNA, and the extent to which these properties vary from species to species, are poorly understood. We have developed miRdup, a computational predictor for the identification of the most likely miRNA location within a given pre-miRNA or the validation of a candidate miRNA. MiRdup is based on a random forest classifier trained with experimentally validated miRNAs from miRbase, with features that characterize the miRNA-miRNA* duplex. Because we observed that miRNAs have sequence and structural properties that differ between species, mostly in terms of duplex stability, we trained various clade-specific miRdup models and obtained increased accuracy. MiRdup self-trains on the most recent version of miRbase and is easy to use. Combined with existing pre-miRNA predictors, it will be valuable for both de novo mapping of miRNAs and filtering of large sets of candidate miRNAs obtained from transcriptome sequencing projects. MiRdup is open source under the GPLv3 and available at http://www.cs.mcgill.ca/∼blanchem/mirdup/.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23748953 PMCID: PMC3753617 DOI: 10.1093/nar/gkt466
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Features used in miRdup
| Features | Number | Description |
|---|---|---|
| miRNA primary sequence | ||
| Single nucleotide frequency | 4 | Frequency of each nucleotide |
| Dinucleotide frequency | 16 | Frequency of each dinucleotide |
| GC content | 1 | Frequency of C or G |
| First/last nucleotide | 8 | Nucleotide type at the miRNA start and end |
| Length | 1 | miRNA length |
| miRNA–miRNA duplex | ||
| Triplets | 32 | Frequency of each sequence/structure triplet ( |
| Bulges | 22 | Bulge(s) at positions −4 to +4 nt around start and end of the miRNA. Bulges lengths and number of bulges in the miRNA. |
| Base pairing | 10 | Average number of base pairs in duplex and in a sliding window of length 3, 5 and 7 nt. Presence and start position of a perfect 5, 10 and 20 nt base pairs. |
| Pairs type | 3 | Percentage of bases forming each type of canonical/wobble base pairs (C-G, A-U, G-U) in the duplex |
| Loop | 2 | Percentage of the miRNA overlapping the hairpin loop |
| MFE | 1 | MFE of the duplex |
Figure 1.Structure of a pre-miRNA hairpin. The dashed box represents the duplex from which features are computed.
Attribute ranking scores evaluated on all miRbase, mammals and plants data sets with Information Gain ranker
| Features (Total: 22) | miRbase rank score | Mammals rank score | Plants rank score | Arthropods rank score | Nematodes rank score | Fishes rank score |
|---|---|---|---|---|---|---|
| Average number of paired bases in 3 bp sliding widow | 0.181 [2] | 0.165 [2] | 0.190 [5] | 0.220 [5] | ||
| Length of the longest bulges (% of miRNA length) | 0.185 [2] | 0.176 [3] | 0.203 [5] | 0.153 [4] | 0.190 [3] | 0.193 [3] |
| Length of the longest bulges (nt) | 0.183 [3] | 0.175 [4] | 0.197 [7] | 0.147 [6] | 0.196 [2] | 0.189 [2] |
| Average number of paired bases in 5 bp sliding widow | 0.174 [5] | 0.171 [5] | 0.21 [4] | 0.163 [3] | 0.168 [6] | 0.201 [6] |
| 0.174 [4] | 0.151 [9] | |||||
| 0.165 [6] | 0.151 [8] | 0.213 [3] | 0.137 [7] | 0.182 [4] | 0.188 [4] | |
| Average number of paired bases in 7 bp sliding widow | 0.159 [7] | 0.156 [7] | 0.2 [6] | 0.136 [8] | 0.146 [7] | 0.181 [7] |
| 0.147 [8] | 0.167 [6] | 0.107 [14] | 0.115 [9] | 0.145 [8] | 0.150 [8] | |
| 0.137 [9] | 0.112 [10] | 0.214 [2] | 0.162 [5] | 0.102 [12] | 0.196 [12] | |
| Percentage of GC base pairs in the duplex | 0.122 [10] | 0.09 [14] | 0.102 [15] | 0.060 [16] | 0.059 [17] | 0.068 [17] |
| Percentage of AU base pairs in the duplex | 0.118 [11] | 0.068 [18] | 0.083 [19] | 0.027 [22] | 0.058 [18] | 0.046 [18] |
| Triplet U | 0.117 [12] | 0.114 [9] | 0.124 [10] | 0.106 [10] | 0.107 [11] | 0.128 [11] |
| 0.112 [13] | 0.094 [13] | 0.155 [8] | 0.077 [14] | 0.144 [9] | 0.107 [9] | |
| Triplet A | 0.111 [14] | 0.099 [12] | 0.113 [11] | 0.085 [12] | 0.126 [10] | 0.114 [10] |
| miRNA included in loop (yes/no) | 0.107 [15] | 0.105 [11] | 0.076 [20] | 0.058 [17] | 0.088 [14] | 0.067 [14] |
| Triplet C | 0.082 [16] | 0.074 [17] | 0.09 [17] | 0.063 [15] | 0.091 [13] | 0.101 [13] |
| Percentage of GU base pairs in the duplex | 0.074 [17] | 0.076 [16] | 0.084 [18] | 0.034 [20] | 0.045 [19] | 0.055 [19] |
| Triplet G | 0.068 [18] | 0.08 [15] | 0.069 [21] | 0.082 [13] | 0.082 [15] | 0.110 [15] |
| Position of the first 5 nt bulge-free region | 0.066 [19] | 0.059 [19] | 0.098 [16] | 0.103 [11] | 0.076 [16] | 0.124 [16] |
| Triplet G | 0.059 [20] | 0.029 [22] | 0.058 [22] | 0.027 [21] | 0.022 [21] | 0.040 [21] |
| 0.058 [22] | 0.05 [21] | 0.112 [12] | 0.038 [19] | 0.037 [20] | 0.056 [20] | |
| 0.058 [21] | 0.051 [20] | 0.11 [13] | 0.049 [18] | 0.033 [22] | 0.063 [22] |
Scores are based on the information gain between the attribute and the class (67). Best score is bold. Features with substantially different scores (>0.05) in mammals versus plants are underlined. Full ranking values are in Supplementary Tables S1 for miRbase, mammals and plants.
Figure 5.Cumulative distribution of the minimum distance between the true and predicted miRNAs or miRNAs* starts (up) and ends (down), i.e. the proportion of cases where the prediction is within x bases of the true start/end positions. Multilineage miRdup predictions are compared with MatureBayes (57), MiRalign (61), MaturePred (63) and PromiR1 (40) for all experimentally validated pre-miRNAs from miRbase, except for MaturePred, where our analysis was limited to only 2400 miRNAs submitted owing to web server constraints. For MatureBayes and Promir, a small number of queries were rejected by the web server and were thus excluded from the results. We only show distances of up to 10 nt, but in some rare cases, errors are substantially larger (up to 250 nt). Results for lineage-specific miRdup compared with MatureBayes for mammals, arthropods, nematods, fish and plants are shown in Supplementary Figure S1.
Results of various classifiers trained on all features of miRbase (all lineages) evaluated using 10-fold cross-validation
| Classifier | Correctly classified instances (out of 39 646) | Sensitivity | Specificity | Accuracy | MCC | AUC |
|---|---|---|---|---|---|---|
| Random forest with AdaBoost | 31 940 | 0.863 | 0.748 | 0.806 | 0.614 | 0.892 |
| C4.5 decision tree with AdaBoost | 31 317 | 0.809 | 0.771 | 0.79 | 0.58 | 0.875 |
| SVM with radial basis kernel | 25 878 | 0.344 | 0.962 | 0.653 | 0.385 | 0.653 |
Prediction accuracy of lineage-specific miRdup predictors (random forest with Adaboost, evaluated using 10-fold cross-validation)
| Classifier | Number of instances | Correctly classified instances | Sensitivity | Specificity | ACC | MCC | AUC |
|---|---|---|---|---|---|---|---|
| Mammals | 13 918 | 11 415 | 0.868 | 0.772 | 0.82 | 0.642 | 0.897 |
| Plants | 9464 | 7734 | 0.866 | 0.768 | 0.817 | 0.636 | 0.904 |
| Nematods | 2174 | 1789 | 0.882 | 0.764 | 0.823 | 0.649 | 0.898 |
| Arthropods | 5240 | 4071 | 0.833 | 0.721 | 0.777 | 0.557 | 0.857 |
| Fish | 1530 | 1323 | 0.905 | 0.824 | 0.864 | 0.731 | 0.918 |
Accuracy of lineage-specific and non-lineage-specific miRdup predictors (rows) for the prediction of miRNAs from each lineage (columns)
| Test set |Training set | miRbase | Nematods | Arthropods | Fish | Mammals | Plants |
|---|---|---|---|---|---|---|
| miRbase | 0.818 | 0.852 | 0.808 | 0.790 | ||
| Nematods | 0.74 | 0.823 | 0.755 | 0.82 | 0.768 | 0.618 |
| Arthropods | 0.768 | 0.812 | 0.777 | 0.806 | 0.765 | 0.712 |
| Fish | 0.716 | 0.808 | 0.72 | 0.741 | 0.606 | |
| Mammals | 0.793 | 0.766 | 0.846 | 0.655 | ||
| Plants | 0.700 | 0.662 | 0.644 | 0.681 | 0.645 |
The highest accuracy for each column is in bold. For cases where a predictor is applied to data from the lineage it is trained on, the numbers reported are obtained by 10-fold cross-validation.