| Literature DB >> 26558614 |
Haibo Cui1, Jingjing Zhai2, Chuang Ma2,3.
Abstract
MicroRNAs (miRNAs) are a class of short, non-coding RNA that play regulatory roles in a wide variety of biological processes, such as plant growth and abiotic stress responses. Although several computational tools have been developed to identify primary miRNAs and precursor miRNAs (pre-miRNAs), very few provide the functionality of locating mature miRNAs within plant pre-miRNAs. This manuscript introduces a novel algorithm for predicting miRNAs named miRLocator, which is based on machine learning techniques and sequence and structural features extracted from miRNA:miRNA* duplexes. To address the class imbalance problem (few real miRNAs and a large number of pseudo miRNAs), the prediction models in miRLocator were optimized by considering critical (and often ignored) factors that can markedly affect the prediction accuracy of mature miRNAs, including the machine learning algorithm and the ratio between training positive and negative samples. Ten-fold cross-validation on 5854 experimentally validated miRNAs from 19 plant species showed that miRLocator performed better than the state-of-art miRNA predictor miRdup in locating mature miRNAs within plant pre-miRNAs. miRLocator will aid researchers interested in discovering miRNAs from model and non-model plant species.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26558614 PMCID: PMC4641693 DOI: 10.1371/journal.pone.0142753
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Training (A) and testing (B) procedures for ML-based miRNA prediction.
Fig 2Experimentally validated miRNAs obtained from the miRBase database.
(A) Statistical results of pre-miRNAs carrying experimentally validated miRNAs on their 5' and/or 3' arms. (B) Anatomy of the miRNA duplex in the pre-miRNA hairpin. "miRNA duplex (5p)" and "miRNA duplex (3p)" represent the 5' and 3' strands of the miRNA duplex, respectively. "Loop", "Helix" and "Bulge" are three common structural elements in the secondary structure of pre-miRNAs.
Sequence and structural features used in miRLocator.
| Type | Feature Number | Feature Description |
|---|---|---|
| miRNALen | 1 | miRNA length |
| MNC | 4 | Frequency of four nucleotides |
| DNC | 16 | Frequency of 16 dinucleotides |
| NT(5p) | (5+1+5)*4*2 | Type of nucleotides surrounding the start and end of the miRNA duplex (5p) within the [-5,5] region. A, C, G and U are encoded as0001, 0010, 0100, and 1000, respectively. |
| NT(3p) | (5+1+5)*4*2 | Type of nucleotides surrounding the start and end of the miRNA duplex(3p) within the [-5,5] region. |
| MFE | 1 | Minimum free energy (MFE) of the miRNA duplex |
| mlBulge | 1 | Maximal length of an miRNA without bulge in the miRNA duplex |
| bpNum | 1 | Number of base pairs in the miRNA duplex |
| dist2 Loop | 1 | Distance of the duplex to the terminal loop |
| dist2Helix | 1 | Distance of the duplex to the start of the helix |
| numLoop | 1 | Number of loops in the miRNA duplex |
| numBulges | 1 | Number of bulges in the miRNA duplex |
| perfectBP | 3*2 | Presence and start position of perfect base pairs with 5, 10, and 20nt |
| numBP_Win | 3 | Average number of base pairs obtained by scanning the miRNA duplex with a window of length |
| Bulges | 7*2 | Presence of bulges in the area surrounding the start and end of the miRNA within the [-3,3] region |
| posEntropy | (5+1+5)*2 | Entropy values of the areas surrounding the start and end of the miRNA in the[-5,5] region |
| monoSSq | 4*3+1 | Number of different combinations of four nucleotides (A,C,G, and U) and three structure symbols (‘(‘, ‘)’, and ‘.’). One additional feature was added for undefined combinations. ‘(‘, ‘)’, and ‘.’ represent paired, paired, and unpaired nucleotides in the secondary structure of the pre-miRNA, respectively. |
| diSS | 16*9+1 | Number of different combinations of 16 dinucleotides and nine di-structure symbols. One additional feature was added for undefined combinations. |
| triplets | 32+1 | Number of 32 combinations of mononucleotide and triplets of structure symbols. One additional feature was added for undefined combinations. |
AUC values of miRNA predictors constructed with different ML algorithms using different RPNSs.
| Ratio | RF | SVM | NB | kNN | DT |
|---|---|---|---|---|---|
| 1:1 | 0.917 | 0.897 | 0.724 | 0.853 | 0.759 |
| 1:5 | 0.930 | 0.888 | 0.725 | 0.861 | 0.752 |
| 1:10 | 0.938 | 0.878 | 0.723 | 0.851 | 0.738 |
| 1:50 | 0.938 | 0.876 | 0.727 | 0.848 | 0.728 |
Fig 3Performance of ML-based miRNA predictors in classifying real and pseudo miRNA duplexes.
(A)ROC curve displaying the performance of different ML-based miRNA predictors in the ten-fold cross-validation experiment.(B) Performance of ML-based miRNA predictors obtained with different numbers of features.(C) Distribution of the number of base pairs in the positive and negative sample sets.(D) Distribution of the average number of base pairs in a 4-nt sliding window in the positive and negative sample sets.(E) Frequency of bulges 1nt upstream of the miRNA end in the positive and negative sample sets.
Feature importance scores evaluated with the RF-based Gini importance algorithm.
| Features | Description | 1:1 | 1:5 | 1:10 | 1:50 |
|---|---|---|---|---|---|
| bpNum | Number of base pairs in the duplex | 0.020 [1] | 0.012[3] | 0.009[3] | 0.006[11] |
| numbP_Win4 | Average number of base pairs in a sliding window of 4nt | 0.020[2] | 0.015[1] | 0.010[1] | 0.006[4] |
| numBP_Win6 | Average number of base pairs in a sliding window of 6nt | 0.018[3] | 0.013[2] | 0.010[2] | 0.006[6] |
| BulgeAtEnd_u1 | Bulge located 1nt upstream of the miRNA end | 0.016[4] | 0.008[6] | 0.006[14] | 0.003[163] |
| numBP_Win8 | Average number of base pairs in a sliding window of 8nt | 0.014[5] | 0.010 [5] | 0.008[5] | 0.005[16] |
| BulgeAtEnd_u2 | Bulge located 2nt upstream of the miRNA end | 0.012[6] | 0.007[14] | 0.005[31] | 0.003[177] |
| BulgeAtEnd_u3 | Bulge located 3nt upstream of the miRNA end | 0.011[7] | 0.006[19] | 0.005[40] | 0.003[111] |
| Persent_10mer | Presence and start position of perfect base pairs in a length of 10nt | 0.011[8] | 0.008[7] | 0.006[13] | 0.003[103] |
| U) | Paired U nucleotide in the miRNA duplex | 0.010 [9 | 0.006[17] | 0.005[25] | 0.003 [113] |
| mlBulge | Maximal miRNA length without a bulge in the duplex | 0.009[10] | 0.007 [12] | 0.007[10] | 0.005 [26] |
aThe rank of each feature is presented in the bracket.
Fig 4Cumulative frequency of correctly predicted start (A) and end (B) positions of miRNAs at different resolutions.
For a given resolution d (0 ≤ d ≤ 10 nt), the start (end) of the predicted miRNAs was regarded as true if it was located within dbp from the start (end) of the annotated miRNAs.
Fig 5Effect of sample size on the prediction accuracy of miRLocator and miRdup.
Fig 6Cumulative frequency of correctly predicted start and end positions of miRNAs at different resolutions.