| Literature DB >> 28105918 |
Yuri Bento Marques1,2, Alcione de Paiva Oliveira1,3, Ana Tereza Ribeiro Vasconcelos4, Fabio Ribeiro Cerqueira5.
Abstract
BACKGROUND: MicroRNAs (miRNAs) are key gene expression regulators in plants and animals. Therefore, miRNAs are involved in several biological processes, making the study of these molecules one of the most relevant topics of molecular biology nowadays. However, characterizing miRNAs in vivo is still a complex task. As a consequence, in silico methods have been developed to predict miRNA loci. A common ab initio strategy to find miRNAs in genomic data is to search for sequences that can fold into the typical hairpin structure of miRNA precursors (pre-miRNAs). The current ab initio approaches, however, have selectivity issues, i.e., a high number of false positives is reported, which can lead to laborious and costly attempts to provide biological validation. This study presents an extension of the ab initio method miRNAFold, with the aim of improving selectivity through machine learning techniques, namely, random forest combined with the SMOTE procedure that copes with imbalance datasets.Entities:
Keywords: Data mining; Machine learning; Pre-miRNA ab initio prediction; Random forest; Smote; microRNA
Mesh:
Substances:
Year: 2016 PMID: 28105918 PMCID: PMC5249014 DOI: 10.1186/s12859-016-1343-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1General view of the Mirnacle approach (Adapted from Tempel and Tahi [14]). Given the input DNA sequence a, a sliding window is used to extract subsequences of length close to the expected pre-miRNA length. For each subsequence, a triangular base pairing matrix (example for the sequence CAGAUUUACUAGUACGUAAUUUG) is constructed and analyzed in three stages. In the first stage (b and c), long exact stems (series of positive numbers in the diagonals) are sought and classified. Next, in the second stage (d and e), for each positively classified exact stem, its diagonal is searched to form a non-exact stem (series of positive numbers interspersed with series of 0’s) that also passes through a classification procedure. Finally, in the last stage (f and g), a complete hairpin is produced from each previously filtered non-exact stem, using the originally identified exact stem as the starting point for a further search in the matrix. In this search, other diagonals are tried so that secondary structures with asymmetrical internal loops are also considered. The resultant hairpins are then classified with a third ML model and only the ones predicted as positives are given as the final output (h)
Fig. 2Illustration of the incremental approach performed in the base pairing matrix analysis. a A long exact stem (in blue) is identified (steps shown in Fig. 1 b/c). b The exact stem is then extended to a non-exact stem (Fig. 1 d/e, here in green and blue) that, in turn, is the basis to build a complete hairpin (procedure represented in Fig. 1 f/g)
Fig. 3Training set construction. Each stage has its own training set for building its specific machine learning model. Real pre-miRNAs are used as positive examples, while negative examples are comprised of snRNAs, snoRNAs, tRNAs, miscRNAs, and pseudo hairpins
Comparing different combinations of methods for imbalanced data and learning algorithms. A 10-fold cross validation was performed in each case using TSHA1, TSHA2, and TSHA3 for the respective stages
| Method | 1st stage | 2nd stage | 3rd stage | ||||||
|---|---|---|---|---|---|---|---|---|---|
| SN | SL | GM | SN | SL | GM | SN | SL | GM | |
| LibSVM: | |||||||||
| Cost matrix | 4.7 | 64.7 | 17.4 | 3.8 | 81.8 | 17.7 | 4.2 | 100 | 20.6 |
| Sampling | 99.6 | 55.4 | 74.3 | 99.6 | 54.8 | 73.8 | 100 | 65 | 80.6 |
| SMOTE | 87.5 | 99.6 |
| 82.5 | 100 |
| 97.9 | 100 |
|
| SMO: | |||||||||
| Cost matrix | 80.9 | 17.9 | 38 | 85.6 | 40 | 58.1 | 97.9 | 77.5 | 87.1 |
| Sampling | 84.3 | 77.7 | 81 | 86 | 90.6 | 88.3 | 98.7 | 96.7 | 97.7 |
| SMOTE | 86.3 | 83.5 |
| 91.9 | 94.4 |
| 99.9 | 99.3 |
|
| MLP: | |||||||||
| Cost matrix | 74.1 | 16.7 | 35.2 | 79.2 | 49.7 | 62.7 | 98.7 | 3.9 | 19.6 |
| Sampling | 78.8 | 77.5 | 78.2 | 89.8 | 88.3 | 89.1 | 97 | 97.9 | 97.4 |
| SMOTE | 91.5 | 90.8 |
| 98 | 97.1 |
| 99.9 | 99.7 |
|
| RF: | |||||||||
| Cost matrix | 46.2 | 44 | 45.1 | 74.6 | 76.5 | 75.5 | 89 | 89 | 89 |
| Sampling | 84.3 | 78.7 | 81.4 | 87.7 | 87.7 | 87.7 | 97.9 | 97.1 | 97.5 |
| SMOTE | 98.2 | 96.3 |
| 99.1 | 98.4 |
| 99.9 | 99.4 |
|
Comparison of the selected classifiers combined to the SMOTE filter using TSHS1, TSHS2, and TSHS3 (A) as well as TSMM1, TSMM2, and TSMM3 (B) for the respective stages. Each ML algorithm was tested for each TS through a 10-fold cross validation
| Classifier | 1st stage | 2nd stage | 3rd stage | ||||||
|---|---|---|---|---|---|---|---|---|---|
| SN | SL | GM | SN | SL | GM | SN | SL | GM | |
| (A) Results for TSHS1, TSHS2, and TSHS3: | |||||||||
| LibSVM | 87.5 | 99.5 | 93.3 | 84.7 | 100 | 92 | 98.1 | 100 | 99.1 |
| SMO | 86.7 | 84 | 85.3 | 92.9 | 94.5 | 93.7 | 99.9 | 99.4 | 99.6 |
| MLP | 92.3 | 88.4 | 90.3 | 97.9 | 97 | 97.5 | 99.9 | 99.7 |
|
| RF | 97.8 | 97.7 |
| 98.7 | 99 |
| 100 | 99.7 |
|
| (B) Results for TSMM1, TSMM2, and TSMM3: | |||||||||
| LibSVM | 89.4 | 98.8 | 94 | 100 | 60.5 | 77.8 | 97.6 | 100 | 98.8 |
| SMO | 85.6 | 81.6 | 83.5 | 92 | 92.5 | 92.2 | 99.2 | 99 | 99.1 |
| MLP | 88.9 | 86.8 | 87.8 | 95.2 | 94.9 | 95.1 | 99.9 | 99.5 |
|
| RF | 96.4 | 94.6 |
| 97.8 | 97.9 |
| 99.8 | 99.6 |
|
Comparison of the methods for pre-miRNA ab initio prediction using sequence HAD. The results of five distinct combinations of discriminant probabilities for the three respective stages are shown in parenthesis for Mirnacle
| Method | Sensitivity | Selectivity | GM | Time(mm:ss) |
|---|---|---|---|---|
| Mirnacle (0.3,0.3,0.7) | 97 | 81.51 |
| 14:58 |
| Mirnacle (0.4,0.4,0.7) | 86 | 85.15 | 85.57 | 07:31 |
| Mirnacle (0.5,0.5,0.7) | 65 | 86.76 | 75.09 | 03:24 |
| Mirnacle (0.6,0.6,0.7) | 55 | 94.83 | 72.21 | 02:05 |
| Mirnacle (0.8,0.8,0.7) | 23 | 95.83 | 46.94 | 01:20 |
| TVM | 97 | 39.34 | 61.77 | - |
| miRNAFold | 97 | 19.17 | 43.12 | 00:0.84 |
| miRPara | 97 | 9.70 | 30.67 | 05:24 |
| CID-miRNA | 97 | 11.72 | 33.71 | 90:49 |
| VMir | 28 | 1.32 | 6.07 | 02:32 |
Comparison of the methods for pre-miRNA ab initio prediction using sequence HSD (A) and MMD (B). The results of Mirnacle are with thresholds (0.3, 0.3, 0.7) for the three stages, respectively
| Method | Sensitivity | Selectivity | GM |
|---|---|---|---|
| (A) Results for the | |||
| Mirnacle | 100 | 29.52 |
|
| TVM | 100 | 1.43 | 11.95 |
| miRNAFold | 100 | 0.89 | 9.43 |
| miRPara | 98 | 0.93 | 9.54 |
| CID-miRNA | 38 | 0.69 | 5.12 |
| VMir | 100 | 0.56 | 7.48 |
| (B) Results for the | |||
| Mirnacle | 98.61 | 47.33 |
|
| TVM | 100 | 4.13 | 20.32 |
| miRNAFold | 98.59 | 7.71 | 27.57 |
| miRPara | 98.59 | 5.34 | 22.94 |
| CID-miRNA | 29.58 | 0.82 | 4.92 |
| VMir | 88.73 | 2.93 | 16.12 |