| Literature DB >> 23497112 |
Adam Gudyś1, Michał Wojciech Szcześniak, Marek Sikora, Izabela Makałowska.
Abstract
BACKGROUND: Machine learning techniques are known to be a powerful way of distinguishing microRNA hairpins from pseudo hairpins and have been applied in a number of recognised miRNA search tools. However, many current methods based on machine learning suffer from some drawbacks, including not addressing the class imbalance problem properly. It may lead to overlearning the majority class and/or incorrect assessment of classification performance. Moreover, those tools are effective for a narrow range of species, usually the model ones. This study aims at improving performance of miRNA classification procedure, extending its usability and reducing computational time.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23497112 PMCID: PMC3686668 DOI: 10.1186/1471-2105-14-83
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Datasets characteristics
| 1 406 | 81 228 | 57.8 | |
| 231 | 28 359 | 122.8 | |
| 7 053 | 218 154 | 30.9 | |
| 2 172 | 114 929 | 52.9 | |
| 237 | 839 | 3.5 | |
| 691 | 9 248 | 13.4 |
Characteristics of biological datasets used in the experiments. Imbalance is defined as a ratio of #Negatives to #Positives. We limited dataset imbalance to several tens for practical reasons even though proportions of miRNAs to non-miRNAs in genomes are more extreme. In the case of virus dataset the imbalance is exceptionally low as we wanted to know how methods perform on moderately imbalanced problems. In addition, it is difficult to create representative dataset for viruses as their genomes differ significantly in sizes and most of them do not contain miRNAs.
Relative gains in classification results
| Naïve Bayes | 1.11 | 0.00 | 1.11 |
| Perceptron | 7.70 | 0.26 | 7.76 |
| SVM | 10.11 | 1.89 | 10.29 |
| Random forest | 6.95 | 1.55 | 9.30 |
Relative percentage gains in G obtained by applying parameter and/or threshold selection on different classifiers averaged over all datasets.
Detailed classification results
| N. Bayes | 87.98 | 96.33 | 92.06 | 91.97 | 93.93 | 92.94 |
| Perceptron | 69.56 | 99.84 | 83.34 | |||
| SVM | 69.56 | 99.85 | 83.34 | 92.53 | 95.69 | 94.10 |
| R. forest | 68.21 | 99.85 | 82.53 | 91.53 | 96.34 | 93.90 |
| APLSC | 94.88 | 92.14 | 93.50 | | | |
| SMOTE + SVM | 77.67 | 99.02 | 87.69 | | | |
| N. Bayes | 86.99 | 98.91 | 92.76 | 91.30 | 97.77 | 94.48 |
| Perceptron | 80.09 | 99.95 | 89.47 | 93.04 | 97.47 | 95.23 |
| SVM | 80.07 | 99.96 | 89.47 | 93.04 | 98.95 | 95.95 |
| R. forest | 83.55 | 99.94 | 91.38 | |||
| APLSC | 96.09 | 90.42 | 93.21 | | | |
| SMOTE + SVM | 88.71 | 99.64 | 94.02 | | | |
| N. Bayes | 85.54 | 95.53 | 90.40 | 88.83 | 92.81 | 90.79 |
| Perceptron | 74.03 | 99.65 | 85.89 | 91.78 | 95.13 | 93.44 |
| SVM | 72.04 | 99.74 | 84.77 | 90.67 | 90.09 | 93.34 |
| R. forest | 72.52 | 99.72 | 85.04 | |||
| APLSC | 91.93 | 91.13 | 91.53 | | | |
| SMOTE + SVM | 84.56 | 98.68 | 91.35 | | | |
| N. Bayes | 83.56 | 97.56 | 90.29 | 87.48 | 95.84 | 91.57 |
| Perceptron | 77.30 | 99.80 | 87.83 | 89.64 | 97.38 | 93.43 |
| SVM | 73.07 | 99.85 | 85.42 | 89.46 | 97.93 | 93.60 |
| R. forest | 78.41 | 99.81 | 88.47 | |||
| APLSC | 92.77 | 89.39 | 91.07 | | | |
| SMOTE + SVM | 81.31 | 99.32 | 89.86 | | | |
| N. Bayes | 93.21 | 93.21 | 93.21 | 95.74 | 92.37 | 94.04 |
| Perceptron | 87.77 | 98.10 | 92.79 | 94.08 | 95.71 | 94.89 |
| SVM | 90.31 | 98.10 | 94.12 | |||
| R. forest | 88.59 | 98.45 | 93.39 | 93.26 | 96.31 | 94.77 |
| APLSC | 96.61 | 92.97 | 94.77 | | | |
| SMOTE + SVM | 91.99 | 97.14 | 94.53 | | | |
| N. Bayes | 80.32 | 94.27 | 87.02 | 89.43 | 87.91 | 88.67 |
| Perceptron | 82.35 | 99.37 | 90.46 | 90.74 | 94.65 | 92.67 |
| SVM | 79.31 | 99.72 | 88.93 | 89.29 | 97.01 | 93.07 |
| R. forest | 75.83 | 99.66 | 86.94 | |||
| APLSC | 91.45 | 90.96 | 91.21 | | | |
| SMOTE + SVM | 87.70 | 98.83 | 93.10 | |||
Figure 1Statistical significance diagram. Critical difference diagram for Nemenyi tests performed on human, animal, arabidopsis, plant, virus datasets. Average ranks of examined methods are presented. Bold lines indicate groups of classifiers which are not significantly different (their average ranks differ by less than CD value).
Training times
| Naïve Bayes | 00:00:13 | 00:01:03 | 00:06:38 | 00:11:56 |
| Perceptron | 00:28:02 | 01:15:53 | 05:15:04 | 10:21:05 |
| SVM | 00:23:00 | 00:25:49 | 20:22:57 | 170:47:13 |
| Random forests | 00:17:27 | 00:59:15 | 07:58:10 | 23:07:23 |
| SMOTE + SVM | 01:26:00 | 04:05:17 | 252:02:10 | 281:11:12 |
| APLSC | 00:00:34 | 00:01:46 | 00:08:52 | 00:29:52 |
Classifier training times for selected datasets (medians over all cross-validation folds). Times are given in format hh:mm:ss.
Feature selection results
| 95.31 | 97.18 | 96.24 | |
| 96.11 | 99.31 | 97.70 | |
| 94.92 | 96.60 | 95.76 | |
| 92.36 | 98.38 | 95.32 | |
| 96.18 | 95.95 | 96.06 | |
| 92.76 | 96.46 | 94.59 |
Classification results obtained by ROC-select + random forest combination for extended representation including seven new features. These are also the final results for HuntMi software.
Comparison with other tools: animal species
| 4 | 75.00 | 100.00 | |
| 16 | 87.50 | 93.75 | |
| 19 | 89.47 | 73.68 | |
| 175 | 85.14 | 93.14 | |
| 16 | - | 81.25 | |
| 139 | 64.03 | 94.96 | |
| 152 | 94.08 | 96.05 | |
| 54 | 83.33 | 94.44 | |
| 38 | 76.32 | 97.37 | |
| 23 | 82.61 | 91.30 | |
| 14 | 64.29 | 78.57 |
Classification sensitivity of microPred and HuntMi on animal miRNAs added in miRBase issues 18-19.
Comparison with other tools: plant species
| 68 | 80.88 | 91.18 | |
| 120 | 90.00 | 95.00 | |
| 302 | - | 88.41 | |
| 45 | 55.56 | 35.56 | |
| 206 | 88.83 | 99.51 | |
| 300 | - | 72.67 | |
| 163 | 84.66 | 93.25 | |
| 169 | 60.95 | 69.82 | |
| 89 | 89.89 | 97.75 | |
| 58 | 94.83 | 94.83 |
Classification sensitivity of PlantMiRNAPred and HuntMi on plant miRNAs added in miRBase issues 18-19. PlantMiRNAPred failed to process some Arabidopsis thaliana miRNAs successfully. However, these sequences were treated as properly identified.