| Literature DB >> 27233515 |
Ivani de O N Lopes1, Alexander Schliep2, André P de L F de Carvalho3.
Abstract
BACKGROUND: Discovery of microRNAs (miRNAs) relies on predictive models for characteristic features from miRNA precursors (pre-miRNAs). The short length of miRNA genes and the lack of pronounced sequence features complicate this task. To accommodate the peculiarities of plant and animal miRNAs systems, tools for both systems have evolved differently. However, these tools are biased towards the species for which they were primarily developed and, consequently, their predictive performance on data sets from other species of the same kingdom might be lower. While these biases are intrinsic to the species, their characterization can lead to computational approaches capable of diminishing their negative effect on the accuracy of pre-miRNAs predictive models. We investigate in this study how 45 predictive models induced for data sets from 45 species, distributed in eight subphyla/classes, perform when applied to a species different from the species used in its induction.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27233515 PMCID: PMC4884428 DOI: 10.1186/s12859-016-1036-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Feature set composition, dimension, literature reference
| Feature | Feature set | |||||||
|---|---|---|---|---|---|---|---|---|
| FS1 | FS2 | FS3 | FS4 | FS 5 | FS6 | FS7 | Select | |
| Di-nucleotide frequencies ( | x | |||||||
|
| x | x | x | |||||
| Maximal length of the amino acid string without stop codons ( | x | |||||||
| Percentage of low complexity regions ( | x | |||||||
| Triplets | x | x | ||||||
| Stacking triplets ( | x | |||||||
| Motifs ( | x | |||||||
| Minimum free energy of folding ( | x | |||||||
| Randfold ( | x | |||||||
| Normalized MFE ( | x | x | x | x | x | |||
| MFE index 1 ( | x | x | x | x | x | |||
| MFE index 2 ( | x | x | x | x | x | |||
| MFE index 3 ( | x | x | x | x | ||||
| MFE index 4 ( | x | x | x | |||||
| Normalized Ensemble Free Energy ( | x | x | x | x | ||||
| Normalized difference ( | x | x | x | x | ||||
| Frequency of the MFE structure ( | x | |||||||
| Normalized base-pairing propensity ( | x | x | ||||||
| Normalized Shannon entropy ( | x | x | x | x | x | |||
| Structural diversity ( | x | x | x | |||||
| Normalized base-pair distance ( | x | x | ||||||
| Average base pairs per stem (Avg_Bp_Stem) | x | x | x | |||||
| Normalized A-U pairs counts (| | x | x | x | |||||
| Normalized G-C pairs counts (| | x | x | x | x | ||||
| Normalized G-U pairs counts (| | x | x | x | x | ||||
| Content of A-U pairs per stem ( | x | x | x | |||||
| Content of G-C pairs per stem ( | x | x | x | |||||
| Content of G-U pairs per stem ( | x | x | x | x | ||||
| Cumulative size of internal loops ( | x | |||||||
| Structure entropy ( | x | x | x | x | ||||
| Normalized structure entropy ( | x | x | x | x | ||||
| Structure enthalpy ( | x | |||||||
| Normalized structure enthalpy ( | x | |||||||
| Melting energy of the structure | x | |||||||
| Normalized melting energy of the structure | x | |||||||
| Topological descriptor (dF) | x | x | x | x | x | |||
| Normalized variants ( | x | |||||||
| Normalized variants ( | x | x | x | |||||
| Normalized variants ( | x | |||||||
| Dimension | 48 | 21 | 7 | 32 | 1300 | 34 | 28 | 13 |
| Reference | [ | [ | [ | [ | [ | [ | [ | [ |
Definition of all 44 classification models compared in this work, according to feature sets and learning algorithms. M is the classifier induced with the feature set i and algorithm j, i=1,…,12 and j=1,2,3, and w is the cross-validation accuracy of the classifier M. is the predicted class by . Emv=Ensemble majority votes, Ewv=Ensemble weighted votes
| 1. SVMs | 2. RF | 3. J48 | |
|---|---|---|---|
| 1. FS1 | M11 | M12 | M13 |
| 2. FS2 | M21 | M22 | M23 |
| 3. FS6 | M31 | M32 | M33 |
| 4. FS7 | M41 | M42 | M43 |
| 5. FS3 | M51 | M52 | M53 |
| 6. FS4 | M61 | M62 | M63 |
| 7. FS5 | M71 | M72 | M72 |
| 8. SELECT | M81 | M82 | M83 |
| 9. Hyb37 | M91 | M92 | M93 |
| 10. Hyb | M101 | M102 | M103 |
| 11. Hyb 17 | M111 | M112 | M113 |
| 12. Ss1 | M121 | M122 | M123 |
| Emv8 |
|
|
|
| Ewv8 |
|
|
|
| Emv24 |
| ||
| Ewv24 |
| ||
Fig. 1Frequencies of species for who each classification model achieved accuracies in the clusters C1-C5.
Centers of accuracy clusters from 24 classification models, per species. Range = Maximum - minimum
| Acronym for species |
|
|
|
|
| Range |
|---|---|---|---|---|---|---|
| bfl | 94 | 83 | - | - | - | 15.0 |
| cin | 83 | 79 | 75 | 68 | - | 19.0 |
| cbr | 93 | 85 | 79 | - | - | 17.0 |
| cel | 92 | 87 | 81 | 75 | - | 20.0 |
| aae | 95 | 90 | 80 | - | - | 18.0 |
| ame | 85 | 78 | 72 | - | - | 20.0 |
| api | 92 | 88 | 82 | 73 | - | 22.0 |
| bmo | 84 | 79 | 71 | 57 | - | 31.0 |
| dme | 91 | 78 | - | - | - | 22.0 |
| tca | 89 | 82 | 76 | - | - | 18.0 |
| aca | 93 | 86 | 80 | - | - | 16.0 |
| xtr | 97 | 87 | 82 | - | - | 18.0 |
| gga | 95 | 90 | 85 | 76 | 68 | 27.0 |
| cfa | 91 | 83 | 75 | - | - | 22.0 |
| eca | 93 | 86 | 77 | - | - | 20.0 |
| mdo | 87 | 79 | 71 | - | - | 21.0 |
| mml | 89 | 82 | 75 | - | - | 17.0 |
| ggo | 89 | 77 | 66 | - | - | 27.0 |
| hsa | 88 | 77 | - | - | - | 16.0 |
| ptr | 89 | 82 | 73 | - | - | 23.0 |
| oan | 88 | 83 | 77 | 70 | - | 23.0 |
| cgr | 92 | 88 | 84 | 78 | - | 16.0 |
| mmu | 85 | 79 | 72 | - | - | 17.0 |
| rno | 93 | 88 | 81 | - | - | 17.0 |
| bta | 84 | 80 | 75 | 68 | - | 18.0 |
| oar | 91 | 86 | 77 | - | - | 18.0 |
| ssc | 90 | 85 | 79 | 64 | - | 29.0 |
| dre | 93 | 86 | 80 | - | - | 17.0 |
| ola | 92 | 88 | 80 | 68 | - | 26.0 |
| ppt | 93 | 84 | 76 | - | - | 20.0 |
| aly | 95 | 88 | 81 | - | - | 17.0 |
| ath | 94 | 83 | - | - | - | 15.0 |
| mes | 98 | 91 | 85 | - | - | 14.0 |
| gma | 91 | 86 | 79 | - | - | 18.0 |
| mtr | 86 | 82 | 72 | - | - | 21.0 |
| lus | 97 | 84 | - | - | - | 18.0 |
| mdm | 98 | 85 | - | - | - | 15.0 |
| ppe | 95 | 87 | 80 | - | - | 18.0 |
| ptc | 94 | 83 | - | - | - | 16.0 |
| stu | 93 | 87 | 82 | - | - | 16.0 |
| vvi | 93 | 86 | 78 | - | - | 20.0 |
| bdi | 91 | 87 | 75 | - | - | 22.0 |
| osa | 87 | 77 | - | - | - | 16.0 |
| sbi | 96 | 89 | 81 | - | - | 20.0 |
| zma | 96 | 82 | - | - | - | 17.0 |
Centers of accuracy clusters obtained from classification models induced with examples from different species, per combination of feature set and learning algorithm. Range = Maximum - minimum
| Feature set | Algorithm |
|
|
|
|
|
| Range |
|---|---|---|---|---|---|---|---|---|
| FS1 | SVM | 95 | 88 | 78 | - | - | - | 21 |
| FS2 | 96 | 92 | 87 | 80 | - | - | 20 | |
| FS3 | 95 | 90 | 85 | - | - | - | 15 | |
| FS4 | 92 | 86 | 81 | 77 | - | - | 22 | |
| FS5 | 94 | 90 | 86 | 80 | - | - | 20 | |
| FS6 | 93 | 88 | 83 | - | - | - | 17 | |
| FS7 | 95 | 88 | - | - | - | - | 16 | |
| SELECT | 96 | 92 | 86 | 80 | - | - | 20 | |
| FS1 | RF | 97 | 92 | 87 | 82 | 72 | - | 30 |
| FS2 | 97 | 93 | 89 | 83 | - | - | 20 | |
| FS3 | 95 | 88 | 84 | - | - | - | 18 | |
| FS4 | 91 | 87 | 84 | 79 | - | - | 18 | |
| FS5 | 92 | 85 | 77 | - | - | - | 19 | |
| FS6 | 95 | 88 | - | - | - | - | 16 | |
| FS7 | 96 | 89 | - | - | - | - | 14 | |
| SELECT | J48 | 96 | 92 | 86 | 78 | - | - | 21 |
| FS1 | 98 | 91 | 85 | 75 | 67 | 57 | 41 | |
| FS2 | 96 | 90 | 84 | 77 | - | - | 24 | |
| FS3 | 97 | 92 | 87 | 81 | - | - | 21 | |
| FS4 | 84 | 79 | 75 | 69 | - | - | 21 | |
| FS5 | 83 | 78 | 75 | 71 | - | - | 17 | |
| FS6 | 97 | 93 | 89 | 83 | 78 | 72 | 27 | |
| FS7 | 96 | 91 | 87 | 81 | - | - | 21 | |
| SELECT | 97 | 92 | 86 | 80 | 74 | - | 26 |
Fig. 2Accuracy cluster membership (columns) for cross-species pre-miRNAs classifiers. Green = c1; red =other; y-axis =model species; x-axis =test species; black frames encloses species from the same subphylum/class. Figures (a), (b), (c) and (d) show interactions between the learning algorithms SVMs and RF and the feature sets SELECT and FS6
Fig. 3Accuracy cluster membership (rows) for cross-species pre-miRNAs classifiers. Green = c1; red =other; y-axis =model species; x-axis =test species; black frames encloses species from the same subphylum/class. Figures (a), (b), (c) and (d) show interactions between the learning algorithms SVMs and RF and the feature sets SELECT and FS6
Fig. 4Pairwise Pearson correlation coefficient of RFI throughout species. Correlations breaks: 0, 0.05, 0.6, 0.8, 0.95 and 1
Fig. 5Thirty most relevant features for pre-miRNAs classification, according to RFI values, per species. RFI breaks: 0, 5, 20, 60, 85, 95 and 100
Fig. 6Distribution of classification errors per species. Exclusive errors by SVMs (e 1), RF (e 2), J48 (e 3), SVMs and RF (e 4), SVMs and J48 (e 5), RF and J48 (e 6) and SVMs and RF and J48 (e 7). Figures (a), (b) and (c) illustrate that, fixing the feature set, the number of instances simultaneously misclassified by two or three learning algorithms depends on species
Fig. 7Venn diagram of the classification errors of the classification algorithms, by feature set. Results were obtained from the classification of 27,000 = 45 (test species) ×10 (repetitions) ×60 (30+,30-). Figures (a), (b), (c) and (d) illustrate that the number of instances simultaneously misclassified by two or three learning algorithms depends on the feature set
Fig. 8Distribution of the accuracies of 44 classifiers within the accuracy clusters.