| Literature DB >> 24884650 |
Ivani de O N Lopes1, Alexander Schliep, André C P de L F de Carvalho.
Abstract
BACKGROUND: Computational discovery of microRNAs (miRNA) is based on pre-determined sets of features from miRNA precursors (pre-miRNA). Some feature sets are composed of sequence-structure patterns commonly found in pre-miRNAs, while others are a combination of more sophisticated RNA features. In this work, we analyze the discriminant power of seven feature sets, which are used in six pre-miRNA prediction tools. The analysis is based on the classification performance achieved with these feature sets for the training algorithms used in these tools. We also evaluate feature discrimination through the F-score and feature importance in the induction of random forests.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24884650 PMCID: PMC4046174 DOI: 10.1186/1471-2105-15-124
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Features used in each feature set
| Dinucleotide frequencies | [ | x | | | | | | |
| [ | x | x | | | | | x | |
| Maximal length of the amino acid string without stop codons | [ | | | | | | | x |
| Low complexity regions detected in the sequence (%) | [ | | | | | | | x |
| Triplets | [ | | | | x | | x | |
| Stacking triplets ( | [ | | | | | | | x |
| Motifs ( | [ | | | | | x | | |
| Minimum free energy of folding ( | [ | | | | | | x | |
| Randfold ( | [ | | | | | | x | |
| Normalized MFE ( | [ | x | x | x | | | | x |
| MFE index 1 ( | [ | x | x | x | | | | x |
| MFE index 2 ( | [ | x | x | x | | | | x |
| MFE index 3 ( | [ | x | x | | | | | x |
| MFE index 4 ( | [ | x | x | | | | | x |
| Normalized essemble free energy ( | [ | x | x | | | | | x |
| Normalized difference ( | [ | x | x | | | | | x |
| Frequency of the MFE structure ( | [ | x | | | | | | |
| Normalized base-pairing propensity ( | [ | x | | x | | | | |
| Normalized Shannon entropy ( | [ | x | x | x | | | | x |
| Structural diversity ( | [ | x | x | | | | | x |
| Normalized base-pair distance ( | [ | x | | x | | | | |
| Average base pairs per stem (Avg_Bp_Stem) | [ | x | x | | | | | x |
| Average A-U pairs | | [ | x | x | | | | | x |
| Average G-C pairs | | [ | x | x | | | | | x |
| Average G-U pairs | | [ | x | x | | | | | x |
| Content of A-U pairs per stem | [ | x | x | | | | | x |
| Content of G-C pairs per stem | [ | x | x | | | | | x |
| Content of G-U pairs per stem | [ | x | x | | | | | x |
| Cumulative size of internal loops | [ | | | | | | | x |
| Structure entropy ( | [ | x | x | | | | | x |
| Normalized structure entropy ( | [ | x | x | | | | | x |
| Structure enthalpy ( | [ | x | | | | | | |
| Normalized structure enthalpy ( | [ | x | | | | | | |
| Melting energy of the structure | [ | x | | | | | | |
| Normalized melting energy of the structure | [ | x | | | | | | |
| Topological descriptor (dF) | [ | x | x | x | | | | x |
| Normalized variants ( | [ | x | | | | | | |
| Normalized variants ( | [ | x | x | | | | | x |
| Normalized variants ( | [ | x | ||||||
Detailed descriptions can be found in the corresponding references.
Predicted performances of classifiers trained with 1,742 examples, presented as the mean and standard deviation (Mean SD)
| SVM | FS4 | E 85.6 ± 1.2 a | D 83.0 ± 1.9 a | D 88.4 ± 1.5 a | E 85.2 ± 1.3 a | E 71.4 ± 2.3 a |
| | FS5 | D 87.4 ± 0.9 a | C 84.3 ± 1.5 a | C 90.5 ± 1.4 a | D 86.9 ± 0.9 a | D 74.9 ± 1.7 a |
| | FS6 | C 89.8 ± 1.1 a | B 87.5 ± 1.5 a | C 93.0 ± 1.7 a | C 89.5 ± 1.1 a | C 79.8 ± 2.2 a |
| | FS3 | B 90.6 ± 0.8 a | B 88.0 ± 1.3 a | B 93.3 ± 1.3 a | B 90.4 ± 0.9 a | B 81.4 ± 1.7 a |
| | FS1 | |||||
| | FS2 | |||||
| | FS7 | |||||
| | SELECT | |||||
| RF | FS4 | E 84.8 ± 1.1 b | D 81.2 ± 1.8 b | C 88.3 ± 1.3 a | E 84.2 ± 1.2 b | E 69.8 ± 2.1 b |
| | FS5 | D 85.7 ± 0.7 b | D 81.2 ± 0.8 b | B 90.3 ± 1.4 a | D 85.1 ± 0.6 b | D 71.8 ± 1.5 b |
| | FS6 | C 88.7 ± 1.4 b | C 86.6 ± 1.5 b | A 89.8 ± 1.6 b | C 88.5 ± 1.4 b | C 77.4 ± 2.8 b |
| | FS3 | C 90.0 ± 1.0 b | C 86.9 ± 1.4 b | A 93.0 ± 1.1 a | C 89.6 ± 1.0 b | C 80.1 ± 1.9 b |
| | FS1 | A 91.5 ± 1.0 b | A 89.1 ± 1.1 a | A 93.9 ± 1.2 a | A 91.3 ± 1.0 b | A 83.1 ± 1.9 b |
| | FS2 | A 90.9 ± 1.0 b | B 88.1 ± 1.2 b | A 93.8 ± 1.3 b | A 90.7 ± 1.1 b | A 82.0 ± 2.1 b |
| | FS7 | A 91.1 ± 0.8 b | B 88.5 ± 1.3 b | A 93.7 ± 1.3 b | A 90.9 ± 1.0 b | A 82.3 ± 2.0 b |
| | SELECT | B 90.5 ± 0.9 b | C 87.4 ± 1.0 b | A 93.6 ± 1.4 b | B 90.2 ± 0.9 b | B 81.2 ± 1.9 b |
| G 2DE | FS3 | 90.2 ± 0.9 | 87.4 ± 1.5 | 93.1 ± 0.9 | 89.9 ± 0.9 | 80.6 ± 1.8 |
Predicted accuracies (Acc), sensitivities (Se), specificities (Sp), F-measures (Fm) and Mathew Correlation Coefficients (Mcc) of classifiers trained with 1,742 examples, presented as the mean and standard deviation (mean ± sd). Capital letters in columns indicate the performance cluster of each feature set, within algorithm (ALG). Lower case letters in columns indicate the cluster of each algorithms, within feature sets. Bold numbers represents the highest performances, which were not significantly different according to the clustering criteria in [42].
Figure 1Average feature importance estimated during the induction of RF ensembles. Features with importance lower than five were omitted. The average feature importance drops-off quickly after the 10th feature, indicating that for each ensemble there are few distinguishing features.
Main characteristics of tools used as references in this work
| Triplet −SVM | SVM | 32 | noML | 163 | 168 | 5.0 | CDS | |
| | | | | | | | | |
| | | | | noML | | | | |
| MiPred | RF | 34 | noML | 163 | 168 | 8.2 | CDS | |
| | | | | | | | | |
| | | | | 50< | | | | |
| | | | | noML | | | | |
| MicroPred | SVM | 21 | RR | Not | Not | 12 | CDS | |
| | | | | RR | Given | Given | | ncRNAs |
| | | | | | Clearly | Clearly | | |
| G 2DE | G 2DE | 7 | RR | 460 | 460 | 12.0 | CDS | |
| | | | noML | | | | | |
| | | | | noML | | | | |
| Mirident | SVM | 1300 | | 484 | 484 | 11.0 | CDS | |
| | | | RR | | | | | |
| | | | noML | 50< | | | | |
| | | | | RR | | | | |
| | | | | noML | | | | |
| HuntMi | RF | 28 | ExpVal | Not | Not | 17.0 | CDS | |
| | | | | | Given | Given | | mRNA |
| Clearly | Clearly | ncRNA | ||||||
BP = Number of base pairs on the stem, MFE = Minimum Free Energy of the secondary structure, noML = no Multiple Loops, RR = Removed Redundancies, E-value ≤ 102 = expected value in BLASTN against mirbase, ExpVal = Only experimentally validated precursors and RF = Random forest.
Predicted performance of tools used as references in our GEN test sets
| Triple-SVM | 78.8 ± 1.3 | 64.7 ± 2.1 | 92.9 ± 1.3 | 75.3 ± 1.7 | 60.1 ± 2.5 |
| MiPred | 86.8 ± 0.9 | 76.8 ± 1.6 | 96.8 ± 0.9 | 85.3 ± 1.1 | 75.1 ± 1.7 |
| microPred | 69.9 ± 1.7 | 72.1 ± 1.7 | 67.6 ± 2.7 | 70.6 ± 1.5 | 39.8 ± 3.3 |
| G 2DE | 90.6 ± 0.9 | 89.2 ± 1.2 | 93.3 ± 1.6 | 90.5 ± 0.9 | 81.4 ± 1.8 |
| Mirident | 85.5 ± 1.0 | 88.2 ± 1.1 | 82.9 ± 1.2 | 85.9 ± 1.0 | 71.2 ± 2.1 |
| HuntMi | 85.1 ± 2.1 | 98.7 ± 0.8 | 71.6 ± 4.2 | 86.9 ± 1.6 | 73.0 ± 3.5 |
Results are presented as the mean and the standard deviation (Mean ±SD). Acc = accuracy; Se = sensitivity; Sp = specificity; Fm = F-measure; Mcc = Mathew Correlation Coefficient.
Figure 2Predictive performance of classifiers throughout 125-quantile distribution of G+C content. The prediction of the secondary structure of G+C-rich sequences is more challenging. This figure shows that the classification of G+C-rich pre-miRNA sequences is also more complex. As the G+C content increased, the sensitivity dropped, except when SVM was trained with feature sets including %G+C-based features (FS1, FS2 and FS7).