| Literature DB >> 20122227 |
Chih-Hung Hsieh1, Darby Tien-Hao Chang, Cheng-Hao Hsueh, Chi-Yeh Wu, Yen-Jen Oyang.
Abstract
BACKGROUND: MicroRNAs (miRNAs) are short non-coding RNA molecules, which play an important role in post-transcriptional regulation of gene expression. There have been many efforts to discover miRNA precursors (pre-miRNAs) over the years. Recently, ab initio approaches have attracted more attention because they do not depend on homology information and provide broader applications than comparative approaches. Kernel based classifiers such as support vector machine (SVM) are extensively adopted in these ab initio approaches due to the prediction performance they achieved. On the other hand, logic based classifiers such as decision tree, of which the constructed model is interpretable, have attracted less attention.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20122227 PMCID: PMC3009525 DOI: 10.1186/1471-2105-11-S1-S52
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Prediction accuracies achieved by SVM, RVKDE, G2DE, C4.5 and RIPPER.
| Kernel based classifiers | Logic based classifiers | |||||
|---|---|---|---|---|---|---|
| Feature set | SVM | RVKDE | G2DE | G2DE-2 | C4.5 | RIPPER |
| 1 | 80.17% | 77.59% | 80.39% | 77.80% | 76.72% | |
| 2 | 92.46% | 92.03% | 93.10% | 90.95% | 90.52% | |
| 3 | 91.60% | 91.16% | 91.60% | 91.16% | 91.38% | |
| 4 | 78.66% | 79.53% | 78.66% | 77.37% | 76.72% | |
| Average | 85.94% | 85.18% | 85.67% | 84.32% | 83.84% | |
| #kernels | 361 | 920 | 6 | 36 | 10 | 9 |
The best performance among each feature set is highlighted with bold font. The G2DE-2 indicates the two-stage G2DE, which uses the first stage G2DE to cluster samples and than uses the second stage G2DE to classify each clusters. The #kernels indicate number of kernels in average, where the numbers of logic based classifiers indicate the number of rules they deliver.
Evaluation measures employed in this study.
| Measure | Abbreviation | Equation |
|---|---|---|
| Sensitivity (recall) | %SE | TP/(TP+FN) |
| Specificity | %SP | TN/(TN+FP) |
| Accuracy | %ACC | (TP+TN)/(TP+TN+FP+FN) |
| F-measure | %Fm | 2TP/(2TP+FP+FN) |
| Matthews' correlation coefficient | %MCC | (TP × TN-FP × FN)/sqrt((TP+FP) × (TN+FN) × (TP+FN) × (TN+FP)) |
The definition of the abbreviations used: TP is the number of real pre-miRNAs detected; FN is the number of real pre-miRNAs missed; TN is the number of pseudo hairpins correctly classified; and FN is the number of pseudo hairpins incorrectly classified as pre-miRNA.
Comparison of G2DE and two existing pre-miRNA identification packages.
| Method | #kernels | %SE | %SP | %ACC | %Fm | %MCC |
|---|---|---|---|---|---|---|
| miPred | 280 | 88.80% | 96.55% | 92.67% | 92.38% | 85.60% |
| miR-KDE | 920 | 89.22% | 96.12% | 92.67% | 92.41% | 85.55% |
| G2DE | 6 | 87.07% | 92.46% | 92.03% | 85.41% | |
| G2DE-2 | 36 | 96.55% |
The best performance among each evaluation index is highlighted with bold font. The G2DE-2 indicates the two-stage G2DE, which uses the first stage G2DE to cluster samples and than uses the second stage G2DE to classify each clusters.
Figure 1Parameters of the three generalized Gaussian components generated by G. This figure shows the three generalized Gaussian components of G2DE with the pre-miRNAs in the HU920 dataset and the second feature set. The correlation of interest is indicated with an arrow.
Figure 2Parameters obtained by basic statistics. These parameters are obtained by calculating the mean, standard deviation and Pearson product-moment correlation coefficients with the pre-miRNAs of the HU920 dataset and the second feature set. The correlation of interest is indicated with an arrow.
Figure 3Distribution of the HU920 dataset. The x-axis is the first feature of the second feature set, ratio of MFE to the number of stems; the y-axis is the fifth feature of the second feature set, adjusted Shannon entropy. Red ellipses represent the generalized Gaussian components shown in Figure 1; the black ellipse represents the Gaussian component shown in Figure 2. The red squares and green circles represent the pre-miRNAs and the pseudo hairpins, respectively. Values within the parentheses indicate the correlations between these two features in the corresponding Gaussian components.
Summary of the adopted feature sets.
| Feature | Description |
|---|---|
| Set 1 | |
| | Frequencies of 16 dinucleotide pairs |
| | Percentage of nitrogenous bases which are either G or C |
| Set 2 | |
| | Ratio of |
| | Ratio of |
| | Adjusted base pairing propensity. |
| | Adjusted minimum free energy of folding. |
| | Adjusted Shannon entropy. |
| | Adjusted base pair distance. |
| | Compactness of the tree-graph representation of the sequence. |
| Set 3 | |
| | 5 normalized variants of |
| Set 4 | |
| | Hairpin length |
| | Loop length |
| | Consecutive base-pairs |
| | Ratio of loop length to hairpin length |
The table shows the order of a feature within the feature set. For example, the fifth feature in the second feature set is dQ.