| Literature DB >> 18226233 |
Malik Yousef1, Segun Jung, Louise C Showe, Michael K Showe.
Abstract
BACKGROUND: The application of machine learning to classification problems that depend only on positive examples is gaining attention in the computational biology community. We and others have described the use of two-class machine learning to identify novel miRNAs. These methods require the generation of an artificial negative class. However, designation of the negative class can be problematic and if it is not properly done can affect the performance of the classifier dramatically and/or yield a biased estimate of performance. We present a study using one-class machine learning for microRNA (miRNA) discovery and compare one-class to two-class approaches using naïve Bayes and Support Vector Machines. These results are compared to published two-class miRNA prediction approaches. We also examine the ability of the one-class and two-class techniques to identify miRNAs in newly sequenced species.Entities:
Year: 2008 PMID: 18226233 PMCID: PMC2248178 DOI: 10.1186/1748-7188-3-2
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
One-class results obtained from the secondary features plus sequence features.
| Method | Sen | Spe | MCC | Sen | Spe | MCC | Sen | Spe | MCC | Sen | Spe | MCC | Average MCC |
| OC-SVM | 0.73 | 0.93 | 0.67 | 0.80 | 0.93 | 0.74 | 0.72 | 0.99 | 0.74 | 0.69 | 0.91 | 0.62 | 0.70 |
| OC-Gaussian | 0.84 | 0.93 | 0.77 | 0.89 | 0.93 | 0.82 | 0.82 | 0.99 | 0.82 | 0.82 | 0.99 | 0.82 | 0.81 |
| OC-Kmeans | 0.79 | 0.93 | 0.73 | 0.85 | 0.92 | 0.77 | 0.89 | 0.92 | 0.81 | 0.89 | 0.80 | 0.69 | 0.75 |
| OC-PCA | 0.87 | 0.89 | 0.76 | 0.88 | 0.92 | 0.80 | 0.90 | 0.79 | 0.69 | 0.90 | 0.86 | 0.76 | 0.77 |
| OC-KNN | 0.90 | 0.86 | 0.76 | 0.90 | 0.92 | 0.82 | 0.90 | 0.96 | 0.86 | 0.90 | 0.93 | 0.83 | 0.82 |
| Two-Class | |||||||||||||
| Naïve Bayes | 0.89 | 0.93 | 0.82 (125) | 0.93 | 0.97 | 0.90 (200) | 0.99 | 0.92 | 0.92 (300) | 0.97 | 0.96 | 0.93 (4000) | 0.88 |
| SVM | 0.90 | 0.97 | 0.87 (200) | 0.95 | 0.98 | 0.93 (500) | 0.99 | 0.99 | 0.98 (300) | 0.98 | 0.95 | 0.93 (900) | 0.92 |
Sen = sensitivity, Spe = specificity, and MCC = Matthews Correlation Coefficient. Results are presented for four genomes individually (C. elegans, Mouse, and Human) and All-miRNA as a mixture of multiple miRNAs species. The number in parentheses is the corresponding number of optimal negative examples giving the highest MCC.
One-class results obtained from triplet-SVM and RNAmicro1.1 tools based on their specific features.
| triplet-SVM ( | RNAmicro1.1 | |||||
| Method | Sen | Spe | MCC | Sen | Spe | MCC |
| OC-SVM | 0.93 | 0.78 | 0.72 | 0.93 | 0.94 | 0.87 |
| OC-Gaussian | 0.90 | 0.88 | 0.78 | 0.90 | 0.96 | 0.87 |
| OC-Kmeans | 0.98 | 0.8 | 0.79 | 0.93 | 0.92 | 0.84 |
| OC-PCA | 0.97 | 0.79 | 0.77 | 0.90 | 0.96 | 0.86 |
| OC-KNN | 0.93 | 0.84 | 0.77 | 0.91 | 0.95 | 0.87 |
| Original study results | 0.93 | 0.88 | 0.81 | 0.84 | 0.99 | 0.84 |
The last row has the originally reported results.
Figure 1Components of the one-class computational procedure.
Prediction of miRNAs in Epstein Barr Virus with the one-class methods.
| Train | Recent | |||||||
| Sen | New | Sen | New | Sen | New | Sen | New | |
| OC-SVM | 0.84 (27/32) | 236 | 0.72 (23/32) | 236 | 0.81 (26/32) | 279 | 0.94 (30/32) | 198 |
| OC-Gaussian | 0.88 (28/32) | 258 | 0.81 (26/32) | 233 | 0.81 (26/32) | 266 | 0.84 (27/32) | 275 |
| OC-Kmeans | 0.90 (29/32) | 284 | 0.97 (31/32) | 266 | 0.78 (25/32) | 269 | 0.97 (31/32) | 271 |
| OC-PCA | 0.97 (31/32) | 284 | 0.90 (29/32) | 255 | 0.90 (29/32) | 259 | 0.94 (30/32) | 283 |
| OC-KNN | 0.88 (28/32) | 272 | 0.84 (27/32) | 266 | 0.81 (26/32) | 283 | 0.91 (29/32) | 269 |
| naïve Bayes | 0.84 (27/32) | 165 | N/A | N/A | N/A | N/A | 0.94 (30/32) | 276 |
All-miRNA, Mouse, or Human served as training data sets. New = new miRNA predictions.
Figure 2One-Class Gaussian classification scores. This shows the distribution of OC-Gaussian classifier scores over the miRNAs class and the new miRNA prediction from EBV genome sequences. All-miRNA is used for training.
Figure 3Partition stem-loop into 3 parts. Foot, Mature and Head features to determine potential stem-loops.