| Literature DB >> 25392687 |
Petra Stepanowsky1, Eric Levy2, Jihoon Kim2, Xiaoqian Jiang2, Lucila Ohno-Machado2.
Abstract
MicroRNAs (miRNAs) are a class of short noncoding RNAs that regulate gene expression through base pairing with messenger RNAs. Due to the interest in studying miRNA dysregulation in disease and limits of validated miRNA references, identification of novel miRNAs is a critical task. The performance of different models to predict novel miRNAs varies with the features chosen as predictors. However, no study has systematically compared published feature sets. We constructed a comprehensive feature set using the minimum free energy of the secondary structure of precursor miRNAs, a set of nucleotide-structure triplets, and additional extracted sequence and structure characteristics. We then compared the predictive value of our comprehensive feature set to those from three previously published studies, using logistic regression and random forest classifiers. We found that classifiers containing as few as seven highly predictive features are able to predict novel precursor miRNAs as well as classifiers that use larger feature sets. In a real data set, our method correctly identified the holdout miRNAs relevant to renal cancer.Entities:
Keywords: classification; feature selection; microRNA prediction
Year: 2014 PMID: 25392687 PMCID: PMC4216048 DOI: 10.4137/CIN.S13877
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Comparison of features in different selection methods.
| METHOD | LEFT TRIPLET | MIDDLE TRIPLET | RIGHT TRIPLET | MFE SCORE | PERMUTATION ON MFE | NUMBER OF NTS IN STEM PARTS | NUMBER OF PAIRED NTS |
|---|---|---|---|---|---|---|---|
| Xue et al | |||||||
| Jiang et al | |||||||
| Zhao et al | |||||||
| Our method | |||||||
Abbreviations: MFE, Minimum Free Energy; NT, Nucleotide.
Figure 1The secondary structure of a pre-miRNA can change if one nucleotide is different. This figure illustrates (A) the pre-miRNA hsa-miR-19b-1 sequence and its secondary structure, (B) a one-nucleotide change (yellow) that modifies the loop part, and (C) the stem arm.
Figure 2An overview of the workflow to extract different nucleotide-structure triplets.
Notes: Only the nucleotides on the 5’ and 3’ stem arm are considered. The different triplets are counted and then normalized by the corresponding total number of triplets.
Figure 3Feature contribution to outcome prediction according to the RELIEF score.
Parameters used for RF for each feature set.
| METHOD | NUMBER OF VARIABLES RANDOMLY SAMPLED AS CANDIDATES |
|---|---|
| Xue et al | 6 |
| Jiang et al | 10 |
| Zhao et al | 10 |
| Our method | 5 |
Average performance of different feature sets and classifier models in 10-fold cross-validation.
| METHOD | AUC LR | 95% CI FOR LR | AUC RF | 95% CI FOR RC |
|---|---|---|---|---|
| Xue et al | 0.9499 | (0.9442, 0.9556) | 0.9217 | (0.9128, 0.9306) |
| Jiang et al | 0.9706 | (0.9658, 0.9755) | 0.9688 | (0.9627, 0.9748) |
| Zhao et al | 0.9752 | (0.9688, 0.9817) | 0.9679 | (0.9606, 0.9753) |
| Our method | 0.9759 | (0.9730, 0.9789) | 0.9716 | (0.9659, 0.9772) |
Figure 4Performance of two classifier models on the validation set: (A) LR and (B) RF.
Performance of different feature sets and classifier models on the external validation data set.
| METHOD | AUC LR | AUC RF |
|---|---|---|
| Xue et al | 0.9651 | 0.9304 |
| Jiang et al | 0.9805 | 0.9748 |
| Zhao et al | 0.9844 | 0.9745 |
| Our method | 0.9861 | 0.9786 |
Figure 6AUC values versus number of features sequentially added based on RELIEF score for RF classifiers in a 10-fold cross-validation.
Figure 5Histogram of estimates (ie, prediction probabilities) for the positive samples from the renal cancer study.