| Literature DB >> 23028655 |
Galina V Glazko1, Boris L Zybailov, Igor B Rogozin.
Abstract
Among thousands of long non-coding RNAs (lncRNAs) only a small subset is functionally characterized and the functional annotation of lncRNAs on the genomic scale remains inadequate. In this study we computationally characterized two functionally different parts of human lncRNAs transcriptome based on their ability to bind the polycomb repressive complex, PRC2. This classification is enabled by the fact that while all lncRNAs constitute a diverse set of sequences, the classes of PRC2-binding and PRC2 non-binding lncRNAs possess characteristic combinations of sequence-structure patterns and, therefore, can be separated within the feature space. Based on the specific combination of features, we built several machine-learning classifiers and identified the SVM-based classifier as the best performing. We further showed that the SVM-based classifier is able to generalize on the independent data sets. We observed that this classifier, trained on the human lncRNAs, can predict up to 59.4% of PRC2-binding lncRNAs in mice. This suggests that, despite the low degree of sequence conservation, many lncRNAs play functionally conserved biological roles.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23028655 PMCID: PMC3441527 DOI: 10.1371/journal.pone.0044878
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Consensus motifs enriched in PRC2-binding and PRC2 non-binding lncRNAs.
| Length | Number of motifs | Enriched in PRC2+ | Enriched in PRC2− |
| k = 4 | 15 | WWHH | SBSC |
| k = 5 | 47 | TYKWW | SSYCV |
| k = 6 | 196 | WWWRW | SKSCSM |
| k = 7 | 568 | WWWWWW | SYSMSMS |
| k = 8 | 1104 | WWWWWWWW | BHMRVMAD |
PRC+: PRC2-binding lncRNAs.
PRC−: PRC2 non-binding lncRNAs.
IUPAC nucleotide code: http://www.bioinformatics.org/sms/iupac.html.
Figure 1Visualization of the classification performance for four classifiers and the set of features, selected at the 0.05 significance level.
The observations along X axis are reordered according to their true class labels. For each observation red and green dots represent the estimated probabilities to belong to class 0 and 1 respectively. Dotted line separates observations from class 0 and class 1. As it is evident from the plot, the probability of observation to belong to a specific class is in agreement with its class label.
Figure 2ROC curves for four different classifiers and the set of features selected at the 0.05 significance level.
Classifiers performances (0.05 significance level).
| Classifier | Specificity | Sensitivity | Misclassification | Accuracy |
| SVM linear | 0.787 | 0.745 | 0.234 | 0.766 |
| Shrincage LDA | 0.771 | 0.682 | 0.274 | 0.726 |
| Random Forest | 0.704 | 0.682 | 0.307 | 0.693 |
| LogisticRegression | 0.688 | 0.631 | 0.341 | 0.659 |
Classifiers performances (0.01 significance level).
| Classifier | Specificity | Sensitivity | Misclassification | Accuracy |
| SVM linear | 0.688 | 0.707 | 0.303 | 0.697 |
| Shrincage LDA | 0.707 | 0.688 | 0.303 | 0.693 |
| Random Forest | 0.669 | 0.659 | 0.336 | 0.632 |
| LogisticRegression | 0.640 | 0.694 | 0.333 | 0.667 |
SVM performance on independent data sets.
| LncRNA | PRC2-binding | PRC2 non-binding |
| HOTAIR 1–300 | 1 | 0 |
| HOTAIR 1–1500 | 1 | 0 |
| HOTAIR 1500–2146 | 0 | 1 |
| repA XIST (human, mouse) | 1 | 0 |
| 106 mouse PRC2-binding | 63 | 43 |
| mHOTAIR 1–500 | 1 | 0 |
| mHOTAIR 500–2006 | 0 | 1 |
Figure 3A fragment of mouse (mHOTAIR) and human (hHOTAIR) HOTAIR lncRNA alignment (positions 1–1120 in human lncRNA are shown).
Exons coordinates are from NC00012.
37]. We also considered more established techniques, such as Support Vector Machine [36] and Random Forest (RF) [38], and as the most classical approach we employed Logistic Regression (LR) [39]. SVM classifier is known to be sensitive to the parameters, and its performance decreases significantly without tuning. We used here a linear kernel and the only tuning parameter was the cost (C). We considered a grid for the cost parameter and employed the nested cross-validation, resulting in the value of hyperparameter that gave the smallest misclassification rate (C = 0.1). All computations were performed in Bioconductor package CMA (‘Classification for MicroArrays) [57], implemented in the R language [55].