| Literature DB >> 31345106 |
Bailing Zhou1,2, Yuedong Yang1,3,4, Jian Zhan4, Xianghua Dou1,2, Jihua Wang1,2, Yaoqi Zhou1,4.
Abstract
High-throughput techniques have uncovered hundreds and thousands of long non-coding RNAs (lncRNAs). Among them, only a tiny fraction has experimentally validated functions (EVlncRNAs) by low-throughput methods. What fraction of lncRNAs from high-throughput experiments (HTlncRNAs) is truly functional is an active subject of debate. Here, we developed the first method to distinguish EVlncRNAs from HTlncRNAs and mRNAs by using Support Vector Machines and found that EVlncRNAs can be well separated from HTlncRNAs and mRNAs with 0.6 for Matthews correlation coefficient, 64% for sensitivity, and 81% for precision for the independent human test set. The most useful features for classification are related to sequence conservations at RNA (for separating from HTlncRNAs) and protein (for separating from mRNA) levels. The method is found to be robust as the human-RNA-trained model is applicable to independent mouse RNAs with similar accuracy and to a lesser extent to plant RNAs. The method can recover newly discovered EVlncRNAs with high sensitivity. Its application to randomly selected 2000 human HTlncRNAs indicates that the majority of HTlncRNAs is probably non-functional but a large portion (nearly 30%) are likely functional. In other words, there is an ample number of lncRNAs whose specific biological roles are yet to be discovered. The method developed here is expected to speed up and reduce the cost of the discovery by prioritizing potentially functional lncRNAs prior to experimental validation. EVlncRNA-pred is available as a web server at http://biophy.dzu.edu.cn/lncrnapred/index.html . All datasets used in this study can be obtained from the same website.Entities:
Keywords: Long non-coding RNAs; functional lncRNAs; low throughput experiments; prediction
Mesh:
Substances:
Year: 2019 PMID: 31345106 PMCID: PMC6779387 DOI: 10.1080/15476286.2019.1644590
Source DB: PubMed Journal: RNA Biol ISSN: 1547-6286 Impact factor: 4.652
Performance of full-feature and sequence-only SVM models trained on human datasets.
| MCC | AUC | Accuracy | Sensitivity | Specificity | Precision | ||
|---|---|---|---|---|---|---|---|
| Full-feature model | |||||||
| Human | CVa | 0.513 ± 0.006 | 0.841 ± 0.006 | 0.791 ± 0.003 | 0.599 ± 0.011 | 0.887 ± 0.005 | 0.728 ± 0.005 |
| Testb | 0.603 | 0.879 | 0.829 | 0.641 | 0.923 | 0.806 | |
| Mouse | Testc | 0.512 | 0.859 | 0.765 | 0.777 | 0.759 | 0.617 |
| Sequence-only model | |||||||
| Human | CVa | 0.471 | 0.841 | 0.774 | 0.551 | 0.887 | 0.709 |
| Testb | 0.514 | 0.852 | 0.792 | 0.590 | 0.893 | 0.734 | |
| Mouse | Testc | 0.481 | 0.846 | 0.745 | 0.777 | 0.729 | 0.589 |
a10-fold cross-validation on the full training set. The mean and standard deviation are obtained from 100 random divisions of 10 folds in the training set. bTest results on the independent test set of the human RNAs. cTest results on the independent test set of the mouse RNAs.
Figure 1.Receiver operating characteristic curves by full-feature and sequence-only models trained on human RNAs.
Performance of full-feature and sequence-only SVM models (except DNA conservation scores) trained on human datasets.
| MCC | AUC | Accuracy | Sensitivity | Specificity | Precision | |
|---|---|---|---|---|---|---|
| Full-feature model except DNA conservation | ||||||
| Human-CVa | 0.482 | 0.845 | 0.781 | 0.514 | 0.915 | 0.753 |
| Human-Testb | 0.560 | 0.869 | 0.812 | 0.581 | 0.927 | 0.800 |
| Mouse-Testc | 0.518 | 0.844 | 0.779 | 0.723 | 0.807 | 0.652 |
| Plant-Testd | 0.383 | 0.801 | 0.744 | 0.417 | 0.908 | 0.694 |
| Sequence-only model except DNA conservation | ||||||
| Human-CVa | 0.448 | 0.833 | 0.763 | 0.562 | 0.864 | 0.675 |
| Human-Testb | 0.521 | 0.849 | 0.792 | 0.632 | 0.872 | 0.712 |
| Mouse-Testc | 0.414 | 0.818 | 0.719 | 0.711 | 0.723 | 0.562 |
| Plant-Testd | 0.221 | 0.725 | 0.689 | 0.283 | 0.892 | 0.567 |
a10-fold cross-validation on the full training set. bTest results on the independent test set of the human RNAs. cTest results on the independent test set of the mouse RNAs. dTest results on the independent test set of the plant RNAs.
Figure 2.As in Figure 1 but for the model without DNA conservation features and tested by the plant RNAs.
Figure 3.The difference in Area Under the ROC Curve (AUC) as a single feature. Here, GC denotes GC content; PUR: Purine content; DNA: DNA conservation; Protein: Protein conservation; RNA: RNA conservation; ASA: Accessible surface area; pA+: polyA+ RNA-seq; pA-: polyA- RNA-seq; s: small RNA-seq; 36: H3K36me3 modification; and 4: H3K4me3 modification. The difference is multiplied by 10 for removing a single feature only (filled bar), to facilitate comparison to the results of using a single feature (open bar).
Figure 4.Receiver operating characteristic curves on the human test set by EVlncRNA-pred and several methods that were trained for separating expressed lncRNAs from mRNAs only.