| Literature DB >> 32296686 |
Yuan Liu1, Dasheng Chen1, Ran Su1, Wei Chen2,3, Leyi Wei4,5.
Abstract
RNA 5-hydroxymethylcytosine (5hmC) modification plays an important role in a series of biological processes. Characterization of its distributions in transcriptome is fundamentally important to reveal the biological functions of 5hmC. Sequencing-based technologies allow the high-throughput identification of 5hmC; however, they are labor-intensive, time-consuming, as well as expensive. Thus, there is an urgent need to develop more effective and efficient computational methods, at least complementary to the high-throughput technologies. In this study, we developed iRNA5hmC, a computational predictive protocol to identify RNA 5hmC sites using machine learning. In this predictor, we introduced a sequence-based feature algorithm consisting of two feature representations, (1) k-mer spectrum and (2) positional nucleotide binary vector, to capture the sequential characteristics of 5hmC sites. Afterward, we utilized a two-stage feature space optimization strategy to improve the feature representation ability, and trained a predictive model using support vector machine (SVM). Our feature analysis results showed that feature optimization can help to capture the most discriminative features. As compared to well-known existing feature descriptors, our proposed representations can more accurately separate true 5hmC from non-5hmC sites. To the best of our knowledge, iRNA5hmC is the first RNA 5hmC predictor that enables to make predictions based on RNA primary sequences only, without any need of prior experimental knowledge. Importantly, we have established an easy-to-use webserver which is currently available at http://server.malab.cn/iRNA5hmC. We expect it has potential to be a useful tool for the prediction of 5hmC sites.Entities:
Keywords: RNA 5-hydroxymethylcytosine modification; feature representation; machine learning; sequence analysis; web server
Year: 2020 PMID: 32296686 PMCID: PMC7137033 DOI: 10.3389/fbioe.2020.00227
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
FIGURE 1Parameter and kernel optimization of the SVM. (A) Visualization of classifier parameter optimization based on grid search; (B) ROC curves of different kernels in SVM.
Five-fold cross validation results of different features and their combinations.
| A | 62.3 | 61.8 | 62.8 | 0.246 |
| B | 64.0 | 63.4 | 64.5 | 0.279 |
| C | 53.3 | 53.5 | 53.2 | 0.066 |
| A + B | 64.0 | 62.7 | 65.3 | 0.280 |
| A + C | 55.4 | 55.9 | 54.8 | 0.107 |
| B + C | 55.8 | 57.1 | 54.5 | 0.116 |
| A + B + C | 56.1 | 57.6 | 54.7 | 0.122 |
FIGURE 2Feature analysis results. (A) ACC curve of the feature selection; (B,C) represent the distribution visualization of the samples (positive and negative) in feature space before and after feature optimization, respectively; (D) F-values of the top 20 most important features. Note that the x-axis represents the specific features and the y-axis represents the F-value. Note that b92 denotes the 92th feature of the binary vector, b25 denotes the 25th feature, and so forth; (E) TSL (Two Sample Logos) visualization of the positives and negatives in the dataset used in this study.
Five-fold cross validation results of the proposed feature set with other sequence-based feature descriptors.
| PCP | 63.97 | 68.73 | 59.21 | 0.2807 |
| MMI | 61.56 | 63.14 | 59.97 | 0.2312 |
| PseDNC | 62.84 | 61.33 | 64.35 | 0.2569 |
| PseEIIP | 64.27 | 69.64 | 58.91 | 0.2872 |
| Our feature set | 67.67 | 63.29 |
Comparative results of SVM and four well-known classifiers on the dataset used in this study.
| GBDT | 63.60 | 63.90 | 63.29 | 0.2719 |
| KNN | 58.46 | 56.95 | 59.97 | 0.1693 |
| NB | 63.37 | 63.00 | 63.75 | 0.2674 |
| RF | 60.27 | 62.08 | 58.46 | 0.2056 |
| SVM (this study) | 67.67 | 63.29 |
FIGURE 3Performance of different classifiers evaluated with five-fold cross validation. (A) ROC curves of different classifiers. (B) PR curves of different classifiers.