| Literature DB >> 31726390 |
Ting Fang1, Zizheng Zhang2, Rui Sun3, Lin Zhu4, Jingjing He2, Bei Huang5, Yi Xiong6, Xiaolei Zhu7.
Abstract
5-methylcytosine (m5C) is one of the most common and abundant post-transcriptional modifications (PTCMs) in RNA. Recent studies showed that m5C plays important roles in many biological functions such as RNA metabolism and cell fate decision. Because most experimental methods that determine m5C sites across the transcriptome are time-consuming and expensive, it is urgent to develop accurate computational methods to identify m5C sites effectively. A benchmark dataset is important for developing and evaluating computational methods. In this work, we constructed four different datasets according to the data redundancy and imbalance. Based on these datasets, we generated three different kinds of features, i.e., KNFs (K-nucleotide frequencies), KSNPFs (K-spaced nucleotide pair frequencies), and pseDNC (pseudo-dinucleotide composition), and then used a support vector machine (SVM) to build our models. Based on the imbalanced and nonredundant dataset, Met935, we extensively studied the three kinds of features and determined an optimal combination of the features. Based on the feature combination, we built models on the three different datasets and compared them with state-of-the-art models. According to the predictive results of the stringent jackknife test, the models based on the three features, 4NF, 1SNPF, and pseDNC, are superior or comparable to other methods. To determine the best model between the models based on the imbalanced dataset Met935 and the balanced dataset Met240, we further evaluated the two models on an independent test set Test1157. Our results demonstrate that the model based on the balanced dataset Met240 achieved the highest recall (68.79%) and the highest Matthews correlation coefficient (MCC) (0.154). In addition, the model is also superior to other state-of-the-art methods according to the integrated parameter MCC on the independent test set. Thus, we selected the model based on Met240 as our final model, which was named RNAm5CPred. In addition, a web server for RNAm5CPred (http://zhulab.ahu.edu.cn/RNAm5CPred/) has been provided to facilitate experimental research.Entities:
Keywords: 5-methylcytosine site; nucleotide composition; post-transcriptional modification; prediction; support vector machine
Year: 2019 PMID: 31726390 PMCID: PMC6859278 DOI: 10.1016/j.omtn.2019.10.008
Source DB: PubMed Journal: Mol Ther Nucleic Acids ISSN: 2162-2531 Impact factor: 8.886
Prediction Results of KNF with Different K Values on Met935 over 10-Fold Cross Validation
| KNF | Feature Dimension | Sen (%) | Spe (%) | Pre (%) | Acc (%) | MCC |
|---|---|---|---|---|---|---|
| 1NF | 4 | 20.55 | 87.15 | 20.09 | 78.11 | 0.076 |
| 2NF | 16 | 51.26 | 98.39 | 83.4 | 91.99 | 0.615 |
| 3NF | 64 | 66.14 | 97.59 | 81.19 | 93.32 | 0.696 |
| 4NF | 256 | 63.07 | 98.19 | 84.61 | 93.42 | 0.696 |
Sen, sensitivity; Spe, specificity; Pre, precision; Acc, accuracy.
Prediction Results of 3NF and 4NF on Met935 over Jackknife Test
| KNF | Feature Dimension | Sen (%) | Spe (%) | Pre (%) | Acc (%) | MCC |
|---|---|---|---|---|---|---|
| 3NF | 64 | 65.35 | 97.52 | 80.58 | 93.16 | 0.688 |
| 4NF | 256 | 63.78 | 98.02 | 83.51 | 93.37 | 0.694 |
Sen, sensitivity; Spe, specificity; Pre, precision; Acc, accuracy.
Prediction Results of KSNPF with Different K Values on Met935 over 10-Fold Cross-Validation
| KSNPF | Dimension | Sen (%) | Spe (%) | Pre (%) | Acc (%) | MCC |
|---|---|---|---|---|---|---|
| 1SNPF | 16 | 20.79 | 97.80 | 59.75 | 87.34 | 0.300 |
| 2SNPF | 16 | 11.50 | 98.84 | 61.24 | 86.97 | 0.224 |
| 3SNPF | 16 | 19.37 | 96.89 | 49.51 | 86.36 | 0.248 |
| 4SNPF | 16 | 23.78 | 95.68 | 46.47 | 85.91 | 0.262 |
| 5SNPF | 16 | 21.26 | 92.44 | 30.59 | 82.77 | 0.160 |
Sen, sensitivity; Spe, specificity; Pre, precision; Acc, accuracy.
Prediction Performances of Different Feature Combinations on Met935 over Jackknife Test
| Feature | Sen (%) | Spe (%) | Pre (%) | Acc (%) | MCC |
|---|---|---|---|---|---|
| 1SNPF | 20.47 | 97.90 | 60.47 | 87.38 | 0.300 |
| PseDNC | 48.82 | 98.64 | 84.93 | 91.87 | 0.606 |
| 4NF | 63.78 | 98.02 | 83.51 | 93.37 | 0.694 |
| 1SNPF + pseDNC | 52.76 | 98.76 | 87.01 | 92.51 | 0.642 |
| 1SNPF + 4NF | 64.57 | 99.13 | 92.13 | 94.44 | 0.744 |
| pseDNC + 4NF | 62.20 | 98.27 | 84.95 | 93.37 | 0.692 |
| 1SNPF + pseDNC + 4NF | 62.99 | 99.50 | 95.24 | 94.55 | 0.749 |
Sen, sensitivity; Spe, specificity; Pre, precision; Acc, accuracy.
Figure 1The ROC Curves for Different Feature Combinations on Met935 Over Jackknife Test
Comparison between M5C-PseDNC, iRNAm5C-PseDNC, M5C-HPCR, M5C-HPCS, and Our Model on Met240 Dataset over Jackknife Test
| Predictor | Sen (%) | Spe (%) | Acc (%) | MCC | AUC |
|---|---|---|---|---|---|
| M5C-PseDNC | 85.00 | 95.83 | 90.42 | 0.810 | 0.950 |
| iRNAm5C-PseDNC | 81.70 | 95.00 | 88.33 | 0.774 | 0.934 |
| M5C-HPCS | 90.83 | 92.50 | 91.67 | 0.833 | 0.956 |
| M5C-HPCR | 90.83 | 95.00 | 92.92 | 0.859 | 0.962 |
| Our model | 90.83 | 94.17 | 92.50 | 0.850 | 0.957 |
Sen, sensitivity; Spe, specificity; Acc, accuracy.
Results excerpted from Zhang et al.
Comparison Between M5C-PseDNC, iRNAm5C-PseDNC, M5C-HPCR, M5C-HPCS, and Our Model on Met1900 Dataset over Jackknife Test
| Predictor | Sen (%) | Spe (%) | Acc (%) | MCC | AUC |
|---|---|---|---|---|---|
| M5C-PseDNC | 84.21 | 94.88 | 92.21 | 0.792 | 0.960 |
| iRNAm5C-PseDNC | 69.89 | 99.86 | 92.37 | 0.794 | 0.963 |
| M5C-HPCS | 83.37 | 96.84 | 93.47 | 0.823 | 0.968 |
| M5C-HPCR | 88.42 | 97.33 | 95.11 | 0.868 | 0.977 |
| Our model | 91.58 | 99.51 | 97.53 | 0.934 | 0.991 |
Sen, sensitivity; Spe, specificity; Acc, accuracy.
Results excerpted from Zhang et al.
Prediction Results of M5C-HPCS and Our Model on Test96
| Predictor | Sen (%) | Spe (%) | Pre (%) | Acc (%) | MCC |
|---|---|---|---|---|---|
| M5C-HPCR | 100.00 | 62.65 | 29.55 | 67.71 | 0.430 |
| Our model | 84.62 | 100.00 | 100.00 | 97.92 | 0.909 |
Sen, sensitivity; Spe, specificity; Pre, precision; Acc, accuracy.
Results obtained by using M5C-HPCS web server on Test96.
Figure 2The ROC Curve Shows the Performances of Our model and M5C-HPCR on Test96
Prediction Results of Different Models on Test1157
| Predictor | Sen (%) | Spe (%) | Pre (%) | Acc (%) | MCC |
|---|---|---|---|---|---|
| M5C-HPCR | 62.42 | 51.10 | 16.70 | 52.64 | 0.093 |
| iRNA-m5C | 43.95 | 49.20 | 11.96 | 48.49 | −0.047 |
| Model240 | 68.79 | 53.70 | 18.91 | 55.75 | 0.154 |
| Model935 | 10.83 | 93.00 | 19.54 | 81.85 | 0.050 |
Sen, sensitivity; Spe, specificity; Pre, precision; Acc, accuracy.
Results obtained by using M5C-HPCS web server on Test1157.
Results obtained by using iRNA-m5C web server on Test1157.
Model240 is our model based on Met240, and Model935 is our model based on Met935.
Figure 3The Flowchart for Generating Dataset Met935
Figure 4The Flowchart for Generating Dataset Test1157
The Information of the Six Datasets
| Dataset | Length (bp) | Positive Subset | Negative Subset | Total |
|---|---|---|---|---|
| Met240 | 41 | 120 | 120 | 240 |
| Met1900 | 41 | 475 | 1425 | 1900 |
| Met935 | 41 | 127 | 808 | 935 |
| Train839 | 41 | 114 | 725 | 839 |
| Test96 | 41 | 13 | 83 | 96 |
| Test1157 | 41 | 157 | 1000 | 1157 |
List of Physicochemical Properties of Dinucleotides in RNA
| Dinucleotide | Free Energy | Hydrophilicity | Stacking Energy |
|---|---|---|---|
| GG | −3.260 | 0.170 | −11.100 |
| GA | −2.350 | 0.100 | −14.200 |
| GC | −3.420 | 0.260 | −16.900 |
| GU | −2.240 | 0.270 | −13.800 |
| AG | −2.080 | 0.080 | −14.000 |
| AA | −0.930 | 0.040 | −13.700 |
| AC | −2.240 | 0.140 | −13.800 |
| AU | −1.100 | 0.140 | −15.400 |
| CG | −2.360 | 0.350 | −15.600 |
| CA | −2.110 | 0.210 | −14.400 |
| CC | −3.260 | 0.490 | −11.100 |
| CU | −2.080 | 0.520 | −14.000 |
| UG | −2.110 | 0.340 | −14.400 |
| UA | −1.330 | 0.210 | −16.000 |
| UC | −2.350 | 0.480 | −14.200 |
| UU | −0.930 | 0.440 | −13.700 |