| Literature DB >> 32645685 |
Lijun Dou1, Xiaoling Li2, Hui Ding3, Lei Xu4, Huaikun Xiang5.
Abstract
5-Methylcytosine (m5C) is a well-known post-transcriptional modification that plays significant roles in biological processes, such as RNA metabolism, tRNA recognition, and stress responses. Traditional high-throughput techniques on identification of m5C sites are usually time consuming and expensive. In addition, the number of RNA sequences shows explosive growth in the post-genomic era. Thus, machine-learning-based methods are urgently requested to quickly predict RNA m5C modifications with high accuracy. Here, we propose a noval support-vector-machine (SVM)-based tool, called iRNA-m5C_SVM, by combining multiple sequence features to identify m5C sites in Arabidopsis thaliana. Eight kinds of popular feature-extraction methods were first investigated systematically. Then, four well-performing features were incorporated to construct a comprehensive model, including position-specific propensity (PSP) (PSNP, PSDP, and PSTP, associated with frequencies of nucleotides, dinucleotides, and trinucleotides, respectively), nucleotide composition (nucleic acid, di-nucleotide, and tri-nucleotide compositions; NAC, DNC, and TNC, respectively), electron-ion interaction pseudopotentials of trinucleotide (PseEIIPs), and general parallel correlation pseudo-dinucleotide composition (PC-PseDNC-general). Evaluated accuracies over 10-fold cross-validation and independent tests achieved 73.06% and 80.15%, respectively, which showed the best predictive performances in A. thaliana among existing models. It is believed that the proposed model in this work can be a promising alternative for further research on m5C modification sites in plant.Entities:
Keywords: 5-methylcytosine; PC-PseDNC-general; electron-ion interaction pseudopotentials of trinucleotide; nucleotide composition; position-specific propensity; support vector machine
Year: 2020 PMID: 32645685 PMCID: PMC7340967 DOI: 10.1016/j.omtn.2020.06.004
Source DB: PubMed Journal: Mol Ther Nucleic Acids ISSN: 2162-2531 Impact factor: 8.886
Eight Proposed Methods to Identify m5C Sites in RNA Sequences
| Method | Species | Feature Extraction/Selection | Classifiers |
|---|---|---|---|
| m5C-PseDNC | PseDNC (3 properties) | SVM | |
| iRNAm5C-PseDNC | PseDNC (10 properties) | RF | |
| M5C-HPCR | HPCR | SVM | |
| pM5CS-Comp-mRMR | Kmer (k = 2, 3, and 4) /mRMR | SVM | |
| RNAm5Cfinder | BE | RF | |
| PEA-m5C | BE + Kmer + PseDNC | RF | |
| iRNA-m5C | Kmer + BE + NV + PseKNC | RF | |
| RNAm5CPred | Kmer + KSNPF + PseDNC | SVM |
Figure 1The Flowchart of the Proposed Predictor for m5C Identification by Combining Multiple Sequence Features
Figure 2Differences of Position-Specific Nucleotide Frequencies between Positive and Negative Samples by
Enriched nucleotides correspond to the condition while depleted to .
Evaluated Performances of Frequency-Associated Feature-Extraction Techniques Using the RF Classifier, Where 10-fold CV, Left, and Independent Tests, Right, Were Separately Used for Training and Testing Datasets
| Feature Subset | Training Datasets | Testing Datasets | ||||||
|---|---|---|---|---|---|---|---|---|
| Acc (%) | MCC | Sn (%) | Sp (%) | Acc (%) | MCC | Sn (%) | Sp (%) | |
| PSNP | 65.48 | 0.31 | 57.78 | 73.19 | 65.05 | 0.32 | 49.60 | 80.50 |
| PSDP | 65.07 | 0.31 | 56.78 | 73.36 | 67.52 | 0.36 | 57.54 | 77.50 |
| PSTP | 67.29 | 0.35 | 61.30 | 73.28 | 74.98 | 0.51 | 65.87 | 84.10 |
| NAC | 64.96 | 0.30 | 61.32 | 68.60 | 68.75 | 0.38 | 69.70 | 67.80 |
| DNC | 68.74 | 0.38 | 64.17 | 73.30 | 72.60 | 0.45 | 70.40 | 74.80 |
| TNC | 69.26 | 0.39 | 61.92 | 76.59 | 72.55 | 0.45 | 68.90 | 76.20 |
| ENAC | 69.11 | 0.38 | 64.53 | 73.68 | 71.90 | 0.44 | 71.90 | 71.90 |
| mM1GAP | 68.11 | 0.36 | 62.94 | 73.28 | 71.45 | 0.43 | 69.50 | 73.40 |
| mM2GAP | 68.80 | 0.38 | 63.32 | 74.29 | 77.20 | 0.55 | 80.60 | 73.80 |
| mM3GAP | 69.09 | 0.38 | 63.75 | 74.42 | 73.50 | 0.47 | 71.40 | 75.60 |
| mD1GAP | 67.57 | 0.36 | 60.33 | 74.82 | 72.15 | 0.44 | 68.80 | 75.50 |
| mD2GAP | 68.33 | 0.37 | 60.92 | 75.74 | 72.10 | 0.44 | 68.00 | 76.20 |
| mD3GAP | 68.38 | 0.37 | 60.41 | 76.35 | 72.70 | 0.46 | 68.60 | 76.80 |
| dM1GAP | 68.05 | 0.37 | 60.52 | 75.57 | 72.95 | 0.46 | 69.00 | 76.90 |
| dM2GAP | 68.39 | 0.37 | 60.37 | 76.40 | 72.10 | 0.44 | 68.10 | 76.10 |
| dM3GAP | 68.43 | 0.37 | 60.35 | 76.52 | 73.10 | 0.46 | 68.10 | 78.10 |
Acc, accuracy.
Same as Table 2 but for Other Five Feature-Representing Methods
| Feature Subset | Training Datasets | Testing Datasets | ||||||
|---|---|---|---|---|---|---|---|---|
| Acc (%) | MCC | Sn (%) | Sp (%) | Acc (%) | MCC | Sn (%) | Sp (%) | |
| EIIP | 66.65 | 0.34 | 59.27 | 74.02 | 70.85 | 0.42 | 68.40 | 73.30 |
| PseEIIP | 69.24 | 0.39 | 62.03 | 76.44 | 72.60 | 0.45 | 68.80 | 76.40 |
| PC-PseDNC | 68.63 | 0.37 | 63.47 | 73.79 | 72.65 | 0.45 | 70.00 | 75.30 |
| BE | 64.37 | 0.29 | 57.48 | 71.26 | 66.55 | 0.33 | 63.60 | 69.50 |
| NCP + ND | 66.67 | 0.34 | 60.92 | 72.41 | 70.25 | 0.41 | 69.30 | 71.20 |
Acc, accuracy.
Performances of Combined Features Over 10-fold CV, in Training Datasets, and Independent Tests, in Testing Datasets
| Feature Combination | Fea_num | Training Datasets | Testing Datasets | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Acc (%) | MCC | Sn (%) | Sp (%) | Acc (%) | MCC | Sn (%) | Sp (%) | ||
| PSP (PSNP + PSDP + PSTP) | 120 | 67.39 | 0.35 | 60.88 | 73.89 | 73.30 | 0.48 | 63.30 | 83.30 |
| Kmer (NAC + DNC + TNC) | 84 | 69.13 | 0.39 | 63.41 | 74.85 | 73.85 | 0.48 | 71.80 | 75.90 |
| PSP + Kmer | 204 | 71.47 | 0.43 | 67.01 | 75.93 | 77.60 | 0.56 | 71.60 | 83.60 |
| PSP + Kmer + ENAC | 352 | 71.27 | 0.43 | 66.50 | 76.04 | 76.80 | 0.54 | 72.20 | 81.40 |
| PSP + Kmer + ENAC + MM2Gap | 384 | 71.72 | 0.44 | 67.86 | 75.59 | 78.15 | 0.56 | 74.10 | 82.20 |
| PseEIIP + PseDNC | 83 | 69.38 | 0.39 | 63.26 | 75.50 | 72.45 | 0.45 | 70.10 | 74.80 |
| PSP + Kmer + PseEIIP + PseDNC | 287 | 71.77 | 0.44 | 67.56 | 75.99 | 78.30 | 0.57 | 73.90 | 82.70 |
| PSP + Kmer + PseEIIP + PseDNC + MM2Gap | 319 | 71.73 | 0.44 | 67.86 | 75.60 | 78.18 | 0.57 | 74.10 | 82.25 |
| PSP + Kmer + PseEIIP + PC-PseDNC + ENAC | 435 | 72.06 | 0.44 | 68.05 | 76.08 | 76.75 | 0.54 | 73.40 | 80.10 |
| PSP + Kmer + PseEIIP + PC-PseDNC + ENAC + MM2Gap | 476 | 71.74 | 0.44 | 67.44 | 76.04 | 77.00 | 0.54 | 74.50 | 79.48 |
| All | 1,571 | 71.93 | 0.44 | 68.18 | 75.69 | 75.71 | 0.51 | 74.50 | 76.92 |
Acc, accuracy.
The “Fea_num” column indicates the number of combined features.
Performances with maximum accuracies.
Comparison of Different Classifiers Using the Feature Combination “PSP + Kmer + PseEIIP + PC-PseDNC”
| Classifier | Training Datasets | Testing Datasets | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Acc (%) | MCC | Sn (%) | Sp (%) | AUC | Acc (%) | MCC | Sn (%) | Sp (%) | AUC | |
| RF | 71.77 | 0.44 | 75.99 | 67.56 | 0.79 | 78.30 | 0.57 | 73.90 | 82.70 | 0.85 |
| SVM | 72.72 | 0.46 | 65.46 | 79.98 | 0.80 | 79.90 | 0.60 | 79.40 | 80.40 | 0.88 |
| AdaBoost | 71.19 | 0.42 | 68.33 | 74.04 | 0.78 | 80.45 | 0.61 | 77.10 | 83.80 | 0.88 |
| NB | 66.60 | 0.34 | 55.08 | 78.12 | 0.71 | 69.82 | 0.40 | 73.00 | 66.63 | 0.77 |
Acc, accuracy.
Performances with maximum accuracies using the SVM algorithm.
Comparison of the Constructed Model with Two Published Methods
| Method | Training Datasets | Testing Datasets | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Acc (%) | MCC | Sn (%) | Sp (%) | AUC | Acc (%) | MCC | Sn (%) | Sp (%) | AUC | |
| PEA-m5C | 44.30 | −0.11 | 43.20 | 45.40 | ||||||
| iRNA-m5C | 70.70 | 0.42 | 65.70 | 75.70 | 0.77 | 74.00 | 0.48 | 72.40 | 75.60 | |
| This work | 73.06 | 0.47 | 66.42 | 79.70 | 0.80 | 80.15 | 0.60 | 79.40 | 80.90 | 0.88 |
Acc, accuracy.
Results of the PEA-m5C tool were excerpted from Lv et al. (i.e., obtained using independent data objectively).
Figure 3Evaluated Perfromances
Left: ROC curves for best performing feature combinations based on the SVM method. Right: comparison of our results (green) and the iRNA-m5C predictor (orange).