| Literature DB >> 35591855 |
Abstract
5-Hydroxymethylcytosine (5hmC), one of the most important RNA modifications, plays an important role in many biological processes. Accurately identifying RNA modification sites helps understand the function of RNA modification. In this work, we propose a computational method for identifying 5hmC-modified regions using machine learning algorithms. We applied a sequence feature embedding method based on the dna2vec algorithm to represent the RNA sequence. The results showed that the performance of our model is better that of than state-of-art methods. All dataset and source codes used in this study are available at: https://github.com/liu-h-y/5hmC_model.Entities:
Keywords: 5-hydroxymethylcytosine; cross-validation; dna2vec; i5hmcVec; machine learning
Year: 2022 PMID: 35591855 PMCID: PMC9110757 DOI: 10.3389/fgene.2022.896925
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1Sequence length distribution. The X-axis represents the length of sequences. The Y-axis represents the density of distribution. (A) Histogram density for the distribution of the length of positive sequences. (B) Histogram density for the distribution of the length of positive sequences.
FIGURE 2Flowchart of this study. Step 1: RNA sequences are segmented into k-mers in the overlapping way, where k = 3, 4, 5, 6, 7, 8. Step 2: k-mers embeddings were trained by the dna2vec model with corpus from dm3. Step 3: We perform summation and concatenation on these k-mers embeddings to encode RNA sequences. Step 4: SVM is used as a classifier for distinguishing positive and negative samples.
FIGURE 3Performance of different kinds features on SVM, CNN, and C4.5. Cyan, orange, gray, yellow, blue, and green, respectively, represent the performance of 3-mer, 4-mer, 5-mer, 6-mer, 7-mer, and 8-mer embedding features. Purple, pink, and red, respectively, represent the performance of 4, 5, 6-mer concatenated embeddings, 6, 7, 8-mer concatenated embeddings, and 3, 4, 5, 6, 7, 8-mer concatenated embeddings. (A,B) Performance of different kinds of feature on SVM. The standard deviation of SVM on 3-mer, 4-mer, 5-mer, 6-mer, 7-mer, 8-mer, 4, 5, 6-mer, 6, 7, 8-mer, and 3, 4, 5, 6, 7, 8-mer is in the range (0.001, 0.003), (0.001, 0.003), (0.001, 0.003), (0.001, 0.003), (0.001, 0.003), (0.001, 0.004), (0.001, 0.004), (0.001, 0.003), and (0.001, 0.003); (C,D) Performance of different kinds of feature on CNN. The standard deviation of CNN on 3-mer, 4-mer, 5-mer, 6-mer, 7-mer, 8-mer, 4, 5, 6-mer, 6, 7, 8-mer, and 3, 4, 5, 6, 7, 8-mer is in the range (0.003, 0.026), (0.008, 0.071), (0.006, 0.049), (0.015, 0.056), (0.016, 0.055), (0.016, 0.059), (0.008, 0.044), (0.011, 0.058), and (0.008, 0.045); (E,F) Performance of different kinds of features on C4.5. The standard deviation of CNN on 3-mer, 4-mer, 5-mer, 6-mer, 7-mer, 8-mer, 4, 5, 6-mer, 6, 7, 8-mer, and 3, 4, 5, 6, 7, 8-mer is in the range (0.005, 0.545), (0.007, 0.531), (0.005, 0.049), (0.007, 0.685), (0.003, 0.440), (0.007, 0.489), (0.008, 0.630), (0.006, 0.567), and (0.005, 0.518).
FIGURE 4Visualization of k-mer embeddings with PCA. Each dot represents a k-mer embedding vector. (A) 3-mer embedding. (B) 4-mer embedding.
FIGURE 5Visualization of sequence features. The red dots represent the positive samples. The blue dots represent the negative samples. (A) Visualization of 2-dimensional t-SNE. (B) Visualization of 3-dimensional t-SNE.
Dataset distributions of i5hmcVec and WeakRM.
| Method | Positive | Negative | Window size |
|---|---|---|---|
| i5hmCVec | 2616 | 2616 | 209 nt∼8097 nt |
| WeakRM (training) | 1875 | 1875 | 210 nt∼8090 nt |
| WeakRM (validation) | 235 | 235 | 210 nt∼8090 nt |
| WeakRM (testing) | 234 | 234 | 210 nt∼8090 nt |
Positive samples are sequences, which contain the 5hmC sites.
Negative samples are sequences, which do not contain the 5hmC sites.
Performance of i5hmcVec and WeakRM on the dataset from WeakRM.
| Method | Acc | Sen | Spe | AUROR | AUPR | MCC |
|---|---|---|---|---|---|---|
| WeakRM | 0.790 | 0.617 |
| 0.892 | 0.905 | 0.619 |
| i5hmCVec |
|
| 0.855 |
|
|
|
Acc is short for accuracy.
Sen is short for sensitivity.
Spe is short for specificity.
AUROC means the area under the ROC curve.
AUPR means the area under the PR curve.
MCC is short for Matthews correlation coefficient.
Boldface indicates the best performance on each metric among methods.
FIGURE 6ROC and PR curves of i5hmcVec and WeakRM on the dataset from WeakRM. (A) ROC curve. The X-axis is the false positive rate, and the Y-axis is the true positive rate. (B) PR curve. The X-axis is the recall, and the Y-axis is the precision.
Performance of i5hmcVec and iRNA5hmC on the benchmark dataset from iRNA5hmC.
| Method | Acc | Sen | Spe | AUROC | AUPR | MCC |
|---|---|---|---|---|---|---|
| iRNA5hmC |
|
| 0.644 |
|
|
|
| i5hmcVec | 0.642 | 0.636 |
| 0.684 | 0.676 | 0.284 |
| ±0.008 | ±0.010 |
| ±0.007 | ±0.007 | ±0.016 |
Acc is short for accuracy.
Sen is short for sensitivity.
Spe is short for specificity.
AUROC means the area under the ROC curve.
AUPR means the area under the PR curve.
MCC is short for Matthews correlation coefficient.
Boldface indicates the best performance on each metric among different methods.
Performance of i5hmcVec on the benchmark dataset from iRNA5hmC with 10 times 5-fold cross-validation. Results are expressed as the mean and standard deviation of 10 times experiments.
FIGURE 7Performance of i5hmCVec on the negative datasets of different sequence lengths with an independent test. The X-axis is the length of sequences in the negative dataset. The Y-axis is the performance of i5hmCVec on spe.