| Literature DB >> 31865116 |
Lijun Dou1, Xiaoling Li2, Hui Ding3, Lei Xu4, Huaikun Xiang5.
Abstract
Pseudouridine (Ψ) is the most abundant RNA modification and has been found in many kinds of RNAs, including snRNA, rRNA, tRNA, mRNA, and snoRNA. Thus, Ψ sites play a significant role in basic research and drug development. Although some experimental techniques have been developed to identify Ψ sites, they are expensive and time consuming, especially in the post-genomic era with the explosive growth of known RNA sequences. Thus, highly accurate computational methods are urgently required to quickly detect the Ψ sites on uncharacterized RNA sequences. Several predictors have been proposed using multifarious features, but their evaluated performances are still unsatisfactory. In this study, we first identified Ψ sites for H. sapiens, S. cerevisiae, and M. musculus using the sequence features from the bi-profile Bayes (BPB) method based on the random forest (RF) and support vector machine (SVM) algorithms, where the performances were evaluated using 5-fold cross-validation and independent tests. It was found that the SVM-based accuracies were 3.55% and 5.09% lower than the iPseU-CUU predictor for the H_990 and S_628 datasets, respectively. Almost the same-level results were obtained for M_994 and an independent H_200 dataset, even showing a 5.0% improvement for S_200. Then, three different kinds of features, including basic Kmer, general parallel correlation pseudo-dinucleotide composition (PC-PseDNC-General), and nucleotide chemical property (NCP) and nucleotide density (ND) from the iRNA-PseU method, were combined with BPB to show their comprehensive performances, where the effective features are selected by the max-relevance-max-distance (MRMD) method. The best evaluated accuracies of the combined features for the S_628 and M_994 datasets were achieved at 70.54% and 72.45%, which were 2.39% and 0.65% higher than iPseU-CUU. For the S_200 dataset, it was also improved 8% from 69% to 77%. However, there was no obvious improvement for H. sapiens, which was evaluated as approximately 63.23% and 72.0% for the H_990 and H_200 datasets, respectively. The overall performances for Ψ identification using BPB features as well as the combined features were not obviously improved. Although some kinds of feature extraction methods based on the RNA sequence information have been applied to construct the predictors in previous studies, the corresponding accuracies are generally in the range of 60%-70%. Thus, researchers need to reconsider whether there is any sequence feature in the RNA Ψ modification prediction problem.Entities:
Keywords: bi-profile Bayes; max-relevance-max-distance method; pseudouridine site; random forest; support vector machine
Year: 2019 PMID: 31865116 PMCID: PMC6931122 DOI: 10.1016/j.omtn.2019.11.014
Source DB: PubMed Journal: Mol Ther Nucleic Acids ISSN: 2162-2531 Impact factor: 8.886
Results of the Proposed iRNA-PseU, PseUI, iPseU-CUU, and XG-PseU Predictors for Training Datasets H_990, S_628, and M_944 and Testing Datasets H_200 and S_200
| Predictors | Training Datasets | Acc (%) | MCC | Sn (%) | Sp (%) | Testing Datasets | Acc (%) | MCC | Sn (%) | Sp (%) |
|---|---|---|---|---|---|---|---|---|---|---|
| iRNA-PseU | H_990 | 60.4 | 0.21 | 61.01 | 59.8 | H_200 | 65.00 | 0.30 | 60.00 | 70.00 |
| PseUI | 64.24 | 0.28 | 64.85 | 63.64 | 65.50 | 0.31 | 63.00 | 68.00 | ||
| iPseU-CUU | 66.68 | 0.34 | 65.00 | 68.78 | 69.00 | 0.40 | 77.72 | 60.81 | ||
| XG-PseU | 65.44 | 0.31 | 63.64 | 67.24 | 67.00 | 0.34 | 67.00 | 67.00 | ||
| iRNA-PseU | S_628 | 64.49 | 0.29 | 64.65 | 64.33 | S_200 | 73.00 | 0.46 | 81.00 | 65.00 |
| PseUI | 66.56 | 0.33 | 62.1 | 71.02 | 68.50 | 0.37 | 72.00 | 65.00 | ||
| iPseU-CUU | 68.15 | 0.37 | 66.36 | 70.45 | 73.50 | 0.47 | 68.76 | 77.82 | ||
| XG-PseU | 68.15 | 0.37 | 66.84 | 69.45 | 71.00 | 0.42 | 75.00 | 67.00 | ||
| iRNA-PseU | M_944 | 69.07 | 0.38 | 73.31 | 64.83 | |||||
| PseUI | 70.44 | 0.41 | 74.58 | 66.31 | ||||||
| iPseU-CUU | 71.81 | 0.44 | 74.49 | 69.11 | ||||||
| XG-PseU | 72.03 | 0.45 | 76.48 | 67.57 |
The predictor developed by Chen et al.
The predictor proposed by He et al.
The predictor constructed by Tahir et al.
The predictor constructed by Liu et al.
Comparison of Our Results based on the RF and SVM Methods Using the BPB Features with the iPseU-CUU Predictor
| Predictors | Training Datasets | Acc (%) | MCC | Sn (%) | Sp (%) | Testing Datasets | Acc (%) | MCC | Sn (%) | Sp (%) |
|---|---|---|---|---|---|---|---|---|---|---|
| iPseU-CNN | H_990 | 66.68 | 0.34 | 65.00 | 68.78 | H_200 | 69.00 | 0.40 | 77.72 | 60.81 |
| RF | 58.28 | 0.17 | 60.00 | 56.57 | 59.00 | 0.18 | 61.00 | 57.00 | ||
| SVM | 63.13 | 0.26 | 64.04 | 62.22 | 74.00 | 0.48 | 78.00 | 70.00 | ||
| iPseU-CNN | S_628 | 68.15 | 0.37 | 66.36 | 70.45 | S_200 | 73.50 | 0.47 | 68.76 | 77.82 |
| RF | 62.58 | 0.25 | 63.69 | 61.46 | 74.00 | 0.48 | 70.00 | 78.00 | ||
| SVM | 63.06 | 0.27 | 52.87 | 73.25 | 73.00 | 0.49 | 60.00 | 86.00 | ||
| iPseU-CNN | M_944 | 71.81 | 0.44 | 74.49 | 69.11 | |||||
| RF | 67.27 | 0.35 | 69.28 | 65.25 | ||||||
| SVM | 71.40 | 0.43 | 75.00 | 67.80 |
The predictor proposed by Tahir et al.
The RF-based predictor using BPB features.
The SVM-based predictor using BPB features.
Figure 2This Histogram Shows the Results of the iPseU-CUU Predictor and the Constructed Model Based on the RF and SVM Classifiers Using the BPB Features
Results of Feature Selection for the H_990 Dataset Using the RF and SVM Methods
| Feature Subset | RF | SVM | ||||||
|---|---|---|---|---|---|---|---|---|
| Acc (%) | MCC | Sn (%) | Sp (%) | Acc (%) | MCC | Sn (%) | Sp (%) | |
| BPB | 58.28 | 0.17 | 60.00 | 56.57 | 63.13 | 0.26 | 64.04 | 62.22 |
| Kmer(2) | 55.76 | 0.12 | 53.13 | 58.38 | 60.00 | 0.23 | 41.82 | 78.18 |
| Kmer(3) | 58.79 | 0.18 | 58.59 | 58.99 | 59.70 | 0.20 | 53.94 | 65.45 |
| Kmer(4) | 58.59 | 0.17 | 59.39 | 57.78 | 57.27 | 0.15 | 56.57 | 57.98 |
| PC-PseDNC-General (6,0.99) | 58.59 | 0.17 | 56.57 | 60.61 | 57.78 | 0.16 | 49.49 | 66.06 |
| NCP+ND | 56.87 | 0.14 | 57.37 | 56.36 | 60.34 | 0.21 | 60.40 | 60.28 |
| BPB+Kmer(3) | 60.40 | 0.21 | 60.61 | 60.20 | 63.23 | 0.27 | 61.01 | 65.45 |
| BPB+PC-PseDNC-General (6,0.99) | 61.72 | 0.23 | 59.39 | 64.04 | 62.93 | 0.26 | 61.62 | 64.24 |
| BPB+NCP+NP | 61.11 | 0.22 | 62.83 | 59.39 | 61.11 | 0.22 | 58.79 | 63.43 |
| BPB+PC-PseDNC-General (6,0.99) + Kmer(3) | 61.01 | 0.22 | 59.39 | 62.63 | 62.73 | 0.25 | 61.82 | 63.64 |
Performance with maximum accuracy.
Results of Feature Selection for the S_628 Dataset Using the RF and SVM Methods
| Feature Subset | RF | SVM | ||||||
|---|---|---|---|---|---|---|---|---|
| Acc (%) | MCC | Sn (%) | Sp (%) | Acc (%) | MCC | Sn (%) | Sp (%) | |
| BPB | 62.58 | 0.25 | 63.69 | 61.46 | 63.06 | 0.27 | 52.87 | 73.25 |
| Kmer (k = 2) | 58.12 | 0.16 | 58.28 | 57.96 | 61.78 | 0.24 | 64.33 | 59.24 |
| Kmer (k = 3) | 60.35 | 0.21 | 62.10 | 58.60 | 61.78 | 0.24 | 66.56 | 57.01 |
| Kmer (k = 4) | 59.71 | 0.19 | 62.74 | 56.69 | 64.97 | 0.30 | 67.52 | 62.42 |
| PC-PseDNC-General (2, 0.11) | 58.76 | 0.18 | 61.78 | 55.73 | 61.15 | 0.22 | 64.01 | 58.28 |
| NCP+ND | 60.83 | 0.22 | 62.74 | 58.92 | 60.99 | 0.22 | 57.01 | 64.97 |
| BPB+Kmer (k = 4) | 64.01 | 0.28 | 64.33 | 63.69 | 68.15 | 0.36 | 66.56 | 69.75 |
| BPB+PC-PseDNC-General (2, 0.11) | 62.90 | 0.26 | 63.38 | 62.42 | 66.08 | 0.33 | 57.64 | 74.52 |
| BPB+NCP+ND | 62.74 | 0.26 | 65.61 | 59.87 | 61.78 | 0.24 | 56.37 | 67.20 |
| BPB+PC-PseDNC-General (2, 0.11) + Kmer(4) | 64.49 | 0.29 | 65.92 | 63.06 | 70.54 | 0.41 | 69.43 | 71.66 |
Performance with maximum accuracy.
Results of Feature Selection for the M_944 Dataset Using the RF and SVM Methods
| Feature Subset | RF | SVM | ||||||
|---|---|---|---|---|---|---|---|---|
| Acc (%) | MCC | Sn (%) | Sp (%) | Acc (%) | MCC | Sn (%) | Sp (%) | |
| BPB | 68.54 | 0.37 | 69.28 | 67.80 | 71.40 | 0.43 | 75.00 | 67.80 |
| Kmer(2) | 52.22 | 0.04 | 54.45 | 50.00 | 56.78 | 0.14 | 61.65 | 51.91 |
| Kmer(3) | 55.51 | 0.11 | 57.42 | 53.60 | 59.22 | 0.18 | 60.81 | 57.63 |
| Kmer(4) | 56.04 | 0.12 | 58.05 | 54.03 | 58.37 | 0.17 | 59.96 | 56.78 |
| PC-PseDNC-General (2, 0.1) | 53.07 | 0.06 | 56.14 | 50.00 | 57.84 | 0.16 | 64.41 | 51.27 |
| NCP+ND | 67.58 | 0.35 | 70.34 | 64.83 | 68.01 | 0.36 | 69.49 | 66.53 |
| BPB+Kmer(3) | 67.37 | 0.35 | 71.61 | 63.14 | 72.46 | 0.45 | 75.85 | 69.07 |
| BPB+PC-PseDNC-General (2, 0.1) | 67.58 | 0.35 | 70.97 | 64.19 | 71.40 | 0.43 | 73.52 | 69.28 |
| BPB+NCP+ND | 68.43 | 0.37 | 71.82 | 65.04 | 68.11 | 0.36 | 69.70 | 66.53 |
| BPB+PC-PseDNC-General (2, 0.11) + Kmer(3) | 68.33 | 0.37 | 72.67 | 63.98 | 71.72 | 0.44 | 75.00 | 68.43 |
Performance with maximum accuracy.
Figure 3Comparisons of the Evaluated Performance of Predictors iPseU-CUU, XG-PseU and the Constructed Model Using the Combined Features in This Work
Results of Feature Selection for H_990 and H_200 Datasets Using Several Kinds of Features from iLearn and BioSeq-Analysis 2.0
| Feature Subset | H_990 | H_220 | ||||||
|---|---|---|---|---|---|---|---|---|
| Acc | MCC | Sn | Sp | Acc | MCC | Sn | Sp | |
| BE | 60.10 | 0.20 | 58.79 | 61.41 | 66.50 | 0.33 | 64.00 | 69.00 |
| Mismatch (3) | 60.81 | 0.22 | 57.37 | 64.24 | 59.50 | 0.19 | 58.00 | 61.00 |
| EIIP | 57.37 | 0.15 | 54.55 | 60.20 | 58.00 | 0.16 | 56.00 | 60.00 |
| PseEIIP | 58.99 | 0.18 | 54.75 | 63.23 | 58.00 | 0.16 | 55.00 | 61.00 |
| BE | 60.10 | 0.20 | 58.79 | 61.41 | 66.50 | 0.33 | 64.00 | 69.00 |
| BPB+Kmer(3)+EIIP | 63.33 | 0.27 | 62.63 | 64.04 | 75.00 | 0.51 | 81.00 | 69.00 |
| BPB+Kmer(3)+PseEIIP | 63.13 | 0.26 | 61.01 | 65.25 | 70.50 | 0.43 | 82.00 | 59.00 |
| BPB+Kmer(3)+BE | 60.91 | 0.22 | 58.99 | 62.83 | 68.00 | 0.36 | 69.00 | 67.00 |
| BPB+Kmer(3)+mismatch(3) | 61.11 | 0.22 | 56.77 | 65.45 | 60.20 | 0.20 | 61.00 | 59.41 |
| BPB+Kmer(3)+EIIP+mismatch(3) | 61.21 | 0.23 | 56.97 | 65.45 | 60.20 | 0.20 | 61.00 | 59.41 |
All values in this row indicate performance with maximum accuracy.
Figure 1Flowchart of Constructed Predictors for Ψ Identification Using the BPB Features