Literature DB >> 31865116

Is There Any Sequence Feature in the RNA Pseudouridine Modification Prediction Problem?

Lijun Dou¹, Xiaoling Li², Hui Ding³, Lei Xu⁴, Huaikun Xiang⁵.

Abstract

Pseudouridine (Ψ) is the most abundant RNA modification and has been found in many kinds of RNAs, including snRNA, rRNA, tRNA, mRNA, and snoRNA. Thus, Ψ sites play a significant role in basic research and drug development. Although some experimental techniques have been developed to identify Ψ sites, they are expensive and time consuming, especially in the post-genomic era with the explosive growth of known RNA sequences. Thus, highly accurate computational methods are urgently required to quickly detect the Ψ sites on uncharacterized RNA sequences. Several predictors have been proposed using multifarious features, but their evaluated performances are still unsatisfactory. In this study, we first identified Ψ sites for H. sapiens, S. cerevisiae, and M. musculus using the sequence features from the bi-profile Bayes (BPB) method based on the random forest (RF) and support vector machine (SVM) algorithms, where the performances were evaluated using 5-fold cross-validation and independent tests. It was found that the SVM-based accuracies were 3.55% and 5.09% lower than the iPseU-CUU predictor for the H_990 and S_628 datasets, respectively. Almost the same-level results were obtained for M_994 and an independent H_200 dataset, even showing a 5.0% improvement for S_200. Then, three different kinds of features, including basic Kmer, general parallel correlation pseudo-dinucleotide composition (PC-PseDNC-General), and nucleotide chemical property (NCP) and nucleotide density (ND) from the iRNA-PseU method, were combined with BPB to show their comprehensive performances, where the effective features are selected by the max-relevance-max-distance (MRMD) method. The best evaluated accuracies of the combined features for the S_628 and M_994 datasets were achieved at 70.54% and 72.45%, which were 2.39% and 0.65% higher than iPseU-CUU. For the S_200 dataset, it was also improved 8% from 69% to 77%. However, there was no obvious improvement for H. sapiens, which was evaluated as approximately 63.23% and 72.0% for the H_990 and H_200 datasets, respectively. The overall performances for Ψ identification using BPB features as well as the combined features were not obviously improved. Although some kinds of feature extraction methods based on the RNA sequence information have been applied to construct the predictors in previous studies, the corresponding accuracies are generally in the range of 60%-70%. Thus, researchers need to reconsider whether there is any sequence feature in the RNA Ψ modification prediction problem.

Entities: Chemical Disease Gene Species

Keywords: bi-profile Bayes; max-relevance-max-distance method; pseudouridine site; random forest; support vector machine

Year: 2019 PMID： 31865116 PMCID： PMC6931122 DOI： 10.1016/j.omtn.2019.11.014

Source DB: PubMed Journal: Mol Ther Nucleic Acids ISSN： 2162-2531 Impact factor: 8.886

Introduction

Pseudouridine (Ψ) is the most prevalent post-transcriptional modification, and it has been widely found in a series of biological and cellular processes., Recent studies have demonstrated that Ψ sites exist in many kinds of RNAs, such as small nuclear RNA (snRNA), rRNA, tRNA, mRNA, and small nucleolar RNA (snoRNA).3, 4, 5, 6, 7, 8, 9, 10, 11 Thus, the Ψ site plays a crucial role in biological research and drug development. More specifically, Ψ is an isomer of uridine catalyzed by the Ψ synthase (PUS) that removes the uridine residue’s base from its sugar, followed by “rotating” it 180° along the N3-C6 axis, and subsequently reattaches the base’s 5-carbon to the 1’-carbon of the sugar. Although there are several experimental methods based on the high-throughput techniques that have been developed to recognize the Ψ modifications, they are both costly and time consuming.13, 14, 15, 16, 17 In addition, researchers are facing an explosive increase of RNA data in the post-genomic age.18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 Therefore, intelligent computational approaches are highly desirable to predict Ψ sites on RNA sequences. To the best of our knowledge, six predictors have been reported to identify Ψ sites. Specifically, Panwar and Raghava first proposed the tRNAmod model to predict Ψ sites in tRNA. Li et al. then developed the PPUS method based on the support vector machine (SVM) to identify PUS-specific Ψ sites. Later, Chen et al. provided the iRNA-PseU predictor, and He et al. introduced the PseUI predictor, which are both based on the SVM classifier. In addition, Tahir et al. built the iPseU-CUU model based on the convolution neural network (CNN). Most recently, Chen et al. proposed an eXtreme Gradient Boosting (xgboost)-based method (XG-PseU). It should be noted that the same datasets, built by Chen et al., were applied in the three studies (iRNA-PseU, PseUI, and iPseU-CUU) to build the predictors, including the benchmark training datasets (H_990, S_628, and M_944) and the independent testing datasets (H_200 and S_200). Here, H, S, and M represent the RNA samples for H. sapiens, S. cerevisiae, and M. musculus, while 990, 628, 944, and 200 indicate the corresponding sample numbers in each dataset. Thus, we used the datasets mentioned earlier in this article for convenient comparisons. The performances of the four predictors (iRNA-PseU, PseUI, iPseU-CUU, and XG-PseU) are listed in Table 1, where the XG-PseU results for independent datasets were obtained by the web server at http://www.bioml.cn. The jackknife test, 5-fold cross-validation, and 10-fold cross-validation are used for the iRNA-PseU, PseUI/iPseU-CUU, and XG-PseU models, respectively. It can be seen that their overall performances are gradually improved through the scientists’ efforts. Taking H_990 as an example, the accuracies have been improved by 6.28% from 60.40% (iRNA-PseU) to 61.24% (PseUI) and to 66.68% (iPseU-CUU). However, it must be noted that these predictive accuracies are still unsatisfactory.

Table 1

Results of the Proposed iRNA-PseU, PseUI, iPseU-CUU, and XG-PseU Predictors for Training Datasets H_990, S_628, and M_944 and Testing Datasets H_200 and S_200

Predictors	Training Datasets	Acc (%)	MCC	Sn (%)	Sp (%)	Testing Datasets	Acc (%)	MCC	Sn (%)	Sp (%)
iRNA-PseUa	H_990	60.4	0.21	61.01	59.8	H_200	65.00	0.30	60.00	70.00
PseUIb		64.24	0.28	64.85	63.64		65.50	0.31	63.00	68.00
iPseU-CUUc		66.68	0.34	65.00	68.78		69.00	0.40	77.72	60.81
XG-PseUd		65.44	0.31	63.64	67.24		67.00	0.34	67.00	67.00
iRNA-PseUa	S_628	64.49	0.29	64.65	64.33	S_200	73.00	0.46	81.00	65.00
PseUIb		66.56	0.33	62.1	71.02		68.50	0.37	72.00	65.00
iPseU-CUUc		68.15	0.37	66.36	70.45		73.50	0.47	68.76	77.82
XG-PseUd		68.15	0.37	66.84	69.45		71.00	0.42	75.00	67.00
iRNA-PseUa	M_944	69.07	0.38	73.31	64.83
PseUIb		70.44	0.41	74.58	66.31
iPseU-CUUc		71.81	0.44	74.49	69.11
XG-PseUd		72.03	0.45	76.48	67.57

The predictor developed by Chen et al.

The predictor proposed by He et al.

The predictor constructed by Tahir et al.

The predictor constructed by Liu et al.

Results of the Proposed iRNA-PseU, PseUI, iPseU-CUU, and XG-PseU Predictors for Training Datasets H_990, S_628, and M_944 and Testing Datasets H_200 and S_200 The predictor developed by Chen et al. The predictor proposed by He et al. The predictor constructed by Tahir et al. The predictor constructed by Liu et al. As a crucial step toward building a machine-learning-based predictor, feature extraction becomes a particularly important process. Several sequence representation methods have been used in previous works to obtain feature vectors. For example, a hybrid approach of the binary profile of patterns (BPP) and structural information is applied in the tRNAmod. In addition, the PPUS model uses the nucleotides around Ψ as the features to identify. For the successful iRNA-PseU method, dinucleotide chemical properties (DCP) and nucleotide density (ND) are incorporated for identification. For the PseUI, the effective features are selected from five different feature extraction techniques using the sequential forward-feature-selection method, including nucleotide composition (NC), dinucleotide composition (DNC), pseudo-dinucleotide composition (PseDNC), position-specific nucleotide propensity (PSNP), and position-specific dinucleotide propensity (PSDP)., For the iPseU-CUU method, the features are obtained automatically by a CNN model based on a deep learning machine, which is widely used in bioinformatics.39, 40, 41, 42 Furthermore, two additional feature extraction techniques, n-gram and multivariate mutual information (MMI), are also applied for the machine learning approach by the SVM method, where they still give a low accuracy (Acc). For the newly reported XG-PseU predictor, six feature extraction techniques are used, namely, NC, DNC, trinucleotide composition (TNC), nucleotide chemical property (NCP), ND, and one-hot encode (one hot). At the same time, the identification of many types of RNA modifications using the machine-learning-based computational approaches shows the excellent performance, including for N6-methyladenosine (m6A),43, 44, 45 5-methylcytosine (m5C),46, 47, 48, 49, 50, 51, 52, 53 N1-methyladenosine (m1A),54, 55, 56 and so forth. The related kinds of computational models used for these purposes have been summarized in a review, in which the recently reported overall accuracies are basically above 90%. In particular, the SVM-based iRNA(m6A)-PseDNC model demonstrates an Acc of 91.24% of 10-fold cross-validation for m6A identification for S. cerevisiae. For the m5C site, the recently developed iRNA-m5C predictor by the Random Forest (RF) algorithm shows a jackknife test Acc up to 92.9% for H. sapiens. For m1A, the SVM-based iRNA-3typeA method obtains a jackknife validation Acc of 99.13% on H. sapiens and 98.73% for M. musculus. However, as mentioned earlier, the evaluated accuracies of Ψ site identification of different models are basically only 60%–70%, where there is still a large amount of improvement possible. We noticed that a predictor called “KELMPSP” reported a better performance, where the accuracies for the H_990, S_628, M_949, H_200, and S_200 datasets are up to 74.55%, 85.53%, 79.45%, 72.5%, and 76.00%, respectively. In this method, the kernel extreme learning machine (KELM) algorithm is applied, where the final features are obtained by combining NCP, nucleotide concentrations, and position-specific mononucleotide, dinucleotide, and trinucleotide propensity characteristics. However, the related web server at http://39.105.77.161:8890/KELMPSP is no longer available. In this paper, we first applied the bi-profile Bayes method (BPB) to extract the RNA sequence features to identify the Ψ sites. Two algorithms, RF and SVM, were both used to construct the models, where the performances were evaluated by 5-fold cross-validation and independent tests. Then, we incorporated three different features with BPB to show their comprehensive performance, including basic Kmer (Kmer), general parallel correlation pseudo-dinucleotide composition (PC-PseDNC-General) generated from the web server Pse-in-One, and NCP with ND (NCP+ND). Also, high-quality features were selected using the MRMD method to predict the Ψ sites.

Results and Discussion

Performance of the BPB Features

First, we extracted the RNA features using the BPB method for Ψ site prediction. The performances were evaluated over the 5-fold cross-validation for the benchmark datasets H_990, S_628, and M_944 and independent dataset for H_200 and S_200. Table 2 gives a comparison of our results using BPB features with the iPseU-CUU predictor, where RF and SVM indicate the results from the RF and SVM classifiers, respectively. It is obvious that the SVM generally performed better than the RF. Specifically, the accuracies of the SVM method were improved 4.85% and 4.13% for the training datasets H_990 and M_994, respectively. For the independent dataset test, the Acc and MCC were obviously increased by 15.0% and 0.3 for the H_200 datasets. However, for the S_628 datasets, the Acc was only increased from 62.58% to 63.06%. Here, we found that, although the specificity (Sp) increased from 61.46% to 73.25%, the sensitivity (Sn) actually declined from 63.69% to 52.87%, which means that one half of the positive samples were incorrectly predicted to be the false one. Similar results can also be observed in S_200. From the comparison between RF and SVM, it can be concluded that the SVM algorithm is more efficient than RF for the Ψ prediction of RNA sequences for H. sapiens and M. musculus.

Table 2

Comparison of Our Results based on the RF and SVM Methods Using the BPB Features with the iPseU-CUU Predictor

Predictors	Training Datasets	Acc (%)	MCC	Sn (%)	Sp (%)	Testing Datasets	Acc (%)	MCC	Sn (%)	Sp (%)
iPseU-CNNa	H_990	66.68	0.34	65.00	68.78	H_200	69.00	0.40	77.72	60.81
RFb		58.28	0.17	60.00	56.57		59.00	0.18	61.00	57.00
SVMc		63.13	0.26	64.04	62.22		74.00	0.48	78.00	70.00
iPseU-CNNa	S_628	68.15	0.37	66.36	70.45	S_200	73.50	0.47	68.76	77.82
RFb		62.58	0.25	63.69	61.46		74.00	0.48	70.00	78.00
SVMc		63.06	0.27	52.87	73.25		73.00	0.49	60.00	86.00
iPseU-CNNa	M_944	71.81	0.44	74.49	69.11
RFb		67.27	0.35	69.28	65.25
SVMc		71.40	0.43	75.00	67.80

The predictor proposed by Tahir et al.

The RF-based predictor using BPB features.

The SVM-based predictor using BPB features.

Comparison of Our Results based on the RF and SVM Methods Using the BPB Features with the iPseU-CUU Predictor The predictor proposed by Tahir et al. The RF-based predictor using BPB features. The SVM-based predictor using BPB features. Compared with the iPseU-CUU model, the SVM method showed accuracies reduced by 3.55% and 5.09% for the first two training datasets H_990 and S_628. Almost the same results could be found for the training dataset M_994 and independent dataset H_200, where our results are only 0.5% lower than that for iPseU-CUU. Additionally, the SVM model performed better for S_200, where the Acc and MCC were both improved approximately 5.0% and 0.08, respectively. In general, the SVM algorithm appears to be a better choice than RF for the Ψ modification prediction using BPB features alone, which can be clearly found in Figure 2. However, it must be noted that the overall performance of the SVM method here is unsatisfactory, even lower than that of the latest predictor, iPseU-CUU, for the two datasets H_990 and S_628.

Figure 2

This Histogram Shows the Results of the iPseU-CUU Predictor and the Constructed Model Based on the RF and SVM Classifiers Using the BPB Features

Performance of the BPB Features Combining Other Features

For a better performance, three different kinds of features were also investigated: Kmer, PC-PseDNC-General, and NCP+ND from the iRNA-PseU method. At the same time, those features were further combined with BPB to achieve a better result, where the MRMD method was applied to select the important features for experiments. Table 3 lists the results of different feature selection for H_990 datasets using the RF method (left) and SVM method (right).

Table 3

Results of Feature Selection for the H_990 Dataset Using the RF and SVM Methods

Feature Subset	RF				SVM
Feature Subset	Acc (%)	MCC	Sn (%)	Sp (%)	Acc (%)	MCC	Sn (%)	Sp (%)
BPB	58.28	0.17	60.00	56.57	63.13	0.26	64.04	62.22
Kmer(2)	55.76	0.12	53.13	58.38	60.00	0.23	41.82	78.18
Kmer(3)	58.79	0.18	58.59	58.99	59.70	0.20	53.94	65.45
Kmer(4)	58.59	0.17	59.39	57.78	57.27	0.15	56.57	57.98
PC-PseDNC-General (6,0.99)	58.59	0.17	56.57	60.61	57.78	0.16	49.49	66.06
NCP+ND	56.87	0.14	57.37	56.36	60.34	0.21	60.40	60.28
BPB+Kmer(3)	60.40	0.21	60.61	60.20	63.23a	0.27a	61.01a	65.45a
BPB+PC-PseDNC-General (6,0.99)	61.72	0.23	59.39	64.04	62.93	0.26	61.62	64.24
BPB+NCP+NP	61.11	0.22	62.83	59.39	61.11	0.22	58.79	63.43
BPB+PC-PseDNC-General (6,0.99) + Kmer(3)	61.01	0.22	59.39	62.63	62.73	0.25	61.82	63.64

Performance with maximum accuracy.

Results of Feature Selection for the H_990 Dataset Using the RF and SVM Methods Performance with maximum accuracy. The first six rows give the performance of each type of feature, including BPB, Kmer (k = 2, 3, 4), PC-PseDNC-General (λ = 6,w = 0.99), and NCP+ND. For the Kmer method, three results with k = 2, 3, and 4 are listed. It can be found that the Kmer(3) shows consistent results, where the accuracies are 58.79% and 59.70% for the RF and SVM classifiers, respectively. In the PC-PseDNC-General method,, several parameters have been tested, and better results are obtained with the parameters λ = 6 and w = 0.99. The corresponding SVM-based Acc (57.78%) is slightly lower than the RF-based Acc (58.59%). We also repeated the work by Chen et al. (NCP+ND) with the 5-fold cross-validation, which obtained an Acc of 60.34% compared to the reported jackknife results (60.40%). From the discussion earlier, the performances of the single features are all lower than that of the latest iPseU-CUU predictor (66.68%), among which the BPB features give the best Acc (63.13%) by the SVM method. Further, we combined the Kmer, PC-PseDNC-General, and NCP+ND features with the BPB, and the final useful features for model constructing were selected using the MRMD method. There were four results for the combined features listed in Table 3 for the H_990 datasets, including BPB+Kmer(3), BPB+PC-PseDNC-General(6,0.99), BPB+NCP+ND, and BPB+PC-PseDNC-General(6,0.99)+Kmer(3). It can be found that the combined results are generally improved 2%–3% over the single BPB results by the RF method, where the best combination with a maximum Acc 61.72% is BPB+PC-PseDNC-General(6,0.99). However, there is no obvious improvement for the SVM-based method and even a 2.02% decrease for the BPB+NCP+ND combination. The feature combination BPB+Kmer(3) showed the best performance by the SVM method, which gave 63.23% Acc, 0.27 MCC, 61.01% Sn, and 65.45% Sp. Applying this model to an independent test for H_200, the obtained Acc, MCC, Sp, and Sn were 72.00%, 0.46, 82%, and 62%, respectively. Compared to the iPseU-CUU predictor, 3% and 0.06 improvement for the Acc and MCC were found. Tables 4 and 5 list the same results as in Table 3 but for the datasets S_628 and M_944, respectively. For S_628, the feature combination BPB+PC-PseDNC-General(2,0.1)+Kmer(4) gave the best performance, where the ACc and MCC were obviously improved by 7.48% and 0.14, respectively. When compared with the iPseU-CNN model, the evaluated Acc shows 2.39% improvement. Finally, the combined model was tested using the independent dataset S_200, where the Acc, MCC, Sn, and Sp are 77.00%, 0.54, 75%, and 79%, respectively. It can be seen that there were 3.5% and 0.07 improvement for the Acc and MCC compared to those for the iPseU-CUU model. For M_994, the best performance was given by feature combination BPB+Kmer(3), for which the Acc was 72.46%, MCC was 0.45, Sn was 75.85%, and Sp was 69.07%. Compared with the Acc of the iPseU-CUU method, there was only 0.65% improvement obtained. Figure 3 shows an intuitive comparison of the evaluated performance of the iPseU-CUU (orange bars), XG-PseU (green bars), and the constructed SVM-based model using the combined features in this work (blue bars).

Table 4

Results of Feature Selection for the S_628 Dataset Using the RF and SVM Methods

Feature Subset	RF				SVM
Feature Subset	Acc (%)	MCC	Sn (%)	Sp (%)	Acc (%)	MCC	Sn (%)	Sp (%)
BPB	62.58	0.25	63.69	61.46	63.06	0.27	52.87	73.25
Kmer (k = 2)	58.12	0.16	58.28	57.96	61.78	0.24	64.33	59.24
Kmer (k = 3)	60.35	0.21	62.10	58.60	61.78	0.24	66.56	57.01
Kmer (k = 4)	59.71	0.19	62.74	56.69	64.97	0.30	67.52	62.42
PC-PseDNC-General (2, 0.11)	58.76	0.18	61.78	55.73	61.15	0.22	64.01	58.28
NCP+ND	60.83	0.22	62.74	58.92	60.99	0.22	57.01	64.97
BPB+Kmer (k = 4)	64.01	0.28	64.33	63.69	68.15	0.36	66.56	69.75
BPB+PC-PseDNC-General (2, 0.11)	62.90	0.26	63.38	62.42	66.08	0.33	57.64	74.52
BPB+NCP+ND	62.74	0.26	65.61	59.87	61.78	0.24	56.37	67.20
BPB+PC-PseDNC-General (2, 0.11) + Kmer(4)	64.49	0.29	65.92	63.06	70.54a	0.41a	69.43a	71.66a

Performance with maximum accuracy.

Table 5

Results of Feature Selection for the M_944 Dataset Using the RF and SVM Methods

Feature Subset	RF				SVM
Feature Subset	Acc (%)	MCC	Sn (%)	Sp (%)	Acc (%)	MCC	Sn (%)	Sp (%)
BPB	68.54	0.37	69.28	67.80	71.40	0.43	75.00	67.80
Kmer(2)	52.22	0.04	54.45	50.00	56.78	0.14	61.65	51.91
Kmer(3)	55.51	0.11	57.42	53.60	59.22	0.18	60.81	57.63
Kmer(4)	56.04	0.12	58.05	54.03	58.37	0.17	59.96	56.78
PC-PseDNC-General (2, 0.1)	53.07	0.06	56.14	50.00	57.84	0.16	64.41	51.27
NCP+ND	67.58	0.35	70.34	64.83	68.01	0.36	69.49	66.53
BPB+Kmer(3)	67.37	0.35	71.61	63.14	72.46a	0.45a	75.85a	69.07a
BPB+PC-PseDNC-General (2, 0.1)	67.58	0.35	70.97	64.19	71.40	0.43	73.52	69.28
BPB+NCP+ND	68.43	0.37	71.82	65.04	68.11	0.36	69.70	66.53
BPB+PC-PseDNC-General (2, 0.11) + Kmer(3)	68.33	0.37	72.67	63.98	71.72	0.44	75.00	68.43

Performance with maximum accuracy.

Figure 3

Comparisons of the Evaluated Performance of Predictors iPseU-CUU, XG-PseU and the Constructed Model Using the Combined Features in This Work

Results of Feature Selection for the S_628 Dataset Using the RF and SVM Methods Performance with maximum accuracy. Results of Feature Selection for the M_944 Dataset Using the RF and SVM Methods Performance with maximum accuracy. Finally, we investigated several kinds of features from the two state-of-the-art tools iLearn and BioSeq-Analysis2.0 for H. sapiens, including Mismatch (k = 2,3,4), subsequence (k = 2,3,4), the enhanced nucleic acid composition (ENAC) with the sequence window 5, electron-ion interaction pseudopotentials of trinucleotide (EIIP), electron-ion interaction pseudopotentials of trinucleotide (PseEIIP), binary encoding (BE), dinucleotide-based auto covariance (DAC), dinucleotide-based cross covariance (DCC), and dinucleotide-based auto-cross covariance (DACC). It was found that the average Acc of subsequence, ENAC, and autocorrelation features using the SVM algorithm is approximately 55%. The evaluated performances of other features as well as the combined features with the best performances BPB+Kmer(3) are listed in Table 6. It can be seen that the feature combination BPB+Kmer(3)+EIIP gives the accuracies 63.33% and 75% on the H_990 and H_200 datasets, which are improved by 0.1% and 3% compared with our original feature combination BPB+Kmer(3), respectively.

Table 6

Results of Feature Selection for H_990 and H_200 Datasets Using Several Kinds of Features from iLearn and BioSeq-Analysis 2.0

Feature Subset	H_990				H_220
Feature Subset	Acc	MCC	Sn	Sp	Acc	MCC	Sn	Sp
BE	60.10	0.20	58.79	61.41	66.50	0.33	64.00	69.00
Mismatch (3)	60.81	0.22	57.37	64.24	59.50	0.19	58.00	61.00
EIIP	57.37	0.15	54.55	60.20	58.00	0.16	56.00	60.00
PseEIIP	58.99	0.18	54.75	63.23	58.00	0.16	55.00	61.00
BE	60.10	0.20	58.79	61.41	66.50	0.33	64.00	69.00
BPB+Kmer(3)+EIIPa	63.33	0.27	62.63	64.04	75.00	0.51	81.00	69.00
BPB+Kmer(3)+PseEIIP	63.13	0.26	61.01	65.25	70.50	0.43	82.00	59.00
BPB+Kmer(3)+BE	60.91	0.22	58.99	62.83	68.00	0.36	69.00	67.00
BPB+Kmer(3)+mismatch(3)	61.11	0.22	56.77	65.45	60.20	0.20	61.00	59.41
BPB+Kmer(3)+EIIP+mismatch(3)	61.21	0.23	56.97	65.45	60.20	0.20	61.00	59.41

All values in this row indicate performance with maximum accuracy.

Results of Feature Selection for H_990 and H_200 Datasets Using Several Kinds of Features from iLearn and BioSeq-Analysis 2.0 All values in this row indicate performance with maximum accuracy.

Conclusions

Ψ identification plays an important role in academic research and drug development. In this study, we first extracted the RNA features using the BPB method,67, 68, 69 for Ψ site prediction, which gives the RNA sequence information from both positive and negative training samples. The evaluated accuracies using the SVM method are 3.55% and 5.09% lower than the iPseU-CUU for the H_990 and S_628 datasets. Almost the same results and 5.0% improvement were obtained for M_994, H_200, and S_200, respectively. Then, we combined three kinds of features—Kmer, PC-PseDNC-General, and NCP+ND, where the useful features were further selected by the MRMD method. The final accuracies of the combined features using the SVM classifier were achieved at 70.54% and 72.45% for S_628 and M_994, respectively. The predicted Acc of independent S_200 was also improved from 69.0% (BPB features alone) to 77.0% (combined features). It can be concluded that there are some improvements for the S. cerevisiae and M. musculus using the combined features by the SVM classifier. However, including the six existing predictors, the general accuracies are still 60%–70%, which needs to be further improved for biologist usage. It is clearly known that many kinds of feature extraction methods have been applied to encode RNA sequences to identify Ψ modification, including BPB, Kmer, PC-PseDNC-General, NCP, ND, Mismatch, subsequence, ENAC, EIIP, PseEIIP, BE, DAC, DCC, and DACC in this paper, as well as PSNP, PSDP, and so forth. In addition, many machine-learning-based computational methods70, 71, 72 for the identification of many types of RNA methylations have shown excellent performance (with an Acc of approximately 90%), including m6A, m5C, m1A, and so forth. Thus, the researchers need to reconsider whether there is any sequence feature in the RNA Ψ modification prediction problem. There may be other methods to identify Ψ modification sites that have better performance.

Materials and Methods

In this study, we use the datasets built by Chen et al. from RMBase, including training datasets H_990, S_628, and M_990, and independent testing datasets H_200 and S_200 for H. sapiens, S. cerevisiae, and M. musculus, respectively. Here, the BPB features alone as well as the combination of three other kinds of features (Kmer, PC-PseDNC-General, and NCP+ND) using the MRMD method are prepared. Then, two classifiers, RF and SVM, are used separately for model construction. The schematic flowchart of this work is shown in Figure 1.

Figure 1

Flowchart of Constructed Predictors for Ψ Identification Using the BPB Features

Flowchart of Constructed Predictors for Ψ Identification Using the BPB Features This Histogram Shows the Results of the iPseU-CUU Predictor and the Constructed Model Based on the RF and SVM Classifiers Using the BPB Features Comparisons of the Evaluated Performance of Predictors iPseU-CUU, XG-PseU and the Constructed Model Using the Combined Features in This Work

Feature Extraction Methods

BPB

BPB is an effective feature extraction approach that has been successfully applied in bioinformatics with good performance.,67, 68, 69,74, 75, 76, 77 It can obtain comprehensive sequence information from not only positive but also negative RNA samples. Considering the RNA sequence , the associated BPB feature vector is written aswhere and represent the corresponding nucleotide frequency at each position in positive and negative datasets, respectively. Thus, the BPB features for model training can well reflect the positive and negative position-specific information.

Kmer

Kmer is a common method used to give RNA sequence information, where the feature vector is obtained from the frequencies of k-neighboring nucleotides., The Kmer features are available at the powerful web server Pse-in-one (http://bioinformatics.hitsz.edu.cn/Pse-in-One/RNA/Kmer/).

PC-PseDNC-General

Similarly, the PC-PseDNC-General features,, can also be obtained at Pse-in-one (http://bioinformatics.hitsz.edu.cn/Pse-in-One/RNA/PC-PseDNC-General/), where 22 alternative physicochemical properties are provided to generate the pseudo-dinucleotide composition. The corresponding RNA features can be written as:withHere, indicates the normalized occurrence frequency of the 16 dinucleotides; is the weight factor; and is the j-tier correlation factor demonstrating the sequence-order correlations between all of the most contiguous dinucleotides along a given RNA sequence, where parameter λ gives the highest counted rank (or tier). It can be further expressed as:where C(D, D) is called the correlation function formulated asHere, u indicates the number of physicochemical properties investigated and and are the associated values of the gth property for the dinucleotides at position and at , respectively.

NCP+ND

In the iRNA-PseU method, the feature vectors are obtained by incorporating three NCPs (ring structure, hydrogen bond, and functional group) and accumulated occurrence frequency. The related chemical properties are described as follows: A and G purines with two rings encoded as 1; C and U pyrimidines with one ring, as 0; the strong hydrogen bonds formed between C and G, as 1 when constructing secondary structures; the weak hydrogen bonds between A and U, as 0; the amino groups A and C, as 1; and the keto groups G and U, as 0. Then, the four nucleotides A, C, G, and U can be encoded as (1, 1, 1), (0, 0, 1), (1, 0, 0), and (0, 1, 0), respectively. In addition, the nucleotide density d is defined aswhere |N| is the length of the ith prefix string . Finally, the RNA sequence can be simply represented by a 4l-dimensional vector according to the formulation of PseKNC.

Classifiers and Cross-Validation

RF

RF is a widely used algorithm in prediction problems that effectively combines ensemble tree-structured classifiers.79, 80, 81, 82, 83, 84, 85, 86, 87 It is usually applied to research with a very large number of feature vectors. This classifier consists of hundreds of decision trees, and the final prediction is obtained by major votes. In this article, we used the RF method implemented on the Weka data mining suite with the default parameters for analysis.

SVM

SVM is a successful machine learning algorithm based on statistical learning theory,89, 90, 91, 92, 93, 94, 95, 96 which has been widely applied in bioinformatics and computational biology.,97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108 In this method, the original input data are transformed into a higher dimensional feature space (Hilbert space), and then the optimal separating hyperplanes are determined. Here, the LIBSVM package v.3.21 was used to implement the SVM, where the radial basis kernel function (RBF) was chosen to obtain the best classification hyperplane. The related regularization parameter C and kernel width γ were determined through the optimization procedure, using the default grid search approach written as:

5-Fold Cross-Validation

Although the jackknife test is effective and stable and has been applied in the iRNA-PseU and PseUI, it is a very time-consuming process. On the other hand, the predictor iPseU-CUU uses 5-fold cross-validation to evaluate performance. Therefore, we chose 5-fold cross-validation on the benchmark datasets for a convenient comparison. Specifically, the benchmark datasets are equally divided into five subsets separately. Then, the four subsets are used to train the model and the remaining one to test. This process is repeated five times when all subsets are applied once for testing. The final performances are an average value of all five testing experiments.

MRMD

Feature selection aims to select a subset of features by removing redundancy and keeping the most discriminative features.111, 112, 113, 114 MRMD is an effective feature selection method to reduce dimensionalities of feature vectors, where the Acc and stability of feature ranking and prediction tasks are both considered. As Xu et al.’s related work shows, the performances are improved based on the selected features using the MRMD method. In this method, the features with the maximum relevance and distance are selected as the ultimate sub-feature set for experiments.

Evaluation Parameters

The performance of the constructed models is frequently evaluated using Sn, Sp, Acc, and Matthews correlation coefficient (MCC), which are expressed as:116, 117, 118, 119, 120, 121 where N+ and N− represent the total number of positive and negative RNA samples considered, in which the incorrectly predicted samples are indicated by and , respectively.

Author Contributions

L.X. and H.X. conceived the idea and designed the overall research. L.D. constructed the predictors, evaluated the performance, and drafted the manuscript. X.L. and H.D. helped to revise the paper; Both authors read, critically revised and approved the final manuscript.

Conflicts of Interest

The authors declare no competing interests.

7 in total

1. RMDisease: a database of genetic variants that affect RNA modifications, with implications for epitranscriptome pathogenesis.

Authors: Kunqi Chen; Bowen Song; Yujiao Tang; Zhen Wei; Qingru Xu; Jionglong Su; João Pedro de Magalhães; Daniel J Rigden; Jia Meng
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

2. HSM6AP: a high-precision predictor for the Homo sapiens N6-methyladenosine (m^6 A) based on multiple weights and feature stitching.

Authors: Jing Li; Shida He; Fei Guo; Quan Zou
Journal: RNA Biol Date: 2021-02-12 Impact factor: 4.652

3. iDNA-MT: Identification DNA Modification Sites in Multiple Species by Using Multi-Task Learning Based a Neural Network Tool.

Authors: Xiao Yang; Xiucai Ye; Xuehong Li; Lesong Wei
Journal: Front Genet Date: 2021-03-31 Impact factor: 4.599

7. NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data.

Authors: Chao Wang; Jin Wu; Lei Xu; Quan Zou
Journal: Microb Genom Date: 2020-11-27

7 in total

Is There Any Sequence Feature in the RNA Pseudouridine Modification Prediction Problem?

Introduction

Results and Discussion

Performance of the BPB Features

Performance of the BPB Features Combining Other Features

Conclusions

Materials and Methods

Feature Extraction Methods

BPB

Kmer

PC-PseDNC-General

NCP+ND

Classifiers and Cross-Validation

RF

SVM

5-Fold Cross-Validation

MRMD

Evaluation Parameters

Author Contributions

Conflicts of Interest

1. RMDisease: a database of genetic variants that affect RNA modifications, with implications for epitranscriptome pathogenesis.

2. HSM6AP: a high-precision predictor for the Homo sapiens N6-methyladenosine (m^6 A) based on multiple weights and feature stitching.

3. iDNA-MT: Identification DNA Modification Sites in Multiple Species by Using Multi-Task Learning Based a Neural Network Tool.

4. A Method for Identifying Vesicle Transport Proteins Based on LibSVM and MRMD.

5. 6mA-Pred: identifying DNA N6-methyladenine sites based on deep learning.

6. PSBP-SVM: A Machine Learning-Based Computational Identifier for Predicting Polystyrene Binding Peptides.

7. NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data.