| Literature DB >> 32175316 |
Zhibin Lv1, Jun Zhang2, Hui Ding3, Quan Zou1,3.
Abstract
One of the ubiquitous chemical modifications in RNA, pseudouridine modification is crucial for various cellular biological and physiological processes. To gain more insight into the functional mechanisms involved, it is of fundamental importance to precisely identify pseudouridine sites in RNA. Several useful machine learning approaches have become available recently, with the increasing progress of next-generation sequencing technology; however, existing methods cannot predict sites with high accuracy. Thus, a more accurate predictor is required. In this study, a random forest-based predictor named RF-PseU is proposed for prediction of pseudouridylation sites. To optimize feature representation and obtain a better model, the light gradient boosting machine algorithm and incremental feature selection strategy were used to select the optimum feature space vector for training the random forest model RF-PseU. Compared with previous state-of-the-art predictors, the results on the same benchmark data sets of three species demonstrate that RF-PseU performs better overall. The integrated average leave-one-out cross-validation and independent testing accuracy scores were 71.4% and 74.7%, respectively, representing increments of 3.63% and 4.77% versus the best existing predictor. Moreover, the final RF-PseU model for prediction was built on leave-one-out cross-validation and provides a reliable and robust tool for identifying pseudouridine sites. A web server with a user-friendly interface is accessible at http://148.70.81.170:10228/rfpseu.Entities:
Keywords: RNA; light gradient boosting; machine learning; pseudouridine sites; random forest
Year: 2020 PMID: 32175316 PMCID: PMC7054385 DOI: 10.3389/fbioe.2020.00134
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
FIGURE 1A schematic diagram of RF-PseU. RNA sequences with or without pseudouridine sites were encoded via seven RNA coding technologies; following removal of redundant features by light gradient boosting machine feature selection, the random forest model was trained on smaller but more relevant feature vector spaces, and was evaluated through cross-validation and independent testing to obtain an optimized model for prediction.
ACGU categories based on chemical properties.
| C,U | Pyrimidine and ring structure |
| A,G | Purine and ring structure |
| A,U | Weak and hydrogen bond |
| C,G | Strong and hydrogen bond |
| G,U | Keto and functional group |
| A,C | Amino and functional group |
FIGURE 2(A) Accuracy of the random forest predictor varied with feature dimension for all three species: (A1) H. sapiens; (A2) S. cerevisiae; (A3) M. musculus. The best independent accuracies for H. sapiens and S. cerevisiae were 75.0% with 257 features and 77.0% with 397 features, respectively, and the best 10-Fold cross-validated accuracy for M. musculus was 74.8% with 161 features. (B) Receiver operating characteristic curve (ROC) and area under the ROC curve (auROC) for different species under various conditions. (B1) is for H. sapiens, (B2) is for S. cerevisiae and (B3) is M. musculus. A support vector machine (SVM) was used for comparison with the random forest (RF) model. 10-Fold (10-Fold) model testing and leave-one-out (LOO) model testing indicate the model with the best 10-Fold and LOO cross-validation scores in independent testing. In cross-validation (10-Fold and LOO) and testing process, the training datasets have divided into training part and validation part. That is, they have used the general machine learning evaluation methods (training, validation and testing) for model optimization. In the figure, the 10-fold cross-validation and LOO cross-validation metric values are obtained from the validation part of training part, while the independent testing metric values are obtained from the independent testing datasets.
Cross-validation and independent testing scores of two different classifiers for three species.
| SVM | 62.0% | 0.240 | 61.4% | 62.6% | 0.656 | 64.0% | 0.280 | 66.0% | 62.0% | 0.679 | |
| RF | 64.3% | 0.287 | 66.1% | 62.6% | 0.700 | 75.0% | 0.501 | 78.0% | 72.0% | 0.800 | |
| SVM | 67.5% | 0.352 | 73.7% | 61.2% | 0.720 | 72.5% | 0.45 | 73.0% | 73.0% | 0.786 | |
| RF | 74.8% | 0.497 | 77.2% | 72.4% | 0.810 | 77.0% | 0.540 | 75.0% | 79.0% | 0.838 | |
| SVM | 70.7% | 0.42 | 65.9% | 75.4% | 0.759 | / | / | / | / | / | |
| RF | 74.8% | 0.50 | 73.1% | 76.5% | 0.796 | / | / | / | / | / | |
Comparison of cross-validation and independent testing scores of existing state-of-the-art pseudouridine site predictors and RF-PseU.
| iRNA-PseU(LOO)a | 60.4% | 0.21 | 61.0% | 59.8% | 0.640 | 65.0% | 0.30 | 60.0% | 70.0% | / | |
| PseUI(LOO)a | 64.2% | 0.28 | 64.9% | 63.6% | 0.68 | 65.5% | 0.31 | 63.0% | 68.0% | / | |
| iPseU-CNN(5F)b | 66.7% | 0.34 | 65.0% | 68.8% | / | 69.0% | 0.40 | 77.7% | 60.8% | / | |
| XG-PseU (10F)c | 66.1% | 0.32 | 63.5% | 68.7% | 0.700 | 67.5% | / | / | / | / | |
| RF-PseU(10F)d | 64.3% | 0.29 | 66.1% | 62.6% | 0.700 | 75.0% | 0.50 | 78.0% | 72.0% | 0.800 | |
| RF-PseU(LOO)e | 64.0% | 0.29 | 65.9% | 62.6% | 0.694 | 74.0% | 0.48 | 74.0% | 74.0% | 0.814 | |
| iRNA-PseU(LOO) | 64.5% | 0.29 | 64.7% | 64.3% | 0.81 | 60.0% | 0.20 | 63.0% | 57.0% | / | |
| PseUI(LOO) | 64.1% | 0.30 | 64.7% | 67.5% | 0.69 | 68.5% | 0.37 | 65.0% | 72.0% | / | |
| iPseU-CNN(5F) | 68.2% | 0.37 | 66.4% | 70.5% | / | 73.5% | 0.47 | 68.8% | 77.8% | / | |
| XG-PseU(10F) | 68.2% | 0.37 | 66.8% | 69.5% | 0.77 | 71.0% | / | / | / | / | |
| RF-PseU(10F) | 74.8% | 0.49 | 77.2% | 72.4% | 0.810 | 77.0% | 0.54 | 75.0% | 79.0% | 0.838 | |
| RF-PseU(LOO) | 75.8% | 0.52 | 78.2% | 73.4% | 0.819 | 74.5% | 0.49 | 70.0% | 79.0% | 0.823 | |
| iRNA-PseU(LOO) | 69.1% | 0.38 | 73.3% | 64.8% | 0.75 | / | / | / | / | / | |
| PseUI(LOO) | 70.4% | 0.41 | 79.9% | 70.3% | 0.71 | / | / | / | / | / | |
| iPseU-CNN(5F) | 71.8% | 0.44 | 74.8% | 69.1% | / | / | / | / | / | / | |
| XG-PseU(10F) | 72.0% | 0.45 | 76.5% | 67.6% | 0.74 | / | / | / | / | / | |
| RF-PseU(10F) | 74.8% | 0.50 | 73.1% | 76.5% | 0.796 | / | / | / | / | / | |
| RF-PseU(LOO) | 74.5% | 0.48 | 72.7% | 75.2% | 0.794 | / | / | / | / | / | |
Comparison of average accuracies for state-of-the-art predictors.
| Cross-validationa | 71.3% | 71.4% | 64.7% | 66.2% | 68.9% | 68.7% |
| Independent testingb | 76.0% | 74.7% | 62.5% | 67.0% | 71.3% | 69.3% |
FIGURE 3A screenshot of RF-PseU web server interface. The web server allows users to type or paste FASTA format text into the textbox and click submit button; the results are displayed in the right-hand table.