| Literature DB >> 35432446 |
Yinbo Liu1, Yingying Shen1, Hong Wang1, Yong Zhang1, Xiaolei Zhu1.
Abstract
As one of the most important post-transcriptional modifications of RNA, 5-cytosine-methylation (m5C) is reported to closely relate to many chemical reactions and biological functions in cells. Recently, several computational methods have been proposed for identifying m5C sites. However, the accuracy and efficiency are still not satisfactory. In this study, we proposed a new method, m5Cpred-XS, for predicting m5C sites of H. sapiens, M. musculus, and A. thaliana. First, the powerful SHAP method was used to select the optimal feature subset from seven different kinds of sequence-based features. Second, different machine learning algorithms were used to train the models. The results of five-fold cross-validation indicate that the model based on XGBoost achieved the highest prediction accuracy. Finally, our model was compared with other state-of-the-art models, which indicates that m5Cpred-XS is superior to other methods. Moreover, we deployed the model on a web server that can be accessed through http://m5cpred-xs.zhulab.org.cn/, and m5Cpred-XS is expected to be a useful tool for studying m5C sites.Entities:
Keywords: 5-cytosine-methylation; XGBoost; feature selection; machine learning; shap
Year: 2022 PMID: 35432446 PMCID: PMC9005994 DOI: 10.3389/fgene.2022.853258
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1The flowchart of m5Cpred_XS.
Training and test data sets of three species.
| Datasets | Length (bp) | Positive subset | Negativity subset |
|---|---|---|---|
| H_train | 41 | 200 | 200 |
| H_test | 41 | 69 | 69 |
| M_train | 41 | 4,563 | 4,563 |
| M_test | 41 | 1,000 | 1,000 |
| A_train | 41 | 5,289 | 5,289 |
| A_test | 41 | 1,000 | 1,000 |
H, M and H, M, A represent H. sapiens, M. musculus and A. thaliana, respectively.
Chemical structure of each nucleotide.
| Chemical property | Class | Nucleotides |
|---|---|---|
| Ring Structure | Purine | A, G |
| Pyrimidine | C, U | |
| Functional Group | Amino | A, C |
| Keto | G, U | |
| Hydrogen Bond | Strong | C, G |
| Weak | A, U |
The optimal hyperparameters of XGBoost for three species.
| Species | learning_rate | max_depth | n_estimators |
|---|---|---|---|
|
| 0.05 | 2 | 2000 |
|
| 0.02 | 6 | 2,600 |
|
| 0.01 | 16 | 1800 |
FIGURE 2The cross-validation AUROC values of models based on the top n features selected by SHAP, mRMR, and f-score.
The five-fold cross-validation results for models based on features selected by SHAP or the original 808 features.
| Species | Feature used | Pre (%) | Sp (%) | Sn (%) | Acc (%) | F1 | MCC | AUROC |
|---|---|---|---|---|---|---|---|---|
|
| Features selected by SHAP |
| 82.0 |
|
|
|
|
|
| 808 features | 78.9 | 78.5 | 80.5 | 79.5 | 0.797 | 0.590 | 0.873 | |
|
| Features selected by SHAP |
|
| 75.6 |
|
|
|
|
| 808 features | 74.7 | 74.2 |
| 75.1 |
| 0.503 | 0.831 | |
|
| Features selected by SHAP |
| 76.9 |
|
|
|
|
|
| 808 features | 73.6 | 75.9 | 67.3 | 71.6 | 0.703 | 0.434 | 0.779 |
The five-fold cross-validation performance of models built based on different classifiers with the features selected by SHAP.
| Species | Classifiers | Pre (%) | Sp (%) | Sn (%) | Acc (%) | F1 | MCC | AUROC |
|---|---|---|---|---|---|---|---|---|
|
| RF | 82.8 |
| 84.5 | 83.5 | 0.837 | 0.670 | 0.911 |
| SVM | 79.9 | 79.0 | 83.5 | 81.3 | 0.817 | 0.626 | 0.903 | |
| XGBoost |
| 82.0 |
|
|
|
|
| |
|
| RF | 70.7 | 69.2 | 74.4 | 71.8 | 0.725 | 0.437 | 0.795 |
| SVM | 73.5 | 72.6 | 76.0 | 74.3 | 0.747 | 0.487 | 0.824 | |
| XGBoost |
|
| 75.6 |
|
|
|
| |
|
| RF | 75.1 |
| 65.3 | 71.8 | 0.699 | 0.441 | 0.780 |
| SVM | 74.2 | 78.2 | 62.9 | 70.5 | 0.681 | 0.416 | 0.768 | |
| XGBoost |
| 76.9 |
|
|
|
|
|
FIGURE 3The ROC curves and PRC curves of five-fold cross-validation results based on three learning algorithms for the three species.
Comparison with other existing models on the independent test sets.
| Species | Model | Pre (%) | FOR (%) | Sp (%) | Sn (%) | Acc (%) | F1 | Mcc | AUC |
|---|---|---|---|---|---|---|---|---|---|
|
| RNAm5Cfinder | 76.5 | 41.3 | 88.4 | 37.7 | 63.1 | 0.505 | 0.303 | 0.635 |
| iRNA-m5C | 43.9 | 55.5 | 46.4 | 42.1 | 44.2 | 0.429 | -0.116 | – | |
| iRNAm5C-PseDNC | 60.1 |
|
| 4.4 | 50.7 | 0.081 | 0.039 | – | |
| RNAm5CPred | 68.1 | 30.3 | 66.7 | 71.0 | 68.9 | 0.695 | 0.377 | 0.772 | |
| m5CPred-SVM | 78.8 | 23.6 | 79.7 | 75.4 | 77.5 | 0.770 | 0.551 | 0.858 | |
| Our method (Threshold = 0.5) | 80.6 | 21.1 | 81.2 |
| 79.7 |
| 0.594 |
| |
| Our method (FPR |
| 24.4 | 89.9 | 71.0 |
| 0.784 |
|
| |
|
| RNAm5Cfinder | 64.5 | 43.8 | 78.9 | 38.6 | 58.8 | 0.483 | 0.191 | 0.593 |
| iRNA-m5C |
| 49.9 |
| 0.6 | 50.2 | 0.012 | 0.032 | – | |
| m5CPred-SVM | 73.0 | 30.0 | 74.9 |
| 71.4 | 0.704 | 0.429 | 0.775 | |
| Staem5 | 69.7 | 30.3 | 77.8 | 66.1 | 71.9 |
| 0.442 | 0.787 | |
| Our method (Threshold = 0.5) | 74.3 | 29.9 | 76.8 | 67.2 | 72.0 | 0.706 | 0.442 |
| |
| Our method (FPR = 15%) | 79.9 | 32.3 | 85.0 | 59.5 |
| 0.682 |
| 0.790 | |
|
| iRNA-m5C | 73.5 | 26.7 | 75.6 | 72.4 | 74.1 | 0.729 | 0.481 | – |
| PEA-m5C | 43.8 | 55.6 | 45.4 | 43.2 | 44.3 | 0.454 | -0.114 | – | |
| m5CPred-SVM | 76.0 | 24.4 | 76.1 | 75.5 | 75.8 | 0.757 | 0.516 | 0.836 | |
| Staem5 | 74.2 | 25.8 | 72.6 | 74.8 | 73.7 | 0.734 | 0.474 | 0.829 | |
| Our method (Threshold = 0.5) |
| 23.6 | 77.4 |
| 76.8 |
| 0.535 |
| |
| Our method (FPR = 20%) | 78.8 | 24.2 |
| 74.4 |
| 0.765 |
|
|
The settings in the parentheses mean different decision thresholds for determining positive prediction.
FOR, means false omission rate and FOR = FN/(FN + TN).
FIGURE 4Top 20 features sorted by SHAP for the three species.
FIGURE 5Distribution of top 20 features in the seven types of features for the three species.
FIGURE 6PCA plots for the original 808 dimensional features and features selected by SHAP for the three species. Upper panel: the original 808 dimensional features; Lower panel: the features selected by SHAP.
FIGURE 7The heat map for the cross species predictive AUROCs. The models (y-axis) were tested on the three independent test sets (x-axis).