| Literature DB >> 34484416 |
Lu Zhang1, Xinyi Qin1, Min Liu1, Guangzhong Liu1, Yuxiao Ren2.
Abstract
As one of the most prevalent posttranscriptional modifications of RNA, N7-methylguanosine (m7G) plays an essential role in the regulation of gene expression. Accurate identification of m7G sites in the transcriptome is invaluable for better revealing their potential functional mechanisms. Although high-throughput experimental methods can locate m7G sites precisely, they are overpriced and time-consuming. Hence, it is imperative to design an efficient computational method that can accurately identify the m7G sites. In this study, we propose a novel method via incorporating BERT-based multilingual model in bioinformatics to represent the information of RNA sequences. Firstly, we treat RNA sequences as natural sentences and then employ bidirectional encoder representations from transformers (BERT) model to transform them into fixed-length numerical matrices. Secondly, a feature selection scheme based on the elastic net method is constructed to eliminate redundant features and retain important features. Finally, the selected feature subset is input into a stacking ensemble classifier to predict m7G sites, and the hyperparameters of the classifier are tuned with tree-structured Parzen estimator (TPE) approach. By 10-fold cross-validation, the performance of BERT-m7G is measured with an ACC of 95.48% and an MCC of 0.9100. The experimental results indicate that the proposed method significantly outperforms state-of-the-art prediction methods in the identification of m7G modifications.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34484416 PMCID: PMC8413034 DOI: 10.1155/2021/7764764
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Figure 1The flow chart of the BERT-m7G method.
The prediction results of different feature selection methods.
| Methods | SN (%) | SP (%) | ACC (%) | MCC | Optimal |
|---|---|---|---|---|---|
| EN |
|
|
|
|
|
| XGBoost | 91.50 | 92.18 | 91.84 | 0.8375 | 753 |
| LightGBM | 90.15 | 90.42 | 90.28 | 0.8069 | 2206 |
| Boruta | 89.20 | 90.15 | 89.68 | 0.7953 | 1090 |
| SVD | 87.72 | 87.85 | 87.78 | 0.7570 | 473 |
| PCA | 87.45 | 87.59 | 87.52 | 0.7522 | 473 |
| SE | 86.24 | 86.78 | 86.51 | 0.7312 | 473 |
| LLE | 83.13 | 82.73 | 82.93 | 0.6603 | 473 |
| All | 88.66 | 87.73 | 88.19 | 0.7651 | 31488 |
Figure 2The ROC curves of the eight feature selection methods which are XGBoost, LightGBM, Boruta, SVD, PCA, SE, LLE, and EN.
Comparison of different feature extraction methods.
| Methods | SN (%) | SP (%) | ACC (%) | MCC |
|---|---|---|---|---|
| TNC | 87.32 | 83.81 | 85.56 | 0.7127 |
| KSNPFs | 86.37 | 83.54 | 84.95 | 0.7003 |
| NCP | 89.88 | 88.13 | 89.00 | 0.7814 |
| DNC | 87.18 | 82.32 | 84.75 | 0.6973 |
| ANF | 67.20 | 73.00 | 70.10 | 0.4038 |
| BERT (EN) |
|
|
|
|
Figure 3ROC curves of different feature extraction methods.
Performance comparison of different classifiers.
| Classifier | SN (%) | SP (%) | ACC (%) | MCC |
|---|---|---|---|---|
| LightGBM | 89.07 | 89.47 | 89.27 | 0.7863 |
| SVM | 93.79 | 95.28 | 94.53 | 0.8912 |
| LR | 94.47 | 94.46 | 94.47 | 0.8902 |
| GBDT | 89.34 | 89.07 | 89.20 | 0.7848 |
| NB | 88.80 | 87.99 | 88.39 | 0.7688 |
| RF | 88.94 | 87.18 | 88.06 | 0.7623 |
| Stacking |
|
|
|
|
Hyperparameter optimization results of stacking ensemble classifier.
| Classifier | Hyperparameters | Meaning | Search ranges | Optimal values | |
|---|---|---|---|---|---|
| Base classifiers | LR | C1 | The reciprocal of the regularization coefficient | (1, 50) | 0.0181 |
| LightGBM | learning_rate | Learning rate | (0.01, 1.0) | 0.2533 | |
| max_depth | Maximum depth of the tree | (1, 50) | 12 | ||
| max_bin | The max number of bins that feature values will be bucketed in | (10, 100) | 84 | ||
| boosting_type | Training method | gbdt; goss; dart | gbdt | ||
| num_leaves | Number of leaf nodes | (1, 50) | 10 | ||
| n_estimators | Number of iterations | (100, 600) | 255 | ||
| SVM | C2 | Regularized constant which determines regularized penalty to estimation errors | (1, 50) | 1.1322 | |
| Kernel | Kernel function which uses to realize the nonlinear map from the raw feature space to high-dimensional feature space | Linear; sigmoid; poly; rbf | rbf | ||
| Metaclassifier | LR | C3 | The reciprocal of the regularization coefficient | (1, 50) | 35.5133 |
The result comparison of different prediction methods.
| Methods | SN (%) | SP (%) | ACC (%) | MCC |
|---|---|---|---|---|
| iRNA-m7G (fusion)∗ | 89.1 | 90.7 | 89.9 | 0.798 |
| iRNA-m7G (PseDNC)∗ | 80.3 | 88.5 | 84.4 | 0.691 |
| iRNA-m7G (NPF)∗ | 89.1 | 90.7 | 89.9 | 0.798 |
| iRNA-m7G (SSC)∗ | 73.8 | 75.7 | 74.8 | 0.495 |
| m7GFinder∗ | 90.8 | 89.1 | 89.9 | 0.799 |
| m7G-IFL∗ | 92.4 | 92.6 | 92.5 | 0.850 |
| BERT-m7G |
|
|
|
|
∗Results excerpt from [4].