| Literature DB >> 35676633 |
Honglei Wang1,2,3, Hui Liu4,5, Tao Huang2, Gangshen Li1,2, Lin Zhang1,2, Yanjing Sun6,7.
Abstract
BACKGROUND: Recent research recommends that epi-transcriptome regulation through post-transcriptional RNA modifications is essential for all sorts of RNA. Exact identification of RNA modification is vital for understanding their purposes and regulatory mechanisms. However, traditional experimental methods of identifying RNA modification sites are relatively complicated, time-consuming, and laborious. Machine learning approaches have been applied in the procedures of RNA sequence features extraction and classification in a computational way, which may supplement experimental approaches more efficiently. Recently, convolutional neural network (CNN) and long short-term memory (LSTM) have been demonstrated achievements in modification site prediction on account of their powerful functions in representation learning. However, CNN can learn the local response from the spatial data but cannot learn sequential correlations. And LSTM is specialized for sequential modeling and can access both the contextual representation but lacks spatial data extraction compared with CNN. There is strong motivation to construct a prediction framework using natural language processing (NLP), deep learning (DL) for these reasons.Entities:
Keywords: Deep learning; Natural language processing; Predictor; RNA modification site
Mesh:
Substances:
Year: 2022 PMID: 35676633 PMCID: PMC9178860 DOI: 10.1186/s12859-022-04756-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1Chemical structures of modifications. a m1A modification. b m6A modification
AUROC scores of RGloVe and GloVe under different sliding windows sizes based on benchmark datasets
| Modification type | Encoding | Window sizes = 8 | Window sizes = 15 | Window sizes = 30 | Window sizes = 60 |
|---|---|---|---|---|---|
| m1A | RGloVe | 0.9283 | 0.9317 | 0.9315 | |
| GloVe | 0.9282 | 0.9193 | 0.9305 | 0.9185 | |
| m6A | RGloVe | 0.8414 | 0.8415 | 0.8407 | |
| GloVe | 0.8399 | 0.8420 | 0.8414 | 0.8372 |
The bolded values represent the best results
Fig. 2Performance of the different models through fivefold cross-validation. The models are CNNRGloVe, DCNNRGloVe, BiLSTMRGloVe, and DCBRGloVe, respectively. "CNNRGloVe" employs the CNN model in Deeppromise; "DCBRGloVe" represents a self-built DCB model, including the DCNN and the BiLSTM stage; "DCNNRGloVe" denotes the DCBRGloVe removing the BiLSTM stage; "BiLSTMRGloVe" represents the DCBRGloVe without the DCNN stage
Evaluation results of the different models trained on the fivefold cross-validation
| Modification type | Classifiers | AUROC | Acc (%) | Sn (%) | Sp (%) | MCC (%) | Pre (%) | F1 (%) | AUPRC | Time (s) |
|---|---|---|---|---|---|---|---|---|---|---|
| m1A | CNNRGloVe | 0.9248 | 94.06 | 96.78 | 63.97 | 67.52 | 67.23 | 0.7147 | 127 | |
| DCNNRGloVe | 0.9305 | 94.22 | 58.01 | 97.84 | 61.97 | 72.88 | 64.60 | 0.7155 | 96 | |
| BiLSTMRGloVe | 0.9260 | 93.02 | 66.44 | 95.68 | 59.63 | 60.62 | 63.40 | 0.6980 | 2104 | |
| DCBRGloVe | 61.72 | 1809 | ||||||||
| m6A | CNNRGloVe | 0.8281 | 74.93 | 81.84 | 68.22 | 50.47 | 71.44 | 76.29 | 0.8009 | 5264 |
| DCNNRGloVe | 0.8355 | 75.79 | 82.48 | 69.29 | 52.18 | 72.29 | 77.05 | 0.8103 | 18,732 | |
| BiLSTMRGloVe | 0.7885 | 71.42 | 59.33 | 44.48 | 66.70 | 74.31 | 0.7564 | 131,340 | ||
| DCBRGloVe | 79.30 | 21,638 |
The bolded values represent the best results
Fig. 3Performance of the DCB model based on One-hot encoding, RNA word embedding, Word2vec, and RGloVe
Evaluation results of the DCB model based on One-hot encoding, RNA word embedding, Word2vec, and RGloVe
| Modification type | Classifiers | AUROC | Acc (%) | Sn (%) | Sp (%) | MCC (%) | Pre (%) | F1 (%) | AUPRC |
|---|---|---|---|---|---|---|---|---|---|
| m1A | DCBOne-hot | 0.9410 | 95.37 | 64.04 | 98.51 | 69.66 | 81.11 | 71.57 | 0.7812 |
| DCBEmbedding | 0.9409 | 95.37 | 98.33 | 70.0 | 79.79 | 0.7715 | |||
| DCBword2vec | 0.9316 | 95.29 | 61.4 | 68.72 | 70.35 | 0.7349 | |||
| DCBRGloVe | 64.04 | 98.6 | 82.02 | 71.92 | |||||
| m6A | DCBOne-hot | 0.8300 | 74.51 | 72.25 | 49.06 | 73.87 | 0.8080 | ||
| DCBEmbedding | 0.8477 | 83.30 | 69.79 | 73.28 | 77.97 | 0.8272 | |||
| DCBword2vec | 0.8317 | 75.10 | 79.60 | 70.62 | 50.43 | 72.95 | 76.13 | 0.8126 | |
| DCBRGloVe | 76.36 | 68.57 | 53.41 | 72.72 |
The bolded values represent the best results
Fig. 4Performance of EMDLP and other methods on the independent test
Compare EMDLP model
| Modification type | Classifiers | AUROC | Acc (%) | Sn (%) | Sp (%) | MCC (%) | Pre (%) | F1 (%) | AUPRC |
|---|---|---|---|---|---|---|---|---|---|
| m1A | EDLm6Apred | 0.9494 | 95.06 | 64.91 | 98.07 | 68.10 | 77.08 | 70.47 | 0.7773 |
| DeepPromise | 0.9437 | 95.30 | 65.79 | 98.25 | 69.57 | 78.95 | 71.77 | 0.7893 | |
| DCBDeepPromise | 0.9529 | 95.61 | 98.42 | 81.05 | 0.7809 | ||||
| EMDLP | 61.40 | 70.69 | 71.79 | ||||||
| m6A | EDLm6APred | 0.8085 | 73.38 | 80.14 | 66.66 | 47.23 | 70.52 | 75.02 | 0.7905 |
| DeepPromise | 0.8476 | 82.15 | 45.00 | 54.43 | 78.30 | 0.8258 | |||
| DCBDeepPromise | 0.8501 | 76.76 | 81.89 | 44.95 | 53.81 | 74.19 | 77.85 | 0.8292 | |
| EMDLP | 76.98 | 73.44 |
The bolded values represent the best results
Fig. 5Boxplot of eight metrics for comparative performance assessment of the four methods based on the pAerformance of 100 replications of four methods. a for the m1A independent dataset. b for the m6A independent dataset
Statistically significant correlation matrix for the difference in the performance of the four classifiers
| Modification type | Classifiers | Classifiers | |||
|---|---|---|---|---|---|
| EDLm6APred | DeepPromise | DCBDeepPromise | EMDLP | ||
| m1A | EDLm6APred | ||||
| DeepPromise | 6.80137E-27 | ||||
| DCBDeepPromise | 2.14723E-11 | 5.22548E-34 | |||
| EMDLP | 8.734E-20 | 4.51535E-37 | 0.01606677 | ||
| m6A | EDLm6APred | ||||
| DeepPromise | 1.7731E-122 | ||||
| DCBDeepPromise | 3.3248E-133 | 2.05181E-42 | |||
| EMDLP | 8.6672E-142 | 6.72773E-87 | 3.06352E-20 | ||
Fig. 6Screenshot of EMDLP webserver. a Site input interface of EMDLP. b The prediction result returned by EMDLP
A statistical of these two RNA modification datasets
| Modification type | Dataset | Window size | Number of positive samples | Number of negative samples |
|---|---|---|---|---|
| m1A | m1A_BM | 101 | 593 | 5930 |
| m1A | m1A_IND | 101 | 114 | 1140 |
| m6A | m6A_ BM | 1001 | 26,586 | 27,371 |
| m6A | m6A_IND | 1001 | 6879 | 6914 |
BM benchmark; IND independent
Input and output formats with three kinds of feature encoding
| Modification type | Encoding method | Input | Output |
|---|---|---|---|
| m1A | One-hot | 101 × 1 | 101 × 5 |
| RNA word embedding | 99 × 3 | 99 × 300 | |
| RGloVe | 99 × 3 | 99 × 300 | |
| m6A | One-hot | 1001 × 1 | 1001 × 5 |
| RNA word embedding | 999 × 3 | 999 × 300 | |
| RGloVe | 999 × 3 | 999 × 300 |
Fig. 7structure of our computational framework based on RGloVe, DCNN, and BiLSTM neural network to predict m1A methylation site
Fig. 8Structure of EMDLP predictor. The diagrams depicted our method's architecture. Three different DL classifiers predicted the methylation sequences and decided the final finding by a soft vote