| Literature DB >> 34946497 |
Xin Cheng1, Jun Wang2, Qianyue Li1, Taigang Liu1.
Abstract
An important reason of cancer proliferation is the change in DNA methylation patterns, characterized by the localized hypermethylation of the promoters of tumor-suppressor genes together with an overall decrease in the level of 5-methylcytosine (5mC). Therefore, identifying the 5mC sites in the promoters is a critical step towards further understanding the diverse functions of DNA methylation in genetic diseases such as cancers and aging. However, most wet-lab experimental techniques are often time consuming and laborious for detecting 5mC sites. In this study, we proposed a deep learning-based approach, called BiLSTM-5mC, for accurately identifying 5mC sites in genome-wide DNA promoters. First, we randomly divided the negative samples into 11 subsets of equal size, one of which can form the balance subset by combining with the positive samples in the same amount. Then, two types of feature vectors encoded by the one-hot method, and the nucleotide property and frequency (NPF) methods were fed into a bidirectional long short-term memory (BiLSTM) network and a full connection layer to train the 22 submodels. Finally, the outputs of these models were integrated to predict 5mC sites by using the majority vote strategy. Our experimental results demonstrated that BiLSTM-5mC outperformed existing methods based on the same independent dataset.Entities:
Keywords: 5-methylcytosine sites; bidirectional long short-term memory; majority vote; nucleotide property and frequency; one-hot encoding
Mesh:
Substances:
Year: 2021 PMID: 34946497 PMCID: PMC8704614 DOI: 10.3390/molecules26247414
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1The workflow of the proposed BiLSTM-5mC model for the prediction of 5mC sites.
Figure 2Demonstration of nucleotide composition preferences between positive and negative samples in the benchmark datasets. (a) The nucleotide distribution around 5mC sites is for the training dataset. (b) The nucleotide distribution around 5mC sites is for the testing dataset.
Figure 3Performance comparison of different feature encoding methods for the prediction of 5mC sites. (a) Performance on the training dataset by using the 5-fold CV. (b) Performance on the independent testing dataset.
Figure 4ROC curves of different predictors with different features. (a) The ROC curve based on the one-hot features. (b) The ROC curve based on the NPF features. (c) The ROC curve based on the one-hot+NPF features. (d) The ROC curve based on the one-hot&NPF features.
Summary of existing tools for 5mC sites prediction in genome-wide DNA promoters.
| Method | Feature | Algorithm |
|---|---|---|
| iPromoter-5mC [ | One-hot | Deep neural network |
| 5mC_Pred [ | K-mers | XGBoost |
| BiLSTM-5mC (This study) | One-hot and NPF | BiLSTM |
Performance comparison on the training dataset by using the 5-fold CV.
| Method | Sen | Spe | Acc | MCC | AUC |
|---|---|---|---|---|---|
| iPromoter-5mC | 0.8746 | 0.9039 | 0.9016 | 0.5743 | 0.9566 |
| 5mC_Pred | 0.8990 | 0.9200 | 0.9180 | 0.6260 | 0.9620 |
| BiLSTM-5mC | 0.8096 | 0.9404 | 0.9302 | 0.6235 | 0.9644 |
Performance comparison on the independent testing dataset.
| Method | Sen | Spe | Acc | MCC | AUC |
|---|---|---|---|---|---|
| iPromoter-5mC | 0.8777 | 0.9042 | 0.9022 | 0.5771 | 0.9570 |
| 5mC_Pred | 0.8950 | 0.9200 | 0.9180 | 0.6250 | 0.9620 |
| BiLSTM-5mC | 0.8661 | 0.9374 | 0.9303 | 0.6384 | 0.9635 |
The information of the experimental datasets.
| Dataset | Positive Sample | Negative Sample |
|---|---|---|
| Training dataset | 55,800 | 658,861 |
| Testing dataset | 13,950 | 164,715 |
| Total | 69,750 | 823,576 |
Figure 5The structure of the BiLSTM-5mC framework.