| Literature DB >> 34440291 |
Waleed Alam1, Hilal Tayara2, Kil To Chong1,3.
Abstract
DNA is subject to epigenetic modification by the molecule N4-methylcytosine (4mC). N4-methylcytosine plays a crucial role in DNA repair and replication, protects host DNA from degradation, and regulates DNA expression. However, though current experimental techniques can identify 4mC sites, such techniques are expensive and laborious. Therefore, computational tools that can predict 4mC sites would be very useful for understanding the biological mechanism of this vital type of DNA modification. Conventional machine-learning-based methods rely on hand-crafted features, but the new method saves time and computational cost by making use of learned features instead. In this study, we propose i4mC-Deep, an intelligent predictor based on a convolutional neural network (CNN) that predicts 4mC modification sites in DNA samples. The CNN is capable of automatically extracting important features from input samples during training. Nucleotide chemical properties and nucleotide density, which together represent a DNA sequence, act as CNN input data. The outcome of the proposed method outperforms several state-of-the-art predictors. When i4mC-Deep was used to analyze G. subterruneus DNA, the accuracy of the results was improved by 3.9% and MCC increased by 10.5% compared to a conventional predictor.Entities:
Keywords: CNN; DNA methylation; deep learning; regulate expression
Mesh:
Substances:
Year: 2021 PMID: 34440291 PMCID: PMC8393747 DOI: 10.3390/genes12081117
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Demonstration of the data-flow and architecture of the proposed model.
The summary of six species benchmark datasets.
| Species | Sequences | Total | |
|---|---|---|---|
|
| Positive | 1554 | 3108 |
| Negative | 1554 | ||
|
| Positive | 1769 | 3538 |
| Negative | 1769 | ||
|
| Positive | 1978 | 3956 |
| Negative | 1978 | ||
|
| Positive | 388 | 776 |
| Negative | 388 | ||
|
| Positive | 906 | 1812 |
| Negative | 906 | ||
|
| Positive | 569 | 1138 |
| Negative | 569 | ||
The ranges of the tuned hyper-parameters.
| Hyper-Parameters | Range |
|---|---|
| Filters of Conv1D | [8,16,32] |
| Conv1D kernel size | [3,5,7] |
| Conv1D Strides | [2,3] |
| Dropout | [0.2,0.3,0.4,0.5] |
| Dense layer units | [8,16,32] |
The performance comparison between the i4mC-Deep and the existing computational tools for 4mC sites.
| Datasets | Methods | ACC | SN | SP | MCC |
|---|---|---|---|---|---|
|
| iDNA4mC | 0.786 | 0.797 | 0.775 | 0.572 |
| 4mCPred | 0.826 | 0.825 | 0.826 | 0.652 | |
| 4mCPred-SVM | 0.815 | 0.824 | 0.807 | 0.631 | |
| 4mCCNN | 0.842 | 0.894 | 0.825 | 0.694 | |
| DeepTorrent | 0.858 | 0.810 | 0.906 | 0.719 | |
| SOMM4mC | 0.876 | 0.839 |
| 0.743 | |
| i4mC-Deep |
|
| 0.898 |
| |
|
| iDNA4mC | 0.812 | 0.833 | 0.791 | 0.625 |
| 4mCPred | 0.822 | 0.824 | 0.821 | 0.646 | |
| 4mCPred-SVM | 0.830 | 0.838 | 0.822 | 0.661 | |
| 4mCCNN | 0.853 | 0.864 | 0.853 | 0.686 | |
| DeepTorrent | 0.861 | 0.834 | 0.889 | 0.724 | |
| SOMM4mC | 0.874 | 0.862 | 0.886 | 0.724 | |
| i4mC-Deep |
|
|
|
| |
|
| iDNA4mC | 0.760 | 0.757 | 0.762 | 0.519 |
| 4mCPred | 0.768 | 0.755 | 0.780 | 0.536 | |
| 4mCPred-SVM | 0.787 | 0.778 | 0.796 | 0.573 | |
| 4mCCNN | 0.797 | 0.803 | 0.792 | 0.621 | |
| DeepTorrent | 0.803 | 0.703 |
| 0.620 | |
| SOMM4mC | 0.836 | 0.800 | 0.872 | 0.647 | |
| i4mC-Deep |
|
| 0.861 |
| |
|
| iDNA4mC | 0.799 | 0.820 | 0.778 | 0.598 |
| 4mCPred | 0.826 | 0.819 | 0.832 | 0.655 | |
| 4mCPred-SVM | 0.833 | 0.858 | 0.807 | 0.666 | |
| 4mCCNN | 0.859 | 0.881 | 0.788 | 0.687 | |
| DeepTorrent | 0.873 | 0.891 | 0.855 | 0.747 | |
| SOMM4mC | 0.918 | 0.903 |
| 0.853 | |
| i4mC-Deep |
|
| 0.922 |
| |
|
| iDNA4mC | 0.815 | 0.822 | 0.808 | 0.630 |
| 4mCPred | 0.828 | 0.818 | 0.837 | 0.662 | |
| 4mCPred-SVM | 0.837 | 0.840 | 0.834 | 0.674 | |
| 4mCCNN | 0.860 | 0.851 | 0.843 | 0.703 | |
| DeepTorrent | 0.880 | 0.813 |
| 0.768 | |
| SOMM4mC | 0.876 | 0.864 | 0.888 | 0.728 | |
| i4mC-Deep |
|
| 0.926 |
| |
|
| iDNA4mC | 0.831 | 0.824 | 0.838 | 0.663 |
| 4mCPred | 0.830 | 0.850 | 0.810 | 0.668 | |
| 4mCPred-SVM | 0.860 | 0.863 | 0.858 | 0.721 | |
| 4mCCNN | 0.871 | 0.857 | 0.893 | 0.750 | |
| DeepTorrent | 0.894 | 0.831 |
| 0.795 | |
| SOMM4mC | 0.903 | 0.895 | 0.911 | 0.772 | |
| i4mC-Deep |
|
| 0.938 |
|
Figure 2Shows the performance comparison of the proposed tool and other existing state-of-the-art tools.
Figure 3Demonstration of the test dataset receiver operation characteristic curve (ROC) of the ten folds and their standard deviation for six species.
Figure 4The t-SNE visualization of the learned features of the G. subterraneus dataset using the proposed model. The “0” represents the features of the negative samples and “1” represents the features of the positive samples.
Figure 5Demonstration of a heatmap visualization of in silico mutation of G. subterraneus.
Figure 6The effect of the mutations on the prediction probability in G. subterraneus.
Figure 7Demonstration of the web-server window where the users can put the DNA sequences in Fasta format directly for the prediction of 4mC site.
Figure 8Demonstration of the web-server window where the users can upload the DNA sequence in the Fasta file.