| Literature DB >> 35163174 |
Hasan Zulfiqar1, Qin-Lai Huang1, Hao Lv1, Zi-Jie Sun1, Fu-Ying Dao1, Hao Lin1.
Abstract
4mC is a type of DNA alteration that has the ability to synchronize multiple biological movements, for example, DNA replication, gene expressions, and transcriptional regulations. Accurate prediction of 4mC sites can provide exact information to their hereditary functions. The purpose of this study was to establish a robust deep learning model to recognize 4mC sites in Geobacter pickeringii. In the anticipated model, two kinds of feature descriptors, namely, binary and k-mer composition were used to encode the DNA sequences of Geobacter pickeringii. The obtained features from their fusion were optimized by using correlation and gradient-boosting decision tree (GBDT)-based algorithm with incremental feature selection (IFS) method. Then, these optimized features were inserted into 1D convolutional neural network (CNN) to classify 4mC sites from non-4mC sites in Geobacter pickeringii. The performance of the anticipated model on independent data exhibited an accuracy of 0.868, which was 4.2% higher than the existing model.Entities:
Keywords: algorithm; alteration; deep learning; features vector; genomics
Mesh:
Substances:
Year: 2022 PMID: 35163174 PMCID: PMC8836036 DOI: 10.3390/ijms23031251
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Flowchart of the whole study.
Figure 2(A,B) The IFS technique for recognizing 4mC sites. Initially, 871 best features were picked from an overall 5624 by correlation measures (A). A total of 50 more optimized features were also attained from 871 best features by the using of GBDT on 10-fold CV. The Acc increases from 0.894 to 0.908 (B). Plot showing the AUROC curve of Deep-4mCGP on 10-fold CV (C). Nucleotides allocation along the alteration site (D). Performance comparison of Deep-4mCGP with 4mCCNN on 10-fold cross-validation (E). AUROC of predictors on training and independent data (F).
Outcomes of single encodings and their fusion based-models on training and independent data by using different classification algorithms. Bold is used to highlight the best results.
| Training Data | Independent Data | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Algorithm | FS | Method |
|
|
|
|
|
|
|
|
|
|
| LSTM | 5460 | 0.861 | 0.872 | 0.861 | 0.811 | 0.943 | 0.825 | 0.820 | 0.812 | 0.819 | 0.882 | |
| 164 | Binary | 0.834 | 0.828 | 0.837 | 0.838 | 0.875 | 0.801 | 0.804 | 0.798 | 0.801 | 0.872 | |
| 5624 | Fusion | 0.868 | 0.865 | 0.859 | 0.862 | 0.937 | 0.810 | 0.814 | 0.808 | 0.813 | 0.902 | |
| 871 | Fusion | 0.859 | 0.857 | 0.847 | 0.857 | 0.925 | 0.808 | 0.801 | 0.807 | 0.800 | 0.876 | |
| 50 | Fusion | 0.884 | 0.878 | 0.881 | 0.879 | 0.959 | 0.841 | 0.842 | 0.839 | 0.842 | 0.921 | |
| RF | 5460 | 0.831 | 0.862 | 0.758 | 0.664 | 0.936 | 0.809 | 0.838 | 0.761 | 0.648 | 0.909 | |
| 164 | Binary | 0.772 | 0.763 | 0.755 | 0.770 | 0.863 | 0.753 | 0.748 | 0.753 | 0.756 | 0.832 | |
| 5624 | Fusion | 0.844 | 0.847 | 0.839 | 0.845 | 0.891 | 0.795 | 0.788 | 0.783 | 0.794 | 0.887 | |
| 871 | Fusion | 0.847 | 0.849 | 0.851 | 0.846 | 0.897 | 0.801 | 0.800 | 0.800 | 0.798 | 0.878 | |
| 50 | Fusion | 0.866 | 0.858 | 0.861 | 0.854 | 0.915 | 0.812 | 0.808 | 0.814 | 0.812 | 0.898 | |
| GBDT | 5460 | 0.848 | 0.881 | 0.776 | 0.676 | 0.962 | 0.828 | 0.861 | 0.770 | 0.669 | 0.931 | |
| 164 | Binary | 0.827 | 0.821 | 0.823 | 0.827 | 0.895 | 0.782 | 0.778 | 0.779 | 0.781 | 0.862 | |
| 5624 | Fusion | 0.835 | 0.832 | 0.830 | 0.832 | 0.893 | 0.786 | 0.780 | 0.786 | 0.786 | 0.882 | |
| 871 | Fusion | 0.851 | 0.853 | 0.848 | 0.854 | 0.901 | 0.814 | 0.810 | 0.815 | 0.810 | 0.893 | |
| 50 | Fusion | 0.875 | 0.874 | 0.868 | 0.860 | 0.945 | 0.836 | 0.835 | 0.830 | 0.841 | 0.920 | |
| CNN | 5460 | 0.880 | 0.879 | 0.887 | 0.880 | 0.949 | 0.848 | 0.844 | 0.841 | 0.845 | 0.927 | |
| 164 | Binary | 0.868 | 0.836 | 0.834 | 0.832 | 0.928 | 0.798 | 0.802 | 0.807 | 0.790 | 0.881 | |
| 5624 | Fusion | 0.868 | 0.865 | 0.859 | 0.862 | 0.937 | 0.810 | 0.814 | 0.808 | 0.813 | 0.903 | |
| 871 | Fusion | 0.894 | 0.877 | 0.897 | 0.889 | 0.955 | 0.846 | 0.845 | 0.841 | 0.838 | 0.920 | |
| 50 | Fusion |
|
|
|
|
|
|
|
|
|
| |
Performance comparison of Deep-4mCGP with 4mCCNN.
| Predictor |
|
|
|
|
|
| Reference |
|---|---|---|---|---|---|---|---|
| 4mcCNN | 10 (folds) | 0.871 | 0.857 | 0.893 | 0.750 | 0.921 | [ |
| Deep-4mCGP | 10 (folds) | 0.908 | 0.914 | 0.910 | 0.908 | 0.986 | Deep-4mCGP |
| 4mcCNN | Test (Ind) | 0.826 | 0.818 | 0.823 | 0.825 | 0.920 | [ |
| Deep-4mCGP | Test (Ind) | 0.868 | 0.876 | 0.773 | 0.859 | 0.961 | Deep-4mCGP |
Program in TensorFlow 2.1.0 with employed parameters.
| Classifier | Parameters |
|---|---|
| RF | N-estimators = 100, Learning-rate = 0.001, Mean absolute error = 0.143, Mean square error = 0.220 |
| GBDT | N-estimators = 120, Learning-rate = 0.01, Mean absolute error = 0.117, Mean square error = 0.212 |
| LSTM | nn.LSTM(input_size = feature_size, hidden_size = 128) |
| CNN | nn. Conv1d (in_channels = feature size, out_channels = 32, padding = valid, strides = 1, kernel_size = 2) |