| Literature DB >> 29419752 |
Gaofeng Pan1,2, Limin Jiang3,4, Jijun Tang5,6,7, Fei Guo8,9.
Abstract
DNA methylation is an important biochemical process, and it has a close connection with many types of cancer. Research about DNA methylation can help us to understand the regulation mechanism and epigenetic reprogramming. Therefore, it becomes very important to recognize the methylation sites in the DNA sequence. In the past several decades, many computational methods-especially machine learning methods-have been developed since the high-throughout sequencing technology became widely used in research and industry. In order to accurately identify whether or not a nucleotide residue is methylated under the specific DNA sequence context, we propose a novel method that overcomes the shortcomings of previous methods for predicting methylation sites. We use k-gram, multivariate mutual information, discrete wavelet transform, and pseudo amino acid composition to extract features, and train a sparse Bayesian learning model to do DNA methylation prediction. Five criteria-area under the receiver operating characteristic curve (AUC), Matthew's correlation coefficient (MCC), accuracy (ACC), sensitivity (SN), and specificity-are used to evaluate the prediction results of our method. On the benchmark dataset, we could reach 0.8632 on AUC, 0.8017 on ACC, 0.5558 on MCC, and 0.7268 on SN. Additionally, the best results on two scBS-seq profiled mouse embryonic stem cells datasets were 0.8896 and 0.9511 by AUC, respectively. When compared with other outstanding methods, our method surpassed them on the accuracy of prediction. The improvement of AUC by our method compared to other methods was at least 0.0399 . For the convenience of other researchers, our code has been uploaded to a file hosting service, and can be downloaded from: https://figshare.com/s/0697b692d802861282d3.Entities:
Keywords: DNA methylation; PseAAC; Sparse Bayesian learning; discrete wavelet transform; feature selection; k-gram; multivariate mutual information; scBS-seq profiled mouse embryonic stem cells; support vector machine
Mesh:
Substances:
Year: 2018 PMID: 29419752 PMCID: PMC5855733 DOI: 10.3390/ijms19020511
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1The flow chart of our method. DWT: discrete wavelet transform; mESC: mouse embryonic stem cell; MMI: multivariate mutual information; PseAAC: pseudo amino acid composition; CD-HIT: Cluster Database at High Identity with Tolerance; SMOTE: Synthetic Minority Over-sampling Technique.
The original values of six physical structural properties.
| 2-Nucleotides | Twist | Tilt | Roll | Shift | Slide | Rise |
|---|---|---|---|---|---|---|
| 0.026 | 0.038 | 0.020 | 1.69 | 2.26 | 7.65 | |
| 0.036 | 0.038 | 0.023 | 1.32 | 3.03 | 8.93 | |
| 0.031 | 0.037 | 0.019 | 1.46 | 2.03 | 7.08 | |
| 0.033 | 0.036 | 0.022 | 1.03 | 3.83 | 9.07 | |
| 0.016 | 0.025 | 0.017 | 1.07 | 1.78 | 6.38 | |
| 0.026 | 0.042 | 0.019 | 1.43 | 1.65 | 8.04 | |
| 0.014 | 0.026 | 0.016 | 1.08 | 2.00 | 6.23 | |
| 0.031 | 0.037 | 0.019 | 1.46 | 2.03 | 7.08 | |
| 0.025 | 0.038 | 0.020 | 1.32 | 1.93 | 8.56 | |
| 0.025 | 0.036 | 0.026 | 1.20 | 2.61 | 9.53 | |
| 0.026 | 0.042 | 0.019 | 1.43 | 1.65 | 8.04 | |
| 0.036 | 0.038 | 0.023 | 1.32 | 3.03 | 8.93 | |
| 0.017 | 0.018 | 0.016 | 0.72 | 1.20 | 6.23 | |
| 0.025 | 0.038 | 0.020 | 1.32 | 1.93 | 8.56 | |
| 0.016 | 0.025 | 0.017 | 1.07 | 1.78 | 6.38 | |
| 0.026 | 0.038 | 0.020 | 1.69 | 2.26 | 7.65 |
Figure 2The discrete wavelet transform process.
The performance of our method by using different features.
| FEATURE | AUC | ACC | MCC | SN | SP |
|---|---|---|---|---|---|
| 0.7143 | 0.7312 | 0.3288 | 0.3532 | 0.9128 | |
| MMI | 0.6750 | 0.7061 | 0.2430 | 0.2529 | 0.9237 |
| DWT | 0.8063 | 0.7593 | 0.4213 | 0.5057 | 0.8810 |
| PseAAC | 0.6997 | 0.7214 | 0.2936 | 0.2961 | |
| Combination | 0.8338 | 0.7725 | 0.4589 | 0.5540 | 0.8774 |
| Combination (FS) | 0.8377 |
The values were calculated using the testing results on benchmark dataset. The classifier was support vector machine (SVM), and the validation method was target-jackknife cross-validation. Feature size was 612-D, including all of k-gram, MMI, DWT, and PseAAC. Feature size was 114-D feature, selected by feature selection in SVM. AUC: area under the receiver operating characteristic curve; ACC: accuracy; MCC: Matthews correlation coefficient; SN: sensitivity; SP: specificity; FS: feature selection. The bold digits are the greatest values in each column.
Figure 3The receiver operating characteristic (ROC) of classifiers by using different features.
Figure 4The importance score of each feature.
Figure 5The tendency of accuracy on dimensions of features.
Figure 6The tendency of accuracy on DWT features.
Comparison of our method, MethCGI, Methylator, and iDNA-Methyl on the benchmark dataset.
| Predictor | ACC | MCC | SN | SP |
|---|---|---|---|---|
| iDNA-Methyl | 0.7749 | 0.5471 | 0.6125 | |
| Methylator | 0.7135 | 0.3327 | 0.5172 | 0.8078 |
| MethCGI | 0.7383 | 0.3748 | 0.4968 | 0.8542 |
| Our Method | 0.8377 |
Feature size is 114-D feature, selected by feature selection in SVM. Our method used SVM classifier and target-jackknife cross-validation. The bold digits are the greatest values in each column.
Figure 7The ROC of our method on different cells. (a) The ROC on 2i-cultured mESCs. (b) The ROC on serum-cultured mESCs.
Comparison of our method, DeepCpG, and RF_Zhang on scBS-seq profiled mESCs.
| 2 | Serum-Cultured Cells | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| AUC | ACC | MCC | SN | SP | AUC | ACC | MCC | SN | SP | |
| DeepCpG | 0.8497 | 0.7752 | 0.8351 | 0.9237 | 0.9198 | 0.7681 | 0.7907 | 0.9608 | ||
| RF_Zhang | 0.8134 | 0.7610 | 0.5452 | 0.6809 | 0.9234 | 0.9084 | 0.7634 | 0.7704 | ||
| Our Method | 0.5221 | 0.6323 | 0.9106 | 0.9034 | ||||||
For all the three methods, the results are using the best value of the 12 cells. For all the three methods, the results are using the best value of the 18 cells. Our method used sparse Bayesian learning classifier and holdout validation. The bold digits are the greatest values in each column.
Running time of each feature extraction method.
| Feature Sets | ||||
|---|---|---|---|---|
| MMI | DWT | PseAAC | ||
| Running Time (s/10 K sequences) | 19.25 | 280.17 | 287.31 | 82.13 |