| Literature DB >> 26797014 |
Yiheng Wang1, Tong Liu1, Dong Xu2, Huidong Shi3, Chaoyang Zhang1, Yin-Yuan Mo4, Zheng Wang1.
Abstract
The hypo- or hyper-methylation of the human genome is one of the epigenetic features of leukemia. However, experimental approaches have only determined the methylation state of a small portion of the human genome. We developed deep learning based (stacked denoising autoencoders, or SdAs) software named "DeepMethyl" to predict the methylation state of DNA CpG dinucleotides using features inferred from three-dimensional genome topology (based on Hi-C) and DNA sequence patterns. We used the experimental data from immortalised myelogenous leukemia (K562) and healthy lymphoblastoid (GM12878) cell lines to train the learning models and assess prediction performance. We have tested various SdA architectures with different configurations of hidden layer(s) and amount of pre-training data and compared the performance of deep networks relative to support vector machines (SVMs). Using the methylation states of sequentially neighboring regions as one of the learning features, an SdA achieved a blind test accuracy of 89.7% for GM12878 and 88.6% for K562. When the methylation states of sequentially neighboring regions are unknown, the accuracies are 84.82% for GM12878 and 72.01% for K562. We also analyzed the contribution of genome topological features inferred from Hi-C. DeepMethyl can be accessed at http://dna.cs.usm.edu/deepmethyl/.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26797014 PMCID: PMC4726425 DOI: 10.1038/srep19598
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Distribution of DNA methylation levels on CpG sites for chromosomes 1, 2 and 3 for GM12878 and K562.
Performance of SdAs for GM12878 on chromosome 21 under different numbers of hidden layers and different numbers of hidden units using leave-one-out cross-validation.
| Number of hidden units and hidden layers | 200 | 200–200 | 500 | 500–500 | 500–500–500 |
|---|---|---|---|---|---|
| Accuracy | 0.889 | 0.891 | 0.896 | 0.935 | 0.935 |
Figure 2Leave-one-out cross-validation performance of SVMs with different window sizes, chromosomes, and cell lines.
(a) prediction accuracy, (b) specificity, (c) sensitivity, and (d) Matthews’s correlation coefficient.
Figure 3ROC curves of leave-one-out cross-validation using SVMs with different window sizes for (a) GM12878 and (b) K562.
The best performance achieved from leave-one-out cross-validation using SVMs on chromosomes 1 and 21 for cell lines GM12878 and K562.
| Cell Line | Chromosome | α | Window Size | Acc | Sp | Se | MCC |
|---|---|---|---|---|---|---|---|
| GM12878 | CHR1 | 0.55 | 500 | 0.900 | 0.894 | 0.905 | 0.800 |
| GM12878 | CHR21 | 0.99 | 600 | 0.942 | 0.918 | 0.966 | 0.886 |
| K562 | CHR1 | 0.07 | 600 | 0.823 | 0.863 | 0.784 | 0.649 |
| K562 | CHR21 | 0.43 | 800 | 0.876 | 0.848 | 0.904 | 0.753 |
The threshold α was used to ensure equal number of samples in methylated class and un-methylated class.
Performance of leave-one-out cross-validation using SVMs and SdAs on chromosomes 1 and 21 for cell line GM12878 on the window size 600 nt.
| Classifier | Cell Line | Chromosome | Acc | Number of Samples |
|---|---|---|---|---|
| SdA | GM12878 | CHR21 | 0.934 | 296 |
| SVM | GM12878 | CHR21 | 0.942 | 296 |
| SdA | GM12878 | CHR1 | 0.885 | 2616 |
| SVM | GM12878 | CHR1 | 0.839 | 2616 |
Figure 4(a) Accuracy of blind test on chromosome 21 using SdAs and SVMs. (b) Number of samples in the test dataset with different window sizes in chromosome 21.
Figure 5(a) Accuracy of blind test on chromosome X using SdAs and SVMs. (b) Number of samples in the test dataset with different window sizes in chromosome X.
Figure 6(a) Performance of SdAs for the prediction of methylation for lncRNAs and CpG sites without region-specific limitation on chromosome 21. (b) Number of samples in the test dataset on different window sizes in chromosome 21.
Figure 7(a) Performance of SdAs for the prediction of methylation for lncRNAs and CpG sites without region-specific limitation on chromosome X. (b) Number of samples in the test dataset on different window sizes in chromosome X.
The SVM’s 5-fold cross-validation accuracy and MCC scores of using Hi-C based topological neighboring window-Bs and random window-Bs on chromosome 1 with different Hi-C ranges.
| Hi-C range | Acc (Hi-C based) | Acc (random) | MCC (Hi-C based) | MCC random |
|---|---|---|---|---|
| 10K | 0.831 | 0.828 | 0.616 | 0.600 |
| 20K | 0.833 | 0.810 | 0.618 | 0.584 |
| 30K | 0.830 | 0.815 | 0.614 | 0.586 |
| 40K | 0.837 | 0.832 | 0.623 | 0.606 |
| 50K | 0.838 | 0.824 | 0.628 | 0.601 |
The 5-fold cross-validation accuracies of SdAs on chromosome 1 with different Hi-C ranges.
| Hi-C range | Hi-C_1L | Random_1L | Hi-C_2L | Random_2L | Hi-C_3L | Random_3L |
|---|---|---|---|---|---|---|
| 10K | 0.829 | 0.830 | 0.837 | 0.714 | 0.835 | 0.406 |
| 20K | 0.839 | 0.839 | 0.828 | 0.668 | 0.829 | 0.376 |
| 30K | 0.840 | 0.835 | 0.832 | 0.823 | 0.830 | 0.376 |
| 40K | 0.828 | 0.835 | 0.831 | 0.831 | 0.828 | 0.565 |
| 50K | 0.841 | 0.819 | 0.826 | 0.828 | 0.834 | 0.326 |
The SdA model was trained with 10 pre-training epochs (unsupervised learning, learning rate 0.01) and 100 fine-tuning epochs (supervised learning, learning rate 0.01). The 1L, 2L and 3L are the number of hidden layers with corruption levels of all layers set to 0.1. All the layers have 100 hidden nodes. Features based on genome topological neighbors (window-Bs, indicated as “Hi-C” in the table) and features based on randomly selected regions (random windows, indicated as “Random” in the table) were used to benchmark the impact of Hi-C based features.
The blind test accuracy and MCC scores for SdAs and SVMs on randomly combined training and testing samples from chromosomes 1 and 21 with Hi-C range 10 K.
| Classifier | Features | SdA architecture | Acc | MCC |
|---|---|---|---|---|
| SdA | Hi-C based window-B | 109-100-2 | 0.871 | 0.666 |
| SdA | Random window-B | 109-100-2 | 0.810 | 0.612 |
| SdA | Hi-C based window-B | 109-100-100-2 | 0.867 | 0.659 |
| SdA | Random window-B | 109-100-100-2 | 0.631 | 0.058 |
| SVM | Hi-C based window-B | NA | 0.860 | 0.685 |
| SVM | Random window-B | NA | 0.858 | 0.725 |
The ration of fine-tuning, validation, and testing samples for SdAs is 3:1:1. With two hidden layers, the MCC score of an SdA is 0.058. We found that the predictions are highly biased to negative samples. This causes the false negative to be a value close to 1. Therefore, it has a very low MCC score.
Performance of SdAs and SVMs for predicting methylation level of CpG sites within lncRNA regions.
| Classifier | SdA architecture | Acc | MCC | Number of test samples |
|---|---|---|---|---|
| SdA | 109-100-2 | 0.796 | 0.5678 | 2138 (551 positive, 1587 negative) |
| SdA | 109-100-100-2 | 0.784 | 0.5617 | 2138 |
| SdA | 109-100-100-100-2 | 0.832 | 0.6427 | 2138 |
| SVM | NA | 0.837 | 0.6385 | 2138 |
The SdA architecture and SVM model used were the one with the best test accuracy in 5-fold cross-validation on chromosome 1 (Table 5, Supplementary Tables S2–S6). The number of testing lncRNA samples are 2,138 (551 positive and 1587 negative).
Features used for machine learning algorithms and their descriptions.
| Feature name | Feature description | Used in benchmark: |
|---|---|---|
| Ra_A | Ratio of adenine in window-A | 1, 2 |
| Ra_B | Ratio of thymine in window-A | 1, 2 |
| Ra_C | Ratio of guanine in window-A | 1, 2 |
| Ra_D | Ratio of cytosine in window-A | 1, 2 |
| Pa_AAWGGR | Pattern frequency of AAWGGR in window-A | 1, 2 |
| Pa_TGRAAT | Pattern frequency of TGRAAT in window-A | 1, 2 |
| Pa_AAT | Pattern frequency of AAT in window-A | 1, 2 |
| Pa_ATGVAA | Pattern frequency of ATGVAA in window-A | 1, 2 |
| Pa_ACG | Pattern frequency of ACG in window-A | 1, 2 |
| Pa_GC | Pattern frequency of GC in window-A | 1, 2 |
| Pa_CG | Pattern frequency of CG in window-A | 1, 2 |
| Pa_TG | Pattern frequency of TG in window-A | 1, 2 |
| Pa_CCGC | Pattern frequency of CCGC in window-A | 2 |
| Pa_CCCC | Pattern frequency of CCCC in window-A | 2 |
| Pa_CGCC | Pattern frequency of CGCC in window-A | 2 |
| Pa_AAAG | Pattern frequency of AAAG in window-A | 2 |
| Pa_CTCC | Pattern frequency of CTCC in window-A | 2 |
| Ave_ meth | Average methylation level in window-A | 1 |
| PseTNC | 74 pseudo tri-nucleotide composition features (Detail see Methods) | 2 |
| Ave_meth_Hi_C | Average methylation level in window-Bs | 1, 2 |
| Ave_Ra_A_Hi_C | Average Ra_A in window-Bs | 1, 2 |
| Ave_Ra_B_Hi_C | Average Ra_B in window-Bs | 1, 2 |
| Ave_Ra_C_Hi_C | Average Ra_C in window-Bs | 1, 2 |
| Ave_Ra_D_Hi_C | Average Ra_D in window-Bs | 1, 2 |
| Ave_Pa_AAWGGR_Hi_C | Average Pa_ AAWGGR in window-Bs | 1, 2 |
| Ave_Pa_TGRAAT_Hi_C | Average Pa_ TGRAAT in window-Bs | 1, 2 |
| Ave_Pa_AAT_Hi_C | Average Pa_ AAT in window-Bs | 1, 2 |
| Ave_Pa_ATGVAA_Hi_C | Average Pa_ ATGVAA in window-Bs | 1, 2 |
| Ave_Pa_ACG_Hi_C | Average Pa_ ACG in in window-Bs | 1, 2 |
| Ave_Pa_CCGC _Hi_C | Average Pa_CCGC in window-Bs | 2 |
| Ave_Pa_CCCC _Hi_C | Average Pa_CCCC in window-Bs | 2 |
| Ave_Pa_CGCC _Hi_C | Average Pa_CGCC in window-Bs | 2 |
| Ave_Pa_AAAG _Hi_C | Average Pa_AAAG in window-Bs | 2 |
| Ave_Pa_CTCC _Hi_C | Average Pa_CTCC in window-Bs | 2 |
| Ave_Pa_GC _Hi_C | Average Pa_GC in window-Bs | 2 |
| Ave_Pa_CG _Hi_C | Average Pa_CG in window-Bs | 2 |
| Ave_Pa_TG _Hi_C | Average Pa_TG in window-Bs | 2 |
The feature names containing “Hi_C” were generated in window-B, that is, the topological neighbors indicated by Hi-C experiments.