| Literature DB >> 22558141 |
Clemens Wrzodek1, Finja Büchel, Georg Hinselmann, Johannes Eichner, Florian Mittag, Andreas Zell.
Abstract
DNA methylation of CpG islands plays a crucial role in the regulation of gene expression. More than half of all human promoters contain CpG islands with a tissue-specific methylation pattern in differentiated cells. Still today, the whole process of how DNA methyltransferases determine which region should be methylated is not completely revealed. There are many hypotheses of which genomic features are correlated to the epigenome that have not yet been evaluated. Furthermore, many explorative approaches of measuring DNA methylation are limited to a subset of the genome and thus, cannot be employed, e.g., for genome-wide biomarker prediction methods. In this study, we evaluated the correlation of genetic, epigenetic and hypothesis-driven features to DNA methylation of CpG islands. To this end, various binary classifiers were trained and evaluated by cross-validation on a dataset comprising DNA methylation data for 190 CpG islands in HEPG2, HEK293, fibroblasts and leukocytes. We achieved an accuracy of up to 91% with an MCC of 0.8 using ten-fold cross-validation and ten repetitions. With these models, we extended the existing dataset to the whole genome and thus, predicted the methylation landscape for the given cell types. The method used for these predictions is also validated on another external whole-genome dataset. Our results reveal features correlated to DNA methylation and confirm or disprove various hypotheses of DNA methylation related features. This study confirms correlations between DNA methylation and histone modifications, DNA structure, DNA sequence, genomic attributes and CpG island properties. Furthermore, the method has been validated on a genome-wide dataset from the ENCODE consortium. The developed software, as well as the predicted datasets and a web-service to compare methylation states of CpG islands are available at http://www.cogsys.cs.uni-tuebingen.de/software/dna-methylation/.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22558141 PMCID: PMC3340366 DOI: 10.1371/journal.pone.0035327
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Machine learning algorithm performance.
| HEPG2 | HEK293 | Leukocytes | Fibroblast | |
|
| 0.543±0.30 | 0.716±0.01 | 0.801±0.02 | 0.635±0.19 |
|
| 0.382±0.24 | 0.743±0.17 | 0.825±0.14 | 0.564±0.18 |
|
| 0.363±0.28 | 0.667±0.19 | 0.765±0.18 | 0.333±0.32 |
|
| 0.383±0.27 | 0.654±0.19 | 0.683±0.21 | 0.407±0.28 |
|
| 0.204±0.24 | 0.526±0.20 | 0.629±0.26 | 0.214±0.30 |
|
| 0.128±0.33 | 0.381±0.36 | 0.393±0.36 | 0.312±0.40 |
|
| 0.057±0.21 | 0.117±0.26 | 0.146±0.22 | 0.064±0.24 |
Performance comparison of different machine learning algorithms for the task of DNA methylation prediction. We measured Matthews correlation coefficient (MCC) for every algorithm and every cell type using all features. The values shown in this table are the average of ten repetitions using ten-fold cross-validation.
Single feature class performance.
| Feature name | HEPG2 | HEK293 | Leukocytes | Fibroblast | ||||
| ACC | MCC | ACC | MCC | ACC | MCC | ACC | MCC | |
| All features | 0.85 | 0.54 | 0.87 | 0.72 | 0.91 | 0.80 | 0.87 | 0.64 |
| Histone modification data | 0.83 | 0.52 | 0.83 | 0.66 | 0.91 | 0.82 | 0.86 | 0.68 |
| DNA structure | 0.82 | 0.47 | 0.85 | 0.68 | 0.85 | 0.67 | 0.83 | 0.53 |
| Sequence - dinucleotides | 0.80 | 0.34 | 0.86 | 0.70 | 0.89 | 0.76 | 0.82 | 0.54 |
| CpG island-specific attributes | 0.79 | 0.41 | 0.86 | 0.69 | 0.87 | 0.71 | 0.81 | 0.50 |
| Sequence - tetranucleotides | 0.78 | 0.34 | 0.86 | 0.69 | 0.89 | 0.75 | 0.79 | 0.45 |
| Genomic attributes | 0.82 | 0.47 | 0.83 | 0.63 | 0.82 | 0.60 | 0.81 | 0.49 |
| Transcription factor binding sites | 0.81 | 0.41 | 0.80 | 0.58 | 0.85 | 0.65 | 0.82 | 0.48 |
| Closest CpGs | 0.82 | 0.43 | 0.78 | 0.52 | 0.81 | 0.56 | 0.82 | 0.45 |
| Distances to transcription start sites | 0.76 | 0.19 | 0.72 | 0.40 | 0.73 | 0.36 | 0.77 | 0.24 |
| Periodic CpG distances | 0.76 | 0.26 | 0.67 | 0.27 | 0.73 | 0.38 | 0.76 | 0.20 |
| Single nucleotide polymorphism (SNP) | 0.77 | 0.23 | 0.67 | 0.27 | 0.71 | 0.31 | 0.78 | 0.27 |
| Splicing sites | 0.80 | 0.35 | 0.65 | 0.19 | 0.72 | 0.34 | 0.77 | 0.15 |
| CpG flanking sequence | 0.79 | 0.32 | 0.68 | 0.29 | 0.65 | 0.13 | 0.78 | 0.28 |
| Evolutionary conservation (PhastCons) | 0.78 | 0.16 | 0.65 | 0.26 | 0.68 | 0.21 | 0.77 | 0.18 |
| Repeat, ALU-Y and DNA/DNA alignment features | 0.76 | 0.11 | 0.65 | 0.23 | 0.68 | 0.22 | 0.77 | 0.08 |
| Unmethylated instances [%] | 0.74 | 0.60 | 0.65 | 0.74 | ||||
Comparison of predictive performances of single feature classes. All values are taken from SVM predictions with feature files that only contain features belonging to the given class. Each prediction is an average of a ten-fold cross-validation with ten repetitions. The table shows the accuracy (ACC) and Matthews correlation coefficient (MCC) for each cell type and each feature class and is sorted by average MCC. Please note that the underlying data is imbalanced (because CpG islands tend to be unmethylated) and the average accuracy when assigning all CpG islands the unmethylated state is 0.71.
Figure 1CpG island methylation predictions with individual feature classes reveal, which features are correlated to the epigenome.
The figure shows the predictive performances of feature classes, averaged across HEPG2, HEK293, leukocytes and fibroblasts. It reveals, which features are correlated to DNA methylation and which are unlikely to be related to DNA methylation. Each value is an average of a ten-fold cross-validation with ten repetitions. The figure shows the accuracy (ACC), Matthews correlation coefficient (MCC) and the area under the receiver operating characteristic curve (AUC) and is sorted by average MCC.
Figure 2Predicted whole-genome methylation landscape for all four cell types.
This figure visualizes the methylation landscape in all four cell types, compared to the total number of CpG islands. One bar represents the number of methylated instances per cell type as percentage of the total number of CpG islands in the given chromosome. The largest number of methylated CpG islands can be found in HEK293, whereas HEPG2 have an almost unmethylated genome. The few CpG islands in chromosome Y are hypermethylated in most cell types, compared to the other chromosomes.
Validation on experimental data.
| Trained on | TotalCGIs | CGIs in training set | CGIs in test set | ACC | MCC |
| CHR21 | 17588 | 224 | 17364 | 90.01 | 0.43 |
| 10% | 17588 | 1758 | 15830 | 87.18 | 0.48 |
| 25% | 17588 | 4397 | 13191 | 91.68 | 0.56 |
| 50% | 17588 | 8794 | 8794 | 92.02 | 0.58 |
Validation of the proposed method (SVMs with RBF kernel, using all described features) on experimental data. The experimental dataset has been divided into a training and a test set. The training set was used for training and the test set exclusively for the comparison with prediction results and calculation of accuracy (ACC) and Matthews correlation coefficient (MCC). We performed this evaluation on four different training datasets: consisting of all CpG islands (CGIs) from chromosome 21, randomly picked 10%, 25% and 50% of the data.
Comparison of different methylation prediction approaches.
| Year | Authors | Dataset | CC | Accuracy |
| 2006 | Fang | HEP pilot phase data | 0.42 | 81.48 |
| 2006 | Bock | HEP pilot phase data | 0.15 | 74.76 |
| 2006 | Bock | Human peripheral blood lymphocytes | 0.74 | 91.5 |
|
| Human peripheral blood lymphocytes | 0.87 | 95.76 | |
|
| NAME21 (Leukocytes) | 0.80 | 91.13 | |
The predictive results of our method, compared to other methods. The table shows that our method outperforms other previously published methods.