| Literature DB >> 23369266 |
Hao Zheng1, Hongwei Wu, Jinping Li, Shi-Wen Jiang.
Abstract
DNA methylation is an inheritable chemical modification of cytosine, and represents one of the most important epigenetic events. Computational prediction of the DNA methylation status can be employed to speed up the genome-wide methylation profiling, and to identify the key features that are correlated with various methylation patterns. Here, we develop CpGIMethPred, the support vector machine-based models to predict the methylation status of the CpG islands in the human genome under normal conditions. The features for prediction include those that have been previously demonstrated effective (CpG island specific attributes, DNA sequence composition patterns, DNA structure patterns, distribution patterns of conserved transcription factor binding sites and conserved elements, and histone methylation status) as well as those that have not been extensively explored but are likely to contribute additional information from a biological point of view (nucleosome positioning propensities, gene functions, and histone acetylation status). Statistical tests are performed to identify the features that are significantly correlated with the methylation status of the CpG islands, and principal component analysis is then performed to decorrelate the selected features. Data from the Human Epigenome Project (HEP) are used to train, validate and test the predictive models. Specifically, the models are trained and validated by using the DNA methylation data obtained in the CD4 lymphocytes, and are then tested for generalizability using the DNA methylation data obtained in the other 11 normal tissues and cell types. Our experiments have shown that (1) an eight-dimensional feature space that is selected via the principal component analysis and that combines all categories of information is effective for predicting the CpG island methylation status, (2) by incorporating the information regarding the nucleosome positioning, gene functions, and histone acetylation, the models can achieve higher specificity and accuracy than the existing models while maintaining a comparable sensitivity measure, (3) the histone modification (methylation and acetylation) information contributes significantly to the prediction, without which the performance of the models deteriorate, and, (4) the predictive models generalize well to different tissues and cell types. The developed program CpGIMethPred is freely available at http://users.ece.gatech.edu/~hzheng7/CGIMetPred.zip.Entities:
Mesh:
Year: 2013 PMID: 23369266 PMCID: PMC3552668 DOI: 10.1186/1755-8794-6-S1-S13
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Number of methylated and unmethylated CpG islands in the twelve different tissue and cell types based on the DNA methylation profiles of HEP.
| Tissue/Cell type | Methylated | Unmethylated |
|---|---|---|
| CD4 | 101 | 368 |
| CD8 | 103 | 332 |
| sperm | 45 | 331 |
| liver | 105 | 334 |
| heart muscle | 96 | 372 |
| skeletal muscle | 91 | 371 |
| fetal skeletal muscle | 79 | 281 |
| fetal liver | 76 | 270 |
| placenta | 92 | 328 |
| dermal melanocytes | 107 | 326 |
| dermal fibroblasts | 92 | 358 |
| dermal keratinocytes | 91 | 374 |
Figure 1Workflow used for the prediction of the methylation status of CpG island in human genome. The CpG island map is obtained by applying the traditional Gardiner-Garden sequence criteria on non-repetitive sequences of the human genome. The core steps of our model development consist of three parts - feature extraction, feature selection and predictive modeling.
Number of features in each category and information resource for the feature extraction.
| Category | # Features | Resource | |
|---|---|---|---|
| 3 | Gardiner-Garden criteria [ | ||
| tetramer frequency | 256 | calculated by in-house code based on definition | |
| tetramer z-score | 256 | calculated by in-house code based on formula (1)-(3) | |
| conserved TFBS's | 230 | calculated by in-house code based on UCSC information [ | |
| conserved elements | 2 | calculated by in-house code based on conserved elements [ | |
| DNA 3-D conformation | 6 | calculated by in-house code based on formula [ | |
| nucleosome positioning propensity | 4 | calculated by in-house code using nucleosome organization map [ | |
| 2 | calculated by in-house code for enrichment analysis | ||
| histone methylation | 46 | calculated by in-house code based on the data set from [ | |
| histone acetylation | 36 | calculated by in-house code based on the data set from [ | |
Number of principal components (PCs) required to retain a certain percentage (Pcnt) of the variance of the original feature space of the 342 features selected through statistical tests.
| 100% | 99:99% | 99:90 | 99:00% | |
| 342 | 10 | 6 | ||
| 95:00% | 90:00 | 75:0% | 50:00% | |
| 5 | 4 | 3 | 2 |
Figure 2Contribution of the 342 features to the eight principal components. Each column corresponds to a principal component, and each row corresponds to an original feature dimension. All feature categories make substantial contributions to one or more principal components, suggesting that these categories of information, though correlated, are complementary to a certain extent for predicting the CpG island methylation.
Performance of our classifiers M1 on CD4 lymphocytes with comparison to the existing method.
| Method | SP | SE | ACC | CC |
|---|---|---|---|---|
| 0.9405 | 0.9257 | 0.9313 | 0.8302 | |
| Fan et al.'s [ | 0.7400 | 0.9428 | 0.8994 | - |
Performance of the predictive models (M3 through M16), each with an individual or a combination of the newly added categories of features being excluded.
| Features | SP | SE | ACC | CC | |
|---|---|---|---|---|---|
| 0.9405 | 0.9257 | 0.9313 | 0.8302 | ||
| 0.9012 | 0.8965 | 0.9046 | 0.7852 | ||
| 0.9302 | 0.9265 | 0.9210 | 0.8038 | ||
| 0.9270 | 0.9250 | 0.9205 | 0.8024 | ||
| 0.8791 | 0.8903 | 0.8897 | 0.7632 | ||
| 0.8698 | 0.8835 | 0.8826 | 0.7625 | ||
| 0.9186 | 0.9116 | 0.9186 | 0.8012 | ||
| 0.8685 | 0.8822 | 0.8786 | 0.7558 | ||
| 0.9318 | 0.5932 | 0.8575 | 0.6404 | ||
| 0.9670 | 0.2247 | 0.8001 | 0.3302 | ||
| 0.9092 | 0.5670 | 0.8312 | 0.6124 | ||
| 0.9078 | 0.5660 | 0.8296 | 0.6076 | ||
| 0.9320 | 0.2279 | 0.7862 | 0.3236 | ||
| 0.9266 | 0.2304 | 0.7641 | 0.3264 | ||
| 0.8990 | 0.5519 | 0.8232 | 0.5924 | ||
| 0.8972 | 0.2338 | 0.7352 | 0.3013 | ||
Specificity (SP), sensitivity (SE) and accuracy (ACC) are evaluated for binary classification, and correlation coefficient (CC) for regression models.
Performance of the classifier model and the influence of newly added features on the data of 11 different tissues and cell types: with histone modification.
| Procedure | Tissue/Cell Type | with added features | without added features | ||||||
|---|---|---|---|---|---|---|---|---|---|
| SP | SE | ACC | CC | SP | SE | ACC | CC | ||
| CD4 | 0.9405 | 0.9257 | 0.9313 | 0.8302 | 0.8685 | 0.8822 | 0.8786 | 0.7558 | |
| CD8 | 0.9608 | 0.8932 | 0.9448 | 0.8286 | 0.8692 | 0.8534 | 0.8758 | 0.7476 | |
| liver | 0.9680 | 0.8762 | 0.9465 | 0.8292 | 0.8512 | 0.8468 | 0.8698 | 0.7398 | |
| heart muscle | 0.9462 | 0.9479 | 0.9466 | 0.8342 | 0.8678 | 0.8796 | 0.8724 | 0.7542 | |
| skeletal muscle | 0.9542 | 0.9451 | 0.9524 | 0.8411 | 0.8714 | 0.8923 | 0.8895 | 0.7612 | |
| embryonic skeletal | 0.9395 | 0.9367 | 0.9389 | 0.8337 | 0.8676 | 0.8802 | 0.8774 | 0.7553 | |
| embryonic liver | 0.9259 | 0.9342 | 0.9277 | 0.8250 | 0.8490 | 0.8834 | 0.8683 | 0.7324 | |
| placenta | 0.9695 | 0.9130 | 0.9571 | 0.8412 | 0.8704 | 0.8742 | 0.8802 | 0.7597 | |
| dermal melanocytes | 0.9663 | 0.8785 | 0.9446 | 0.8401 | 0.8677 | 0.8792 | 0.8726 | 0.7498 | |
| dermal fibroblasts | 0.9525 | 0.9239 | 0.9467 | 0.8332 | 0.8625 | 0.8792 | 0.8656 | 0.7478 | |
| dermal keratinocytes | 0.9385 | 0.9341 | 0.9376 | 0.8310 | 0.8505 | 0.8690 | 0.8502 | 0.7371 | |
| sperm | 0.8459 | 0.9778 | 0.8617 | 0.7204 | 0.7115 | 0.8992 | 0.7508 | 0.6052 | |
Performances of the classifier model and the influence of newly added features on the data of 11 different tissues and cell types: without histone modification.
| Procedure | Tissue/Cell Type | with added features | without added features | ||||||
|---|---|---|---|---|---|---|---|---|---|
| SP | SE | ACC | CC | SP | SE | ACC | CC | ||
| CD4 | 0.9670 | 0.2247 | 0.8001 | 0.3302 | 0.8972 | 0.2338 | 0.7352 | 0.3013 | |
| CD8 | 0.9722 | 0.2108 | 0.8104 | 0.3325 | 0.8978 | 0.2284 | 0.7350 | 0.3009 | |
| liver | 0.9678 | 0.2143 | 0.8122 | 0.3328 | 0.8965 | 0.2325 | 0.7298 | 0.3005 | |
| heart muscle | 0.9562 | 0.2386 | 0.8186 | 0.3402 | 0.8804 | 0.2468 | 0.7190 | 0.3001 | |
| skeletal muscle | 0.9594 | 0.2364 | 0.8306 | 0.3268 | 0.8874 | 0.2476 | 0.7268 | 0.3003 | |
| embryonic skeletal | 0.9425 | 0.2298 | 0.8100 | 0.3228 | 0.8805 | 0.2406 | 0.7222 | 0.3002 | |
| embryonic liver | 0.9389 | 0.2306 | 0.8054 | 0.3217 | 0.8796 | 0.2512 | 0.7350 | 0.3015 | |
| placenta | 0.9655 | 0.2184 | 0.8276 | 0.3450 | 0.9004 | 0.2216 | 0.7398 | 0.3128 | |
| dermal melanocytes | 0.9700 | 0.2186 | 0.8156 | 0.3358 | 0.8986 | 0.2306 | 0.7354 | 0.3027 | |
| dermal broblasts | 0.9605 | 0.2200 | 0.8058 | 0.3286 | 0.8902 | 0.2276 | 0.7308 | 0.3016 | |
| dermal keratinocytes | 0.9425 | 0.2204 | 0.8095 | 0.3325 | 0.8854 | 0.2304 | 0.7304 | 0.3013 | |
| sperm | 0.8524 | 0.2365 | 0.7625 | 0.2678 | 0.7906 | 0.2408 | 0.6705 | 0.2317 | |
The number of CpG islands that are differentially methylated in any two tissues among 321 common CpG islands for all the 12 tissues.
| Tissue | CD4 | CD8 | DF | DK | DM | EL | ESM | HM | Liver | Placenta | SM | Sperm |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 5 | 6 | 4 | 0 | 3 | 0 | 2 | 0 | 0 | 28 | |
| 0 | 0 | 7 | 7 | 6 | 0 | 5 | 2 | 3 | 1 | 0 | 32 | |
| 5 | 7 | 0 | 4 | 2 | 4 | 1 | 1 | 6 | 1 | 1 | 26 | |
| 6 | 7 | 4 | 0 | 6 | 5 | 4 | 2 | 7 | 2 | 2 | 28 | |
| 4 | 6 | 2 | 6 | 0 | 4 | 4 | 1 | 4 | 1 | 2 | 32 | |
| 0 | 0 | 4 | 5 | 4 | 0 | 3 | 0 | 2 | 0 | 0 | 24 | |
| 3 | 5 | 1 | 4 | 4 | 3 | 0 | 1 | 4 | 1 | 0 | 24 | |
| 0 | 2 | 1 | 2 | 1 | 0 | 1 | 0 | 2 | 0 | 0 | 25 | |
| 2 | 3 | 6 | 7 | 4 | 2 | 4 | 2 | 0 | 3 | 2 | 29 | |
| 0 | 1 | 1 | 2 | 1 | 0 | 1 | 0 | 3 | 0 | 0 | 22 | |
| 0 | 0 | 1 | 2 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 22 | |
| 28 | 32 | 26 | 28 | 32 | 24 | 24 | 25 | 29 | 22 | 22 | 0 |
DF: dermal fibroblasts, DK: dermal keratinocytes, DM: dermal melanocytes, EL: embryonic liver, ESM: embryonic skeletal muscle, HM: heart muscle, SM: skeletal muscle.
Figure 3Correlation coefficients of the CpG island methylation levels across different tissues and cell types. The methylation status of CpG islands are highly correlated among the somatic and placenta cells. The methylation status of CpG island in sperm exhibits much difference in comparison with other tissue and cell types.