| Literature DB >> 19383127 |
Christopher Previti1, Oscar Harari, Igor Zwir, Coral del Val.
Abstract
BACKGROUND: The computational prediction of DNA methylation has become an important topic in the recent years due to its role in the epigenetic control of normal and cancer-related processes. While previous prediction approaches focused merely on differences between methylated and unmethylated DNA sequences, recent experimental results have shown the presence of much more complex patterns of methylation across tissues and time in the human genome. These patterns are only partially described by a binary model of DNA methylation. In this work we propose a novel approach, based on profile analysis of tissue-specific methylation that uncovers significant differences in the sequences of CpG islands (CGIs) that predispose them to a tissue- specific methylation pattern.Entities:
Mesh:
Year: 2009 PMID: 19383127 PMCID: PMC2683815 DOI: 10.1186/1471-2105-10-116
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overview of the profile-based approach to the analysis of tissue-specific CGI methylation.
Unique CpGs and CGIs defined by the HEP data
| Tissue/Cell type | Abbreviation | Total # CpGs (%)* | # CGIs (%)** |
|---|---|---|---|
| CD4 T lymphocytes | CD4 | 31,219 (94.86) | 515 (99.61) |
| CD8 T lymphocytes | CD8 | 29,979 (91.09) | 503 (97.29) |
| Dermal fibroblasts | DF | 29,776 (90.48) | 504 (97.49) |
| Dermal keratinocytes | DK | 29,739 (90.36) | 508 (98.26) |
| Dermal melanocytes | DM | 29,809 (90.58) | 504 (97.49) |
| Fetal liver | FL | 24,343 (73.97) | 452 (87.43) |
| Fetal skeletal muscle | FSM | 24,250 (73.69) | 448 (86.65) |
| Heart muscle | HM | 31,268 (95.01) | 517 (100) |
| Liver | - | 31,456 (95.58) | 517 (100) |
| Placenta | - | 29,900 (90.85) | 505 (97.68) |
| Skeletal muscle | SM | 31,518 (95.77) | 513 (99.23) |
| Sperm | - | 23,621 (71.77) | 444 (85.88) |
| Supported by at least 2 tissues: | 32,910 | 517 | |
* Number and percentage of CpGs measured in at least one of the twelve tissues.
** Number and percentage of CGIs where at least 70% of the CpGs were covered by the HEP data, in at least two tissues.
Validity indices used to estimate the optimum number of data clusters
| CGI methylation data | |||
|---|---|---|---|
| Hierarchical clusters | ICT | K-means clusters | ADSM |
| 35 | 2.25 | 5 | 0.605 |
| 38 | 2.2 | 6 | 0.509 |
| 41 | 2.15 | 7 | 0.497 |
| 44 | 2.1 | 8 | 0.490 |
| 48 | 2.05 | 9 | 0.472 |
| 50 | 2 | 10 | 0.485 |
| Hierarchical clusters | ICT | K-means clusters | ADSM |
| 1 | 2.7 | ||
| 5 | 0.184 | ||
| 9 | 2.55 | 6 | 0.177 |
| 9 | 2.5 | 7 | 0.169 |
| 16 | 2.45 | 8 | 0.145 |
| 24 | 2.4 | 9 | 0.134 |
| 31 | 2.35 | 10 | -0.348 |
Figure 2Determining non-redundant CGI profiles. Elimination of redundant CGI profiles. Initially, 55 profiles (relations between CGI sequence attributes and methylation classes linked by the probability of intersection) were identified. We grouped all profiles recognizing the same observation using a column/row hierarchical clustering, and summarize each cluster by their most representative prototype (i.e., the most supported relation of each cluster). The validity index we used (see methods) suggests a partition into 9 final profiles.
CGI profiles: Methylation values of each non-redundant CGI profile.
| Constitutively methylated | Unmethylated in sperm | Differentially methylated | Constitutively unmethylated | ||||||
|---|---|---|---|---|---|---|---|---|---|
| CD4 | 0.074 | 0.078 | 0.065 | 0.076 | 0.094 | 0.072 | |||
| CD8 | 0.905 | 0.089 | 0.078 | 0.071 | 0.104 | 0.136 | 0.092 | ||
| DF | 0.070 | 0.066 | 0.068 | 0.079 | 0.082 | 0.045 | |||
| DK | 0.077 | 0.061 | 0.063 | 0.072 | 0.074 | 0.086 | |||
| DM | 0.854 | 0.093 | 0.103 | 0.079 | 0.102 | 0.133 | 0.108 | ||
| FL | 0.762 | 0.104 | 0.117 | 0.105 | 0.138 | 0.158 | 0.157 | ||
| FSM | 0.793 | 0.097 | 0.136 | 0.127 | 0.125 | 0.118 | 0.052 | ||
| HM | 0.068 | 0.071 | 0.059 | 0.064 | 0.072 | 0.059 | |||
| Liver | 0.883 | 0.073 | 0.071 | 0.065 | 0.080 | 0.095 | 0.077 | ||
| Placenta | 0.846 | 0.085 | 0.075 | 0.066 | 0.096 | 0.098 | 0.123 | ||
| SM | 0.804 | 0.071 | 0.072 | 0.057 | 0.062 | 0.078 | 0.067 | ||
| Sperm | 0.102 | 0.086 | 0.105 | 0.125 | 0.117 | 0.112 | |||
| Average methylation ± Std | 0.084 ± 0.066 | 0.085 ± 0.068 | 0.078 ± 0.058 | 0.094 ± 0.074 | 0.105 ± 0.088 | 0.088 ± 0.065 | |||
Average and tissue-specific methylation values of the nine non-redundant CGI profiles. The methylation values mentioned during the comparison of the profiles are marked in bold and all differences between profiles are supported by a MWW p-value lower than 0.01.
CGI profiles: Attribute values of each non-redundant CGI profile.
| Constitutively methylated | Unmethylated in sperm | Differentially methylated | Constitutively unmethylated | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Attributes | |||||||||
| O/E ratio1 | 0.239 | 0.253 | |||||||
| CpG distance1 | |||||||||
| SD1 | |||||||||
| G+C content1 | |||||||||
| Repetitive content2 | 0.010 | 0.000 | 0.002 | 0.006 | 0.005 | 0.017 | 0.036 | 0.042 | |
| Repetitive elements2 | 0.004 | 0.000 | 0.004 | 0.012 | 0.004 | 0.037 | 0.066 | 0.078 | |
| 0.063 | 0.057 | 0.081 | 0.092 | ||||||
| 0.330 | 0.336 | 0.242 | 0.049 | 0.060 | 0.076 | 0.064 | 0.066 | 0.050 | |
| CG4 | 0.097 | 0.101 | |||||||
| GC4 | 0.203 | 0.215 | |||||||
| AA4 | 0.171 | 0.156 | 0.141 | ||||||
| TT4 | 0.181 | 0.239 | 0.154 | ||||||
| TA4 | 0.325 | 0.350 | 0.212 | 0.323 | 0.225 | 0.194 | 0.160 | 0.180 | 0.116 |
| AT4 | 0.247 | 0.305 | 0.211 | 0.161 | |||||
| CA4 | 0.407 | 0.320 | 0.360 | 0.340 | 0.358 | 0.208 | |||
| TG4 | 0.386 | 0.446 | 0.453 | 0.415 | 0.375 | 0.460 | |||
| AC4 | 0.465 | 0.351 | 0.399 | 0.477 | 0.361 | 0.379 | 0.340 | 0.305 | 0.193 |
| GT4 | 0.446 | 0.410 | 0.456 | 0.391 | 0.420 | 0.338 | 0.329 | 0.363 | 0.329 |
| AG4 | 0.467 | 0.447 | 0.360 | 0.498 | 0.413 | 0.465 | 0.448 | 0.401 | 0.360 |
| CT4 | 0.463 | 0.493 | 0.443 | 0.437 | 0.532 | 0.492 | 0.436 | 0.445 | 0.401 |
| CC4 | 0.303 | 0.323 | 0.277 | ||||||
| GG4 | 0.296 | 0.322 | 0.292 | ||||||
| GA4 | 0.381 | 0.359 | 0.306 | 0.429 | 0.336 | 0.376 | 0.362 | 0.298 | 0.270 |
| TC4 | 0.352 | 0.355 | 0.417 | 0.350 | 0.393 | 0.314 | 0.299 | 0.378 | 0.213 |
| Bending5 | 0.597 | 0.596 | |||||||
| Curvature5 | 0.506 | 0.547 | |||||||
| Stacking energy5 | |||||||||
| Turns5 | 0.923 | 0.920 | |||||||
| Degree of twist5 | 0.920 | 0.916 | 0.936 | 0.940 | 0.940 | 0.921 | 0.925 | 0.903 | |
| DNA cleavage5 | 0.192 | 0.187 | |||||||
| Bases per turn5 | 0.074 | 0.078 | 0.060 | 0.055 | 0.057 | 0.073 | 0.070 | 0.090 | |
| Twist constraint5 | 0.763 | 0.749 | |||||||
| Tilt constraint5 | 0.645 | 0.657 | 0.565 | 0.714 | 0.729 | 0.725 | 0.738 | 0.768 | 0.683 |
| Roll constraint5 | 0.484 | 0.504 | 0.467 | 0.453 | 0.522 | 0.572 | 0.547 | 0.471 | 0.652 |
| Shift constraint5 | 0.606 | 0.592 | 0.527 | 0.721 | 0.683 | 0.672 | 0.674 | 0.696 | 0.587 |
| Slide constraint5 | 0.611 | 0.606 | 0.677 | 0.678 | 0.652 | 0.645 | 0.578 | 0.469 | 0.662 |
| Rise constraint5 | 0.536 | 0.571 | 0.466 | 0.548 | 0.617 | 0.649 | 0.631 | 0.583 | 0.691 |
The attribute values in Table 4 are normalized between 0 and 1. This normalization is performed before the clustering process in order to prevent biased clusters caused by attributes with high absolute values. The significance at a p < 0.05 is relative to these normalized values. The non-normalized values are available in the supplementary information. The attribute values mentioned during the comparison of the profiles are marked in bold and all differences between profiles are supported by a MWW p-value lower than 0.01.
1CGI-specific attributes; 2Repetitive sequences; 3Evolutionary conservation; 4Dinucleotide content; 5Structural and physicochemical properties;
Profile support
| Profile | # CGIs (%) | Methylation pattern |
|---|---|---|
| 1 | 84 (16.25) | Constitutively methylation |
| 2 | 50 (8.51) | Unmethylated in sperm |
| 3 | 36 (6.69) | Differentially methylated |
| 4 | 68 (13.15) | Constitutively unmethylated |
| 5 | 36 (6.96) | |
| 6 | 62 (11.99) | |
| 7 | 76 (14.70) | |
| 8 | 60 (11.60) | |
| 9 | 29 (5.61) | |
| Misclassified | 22 (4.26) | |
A CGI was defined as being misclassified if it lacked an experimental methylation value in sperm but was classified nonetheless as constitutively methylated or unmethylated solely in sperm. CGIs that were missing methylation data in more than half of tissues and cell types under analysis were also defined as misclassified and removed.
Figure 3Linking clusters and Feature selection of new methylation classes. Summarization and feature selection of CGI profiles. A) Identification of 9 CGI profiles by linking CGI sequence attribute clusters (lower left corner) and methylation clusters (upper left corner) by the probability of the intersection (PI), which is calculated based on the hypergeometric measurement (blue color). The attributes were normalized within the colourmap intervals. Notably, the relations are built based on the PI (line color; dark blue: low p-value; light blue: high p-value), which substantially differs from the typical support of intersection measurement (line weight; thin: few; tick: many). For example, the fifth relation (5th column from left) is supported by just ~40 observations (thin line) but most of the CGI sequence attribute observations correspond to the 4th methylation class and only few belong to others classes. This approach can generate cohesive relations even if they aren't highly supported. The nine methylation profiles are summarized by similarity of their prototypes, constituting 4 final methylation classes (I-IV). These classes were used to label all CGI sequence attributes observations. B) Feature selection for each class based on the dataset labeled in A). This process has been carried out locally by using decision trees (Matlab) where the desired class (labeled read leaf) was distinguished from all of the others (unlabeled black leaf).
Distribution of CGIs over the gene association classes
| Gene association class | Location of CGI | # CGI (%) |
|---|---|---|
| Pseudogene | Within 1.5 kb of a pseudogene | 62 (12.53) |
| TSS | Overlapping TSS | 146 (29.49) |
| Promoter | Overlapping extended promoter region, (1.5 kb upstream of the TSS to end of the 5'UTR) | 178 (35.96) |
| 3'UTR | Overlapping 3' UTR and may overlap CDS | 12 (2.42) |
| CDS | Overlapping protein coding region | 51 (10.30) |
| Intron | Lies entirely within an Intron (excluding 3' and 5' UTRs) | 29 (5.86) |
| NA | Outside of the gene environment | 17 (3.43) |
Coincidence between gene association classes and PBCs
| Gene association class | Constitutively unmethylated ( | Constitutively methylated ( | Unmethylated in sperm ( | Differentially methylated ( |
|---|---|---|---|---|
| TSS | 3 | 3 | 2 | |
| Promoter | 118 | 28 | 17 | 15 |
| 3'UTR | 1 | 1 | 4 | |
| CDS | 14 | 7 | 7 | |
| Intron | 12 | 8 | 0 | |
| Pseudogene | 33 | 11 | 4 | |
| NA | 7 | 5 | 2 | 3 |
| Sum (% of total) | 323 (65.25) | 84 (16.97) | 44 (8.89) | 44 (8.89) |
Only the p-value (PI of significant intersections (< 0.01) is shown and marked in bold.
Re-classification using functional CGI profiles
| Methylation class | # CGI (%) | Significant gene associations |
|
|---|---|---|---|
| Constitutively unmethylated | 323 (65.25) | Promoter/TSS | 1.00E-12 |
| Constitutively methylated | 84 (16.97) | CDS | 4.27E-08 |
| Unmethylated in sperm | 44 (8.89) | Pseudogenes | 3.5E-03 |
| Differentially methylated | 44 (8.89) | Introns | 4.01E-04 |
Distribution of conserved and not conserved CGIs over PBCs and gene association classes
| Conserved CGIs | ||||
|---|---|---|---|---|
| TSS | 75 (23.22) | 2 (2.38) | - | - |
| Promoter | 43 (13.31) | 24 (28.57) | 8 (18.18) | 7 (15.91) |
| 3'UTR | 1 (0.31) | 2 (2.38) | 1 (2.27) | 3 (6.82) |
| CDS | 12 (3.72) | 23 (27.38) | 7 (15.91) | 7 (15.91) |
| Intron | 6 (1.86) | 3 (3.57) | - | 2 (4.55) |
| Pseudogene | 18 (5.57) | 8 (9.52) | 7 (15.91) | 2 (4.55) |
| NA | 2 (0.62) | 1 (1.19) | 2 (4.55) | 3 (6.82) |
| Total # of conserved CGIs | 157 (48.61) | 63 (75.00) | 25 (56.82) | 24 (54.55) |
| TSS | 63 (19.50) | 1 (1.19) | 3 (6.82) | 2 (4.55) |
| Promoter | 75 (23.22) | 4 (4.76) | 9 (20.45) | 8 (18.18) |
| 3'UTR | - | 4 (4.76) | - | 1 (2.27) |
| CDS | 2 (0.62) | - | - | - |
| Intron | 6 (1.86) | 5 (5.95) | - | 7 (15.91) |
| Pseudogene | 15 (4.64) | 3 (3.57) | 7 (15.91) | 2 (4.55) |
| NA | 5 (1.55) | 4 (4.76) | - | - |
| Total # of not conserved CGIs | 166 (51.39) | 21 (25.00) | 19 (43.18) | 20 (45.45) |
| Total # CGIs per PBC | 323 | 84 | 44 | 44 |
*Absolute number and percentage of conserved and not conserved CGIs of each PBC in the gene-association classes.
Comparison of accuracy using binary methylation classification.
| Methylation classification | Dataset | Methods |
| |
|---|---|---|---|---|
| binary | HEP1 | 84.90 | 0.657 | |
| binary | HEP1 | 75.80 | 0.462 | |
| binary | HEP1 | 90.08 | 0.743 | |
| binary | 85.20 | 0.658 | ||
| binary | 78.60 | 0.524 | ||
| binary | 91.67 | 0.775 | ||
| four classes | HEP3 | 89.39 | 0.707 |
*Validation was performed using 10 repetitions of 10 fold cross-validation
** Validation was performed using 10 fold cross-validation
1HEP CGI data (using our attributes and binary methylation classes)
2EpiGRAPH methylation data (using the default EpiGRAPH sequence attributes and binary methylation classes)
3HEP CGI data (using our attributes and four methylation classes)
The average accuracy (Acc) and correlation coefficient (CC) were used to measure fitness.