| Literature DB >> 20500903 |
Michael Hackenberg1, Guillermo Barturen, Pedro Carpena, Pedro L Luque-Escamilla, Christopher Previti, José L Oliver.
Abstract
BACKGROUND: Unmethylated stretches of CpG dinucleotides (CpG islands) are an outstanding property of mammal genomes. Conventionally, these regions are detected by sliding window approaches using %G + C, CpG observed/expected ratio and length thresholds as main parameters. Recently, clustering methods directly detect clusters of CpG dinucleotides as a statistical property of the genome sequence.Entities:
Mesh:
Year: 2010 PMID: 20500903 PMCID: PMC2887419 DOI: 10.1186/1471-2164-11-327
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Comparison of the distributions of the island length for both the . It can be seen that, for all this four properties, the SWA distributions are heavily biased towards their respective thresholds. However, CpGcluster distributions do not show this artifact.
Co-localization of CpG islands and the promoter region.
| Method | Number of predicted islands | Genome coverage (%) | Promoter overlap (R13) | |
|---|---|---|---|---|
| TJ | 37,323 | 1.43 | 14,034 | 37.60 |
| UCSC | 27,639 | 0.74 | 13,369 | 48.40 |
| CpGproD | 76,886 | 2.81 | 14,814 | 19.30 |
| relaxed set* | 198,702 | 1.90 | 30,660 | 15.43 |
| strict set** | 25,454 | 0.65 | 13,349 | 52.40 |
*p-value ≤ 1E-5;** p-value ≤ 1E-20
Figure 2The length of . It can be seen that no linear correlation exists and that the relation between p-value and length is more complex, e.g. the p-value depends on both the island length and the island density.
Figure 3Variation of the overlap fraction of predicted islands with RefSeq promoter regions. The different sets of predicted islands have been obtained by varying the CpGcluster p-value and the SWA window length.
Figure 4Variation of the mean coverage by PhastCons in different predicted island-sets obtained by varying the .
Figure 5Variation of the mean coverage by .
Correspondence between the number of predicted islands, log (p-value) and window length.
| No. of predicted islands | log ( | Window length |
|---|---|---|
| 193,856 | 5.06509 | 200 |
| 139,013 | 6.1864 | 250 |
| 109,907 | 7.19943 | 300 |
| 69,477 | 9.82744 | 350 |
| 52,687 | 11.85626 | 400 |
| 42,392 | 13.73824 | 450 |
| 37,293 | 14.96788 | 500 |
| 33,691 | 15.95388 | 550 |
| 30,881 | 16.8824 | 600 |
| 28,162 | 18.18919 | 650 |
| 26,192 | 19.45203 | 700 |
Prediction of unmethylated regions (Bird's islands, N = 17,383).
| Method | Number of predicted islands | Number of islands overlapping a Bird's island | Number of Bird's islands 'touched' by the prediction | SN | PPV |
|---|---|---|---|---|---|
| TJ | 37,293 | 14,315 | 14,942 | 0.854 | 0.384 |
| UCSC | 27,639 | 13,858 | 14,256 | 0.816 | 0.501 |
| CpGproD | 76,886 | 14,250 | 15,346 | 0.875 | 0.185 |
| relaxed set* | 198,702 | 29,235 | 15,497 | 0.939 | 0.147 |
| strict set** | 25,454 | 14,809 | 12,623 | 0.757 | 0.582 |
*p-value ≤ 1E-5; **p-value ≤ 1E-20
Prediction of unmethylated regions (Weber's regions, N = 13,277).
| Method | Number of predicted islands | Number of islands overlapping a Weber's region | Number of Weber's regions 'touched' by the prediction | SN | PPV |
|---|---|---|---|---|---|
| TJ | 37,293 | 10,179 | 9,965 | 0.755 | 0.273 |
| UCSC | 27,639 | 9,788 | 9,552 | 0.724 | 0.354 |
| CpGproD | 76,886 | 10,320 | 10,257 | 0.774 | 0.134 |
| relaxed set* | 198,702 | 18,967 | 10,372 | 0.867 | 0.095 |
| strict set** | 25,454 | 9,633 | 8,378 | 0.663 | 0.378 |
*p-value ≤ 1E-5; ** p-value ≤ 1E-20
Overlap of different CGIs with 3,465 domains bound by the polycomb repressive complex 2 (PRC2).
| Method | Number of predicted islands | Number of islands overlapping PRC2 domains | Number of PRC2 domains 'touched' by the prediction | SN | PPV |
|---|---|---|---|---|---|
| TJ | 37,293 | 3,523 | 3,033 | 0.891 | 0.094 |
| UCSC | 27,639 | 3,179 | 2,790 | 0.825 | 0.115 |
| CpGproD | 76,886 | 3,321 | 3,159 | 0.916 | 0.043 |
| relaxed set* | 198,702 | 9,097 | 3,097 | 0.961 | 0.046 |
| strict set** | 25,454 | 3,424 | 2,372 | 0.758 | 0.135 |
*p-value ≤ 1E-5; **p-value ≤ 1E-20
Figure 6Distribution of the number of .
Co-localization of CpG islands and alternative promoters.
| Numbers of overlapping islands | |||
|---|---|---|---|
| TJ | 13,759 | 8,868 (64.45%) | 4,891 (35.55%) |
| UCSC | 11,826 | 8,143 (68.86%) | 5,518 (31.14%) |
| CpGproD | 15,319 | 9,801 (63.98%) | 5,518 (36.02%) |
| relaxed set* | 15,095 | 12,034 (79.72%) | 3,061 (20.28%) |
| strict set** | 10,325 | 7,659(74.18%) | 2,666 (25.82%) |
*p-value ≤ 1E-5; ** p-value ≤ 1E-20
Figure 7A bidirectional promoter region in human chromosome 22 which is overlapped by one TJ or UCSC island but by several . The two genes show very different expression profiles, and therefore it is very likely that the prediction of different islands for the different TSSs as done by CpGcluster is the better choice. The figure was obtained by using the UCSC Genome Browser [46].
Figure 8A 317 bp long region of human chromosome 22 showing strong heterogeneity in methylation. CpGcluster predicts separate islands for each methylation domain, while TJ and all the remaining tested sliding-window approaches predict only one longer island overlapping the different methylation domains.
Figure 9Distribution of the maximal methylation differences between . a) HEP methylation data; b) Lister's methylation data [33].
Overlap of CpG islets (N = 88,137) with different sets of promoters and evolutionarily conserved elements.
| Genome element | Number of overlapping CpG islets | Number of overlapping CpG islets exclusively predicted by CpG cluster |
|---|---|---|
| Promoters from RefSeq database | 9,826 (11.15%) | 1,218 (12.40%) |
| TSSs from DBTSS database | 1,868 (2.12%) | 398 (21.31%) |
| Promoter regions from DBTSS database | 6,510 (7.39%) | 4,869 (74.79%) |
| PhastCons | 17,613 (19.98%) | 8,219 (46.66%) |
Number of unmethylated and differentially methylated CpG 'islets'.
| Dataset | Methylation state* | Number of CpG islets | CpG 'islets' exclusively predicted by CpGcluster |
|---|---|---|---|
| HEP (12 tissues)** | Unmethylated | 126 | 1 |
| Differentially methylated | 26 | 8 | |
| Lister et al. 2009 (2 cell lines)*** | Unmethylated | 4,460 | 1,472 |
| Differentially methylated | 373 | 295 | |
*Unmethylated: average methylation ≤ 0.2; differentially methylated: average methylation <= 0.2 in at least one tissue & average methylation >= 0.8 in at least one other tissue.
**The methylation state of 246 CpG 'islets' from chromosomes 6, 20 and 22 was determined by using 3,168 individual CpG sites (HEP project). We only included CpGs which have been detected in at least 2 clones or in at least 6 different tissues.
***We used the sequence reads obtained by MethylC-Seq for two human cell lines [33], H1 human embryonic stem cells and IMR90 fetal lung fibroblasts, to get the average methylation level of single cytosines at both DNA strands for these two methylomes. All islands need more than 50% of its CpGs covered. Only cytosines covered by at least 10 reads were counted.