| Literature DB >> 17038168 |
Michael Hackenberg1, Christopher Previti, Pedro Luis Luque-Escamilla, Pedro Carpena, José Martínez-Aroza, José L Oliver.
Abstract
BACKGROUND: Despite their involvement in the regulation of gene expression and their importance as genomic markers for promoter prediction, no objective standard exists for defining CpG islands (CGIs), since all current approaches rely on a large parameter space formed by the thresholds of length, CpG fraction and G+C content.Entities:
Mesh:
Year: 2006 PMID: 17038168 PMCID: PMC1617122 DOI: 10.1186/1471-2105-7-446
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Benchmarking of CpGcluster
| Program | Sn ± SD | Sp ± SD | CC ± SD | Hit* [%] ± SD |
| 0.545 ± 0.002 | 0.973 ± 0.002 | 0.725 ± 0.005 | 87.000 ± 0.540 | |
| 0.918 ± 0.003 | 0.657 ± 0.003 | 0.772 ± 0.006 | 94.675 ± 0.808 | |
| 0.832 ± 0.003 | 0.756 ± 0.007 | 0.789 ± 0.013 | 86.675 ± 1.528 | |
| 0.910 ± 0.002 | 0.667 ± 0.003 | 0.775 ± 0.006 | 94.650 ± 0.810 | |
| 0.819 ± 0.013 | 0.584 ± 0.004 | 0.685 ± 0.005 | 84.075 ± 1.191 | |
| 0.655 ± 0.003 | 0.976 ± 0.005 | 0.797 ± 0.009 | 95.475 ± 0.870 | |
| 0.866 ± 0.006 | 0.832 ± 0.009 | 0.846 ± 0.006 | 95.050 ± 0.643 |
*Hit: Percentage of true islands overlapping (by at least one nucleotide) with predicted islands.
Average accuracy values (Sn, Sp and CC) of CpGcluster and other five CGI finders over 10 test sequences, each one with 400 experimental CGIs. Default parameters values, as recommended in the corresponding publications, were used for each program.
Basic statistics of CpGcluster and CpGProD islands
| Genome length (without N-runs, bp) | 2.85E + 09 | 2.85E + 09 | 2.51E + 09 | 2.51E + 09 |
| Total number of CpGs | 28,073,991 | 28,073,991 | 20,967,593 | 20,967,593 |
| CpG-dinucleotides in CpG-islands (%) | 4,489,575 (15.99) | 4,323,799 (15.40) | 2,708,986 (12.92) | 2,215,608 (10.57) |
| Number of islands predicted | 197,727 | 76,793 | 117,373 | 40,171 |
| *Island coverage (%) | 1.90 | 2.81 | 1.47 | 1.65 |
| Island length (bp): | ||||
| Average | 273.5 ± 246.7 | 1043.8 ± 761.7 | 314.0 ± 293.8 | 1030.3 ± 560.0 |
| Minimum | 8 | 500 | 8 | 500 |
| Maximum | 7,774 | 42,276 | 5,618 | 9,288 |
| Average island GC-content (%) | 63.76 ± 7.51 | 54.58 ± 6.12 | 61.58 ± 10.03 | 54.62 ± 5.17 |
| Average CpG O/E ratio | 0.855 ± 0.265 | 0.636 ± 0.089 | 0.956 ± 0.428 | 0.652 ± 0.103 |
| Average CpG-density | 0.087 ± 0.041 | 0.047 ± 0.016 | 0.097 ± 0.084 | 0.048 ± 0.015 |
*Percentage of the genome covered by the CpG-islands.
Basic statistics of the CpG-islands predicted by CpGcluster and CpGProD in the human (hg17) and mouse genomes (mm7).
Overlap with PhastCons and MAGE genes
| Overlap with TSS of MAGE genes | % of overlap with | |||
| Program | #CGI | Average length ± SD | Alus | PhastCons |
| 2 | 271.0 ± 18.4 | 19.49 | 23.73 | |
| 3 | 1,314.3 ± 525.1 | 23.40 | 13.31 | |
| 3 | 800.0 ± 243.3 | 10.52 | 20.59 | |
| 3 | 1,093.0 ± 476.1 | 23.99 | 14.00 | |
| 2 | 730.5 ± 320.3 | 15.32 | 15.82 | |
| 8 | 258.3 ± 100.8 | 6.79 | 28.53 | |
Predicted CpG islands in 10 MAGE genes (left part) and percentage of overlap between the CpG-islands predicted in the human chromosome 22 and both Alu retrotransposons and PhastCons (evolutionarily conserved elements).
Location of CpGcluster islands
| P-values | |||||||||
| Class* | # CpG islands | Length ± SD | CpG Density ± SD | Obs/Exp Ratio ± SD | %GG ± SD | P25 | Median | P75 | PhastCons overlap (%) |
| L1 | 6,775 | 672.3 ± 398.7 | 0.10 ± 0.02 | 0.89 ± 0.13 | 68.4 ± 5.7 | 1.43E-66 | 1.53E-40 | 8.43E-23 | 29.73 |
| L2 | 16,709 | 256.3 ± 213.1 | 0.09 ± 0.04 | 0.89 ± 0.23 | 64.6 ± 7.8 | 2.61E-14 | 5.42E-09 | 7.27E-07 | 21.56 |
| L3 | 29,386 | 230.5 ± 173.0 | 0.08 ± 0.04 | 0.84 ± 0.27 | 63.1 ± 7.1 | 1.17E-10 | 9.94E-08 | 1.86E-06 | 14.97 |
| L4 | 3,880 | 247.8 ± 212.8 | 0.09 ± 0.04 | 0.85 ± 0.23 | 65.8 ± 7.3 | 2.12E-12 | 3.43E-08 | 1.31E-06 | 28.04 |
| NG | 140,977 | 266.0 ± 238.2 | 0.09 ± 0.04 | 0.85 ± 0.27 | 63.5 ± 7.5 | 1.44E-12 | 2.43E-08 | 1.25E-06 | 14.06 |
| L1 | 8,090 | 745.7 ± 373.8 | 0.09 ± 0.02 | 0.83 ± 0.13 | 65.3 ± 5.2 | 2.50E-64 | 1.20E-40 | 7.20E-24 | 35.69 |
| L2 | 10,219 | 302.7 ± 257.4 | 0.09 ± 0.07 | 0.92 ± 0.38 | 62.0 ± 9.3 | 8.60E-15 | 9.15E-09 | 8.69E-07 | 38.80 |
| L3 | 18,734 | 230.6 ± 190.7 | 0.10 ± 0.09 | 0.97 ± 0.45 | 61.4 ± 10.4 | 9.91E-10 | 2.04E-07 | 2.29E-06 | 34.99 |
| L4 | 3,305 | 284.6 ± 232.6 | 0.08 ± 0.06 | 0.87 ± 0.35 | 61.3 ± 8.3 | 1.61E-11 | 7.17E-08 | 1.58E-06 | 49.87 |
| NG | 75,419 | 284.0 ± 257.6 | 0.10 ± 0.09 | 0.98 ± 0.45 | 61.1 ± 10.5 | 2.54E-12 | 2.92E-08 | 1.29E-06 | 20.45 |
*See text for a description of the different classes.
Co-location of CpG islands predicted by CpGcluster and genes in human (hg17) and mouse (mm5) genomes.
Location of CpGProD islands
| P-values | |||||||||
| Class* | # CpG islands | Length ± SD | CpG Density ± SD | Obs/Exp Ratio ± SD | %GG ± SD | P25 | Median | P75 | PhastCons overlap (%) |
| L1 | 7,310 | 1,831.5 ± 875.8 | 0.06 ± 0.01 | 0.74 ± 0.09 | 59.3 ± 5.0 | 2.30E-78 | 4.20E-51 | 7.06E-33 | 21.01 |
| L2 | 3,542 | 933.0 ± 599.3 | 0.04 ± 0.01 | 0.62 ± 0.08 | 53.6 ± 5.8 | 1.40E-17 | 1.35E-09 | 3.91E-07 | 13.66 |
| L3 | 10,582 | 814.5 ± 436.2 | 0.04 ± 0.01 | 0.61 ± 0.07 | 53.4 ± 5.6 | 1.10E-13 | 3.68E-09 | 4.36E-07 | 10.66 |
| L4 | 1,218 | 1,097.2 ± 775.8 | 0.05 ± 0.02 | 0.63 ± 0.08 | 56.2 ± 6.8 | 6.79E-32 | 5.47E-12 | 1.35E-07 | 16.63 |
| NG | 54,141 | 988.3 ± 739.8 | 0.05 ± 0.02 | 0.63 ± 0.09 | 54.2 ± 6.1 | 2.30E-21 | 2.03E-10 | 1.33E-07 | 12.15 |
| L1 | 7,938 | 1,463.4 ± 576.0 | 0.06 ± 0.01 | 0.72 ± 0.11 | 58.1 ± 4.2 | 8.30E-67 | 3.86E-43 | 9.89E-28 | 27.82 |
| L2 | 1,764 | 1,007.7 ± 560.1 | 0.05 ± 0.01 | 0.64 ± 0.11 | 55.0 ± 5.2 | 3.88E-31 | 2.65E-16 | 3.32E-10 | 34.15 |
| L3 | 4,050 | 807.2 ± 374.0 | 0.04 ± 0.01 | 0.61 ± 0.08 | 53.1 ± 4.5 | 7.05E-17 | 3.52E-11 | 1.19E-08 | 26.29 |
| L4 | 716 | 997.4 ± 501.5 | 0.05 ± 0.01 | 0.64 ± 0.09 | 55.1 ± 5.1 | 9.21E-32 | 1.69E-16 | 4.21E-10 | 34.40 |
| NG | 24,644 | 924.4 ± 502.7 | 0.05 ± 0.01 | 0.64 ± 0.10 | 53.2 ± 5.0 | 4.28E-23 | 1.84E-13 | 1.88E-09 | 17.43 |
*See text for a description of the different classes.
Co-location of the CpG islands predicted by the program CpGProD and genes in human (hg17) and mouse (mm5) genomes.
Figure 1Probability density function of distances between neighboring CpGs. Distribution of distances between neighboring CpG dinucleotides in the human chromosome 1. The observed distribution is represented in symbols, while the random expectation corresponding to the geometric distribution [Eq. 1] is represented in a solid line. Note that, in a good approximation, the median separates over-represented distances from under-represented ones.
Statistics of CpG distances and %G+C in human and mouse chromosomes
| Chromosome | median | mean | %G+C | median | mean | %G+C |
| 1 | 40 | 97.7 | 41.73 | 63 | 129.1 | 41.12 |
| 2 | 46 | 109.4 | 40.23 | 56 | 116.4 | 42.07 |
| 3 | 52 | 119.1 | 39.69 | 63 | 129.9 | 40.44 |
| 4 | 53 | 127.1 | 38.21 | 52 | 113.1 | 42.29 |
| 5 | 49 | 116.9 | 39.52 | 51 | 108.4 | 42.51 |
| 6 | 47 | 112.6 | 39.60 | 61 | 124.8 | 41.39 |
| 7 | 40 | 98.6 | 40.72 | 53 | 113.9 | 43.12 |
| 8 | 46 | 108.3 | 40.17 | 52 | 109.5 | 42.35 |
| 9 | 39 | 96.8 | 41.36 | 54 | 111.5 | 42.70 |
| 10 | 41 | 96.2 | 41.58 | 54 | 113.9 | 41.38 |
| 11 | 41 | 100.6 | 41.57 | 47 | 101.0 | 43.82 |
| 12 | 41 | 101.2 | 40.80 | 59 | 121.2 | 41.65 |
| 13 | 50 | 118.1 | 38.52 | 57 | 117.5 | 41.61 |
| 14 | 42 | 101.7 | 40.89 | 62 | 126.5 | 41.10 |
| 15 | 40 | 92.6 | 42.21 | 54 | 114.5 | 41.95 |
| 16 | 31 | 70.9 | 44.79 | 61 | 124.7 | 40.90 |
| 17 | 29 | 66.4 | 45.53 | 49 | 106.3 | 42.61 |
| 18 | 47 | 109.2 | 39.79 | 59 | 120.2 | 41.43 |
| 19 | 23 | 51.8 | 48.36 | 49 | 103.2 | 42.73 |
| 20 | 36 | 81.9 | 44.13 | – | – | – |
| 21 | 37 | 90.9 | 40.88 | – | – | – |
| 22 | 28 | 59.5 | 47.96 | – | – | – |
| X | 52 | 121.0 | 39.46 | 81 | 168.2 | 39.22 |
| Y | 49 | 119.9 | 39.85 | 67 | 155.1 | 39.19 |
Median and mean distances between CpGs, and average %G+C content for both, human and mouse chromosomes.
Figure 2Probability density function of distances between neighboring CpGs (log-scale). The same as in Figure 1, but using logarithmic axis; over-represented large distances can be appreciated.