| Literature DB >> 28589860 |
Ning Yu1,2, Xuan Guo1,3, Alexander Zelikovsky1, Yi Pan4.
Abstract
BACKGROUND: As crucial markers in identifying biological elements and processes in mammalian genomes, CpG islands (CGI) play important roles in DNA methylation, gene regulation, epigenetic inheritance, gene mutation, chromosome inactivation and nuclesome retention. The generally accepted criteria of CGI rely on: (a) %G+C content is ≥ 50%, (b) the ratio of the observed CpG content and the expected CpG content is ≥ 0.6, and (c) the general length of CGI is greater than 200 nucleotides. Most existing computational methods for the prediction of CpG island are programmed on these rules. However, many experimentally verified CpG islands deviate from these artificial criteria. Experiments indicate that in many cases %G+C is < 50%, CpG obs /CpG exp varies, and the length of CGI ranges from eight nucleotides to a few thousand of nucleotides. It implies that CGI detection is not just a straightly statistical task and some unrevealed rules probably are hidden.Entities:
Keywords: CpG box; CpG island; Energy distribution; Epigenetics; Gaussian model; Methylation
Mesh:
Year: 2017 PMID: 28589860 PMCID: PMC5461559 DOI: 10.1186/s12864-017-3731-5
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Distribution of CpG-box length. Distribution curve of CpG box in length for human genome
Fig. 2Gaussian kernel estimation. Distance distribution of CGI candidates in human chromosome 21 as an example and its Gaussian kernel density estimation (blue solid line)
Fig. 3Discrete Gaussian filter. The upper chart shows the value for each location; the lower box is the discrete filter
Fig. 4The main procedures of GaussianCpG
Coverage rate of known human CGIs
| Chr# | Known | Predicted | Coverage |
|---|---|---|---|
| Chr 1 | 546 | 541 | 99.08% |
| Chr 2 | 430 | 426 | 99.07% |
| Chr 3 | 319 | 319 | 100% |
| Chr 4 | 272 | 272 | 100% |
| Chr 5 | 359 | 356 | 99.16% |
| Chr 6 | 293 | 292 | 99.66% |
| Chr 7 | 304 | 298 | 98.03% |
| Chr 8 | 254 | 253 | 99.61% |
| Chr 9 | 359 | 356 | 99.16% |
| Chr 10 | 311 | 311 | 100% |
| Chr 11 | 346 | 346 | 100% |
| Chr 12 | 363 | 360 | 99.17% |
| Chr 13 | 200 | 200 | 100% |
| Chr 14 | 206 | 205 | 99.51% |
| Chr 15 | 150 | 150 | 100% |
| Chr 17 | 383 | 380 | 99.22% |
| Chr 18 | 43 | 43 | 100% |
| Chr 19 | 315 | 314 | 99.68% |
| Chr 20 | 259 | 257 | 99.23% |
| Chr 21 | 133 | 131 | 98.50% |
| Chr 22 | 215 | 214 | 99.53% |
| Chr X | 253 | 250 | 98.81% |
| Chr Y | 5 | 5 | 100% |
Known CGIs: 6786, & predicted: 6740, & avg. coverage rate: 99.32%
Comparison in artificial data set
| aMethod: | I | II | III | IV | V |
|---|---|---|---|---|---|
| T | 6854696 | 6854696 | 6854696 | 6854696 | 6854696 |
| TP | 2101562 | 3603662 | 5489738 | 2531549 | 5036243 |
| FN | 4753134 | 3251034 | 1364958 | 4323147 | 1818453 |
| F | 5919255 | 5919255 | 5919255 | 5919255 | 5919255 |
| FP | 20437 | 220957 | 1085303 | 9319 | 46906 |
| TN | 5898818 | 5698298 | 4833952 | 5909936 | 5872349 |
| bMethod: | I | II | III | IV | V |
| Sn | 30.66% | 52.57% |
| 36.93% | 73.47% |
| Sp | 99.65% | 96.27% | 81.66% |
| 99.21% |
| Acc | 62.63% | 72.82% | 80.82% | 66.08% |
|
| Mcc | 99.04% | 94.22% | 83.49% |
| 99.08% |
| Ppv | 30.57% | 50.93% | 69.14% | 36.88% |
|
| Pc | 40.61% | 53.18% | 61.61% | 45.94% |
|
| F1 | 46.82% | 67.49% | 81.75% | 53.89% |
|
I:CpGPlot, II:CpGReport, III:CpGProd, IV:CpGCluster, V:GaussianCpG
For Panel a: The unit of measurement is necleotide
True, T: the length of known CpG islands
False, F: the length of non-CpG islands
True positive, TP: the length of predicted known CGIs
False positive, FP: the length of predicted CGIs not in known CGIs
False negative, FN: the length of not predicted known CGIs
True negative, TN: the length of predicted non-CGIs
For Panel b:
Sensitivity, Sn=TP/(TP+FN)
Specificity, Sp=TN/(TN+FP)
Accuracy, Acc=(TP+TN)/(TP+FP+FN+TN)
Mean correlation coefficient,
Positive predictive value, Ppv=TP/(TP+FP)
Performance coefficient, Pc=TP/(TP+FN+FP)
F1 score, the harmonic mean of precision and sensitivity,
F1=2×TP/(2×TP+FP+FN)
For Panel a&b: Default parameters for all software are set
Comparison in real data set
| aMethod: | I | II | III | IV | V |
|---|---|---|---|---|---|
| T | 348930 | 348930 | 348930 | 348930 | 348930 |
| TP | 255732 | 348546 | 333015 | 300315 | 292732 |
| FN | 93198 | 384 | 15915 | 48615 | 56198 |
| F | 46361053 | 46361053 | 46361053 | 46361053 | 46361053 |
| FP | 397423 | 1680731 | 1034353 | 583460 | 363493 |
| TN | 46124740 | 44417698 | 45331765 | 45923959 | 46075369 |
| bMethod: | I | II | III | IV | V |
| Sn | 73.29% |
| 95.43% | 86.06% | 83.89% |
| Sp | 99.14% | 96.35% | 97.76% | 98.74% |
|
| Acc | 98.95% | 96.38% | 97.75% | 98.65% |
|
| Mcc | 53.11% | 40.65% | 47.61% | 53.60% |
|
| Ppv | 39.15% | 17.17% | 24.35% | 33.98% |
|
| Pc | 34.26% | 17.17% | 24.07% | 32.20% |
|
| F1 | 51.03% | 29.31% | 38.80% | 48.72% |
|
I:CpGPlot, II:CpGReport, III:CpGProd, IV:CpGCluster, V:GaussianCpG
For Panel a&b: The setting and metrics are same as those in Table 2