| Literature DB >> 21738602 |
Li-Yeh Chuang1, Hsiu-Chen Huang, Ming-Cheng Lin, Cheng-Hong Yang.
Abstract
BACKGROUND: Regions with abundant GC nucleotides, a high CpG number, and a length greater than 200 bp in a genome are often referred to as CpG islands. These islands are usually located in the 5' end of genes. Recently, several algorithms for the prediction of CpG islands have been proposed. METHODOLOGY/PRINCIPALEntities:
Mesh:
Year: 2011 PMID: 21738602 PMCID: PMC3125183 DOI: 10.1371/journal.pone.0021036
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Comparison of different methods for CpG island prediction.
|
| Performance | Methods | |||||||
| CpGPlot | CpGcluster | CpGProD | CpGIS | PSO | CPSO | ||||
| without | with | without | with | ||||||
|
| SN (%) | 56.43 | 50.46 | 58.07 | 83.98 | 69.22 | 75.58 | 77.43 |
|
| SP (%) |
| 99.95 | 99.50 | 99.05 | 99.61 | 99.02 | 99.58 | 99.05 | |
| ACC (%) | 98.09 | 97.78 | 97.69 | 98.39 | 98.28 | 97.99 |
| 98.43 | |
| PC (%) | 56.42 | 49.92 | 52.36 | 69.59 | 63.77 | 62.27 |
| 70.34 | |
| CC (%) | 74.38 | 69.41 | 68.83 | 81.25 | 77.66 | 75.71 |
| 81.80 | |
|
| SN (%) | 47.19 | 67.15 | 68.51 | 85.12 | 54.47 | 59.63 | 77.80 |
|
| SP (%) |
| 99.72 | 99.63 | 99.30 | 99.96 | 99.88 | 99.50 | 99.61 | |
| ACC (%) | 98.08 | 98.54 | 98.50 | 98.79 | 98.31 | 98.42 | 98.71 |
| |
| PC (%) | 47.14 | 62.47 | 62.35 | 71.78 | 53.87 | 57.74 | 68.67 |
| |
| CC (%) | 67.94 | 77.03 | 76.65 | 82.96 | 72.41 | 74.51 | 80.85 |
| |
|
| SN (%) | 51.29 | 27.16 | 46.41 | 82.13 | 79.27 | 81.65 | 81.08 |
|
| SP (%) |
| 99.94 | 98.93 | 98.26 | 98.13 | 97.90 | 98.17 | 98.34 | |
| ACC (%) | 96.90 | 95.32 | 95.60 | 97.24 | 96.93 | 96.87 | 97.08 |
| |
| PC (%) | 51.24 | 26.92 | 40.10 | 65.36 | 62.10 | 62.33 | 63.80 |
| |
| CC (%) | 70.38 | 49.96 | 56.80 | 77.63 | 75.03 | 75.28 | 76.41 |
| |
|
| SN (%) | 22.80 | 57.32 | 29.79 | 74.05 | 60.20 | 64.80 | 70.53 |
|
| SP (%) |
| 99.74 | 99.56 | 98.83 | 99.27 | 99.23 | 99.22 | 99.13 | |
| ACC (%) | 97.76 | 98.51 | 97.53 | 98.11 | 98.13 | 98.23 | 98.38 |
| |
| PC (%) | 22.80 | 52.74 | 25.96 | 53.23 | 48.39 | 51.59 | 55.91 |
| |
| CC (%) | 47.21 | 69.89 | 43.61 | 68.64 | 64.50 | 67.25 | 70.90 |
| |
|
| SN (%) | 31.24 | 29.86 | 52.01 |
| 56.92 | 63.58 | 70.54 |
|
| SP (%) |
| 99.46 | 98.72 | 97.62 | 98.40 | 98.13 | 98.34 | 98.23 | |
| ACC (%) |
| 96.90 | 97.00 | 96.83 | 96.87 | 96.86 | 97.32 |
| |
| PC (%) | 31.24 | 26.19 | 38.94 | 47.05 | 40.12 | 42.74 | 49.22 |
| |
| CC (%) | 55.17 | 43.81 | 54.68 | 63.29 | 55.65 | 58.36 | 64.72 |
| |
|
| SN (%) | 27.11 | 44.89 | 54.18 | 76.68 | 68.97 | 72.79 | 72.52 |
|
| SP (%) |
| 99.47 | 99.45 | 98.93 | 99.27 | 98.99 | 9918 | 98.90 | |
| ACC (%) | 97.98 | 97.53 | 98.19 | 98.14 | 98.19 | 98.06 |
| 98.12 | |
| PC (%) | 27.10 | 39.26 | 45.36 | 59.36 | 57.49 | 57.17 |
| 59.25 | |
| CC (%) | 51.51 | 57.21 | 62.26 | 73.57 | 72.21 | 71.75 |
| 73.48 | |
RL: Reinforcement Learning. SN: Sensitivity. SP: Specificity. ACC: Accuracy. PC: Performance coefficient. CC: Correlation coefficient. Underlined value representing the best results.
Number of CpG islands located in gene regions identified with CPSORL.
| Chr. | Contig | GC% (Average) | CpG island length | CpG island number | Number of genes |
|
| NT_113952.1 | 54.34 | 8,537 | 12 | 1(3) |
|
| NT_113955.2 | 53.04 | 10,023 | 15 | 2(3) |
|
| NT_113958.2 | 57.01 | 14,470 | 19 | 2(3) |
|
| NT_113953.1 | 50.92 | 3,998 | 8 | 1(1) |
|
| NT_113954.1 | 54.53 | 6,174 | 10 | 1(1) |
|
| NT_028395.3 | 55.40 | 24,649 | 38 | 10(15) |
(*)True number of genes in the contig is given in parentheses.
Comparison of the number of CpG islands identified in the human genome with different methods. (NCBI.36).
|
| |||||||
|
|
|
|
|
|
|
| |
| Chromosome Length (bp) | 46,944,329 | ||||||
| Total length of CpG islands | 347,334 | 639,161 | 1,072,192 | 1,280,505 | 1,564,596 | 1,607,472 | 926,178 |
| Number of islands predicted | 973 | 2,703 | 1,091 | 3,704 | 2,648 | 2,813 | 850 |
| Island coverage (%) | 0.73 | 1.36 | 2.28 | 2.73 | 3.3 | 3.4 | 1.97 |
| Island length (bp) | |||||||
| Average | 357 | 237 | 983 | 346 | 591 | 571 | 1,089 |
| Minimum | 101 | 8 | 500 | 200 | 202 | 202 | 500 |
| Maximum | 3,047 | 3,028 | 6,732 | 1,948 | 4,020 | 4,035 | 4,035 |
| GC-content ± SD (%) | 62.17±0.07 | 65.49±0.07 | 54.49±0.06 | 57.98±0.04 | 53.73±0.05 | 53.72±0.05 | 55.60±0.05 |
| CpG island O/E ratio ±SD | 0.84±0.1 | 0.87±0.3 | 0.63±0.1 | 0.68±0.1 | 0.64±0.08 | 0.65±0.08 | 0.65±0.09 |
SD is the Standard Deviation.
Proportion (%) of the chromosome sequence covered by methods.
Number of methylation sites identified with CPSORL in chromosomes 21 and 22 of the human genome. (NCBI. 36).
|
|
|
|
| Chromosome length (bp) | 46,944,323 | 49,691,432 |
| Total length of CpG island (bp) | 1,607,472 | 2,907,983 |
| Number of methylation sites in entire genome | 841,554 | 1,120,517 |
| Number of methylation sites using CPSORL | 111,172 | 185,324 |
| Methylation density of CpG islands (%) | 6.91 | 6.37 |
Number of methylation sites identified with CPSORL in all chromosomes of the human genome. (NCBI.36).
| Chr. | Length | Total length of CpG island | Number of all methylation sites | Number of predicted methylation sites | Methylation Density (%) |
| 1 | 247,249,719 | 9,819,708 | 5,006,940 | 523,354 | 5.33 |
| 2 | 242,951,149 | 7,822,751 | 5,023,026 | 431,279 | 5.51 |
| 3 | 199,501,827 | 5,561,406 | 3,965,121 | 310,656 | 5.58 |
| 4 | 191,273,063 | 5,331,470 | 3,577,143 | 275,413 | 4.95 |
| 5 | 180,857,866 | 5,780,736 | 3,563,532 | 318,252 | 5.51 |
| 6 | 170,899,992 | 5,858,975 | 3,465,347 | 318,445 | 5.44 |
| 7 | 158,821,424 | 6,784,935 | 3,450,658 | 392,566 | 5.79 |
| 8 | 146,274,826 | 4,841,004 | 3,015,121 | 267,302 | 5.52 |
| 9 | 140,273,252 | 5,384,493 | 2,574,014 | 282,008 | 5.23 |
| 10 | 135,374,737 | 5,245,458 | 3,013,632 | 292,186 | 5.57 |
| 11 | 134,452,384 | 5,228,058 | 2,872,470 | 282,971 | 5.41 |
| 12 | 132,349,534 | 5,512,364 | 2,957,221 | 195,079 | 3.54 |
| 13 | 114,142,980 | 3,049,962 | 1,946,147 | 180,554 | 5.92 |
| 14 | 106,368,585 | 3,536,154 | 1,935,241 | 191,968 | 5.43 |
| 15 | 100,338,915 | 3,676,992 | 1,858,038 | 186,212 | 5.06 |
| 16 | 88,827,254 | 5,414,278 | 2,222,494 | 320,771 | 5.92 |
| 17 | 78,774,742 | 6,551,708 | 2,306,666 | 252,464 | 3.85 |
| 18 | 76,117,153 | 2,528,076 | 1,605,879 | 180,108 | 7.12 |
| 19 | 63,811,651 | 7,604,015 | 1,939,151 | 461,782 | 6.07 |
| 20 | 62,435,964 | 3,106,557 | 1,551,541 | 180,108 | 5.80 |
| 21 | 46,944,323 | 1,607,472 | 841,554 | 111,172 | 6.91 |
| 22 | 49,691,432 | 2,907,983 | 1,120,517 | 185,324 | 6.37 |
| X | 154,913,754 | 4,831,155 | 2,279,012 | 190,792 | 3.95 |
| Y | 57,772,954 | 1,001,532 | 214,434 | 15,945 | 1.59 |
|
| 128,350,812 | 4,957,802 | 2,596,037 | 264,446 | 5.33 |
Comparison of different methods on the number of CpG islands identified in the entire human genomes.
|
|
|
|
|
|
| Genome length | 2.86E+09 | |||
| Number of predicted islands | 198,702 | 37,729 | 208,536 | 54,483 |
|
| 1.90 | 1.44 | 4.1 | 2.1 |
| Island length | ||||
|
| 273±246 | 1,090±717 | 572±469 | 1100±541 |
|
| 63.78±7.50 | 60.61±5.06 | 53.90±5.25 | 56.26±6.45 |
|
| 0.855±0.265 | 0.717±0.082 | 0.649±0.087 | 0.665±0.10 |
| TSSs | 21,741 (10.9%) | 15,106 (40.0%) | 25,477 (12.2%) | 22,057 (40.5%) |
| Promoter regions | 29,156 (14.7%) | 13,196 (35.0%) | 54,356 (26.1%) | 37,038 (67.8%) |
Figure 1CPSO implementation diagram.