| Literature DB >> 22253817 |
Benjamin Koester1, Thomas J Rea, Alan R Templeton, Alexander S Szalay, Charles F Sing.
Abstract
In this paper, we use a statistical estimator developed in astrophysics to study the distribution and organization of features of the human genome. Using the human reference sequence we quantify the global distribution of CpG islands (CGI) in each chromosome and demonstrate that the organization of the CGI across a chromosome is non-random, exhibits surprisingly long range correlations (10 Mb) and varies significantly among chromosomes. These correlations of CGI summarize functional properties of the genome that are not captured when considering variation in any particular separate (and local) feature. The demonstration of the proposed methods to quantify the organization of CGI in the human genome forms the basis of future studies. The most illuminating of these will assess the potential impact on phenotypic variation of inter-individual variation in the organization of the functional features of the genome within and among chromosomes, and among individuals for particular chromosomes.Entities:
Mesh:
Year: 2012 PMID: 22253817 PMCID: PMC3256200 DOI: 10.1371/journal.pone.0029889
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Density plots for chromosomes 1, 8 and 19.
Densities are defined as the number of CGI per 1 Mb window. Note the especially high density at the 5′ telomere and the missing sequence at position ∼130 Mb. Inset shows the distribution of densities in the 249 1 Mb windows that comprise the chromosome. The distribution is skewed and demonstrates that simple estimators of the centroid and dispersion are insufficient. Plots for all chromosomes are given in Text S1.
Summary statistics for the CGI densities for each chromosome.
| Chromosome | Total Length (Mb) | Missing (Mb) | NCGI | Average Density per Mb assayed | Average density per Mb of Chr | Standard deviation of the distribution of density per Mb of Chr | Skewness of the distribution of density per Mb of Chr |
| 1 | 249.3 | 24.0 | 3430 | 15.2 | 15.1 | 16.0 | 4.3 |
| 2 | 243.2 | 5.0 | 2553 | 10.7 | 10.8 | 9.5 | 1.8 |
| 3 | 198.0 | 3.2 | 1814 | 9.3 | 9.3 | 8.1 | 2.0 |
| 4 | 191.2 | 3.5 | 1664 | 8.9 | 9.0 | 11.8 | 4.9 |
| 5 | 180.9 | 3.2 | 1884 | 10.6 | 10.6 | 13.8 | 4.0 |
| 6 | 171.1 | 3.7 | 1954 | 11.7 | 12.0 | 12.0 | 2.1 |
| 7 | 159.1 | 3.8 | 2256 | 14.5 | 14.7 | 19.2 | 3.3 |
| 8 | 146.4 | 3.5 | 1562 | 10.9 | 10.9 | 14.2 | 4.0 |
| 9 | 141.2 | 21.1 | 1814 | 15.1 | 14.7 | 15.3 | 3.2 |
| 10 | 135.5 | 4.2 | 1733 | 13.2 | 12.6 | 13.5 | 4.4 |
| 11 | 135.0 | 3.9 | 1776 | 13.5 | 13.9 | 14.8 | 2.8 |
| 12 | 133.9 | 3.4 | 1832 | 14.0 | 13.7 | 13.6 | 2.1 |
| 13 | 115.2 | 19.6 | 959 | 10.0 | 10.3 | 14.4 | 4.2 |
| 14 | 107.3 | 19.1 | 1180 | 13.4 | 13.2 | 13.0 | 2.3 |
| 15 | 102.5 | 20.8 | 1187 | 14.5 | 14.2 | 9.3 | 0.9 |
| 16 | 90.4 | 11.5 | 1894 | 24.0 | 23.5 | 28.5 | 2.3 |
| 17 | 81.2 | 3.4 | 2210 | 28.4 | 28.0 | 22.3 | 1.4 |
| 18 | 78.1 | 3.4 | 805 | 10.8 | 11.2 | 15.0 | 4.8 |
| 19 | 59.1 | 3.3 | 3147 | 56.4 | 55.9 | 40.6 | 1.6 |
| 20 | 63.0 | 3.5 | 1111 | 18.7 | 18.5 | 20.5 | 3.2 |
| 21 | 48.1 | 13.0 | 502 | 14.3 | 13.1 | 16.2 | 2.0 |
| 22 | 51.3 | 16.4 | 976 | 28.0 | 26.6 | 17.8 | 1.2 |
| X | 155.3 | 4.2 | 1541 | 10.2 | 10.5 | 12.6 | 3.7 |
| Y | 59.4 | 33.7 | 311 | 12.1 | 11.3 | 20.6 | 3.4 |
Column 1: Chromosome.
Column 2: Length in Mb (including missing sequence that was not assayed).
Column 3: Ambiguous or missing sequence in Mb not assayed.
Column 4: Number of CGI detected.
Column 5*: Density = NCpG/(Total Mb – missing Mb not assayed).
Column 6:* Mean number CGI per Mb for entire chromosome.
Column 7: Standard deviation of number of CGI per Mb for entire chromosome.
Column 8: Skewness of number of CGI per Mb.for entire chromosome.
*The density is simply computed as the total number of CGI/chromosome length in Mb that have been assayed. This is formally not the same as the mean number of CGI per Mb of chromosome ignoring the missing Mb, which we compute by counting CGI in windows of 1 Mb and computing the mean, standard deviation and skewness of the resulting distribution.
Figure 2The Two Point Correlation Functions of CGI in Chromosomes 1, 8 and 19.
The vertical axis shows value of the two-point correlation function, estimated using the bootstrap mean (see methods), and error bars are . The expectation in the absence of clustering is . CGI using the Takai and Jones (2002) algorithm are shown in black, as are the best-fit power law models. Dotted lines show an approximate 3σ confidence intervals derived from a Monte Carlo based on the bootstrap estimate of ξ and our estimate of its variance (see Methods). Also shown in red (green) are the TPCF for the CGI given by Irizarry et al [56] (Illingworth et al [40]) and the associated regression coefficients also in red (green). Remaining chromosomes can be found in Text S2.
Figure 3Summary of TPCF for all human chromosomes.
Dashed lines show high CGI density chromosomes, dash-dot lines represent (top to bottom) the Y and X chromosomes. Inter-chromosomal variation is clear, and in general all chromosomes show random clustering by ∼10 Mb. Large separations likewise produce significant clustering, due to the high density of CGI in telomeres. Typical error bars for a given 1+ξ are shown to the right. Individual profiles for each chromosome can be found in Figure 2 and the Supplemental Figures.