Literature DB >> 17932072

CG dinucleotide clustering is a species-specific property of the genome.

Jacob L Glass¹, Reid F Thompson, Batbayar Khulan, Maria E Figueroa, Emmanuel N Olivier, Erin J Oakley, Gary Van Zant, Eric E Bouhassira, Ari Melnick, Aaron Golden, Melissa J Fazzari, John M Greally.

Abstract

Cytosines at cytosine-guanine (CG) dinucleotides are the near-exclusive target of DNA methyltransferases in mammalian genomes. Spontaneous deamination of methylcytosine to thymine makes methylated cytosines unusually susceptible to mutation and consequent depletion. The loci where CG dinucleotides remain relatively enriched, presumably due to their unmethylated status during the germ cell cycle, have been referred to as CpG islands. Currently, CpG islands are solely defined by base compositional criteria, allowing annotation of any sequenced genome. Using a novel bioinformatic approach, we show that CG clusters can be identified as an inherent property of genomic sequence without imposing a base compositional a priori assumption. We also show that the CG clusters co-localize in the human genome with hypomethylated loci and annotated transcription start sites to a greater extent than annotations produced by prior CpG island definitions. Moreover, this new approach allows CG clusters to be identified in a species-specific manner, revealing a degree of orthologous conservation that is not revealed by current base compositional approaches. Finally, our approach is able to identify methylating genomes (such as Takifugu rubripes) that lack CG clustering entirely, in which it is inappropriate to annotate CpG islands or CG clusters.

Entities: CellLine Chemical Disease Gene Mutation Species

Mesh：

Substances：

Year: 2007 PMID： 17932072 PMCID： PMC2175314 DOI： 10.1093/nar/gkm489

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Several observations have converged to focus attention on cytosine-guanine (CG) dinucleotide clusters in mammalian genomes. Digestion of genomic DNA with HpaII allowed the isolation of loci where these restriction sites cluster and are unmethylated in cis, defining a population of loci referred to as HpaII tiny fragments (1). Upon sequencing, these loci were found to be unusually rich in CG dinucleotides and (G+C) content when compared with other sequences in the 1985 Genbank database (2). The CpG island base compositional criteria now used for genomic annotation are derived from this original set of experiments. With the appreciation that methylcytosine is unusually susceptible to mutation through deamination to thymine (3), a logical conclusion was that the absence of methylation at cytosines in CpG islands protected them from mutational decay during evolution. The implicit assumption is that these loci are universally unmethylated in normal cells but can be the target of abnormal methylation in cancer, or in unusual epigenetic regulatory processes such as genomic imprinting or X chromosome inactivation (4,5). CpG islands have proven valuable in focusing the study of the widespread genomic changes that occur in these processes, and are commonly used in designing custom microarrays for that purpose. CpG islands have also been used as a foundation for bioinformatic analyses such as finding gene promoters (6) and identifying sequence features that distinguish imprinted genes (7). Despite its proven utility, there are problems with the original definition of CpG islands, including its lack of specificity. Using base compositional criteria alone, CpG island annotations identify over 350 000 sites in the human genome (Table 1), many of which are in repetitive sequences. Recognizing this problem, other groups have modified the original base compositional criteria (8) or the analytical approach used (9) in order to increase the stringency for CpG islands identification, greatly reducing the number of repetitive sequences annotated while preserving most CpG islands located at promoters. However, the more common approach, used by genome browsers such as that at UCSC (http://genome.ucsc.edu/) (10), is simply to remove all repetitive sequence prior to annotating CpG islands.

Table 1.

		All	Sequence feature conserved at orthologous locus	Overlap with refSeq transcription start site
CpG islands (original definition^a)	Human^b	350 201	42 445 (12.1%)	17 209 (4.9%)
	Mouse^c	165 379	47 139 (28.5%)	11 822 (7.1%)
CpG islands (UCSC annotation^b)	Human	27 801	14 452 (52.0%)	14 121 (50.8%)
	Mouse	15 974	14 057 (88.0%)	9114 (57.1%)
CG clusters	Human	44 165	19 410 (44.0%)	16 822 (38.1%)
	Mouse	42 971	18 970 (44.2%)	11 859 (27.6%)
CG clusters (non-transposon)^d	Human	31 225	19 071 (61.1%)	16 690 (53.5%)
	Mouse	21 587	17 614 (81.6%)	11 677 (54.1%)

The exclusion of transposon-derived CG clusters creates an annotation that is comparable to the UCSC CpG island annotations. We show the numbers and percentages of total for each sequence feature in terms of conservation of that sequence feature at the orthologous locus in the other species. We also quantify the numbers and proportions overlapping refSeq gene transcription start sites in each genome. Comparison of annotation performance in unmasked sequence shows CpG islands to suffer from excessive non-specificity, while the performance for non-repetitive sequences shows comparable proportional but quantitatively greater identification of conserved CG-dense regions or refSeq promoters using the CG cluster annotation in both human and mouse.

Using cpgi130 program (4) (http://cpgislands.usc.edu/) using parameters (G+C) ⩾0.50, O/E CpG ⩾0.60, window size ⩾200 bp.

Using hg17 assembly at the UCSC genome browser (http://genome.ucsc.edu/) (11).

Using mm7 assembly at the UCSC genome browser.

Removing CG clusters for which ⩽27 CGs are contributed by unique sequence (24 CGs for mouse).

The total numbers of CpG islands and CG clusters in human and mouse, using unmasked sequence that contains repetitive elements, the UCSC CpG island track (generated from sequence in which transposons have been masked) and CG clusters excluding those with ⩽27 CGs derived from unique sequence (24 for mouse, the minimum number needed to define a CG cluster in each species). The exclusion of transposon-derived CG clusters creates an annotation that is comparable to the UCSC CpG island annotations. We show the numbers and percentages of total for each sequence feature in terms of conservation of that sequence feature at the orthologous locus in the other species. We also quantify the numbers and proportions overlapping refSeq gene transcription start sites in each genome. Comparison of annotation performance in unmasked sequence shows CpG islands to suffer from excessive non-specificity, while the performance for non-repetitive sequences shows comparable proportional but quantitatively greater identification of conserved CG-dense regions or refSeq promoters using the CG cluster annotation in both human and mouse. Using cpgi130 program (4) (http://cpgislands.usc.edu/) using parameters (G+C) ⩾0.50, O/E CpG ⩾0.60, window size ⩾200 bp. Using hg17 assembly at the UCSC genome browser (http://genome.ucsc.edu/) (11). Using mm7 assembly at the UCSC genome browser. Removing CG clusters for which ⩽27 CGs are contributed by unique sequence (24 CGs for mouse). CpG island annotations are meant to identify constitutively unmethylated sites in the genome. However, the traditional CpG island criteria mostly identify repetitive sequences, as we show in Table 1, and these repeats are generally highly methylated (11). Furthermore, when we (12) and others (13) performed high-throughput cytosine methylation studies, even the annotated unique sequence CpG islands were subject to methylation at non-imprinted autosomal loci from normal tissues. It follows that the original base compositional criteria by themselves are not sufficient to predict methylation status. Rather than modify existing base compositional criteria further, we decided to focus on the single characteristic of CG dinucleotides in which we had confidence: that they cluster at certain loci. We sought to identify whether such loci form a distinctive population within the genome as a whole. This approach allowed us to develop a new means of defining what we call CG clusters, and for the first time allows a species-specific definition that reveals the pattern of preservation of CGs to be genome specific and more conserved at orthologous loci than previously recognized. As CpG islands have been used as fundamental predictors of functionally important sites such as promoters (6), and we show that the CG cluster annotation has a substantially better positive predictive value for annotated transcription start sites than do CpG islands, it is likely that prior bioinformatic studies based on using CpG islands will be greatly improved by the use of CG clusters instead. We also show that the potential utility of CG clusters extends beyond sequence analysis alone, with demonstration of epigenetic predictive capacity, identifying substantially more hypomethylated sites than CpG islands in human CD34+ and embryonic stem cells. Because CpG islands are used as a basis for microarray studies of methylation changes, particularly in cancer (14), the use of CG clusters is likely to improve the sensitivity of such studies.

MATERIALS AND METHODS

CG cluster generation

The CG cluster annotation was generated using a set of custom PERL, R (http://www.r-project.org/) and shell scripts (available at http://greallylab.aecom.yu.edu/cgClusters/). Initially, the locations of every CG dinucleotide in the human genome were extracted from raw genomic DNA sequences (human May 2004 assembly hg17, http://genome.ucsc.edu/). Using these positions, every overlapping sequence fragment containing a fixed number of CGs (n = 5,10, … 100) and having variable length was identified. For each number n of CGs, the frequency of each fragment length was recorded and the distribution of fragment lengths was examined using the R statistical package for the presence of a short, CG-dense population distinct from the longer fragments . The threshold for each CG number θ (maximum fragment length) was defined to be the location of the local minimum in the fragment length histogram, estimated by identifying zero values of the first derivative of a cubic spline fit. Plots of θn against the number of CGs (n) exhibited a nearly linear relationship. Mapping the CG-dense fragments in C back to the genomic sequence produces an annotation track where each annotated locus is a conglomeration of one or more overlapping fragments of variable length. However, the exact number, length and location of the annotated regions vary with the number of CGs per fragment (n). As the basis for choosing the optimal track in an objective manner, we noted that the fragments tended to aggregate and overlap to a greater extent in genomic regions of higher CG density. Because these types of regions are the major source of the CG-dense subpopulation, we used the number of overlapping fragments at locus j, , as a parameter for evaluating the information content of an annotated locus. To normalize for the length dependence of this value, we divided it by the maximum fragment length θ. To choose the track with maximal fragment overlap per locus, we compared genomic averages of this metric for different numbers of CGs per fragment (n). This allowed us to choose the species-specific optimal number of CGs per fragment for the final annotation. These annotations were then formatted for visualization in the UCSC genome browser and are available for download (human and mouse genomes) at http://greallylab.aecom.yu.edu/cgClusters/ Annotation track features including CpG islands and repetitive elements were examined using a local mirror of the UCSC genome browser MySQL database through the PERL DBI interface. The Takai and Jones (8) and Gardiner-Garden and Frommer (2) CpG island annotation tracks were generated using the cpgi130 program (8) (http://cpgislands.usc.edu/), and loaded into the database to facilitate analysis. The CG cluster annotation was also loaded into the database. Analysis of CpG island and CG cluster promoter prediction was performed using a highly restrictive set of criteria. Only refSeq genes were considered, and promoter prediction was defined as strict overlap of the transcription start site. Non-transposon CG clusters were defined by quantifying the number of CG dinucleotides derived from transposon and unique sequences, identifying those for which unique sequence contributed less than the minimum number of CGs required for a CG cluster in each species and removing them from consideration. For the comparisons of CpG islands and CG clusters at orthologous promoters in human and mouse at the 23 loci, we used the same approach as in the original analysis (15), scoring conservation when the promoter of the gene had any overlap with the sequence feature. For the corresponding genome-wide analysis of CpG island and CG cluster conservation, we defined orthologous annotations in human and mouse using the mouse net (netMm7) track from the UCSC Genome Browser (16). Promoter hits were defined as strict overlap with transcription start sites of refSeq genes, while overlap of the annotation from one species with the annotation in the other species at orthologous sequences defined conservation of the CG-dense region.

Cytosine methylation analysis using the HELP (HpaII tiny fragment enrichment by ligation-mediated PCR) assay

Two normal human cell types were chosen for analysis, human embryonic stem cells and hematopoietic stem and progenitor cells. The H1 human embryonic stem cells (hESCs; NIH code WA01 from Wicell Research Institute, Madison, WI, USA) were cultured on P51R [hESC-derived MSCs (17)], plated at 75 000 cells per cm2 or on matrigel (BD Biosciences, San Diego) at 37°C, 5% O2 and 5% CO2. The hESC medium contained DMEM/Ham's F-12, 20% Knockout Serum Replacer (KSR), 2 mM l-glutamine, minimal essential medium nonessential amino acid solution (NEAA), 0.1 mM penicillin–streptomycin 1% (all from Gibco, Grand Island, NY, USA), 4 ng/ml basic fibroblast growth factor (or 100 ng/ml for cells on matrigel, R&D Systems Inc., Minneapolis; or ProSpect-Tany, Technogene, Rehovot, Israel) and 0.1 mM 1-thioglycerol (Sigma–Aldrich, St Louis). The culture medium was changed daily, and the cells were passaged once weekly. The hESCs were harvested using TrypLE™ EXPRESS (Gibco), washed and re-suspended in staining buffer [Dulbecco's phosphate-buffered saline (DPBS) + 5% KSR] at a concentration of 107 cells/ml and stained with mouse anti-human SSEA-4 antibody (DHSB) or isotype control (eBioscience, San Diego). Secondary staining was performed using rat anti-mouse IgG (H+L) immunoglobulin conjugated to fluorescein isothiocyanate (eBioscience). Based on fluorescence, positive cells for SSEA-4 were sorted using Moflow Cell-Sorter (DakoCytomation, Glostrup, Denmark). Genomic DNA was extracted from 1.5 to 2.5 × 106 cells using proteinase K digestion, phenol–chloroform extraction, dialysis against 0.2× SSC and concentration by surrounding the dialysis bag with PEG 20 000 to reduce water content by osmosis. CD34+ cells were selected from bone marrow samples of healthy adult donors using a Miltyeni (Auburn, CA, USA) LS immunoabsorption column. Genomic DNA was extracted from 2 to 3 × 106 cells following a standard phenol–chloroform protocol followed by an ethanol precipitation and re-suspension of the DNA pellet in 10 mM Tris pH 8.0. To identify hypomethylated loci, HELP analysis was performed (12) using a custom microarray representing HpaII-amplifiable sites at gene promoters (NimbleGen Systems). We used a categorical approach for the output of the assay, as our outcome of interest was defined in terms of methylated or hypomethylated loci. Methylated loci were identified by their inability to amplify from HpaII representations of genomic DNA (measured by the median microarray fluorescence intensities for the oligonucleotides representing each HpaII-amplifiable fragment, when median HpaII signal intensity was below the level of background signal, defined as 2.5 median absolute deviations above the median of random probe signal intensities), despite amplification in the corresponding MspI representation (signal intensity above the background calculated in the same way for the MspI channel). The hypomethylated loci represented the remaining subset that was amplified in both channels (HpaII and MspI signal intensities above the levels of background signals). Overlap of each methylated or hypomethylated locus with CpG islands and CG clusters was quantified using a set of custom PERL scripts, and the results were analyzed by SQL query following their entry into a MySQL database.

RESULTS

We pursued the hypothesis that there is a subpopulation of sequences in the genome defined solely by their clustering of CG dinucleotides. This clustering is a result of the genome-wide decay of CG dinucleotide content, with preservation of CG density at certain regions. By measuring the distance spanned by a fixed number of CG dinucleotides for every such group genome-wide, we observed that there are two populations of loci with distinctive CG clustering densities (Figure 1a and b). Using the first local minimum in the distribution of spanned sequence fragment lengths as the boundary of the short, CG-dense population, we identified the maximum fragment length for each cluster corresponding to a fixed number of CGs. In analyzing these cutoffs, we defined a linear relationship between CG dinucleotide number and the associated maximum fragment length (Figure 1c).

Figure 1.

The analytical technique used to define CG clusters. First, a fixed number of CG dinucleotides is chosen, as illustrated in the example using six CGs (a). The first CG is identified in a chromosome, then the sixth, allowing the number of nucleotides between them to be recorded. The second and seventh CGs are then identified and the distance recorded and so on until the data for the entire genome is collected. When this analysis is performed for the entire human genome using 30 CGs at a time, the resulting lengths can be represented as a frequency histogram as shown in panel (b). Two populations are apparent—the peak on the left with short fragment lengths for this number of CGs, and the remainder of the genome where CGs do not cluster. The maximum fragment length for the clustered CGs is shown as a vertical red line. When the analysis is repeated for 5, 10, 15, … 100 CGs genome-wide, different maximum fragment lengths for each number of CGs are derived, with a near-linear relationship between these variables as shown in panel (c). The arrowhead refers to the observation made for 30 CGs illustrated in panel (b). The clear differentiation of CG-dense fragments from the rest of the genome provides a means of mathematically defining CG-dense regions and can therefore be used as a robust foundation for computational genomic annotation. Given a fixed number of CGs, the CG-dense fragments below the maximum fragment length could be identified and mapped back onto the genome. But, as Figures 2a and b show, each of the fixed number of CGs generates different annotations. Using fewer CGs and correspondingly smaller fragments, many small CG clusters are identified, whereas by using a greater number of CGs and correspondingly larger fragments, fewer clusters are identified, but each extends into large flanking regions of lower CG density.

Figure 2.

Creating a CG cluster annotation for the human genome. For a given number of CGs, significantly CG-dense fragments are defined as being shorter than the maximum fragment length. When these fragments are mapped back to the genome, some loci have multiple overlapping fragments, indicating that they are more likely to be CG-dense. These conglomerations define a genomic annotation track for each number of CGs used. Fewer CGs per fragment produces an annotation that is highly sensitive to local changes in CG density, defining a large number of small CG clusters, as shown in panel (a). On the other hand, a high number of CGs per fragment defines fewer CG clusters, but can extend far into flanking CG-poor regions (b). To find the intermediate optimum, we calculated the average number of fragments per CG cluster genome-wide. When recalculated relative to maximum fragment length, this measure of information content per CG cluster generated a peak at 27 CGs per fragment (c). This value is associated with a maximum fragment length value from the regression line in Figure 1c of 531 bp. We were able to optimize the criteria when we recognized that at any individual CG-dense locus, a given number of CGs generates multiple overlapping fragments. More CG-dense clusters require a greater number of fragments to span all of the CGs they contain. Accordingly, the more overlapping fragments that represent a given locus, the more likely it is to be significantly CG-dense. For each number of CGs, we calculated the number of overlapping fragments per cluster. We obtained a representation of information content for each CG number by summing this total across all loci in the genome and dividing by maximum fragment length. We then determined the optimal number of CGs per fragment using the maximum value obtained (Figure 2c). For the human genome, this optimum corresponds to 27 or more CG dinucleotides in a sequence of no more than 531 bp in length. This new means of identifying CG clusters is neither constrained by (G+C) content nor by the associated observed/expected CG dinucleotide ratio. In Figure 3, we show that the thresholds imposed by even the least stringent original base compositional criteria (2) cause many CG-dense loci in the genome to be missed. However, even though we are annotating the entire sequenced genome, including repetitive DNA, we identify only a small fraction of the ∼350 000 CpG islands predicted by these old criteria (2) (Table 1).

Figure 3.

The base compositional characteristics of CG clusters (black) are shown in terms of observed to expected CG dinucleotide densities (O/E CG ratio) on the x-axis with (G+C) content on the y-axis. The dashed lines illustrate the relatively non-stringent thresholds of the original CpG island definition (3). Any points to the left of the vertical threshold or below the horizontal threshold show how many CG-dense loci would fail to be identified using base compositional criteria alone. The arrowhead illustrates extremely (A+T)-rich, CG-dense alpha satellite DNA sequences. We compared the functional significance of CpG islands and CG clusters in two ways—testing their relative frequency co-localizing with promoters and with hypomethylated loci. A major use of the CpG island annotation has been to predict the location of transcription start sites in the genome. Approximately 40% (18) to 50% (6) of human promoters have been found to co-localize with CpG islands, while promoters of housekeeping genes have been described to have a near-universal association with CpG islands (18). We cross-correlated our CG clusters and the CpG island locations annotated at the UCSC Genome Browser with transcription start sites of refSeq genes from the same database, finding CpG islands to overlap 57% of refSeq transcription start sites, 79% of a published list of housekeeping genes (19) and 38% of a published list of tissue-specific genes (20). In contrast, the proportion of refSeq transcription start sites associated with CG clusters is substantially higher (68%, an additional 11% or 2701 refSeq transcription start sites), with 45% of the tissue-specific genes and 91% of housekeeping genes co-localizing with these CG clusters (Figure 4a; Table 1). As the UCSC CpG island annotation is of non-repetitive sequence and our CG cluster annotation was generated without this filter, we were concerned that the comparison was unfairly penalizing the UCSC CpG island annotation, so we tested the relative proportions of refSeq promoter overlaps for two other annotations, the 350 201 CpG islands that occur in the genome as a whole, and the 31 225 CG clusters that are not defined due to substantial overlap with transposable elements. In Figure 4b, we show that the performance of the CG cluster annotation is stronger for both unfiltered sequence (positive predictive values of 0.381 compared with 0.049) and non-transposon sequence (0.535 compared with 0.508) in identifying refSeq promoters. Similar patterns are found for the mouse genome (Table 1). Given that CpG islands have been used as a component of algorithms for predicting promoters in the genome (6), CG clusters should offer a more powerful resource for this and comparable purposes.

Figure 4.

(a) CG clusters (white bars) overlap more refSeq transcription start sites than the CpG island annotations of the UCSC genome browser (black bars). CG clusters overlap the presumed promoters of a substantially higher proportion of genes overall (left), including both housekeeping and tissue-specific genes. In panel (b) we add in (in black) the number of loci that do not overlap refSeq promoters (white) to demonstrate the ‘false positive’ rate for the categories shown in panel (a), as well as the 350 201 CpG islands found when all genomic sequence is tested without removing repetitive elements and the CG clusters that are not defined by transposable elements. The positive predictive value for CpG islands and CG clusters (bold) are also shown, quantifying the relative performance of each of these annotations. We tested the relative ability of CG clusters to detect hypomethylated sites by performing the HELP assay (12) on human embryonic stem cells and CD34+ hematopoietic stem and progenitor cells. A microarray representing HpaII-amplifiable fragments located near transcription start sites in the human genome was used for two biological replicates of each cell type. While similar proportions of loci at CpG islands and at CG clusters demonstrated hypomethylation (Figure 5a), the absolute number of hypomethylated loci differed (Figure 5b), as the hypomethylated CpG islands represent a subset of the larger group of hypomethylated CG clusters. The CG clusters identify ∼50% more hypomethylated loci than do CpG islands. We conclude that the CG cluster annotation is not only identifying more transcription start sites, it is also defining loci with comparable epigenetic characteristics.

Figure 5.

The HELP assay (13) was used on a custom promoter microarray to test cytosine methylation patterns in two samples each of CD34+ hematopoietic stem and progenitor cells (CD34) and human embryonic stem cells (ES). In panel (a) it is apparent that similar proportions of sites are categorized as hypomethylated for CG clusters (C) and CpG islands (I). However, the absolute number of hypomethylated sites detected by the CG cluster annotation is markedly larger than that for the CpG island annotation [panel (b)]. We conclude that the CG cluster annotation is not only detecting larger numbers of transcription start sites (Figure 4), it extends the ability of the CpG island annotation to identify more hypomethylated sites in the genome. We next addressed the question of why CpG islands are often not conserved between human and mouse (16,22). This is a puzzling issue if the CG-rich nature of the promoter is of functional importance, for example conferring the ubiquitous expression patterns that define housekeeping genes (18). It would be expected that such functional promoter characteristics would be conserved between species despite differences in overall CG dinucleotide content [observed/expected (O/E) CG ratios of 0.19 and 0.24 for mouse and human, respectively (22)]. It is therefore surprising that the total number of CpG islands in mouse is only ∼58% of the number annotated for the human genome (Table 1). When we performed the CG clustering analysis of the mouse genome, we found it also generates two populations with distinct CG density characteristics, but that the optimal CG cluster definition for the mouse genome is different from that of the human, corresponding to 24 or more CG dinucleotides in a sequence of no more than 585 bp in length (Figure 6). By comparison, human CG clusters consist of 27 CGs in no more than 571 bp. When we calculated the total number of CG clusters for the mouse genome, it was strikingly similar to that for the human (42 971 and 44 165, respectively, Table 1). In addition, when we re-analyzed a sample of 23 loci originally published to demonstrate the failure of CpG island conservation between these species (15), we found that while only 18 conserve CpG islands, 22 out of 23 conserve CG clusters, the single exception in this limited sample being the alpha globin orthologs (HBA1/Hba-a1). We extended this study to test conservation of each annotation genome-wide. Of all of the 27 801 CpG islands annotated at the UCSC Genome Browser, 14 452 have orthologous sequences with CpG islands in the mouse genome, while there exist 19 410 sites of conserved CG clustering (Table 1). When studied using our genome-specific annotations, clustered CG dinucleotides are demonstrably much more conserved between species than previously appreciated.

Figure 6.

The mouse genome has different CG clustering characteristics than those of the human genome. The optimization curve characteristics for mouse are clearly different from those for human (a). The optimal mouse annotation contains fragments no longer than 585 nt with 24 or more CGs per fragment, fewer CGs in a longer stretch of DNA than for the human genome. In panel (b) it is again apparent that base composition criteria alone will fail to recognize a substantial proportion of CG-dense loci in this species. We extended the CG clustering histogram analysis to eight more genomes, including other organisms that are known to methylate their genomes, those that do so only transiently (Drosophila melanogaster) (23), and those that do not methylate at all. The surprising result of these analyses is that the fugu (Tiger Blowfish, Takifugu rubripes) genome, which has been described to methylate its DNA (24), does not exhibit uniquely CG-dense regions. What may explain this difference is that the degree of decay of CG dinucleotide content in the fugu genome is less than that of most genomes in which unique CG-dense regions emerge (Figure 7). The zebrafish (Danio rerio) genome, on the other hand, does display uniquely CG-dense regions with only marginally greater CG dinucleotide decay (O/E CG 0.53 as opposed to 0.57 in fugu). The remaining major difference between these genomes is that of size, the fugu genome being substantially smaller than the other methylating genomes at only 365 Mb total (25), a variable already suggested to be related to the evolution of cytosine methylation (26). Our data demonstrate that while cytosine methylation appears to be necessary for CG decay, it is not sufficient to cause local preservation of clustered CG dinucleotides. Furthermore, we can conclude that any annotation of the fugu genome to indicate the presence of CpG islands or CG clusters is inappropriate.

Figure 7.

CG cluster analysis of 10 different species. These CG fragment length frequency plots were generated using 30 CGs per fragment for each species. Genomes containing CG clusters are defined by the distinct peak of short, uniquely CG-dense fragments. While the three non-methylating organisms on the left (Saccharomyces cerevisiae, Caenorhabditis elegans and D. melanogaster) show no uniquely CG-dense peak, it was surprising to find that fugu has similar characteristics despite the fact that it methylates its genome (25). Zebrafish, on the other hand, which also methylates its genome, has a distinct CG-dense peak, as do the other vertebrate genomes on the right. There is more CG decay in zebrafish than fugu (O/E CG ratios of 0.525 and 0.571, respectively), but this marginal difference does not appear sufficient to account for the emergence of CG-dense clusters in zebrafish. Methylation of the genome is not, therefore, always accompanied by the presence of CG-dense loci that avoid mutational decay. For a more detailed illustration of the CG cluster analysis of these genomes, see the Supplementary Data section.

DISCUSSION

The approach of testing for CG clustering reveals the loci at which CG dinucleotide decay occurred at a markedly lower rate than it did in the rest of the genome. Unlike CpG islands, CG clusters occur with a large range of (G+C) content and O/E CG ratios, revealing the unusual CG density of loci such as alpha satellite DNA sequences, targets for the DNMT3B methyltransferase (27) that are very (A+T)-rich and are consequently not defined as distinctive using traditional CpG island base compositional criteria. While this is an example of the CpG island definition being excessively restrictive, the definition also suffers from the problem of identifying a large number of sites located within transposable elements. It is certainly possible that these retroelements may have cis-regulatory effects (28), but their overall tendency to be methylated (11) diminishes the usefulness of the CpG island annotation as a mark of unmethylated DNA in the genome, a major use of the annotation in cancer epigenomics, especially in defining the CpG island methylator phenotype (CIMP), which requires that methylation of a given CpG island be distinctive in neoplastic cells (29). The approach used in the creation of CpG island browser tracks involves using sequences from which repetitive elements have previously been removed, increasing the likelihood of identifying constitutively unmethylated, presumably cis-regulatory sequences, but even this approach defines a number of sites that are methylated in normal cells (12,13). Our HELP data indicate that CG clusters perform better than CpG islands at identifying unmethylated sites in the genome. However, our data do not support CG clusters being universally unmethylated, as we find satellite DNA and young retrotransposons to encode CG clusters, and these sequences are normally methylated (11,27). We propose that what distinguishes clustered from non-clustered CGs in the genome is the greater stability of associated epigenetic marks, such as hypomethylation at gene promoters or methylation of alpha satellite DNA. CpG islands have recently been described to be located at the bivalent domains of histone tail modifications in embryonic stem cells (30), reinforcing the rationale for using these loci as a means of identifying candidate cis-regulatory sites in the genome. CpG islands represent a foundation annotation of the genome on which other annotations are built, for example contributing to algorithms to identify gene promoters (31). However, we show (Figure 3) that the CG cluster annotation performs substantially better than the CpG island annotation in localizing to known promoters (as represented by refSeq transcription start sites), indicating that an improved foundation annotation like CG clusters may improve the performance of algorithms currently using CpG islands. Because identical criteria have been used to define CpG islands in different species, in which CG clustering can have markedly different characteristics, CpG islands have been thought to be poorly conserved between species, especially with the focus on human/mouse comparisons (21). We show that a species-specific definition of CG clusters reveals an unexpected degree of conservation of this annotation between human and mouse. We anticipate that conserved CG clusters will represent a subset of loci of exceptional functional importance in the genome. However, it is also clear (Figure 7) that it is inappropriate to annotate all methylating genomes for the presence of CG-dense regions. Fugu is annotated at genome browsers (UCSC and Ensembl) for the presence of CpG islands despite the fact that it does not have a distinctive population of loci maintaining CG content in an overall genomic context of CG decay. Zebrafish, on the other hand, with a similar degree of CG decay, manifests the two populations of CG content. Interestingly, the methylation of cytosines in the zebrafish genome includes a substantial proportion at non-CG dinucleotide sites (32), yet the selective preservation of CG content at a subset of loci occurs in this genome.

CONCLUSIONS

We show that CG clusters, when present in a genome, define themselves as a distinctive population of loci. This novel annotation performs better at identifying promoters and hypomethylated DNA than current CpG island definitions, and allows a species-specific definition of CG clusters that reveals a previously unsuspected degree of conservation of this sequence feature. The species specificity of what defines a CG cluster indicates that CG dinucleotides only need to be enriched within the context of their genome to be distinctive and presumably functional. We expect that the annotations of CG clusters will prove valuable to those studying the genome as well as the epigenome, so we have provided the human and mouse annotations as a resource for public use at http://greallylab.aecom.yu.edu/cgClusters/

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

URLS

The UCSC Genome Browser (http://genome.ucsc.edu/), the ENSEMBL Genome Browser (http://www.ensembl.org/), the Greally lab CG cluster webpage (http://greallylab.aecom.yu.edu/cgClusters/), the R project (http://www.r-project.org/) and the CpG island searcher website with the cpgi130 program (http://cpgislands.usc.edu/).

32 in total

Review 1. CpG-island methylation in aging and cancer.

Authors: J P Issa
Journal: Curr Top Microbiol Immunol Date: 2000 Impact factor: 4.291

2. DNA methylation in Drosophila melanogaster.

Authors: F Lyko; B H Ramsahoye; R Jaenisch
Journal: Nature Date: 2000-11-30 Impact factor: 49.962

3. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

4. The human genome browser at UCSC.

Authors: W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler
Journal: Genome Res Date: 2002-06 Impact factor: 9.043

5. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes.

Authors: Samuel Aparicio; Jarrod Chapman; Elia Stupka; Nik Putnam; Jer-Ming Chia; Paramvir Dehal; Alan Christoffels; Sam Rash; Shawn Hoon; Arian Smit; Maarten D Sollewijn Gelpke; Jared Roach; Tania Oh; Isaac Y Ho; Marie Wong; Chris Detter; Frans Verhoef; Paul Predki; Alice Tay; Susan Lucas; Paul Richardson; Sarah F Smith; Melody S Clark; Yvonne J K Edwards; Norman Doggett; Andrey Zharkikh; Sean V Tavtigian; Dmitry Pruss; Mary Barnstead; Cheryl Evans; Holly Baden; Justin Powell; Gustavo Glusman; Lee Rowen; Leroy Hood; Y H Tan; Greg Elgar; Trevor Hawkins; Byrappa Venkatesh; Daniel Rokhsar; Sydney Brenner
Journal: Science Date: 2002-07-25 Impact factor: 47.728

6. Short interspersed transposable elements (SINEs) are excluded from imprinted regions in the human genome.

Authors: John M Greally
Journal: Proc Natl Acad Sci U S A Date: 2001-12-26 Impact factor: 11.205

7. Computational identification of promoters and first exons in the human genome.

Authors: R V Davuluri; I Grosse; M Q Zhang
Journal: Nat Genet Date: 2001-12 Impact factor: 38.330

8. Satellite 2 methylation patterns in normal and ICF syndrome cells and association of hypomethylation with advanced replication.

Authors: K M Hassan; T Norwood; G Gimelli; S M Gartler; R S Hansen
Journal: Hum Genet Date: 2001-10 Impact factor: 4.132

9. Comprehensive analysis of CpG islands in human chromosomes 21 and 22.

Authors: Daiya Takai; Peter A Jones
Journal: Proc Natl Acad Sci U S A Date: 2002-03-12 Impact factor: 11.205

10. CpGcluster: a distance-based algorithm for CpG-island detection.

Authors: Michael Hackenberg; Christopher Previti; Pedro Luis Luque-Escamilla; Pedro Carpena; José Martínez-Aroza; José L Oliver
Journal: BMC Bioinformatics Date: 2006-10-12 Impact factor: 3.169

42 in total

1. Late-replicating heterochromatin is characterized by decreased cytosine methylation in the human genome.

Authors: Masako Suzuki; Mayumi Oda; María-Paz Ramos; Marién Pascual; Kevin Lau; Edyta Stasiek; Frederick Agyiri; Reid F Thompson; Jacob L Glass; Qiang Jing; Richard Sandstrom; Melissa J Fazzari; R Scott Hansen; John A Stamatoyannopoulos; Andrew S McLellan; John M Greally
Journal: Genome Res Date: 2011-09-28 Impact factor: 9.043

2. Identification and characterization of putative methylation targets in the MAOA locus using bioinformatic approaches.

Authors: Elena Shumay; Joanna S Fowler
Journal: Epigenetics Date: 2010-05-05 Impact factor: 4.528

3. Diverse RNA viruses of arthropod origin in the blood of fruit bats suggest a link between bat and arthropod viromes.

Authors: Andrew J Bennett; Trenton Bushmaker; Kenneth Cameron; Alain Ondzie; Fabien R Niama; Henri-Joseph Parra; Jean-Vivien Mombouli; Sarah H Olson; Vincent J Munster; Tony L Goldberg
Journal: Virology Date: 2018-12-18 Impact factor: 3.616

4. A pipeline for the quantitative analysis of CG dinucleotide methylation using mass spectrometry.

Authors: Reid F Thompson; Masako Suzuki; Kevin W Lau; John M Greally
Journal: Bioinformatics Date: 2009-06-26 Impact factor: 6.937

5. An integrated resource for genome-wide identification and analysis of human tissue-specific differentially methylated regions (tDMRs).

Authors: Vardhman K Rakyan; Thomas A Down; Natalie P Thorne; Paul Flicek; Eugene Kulesha; Stefan Gräf; Eleni M Tomazou; Liselotte Bäckdahl; Nathan Johnson; Marlis Herberth; Kevin L Howe; David K Jackson; Marcos M Miretti; Heike Fiegler; John C Marioni; Ewan Birney; Tim J P Hubbard; Nigel P Carter; Simon Tavaré; Stephan Beck
Journal: Genome Res Date: 2008-06-24 Impact factor: 9.043

6. CG dinucleotide periodicities recognized by the Dnmt3a-Dnmt3L complex are distinctive at retroelements and imprinted domains.

Authors: Jacob L Glass; Melissa J Fazzari; Anne C Ferguson-Smith; John M Greally
Journal: Mamm Genome Date: 2009-11-17 Impact factor: 2.957