| Literature DB >> 34146109 |
Galen T Martin1, Danelle K Seymour2, Brandon S Gaut1.
Abstract
Methylated CHH (mCHH) islands are peaks of CHH methylation that occur primarily upstream to genes. These regions are actively targeted by the methylation machinery, occur at boundaries between heterochromatin and euchromatin, and tend to be near highly expressed genes. Here we took an evolutionary perspective by studying upstream mCHH islands across a sample of eight grass species. Using a statistical approach to define mCHH islands as regions that differ from genome-wide background CHH methylation levels, we demonstrated that mCHH islands are common and associate with 39% of genes, on average. We hypothesized that islands should be more frequent in genomes of large size, because they have more heterochromatin and hence more need for defined boundaries. We found, however, that smaller genomes tended to have a higher proportion of genes associated with 5' mCHH islands. Consistent with previous work suggesting that islands reflect the silencing of the edge of transposable elements (TEs), genes with nearby TEs were more likely to have mCHH islands. However, the presence of mCHH islands was not a function solely of TEs, both because the underlying sequences of islands were often not homologous to TEs and because genic properties also predicted the presence of 5' mCHH islands. These genic properties included length and gene-body methylation (gbM); in fact, in three of eight species, the absence of gbM was a stronger predictor of a 5' mCHH island than TE proximity. In contrast, gene expression level was a positive but weak predictor of the presence of an island. Finally, we assessed whether mCHH islands were evolutionarily conserved by focusing on a set of 2,720 orthologs across the eight species. They were generally not conserved across evolutionary time. Overall, our data establish additional genic properties that are associated with mCHH islands and suggest that they are not just a consequence of the TE silencing machinery.Entities:
Keywords: DNA methylation; Poaceae; comparative analysis; epigenetics; mCHH islands; transposable elements
Mesh:
Substances:
Year: 2021 PMID: 34146109 PMCID: PMC8374106 DOI: 10.1093/gbe/evab144
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
A List of Species Examined in This Study, with Their Genome Size, the Number of Genes Used in Analyses and Information about CHH-Island Characteristics
| Species | Genome Size (Mb)a |
|
|
|
|
|
|---|---|---|---|---|---|---|
|
| 355 | 34,257 | 55.16% | 31.49% | 58.20% | 32.76% |
|
| 5,428 | 35,200 | 28.22% | 34.19% | 41.34% | 34.64% |
|
| 489 | 41,806 | 71.85% | 41.80% | 76.61% | 49.19% |
|
| 2,075 | 30,946 | 17.27% | 29.03% | 18.51% | 28.57% |
|
| 513 | 34,170 | 29.80% | 38.89% | 34.31% | 41.00% |
|
| 734 | 33,972 | 54.01% | 46.05% | 59.91% | 49.06% |
|
| 4,817 | 33,612 | 22.94% | 34.78% | 28.29% | 35.71% |
|
| 2,655 | 37,534 | 30.85% | 53.85% | 38.73% | 53.80% |
Genome sizes estimated by flow cytometry, primarily from the Kew C-values database (see Materials and Methods).
Number of genes used in genome-wide summaries in figures 1 and 2, including only genes with near-gene BSseq coverage (see Materials and Methods).
The percentage of genes associated with mCHH islands within the flanking 5′ or 3′ 2.0 kb.
The median level of CHH methylation in islands within 2 kb of genes.
The percent of orthologs, of 2,720 total, associated with a 5′ mCHH island in each species.
Median mCHH level for islands associated with orthologs.
Fig. 1Near-gene methylation across Poaceae species. (A) Profiles of methylation across genes and their 2.0 kb 5′ and 3′ flanking regions. Weighted methylation levels are summarized in 10 200 bp windows upstream and downstream of genes, and in 20 equally sized windows within genes that vary in size depending on gene length. These figures summarize across full genes, with exons and introns. Here we show three species that span the range of genome size (table 1), with the remaining species shown in supplementary figure S2, Supplementary Material online. TSS and TS refer to the transcription start and termination sites. (B) Near-gene enrichment of mCHH increases with genome size. Near-gene mCHH enrichment represents the mean weighted mCHH levels in 1 kb regions upstream of the TSS divided by the mean weighted mCHH levels in an equal number of 1 kb regions randomly selected throughout the genome.
Fig. 2Profiles of methylation across mCHH islands in each sequence context. The x axis provides the distance in base pairs (bp) from a detected island, which is centered at zero. The points on the graph represent weighted methylation levels in 100 bp windows. The islands were not at a fixed distance from genes, because they were determined by significance tests, but they were within the 2 kb 5′ flanking region of genes.
Fig. 4Variable importance analysis of the logistic regression model presenting the contribution of each variable to the model on an equivalent scale. Values <0 on the y axis denote a negative association between the predictor and the presence of a CHH island; values >0 are positive predictors.
Fig. 5Conservation of mCHH islands across orthologs in grass species. (A) A heatmap of the enrichment of features over the random expectation of 1.0. Top half: enrichment of mCHH island conservation between pairs of species based on one-to-one orthologs. Bottom half: enrichment of gbM between pairs of species based on one-to-one orthologs. (B–E) Graphs of the relationship between mCHH island conservation and each genic predictor variable: exonic mCG level (B), expression (C), length (D), and TE distance (E). For each graph, the x axis denotes the number of orthologs, of eight total, with a 5′ mCHH island, and the y axis denotes the average value of the stated statistics in the ortholog across species.
Counts of TEs within 2 kb for a Common Set of TE Superfamilies across Species and Their Enrichment Status for mCHH Islands
| Barley | Rice | Maize | |||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
| #TEs <2 kb | % with Island | Enriched | #TEs <2 kb | % with Island | Enriched |
| DHH | 69 | 0.174 | NS | 132 | 0.417 | Under | 5,235 | 0.283 | Under |
| DTA | 21 | 0.095 | NS | 517 | 0.768 | NS | 653 | 0.542 | Enriched |
| DTC | 3,480 | 0.280 | NS | 1,243 | 0.474 | Under | 184 | 0.429 | Enriched |
| DTH | 478 | 0.460 | Enriched | 40 | 0.700 | NS | 2,677 | 0.536 | Enriched |
| DTM | 654 | 0.378 | Enriched | 1,106 | 0.806 | Enriched | 122 | 0.623 | Enriched |
| DTT | 474 | 0.430 | Enriched | 2,143 | 0.898 | Enriched | 2,307 | 0.389 | Enriched |
| DTX | 332 | 0.497 | Enriched | 6,476 | 0.882 | Enriched | 299 | 0.408 | Enriched |
| RIX | 880 | 0.227 | Under | 551 | 0.611 | Under | 87 | 0.253 | NS |
| RLC | 4,896 | 0.239 | Under | 1,433 | 0.651 | Under | 3,148 | 0.280 | Under |
| RLG | 4,413 | 0.213 | Under | 2,230 | 0.580 | Under | 3,891 | 0.292 | Under |
| RLX | 11,356 | 0.315 | Enriched | 1,0753 | 0.704 | Under | 2,632 | 0.224 | Under |
| RSX | 76 | 0.316 | NS | 796 | 0.932 | Enriched | 43 | 0.302 | NS |
|
|
|
|
|
|
|
| |||
TE classification code as described by Wicker et al. (2007). DHH, Helitron; DTA, hAT; DTC, CACTA; DTH, PIF-Harbinger; DTM, Mutator; DTT, Tc1-Mariner; DTX, unknown DNA elements; RIX, unclassified LINE; RLC, Copia; RLG, Gypsy; RLX, unclassified LTR; RSX, unclassified SINE.
The number of TEs within each class that are within 2 kb upstream of an annotated gene, based on counting only the closest TE to a gene.
The proportion of genes that have both an mCHH island and a TE within 2 kb upstream.
Based on a binomial test (FDR corrected, P < 0.05), classes of TEs were determined to be significantly enriched (Enriched) for CHH islands or under-enriched (Under), relative to the total proportion estimated across all TE superfamilies. NS, nonsignificant.
Fig. 6—mCHH islands in relation to TE presence. (A) The distribution of e-values after blasting sequences to an annotated TE database for Z. mays (left) and O. sativa (middle) and H. vulgare (right). Each graph plots the results for 100 bp mCHH island DNA sequences and an equal number of randomly chosen 100 bp nonisland sequences for comparison. (B) The coefficients of variation for mCHH island distances from gene TSS (orange) and TE edges (green) for each of the different types of TEs analyzed (Wicker et al. 2007). The schematic above the graphs defines the distances measured. (C) A schematic that defines the use of the terms coincident and dissonant. Each term describes a comparison of orthologs between pairs of species, with a lineage-specific 5′ mCHH island in only one species. Coincidence is when there is a lineage-specific TE and island in the same species; Dissonance is when the TE and island are in different species. The bar graph shows the frequency with which orthologs possess a lineage-specific mCHH island and the presence of TEs in neither lineage, both lineages or a single lineage (coincidence and dissonance) in the three pairwise comparisons between maize, rice, and barley.
Fig. 3mCHH island relative to gene expression and length. (A) Profiles of near-gene methylation in genes separated into four quartiles of expression and into nonexpressed genes. The graphs illustrate for some species that genes in the higher quartiles tend to have higher 5′ flanking CHH methylation. (B) Expression levels between mCHH island genes and nonisland genes. Significance levels between the two categories are shown for each species, with NS = not significant. (C) Profiles of near-gene methylation in genes separated into four quartiles for gene length. (D) The length of island and nonisland genes. Significance levels between the two categories are shown for each species, with NS = not significant. These length measures were based on distances from the TSS to the TTS, but the results hold using the length of exons in the longest transcript (supplementary fig. S6, Supplementary Material online). For panels (A) and (C), the species were chosen because they represent a range of genome size, as in figure 1. The remaining species are shown in supplementary figures S4 and S5, Supplementary Material online. For (B) and (D), the box plots present the median, with the edges representing the upper and lower quartiles.