Literature DB >> 26689968

Identification of methylated deoxyadenosines in vertebrates reveals diversity in DNA modifications.

Charles R Bradshaw¹, George E Allen¹, Ana S H Costa², Magdalena J Koziol^1,3, Christian Frezza², John B Gurdon^1,3.

Abstract

Methylation of cytosine deoxynucleotides generates 5-methylcytosine (m(5)dC), a well-established epigenetic mark. However, in higher eukaryotes much less is known about modifications affecting other deoxynucleotides. Here, we report the detection of N(6)-methyldeoxyadenosine (m(6)dA) in vertebrate DNA, specifically in Xenopus laevis but also in other species including mouse and human. Our methylome analysis reveals that m(6)dA is widely distributed across the eukaryotic genome and is present in different cell types but is commonly depleted from gene exons. Thus, direct DNA modifications might be more widespread than previously thought.

Entities: Chemical

Mesh：

Substances：

Year: 2015 PMID： 26689968 PMCID： PMC4941928 DOI： 10.1038/nsmb.3145

Source DB: PubMed Journal: Nat Struct Mol Biol ISSN： 1545-9985 Impact factor: 15.369

More than 60 years ago, it was discovered that cytosine deoxynucleotides could be methylated in eukaryotic genomic DNA1. Since then, dC5m has been extensively studied, revealing it as a major genetically heritable regulatory modification for gene transcription2–4. Up to now, not much is known in higher eukaryotes about modifications affecting other deoxynucleotides. In contrast, RNA, a molecule that is built up from similar molecules as DNA, is known to have more than 60 modifications in eukaryotes, and when including different organisms, the number is greater than 110 (ref. 5). Due to the strong similarity between the DNA and RNA building blocks, we found it surprising that higher eukaryotic DNA is not known to be diverse. In order to determine if there are other direct DNA modifications, we used dA6m as an example to discover if the higher eukaryotic genome is more diverse than previously thought. Methylation of deoxyadenosines has been identified and is a well-described epigenetic feature in bacteria. In these prokaryotes, dA6m is known to regulate various biological pathways such as the restriction-modification system, replication, repair, transcription and transposition6–11. Two reports, using restriction enzyme digests, suggested that dA6m might exist in higher eukaryotes, but no direct evidence or global pattern has ever been reported12,13. Other initial analytical approaches to assess the presence of dA6m in higher eukaryotes were unsuccessful, possibly because these approaches were constrained by the detection limit of 0.1–0.01% of total deoxynucleotides14–16. Only very recently, it was reported that dA6m is present in the genome of the algae Chlamydomonas, in the insect Drosophila and in the nematode C. elegans17–19. In contrast to that work, we focused on higher eukaryotes instead. Here, we report the identification of dA6m in higher eukaryotes, using the same approach used by the other reports. However, ours was developed independently, before the recent publications concerning this modification were published. In order to determine the presence and distribution of dA6m in eukaryotic genomes, we used dot blots, ultra-high performance liquid chromatography-tandem mass spectrometry (UHPLC-MS/MS) and applied a dA6m enrichment approach. Using an antibody against dA6m (dA6m Ab), we carried out DNA immunoprecipitation (DIP) to enrich for genomic DNA fragments containing dA6m that allowed us to identify and describe dA6m genome-wide (Fig. 1a). Here, we identified dA6m not only in the genomes of the frog X. laevis, but also in all genomes we analyzed, such as the mouse M. musculus and human tissues. We showed that this mark is widely distributed across the genome, but is depleted in exonic regions, and appeared to have a preference for TAGG sites, and possibly contain AG as a core motif.

Figure 1

Identification of dA6m in the genome of higher eukaryotes.

a, Illustration of dA6m identification. dA6m was identified using dot blots, UHPLC-MS/MS and dA6m Ab DIP sequencing (DIP-Seq). b, Dot blot with dA6m Ab on DNA templates. c-d, Representative (1 out of 4) UHPLC-MS/MS extracted ion chromatogram (EIC) (left) and fragmentation spectrum (right) monitoring presence of dA6m in genomic DNA from X. laevis. * indicates parent ion, AU=arbitrary units, m/z=mass to charge ratio, n=4 biological replicates (tissues from different animals). e, Percentage of dA6m versus total deoxyadenosines in DNA from different samples following dA6m Ab DIP enrichment. Error bars, s.e.m. n=4 tissues from different animals or from independent bacterial cell cultures, **P≤0.005, *P≤0.05, two-sided t-test. f, Percentage of dA6m versus total deoxyadenosines in DNA from different samples without dA6m Ab DIP enrichment. Error bars, s.e.m., n=4 tissues from different animals or from independent bacterial cell cultures, **P≤0.005, *P≤0.05, two-sided t-test, but when one sample was zero the one-sided t-test was applied. g, dA6m dot-blot of DNA from different higher eukaryotic sources, n=3 technical replicates.

Results

To identify dA6m in higher eukaryotic genomes, we applied an antibody enrichment approach. First, we verified that an antibody reported to bind to methylated adenosines could in fact recognize the dA6m modification20. Dot blot experiments and DIP using synthetic oligonucleotides confirmed that this Ab indeed recognizes dA6m (Supplementary Fig. 1a-c). We then asked if the X. laevis sperm genome contains dA6m. To address this, we isolated DNA from different samples and removed all proteins and RNA. We performed dot blots with X. laevis sperm genomic DNA and stained with the dA6m Ab (Fig. 1b). Importantly, we detected a dA6m signal with the dA6m Ab on X. laevis sperm genomic DNA (Fig. 1b and Supplementary Fig. 1d-h). As controls, we used bacterial genomes from deoxyadenosine methylase (Dam) positive (Dam+) and negative (Dam–) bacteria. We detected dA6m not only in Dam+ bacteria, but also in Dam– bacteria (Fig. 1b). The dA6m signal in Dam– bacteria could be explained by the presence of the other deoxyadenosine methylase EcoKI, which maintains some level of dA6m in the genome even in the absence of Dam21,22.

Genomes of higher eukaryotes contain dA6m

To further confirm the results from the dot blot screen, genomic DNA was digested into its individual nucleosides and analyzed by UHPLC-MS/MS (Fig. 1a). As a positive UHPLC-MS/MS reference, we used a synthetic dA6m standard dilution series, as well as a water negative control. dA6m was identified in a given sample only when the retention time as well as its fragmentation pattern both matched the synthetic dA6m standard. Analogous to the dot blot results, dA6m was detected in both Dam+ and Dam– bacteria controls. As expected, the level of dA6m differed between these two bacteria. We encountered a lower level of dA6m in Dam– bacteria in comparison to Dam+ bacteria. Importantly, we did not detect dA6m in our processed negative control, but detected dA6m in the processed DNA isolated from eukaryotic tissues (Fig. 1c and Supplementary Fig. 2a-d). These results substantiate the dot blot approach and strongly support the presence of dA6m in the genome of a higher eukaryotic organism. We next tested if the dA6m Ab can in fact enrich for dA6m. We carried out dA6m Ab DIP on sheared X. laevis DNA. The DNA recovered from the dA6m Ab DIP was then further processed into its individual nucleosides and analyzed by UHPLC-MS/MS. The results validated that the dA6m Ab DIP strongly enriches for the low level of dA6m in higher eukaryotes, namely 14,152 times under the conditions applied (Fig. 1d-e and Supplementary Fig. 2a-c, e-g). To estimate the abundance of dA6m in the higher eukaryotic genome, we used the data obtained from the non-enriched dA6m Ab DIP samples. Our results show that dA6m is found 1 in 84 dA in Dam+ bacteria (1.19%), 1 in 4,215 dA in Dam– (0.02%) bacteria and only 1 in 1,172,141 dA (0.00009%) in higher eukaryotic samples (Fig. 1f). This corresponds to 27,238 dA6m in Dam+ bacteria, 542 dA6m in Dam– bacteria and 1,654 dA6m in one X. laevis genome, or 6,616 dA6m in one X. laevis tetraploid cell. To determine if dA6m is only a feature of X. laevis testes or if it is present in other higher eukaryotes, we extended our dot blot screen to search for the presence of dA6m in other organisms. Our results suggest that dA6m is not only present in various X. laevis tissues, but is also found in all higher eukaryotes we tested, such as in D. rerio, M. musculus and tissue culture cells derived from mouse and humans (Fig. 1g). We decided to focus our studies on X. laevis and used M. musculus to generalize our findings for higher eukaryotes.

Few genes are associated with dA6m

To study the location and distribution of dA6m containing regions across the genome, we generated high throughput sequencing libraries (Seq) from dA6m Ab DIP-enriched and input fractions (dA6m Ab DIP-Seq and input-Seq, respectively). We analyzed the genomes of X. laevis testes, fat and oviduct, and of M. musculus kidney by dA6m Ab DIP-Seq (Fig. 2a-d, Supplementary Table 1). For all X. laevis tissues, we processed 2 biological replicates that were obtained from different animals. In the case of M. musculus, we used 3 biological replicates that were also isolated from different animals. We compared the dA6m Ab DIP-Seq to the corresponding input-Seq controls in order to determine which regions in the genome were enriched by the dA6m Ab, hence, contained the dA6m mark. Based on our dA6m Ab DIP-Seq data, we identified in total 27,374 dA6m peaks in X. laevis testes, 20,160 in oviduct, 47,834 in fat, and 27,374 in M. musculus kidney (Fig. 2d, Supplementary Table 2). In dA6m Ab DIP-Seq experiments, dA6m peaks obtained from different cell types add up, aberrantly increasing the total abundance of dA6m. Therefore, such peak data should be used only to estimate the distribution of dA6m genome wide, rather than to determine the absolute levels of dA6m in the tissue. To determine if our sequencing data is of good quality for subsequent genome-wide analyses, we determined if it is consistent and reproducible.

Figure 2

Genome wide identification of dA6m marks in X. laevis fat, oviduct, testes and M. musculus kidney.

a, dA6m peak signal tracks. One gene region for each tissue type is shown. The y-axis represents the amount of immunoprecipitated DNA at each position normalized by the total number of reads. RefSeq gene annotations are shown. n=2 for each X. laevis tissue and n=3 for M. musculus, biological replicates from different animals, AU=arbitrary units. b, Heat map of Pearson correlation coefficient values from comparisons between X. laevis and M. musculus dA6m DIP-Seq and input-Seq samples. Correlation was calculated pairwise for all samples, excluding windows where both samples in the pair had zero depth. n=2 for each X. laevis tissue and n=3 for M. musculus, biological replicates from different animals, **P<1×10−16, Pearson Correlation test on mapped reads. c, Scatter plots comparing individual samples pairwise for overlapping peaks. The enrichment score ranging from 0–8 is plotted on both axes. The color in each plot reflects the correlation value (Pearson correlation coefficient), which is also shown in the top right corner of each plot. n=2 for each X. laevis tissue and n=3 for M. musculus, biological replicates from different animals, **P<1×10−16, two sided t-test on dA6m peaks. d, Number of unique and overlapping dA6m peaks identified in X. laevis and in M. musculus tissues. n=2 for each X. laevis tissue and n=3 for M. musculus, biological replicates from different animals, **P<1×10−16, χ2-test on dA6m peaks.

By comparing samples to each other using the Pearson Correlation Coefficient and scatter plots, we showed that biological replicates (tissues from different animals) as well as experimental conditions such as pulldown and input correlate with each other more than between different experimental conditions or between different biological sources, corroborating the robustness of our sequencing data (Fig. 2b-c). Further, we determined the number of identified dA6m peaks in individual replicates, and asked how many of these peaks overlap between biological replicates (Fig. 2d). First, the number of peaks identified in biological replicates is similar, which supports reproducibility (Fig. 2d). Second, the overlap between all biological replicates (tissues from different animals) is much higher than one would expect at random (χ2-test, **P-value < 1 × 10−16), strengthening the reproducibility of our approach. Next, we determined the overlap of dA6m peaks between different tissues (Fig. 2d). We took the overlapping peaks between replicates of X. laevis testes, oviduct and fat, and overlapped them between the different tissues. We found that some of them overlapped between all tissues, suggesting that some dA6m peaks are present at the same location in the genome, irrespective of tissue type. This is in particular true for X. laevis oviduct, where 85% of the dA6m peaks identified seem to also be present in at least one of other tissues analyzed, namely fat and testes (Fig. 2d). However, many dA6m peaks seem also to be tissue type specific. For example, 52% (7,207) of all dA6m peaks identified in fat are only found in X. laevis fat, while the remainder 48% (6,620) are also present in testes, in oviduct or in both (Fig. 2d). Overall, we conclude that some of the dA6m peaks are the same in different cell types, but many are different, indicating some degree of cell type specificity.

dA6m is predominantly excluded from coding regions

Despite the high number of dA6m peaks identified, only a small fraction of all genes have a dA6m peak. This was observed in all tissues and in samples from both X. laevis and M. musculus. Between 6.7% and 20.6% of all genes have a dA6m peak, while the rest of the peaks lie in non-genic regions (Supplementary Table 3). In X. laevis, the few genes that are found to be associated with dA6m are strongly linked to pathways such as nucleic acid binding, metabolic processes and transcription, as determined by gene ontology analysis. This was found across all tissues (Supplementary Table 4). In contrast, the genes that are associated with dA6m in M. musculus kidney are linked to different pathways, for example to ion channel activity, cell adhesion and ATP binding (Supplementary Table 4). The different pathways found in M. musculus could either indicate a tissue specific role of dA6m in kidney, or be due to the possibility that dA6m regulates different pathways in M. musculus than it does in X. laevis. When we analyzed the gene regions further, we observed that few dA6m peaks are located in exonic regions. Only 0.1–0.6% of all exons have dA6m peaks. In contrast, dA6m peaks are more frequent within introns. We found that 6.4–17.6% of all genes have dA6m peaks in introns (Supplementary Table 3). This lack of dA6m in exonic regions is in accordance with transcriptional start site (TSS) plots (Fig. 3a, Supplementary Figure 3a). The TSS plots showed a strong decrease of dA6m levels just after the TSS of genes. In addition, occasionally, a small increased abundance of dA6m upstream of TSS was detected, in comparison to the more downstream 3’ region. This TSS plot pattern, where a strong decrease of dA6m level is observed just after the TSS, was encountered in all X. laevis tissues analyzed and also in all M. musculus kidney samples (Fig. 3a, Supplementary Figure 3a). This suggests that the absence of dA6m in coding regions might be a general feature of dA6m in higher eukaryotes.

Figure 3

Genome wide distribution of dA6m in the vicinity of genes.

a, Distribution of dA6m peaks around TSS, from 20kb 5’ to 20kb 3’, identified in X. laevis fat, oviduct, testes and in M. musculus kidney. One biological replicate from one animal is shown in each graph. b, Density of dC5m versus unmethylated dC in distinct areas of the M. musculus testes genome. Only the regions shown in dark grey are statistically significant. One biological replicate from one animal is shown, **P<0.007, binomial test on dA6m peaks. c, Density of dA6m versus unmethylated dA in distinct areas of the genome of X. laevis testes. Only the regions shown in dark grey are statistically significant. One biological replicate from one animal is shown, *P<0.03, binomial test on dA6m peaks.

To obtain a better understanding of the dA6m distribution, we further analyzed the abundance of dA6m in the vicinity of genes. We divided regions that are in the vicinity of genes into different groups, for example those consisting of 1kb areas upstream and downstream of coding genes, and those that distinguish exons and introns. We next determined the ratio of methylated versus non-methylated deoxyadenosines in these regions, based on our DIP-Seq data. As a control, we also analyzed the dC5m distribution in the same way23. Our analysis revealed an enrichment of dC5m in exons (Fig. 3b), but in contrast to this we observed depletion of dA6m marks in exonic regions in all X. laevis testes replicates (Fig. 3c, Supplementary Fig. 3b). Further, this observation was confirmed in all X. laevis and M. musculus tissues analyzed, and in all replicates (Supplementary Fig. 3c-e). This suggests that depletion of dA6m in exonic regions is a distinct feature of this epigenetic modification. To further corroborate our findings, we carried out DIP-Seq on X. laevis testes with 2 other antibodies that are known to recognize dA6m. These are referred to as dA6m Ab* and dA6m Ab**. Importantly, dA6m signals identified were irrespective of the antibody used, excluding an antibody bias. As a control, we used the corresponding input, but also IgG for further validations (Fig. 4, Supplementary Fig. 4, Supplementary Table 1, Supplementary Table 5). Our analysis showed that irrespective of the dA6m recognizing antibody used, and independently of whether we compared our dA6m recognizing antibodies to input or IgG controls, the distribution of dA6m remained the same (Supplementary Fig. 4). In all cases we found that the level of dA6m decreased in exons, strengthening our previous dA6m DIP-Seq results.

Figure 4

Genome wide distribution of dA6m peaks in X. laevis testes samples as determined with three different dA6m recognizing antibodies in DIP-Seq and comparison to input-Seq or IgG-Seq controls.

a, dA6m peak signal tracks. Tracks obtained from dA6m Ab DIP-Seq, dA6m Ab* DIP-Seq, dA6m Ab** DIP-Seq and control input-Seq and IgG-Seq are shown. One gene region for each biological replicate is shown. The y-axis of each profile represents the amount of reads at each position normalized by the total number of reads in a given dataset. RefSeq gene annotations are shown. n=2, biological replicates from different animals, AU=arbitrary units. b, Heat map of Pearson correlation coefficient values from comparisons between X. laevis testes dA6m Ab DIP-Seq, dA6m Ab* DIP-Seq, dA6m Ab** DIP-Seq, IgG DIP-Seq and input-Seq samples. Correlation was calculated pairwise for all samples, excluding windows where both samples in the pair had zero depth. n=2, biological replicates from different animals, **P<1×10−16, Pearson Correlation test on mapped reads. c, Scatter plots comparing individual samples pairwise for overlapping peaks. The enrichment score ranging from 0–8 is plotted on both axes. The color in each plot reflects the correlation value (Pearson correlation coefficient), which is also shown in the top right corner of each plot. n=2, biological replicates from different animals, **P<1×10−16, two sided t-test on dA6m peaks.

Next, we asked if our dA6m peaks are conserved. For this purpose, PhyloP scores across 30 vertebrate species were compared to our dA6m M. musculus data24. This analysis was not possible for X. laevis, as PhyloP data is not available for this species. Using the Top 300 dA6m peak overlaps between the kidney M. musculus replicates, we found that the conservation score means of dA6m enriched regions (0.08) are smaller than and differ significantly (**P-value < 2.2 × 10−16) from the scores when the dA6m enriched regions were shifted by 10kb (0.12). This suggested that although there is some conservation, it was relatively weak. Bearing in mind that most of our peaks were excluded from coding regions, which were considered conserved, it is not too surprising that the dA6m peak regions showed weak conservation. This in fact confirmed our previous observations.

“AG” could be a putative consensus site for dA6m

We then wanted to identify putative dA6m consensus sequence motifs. To verify our approach, we first tested the abundance of any 4bp motifs in bacteria. We have carried out dA6m Ab DIP-Seq and input-Seq experiments on Dam+ and Dam– E. coli genomes and identified dA6m peaks (Supplementary Table 6). We then asked how abundant any 4bp motif is in these peaks. Out of the 256 possible combinations, we found enrichment for the GATC sequence in the dA6m peaks of Dam+ bacteria (Fig. 5a). Importantly, this GATC sequence is the known target recognition sequence of the Dam methylase25. In Dam– bacteria, this GATC motif, as expected, is no longer the most abundant motif encountered in dA6m peaks (Fig. 5a-b). We also applied the MEME prediction program to the bacterial dA6m peaks that were at least enriched by a magnitude of 2 (ref. 26). This analysis confirmed the GATC Dam motif in Dam+, but not in Dam– bacteria (Fig. 5c). This further confirmed the validity of our data and approach to predict the consensus sequence of dA6m. Next, we tried to identify potential dA6m consensus sequences for all our X. laevis and M. musculus samples. Using MEME, we obtained significant putative 8bp consensus motifs for all X. laevis tissues (**E-value < 1.2 × 10−8) (Fig. 5d). Also, forcing a shift of dA6m peaks by 5kb led to an inability of MEME to identify these sequences, showing that these putative motifs were not identified at random (Supplementary Fig. 5). Our analysis was not successful on M. musculus, as it seemed to be embedded in sequences that are repetitive. This made it impossible to reliably predict a motif in M. musculus, even when we tried to remove repeats from the input sequences for MEME analysis. As a result, we decided to focus on X. laevis. Interestingly, overlapping all of the tissues gives the putative dA6m 8bp motif TAGGAAGG (**E-value < 6.7 × 10−141) (Fig. 5d). This sequence was very similar to the ones identified in individual tissues, suggesting that this might be or contain a basic motif that is present in all tissue types. The sequences found by MEME in different tissues were variable enough to make us believe that the core motif was shorter. To determine if we can narrow down the 8bp putative motif to 4bp, we determined, as we have done with bacteria, how abundant any 4bp motif is in our peaks. However, out of the 256 possible combinations, not all are feasible as a potential motif, as the putative motif should at least in part overlap with the 8bp motif that was generated with MEME and has to contain a deoxyadenosine26. We calculated the frequency of all 256 possible 4bp in the peaks, and the noise, namely the frequency one would expect under those peaks when they were shifted (Supplementary Table 7). The ratio of those revealed that TAGG is likely to be a potential motif, as it is in top 4 most enriched sequences in all overlapped peaks. The other most enriched sequences are not applicable, as they are not found by MEME to be statistically significant (E-value > 0.05). Interestingly, we did not identify the bacterial motif GATC, also confirming we did not have bacterial contamination in our eukaryotic datasets. Although MEME and our 4 base predictions showed that the TAGG sequence is enriched under our peaks, we are hesitant to claim this is a consensus motif. However, all our results point towards part of it being a consensus. We therefore postulate that AG, with the major fraction being TAG, forms part of the motif. However, further elaborate experimental evidence is required to determine if TAG or AG is in fact a bona fide consensus motif in which dA6m is found.

Figure 5

dA6m motif identification in bacteria and X. laevis.

a, Abundance of 4bp motif in dA6m peaks. 256 potential 4bp motifs, each representing one column along x-axis, were ranked for their abundance in Dam+ and Dam– bacterial dA6m peaks. The Dam recognition motif GATC is shown in red. n=1 biological replicate from one bacterial culture for Dam+ and Dam- bacteria, Spearman rank correlation coefficient=0.94 between Dam+ and Dam– bacteria, Spearman’s **ρ=2.2×10−16. b, Ratio between Dam+ and Dam– bacteria enriched 4bp motifs. Only motifs that are at least 5% enriched are illustrated. c, Bacterial dA6m motif identified by MEME. In Dam+ bacteria, the motif GATC has been identified (73 out of 73 gene regions identified by MEME). n=1 biological replicate from one bacterial culture for Dam+ and Dam- bacteria, AU=arbitrary units. d, X. laevis dA6m motif identification by MEME. Overlaps between biological replicates from different animals were used for analysis. Same tissue n=2 biological replicates from different animals, different tissue overlaps n=6 biological replicates for different animals, **E-value<1.2×10−8, statistics by MEME, AU=arbitrary units.

Discussion

Epigenetic modifications can cause changes to the genome without altering the DNA sequence. These are known to occur on histones, RNA and DNA. Most of the epigenetic modifications studied to date are those of histone and RNA modifications. Both molecules can bind to specific DNA sequences and subsequently change the accessibility of that region, but do not directly modify the DNA itself. Up to date, only dC5m has been studied extensively in higher eukaryotes, which directly affects the DNA itself2–4. Although intermediate forms of dC5m, such as 5-formylcytosine and 5-carboxylcytosine have been discovered and are increasingly being studied, not much is known in higher eukaryotes about modifications affecting other deoxynucleotides27,28. We found it surprising that so little attention has been given to direct epigenetic modifications. In order to determine if there are in fact no other modifications, we used dA6m as an example and discovered that the higher eukaryotic genome is more diverse than previously thought. dA6m is a modification found in bacterial DNA and affects gene expression and virulence6,10. However, its presence in higher eukaryotes has been debated12,13. Its identification was likely constrained by the low abundance of this modification14–16. With technological advancements, the detection limits improved, allowing us to directly identify dA6m in the genome of higher eukaryotes. Very recently, other reports were published making observations similar to ours, using the same approach17–19. However, ours was developed independently, before the recent publications concerning this modification were published. We have discovered dA6m in higher eukaryotic organisms, while the recent publications reported the presence of dA6m in the genome of the algae Chlamydomonas, in the insect Drosophila and in the nematode C. elegans17–19. In agreement to previous work, we find that dA6m is a low abundant modification, even less abundant in higher eukaryotes than in other organisms. For example, in Drosophila, the frequency of the dA6m /dA ratio varies between 0.07–0.001%, in C. elegans between 0.01–0.4%, and in Chlamydomonas it is 0.4% (refs. 17–19). In the organisms that we investigated, dA6m was substantially less abundant, namely 0.00009%. The reason for the difference in abundance of dA6m among these organisms is unclear, and may be ascribed to inherent differences in genome organization and epigenetic regulators. The high abundance of dA6m in C. elegans might also be explained by the fact that these animals were fed with Dam– bacteria that still contained dA6m. Indeed, the genome of these bacteria still possess residual dA6m due to the presence of the other known deoxyadenosine methylase EcoKI, which we have also confirmed by UHPLC-MS/MS121,22. This might have interfered with the determination of total dA6m levels, as well as any functional tests performed in the presence of these bacteria. When comparing the genome wide distributions of dA6m in M. musculus and X. laevis to the other organisms, we encountered a pattern different from what we saw in the in higher eukaryotes. Indeed, we found that dA6m is absent from areas downstream of TSS and from exons in mouse and frog genomes. In C. elegans, no appreciable distinct pattern near genes is observed19. In contrast, in Drosophila and Chlamydomonas genomes, dA6m is enriched at or following TSS sites17,18. This is the opposite of what we found in mouse and frogs. This different pattern of dA6m suggests that this modification may have distinct roles across eukaryotes. However, any functions of dA6m in higher eukaryotes remain to be investigated. Key aspects of this investigation are the identification of epigenetic modifiers that deposit (methylase) and remove (demethylases) the modification, and of possible dA6m interacting proteins. The latter aspect is of particular importance since dA6m might serve as a DNA anchor for regulatory proteins to bind, which could then trigger various downstream pathways and regulate gene transcription. Also, the presence of dA6m could ultimately cause different chromatin landscapes, influence nucleosome positioning, or insulate different DNA regions from each other. Overall, our findings suggest that direct epigenetic modifications might be more widespread than previously thought in higher eukaryotes. RNA, a molecule that is built up from similar molecules as DNA, is known to have more than 60 modifications in eukaryotes, and when including different organisms, the number is greater than 110 (ref. 5). Due to the strong similarity between the DNA and RNA building blocks, we have shown that DNA is much more diverse than has been previously believed. Overall, we believe it is very unlikely that DNA is so simplistic while RNA is so diverse. Hence, we hypothesize that many of such ‘apegenetic’ (from Greek apeftheias, meaning direct) modifications exist. In future, this ‘apegenome’ remains to be discovered and its function further investigated.

Online Methods

Genomic DNA isolation

All X. laevis and M. musculus tissues were directly isolated from sacrificed vertebrates. This was done following all provisions and ethical regulations of the Animal (Scientific Procedures) Act 1986, while having licenses and approval from the Home Office and the Local Ethical Committee (AWERB). No statistical method was used to predetermine sample size. The experiments were not randomized and were not performed with blinding to the conditions of the experiments. After homogenization of the samples (Precellys 24) and addition of phenol chloroform, and the aqueous phase was precipitated with isopropanol and sodium acetate. After 2 washes with 70% EtOH, the DNA was digested with RNAse A for at least 16hrs at 37°C. The DNA was subsequently treated with Proteinase K and purified using the DNeasy Blood & Tissue Kit (Qiagen). To ensure removal of any RNAse, the DNA was again digested with RNAse A and Proteinase K, and repeatedly extracted with the DNeasy Blood & Tissue Kit. The concentration of the genomic DNA was measured using the Qubit double stranded High Sensitivity assay kit.

DNA immunoprecipitation (DIP)

DNA immunoprecipitation was prepared using the protocol from Dominissini et al., with the following variations30: After the isolation of genomic DNA, at least 20μg DNA was fractionated into 100–200bp fragments using the bioruptor (Diagenode). About 1μg or the fractionated DNA was put aside as an input control. The rest of the fragmented genomic DNA was resuspended in a 1ml final reaction volume containing 10mM Tris-HCl, 150mM NaCl, 0.1% (v/v) Igepal CA-630 and 1.5ug/ul BSA. To this, at least 2.5μg of the dA6m Ab (Synaptic Systems GmbH, m6A antibody, Cat. No. 202003) was added. Its function was validated as described in Figure 1 and Supplementary Figure 1. Alternatively, the following antibodies were used: dA6m Ab* (Synaptic Systems GmbH, m6A antibody, Cat. No. 202011), dA6m Ab** (Synaptic Systems GmbH, m6A antibody, Cat. No. 202111), IgG (Abcam, Cat. No. ab171870). Validation of these antibodies is provided on the manufacturer’s website, and has been supplemented by our findings in Figure 4. Please note, species specific validation of these antibodies is not required as the antibodies were only exposed to DNA, and their target is dA6m, which is identical between different species. After an overnight incubation at 4°C on a rotor, 100μl of prewashed protein A magnetic beads were added for 2 hrs at 4°C. Next, the supernatant was removed while Ab bound to the beads was retained using a magnet. After 5 washes with the washing buffer (10mM Tris-HCl, 150mM NaCl, 0.1% (v/v) Igepal CA-630), the DNA bound Ab fraction was eluted. For almost all applications, the DNA was eluted from the Ab and beads with 150μl of elution buffer that contained 10mM Tris-HCl, 150mM NaCl, 0.1% (v/v) Igepal CA-630, 6.7mM methylated adenosine triphophate). The mixture was then incubated for 1hr with continuous shaking at 37°C. The supernatant was removed, and then another 150μl of elution buffer was added to remove any unbound remaining fraction. Next, the supernatants were combined, precipitated, and used for subsequent analysis.

Dot blot

The desired amount of genomic DNA, in most cases 25ng per sample, was diluted in 100μl of 0.5mM EDTA (pH8.0), 7.4% formaldehyde and 6xSSC. Next, the samples were incubated for 30min at 60°C, and then kept on ice. A Nylon + membrane was soaked in distilled water, and then in 10xSSC. The membrane was transferred into a pre-cleaned dot blot filtration apparatus, and was placed on top of 3 Whatman Paper sheets. While the apparatus was under vacuum pressure, first, the membrane was rehydrated with 10xSSC, and then the DNA containing samples were applied into individual wells. After the samples were fully absorbed in the membrane, the wells were washed with 10xSSC. The apparatus was dismantled, and the membrane was then removed while the apparatus was still under vacuum. After drying in air for about 10min, the membrane was crosslinked at 302nm with UV and blocked for 1hr in 5% nonfat dry milk and 0.1% PBST (0.1% Tween-20 in 1xPBS, pH7.4). Subsequently, the antibody dA6m (Synaptic Systems GmbH) was diluted to 1:1000 in 0.1% PBST and incubated overnight at 4°C. Following 3 washes with 0.1% PBST, a fluorescent secondary antibody was applied for 30min at room temperature (Alexa Fluor Ab, Invitrogen). After further 3 washes with 0.1% PBST, the fluorescent signal was visualized and quantified. All samples that were processed by dot blots were done in triplicate (technical replicates), as well as biological replicates. Original images of blots used presented in the main figures can be found in Supplementary Data Set 1.

Dot blot competition experiments

The genomic DNA was applied and cross-linked to the membrane as described above. However, the dA6m Ab, used at the same dilution of 1:1000, was pre-incubated with different competitors and different competitor concentrations for 3hrs before being applied onto the sample dotted onto the membrane. Dot blots have been cut in order to expose the individual samples to different competitor concentrations, but all samples were processed in parallel. The subsequent steps are the same as for the normal dot blot procedure described above.

Image analysis

Antibody stained blots were visualized with the LI-COR Odyssey CLx. The images were acquired and quantified with the Image Studio Ver 4.0 software.

DNA oligos

Synthetic oligos were used on dot blots and pulldown studies. The sequence of the 25bp DNA oligo with no dA6m is 5’ AGTCGTTCATCTAGTTGCGGTGTAC 3’. The sequence of the 25bp DNA oligo with dA6m is 5’ AGTCGTTCATCT(dA6m) GTTGCGGTGTAC 3’. The sequence of the 110bp DNA oligo with no dA6m is 5’ TGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGGGCGAGGGCGATGCC 3’. The sequence of the 110bp DNA oligo with dA6m is 5’ TGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCG (dA6m) GCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGGGCGAGGGCGATGCC 3’.

Strain genotypes

The Dam– bacteria are a K12 strain, and in addition to lacking the Dam methylase, they are also deficient in deoxycytosine methylation (Dcm–). These strains were obtained from NEB (dam–/dcm– Competent E.coli), and have been authenticated by UHPLC-MS/MS and dot blots (Figure 1, Supplementary Figure 2). The genotype is provided on the manufacturer’s website. The Dam+ bacteria are a DH10B strain (Invitrogen). The genotype is provided on the manufacturer’s website, and the strain has been authenticated by UHPLC-MS/MS and dot blots (Figure 1, Supplementary Figure 2). X. laevis fat and oviduct samples were obtained from female adults. X. laevis testes were isolated from adult males. All X. laevis were pigmented, and purchased from eNASCO. M. musculus kidneys were obtained directly from adult wild type males, with the strain C57B6. All mouse cell lines used in the dot blot experiments come originally from the C57B6 strain, have been identified and tested for mycoplasma contamination by Q-PCR. The human cell line 293T has been obtained from ATCC (ATCC CRL-3216), has been identified and tested for mycoplasma contamination by Q-PCR.

Sample preparation for UHPLC-MS/MS analysis

AGenomic DNA to be analyzed by UHPLC-MS/MS was diluted in a volume of 250μl water. Such samples were then denatured by heating them at 100°C for 5min and immediately placing them on ice. 20μl of 20mM ZnSO4 and 10μl of the nuclease P1 (200units/ml in 30mM sodium acetate, pH5.3) were added, in order to digest any DNA strands into individual nucleotides. After an overnight incubation at 50°C, 180μl of water, and 1μl of bacterial alkaline phosphatase (BAP, 150U/ul) were added. After a 24hr incubation at 37°C, 1ul of the BAP was added again, and the samples were incubated at 65°C for another hour. Next, 30μl of 0.5M Tris-HCl (pH7.9) was added, and the phosphatase reaction was continued for another hour at 37°C. Next, 400μl of water was added, together with silicic acid that filled the 1.7ml Eppendorf tube to about 200μl. After incubation for 15min with occasional vortexing, the sample mixture was transferred onto 0.45μm cellulose acetate filters. After centrifugation the silicic acid granules were removed. The flow through was then analyzed by UHPLC-MS/MS.

LC-MS/MS analysis

Analysis of global levels of dA and dA6m was performed on a Q Exactive Orbitrap mass spectrometer coupled to a Dionex UltiMate 3000 Rapid Separation LC fitted with an Acquity UHPLC HSS T3 column (100 x 2.1 mm, 1.8 µm particle size). The mobile phase consisted of 0.1% aqueous formic acid (solvent A) and 0.1% formic acid in acetonitrile (solvent B) at a flow rate of 300 µl/min. Calibration curves were generated using serial dilutions of synthetic standards for deoxyadenosine (dA, Sigma) and N6-methyl-2’-deoxyadenosine (dA6m, Sigma). The mass spectrometer was set in a positive ion mode and operated in parallel reaction monitoring. Ions of masses 252.11 (dA) and 266.12 (dA6m) were fragmented and full scans were acquired for the base fragments 136.0618 and 150.0774 ± 5ppm (adenine and methyladenine, respectively). The EIC of the base fragment was used for quantification. Accurate mass of the corresponding base-fragment was extracted using the XCalibur Qual Browser and XCalibur Quan Browser software (Thermo Scientific), and used for quantification. Quantification was performed by comparison with the corresponding standard curve obtained from the pure nucleoside standards running with the same batch of samples. The level of dA6m present in the sample was expressed as a percentage of total adenosine content (methylated and non-methylated), calculated according to the following equation: (%) dA6m = 100 x dA6m /[dA+dA6m]. Differences in dA6m percent abundance were considered significant when P ≤ 0.05.

Illumina sequencing library preparation

High throughput sequencing libraries were prepared with different genomic DNA samples following the protocol described by Ford et al.29. All libraries were sequenced using Illumina HiSeq 2000 / 2500, single end, 50bp. At least two biological replicates (tissues from different animals) were performed for each experiment, except for bacterial control samples, where only one replicate was carried out. Each experiment consisted of one dA6m DIP and its corresponding input and occasionally also IgG sample (Supplementary Table 1).

Bioinformatics analysis

Genome alignment for frog, mouse and bacteria genomes: Fastq files were filtered for low quality reads ( Annotation and gene set enrichment: X. laevis and M. musculus sequences were annotated using InterProScan to provide both InterPro Domains and Panther ontology terms34,35. Descriptions for the remaining NCBI sourced sequences were downloaded from the NCBI. Gene set enrichment was obtained using Panther GO Slim terms (7.2) with topGO (Bioconductor package version 2.6.0. http://www.bioconductor.org/packages/release/bioc/html/topGo.html). Peak calling: PCR duplicates were removed from the aligned datasets and peaks were called using SICER, comparing dA6m pull-down over input or IgG (parameterisation: redundancy threshold = 1, window size = 200, fragment size = 200, effective genome fraction = 0.74 (0.89 X. laevis), gap size = 400, FDR = 0.05)36. X. laevis genome filtering: Due to the lower quality of this genome assembly, the following filtering steps were performed to increase the accuracy of both mapping and motif analysis. Repeat masker was used to remove any residual repeats (RepeatModeler Open-1.0. 2008-2010 ). Sequences of low quality (represented as lower case) were masked. Uninformative sequences were removed using DUST and homopolymers of more than 4 bases were removed37. Replicate overlaps: The overlaps between replicate peaks were detected using the R bioconductor library “Genomic Features->findOverlaps” (ref. 38). For statistical purposes, one set of peaks in each pairwise comparison was randomly redistributed (shuffled) around the genome. This was repeated 100 times and the mean number of intersected peaks was taken. These shuffled peaks were then compared to the non-shuffled pair resulting in the number of overlaps. This number was then compared to the number of overlaps generated from the original (both pairs un-shuffled) set of peaks using a χ2-square test. In all cases, the resulting P-value was less than 10−16. Motif analysis with MEME: Motifs were called using MEME on the sequences below the peaks26. For E.coli peaks with at least 2 times enrichment were analyzed, while for larger genomes, namely M. musculus and X. laevis, the top 300 enriched peaks were analyzed by MEME. For M. musculus the data was mapped to a filtered version of mm9 where known uninformative sequences were removed using Repeat Masker (RepeatModeler Open-1.0. 2008-2010 ) for repeats and DUST for regions of low complexity37. Homopolymers of more than 4 bases were also removed from the genome. For X. laevis the filtered genome as described above was used. MEME was run on the sequences of the peaks from overlapping replicates or tissues, and on the sequences of the peaks obtained from individual replicates. Statistically significant (E-values < 0.05) 8bp motifs were only found in the overlapping data between replicates and tissues, and only in X. laevis. MEME was run using the ZOOPS model generating different motifs26. Motifs were then called using shifted peaks with the same size and distribution to build a background probability of occurrence for the observed motif. 4bp motif analysis: The frequency of all possible four base pair combinations was calculated for sequences of regions under peaks and not under peaks. For all combinations, differences between these two frequencies were then ranked by the respective ratios. TSS plots: The distance to each TSS from the midpoint of all reads within 20 kb of that TSS was measured. The distances were pooled over all TSS locations and plotted in a histogram with 200 bp bins. Scatter plots: For all pairwise intersections of peaks, a scatterplot was generated where enrichment values were plotted for each pair of overlapping peaks. Pearson correlations were also calculated over all enrichment values and t-tests were performed, yielding values less than 10−16 in all cases. Conservation analysis: Conservation scores were downloaded from UCSC phyloP conservation tracks for mouse (mm9)24. For each base position under a peak the P score was taken and the overall score was calculated. For each set of peak calls, a conservation score distribution was generated by extracting the phyloP for each position in the peak ranges (n>50 in all cases, so Central Limit Theorem applies). This was compared between peaks shifted by 10kb and the original peaks and the original peaks were shown by a t-test to have a significantly higher level of conservation (P-value < 2.2 ×10−16 in all cases). Determination of dA6m peak distribution in gene regions: The midpoint of peak locations was classified into regions of the genome including exonic, intronic, intergenic and around the TSS and transcriptional termination sites. The data was normalized to abundance in genome. The peak counts in these regions were modeled by a Poisson distribution assuming, under the null hypothesis, that the incidence rate in each was equal to that of the whole genome average. The probability of the observed counts, given this distribution, was calculated for each region to ascertain whether peak rates were significantly different to the whole genomic background.

Statistical analysis (of dot blots, mass spec and Ab pulldown yield quantifications)

Statistical differences, P-values, were calculated using the two-sided t-test for paired samples. For calculations of P-values between samples in which one sample had only zero as values, the one-sided t-test was applied. All experiments were carried out with 3 technical and biological replicates, indicated by n. All P-values ≤ 0.05 are formatted as *P, while all P-values ≤ 0.005 are indicated as **P.

Venn diagrams

Venn diagrams were drawn with the help of eulerAPE v3 (ref. 39).

39 in total

1. The human genome browser at UCSC.

Authors: W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler
Journal: Genome Res Date: 2002-06 Impact factor: 9.043

2. The quantitative separation of purines, pyrimidines, and nucleosides by paper chromatography.

Authors: R D HOTCHKISS
Journal: J Biol Chem Date: 1948-08 Impact factor: 5.157

3. N6-methyldeoxyadenosine marks active transcription start sites in Chlamydomonas.

Authors: Ye Fu; Guan-Zheng Luo; Kai Chen; Xin Deng; Miao Yu; Dali Han; Ziyang Hao; Jianzhao Liu; Xingyu Lu; Louis C Dore; Xiaocheng Weng; Quanjiang Ji; Laurens Mets; Chuan He
Journal: Cell Date: 2015-04-30 Impact factor: 41.582

4. N6-methyladenine DNA modification in Drosophila.

Authors: Guoqiang Zhang; Hua Huang; Di Liu; Ying Cheng; Xiaoling Liu; Wenxin Zhang; Ruichuan Yin; Dapeng Zhang; Peng Zhang; Jianzhao Liu; Chaoyi Li; Baodong Liu; Yuewan Luo; Yuanxiang Zhu; Ning Zhang; Shunmin He; Chuan He; Hailin Wang; Dahua Chen
Journal: Cell Date: 2015-04-30 Impact factor: 41.582

5. The discovery of 5-formylcytosine in embryonic stem cell DNA.

Authors: Toni Pfaffeneder; Benjamin Hackner; Matthias Truss; Martin Münzel; Markus Müller; Christian A Deiml; Christian Hagemeier; Thomas Carell
Journal: Angew Chem Int Ed Engl Date: 2011-06-30 Impact factor: 15.336

6. Rare bases in animal DNA.

Authors: B F Vanyushin; S G Tkacheva; A N Belozersky
Journal: Nature Date: 1970-03-07 Impact factor: 49.962

7. Characterization of antibodies specific for N6-methyladenosine and for 7-methylguanosine.

Authors: T W Munns; M K Liszewski; H F Sims
Journal: Biochemistry Date: 1977-05-17 Impact factor: 3.162

Review 8. The fundamental role of epigenetic events in cancer.

Authors: Peter A Jones; Stephen B Baylin
Journal: Nat Rev Genet Date: 2002-06 Impact factor: 53.242

9. Biomethylation of deoxyribonucleic acid in cultured human tumour cells (HeLa). Methylated bases other than 5-methylcytosine not detected.

Authors: P D Lawley; A R Crathorn; S A Shah; B A Smith
Journal: Biochem J Date: 1972-06 Impact factor: 3.857

10. The RNA Modification Database, RNAMDB: 2011 update.

Authors: William A Cantara; Pamela F Crain; Jef Rozenski; James A McCloskey; Kimberly A Harris; Xiaonong Zhang; Franck A P Vendeix; Daniele Fabris; Paul F Agris
Journal: Nucleic Acids Res Date: 2010-11-10 Impact factor: 16.971

79 in total

Review 1. Diverse and dynamic DNA modifications in brain and diseases.

Authors: Matthew J Armstrong; Yulin Jin; Emily G Allen; Peng Jin
Journal: Hum Mol Genet Date: 2019-11-21 Impact factor: 6.150

Review 2. Translational Perspective on Epigenetics in Cardiovascular Disease.

Authors: Pim van der Harst; Leon J de Windt; John C Chambers
Journal: J Am Coll Cardiol Date: 2017-08-01 Impact factor: 24.094

3. A reassessment of DNA-immunoprecipitation-based genomic profiling.

Authors: Antonio Lentini; Cathrine Lagerwall; Svante Vikingsson; Heidi K Mjoseng; Karolos Douvlataniotis; Hartmut Vogt; Henrik Green; Richard R Meehan; Mikael Benson; Colm E Nestor
Journal: Nat Methods Date: 2018-06-25 Impact factor: 28.547

4. Epigenetics: An elusive DNA base in mammals.

Authors: Gerd P Pfeifer
Journal: Nature Date: 2016-03-30 Impact factor: 49.962

5. Identification of a DNA N6-Adenine Methyltransferase Complex and Its Impact on Chromatin Organization.

Authors: Leslie Y Beh; Galia T Debelouchina; Derek M Clay; Robert E Thompson; Kelsi A Lindblad; Elizabeth R Hutton; John R Bracht; Robert P Sebra; Tom W Muir; Laura F Landweber
Journal: Cell Date: 2019-05-16 Impact factor: 41.582

Review 6. N6-Methyladenine: A Conserved and Dynamic DNA Mark.

Authors: Zach Klapholz O'Brown; Eric Lieberman Greer
Journal: Adv Exp Med Biol Date: 2016 Impact factor: 2.622

Review 7. N(6)-Methyladenine in eukaryotes.

Authors: Myles H Alderman; Andrew Z Xiao
Journal: Cell Mol Life Sci Date: 2019-05-29 Impact factor: 9.261

8. i6mA-DNCP: Computational Identification of DNA N⁶-Methyladenine Sites in the Rice Genome Using Optimized Dinucleotide-Based Features.

Authors: Liang Kong; Lichao Zhang
Journal: Genes (Basel) Date: 2019-10-20 Impact factor: 4.096

Review 9. Detecting and interpreting DNA methylation marks.

Authors: Ren Ren; John R Horton; Xing Zhang; Robert M Blumenthal; Xiaodong Cheng
Journal: Curr Opin Struct Biol Date: 2018-07-19 Impact factor: 6.809

10. Active N⁶-Methyladenine Demethylation by DMAD Regulates Gene Expression by Coordinating with Polycomb Protein in Neurons.

Authors: Bing Yao; Yujing Li; Zhiqin Wang; Li Chen; Mickael Poidevin; Can Zhang; Li Lin; Feng Wang; Han Bao; Bin Jiao; Junghwa Lim; Ying Cheng; Luoxiu Huang; Brittany Lynn Phillips; Tianlei Xu; Ranhui Duan; Kenneth H Moberg; Hao Wu; Peng Jin
Journal: Mol Cell Date: 2018-08-02 Impact factor: 17.970