Literature DB >> 25374580

Methylated DNA is over-represented in whole-genome bisulfite sequencing data.

Lexiang Ji¹, Takahiko Sasaki², Xiaoxiao Sun³, Ping Ma³, Zachary A Lewis², Robert J Schmitz⁴.

Abstract

The development of whole-genome bisulfite sequencing (WGBS) has resulted in a number of exciting discoveries about the role of DNA methylation leading to a plethora of novel testable hypotheses. Methods for constructing sodium bisulfite-converted and amplified libraries have recently advanced to the point that the bottleneck for experiments that use WGBS has shifted to data analysis and interpretation. Here we present empirical evidence for an over-representation of reads from methylated DNA in WGBS. This enrichment for methylated DNA is exacerbated by higher cycles of PCR and is influenced by the type of uracil-insensitive DNA polymerase used for amplifying the sequencing library. Future efforts to computationally correct for this enrichment bias will be essential to increasing the accuracy of determining methylation levels for individual cytosines. It is especially critical for studies that seek to accurately quantify DNA methylation levels in populations that may segregate for allelic DNA methylation states.

Entities: Chemical Disease Gene Mutation Species

Keywords: DNA methylation; Epigenomics; PCR bias; epigenetics; whole genome bisulfite sequencing

Year: 2014 PMID： 25374580 PMCID： PMC4204604 DOI： 10.3389/fgene.2014.00341

Source DB: PubMed Journal: Front Genet ISSN： 1664-8021 Impact factor: 4.599

INTRODUCTION

The ability to produce genome-wide data sets has radically changed during the last decade. With the advent of high-throughput sequencing, billions of short read data can be generated within days. Currently, one of the primary uses of these short reads is to explore genetic variation within the human population with the goal of linking variants to diseases. Identifying these causal associations is the first major step to understanding how diseases arise, how they manifest and how they can be treated. Additionally, these short reads are used to study a wide range of model systems and an ever-growing list of non-model systems. Although much of the energy in the scientific community focuses on genetic variants there are a growing number of scientists that are fascinated by the epigenome. The NIH Roadmap Epigenomics Mapping Consortium defines the epigenome as a map of features distributed throughout and on top of the genome (Bernstein et al., 2010). These features can include RNAs such as mRNAs, long non-coding RNAs and small RNAs, DNA-protein interactions, chromatin accessibility, nucleosome positioning, covalent modifications to histone tails and DNA methylation, to name a few. Understandably, refinement of epigenomic techniques lagged behind development of genomic DNA library preparation due to their demand for use and due to the difficulty in execution. Part of the reason for this sudden increase and interest in epigenomics is due to the standardization of protocols and analysis pipelines used to produce and interpret data. No longer is data generation the rate-limiting step. Instead, the ability to accurately analyze and interpret these data sets has become the major bottleneck in the daily workflow. One of the most exciting technical advances in field of epigenomics was the development of methods to analyze DNA methylation at single-base resolution (Cokus et al., 2008; Lister et al., 2008). Generally referred to as whole-genome bisulfite sequencing (WGBS), these methods combine sodium bisulfite conversion (Hayatsu et al., 1970; Shapiro et al., 1973, 1974; Hayatsu, 1976; Frommer et al., 1992; Clark et al., 1994, 2006), the gold standard for determining DNA methylation states at individual cytosines, with high-throughput DNA sequencing. Using this technique, unmethylated cytosines are converted into uracils and then into thymines after PCR. The observation of cytosines in sequencing reads typically indicates that the cytosine was methylated, as methylcytosines are protected from conversion by treatment with sodium bisulfite. These methods are capable of testing approximately 90% of all cytosines in genomes studied to date. They also generate quantitative measurements of DNA methylation at each cytosine due to the multiple independent reads that align to a particular sequence. These methods were originally tested on Arabidopsis thaliana genomic DNA (Cokus et al., 2008; Lister et al., 2008) as this species has a relatively small genome size (∼150 Mb) compared to human genomes (∼3 Gb per haploid genome), which made testing and troubleshooting cost efficient. Since the first single base resolution DNA methylomes for A. thaliana surfaced, larger DNA methylomes have been sequenced such as humans (Lister et al., 2009), mice (Stadler et al., 2011), corn (Gent et al., 2013), and soybean (Schmitz et al., 2013a), but in general it is cost prohibitive to sequence large numbers of individual samples using this technique. As a result, reduced representation bisulfite sequencing (RRBS) was developed in which ∼1% of the genome is studied making it possible to investigate large numbers of individuals at the cost of surveying the entire genome (Meissner et al., 2005). Most recently, additional modifications to cytosines have been identified such as 5-hydroxymethylation, 5-formylcytosine, and 5-carboxylcytosine and methodologies have been developed to study the presence of some of these base modifications genome-wide (Booth et al., 2012, 2014; Yu et al., 2012). The initial genome-wide maps of DNA methylation were invaluable for defining previously unexpected epigenomic signatures in a variety of systems, such as partially methylated domains and non-CG methylation in animals (Lister et al., 2009) and CHH islands in maize (Gent et al., 2013). Since these initial maps, WGBS has been used to study epigenome reprogramming of induced pluripotent stem cells (Lister et al., 2011), genomic imprinting in both plants (Gehring et al., 2009; Hsieh et al., 2009; Zemach et al., 2010; Calarco et al., 2012) and animals (Xie et al., 2012; Kobayashi et al., 2013; Shirane et al., 2013), germ line epigenome reprogramming in plants (Slotkin et al., 2009; Calarco et al., 2012; Ibarra et al., 2012; Creasey et al., 2014) and animals (Seisenberger et al., 2012; Jiang et al., 2013) and also to study the impact of genetic variation on DNA methylation (Schmitz et al., 2013a,b; Orozco et al., 2014). Clearly, mapping and analyses of DNA methylomes is a relatively young field with great potential for exciting discoveries in the next decade. It is well established that there is a GC content bias in amplification of fragments of DNA used for high-throughput sequencing applications (Aird et al., 2011; Benjamini and Speed, 2012). As a result, methods were developed to avoid PCR amplification all together in the construction of DNA sequencing libraries (Kozarewa et al., 2009). However, this approach is not applicable to bisulfite-treated DNA as the uracils in the fragments would inhibit cluster formation on the instrument unless reagents could be customized to include a uracil-insensitive DNA polymerase. As a result, bisulfite-treated DNA is amplified for varying cycles of PCR to replace uracils in the DNA sequence and to amplify libraries as treatment with sodium bisulfite degrades DNA. The degradation of DNA by sodium bisulfite also creates limitations with regards to the types of samples that can be sequenced, as higher input amounts are often required for most protocols. Moreover, after bisulfite conversion, previously complementary strands are single stranded and will no longer anneal together. In fact, previous observations of bisulfite-conversion and PCR have uncovered strand biases that severely affects estimates of DNA methylation levels (Warnecke et al., 1997). Correction methods have been developed for targeted bisulfite PCR experiments, but these methods are not feasible for WGBS (Komori et al., 2011; Moskalev et al., 2011). Given that methylated DNA will retain higher GC content after bisulfite conversion and PCR compared to unmethylated DNA, over-representation of methylated DNA may occur in the construction of sequencing libraries. Although libraries can be constructed from lower input DNA samples such as those prepared from FFPE (formalin-fixed paraffin-embedded), sequence-capture or single cell approaches, there usually is a cost of increasing the total PCR cycles. This cost likely comes in the form of increasing any bias in read enrichment between methylated and unmethylated sequences at the same locus. In this study, we explore the presence of an amplification bias created in fragments that retain higher GC content after sodium-bisulfite treatment with unmethylated sequences and with genomic DNA sequences. A strong positive correlation between DNA methylation levels and normalized read counts are observed in the WGBS data. Investigation of three different commercially available uracil-insensitive enzymes reveal differences in amplification bias of highly methylated and high GC content DNA sequences.

MATERIALS AND METHODS

PLANT AND FUNGAL MATERIAL

The A. thaliana Col-0 accession was used for all. Plants were grown in 16 h day lengths and tissue was harvested from above ground rosette leaves 3 weeks after planting. DNA was isolated using the Qiagen (Valencia, CA, cat-69106, USA) DNeasy Plant Kit according to the manufacturer’s protocol. All Neurospora experiments were performed with the wild-type “Oak Ridge” strain (FGSC# 2489) obtained from the Fungal Genetics Stock Center (McCluskey et al., 2010). ∼1 × 106 conidia/ml were germinated in 50ml of Vogel’s Minimal Medium with 1.5% sucrose and grown for 5 h.

METHYLC-SEQUENCING LIBRARY PREPARATION

MethylC-seq libraries were prepared according to the following protocol (Urich et al., 2014), with the some modifications. Three different enzyme mixtures were used to amplify bisulfite-converted adapter ligated DNA: Kapa HiFi Uracil+ (cat-KK2802, Kapa Biosystems), Pfu Turbo Cx Hotstart DNA polymerase (Agilent Technologies, Santa Clara, CA, cat-600410, USA), and EpiMark (New England Biolabs., Ipswich, MA, cat-M0490S, USA). PCR cycle number varied from 4, 8, or 15 cycles as described in the main text.

PREPARATION AND SEQUENCING OF Neurospora DNA

Chromatin immunoprecipitation (ChIP) and DNA isolation were performed as described previously (Sasaki et al., 2014). For Illumina sequencing, libraries were prepared from 10 ng of ChIP input DNA using an Illumina TruSeq kit (Illumina cat-FC-121-2002). Libraries were prepared according to manufacturer instructions except for the following modifications. DNA adaptors were diluted 1:100 prior to ligation. PCR was performed for 4, 8, or 15 cycles, as indicated in the text. In some experiments, a PCR reaction mixture containing Kapa HiFi polymerase (Kapa Biosystems, cat-KK2502), 1X Kapa HiFi buffer, 200 nM dNTPs, 1X TruSeq primer cocktail, and 60 mM tetramethylammonium chloride (TMAC; Sigma-Aldrich, catalog # T3411) was used in place of the TruSeq PCR master mix [as described in (Oyola et al., 2012)]. Illumina sequencing was performed using an Illumina Next-Seq500 instrument at the University of Georgia genomics facility.

METHYLC-SEQ DATA ANALYSIS

FASTQ files were trimmed for adapters, preprocessed to remove low quality reads and aligned to the TAIR10 reference genome as previously described in (Schmitz et al., 2013b). Inefficiencies in the sodium bisulfite conversion reaction are calculated by measuring the fraction of methylated basecalls detected in the chloroplast genome (which is unmethylated). This non-conversion rate is used as the null hypothesis for a binomial test to determine if a cytosine is methylated and the resulting p-values are corrected for multiple testing using Benjamini–Hochberg with an FDR cut off of 5%.

CALCULATION OF METHYLATION LEVELS

Weighted methylation levels were performed as described in (Schultz et al., 2012).

CALCULATION OF CORRELATION COEFFICIENTS

Pearson correlation was calculated. In Figure , the outliers were identified using Bonferroni test (Cook and Weisberg, 1982) and then Pearson correlation was calculated after the outliers were removed.

CALCULATION OF NORMALIZED READ COVERAGE

For Figures normalized read coverage was calculated by the total number of reads in the DMR (Differentially Methylated Region) divided by the total number of reads sequenced from the library. This value was then multiplied by 105. For Figure , normalized read coverage was calculated as follows: genome sequencing reads from previously published A. thaliana Col-0 data (Becker et al., 2011) were used to determine a baseline level of mappability by the DNA that had not undergone sodium bisulfite treatment. In this case, all cytosines were converted to thymines in the FASTQ files and then the genome sequencing data set was used in the same analysis pipeline as the one used to map and determine methylation levels for sodium bisulfite-treated libraries. A. thaliana mutation accumulation line 1 replicate 2 was selected as bisulfite sequencing sample (Schmitz et al., 2011). In each 10 kb window, read numbers were counted and then divided by the number of mapped reads in the first tile to normalize them to the same scale. Normalized read coverage was then computed by subtracting the normalized number of reads present in the genomic DNA sample from the normalized number of reads in the bisulfite sequencing sample. 5% outliers were identified, as described in the Calculation of correlation coefficients section, and removed before computing linear regression (Cook and Weisberg, 1982). For Figure , the window size was set to 100 kb, the enrichment of mapped reads for each tile was computed using the same method as described for 1 G. For Figures , the methylation levels from Figure were used to set a fixed horizontal axis. Therefore, data points were only able to move in a vertical direction depending on normalized read coverage. This method was used because higher read coverage resulted in higher methylation levels, which made it challenging to accurately reveal and compare the trends in each figure. For Figure , to calculate normalized read coverage for Neurospora, the genome was divided into 1 kb windows using bedtools, and a gtf annotation file was constructed using custom scripts. Cuffdiff software (Trapnell et al., 2013) and the custom annotation file were used to calculate the enrichment ratios at each 1 kb window. %GC was determined using a custom script.

DATA ACCESS

The data generated for this study have been deposited in the Gene Expression Omnibus[1] and are accessible through accession number GSE58217. For WGBS data in Figures and data were downloaded from the National Center for Biotechnology Information (NCBI) – Sequence Read Archive (SRA)[2] and for genomic DNA sequence used in Figure data were downloaded from (). Methylated DNA is enriched in Examples of differentially methylated regions (DMRs) that contain identical DNA sequences. (D–F) Highly methylated regions contain an abundance of bisulfite sequencing reads compared to lowly methylated sequences. (G) Highly methylated DNA is positively correlated with bisulfite sequencing read counts. The names of the samples in (A–C) indicate lineage numbers as previously reported Shaw et al. (2000). Pink shaded regions indicate DMRs in (A–C) and the location of data used to produce (D–F). In (A–C), gold lines = methylated CGs, blue lines = methylated CHGs (whereH = A, C, or T) and pink lines = methylated CHHs. In (D–F), black dots = highly methylated alleles, red dots = lowly methylatedalleles. Data used to produce Figure were obtained from line1 replicate 2 from (Schmitz et al., 2011). Methylated DNA has a skewed GC content after sodium bisulfite conversion of DNA and this leads to an enrichment of methylated reads in bisulfite sequencing. (A) GC content beforebisulfite conversion, after bisulfite conversion, computed full conversion and the absolute difference in GC content between unconverted and bisulfite-treated DNA. (B) WGBS reads are positively correlated and enriched with highly methylated regions of the chromosome compared to alignment of sequencing reads from genomic DNA libraries. The black line and the corresponding y-axis on the left depict the total level of cytosine methylation in all combined contexts (CG, CHG, or CHH). The orange line and the corresponding y-axis on the right depict normalized read coverage. WGBS data used to produce this figure were obtained from line 1 replicate 2 from (Schmitz et al., 2011). Genomic DNA sequencing data used to produce this figure were obtained from sample 119 in (Becker et al., 2011).

RESULTS AND DISCUSSIONS

METHYLATED DNA IS OVER-REPRESENTED IN BISULFITE SEQUENCING DATA

Our previous studies that applied MethylC-seq (a WBGS method) on a population of A. thaliana MA lines (Shaw et al., 2000) revealed the existence of spontaneous epialleles (Schmitz et al., 2011; Figures ). These differentially methylated regions (DMRs) of the population have identical underlying DNA sequences as revealed by whole genome sequencing (Ossowski et al., 2010), but vary significantly in DNA methylation levels (Figures ). Further examination of the underlying data revealed that reads from highly methylated DMRs were over-represented compared to unmethylated regions after bisulfite conversion and DNA sequencing (Figures ). In total, eight different MA lines were sequenced in duplicate for a total of 16 samples. In every case, there was a positive correlation between the methylation level of the DMR and the normalized number of aligned sequenced reads (Figures ). Similarly, a chromosome-wide test revealed a strong positive correlation between DNA methylation levels and normalized (aligned) read counts from WGBS data (Figure ; R2 = 0.55).

GC CONTENT INFLUENCES ENRICHMENT OF HIGHLY METHYLATED DNA

Although the sequences within DMRs are identical, GC contents were significantly different after sodium bisulfite conversion. Therefore, the possibility exists that the underlying GC content likely influences the enrichment bias of methylated DNA. To further explore this question, the GC content throughout chromosome five was determined prior to and post sodium bisulfite conversion (Figure ). Post bisulfite treatment, the GC content is significantly reduced by approximately 50%, which is expected as most of the cytosines are unmethylated and are converted to uracils and ultimately thymines after PCR. The lowest percentage reduction in GC content is observed in the pericentromeric region of the chromosome (Figure ) where the highest density of methylation has been observed (Figure ; Cokus et al., 2008; Lister et al., 2008). To empirically test if sodium bisulfite-treated and amplified DNA is preferentially biased toward highly methylated reads, both WBGS and genomic DNA sequencing data from wild-type Col-0 were aligned using the same parameters. The read enrichment was calculated and plotted along chromosome five, which revealed an enrichment of WGBS reads compared to genomic DNA sequencing reads that paralleled DNA methylation levels along the chromosome (Figure ). Collectively, these data suggest that highly methylated sequences, which retain high GC content after bisulfite conversion and PCR amplification, are preferentially enriched in WGBS data compared to unmethylated sequences.

NUMBER OF PCR CYCLES INFLUENCES ENRICHMENT OF HIGHLY METHYLATED DNA

The introduction of altered GC content occurs during bisulfite treatment and subsequent PCR amplification. Bisulfite-converted DNA is rich with uracils, which impairs the ability of most DNA polymerases to efficiently amplify these fragments. The B-type DNA polymerase from Pyrococcus furiosus is able to polymerize uracil-rich DNA because it binds to deaminated bases (Horvath and Vertessy, 2010). This enzyme was engineered to polymerize uracil-rich DNA by mutating a single amino acid that blocks recognition of the deaminated bases (Horvath and Vertessy, 2010). These types of engineered uracil-tolerant enzymes are critical for single-base resolution studies of DNA methylation and as a result, a number of commercially available enzymes have been engineered to amplify uracil-containing templates. We performed a test on the Pfu Turbo Cx enzyme with biological replicates (Supplementary Table 1) using three different cycles (4, 8, and 15 cycles) of PCR (Figure ) to assess the degree to which a uracil-insensitive enzyme is prone to an amplification bias of methylated DNA. A comparison between 4, 8, and 15 PCR cycles revealed that 15 cycles results in an exacerbation in the preferential enrichment of methylated DNA (Figures ). Polymerase chain reaction (PCR) cycle number influences enrichment bias. (A) Diagram of the experimental design for testing the effect PCR cycle number on representation of methylated DNA. Normalized read counts versus methylation levels of 10 kb fragments from chromosome five for (B) 4 cycles, (C) 8 cycles, and (D) 15 cycles of PCR amplification using Pfu Turbo Cx. (E–G) Normalized read numbers for each 10 kb window were compared between different PCR cycle numbers and correlation scores were determined as indicated in the upper right hand of the plot. All R2 values reported reflect Pearson correlation coefficients.

COMPARATIVE ANALYSIS OF COMMERCIALLY AVAILABLE URACIL-INSENSITIVE ENZYMES

Based on the results from Figure , an experiment was designed to test two additional commonly used commercially available enzymes in biological replicates (Supplementary Table 1) for their ability to amplify bisulfite-converted DNA (Kapa HiFi Uracil + and EpiMark). Upon constructing the MethylC-seq libraries the concentration of the samples were determined using a Qubit. This revealed that the total library yield for the 4-cycle/8-cycle PCR was similar between the three enzymes tested, but for the 15-cycle PCR, the KAPA HiFi Uracil + enzyme yielded significantly more total library (Table ). Regardless, the total amount of library yielded in all experiments is more than adequate to sequence to sufficient depth. Next, the methylation levels were plotted against normalized read counts and as observed with the Pfu Turbo Cx enzyme in Figures and , the other two tested enzymes also preferentially enrich methylated DNA, although to a lesser extent (Figures and ). However, both of these enzymes are sensitive to highly methylated sequences as the normalized read numbers started to decrease with higher methylation levels compared to intermediate methylation levels (Figures and ). Interestingly, between the 4, 8, and 15 cycle experiments for the Kapa HiFi Uracil + enzyme, extremely high GC content regions did not result in an increase in amplification bias of DNA like it does with both the Pfu Turbo Cx and EpiMark enzymes (Figures ), which would indicate this enzyme is likely preferred for most WGBS applications. Interestingly, the EpiMark enzyme revealed that at 15 PCR cycles a strong bias was associated with increasing GC content, but not with increasing levels of DNA methylation compared to the 4- and 8-cycle reactions (Figures ). Therefore, both PCR cycles and enzymes selected for amplification can influence enrichment of highly methylated DNA in sequencing data. Sample name, cycle number and total library yield for each WGBS library are reported. Polymerase chain reaction amplification bias observed with Kapa HiFi Uracil +. Normalized read counts versus methylation levels for 10 kb windows from chromosome five for (A–C) Kapa HiFi Uracil + from 4-, 8-, and 15-cycles of PCR. (D–F) Normalized read numbers for each 10 kb window were compared between different PCR cycle numbers and correlation scores were determined as indicated in the upper right hand of the plot. All R2 values reported reflect Pearson correlation coefficients. Polymerase chain reaction amplification bias observed with NEB EpiMark. Normalized read counts versus methylation levels for 10 kb windows from chromosome five for (A–C) NEB EpiMark from 4-, 8-, and 15-cycles of PCR. (D–F) Normalized read numbers for each 10 kb window were compared between different PCR cycle numbers and correlation scores were determined as indicated in the upper right hand of the plot. All R2 values reported reflect Pearson correlation coefficients.

AT-RICH SEQUENCES ARE DEPLETED FROM DNA SEQUENCING LIBRARIES

ChIP-seq experiments carried out with the filamentous fungus Neurospora crassa also revealed a bias against AT-rich DNA sequences, demonstrating reduced coverage at AT-rich DNA is not restricted to uracil-tolerant polymerases. Neurospora heterochromatin domains have an average GC content of 30% compared to an average GC content of 51.5% for genes (Figure ). ChIP input DNA was used to prepare Illumina sequencing libraries using two different polymerases and PCR was performed for 4, 8, and 15 cycles. Libraries prepared using a TruSeq PCR master-mix revealed equal representation of GC-rich and AT-rich regions after 4 and 8 cycles of PCR, but a clear amplification bias was observed after 15 cycles of PCR (Figures ). It was reported previously that the Kapa HiFi polymerase, a distinct enzyme from the Kapa HiFi Uracil + polymerase used throughout the rest of the manuscript, efficiently amplified AT-rich DNA when supplemented with 60 mM TMAC (Oyola et al., 2012). However, our experiments with Neurospora DNA revealed a significant underrepresentation of AT-rich DNA after only 8 cycles of PCR using the KAPA HiFi DNA polymerase + TMAC (Figures ). Therefore, the DNA polymerase used for PCR amplification and the number of PCR cycles selected can affect amplification of genomic DNA that contains large disparities in GC content. Polymerase chain reaction amplification bias in Enrichment of tri-methyl H3 K9 (H3K9me3; Sasaki et al., 2014) and GC-content (%GC) are plotted to indicate the position of AT-rich heterochromatin domains. Data are shown for libraries prepared with Illumina PCR master mix (A) or Kapa HiFi polymerase + TMAC (B) and amplified for 4, 8, and 15 cycles, as indicated. The positions of genes are plotted beneath the coverage tracks. (C–D) Normalized read coverage and GC-content was calculated for 1 kb windows across the entire Neurospora genome for libraries prepared using different conditions. Log2 values obtained by comparing normalized read coverage after 4 and 8 cycles (top) or 4 and 15 cycles (bottom) are plotted on the y-axis. %GC is plotted on the x-axis. Results for TruSeq master mix and Kapa HiFi polymerase are shown in panels (C,D), respectively.

CONCLUSION

Whole-genome bisulfite sequencing is an excellent method for determining base resolution DNA methylomes and although in most genomes studied to date, it has been able to determine methylation levels for >90% of cytosines, it does have some limitations. In this study, we show that there exists a strong amplification bias of methylated DNA compared to unmethylated sequences and that this bias can be affected by PCR cycle number and the enzyme used for PCR amplification. Interestingly, this change in amplification bias is not observed with the Kapa HiFi Uracil + enzyme in comparison to the others tested. This observation suggests that the high PCR cycle number that is necessary to construct a sequenceable library for low input samples such as RRBS (Gu et al., 2011), single cells, sequence-capture or FFPE samples can be achieved without introducing major artifacts from the number of PCR cycles selected. Therefore, although an amplification bias will occur at these highly methylated sequences, in these types of libraries, it will not grow in strength to the point that it depletes lower methylated sequences from produced sequencing data. This effect is not true for all enzymes used in amplification of bisulfite converted DNA as the Pfu Turbo Cx enzyme resulted in a significant over-enrichment of reads associated with methylated DNA with increasing cycle number. Furthermore, even tests on genomic DNA from Neurospora revealed significant variation between different commercially available enzymes and selected PCR cycle number. To improve the quality and to reduce to bias associated with WGBS data it is recommended to start higher concentrations of genomic DNA when possible, limit the number of PCR cycles and optimize the polymerase used for amplification of the library. In general, it is difficult to precisely quantify the effect of this observed bias in WGBS data on percent methylation calculations. As we show here, bias depends on polymerase choice and PCR cycle number. We expect that other factors can also affect bias, not only the methylation level, which itself can be affected by biological and technical variation, but also factors like raw sequence read length, sequencing errors, efficiency of sodium bisulfite conversion, sequence structure itself, etc. In order to develop computational tools that correct for bias, additional experiments are needed to define how these factors contribute to the observed bias in WGBS data. We note that the bias introduced using current protocols is likely to influence DMR analysis. In particular, loci that are heterozygous for methylation state could be underrepresented by DMR finding algorithms. The sodium bisulfite conversion and PCR amplification steps are impossible to remove from WGBS library preparation and these are the steps where a bias is introduced. In fact, efforts to correct for amplification bias in targeted bisulfite sequencing of genomic regions have been successfully performed by using a calibration method that requires a parallel analysis of input samples with known methylation levels (Moskalev et al., 2011). Additionally, RainDance Technology has also proved useful for correction of amplification bias in targeted bisulfite PCR (Komori et al., 2011), but it is unknown if this correction will be observed on WGBS libraries. Regardless, both of these methods are not feasible alternatives for routine correction of the observed amplification bias in WGBS, as they are technically cumbersome compared to current methods and they are cost prohibitive. This bias will be most pronounced in the study of allele-specific methylation. The results of this study make it clear that PCR enzyme and the PCR amplification cycle used can influence determined methylation levels, which undoubtedly could affect results of previously published studies that were not aware of these biases. Fortunately, most experiments that use this approach can be corrected for this bias by using SNP data in the bisulfite-sequencing reads. Caution should be taken when calculating methylation levels at loci with heterozygous methylation states, especially when experiments study regions that do not have SNPs or do not take into consideration SNPs linked to bisulfite-sequencing reads. Caution should also be taken when using platforms that do not detect SNPs. For example, many microarray platforms have been designed to detect DNA methylation states and although PCR is not generally used to amplify bisulfite-converted DNA, the use of Whole Genome Amplification (WGA) methods will still introduce a bias based on the GC content of DNA sequences (Bredel et al., 2005; Arriola et al., 2007). Because WGBS is primarily used on species with published reference genomes, it may be possible to predict this amplification bias based on GC content and the range of potential methylation levels at these regions. Therefore, future efforts to computationally model and predict expected read coverage for regions will be advantageous for studies where SNP information is not available or present at specific regions of interest.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Table 1

Sample name, cycle number and total library yield for each WGBS library are reported.

Sample name	Cycle number	Total library yield (ng/ul) replicate 1	Total library yield (ng/ul) replicate 2	Correlation coefficient between biological replicates
Kapa HiFi U+	4	0.44	0.38	0.9167678
Pfu Turbo Cx	4	0.49	0.39	0.5800957
NEB EpiMark	4	0.59	0.25	0.8791214
Kapa HiFi U+	8	2.32	1.63	0.980213
Pfu Turbo Cx	8	1.45	1.03	0.8921665
NEB EpiMark	8	2.64	1.89	0.9589284
Kapa HiFi U+	15	80.8	62.4	0.9677855
Pfu Turbo Cx	15	40.2	36.1	0.9730408
NEB EpiMark	15	52.6	39.2	0.8796196

53 in total

1. DNA-binding factors shape the mouse methylome at distal regulatory regions.

Authors: Michael B Stadler; Rabih Murr; Lukas Burger; Robert Ivanek; Florian Lienert; Anne Schöler; Erik van Nimwegen; Christiane Wirbelauer; Edward J Oakeley; Dimos Gaidatzis; Vijay K Tiwari; Dirk Schübeler
Journal: Nature Date: 2011-12-14 Impact factor: 49.962

2. Base-resolution analysis of 5-hydroxymethylcytosine in the mammalian genome.

Authors: Miao Yu; Gary C Hon; Keith E Szulwach; Chun-Xiao Song; Liang Zhang; Audrey Kim; Xuekun Li; Qing Dai; Yin Shen; Beomseok Park; Jung-Hyun Min; Peng Jin; Bing Ren; Chuan He
Journal: Cell Date: 2012-05-17 Impact factor: 41.582

3. Local DNA hypomethylation activates genes in rice endosperm.

Authors: Assaf Zemach; M Yvonne Kim; Pedro Silva; Jessica A Rodrigues; Bradley Dotson; Matthew D Brooks; Daniel Zilberman
Journal: Proc Natl Acad Sci U S A Date: 2010-10-11 Impact factor: 11.205

4. Application of microdroplet PCR for large-scale targeted bisulfite sequencing.

Authors: H Kiyomi Komori; Sarah A LaMere; Ali Torkamani; G Traver Hart; Steve Kotsopoulos; Jason Warner; Michael L Samuels; Jeff Olson; Steven R Head; Phillip Ordoukhanian; Pauline L Lee; Darren R Link; Daniel R Salomon
Journal: Genome Res Date: 2011-07-14 Impact factor: 9.043

5. Transgenerational epigenetic instability is a source of novel methylation variants.

Authors: Robert J Schmitz; Matthew D Schultz; Mathew G Lewsey; Ronan C O'Malley; Mark A Urich; Ondrej Libiger; Nicholas J Schork; Joseph R Ecker
Journal: Science Date: 2011-09-15 Impact factor: 47.728

6. DNA methylation: bisulphite modification and analysis.

Authors: Susan J Clark; Aaron Statham; Clare Stirzaker; Peter L Molloy; Marianne Frommer
Journal: Nat Protoc Date: 2006 Impact factor: 13.491

7. The NIH Roadmap Epigenomics Mapping Consortium.

Authors: Bradley E Bernstein; John A Stamatoyannopoulos; Joseph F Costello; Bing Ren; Aleksandar Milosavljevic; Alexander Meissner; Manolis Kellis; Marco A Marra; Arthur L Beaudet; Joseph R Ecker; Peggy J Farnham; Martin Hirst; Eric S Lander; Tarjei S Mikkelsen; James A Thomson
Journal: Nat Biotechnol Date: 2010-10 Impact factor: 54.908

8. High sensitivity mapping of methylated cytosines.

Authors: S J Clark; J Harrison; C L Paul; M Frommer
Journal: Nucleic Acids Res Date: 1994-08-11 Impact factor: 16.971

9. Evaluation of Phi29-based whole-genome amplification for microarray-based comparative genomic hybridisation.

Authors: Edurne Arriola; Maryou B K Lambros; Chris Jones; Tim Dexter; Alan Mackay; David S P Tan; Narinder Tamber; Kerry Fenwick; Alan Ashworth; Mitch Dowsett; Jorge S Reis-Filho
Journal: Lab Invest Date: 2007-01 Impact factor: 5.662

10. Correction of PCR-bias in quantitative DNA methylation studies by means of cubic polynomial regression.

Authors: Evgeny A Moskalev; Mikhail G Zavgorodnij; Svetlana P Majorova; Ivan A Vorobjev; Pouria Jandaghi; Irina V Bure; Jörg D Hoheisel
Journal: Nucleic Acids Res Date: 2011-04-12 Impact factor: 16.971

29 in total

Review 1. Crop Epigenomics: Identifying, Unlocking, and Harnessing Cryptic Variation in Crop Genomes.

Authors: Lexiang Ji; Drexel A Neumann; Robert J Schmitz
Journal: Mol Plant Date: 2015-01-29 Impact factor: 13.164

2. Chromatin analyses of Zymoseptoria tritici: Methods for chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq).

Authors: Jessica L Soyer; Mareike Möller; Klaas Schotanus; Lanelle R Connolly; Jonathan M Galazka; Michael Freitag; Eva H Stukenbrock
Journal: Fungal Genet Biol Date: 2015-04-07 Impact factor: 3.495

3. Statistical challenges in analyzing methylation and long-range chromosomal interaction data.

Authors: Zhaohui Qin; Ben Li; Karen N Conneely; Hao Wu; Ming Hu; Deepak Ayyala; Yongseok Park; Victor X Jin; Fangyuan Zhang; Han Zhang; Li Li; Shili Lin
Journal: Stat Biosci Date: 2016-03-07

4. Analysis of Bisulfite Sequencing Data Using Bismark and DMRcaller to Identify Differentially Methylated Regions.

Authors: HueyTyng Lee
Journal: Methods Mol Biol Date: 2022

5. Single-molecule mitochondrial DNA sequencing shows no evidence of CpG methylation in human cells and tissues.

Authors: Iacopo Bicci; Claudia Calabrese; Zoe J Golder; Aurora Gomez-Duran; Patrick F Chinnery
Journal: Nucleic Acids Res Date: 2021-12-16 Impact factor: 16.971

6. Genome-wide redistribution of H3K27me3 is linked to genotoxic stress and defective growth.

Authors: Evelina Y Basenko; Takahiko Sasaki; Lexiang Ji; Cameron J Prybol; Rachel M Burckhardt; Robert J Schmitz; Zachary A Lewis
Journal: Proc Natl Acad Sci U S A Date: 2015-11-02 Impact factor: 11.205

Review 7. Analysis and Performance Assessment of the Whole Genome Bisulfite Sequencing Data Workflow: Currently Available Tools and a Practical Guide to Advance DNA Methylation Studies.

Authors: Ting Gong; Heather Borgard; Zao Zhang; Shaoqiu Chen; Zitong Gao; Youping Deng
Journal: Small Methods Date: 2022-01-22