Literature DB >> 29046582

Maximizing ecological and evolutionary insight in bisulfite sequencing data sets.

Amanda J Lea^1,2, Tauras P Vilgalys³, Paul A P Durst⁴, Jenny Tung^5,6,7,8.

Abstract

Genome-scale bisulfite sequencing approaches have opened the door to ecological and evolutionary studies of DNA methylation in many organisms. These approaches can be powerful. However, they introduce new methodological and statistical considerations, some of which are particularly relevant to non-model systems. Here, we highlight how these considerations influence a study's power to link methylation variation with a predictor variable of interest. Relative to current practice, we argue that sample sizes will need to increase to provide robust insights. We also provide recommendations for overcoming common challenges and an R Shiny app to aid in study design.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Sulfites
hydrogen sulfite

Year: 2017 PMID： 29046582 PMCID： PMC5656403 DOI： 10.1038/s41559-017-0229-0

Source DB: PubMed Journal: Nat Ecol Evol ISSN： 2397-334X Impact factor: 15.460

Introduction

DNA methylation – the covalent addition of methyl groups to cytosine bases – is a gene regulatory mechanism of well-established importance in development, disease, and the response to environmental conditions[1-5]. In addition, shifts in DNA methylation are thought to contribute to the speciation process and the evolution of trait differences between taxa[6-8], in support of the idea that gene regulation plays a key role in evolutionary change. Because of its contribution to phenotypic diversity, interest in DNA methylation from the ecology and evolutionary biology communities is high[4,5,9-16]. This interest has been further encouraged by the development of sodium bisulfite sequencing, a cost-effective approach that allows researchers to measure genome-wide DNA methylation levels at base-pair resolution in essentially any organism[17-19]. Approaches that rely on sodium bisulfite treatment of DNA followed by high-throughput sequencing produce what are collectively called “bisulfite sequencing (BS) data sets.” These data sets have properties (discussed in the following section) that differ in key ways from other common types of sequencing-based functional genomic data, such as RNA-seq data. Consequently, several statistical approaches have been developed that are specifically tailored to BS data sets[20-23] (Box 1). However, the development, application, and evaluation of these tools has primarily focused on biomedical questions or model systems, with an emphasis on case-control studies and experimental manipulations in a restricted set of species[24-27]. In contrast, ecologists and evolutionary biologists often study non-model organisms, environmental gradients that do not follow a case-control design, and natural populations characterized by complex kin or population structure. They are also typically more limited in their ability to sample pure cell types, and may be interested in effects that are smaller than those reported in the context of major perturbations like cancer or pathogen infection[28,29]. Notably, all of these properties can affect statistical power for differential methylation analysis (the identification of site or region-specific associations between DNA methylation levels and a predictor variable of interest), one of the most common uses of BS data. Our goal in this review is to outline methodological considerations for differential methylation analysis of BS data sets. We tailor our discussion specifically to concerns that commonly arise in ecological and evolutionary studies and that, except where noted, are generalizable across taxa. We first consider how high-throughput BS data are generated, and how this process leads to several idiosyncrasies that must be taken into account during analysis. Next, we identify four properties common to ecological and evolutionary data sets that can influence power: moderate effect sizes, kinship/population structure, taxonomic differences in DNA methylation patterns, and cell type heterogeneity. We analyze both simulated and published empirical data sets to demonstrate how these four features can affect the power and biological interpretation of differential methylation analysis. We also discuss the advantages and disadvantages of conducting differential methylation analyses on individual CpG sites versus larger genomic intervals. Finally, we provide recommendations for handling each issue, with the aim of facilitating robust, well-powered studies of DNA methylation’s role in ecological and evolutionary processes.

Properties of bisulfite sequencing data sets

High-throughput BS protocols, such as whole genome bisulfite sequencing (WGBS[19]) or reduced representation bisulfite sequencing (RRBS[17]), rely on the differential sensitivity of methylated versus unmethylated cytosines to the chemical sodium bisulfite (Figure 1). Specifically, treatment of DNA with sodium bisulfite converts unmethylated cytosines to uracil (replicated as thymine after PCR) but leaves methylated cytosines unchanged (in vertebrates, most DNA methylation occurs at cytosines in CG motifs, while, in other taxa, cytosines in CHG and CHH are also commonly methylated[13,30,31]). DNA methylation level estimates at a given site can thus be obtained via high-throughput sequencing of bisulfite converted DNA, by comparing the relative count of reads that contain a cytosine (C), which reflect an originally methylated DNA base, to the count of reads that contain a thymine (T), which reflect an originally unmethylated version of the same base. Current BS protocols require low amounts of DNA, avoid the use of species-specific arrays, and can be applied to organisms without a reference genome[32], making them an increasingly popular choice for ecologists and evolutionary biologists[33].

Figure 1

Overview of reduced representation bisulfite sequencing (RRBS; left) and whole genome bisulfite sequencing (WGBS; right)

(A) Steps to prepare an RRBS or WGBS library from genomic DNA. Black lollipops: methylated CpG sites; open lollipops: unmethylated CpG sites. Bases introduced during library preparation due to end repair or A-tailing are colored red; unmethylated cytosines that can be used to estimate conversion efficiency are underlined and marked with an asterisk. RRBS fragments start and end with the Msp1 digest sites (CCGG) flanking the initial piece of genomic DNA. (B) Read pileups after mapping RRBS and WGBS libraries to a reference (red asterisks=Msp1 digestion sites). Reads from RRBS libraries cover a small fraction of the genome. Further, because genomic DNA is fragmented with Msp1 and then size selected, all retained fragments should start and end with an Msp1 recognition site and be enriched for CpG sites. Sequencing reads that are shorter than the original fragment length will localize to the Msp1 recognition site associated with either the 5’ or 3’ end of the original fragment. (C) Bar charts compare the proportion of measured CpG sites that fall in gene bodies (between the TSS and the TES), promoters (2 kb upstream of the TSS), CpG islands, and regions far from genes (>100 kb from any annotated TSS or TES) in simulated RRBS and WGBS experiments given the same sequencing effort (20 million reads; read depths commonly used in WGBS studies typically exceed those of RRBS studies, however).

High-throughput BS data sets have a number of unique properties that influence both study design and data analysis. First, the raw data are binomially distributed count data, in which both the number of methylated reads (unconverted “C” bases) and the total read depth (number of methylated “C” bases plus unmethylated “T” bases) at each site contain useful information[34,35] (note that in real data sets, these count data are usually over-dispersed due to biological variability[20-22]). For example, a site where 5 of 10 reads are methylated and a site where 50 of 100 reads are methylated both have estimated methylation levels of 50%. However, confidence in the methylation level estimate is higher for the second site, where total read depth is much greater. Information about relative confidence can be leveraged by modeling the raw count data rather than transforming counts to proportions or percentages, and several software packages now implement beta-binomial or binomial mixed effects models that do so[20-22,36] (Box 1). These approaches provide a more powerful alternative to tests that assume continuously varying percentages or proportions (e.g., t-tests, Mann-Whitney U tests, linear models). They also control for count overdispersion, a known property of BS data that violates the assumptions of commonly used, but extremely false positive-prone[20,36], binomial models. Retaining read depth information during analysis relates to a second property of BS data: often, some samples have low read depth or missing data at a CpG site where other samples have high read depth (especially in RRBS data sets, where read coverage is affected by the sample-specific efficiency and specificity of the restriction enzyme digest: Figure 1, Supplementary Figure 1). Unlike RNA-seq data sets where read depth variation within a sample captures biological information (i.e., once normalized, lower read counts indicate lower expression levels), within-sample read depth variance in BS data sets is purely technical. Both read depth and effective sample size will thus vary across sites in the same data set, and will often do so systematically across different regions of the genome (particularly in RRBS data sets, due to variation in CpG density: Supplementary Figure 1). Finally, the efficacy of the bisulfite conversion step can vary across samples or groups of samples prepared together, creating global batch effects. Though conversion efficiency is typically high (>98% of unmethylated cytosines converted to thymine[37-39]), small differences in conversion efficiency can have significant effects on genome-wide estimates of DNA methylation levels (see Fig S2 for an example from[37]). In particular, samples with low conversion efficiencies will tend to have upwardly biased estimates of DNA methylation levels relative to samples with higher conversion efficiencies, because fewer unmethylated Cs were converted to Ts. Thus, sample-specific bisulfite conversion rates should be directly estimated and taken into account (e.g., as a model covariate) in downstream analyses. However, we do not recommend estimating site-specific conversion rates, as these estimates are highly dependent on sequencing depth (because conversion occurs prior to sequencing, any observed relationship between sequencing depth and bisulfite conversion rate only reflects estimation error; Supplementary Figure 2). Estimates of sample-specific conversion rates can be obtained using CpG sites in the constitutively unmethylated chloroplast genome in plants[13,19,40], an unmethylated DNA spike in (e.g., lambda phage DNA[37-39]), CHH and CHG sites (in species or cell types where CHH and CHG methylation is rare[41,42]), or the unmethylated cytosines added during RRBS library construction[43] (Figure 1A). Empirical comparisons in a baboon RRBS data set[36] suggests that spike-ins, CHH/CHG, and RRBS read end estimates roughly agree, but CHH/CHG estimates tend to be underestimated relative to the other methods and spike-ins seem to best capture a sample prep-related batch effect (Supplementary Figure 2).

Effect sizes in ecological and evolutionary studies

A primary determinant of power in differential methylation analysis is the distribution of true effect sizes. However, it is not obvious what the distributions of effect sizes for questions of ecological and evolutionary interest are likely to be. While effect size distributions and power analyses have been published for human disease case-control studies[24-26], comparable information is not readily available for most other settings. Small or moderate epigenetic changes may still impact gene expression levels and consequently be of interest[44,45]; however, they will require larger sample sizes to detect. To aid researchers in choosing appropriate sample sizes, we estimated effect sizes in BS data sets from plants, hymenopteran insects, and mammals that address a range of ecological and evolutionary questions, including: (i) developmental and demographic effects (eusocial insect caste differentiation[41]; age[37]); (ii) ecological effects (resource availability, including both large differences[46] and more modest ones[37]); (iii) genetic effects (cis-acting methylation quantitative trait loci[47]); and (iv) species differences[48,49] (Table 1). For comparison, we also include a data set contrasting cancer cells with normal tissue from the same donors[28], which produces some of the largest effect sizes for differential methylation observed to date.

Table 1

RRBS and WGBS data sets reanalyzed in this study

Species	Predictor ofinterest	Contrast	Method	Citation
Dog (Canis lupus familiaris) and wolf (Canis lupus)	Species differences	dog versus wolf	RRBS	[49]
Human (Homo sapiens)	Gestational famine during the Dutch Hunger Winter	famine-exposed versus same sex unexposed sibling	RRBS	[46]
Yellow baboon (Papio cynocephalus)	Age	continuous age values	RRBS	[37]
Yellow baboon	Resource base	wild-feeding versus human refuse-supplemented	RRBS	[37]
Clonal raider ant (Cerapachys biroi)	Caste	reproductive phase versus brood care phase	WGBS	[41]
Human	Cancer status	normal versus colorectal tumor samples (paired)	WGBS	[28]
Human, orangutan (Pongo abelii), gorilla (Gorilla gorilla), and chimpanzee (Pan troglodytes)	Species differences	human versus other great apes	WGBS	[48]
Mouseear cress (Arabidopsis thaliana)	Local genetic variation	nearby (putatively cis-acting) genotype	WGBS	[47]

We first reanalyzed each data set using a uniform analysis pipeline (Supplementary Materials) and estimated two measures of effect size: (i) the mean difference in methylation levels between groups of samples, for binary comparisons (Figure 2A) and (ii) the proportion of variance explained by the variable of interest (Supplementary Figure 3). This analysis provides an empirical picture of how effect size distributions vary across differential methylation analyses. For example, local genetic variants tend to have large effects on DNA methylation levels, while environmental effects are consistently more modest (Figure 2A; Supplementary Figure 3). To understand how these differences impact power, we simulated BS data sets across a range of typical effect sizes and estimated the sample size required to identify differentially methylated sites in each case. All simulations presented in the main text assume that 10% of the sites in each dataset are true positives, but results from parallel analyses with varying proportions of true positives are shown in Supplementary Figure 4.

Figure 2

Estimates of effect sizes and their impact on the power of differential methylation analysis

(A) The maximum percent difference in mean DNA methylation levels between two sample groups (y-axis), for selected percentiles of sites (x-axis, ranked from smallest to largest percent difference) in reanalyzed data sets (Table 1) with binary predictor variables. Mean differences are based on raw values, without correction for covariates. We show the largest percentiles here because those effects are most likely to be detected and of interest. (B) Power to detect differentially methylated sites at a 5% FDR in simulated RRBS datasets (sample size is plotted on a log scale). The magnitude of the effect of interest on DNA methylation levels (x-axis) is represented as the proportion of variance explained. (C) In a small ant data set (n=8), site-by-site analyses are underpowered to detect differential methylation between reproductive phase (blue dots) versus brood care phase (yellow dots) individuals, but PCA separates samples by caste (t-test for PC 3, which explains 21.7% of the overall variance: p = 0.022). In 1000 permutations, only 8.8% of permutations separate as cleanly on any of the first five PCs, suggesting that the original analysis was power-limited. Whiskers on boxplots represent the values for the third and first quartiles, plus or minus 1.5× the interquartile range, respectively. (D) The sample size and mean read depth combinations required to achieve 25% power (i.e., detect 25% true positives) in simulated RRBS datasets, for 3 different effect sizes. Increases in read depth do not affect power beyond ~20× coverage, and sample size or effect size increases always increase power more (Supplementary Figure 5).

Our simulations suggest that answering many ecological and evolutionary questions will require sample sizes that exceed those used in most current studies (Figure 2B; Supplementary Table 1). For example, to identify sites where the predictor variable explains 15% of the variance in DNA methylation levels (a mean difference between sample groups of 13–14% in our simulations) with 50% power requires an estimated 125 samples (250 samples for 80% power and 500 samples for 95% power). To accommodate the costs of larger sample sizes, we recommend choosing a reduced representation or capture-based approach rather than WGBS, and/or reducing per sample read depth. Indeed, consistent with results from a previous study[25], we find almost no benefit to power after sequencing beyond a moderate read depth (~15–20x); in contrast, adding samples always increases power (Figure 2D; Supplementary Figure 5). In all cases, we strongly recommend against pooling DNA samples from multiple individuals into a single library, as this approach reduces power by collapsing the number of biological replicates. Global analysis approaches that test for patterns in an entire data set, such as principal components analysis (PCA) or hierarchical clustering, may also be helpful in analyzing low powered data sets. These approaches are particularly useful when a predictor variable is associated with small changes in DNA methylation levels at any given locus, but such changes are common genome-wide. For example, in two published data sets (focused on the epigenetic effects of dominance rank in rhesus macaques and caste differences in clonal raider ants[38,41]), sample sizes were very small. The macaque study (n=3 high-ranking versus n=3 low-ranking animals) did not attempt site-by-site analysis, while the raider ant study (n=4 pools of reproductive phase ants versus n=4 pools of brood care phase ants) found no evidence for caste effects on DNA methylation using site-by-site paired t-tests. As shown in Figure 2B (see also Supplementary Figure 6), this result could have stemmed from low power. In support of this possibility, global analysis separates the sample groups of interest in both data sets. Specifically, the macaque study reported that hierarchical clustering distinguishes between high-ranking (n=3) and low-ranking (n=3) individuals, with increased separation when focusing on CpG sites near genes differentially expressed with rank[38]. Similarly, when we re-analyzed the clonal raider ant data set, we found that PCA separates reproductive and brood care individuals along principal component 3 (t-test for separation along PC 3: p=0.022; Figure 2C). Together, these results emphasize the potential utility of global analysis approaches in small studies.

Kinship and population structure

Ecological and evolutionary studies often focus on natural populations that contain related individuals or complex population structure. Accounting for these sources of variance is important because DNA methylation levels are often heritable[50]. In humans, where genetic effects on DNA methylation have been best studied, average estimated heritability levels are 18%-20% in whole blood[50]. As a result, more closely related individuals will tend to exhibit more similar DNA methylation patterns than unrelated individuals. Analyses that do not take genetic relationships into account can therefore produce spurious associations if the predictor of interest also covaries with kinship or ancestry. For example, samples are often collected along transects where climatic variables (e.g., temperature, altitude, rainfall) covary with genetic structure[47,51]. Genetic effects on DNA methylation could thus masquerade as climatic effects if genetic sources of variance are not also modeled. Fortunately, this problem is structurally parallel to problems that have already been addressed in genotype-phenotype association studies, phylogenetic comparative analyses, and research on other functional genomic traits. The most straightforward solution is to use mixed effects models, which can incorporate a matrix of pairwise kinship or shared ancestry estimates to account for genetic similarity[52-55] (Box 1). Specifically, this matrix is treated as the variance-covariance matrix for the heritable (genetic) component of a random effect variable (the environmental component is usually assumed to be independent across samples, so its variance-covariance is given by the identity matrix). The kinship matrix thus contributes to the predicted value of a heritable response variable, but does not affect the value of nonheritable response variables. Notably, while most approaches for controlling for relatedness implement linear mixed models that are only appropriate for continuous response variables[52-54], recently developed binomial mixed models can be used to achieve the same task using count data[36] (Box 1). These approaches avoid the need for transforming BS data from counts to proportions or ratios, thus preserving information about sequencing depth for each site-sample combination. Additionally, recent tools for calling SNP genotypes directly from BS reads (e.g., BisSNP[56] and BS-SNPer[57]) can help with constructing kinship/relatedness matrices, although not without error (Box 2).

Taxonomic differences in DNA methylation

Most research on DNA methylation to date has focused on humans and a handful of model systems. However, ecologists and evolutionary biologists study a wider range of species, and patterns of DNA methylation can vary dramatically among taxa[13,30]. Striking examples include the broad use of non-CpG (CHH and CHG) DNA methylation in plants relative to animals[4,13,30], increased capacity for transgenerational epigenetic inheritance in plants[15,58,59], and increased use of DNA methylation as a transposable element silencing mechanism in large eukaryotic genomes[13,30]. This variation means that patterns typical of one taxonomic group cannot necessarily be extrapolated to others (see[3,4,12,13,30,31,60] for recent comparative studies and reviews of taxon-specific patterns). Here, we focus on how differences in the distribution of CpG DNA methylation levels across the genome can impact power and analysis strategies. To provide some intuition about these differences, we synthesized data from published studies of flowering plants, hymenopteran insects, canids, humans, and non-human primates (Table 1). We estimated the mean and variance of methylation levels at each CpG site in each data set (Figure 3A–B; Supplementary Figure 7). Consistent with expectations, vertebrates show largely hypermethylated genomes, except in tumor samples where normal patterns are extensively perturbed[28]. In contrast, Arabidopsis genomes include many more hypomethylated sites, and the ant genome—typical of hymenopteran insects[12,60]—is almost completely unmethylated (Figure 3). Based on these observed values, we performed additional simulations (Supplementary Materials), with a particular focus on understanding how variance impacts power (because it is unlikely that a predictor variable of interest will significantly explain variation in DNA methylation levels at a locus where there is little variation to begin with). Importantly, the degree to which genomes are composed of relatively monomorphic (low variance) versus high variance sites systematically varies due to both taxon and sequencing strategy (Figure 3A–B, Supplementary Figure 7).

Figure 3

Properties of CpG methylation levels vary across data sets and influence power

For each (A) WGBS and (B) RRBS data set, we plotted the distribution of mean DNA methylation levels at each CpG site with a median coverage >10× across all samples in the study. Whiskers on boxplots represent the values for the third and first quartiles, plus or minus 1.5× the interquartile range, respectively. (C) Power to detect differentially methylated sites (at a 5% FDR) in simulated RRBS datasets. The proportion of simulated true positives (TP) detected is plotted on the y-axis. Power increases as a function of the simulated effect size (represented as the proportion of variance explained; x-axis) and the variance in DNA methylation levels (colors). For all simulations, mean DNA methylation levels were held constant. The levels of variance in DNA methylation levels explored here (0.035, 0.045, 0.055, and 0.095) represent common values observed in real bisulfite sequencing data sets (Supplementary Figure 7).

Our simulations suggest that, all else being equal, power to detect differential methylation in BS data is limited by variance. Specifically, for any given sample size with a fixed mean DNA methylation level, power increases as a function of the underlying variance in DNA methylation levels (Figure 3C). These results suggest that analyses of low variance genomes (e.g., those of hymenopteran insects) may require larger sample sizes to detect a given effect than analyses of more variable systems, such as plants or mammals. An alternative approach is to filter out low variance sites prior to data analysis, which reduces the multiple testing burden. Notably, such filtering will also affect the relative representation of sites in genes, promoters, CpG islands, and other functional compartments of the genome because some of these compartments are consistently more variable than others (Supplementary Figure 7). In the current literature, differences in the genome-wide distribution of DNA methylation levels across taxa have led to taxon-biased analysis approaches. For example, in hypomethylated insect genomes, several studies[41,61,62] have used a binomial test to classify sites into ‘unmethylated’ or ‘methylated’ categories (i.e., all sites that do not pass a given significance threshold are considered ‘unmethylated’). Our simulations (Supplementary Materials) suggest that this approach not only loses information about quantitative variation, but is also sensitive to technical aspects of the data, such as sequencing depth. For example, using a binomial test approach, a site with an observed methylation level of 15% would be considered ‘unmethylated’ at a read depth of 20x, but ‘methylated’ at a read depth of 26× (Supplementary Figure 8). This problem likely accounts for the report of high rates of ‘sample-specific DNA methylation’ (where a site is methylated in one sample, but unmethylated in all other samples) in one recent study[41]. Indeed, our re-analysis of the same data shows that 77% of putative sample-specific sites can be more parsimoniously explained by greater read depth in the “outlier” sample (Supplementary Figure 8). Such problems can be readily avoided by not binarizing DNA methylation levels, which are intrinsically continuous traits, and by using count-based models that account for variation in sequencing depth[20-22,36]. Discretizing DNA methylation levels into “genotype”-like data for population analyses, which has also been proposed[49,63,64], can suffer from the same technical biases. Fortunately, most analyses that can be applied to discretized data can also be run on observed DNA methylation levels without artificial categorization, including differential methylation analysis (Box 1) and variance partitioning within and between populations[65,66]. Researchers interested in epigenetic inheritance[67,68] have also analyzed DNA methylation levels as discrete states (e.g., ‘epialleles’[35,64,69]) in considering the evolutionary dynamics of epigenetic inheritance. However, because epialleles are thought to ‘mutate’ at a much faster rate than sequence variants and are not limited to discretized states, they are unlikely to behave like biallelic variants in classical population genetics[70]. There are ongoing efforts to modify classical population genetic models to take this hypervariability into account[63,70], and we believe that the development of approaches that directly model the continuous nature of DNA methylation data would be particularly valuable in this regard.

Cell type heterogeneity

Epigenetic patterns vary substantially across cell types, contributing to differences in gene expression and biological function among different tissues. Because most sampled tissues also contain multiple cell types, putatively differentially methylated sites could, in some cases, be more parsimoniously explained by variation in cell type proportions rather than a direct effect of the variable of interest on DNA methylation[71]. Controlling for cell type heterogeneity is therefore a major concern in differential methylation analysis[71], and a particular challenge for biologists working under field conditions or with non-model organisms where isolating purified cell types is not an option. Three broad approaches can be used to confront this challenge. First, cellular composition can be phenotyped for use as a downstream statistical control using microscopy, flow cytometry, or, for animal blood samples, Giemsa or Wright-Giemsa stained blood smears. The ability to leverage these strategies will vary across species and collection conditions. However, some are already commonly applied in field studies (especially blood smears[72-75]), suggesting these approaches are feasible in at least some cases. The resulting estimates, or a composite measure (e.g., the first several principal components of variation in cell type proportions) can be incorporated as covariates in downstream analysis. Second, if no measures of cell type heterogeneity are available for the samples of interest, another option is to use epigenomic profiles from sorted cells[76,77] (‘reference epigenomes’) to predict the composition of mixed samples (a process known as ‘deconvolution’[71,78]). Even if obtained from different individuals or populations (and likely even if obtained from a closely related species), this approach can provide reasonable control for cell type heterogeneity[79]. Reference epigenome-based deconvolution is an active area of research, and several software packages exist to execute it[79,80]. Data from reference epigenomes can also be used to test if sites that are differentially methylated with respect to the predictor of interest are also differentially methylated by cell type, which would suggest the two are confounded[8,37]. However, if the between-sample compositional differences that would be required to produce the observed levels of differential methylation are not biologically plausible, tissue heterogeneity is unlikely to completely explain observed differentially methylated sites[8]. Finally, if data from sorted cell populations are unavailable, researchers can apply methods that attempt to account for cell type heterogeneity without the need for reference information[81-83]. Both previously developed approaches, such as surrogate variable analysis (SVA: originally designed for differential gene expression analysis[82]), and approaches specifically developed with DNA methylation data in mind (e.g., FaST-LMM-EWASher[81] and RefFreeEWAS[83]), have been suggested for this purpose. In applying them, researchers will need to evaluate whether the assumptions of these methods are met in their data. Because they were designed for batch correction, these approaches tend to assume that the largest sources of variance in DNA methylation levels (e.g., the top PCs) are due to cell type heterogeneity rather than differential methylation associated with the predictor of interest. Under this assumption, the only true positive associations that are detectable will tend to be both rare and of large effect. However, some predictors (e.g., environmental or disease perturbations) may truly have widespread, but modestly sized effects. For example, an analysis of resource base effects in baboon whole blood identified an association with DNA methylation levels at 1014 sites, after ruling out tissue heterogeneity confounds based on blood smear counts and comparisons against purified cell populations[37]. In comparison, FaST-LMM-EWASher detected a single differentially methylated site in the same data set. Recent comparisons of reference-based and reference-free methods suggest that reference-based approaches are consistently better powered[79,80], and reference-free methods should only be used when sorted cell profiles are not available (SVA and RefFreeEWAS are recommended in these cases[79,80]). Such results suggest that researchers should consider investing in the generation of a small set of reference epigenomes, if possible for their system.

Site versus region-based analyses

DNA methylation levels are spatially correlated, such that CpG sites that are near each other (within a few hundred base pairs[84,85]) will tend to have more similar DNA methylation levels than those that are farther apart. In addition, regulatory regions such as promoters and CpG islands are characterized by a high density of CpG sites. Hence, spatially contiguous differentially methylated regions (DMRs) will generally be of greater functional interest than individual CpG sites. They also provide some reassurance that a signal of differential methylation is not a statistical or technical artifact. Many strategies for DMR identification have been reported in the literature. Although they differ in modeling approach (Supplementary Table 2), all focus on identifying consecutive differentially methylated sites or regions with a specific number or density of differentially methylated sites. Such approaches have several advantages[25,27]. In addition to their potential functional relevance, region-based analyses can borrow strength across spatially contiguous sites, and some specifically incorporate coverage information to place higher weight on deeply sequenced sites[21,23,86-88]. In principle, taking a region-based approach could also reduce the multiple hypothesis testing burden (there should be fewer regions in a genome than individual sites). However, we note that in a false discovery rate framework[89,90], DMR analyses will only be more powerful if they proportionally increase the number of true positives relative to null expectations. Current region-based methods also have some limitations. First, they are less flexible than site-specific analyses, and generally do not enable users to control for other covariates or test the effects of continuous predictor variables (Supplementary Table 2). In addition, we are not aware of any region-based approach that controls for population structure, a known source of confounding. Second, some of the most commonly used DMR methods have been developed with long stretches of contiguously measured CpG sites in mind (e.g., WGBS data[23,88,91,92]). For RRBS or capture data, where stretches of interrogated sites are patchier, adjusting default parameters for window size or number of CpG sites in a putative DMR will therefore be necessary. For example, the default settings in BSmooth, one of the most popular DMR finding algorithms, perform DMR identification over windows that contain at least 70 CpG sites. In human WGBS data, 70 CpG sites can be captured in a window of 2.94 kb, on average. However, the patchiness of typical RRBS data sets means that a mean region size of 34.5 kb is necessary to capture stretches of 70 CpG sites, well beyond the range expected for spatial autocorrelation of DNA methylation levels. Finally, strategies for DMR identification have focused most immediately on CpG DMRs. Identifying CHG or CHH DMRs is indeed possible[20,93-95], however, because the distribution, density, and variance of CHG and CHH sites differ from CpG sites, identifying non-CpG DMRs may also require careful adjustments to “off-the-shelf” settings. One possible strategy to overcome the relative sparsity of RRBS data compared with WGBS data is to first identify differentially methylated sites and then aggregate them into DMRs (as in [18,37]). A direct comparison between this approach and a “DMR-first” approach in simulated data indicates that they identify generally overlapping sets of DMRs, especially for longer stretches of sites (Supplementary Materials; Supplementary Figure 9). These results suggest a possible compromise between the modeling flexibility afforded by site-by-site analysis and prioritization of the most interesting candidate regions via DMR identification.

Conclusions and tools

Like most other genomic technologies, high-throughput BS approaches were first optimized in research contexts that afford a high degree of control (e.g., experimental case-control studies in model systems) and in systems that boast extensive genomic resources (e.g., humans). However, for ecologists and evolutionary biologists, these approaches often become most exciting when they can be extended to a much more diverse set of species and populations—even if these extensions come with complications. We believe that the biological insights to be gained from studies of DNA methylation in diverse taxa have substantial potential. However, maximizing the yield from these studies will require careful consideration of taxon-specific characteristics, the use of analysis methods appropriate to a data set’s structure, and realistic assessments of power. In particular, our results reveal that, with sample sizes that are currently applied by many ecologists and evolutionary biologists, differential methylation analyses will tend to be moderately or lowly powered. Such studies may still have the potential to reveal interesting and important biology, but researchers should be aware that they are likely to detect only the largest effect sizes (as is also true for other types of genomic analysis[96]). Notably, how tightly small effects on differential methylation are linked to differences in downstream phenotypes, such as gene expression, remains somewhat unclear. While this relationship can be investigated for a few loci using experimental manipulation of DNA methylation levels in reporter assays[97] or, more recently, CRISPR-dCas9 manipulation[98], genome-wide tests are still missing from the literature. Finally, to help quantify how sample size, effect size, population structure, and modeling approach affect BS data analysis, we have developed an R Shiny application to perform power analyses like those presented here. This app allows BS data to be simulated with user-specified properties, is coupled with a set of statistical analysis options to evaluate study power, and outputs the simulated count data for maximal flexibility. The app is freely available at www.tung-lab.org/protocols-and-software.html.

Summary of model properties

Method	Models the count-based nature of thedata	Models geneticcovariance	Programs thatimplement the method
Binomial regression	Yes*	No	R and many others
Beta-binomial regression	Yes	No	DSS[22], MOABS[21], RadMeth[20]
Linear mixed effects model	No	Yes	GEMMA[52], EMMA[53], EMMAX[99], FaST-LMM[55]
Binomial mixed effects model	Yes	Yes	MACAU[36]

Binomial regression is never recommended. Because bisulfite sequencing data are overdispersed relative to the assumptions of this model, binomial regression analyses tend to generate many false positives.

91 in total

1. Rapid establishment of genetic incompatibility through natural epigenetic variation.

Authors: Stéphanie Durand; Nicolas Bouché; Elsa Perez Strand; Olivier Loudet; Christine Camilleri
Journal: Curr Biol Date: 2012-01-26 Impact factor: 10.834

2. A comparison of association methods correcting for population stratification in case-control studies.

Authors: Chengqing Wu; Andrew DeWan; Josephine Hoh; Zuoheng Wang
Journal: Ann Hum Genet Date: 2011-01-31 Impact factor: 1.670

3. Purifying selection, drift, and reversible mutation with arbitrarily high mutation rates.

Authors: Brian Charlesworth; Kavita Jain
Journal: Genetics Date: 2014-09-16 Impact factor: 4.562

4. Transgenerational epigenetic instability is a source of novel methylation variants.

Authors: Robert J Schmitz; Matthew D Schultz; Mathew G Lewsey; Ronan C O'Malley; Mark A Urich; Ondrej Libiger; Nicholas J Schork; Joseph R Ecker
Journal: Science Date: 2011-09-15 Impact factor: 47.728

5. The concerted impact of domestication and transposon insertions on methylation patterns between dogs and grey wolves.

Authors: Ilana Janowitz Koch; Michelle M Clark; Michael J Thompson; Kerry A Deere-Machemer; Jun Wang; Lionel Duarte; Gitanjali E Gnanadesikan; Eskender L McCoy; Liudmilla Rubbi; Daniel R Stahler; Matteo Pellegrini; Elaine A Ostrander; Robert K Wayne; Janet S Sinsheimer; Bridgett M vonHoldt
Journal: Mol Ecol Date: 2016-01-18 Impact factor: 6.185

6. Resource base influences genome-wide DNA methylation levels in wild baboons (Papio cynocephalus).

Authors: Amanda J Lea; Jeanne Altmann; Susan C Alberts; Jenny Tung
Journal: Mol Ecol Date: 2016-01-18 Impact factor: 6.185

7. Increased methylation variation in epigenetic domains across cancer types.

Authors: Kasper Daniel Hansen; Winston Timp; Héctor Corrada Bravo; Sarven Sabunciyan; Benjamin Langmead; Oliver G McDonald; Bo Wen; Hao Wu; Yun Liu; Dinh Diep; Eirikur Briem; Kun Zhang; Rafael A Irizarry; Andrew P Feinberg
Journal: Nat Genet Date: 2011-06-26 Impact factor: 38.330

8. MOABS: model based analysis of bisulfite sequencing data.

Authors: Deqiang Sun; Yuanxin Xi; Benjamin Rodriguez; Hyun Jung Park; Pan Tong; Mira Meong; Margaret A Goodell; Wei Li
Journal: Genome Biol Date: 2014-02-24 Impact factor: 13.583

9. Repurposing the CRISPR-Cas9 system for targeted DNA methylation.

Authors: Aleksandar Vojta; Paula Dobrinić; Vanja Tadić; Luka Bočkor; Petra Korać; Boris Julg; Marija Klasić; Vlatka Zoldoš
Journal: Nucleic Acids Res Date: 2016-03-11 Impact factor: 16.971

10. Genetic and environmental exposures constrain epigenetic drift over the human life course.

Authors: Sonia Shah; Allan F McRae; Riccardo E Marioni; Sarah E Harris; Jude Gibson; Anjali K Henders; Paul Redmond; Simon R Cox; Alison Pattie; Janie Corley; Lee Murphy; Nicholas G Martin; Grant W Montgomery; John M Starr; Naomi R Wray; Ian J Deary; Peter M Visscher
Journal: Genome Res Date: 2014-09-23 Impact factor: 9.043

18 in total

1. Evolution of DNA Methylation in Papio Baboons.

Authors: Tauras P Vilgalys; Jeffrey Rogers; Clifford J Jolly; Sayan Mukherjee; Jenny Tung
Journal: Mol Biol Evol Date: 2019-03-01 Impact factor: 16.240

Review 2. Functional genomic insights into the environmental determinants of mammalian fitness.

Authors: Noah Snyder-Mackler; Amanda J Lea
Journal: Curr Opin Genet Dev Date: 2018-08-22 Impact factor: 5.578

3. Epigenetic considerations in aquaculture.

Authors: Mackenzie R Gavery; Steven B Roberts
Journal: PeerJ Date: 2017-12-07 Impact factor: 2.984

4. Temporal Dynamics of DNA Methylation Patterns in Response to Rearing Juvenile Steelhead (Oncorhynchus mykiss) in a Hatchery versus Simulated Stream Environment.

Authors: Mackenzie R Gavery; Krista M Nichols; Barry A Berejikian; Christopher P Tatara; Giles W Goetz; Jon T Dickey; Donald M Van Doornik; Penny Swanson
Journal: Genes (Basel) Date: 2019-05-09 Impact factor: 4.096

10. IMAGE: high-powered detection of genetic effects on DNA methylation using integrated methylation QTL mapping and allele-specific analysis.

Authors: Yue Fan; Tauras P Vilgalys; Shiquan Sun; Qinke Peng; Jenny Tung; Xiang Zhou
Journal: Genome Biol Date: 2019-10-24 Impact factor: 13.583