Literature DB >> 29333247

Differential methylation analysis of reduced representation bisulfite sequencing experiments using edgeR.

Yunshun Chen^1,2, Bhupinder Pal^1,2, Jane E Visvader^1,2, Gordon K Smyth^1,3.

Abstract

Cytosine methylation is an important DNA epigenetic modification. In vertebrates, methylation occurs at CpG sites, which are dinucleotides where a cytosine is immediately followed by a guanine in the DNA sequence from 5' to 3'. When located in the promoter region of a gene, DNA methylation is often associated with transcriptional silencing of the gene. Aberrant DNA methylation is associated with the development of various diseases such as cancer. Bisulfite sequencing (BS-seq) is the current "gold-standard" technology for high-resolution profiling of DNA methylation. Reduced representation bisulfite sequencing (RRBS) is an efficient form of BS-seq that targets CpG-rich DNA regions in order to save sequencing costs. A typical bioinformatics aim is to identify CpGs that are differentially methylated (DM) between experimental conditions. This workflow demonstrates that differential methylation analysis of RRBS data can be conducted using software and methodology originally developed for RNA-seq data. The RNA-seq pipeline is adapted to methylation by adding extra columns to the design matrix to account for read coverage at each CpG, after which the RRBS and RNA-seq pipelines are almost identical. This approach is statistically natural and gives analysts access to a rich collection of analysis tools including generalized linear models, gene set testing and pathway analysis. The article presents a complete start to finish case study analysis of RRBS profiles of different cell populations from the mouse mammary gland using the Bioconductor package edgeR. We show that lineage-committed cells are typically hyper-methylated compared to progenitor cells and this is true on all the autosomes but not the sex chromosomes. We demonstrate a strong negative correlation between methylation of promoter regions and gene expression as measured by RNA-seq for the same cell types, showing that methylation is a regulatory mechanism involved in epithelial linear commitment.

Entities: Chemical Disease Gene Species

Keywords: BS-seq; Bioconductor; Methylation; differential methylation analysis

Year: 2017 PMID： 29333247 PMCID： PMC5747346 DOI： 10.12688/f1000research.13196.2

Source DB: PubMed Journal: F1000Res ISSN： 2046-1402

Introduction

Cytosine methylation is an important epigenetic DNA modification that is generally associated with transcriptional silencing [1]. In vertebrates, methylation occurs at CpG sites, which are dinucleotides where a cytosine (C) is immediately followed by a guanine (G) in the DNA sequence from 5’ to 3’. CpG dinucleotides are relatively uncommon on the human genome but occur more frequently in gene promoters and exons [2]. About 72% of human gene promoters are enriched for CpGs [2]. CpGs in gene promoters tend to cluster in CpG islands (CGIs), which are regions of a few hundred to a couple of thousand base pairs with very strong enrichment of CpGs [3]. The relationship of DNA methylation to transcription in vertebrates is complex [4]. Methylation of CGIs causes robust transcriptional repression and is required for long-term mono-allelic silencing including X inactivation and genomic imprinting [1]. Methylation of CpG-poor promoters is more weakly associated with gene expression, and the mechanisms by which this occurs are unclear [1]. Methylation of gene bodies on the other hand can be positively associated with gene expression [4, 5]. DNA methylation is relatively stable in that most CGIs do not change methylation state during normal cell development. Nevertheless DNA methylation is understood to play a regulatory role in differentiation and commitment in adult cell lineages [6, 7]. Aberrant methylation patterns are also associated with the development of diseases such as cancer [8, 9]. Bisulfite sequencing (BS-seq) is increasingly used to profile DNA methylation [10, 11]. Unmethylated cytosines (C) are converted to Uracils (U) by sodium bisulfite and then deaminated to thymines (T) during PCR amplification. Methylated Cs, on the other hand, remain intact after bisulfite treatment. This strategy produces whole genome bisulfite sequencing (WGBS) when combined with sequencing of the entire genome. WGBS is sometimes considered the “gold standard” for methylation profiling because it provides single-nucleotide resolution and whole-genome coverage [11]. However it requires large quantities of DNA and is expensive because of the amount of sequencing required. The fact that CpG islands constitute only a small percentage of the genome makes the WGBS approach inefficient in terms of information content per sequenced read. To improve efficiency and reduce costs, enrichment strategies have been developed and combined with BS-seq to target a specific fraction of the genome. A common targeted approach is reduced representation bisulfite sequencing (RRBS) that targets CpG-rich regions [11, 12]. Under the RRBS strategy, small fragments that compose only 1% of the genome are generated using MspI digestion, which means fewer reads need to be sequenced in total to provide reasonable coverage of the targeted regions. The RRBS approach can capture approximately 70% of gene promoters and 85% of CpG islands, while requiring only small quantities of input sample [13]. RRBS has great advantages in cost and efficiency, especially when CGI or gene-orientated results are required. RRBS is also applicable for single cell studies [14]. The first step of analyzing BS-seq data is to align short sequence reads to a reference genome. The number of C-to-T conversions are then counted for all the mapped reads. A number of software tools have been developed to facilitate read mapping and methylation calling, including Bismark [15], MethylCoder [16], BRAT [17], BSSeeker [18, 19] and BSMAP [20]. Most of these tools rely on existing short read aligners, such as Bowtie [21, 22]. Typical downstream DNA methylation studies often involve finding differentially methylated regions (DMRs) between different experimental conditions. A number of statistical methods and software packages have been developed for detecting DMRs using the BS-seq technology. methylkit [23] and RnBeads [24] implement Fisher’s Exact Test, which is a popular choice for two-group comparisons with no replicates. In the case of complex experimental designs, regression methods are widely used to model methylation levels or read counts. RnBeads offers a linear regression approach based on the moderated t-test and empirical Bayes method implemented in limma [25]. BSmooth [26] is another analysis pipeline that uses linear regression together with a local likelihood smoother. methylkit also has an option to apply logistic regression with overdispersion correction [23]. Some other methods have been developed based on beta-binomial distribution to achieve better variance modeling. For example, DSS fits a Bayesian hierarchical beta-binomial model to BS-seq data and uses Wald tests to detect DMRs [27]. Other software using beta-binomial model include BiSeq [28], MOABS [29] and RADMeth [30]. In this workflow, we demonstrate an edgeR pipeline for differential methylation analysis. edgeR is one of the most popular Bioconductor packages for assessing differential expression in RNA-seq data [31, 32]. It is based on the negative binomial (NB) distribution and it models the variation between biological replicates through the NB dispersion parameter. Unlike other approaches to methylation sequencing data, the analysis explained in this workflow keeps the counts for methylated and unmethylated reads as separate observations. edgeR linear models are used to fit the total read count (methylated plus unmethylated) at each genomic locus, in such a way that the proportion of methylated reads at each locus is modeled indirectly as an over-dispersed binomial-like distribution. This approach has a number of advantages. First, it allows the differential methylation analysis to be undertaken using existing edgeR pipelines developed originally for RNA-seq differential expression analyses. The edgeR generalized linear model (GLM) framework offers great flexibility for analysing complex experimental designs while still accounting for the biological variability [33]. Second, keeping methylated and unmethylated read count as separate data observations allows the inherent variability of the data to be modeled more directly and perhaps more realistically. Differential methylation is assessed by likelihood ratio tests so we do not need to assume that the log-fold-changes or other coefficient estimators are normally distributed. This article presents an analysis of an RRBS data set generated by the authors containing replicated RRBS profiles of basal and luminal cell populations from the mouse mammary epithelium. Our primary interest is in gene-orientated and pathway-orientated interpretations of the result. It is of particular importance to relate methylation changes to RNA-seq expression changes for the same genes. We show how to analyze differential methylation changes either for individual CpGs or for pre-specified genomic regions, such as chromosomes or genes, and we particularly focus on methylation changes by promoter regions. As with other articles in the Bioconductor Gateway series, our aim is to provide an example analysis with complete start to finish code. As with other Bioconductor workflow articles, we illustrate one analysis strategy in detail rather than comparing different pipelines. The analysis approach illustrated in this article can in principle be applied to any BS-seq data but is especially appropriate for RRBS data. The approach is designed for experiments that included biological replication but can be used without replication if the NB dispersion is preset. The results shown in this article were generated using Bioconductor Release 3.7. The next section gives an expository introduction to the edgeR approach to methylation data. The analysis of the mammary epithelial data starts afterwards.

Introducing the NB linear modeling approach to BS-seq data

A very small example

To introduce the edgeR linear modeling approach to BS-seq data, consider a genomic locus that has mA methylated and u unmethylated reads in condition A and m methylated and u unmethylated reads in condition B. Our approach is to model all four counts as NB distributed with the same dispersion but different means. Suppose the data is as given in Table 1. The counts can be entered into a matrix in R:

Table 1.

A very small example data set.

Sample	Condition	Methylated Count	Unmethylated Count
1	A	2	12
2	B	11	0

The experimental design of the data can be summarized by indicating which counts correspond to each sample and each condition: In the design matrix, the first two columns are indicators for samples 1 and 2 while the third and fourth columns indicate the methylated count columns for conditions A and B. If this were a complete data set, then it could be analyzed in edgeR as follows. First we fit a negative binomial generalized linear model to the counts: > The first two coefficients in the linear model are used to model the total number of reads (methylated or unmethylated) for samples 1 and 2, respectively. Coefficient 3 estimates the log ratio of methylated to unmethylated reads for condition A, a quantity that can also be viewed as the logit proportion of methylated reads in condition A. In mathematical terms, we have . (Note that edgeR adds 0.125 to each count when reporting the coefficients to avoid taking logarithms of zero.) Coefficient 4 estimates the log ratio of methylated to unmethylated reads for condition B, in mathematical terms . Then we conduct a likelihood ratio test to compare the methylation rate in condition B to that in condition A: The contrast vector indicates that we wish to compare the fourth to the third coefficient, ignoring the two sample coverage coefficients. The estimated value for the contrast is the difference in logit methylation rates, which is estimated as , a quantity that is shown as the log2-fold-change (logFC) column of the results table. The chisquare likelihood ratio statistic (LR) is 20.7 and the P-value for differential methylation in condition B vs condition A is P = 5.27 × 10 . Note that the specific parametrization used to account for the sample effects is not important. The test for differential methylation would be unchanged if we represented the two sample effects using an intercept model. Indeed, the first two columns of the design matrix could be any two columns spanning the sample effects: The dispersion parameter controls the degree of biological variability [33]. If we had set dispersion=0 in the above code, then the above analysis would be exactly equivalent to a logistic binomial regression, with the methylated counts as responses and the total counts as sizes, and with a likelihood ratio test for a difference in proportions between conditions A and B. Positive values for the dispersion produce over-dispersion relative to the binomial distribution. We have set the dispersion here equal to the value that is estimated below for the mammary epithelial data. In the above code, the two library sizes for each sample should be equal. Otherwise, the library size values are arbitrary and any settings would lead to the same P-value.

Relationship to beta-binomial modeling

It is interesting to compare this approach with beta-binomial modeling. It is well known that if m and u are independent Poisson random variables with means µ and µ , then the conditional distribution of m given m + u is binomial with success probability p = µ ( µ + µ ). If the Poisson means µ and µ themselves follow gamma distributions, then the marginal distributions of m and u are NB instead of Poisson. If the two NB distributions have different dispersions, and have expected values in inverse proportion to the dispersions, then the conditional distribution of m given m + u follows a beta-binomial distribution. The approach taken in this article is closely related to the beta-binomial approach but makes different and seemingly more natural assumptions about the NB distributions. We instead assume the two NB distributions to have the same dispersion but different means. The NB linear modeling approach allows the means and dispersions of the two NB distributions to be estimated separately, in concordance with the data instead of being artificially linked.

A small example with replicates

edgeR linear modeling takes advantage of replicate libraries for each condition. Now we augment the small example above to include two replicates for each experimental condition. There are now four samples and eight counts ( Table 2). The eight counts can be represented in R as follows:

Table 2.

A slightly less small example data set with n = 2 replicates for each condition.

Sample	Condition	Methylated Count	Unmethylated Count
1	A	2	12
2	A	4	20
3	B	11	0
4	B	15	3

We can represent the sample information associated with the counts as follows: The design matrix is a simple combination of two parts. First there is a matrix to model the sample coverages: Then there is a matrix to model the two treatment conditions. The full matrix is constructed from the two parts, with the condition design matrix mediating the methylation effects: Now we can fit the GLM and conduct the statistical test for differential methylation: The results for each condition have been averaged over the replicates. The change in methylation is now of smaller magnitude than for Table 1, as shown by the smaller logFC value, but the P-value is more significant because of the greater confidence that comes from the extra replicates.

Automatic construction of the design matrix

In this expository introduction we have demonstrated the construction of the design matrix from basic principles. In practice, edgeR provides a function modelMatrixMeth to automate the construction of the design matrix. The user can simply specify the treatment conditions associated with each DNA sample (as in the second column of Table 2): Then the design matrix is constructed by: Alternatively, a user can construct any appropriate design matrix at the sample level (with 4 rows), exactly as one would do for a RNA-seq differential expression analysis, then the function will expand the sample-level design matrix to the appropriate observation-level matrix with 8 rows: This gives exactly the same result as above.

Another way to make the design matrix

When there are just two treatment conditions to be compared, it can be slightly simpler to compute the log2-fold-change directly as a coefficient of the linear model, rather than forming it as a contrast. This can be done by including an intercept in the design matrix as follows: In this formulation, the second last coefficient still estimates the logit methylation level in condition A, but the last coefficient now estimates the log2-fold-change directly: Again, this gives the same results as the previous analysis.

Analogy with paired-samples expression analyses

The above design matrix might seem as if it is very special to methylation data, but in fact the same sort of design matrix would be used for any gene expression analysis when there are paired-samples and we wish to compare treatment effects between groups. To see this, consider the small RNA-seq experiment shown in Table 3. This experiment has four subjects, each of whom contributes both treated and untreated RNA samples. The subjects belong to two groups (A and B) and we wish to test whether the treatment effect differs between the groups.

Table 3.

A hypothetical paired-samples RNA-seq experiment.

Subject	Group	Treated	Untreated
1	A	2	12
2	A	4	20
3	B	11	0
4	B	15	3

Readers will note that Table 3 is the same as Table 2 except that the columns have been relabeled. Here, the four subjects are analogous to the four DNA samples in Table 2, the groups are analogous to conditions, and treatment is analogous to methylation. Exactly the same design matrix, as used above for the methylation experiment, would be appropriate here for the expression experiment. In edgeR, the only difference between the methylation and expression analyses would be in the normalization steps. The linear modeling and testing steps would be identical. This shows that BS-seq data has a structure that is already familiar from RNA-seq experiments with paired-samples. The theme of this article is that statistical analysis methods developed for RNA-seq can be beneficially applied to BS-seq data.

NB GLMs

In practice, the data will consist of many thousands of genomic loci and the counts matrix will have a row for each locus. The function glmFit fits a NB GLM to each row of read counts. If y is the ith read count for genomic locus g, then y is assumed to be NB distributed with expected value µ and dispersion φ . The expected values are modeled by the design matrix, where the x are the ith row of the design matrix, β are the regression coefficients and N are the library sizes. In the small example above with four samples, the design matrix had 6 columns, so p = 6. In general, if we have an experiment comparing k treatment conditions using n samples, then the count matrix will have 2 n columns and the design matrix will have p = n + k columns. The first n coefficients capture sample effects in read abundance, while the remaining k coefficients compare the methylation proportions between the conditions.

Dispersion estimation

In the small example above we did not estimate the dispersion but simply used a preset value. In practice, there will not only be replicates for each genomic locus but also many thousands of loci. The edgeR package estimates a dispersion for each locus from the replicate variability, using an empirical Bayes procedure to pool information across all the loci while estimating locus-specific dispersion estimates [33– 35]. The full data example on mouse mammary epithelial cells will illustrate this. The variance of the read counts is assumed to follow a quadratic mean-variance relationship where ϕ is the dispersion for locus g. The first term follows from sequencing variability and the second from biological variability [33]. The NB variance function captures the fact that larger counts have larger variances. The quadratic relationship captures the technical factors that affect variability, so that ϕ itself reflects only the biological characteristics of each locus.

RRBS profiling of mammary epithelial cells

Aim of the study

We now describe the main data set to be analyzed in this article. The epithelium of the mammary gland exists in a highly dynamic state, undergoing dramatic morphogenetic changes during puberty, pregnancy, lactation and regression [36]. Characterization of the lineage hierarchy of cells in the mammary epithelium is an important step toward understanding which cells are predisposed to oncogenesis. In this study, we profiled the methylation status of the two major functionally distinct epithelial compartments: basal and luminal cells. The basal cells were further divided into those showing high or low expression of the surface marker Itga5 as part of our investigation of heterogeneity within the basal compartment. We carried out global RRBS DNA methylation assays on two biological replicates of each of the three cell populations to determine whether the epigenetic machinery played a potential role in (i) differentiation of luminal cells from basal and (ii) any compartmentalization of the basal cells associated with Itga5. Of particular interest is to relate transcriptional changes to methylation changes for the same genes.

Sample preparation

Inguinal mammary glands (minus lymph node) were harvested from FVB/N mice. All animal experiments were conducted using mice bred at and maintained in our animal facility according to the Walter and Eliza Hall Institute of Medical Research Animal Ethics Committee guidelines. Epithelial cells were suspended and fluorescence-activated cell sorting (FACS) was used to isolate basal and luminal cell populations [37]. Genomic DNA (gDNA) was extracted from freshly sorted cells using the Qiagen DNeasy kit. Around 25ng gDNA input was subjected to DNA methylation analysis by BS-seq using the Ovation RRBS Methyl-seq kit from NuGEN. The process includes MspI digestion of gDNA, sequencing adapter ligation, end repair, bisulfite conversion, and PCR amplification to produce the final sequencing library. The Qiagen EpiTect Bisulfite kit was used for bisulfite-mediated conversion of unmethylated cytosines.

Experimental design

There are three groups of samples: luminal population, Itga5- basal population and Itga5+ basal population. Two biological replicates were collected for each group. This experimental design is summarized in the table below. The experiment has a simple one-way layout with three groups. The sequencing was carried out on the Illumina NextSeq 500 platform. About 30 million 75bp paired-end reads were generated for each sample.

Reading and annotating the methylation counts

Processing the BS-seq FASTQ files with Bismark

The first step of the analysis is to map the sequence reads from the FASTQ files to the mouse genome and to perform methylation calls. This is the only step of the analysis that cannot currently be done in R. Though many options are available, we used Bismark software ( https://www.bioinformatics.babraham.ac.uk/projects/bismark) to count the methylated and unmethylated reads at each genomic locus. Bismark was run using recommended default settings. We first ran trim_galore ( https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) to remove adapters and to trim poor quality reads. Then Bismark version v0.13.0 was used to align the reads to the mouse mm10 genome using Bowtie2 [22]. Finally, methylation calls were made using bismark_methylation_extractor. The Bismark output consists of one coverage file for each sample. Readers wishing to reproduce the analysis presented in this article can download the coverage files produced by Bismark from http://bioinf.wehi.edu.au/edgeR/F1000Research2017.

Reading in the data

The Bismark coverage files are just tab-delimited text files and so can be read into a dataframe in R using read.delim. Each of the files has the following format: The columns in the coverage file represent: V1: chromosome number; V2: start position of the CpG site; V3: end position of the CpG site; V4: methylation proportion; V5: number of methylated Cs; V6: number of unmethylated Cs. In practice, we only need to read in columns V1, V2, V5 and V6. Column V3 is the same as V2, and the methylation proportion V4 is just a function of V5 and V6. We now read in the Bismark coverage files for all the samples. The edgeR function readBismark2DGE reads all the files and collates the counts for all the sample into one data object: The edgeR package stores the counts and associated annotation in a simple list-based data object called a “DGEList”. The count matrix is There is a row for each CpG locus found in any of the files. There columns of methylated and unmethylated counts for each sample. The chromosomes and genomic loci are stored in the annotation data.frame: Summary information on each sample is stored in yall$samples:

Filtering unassembled chromosomes

It is convenient to remove genomic segments that have not been assembled into any of the recognized chromosomes: For this analysis, we also remove the Y chromosome and mitochondrial DNA: Then we can subset the DGEList data object to remove the rows corresponding to the unwanted chromosomes: The option keep.lib.sizes=FALSE causes the library sizes to be recomputed.

Sort into genomic order

Just for convenience, we sort the DGEList so that all loci are in genomic order, from chromosome 1 to chromosome X:

Gene annotation

We now annotate the CpG loci with the identity of the nearest gene. We search for the gene transcriptional start site (TSS) closest to each our CpGs: Here EntrezID, Symbol, Strand and Width are the Entrez Gene ID, symbol, strand and width of the nearest gene. Distance is the genomic distance from the CpG to the TSS. Positive values means the TSS is downstream of the CpG and negative values means the TSS is upstream.

Differential methylation analysis at CpG loci

Filtering to remove low counts

We now turn to statistical analysis of differential methylation. Our first analysis will be for individual CpG loci. CpG loci that have low coverage are removed prior to downstream analysis as they provide little information for assessing methylation levels. Filtering low-coverage CpGs also simplifies the subsequent analysis because of the reduction in data rows. We sum up the counts of methylated and unmethylated reads to get the total read coverage at each CpG site for each sample: The analysis needs to be restricted to CpG sites that have enough coverage for the methylation level to be measurable in a meaningful way at that site. As a conservative rule of thumb, we require a CpG site to have a total count (both methylated and unmethylated) of at least 8 in every sample before it is considered in the study. This filtering criterion could be relaxed somewhat in principle but the number of CpGs kept in the analysis is large enough for our purposes. The DGEList object is subsetted to retain only the non-filtered loci: Again, the option keep.lib.sizes=FALSE causes the library sizes to be recomputed after the filtering. We generally recommend this, although the effect on the downstream analysis is usually small.

Normalization

A key difference between BS-seq and other sequencing data is that the pair of libraries holding the methylated and unmethylated reads for a particular sample are treated as a unit. To ensure that the methylated and unmethylated reads for the same sample are treated on the same scale, we need to set the library sizes to be equal for each pair of libraries. We set the library sizes for each sample to be the average of the total read counts for the methylated and unmethylated libraries: Other normalization methods developed for RNA-seq data, such as TMM [38], are not required for BS-seq data.

Exploring differences between samples

In microarray methylation studies, a common measure of methylation level is the M-value, which is defined as M = log 2 {(Me + α) /(Un + α)} where Me and Un are the methylated and unmethylated intensities and α is some suitable offset to avoid taking logarithms of zero [39]. The M-value can be interpreted as the base2 logit transformation of the proportion of methylated signal at each locus. We compute the corresponding methylation summary from the methylated and unmethylated counts. We compute the corresponding methylation summary from the methylated and unmethylated counts. Here M contains the empirical logit methylation level for each CpG site in each sample. We have used a prior count of 2 to avoid logarithms of zero. The exact value of the prior count is unimportant, but a value of 2 is common in other contexts such as RNA-seq. Now we can generate a multi-dimensional scaling (MDS) plot to explore the overall differences between the methylation levels of the different samples ( Figure 1):

Figure 1.

MDS plots showing overall differences in methylation levels between the samples.

MDS plots showing overall differences in methylation levels between the samples.

Replicate samples from the same population cluster together. The first dimension separates the luminal cells from the basal cells. The second dimension separates the Itga+ basal cells at the bottom from the Itga- basal cells at the top. In this plot the distance between each pair of samples represents the average logit change between the samples for the top most differentially methylated CpG loci between that pair of samples. (We call this average the leading log-fold-change.) The two replicate samples from the luminal population (P6) are seen to be well separated from the four basal samples (populations P7 and P8), with a leading change of about 6 logit units.

Design matrix

One aim of this study is to identify differentially methylated (DM) loci between the different cell populations. In edgeR, this can be done by fitting linear models under a specified design matrix and testing for corresponding coefficients or contrasts. A basic sample-level design matrix can be made as follows: The we expand this to the full design matrix modeling the sample and methylation effects: The first six columns represent the sample coverage effects. The last three columns represent the methylation levels (in logit units) in the three cell populations. With the design matrix specified, we can now proceed to the standard edgeR pipeline and analyze the data in the same way as for RNA-seq data. Similar to the RNA-seq data, the variability between biological replicates has also been observed in bisulfite sequencing data. This variability can be captured by the NB dispersion parameter under the generalized linear model (GLM) framework in edgeR. The mean-dispersion relationship of BS-seq data has been studied in the past and no apparent mean-dispersion trend was observed [27]. This is also verified through our own practice. Therefore, we would not consider a mean-dependent dispersion trend as we normally would for RNA-seq data. A common dispersion estimate for all the loci, as well as an empirical Bayes moderated dispersion for each individual locus, can be obtained from the estimateDisp function in edgeR: This returns a DGEList object with additional components ( common.dispersion and tagwise.dispersion) added to hold the estimated dispersions. Here the estimation of trended dispersion has been turned off by setting trend="none". For this data, the estimated prior degrees of freedom (df) are infinite for all the loci, which implies all the CpG-wise dispersions are equal to the common dispersion. A BCV plot is often useful to visualize the dispersion estimates, but is not informative in this case.

Testing for differentially methylated CpG loci

We first fit NB GLMs for all the CpG loci using the glmFit function in edgeR. Then we can proceed to testing for differentially methylated CpG sites between different populations. One of the most interesting comparisons is between the basal (P7 and P8) and luminal (P6) populations. The contrast corresponding to any specified comparison can be constructed conveniently using the makeContrasts function: The actual testing is performed using likelihood ratio tests (LRT) in edgeR: The top set of most differentially methylated (DM) CpG sites can be viewed with topTags: Here positive log-fold-changes represent CpG sites that have higher methylation level in the luminal population compared to the two basal populations. The Benjamini-Hochberg multiple testing correction is applied to control the false discovery rate (FDR). The total number of DM CpG sites identified at an FDR of 5% can be shown with decideTests. There are in fact more than 50,000 differentially methylated CpGs in this comparison: The differential methylation results can be visualized with an MD plot (see Figure 2):

Figure 2.

MD plot showing the log-fold-change of the methylation level and average abundance of each CpG site.

Significantly hyper- and hypomethylated CpGs are highlighted in red and blue, respectively.

MD plot showing the log-fold-change of the methylation level and average abundance of each CpG site.

Significantly hyper- and hypomethylated CpGs are highlighted in red and blue, respectively. The logFC of the methylation level for each CpG site is plotted against the average abundance in log2-CPM. Significantly differentially methylated CpGs are highlighted.

Differential methylation by chromosome

We now explore overall methylation patterns by chromosome. To do this we make a list of index vectors, with each vector identifying the loci that belong to that chromosome: Then we can conduct “fry” gene set tests to evaluate overall changes in methylation in each chromosome: This shows that methylation increases overall in committed luminal cells for every chromosome except for Chromosome X. This is consistent with the expectation that methylation is associated with silencing of genes not needed in a committed lineage. The column heading NGenes gives the number of CpG loci in each chromosome. The results for Chromosome X are interesting. The mixed P-value shows that the X chromosome is actually enriched for DM CpGs, but the direction of change is not consistent with hyper and hypomethylated CpGs balancing each other.

Global methylation patterns around TSS

Now we explore methylation patterns to relative to gene TSSs. The coefficients of the fitted model fit record methylation levels at each CpG in each cell population. We now relate methylation in the P6 (luminal) population versus distance to the nearest TSS: Figure 3 shows that CpGs near a TSS tend to be unmethylated but CpGs elsewhere are predominately methylated. Cell populations P7 and P8 show the same patterns.

Figure 3.

Methylation level by distance to the nearest gene TSS.

Results are averaged over all genes. Negative distances are downstream of TSS.

Methylation level by distance to the nearest gene TSS.

Results are averaged over all genes. Negative distances are downstream of TSS. Now we examine the methylation changes stored in lrt and relate these to TSS position. Figure 4 shows that methylation tends to be added around the TSS in luminal vs basal cells. CpGs further from TSS tend to be unchanged.

Figure 4.

Average methylation change by genomic position relative to TSS.

Changes are averaged over all genes. Negative distances are downstream of TSS. Vertical axis shows base 2 logit differences.

Average methylation change by genomic position relative to TSS.

Changes are averaged over all genes. Negative distances are downstream of TSS. Vertical axis shows base 2 logit differences.

Differential methylation in gene promoters

Pre-defined gene promoters

The majority of CpGs are methylated in mammals. On the other hand, unmethylated CpGs tend to group into clusters of CpG islands, which are often enriched in gene promoters. CpG methylation in promoter regions is often associated with silencing of transcription and gene expression [3]. Therefore it is of great biological interest to examine the methylation level within the gene promoter regions. For simplicity, we define the promoter of a gene as the region from 2kb upstream to 1kb downstream of the transcription start site of that gene. We subset the CpGs to those contained in a promoter region:

Summarizing counts in promoter regions

One simple and effective way to conduct a gene-orientated analysis of the methylation changes is to collapse all the CpGs in each promotor into one locus, i.e., to compute the total number of methylated and unmethylated reads within each promoter. This strategy focuses on the aggregate methylation level within each promoter and allows us to test for an overall increase or decrease in methylation for each gene. First we compute the total counts for each gene promoter: The rowsum function here operated on the DGEList object yIP and produces a new DGEList object ypr with a row for each EntrezID. The integer matrix ypr$counts contains the total numbers of methylated and unmethylated CpGs observed within the promoter of each gene. Same as before, ypr$counts has 12 columns, two for each sample. The odd-numbered columns contain the numbers of methylated Cs, whereas the even-numbered columns contain the numbers of unmethylated Cs. The only difference is that each row of ypr$counts now represents a gene promoter instead of an individual CpG site. Filtering is performed in the same way as before. We sum up the read counts of both methylated and unmethylated Cs at each gene promoter within each sample. Since each row represents a 3,000-bps-wide promoter region that contains multiple CpG sites, we would expect less filtering than before. Same as before, we do not perform normalization but set the library sizes for each sample to be the average of the total read counts for the methylated and unmethylated libraries. Same as before, we measure the methylation levels of gene promoter regions using M-values. A prior count of 2 is added to the calculation to avoid undefined values and to reduce the variability of M-values for gene promoters with low counts. Then a MDS plot is produced to examine the overall differences between the methylation levels of the different samples. The resulting Figure 5 shows that the two replicate samples from the luminal population (P6) are well separated from the four replicate samples from the basal population (P7 and P8).

Figure 5.

MDS plot showing differences in methylation profiles at gene promoters.

Basal and luminal cell populations are well separated by the first dimension.

MDS plot showing differences in methylation profiles at gene promoters.

Basal and luminal cell populations are well separated by the first dimension. We estimate the NB dispersions using the estimateDisp function in edgeR. For the same reason, we do not consider a mean-dependent dispersion trend as we normally would for RNA-seq data. The dispersion estimates ( ϕ ) can be visualized with a BCV plot (see Figure 6):

Figure 6.

Scatterplot of the biological coefficient of variation (BCV) against the average abundance of each gene.

The plot shows the square-root estimates of the common and tagwise NB dispersions.

Scatterplot of the biological coefficient of variation (BCV) against the average abundance of each gene.

The plot shows the square-root estimates of the common and tagwise NB dispersions.

Testing for differential methylation in gene promoters

We first fit NB GLMs for all the gene promoters using glmFit. Then we can proceed to testing for differential methylation in gene promoter regions between different populations. Suppose the comparison of interest is the same as before. The same contrast can be used for the testing. The top set of most differentially methylated gene promoters can be viewed with topTags: Here positive log-fold-changes represent gene promoters that have higher methylation level in the luminal population compared to the basal population. The Benjamini-Hochberg multiple testing correction is applied to control the false discovery rate (FDR). The total number of DM gene promoters identified at an FDR of 5% can be shown with decideTests. There are in fact about 1,200 differentially methylated gene promoters in this comparison: For future reference we make a dataframe of all the DM genes: The differential methylation results can be visualized with an MD plot (see Figure 7):

Figure 7.

MD plot showing the log-fold-change of the methylation level and average abundance of CpG sites in each gene promoter.

Significantly hyper and hypomethylated gene promoters are highlighted in red and blue, respectively.

MD plot showing the log-fold-change of the methylation level and average abundance of CpG sites in each gene promoter.

Significantly hyper and hypomethylated gene promoters are highlighted in red and blue, respectively.

Correlate with RNA-seq profiles

RNA-seq profiles of mouse epithelium luminal and basal cells

To explore whether hypermethylation of promoter regions is associated with repressed gene expression, we relate the differential methylation results to differential expression results from RNA-seq for similar cell populations. The RNA-seq data used here is from a study of the epithelial cell lineage in the mouse mammary gland, in which the expression profiles were generated from basal stem-cell enriched cells and committed luminal cells in the mammary glands of virgin, pregnant and lactating mice [40]. The complete differential expression analysis of the data is described in Chen et al. [32]. The RNA-seq data is stored as a DGEList object y_rna and saved in a RData file rna.RData. The object y_rna contains the count matrix, sample information, gene annotation, design matrix and dispersion estimates of the RNA-seq data. The gene filtering, normalization and dispersion estimation were performed in the same way as described in Chen et al. [32]. The rna.RData file is available for download at http://bioinf.wehi.edu.au/edgeR/F1000Research2017. We load the RData file: We keep only the genes that are also included in our methylation analysis: We assess differential expression between the luminal and basal virgin samples:

Correlate methylation and expression

We can add the expression log-fold-changes to our methylation results: The negative correlation between methylation and expression is immediately apparent. Of the top 30 DM genes, eight have NA expression fold-changes because the genes were not expressed at a high enough level to be included in the analysis. Of the rest, all but Akap6 have methylation and expression logFCs of opposite signs. We can explore this correlation further by plotting the expression logFC vs the methylation logFC for all DM genes (see Figure 8):

Figure 8.

Scatter plot of the log-fold-changes of methylation levels in gene promoters (x-axis) vs the log fold-changes of gene expression (y-axis).

Scatter plot of the log-fold-changes of methylation levels in gene promoters (x-axis) vs the log fold-changes of gene expression (y-axis).

The plot shows results for the genes of which the promoters are significantly differentially methylated between basal and luminal. The red line shows the least squares line with zero intercept. A strong negative correlation is observed. The horizontal axis of the scatterplot shows the log-fold-change in methylation level for each gene promoter while the vertical axis shows the log-fold-change in gene expression. To assess the correlation, we fit a least squares regression line through the origin and compute the P-value: The negative association is highly significant ( P = 6 × 10 ). The last line of code adds the regression line to the plot ( Figure 8).

Gene set testing

The correlation in Figure 8 is convincing, but the above P-value computation assumes that genes are statistically independent of one another. We can perform a gene set test to get around this assumption. First we make a dataframe of the logFCs of the DM genes, keeping only the genes for which we have RNA-seq results: Then we test whether the DM genes are up or down-regulated using a “fry” gene set test. This analysis weights the DM genes by their methylation logFCs so that the test evaluates whether the expression changes are positively or negatively associated with the methylation changes: The result Down in the Direction column indicates negative correlation between the methylation and expression changes. The small PValue confirms a highly significant result. We can visualize the gene set test result with a barcode plot (see Figure 9):

Figure 9.

Barcode plot showing strong negative correlation between expression changes and methylation changes when luminal cells are compared to basal.

The plot is analogous to Figure 8 with the DM genes now displayed as gene sets and the methylation logFCs now displayed as gene weights.

Barcode plot showing strong negative correlation between expression changes and methylation changes when luminal cells are compared to basal.

The plot is analogous to Figure 8 with the DM genes now displayed as gene sets and the methylation logFCs now displayed as gene weights. In the barcode plot, genes are sorted left to right from most down-regulated to most up-regulated in luminal vs basal. The x-axis shows the expression log2-fold-change. The vertical red bars indicate genes hypermethylated in luminal and vertical blue bars indicate genes hypomethylated in luminal. The variable-height vertical bars show the methylation log-fold-changes. The red and blue worms measure relative enrichment, showing that hypermethylation is associated with decreased regulation and hypomethylation is associated with up-regulation. In other words, there is a strong negative association between methylation of promoter regions and expression of the corresponding genes.

Discussion

This article has presented a complete start to finish analysis of an RRBS dataset from our own practice. The analysis demonstrates how BS-seq data can be analyzed using software designed for RNA-seq, thus benefiting from a large reservoir of highly-developed RNA-seq methodology. At first sight, BS-seq data is fundamentally different to RNA-seq because of the need to focus on the proportion of methylated reads at each locus rather than on the absolute number of reads at each locus. The link is achieved by conducting the statistical inference conditional on the read coverage at each locus, and this in turn is achieved by including sample-specific locus effects in the linear model. We have concentrated on analysing methylation changes for pre-defined genomic regions. At the highest resolution, conducted differential methylation tests for each distinct CpG site. Our main analysis was gene-level, whereby we aggregated counts over a putative promoter region around the TSS for each gene. This gives a “big picture” analysis and facilities a correlation of DM results with differential expression results from RNA-seq for the same cell populations. We also illustrated the use of gene set tests to examine whether there were overall methylation changes at the whole chromosome level. The analysis presented here was designed for reduced representation BS-seq and was tuned to our own research interests. While the same analysis approach could be fruitfully applied to WGBS data, researchers with WGBS data may also want to discover DMRs de novo without the use of gene annotation or pre-specified genomic regions, something we haven’t explored in this article. The edgeR approach presented here could in principle be extended to discover DMRs in a de novo fashion using similar methods to those developed for ChIP-seq data [41– 43].

Packages used

This workflow depends on various packages from version 3.7 of the Bioconductor project, running on R version 3.5.0 or higher. For all the code to work as presented, edgeR 3.22.2 or later is required. A complete list of the packages used for this workflow is shown below:

Data and software availability

All software packages used in this workflows are publicly available as part of Bioconductor 3.7. The data and analysis code used in this workflow are available from http://bioinf.wehi.edu.au/edgeR/F1000Research2017. The data and analysis code as at time of publication of this article has been archived at http://doi.org/10.5281/zenodo.1052870 [44]. Chen et al. propose a novel use of negative binomial generalized linear models (GLMs), as implemented in the edgeR software, to test for differential methylation from bisulfite-sequencing data, particularly reduced representation bisulfite-sequencing (RRBS). By leveraging the existing edgeR software, a popular tool in the analysis of diffential gene expression analysis from RNA-seq data, this method is immediately able to handle complex experimental designs and integrates with downstream analysis tools such as gene set tests provided by the limma software. The paper is well-written and I was able to reproduce the authors’ analysis. It will be a useful workflow for people needing to develop an analysis of RRBS data. Like Simon Andrews, I initially struggled a little with some of the detail of the method itself. The method’s elegance and power, like all those based on (generalized) linear models, is driven by careful formulation of the design matrix and the choice of contrasts. Necessarily, the design matrix for analysing bisulfite-sequencing data is more complex than that used to analyse RNA-seq data from an identical experimental design. As I know the authors are well aware, getting the design matrix and contrasts correct is 95% of the battle for most people analysing data with edgeR and limma. I will explain my concerns below (many are the same as raised by Simon in his review). Main points p4: The initial example has no replicates. Since the method is designed for “any BS-seq data that includes some replication”, should this example include replicates? I appreciate the desire to keep the initial example simple (especially in light of my next comment). p4: I initially found the design matrix confusing. In fact, I had the same reaction/interpretation as Simon Andrews 4th comment: “In the small example description you say that A_MvsU estimates the log ratio for Sample1, but it wasn’t clear to me why this would apply to only Sample 1 since the factor has a 1 against the meth count for both samples 1 and 2”. I had to manually check a few quantities to convince myself, e.g., to rounding error, `coef(fit)[, "A_MvsU"]` is `logit((2 + prior.count) / (2 + prior.count + 12 + prior.count)`, where `prior.count = 0.125`. Because so much depends on constructing the appropriate design matrix, this description/section may warrant further explanation (e.g., comparing to some manually computed quantities). Like James MacDonald, although the code was clearly written, I was a little surprised that it didn’t use more consistent integration with existing Bioconductor packages and data structures. To add to his example, almost all the work in the section ‘Reading in the data’ can be achieved with `bsseq::read.bismark(fn)`, which will: read in an arbitrary number of Bismark `.cov.gz` files, appropriately combine samples with different sets of CpGs, and return a SummarizedExperiment-derived object (a BSseq instance) which could readily be used to construct the DGEList used in the analysis. In my experience, loading the data and combining different sets of loci is a step fraught with danger of hard-to-track-down errors, so it may be better to advise workflow users to use a fairly well-tested function. Full disclosure: I am the author of `bsseq::read.bismark()`. p14: The aggregation of CpGs to promoters may lead to surprising results. An (extreme) example: the first half of a promoter is methylated in one condition and unmethylated in the other, and vice versa for the second half of the promoter. In aggregate over the promoter the proportion of methylated CpGs may be similar in both conditions, yet this promoter is clearly differentially methylated. I think a note encouraging workflow users to think carefully about their hypothesis when doing this form of aggregation is warranted. Minor points ```R p1: “The most commonly used technology of studying DNA methylation is bisulfite sequencing (BS-seq)”. The Illumina 27k/450k/EPIC microarrays are the most commonly used ‘genome-wide’ assays for studying DNA methylation. However, (whole genome) BS-seq is arguably the gold standard genome-wide assay. p3: I think there’s some confusion about CpGs and CpG islands (CGI). Approximately 0.9% of dinucleotides in the human genome (hg19) are CpGs, and approximately 0.7% of the genome is a CGI (using UCSC CGIs, which is not the only definition but perhaps the standard); see code below: library(BSgenome.Hsapiens.UCSC.hg19) hg19_size <- sum(as.numeric(seqlengths(BSgenome.Hsapiens.UCSC.hg19)[ paste0("chr", c(1:22, "X", "Y"))])) # CpGs on chr1-22,chrX,chrY in hg19 n_CpGs <- Reduce(sum, bsapply(BSParams = new("BSParams", X = BSgenome.Hsapiens.UCSC.hg19, FUN = countPattern, exclude = c("M", "_")), pattern = "CG")) 100 * n_CpGs / hg19_size # CGIs in hg19 library(rtracklayer) my_session <- browserSession("UCSC") genome(my_session) <- "hg19" cgi <- track(ucscTableQuery(my_session, track = "cpgIslandExt")) sum(width(cgi)) / hg19_size ``` p3: Possible type, “with a large genome” p3: “WGBS is more suitable for studies where all CpG islands or promoters across the entire genome are of interest.” Might also add ‘distal regulatory elements’ and CG-poor regions (RRBS targets CG-rich regions of the genome). p3: BSmooth (implemented in bsseq) doesn’t use Empirical Bayes although it does use limma for linear regression p4: Missing a `library(edgeR)` in order for the code to work p4: There’s an extra parenthesis at the end of line 2 when constructing `dimnames(counts)` p4: The authors note that the method is “especially appropriate for RRBS data”. Is the main challenge for running on WGBS data that of computational resources? p5: Typo, “mythlyated” should be “methylated” Table 1: Condition should be ‘A’ or ‘B’ instead of ‘1’ or ‘2’ p6: Was Bowtie1 or Bowtie2 used as the Bismark backend for the mouse data? p8: The filtering step removes almost 90% of CpGs. Is this unavoidable, e.g., due to low sequencing coverage of these samples, or might the filtering be relaxed? Figure 1: Any thoughts for why the P8_6 sample is rather separated from the other Basal samples along dim2 of the MDS plot? Figure 2: What is the meaning of ‘average abundance of each CpG site’? Is ‘abundance’ interpretable as ‘sequencing depth’? p16: Possible typo, “Suppose the comparison of interest is the same as before” p22: In the DNA methylation literature, ‘up-methylated’ is typically called ‘hypermethylated’ and ‘down-methylated’ is typically called ‘hypomethylated’. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. Thank you for your thoughtful and positive comments on our article. We particularly wish to thank you for your comments on the introductory material, which we have rewritten to take account of all your suggestions. We have specified the version of Bowtie, adopted the “hyper” and “hypo” terminology and made all the other corrections. In this revision we have added a lot of new material to the expository section, including an example with replicates, while retaining the first example without replicates. The purpose of the first example is to introduce the key mathematical idea that underlies our article, which is the use of linear model terms to account for total read coverage at each CpG, and to relate this to the more familiar beta-binomial modeling approach. This idea is already manifest in the very small dataset without replicates, and so that is the most fundamental context in which to exposit the idea. The need for replicates relates to dispersion estimation, which is a separate issue. For pedagogic purposes it is best to isolate these concepts separately. We now demonstrate in the first example how the logFC is computed in complete arithmetic detail. We have switched the toy examples to use the group-mean parametrization instead of the treatment-parametrization so that the computation of logFC is explicit rather than implicit. See also our reply to Simon Andrews. We have introduced a new function modelMatrixMeth() to automate the construction of the design matrix. From a user point of view, the design matrix is now made exactly as it would be for an RNA-seq analysis, with the sample effects added automatically. We have also added a new section “Analogy with paired-samples expression analyses” to show that the entire linear modeling approach is equivalent to an (already well understood) type of RNA-seq experiment. The purpose of our article is to show that methylation data can be analyzed exactly like RNA-seq and we use the same packages and data structures as our earlier RNA-seq workflow article (Chen et al, 2017). Adding extra packages and data structures would complicate rather than help our workflow. We are aware of your read.bismark() function in the bbseq package. When run on our data, read.bismark() produces a number of alarming warnings suggesting that incorrect genome annotation has been used for some of the samples. These warnings are false alarms, but would worry users, so we would have to either suppress them or explain them to readers, were we to adopt read.bismark. We are also aware of the readBismark() function in the BiSeq package and the methRead() function in the methylKit package. It seems that every Bioconductor package that analyses BS-seq has its own read function. The functions read the methylation data into a binomial format, whereas the whole point of our article is to show that the methylated and unmethylated counts can be analysed as they are instead of doing a binomial analysis. All produce data structures that are incompatible with edgeR. We could wrangle the data formats back to a DGEList, but the end result would be code that is no shorter or simpler than our original plain base R code. In fact the code would be harder to follow because of the extra packages and data structures involved. It seems unfortunate that the Bioconductor methylation packages all create their own read functions from scratch in order to create their own specialist S4 objects. Each defines a complex S4 class, unique to that one package. If Bioconductor provided a fast, light-weight, base-R style function to read Bismark files, then all the packages could use that as building block instead of re-writing from scratch. In this revision, we have decided to create a new function readBismark2DGE() to read the Bismark files directly into a DGEList object, substantially simplifying our workflow for users. The new function is faster than any existing options and over 3 times as fast as read.bismark() on the workflow data. The new function creates simple list output with minimal processing and so serves the purpose of a light-weight base-R style function. You comment on aggregation over promoters and guidance about hypotheses. We have added more text to clarify that aggregation of counts over gene promoters tests for overall changes only, but we think that users would have expected this already and shouldn’t be surprised. The thought example you give is indeed extreme and does not match real that we have seen. Even if such a phenomenon could occur, our workflow includes a CpG-level analysis, so readers can easily check for CpGs changing in different directions within a promoter if that is of interest to them. We did this check ourselves and found that for 99% of genes whose promoters contain any DM CpGs, all the DM CpGs change in the same direction. In almost all the remaining cases, only one CpG changes in the “wrong” direction. The only exception is Mir1298 on Chromosome X that has two CpGs changing in the other direction to the majority. Even if there were opposing effects within a few gene promoters, it wouldn’t invalidate our aggregate analysis. It would simply mean that exceptional genes would have to analyzed and interpreted individually to understand the fully story, and we see no problem with that. We think there is a role for both aggregate analyses and high-resolution analyses of DNA methylation. However we think that aggregate analyses are under-used in the literature and should be encouraged more than at present. We think that genomic analysts sometimes try to accommodate too much complexity, leading to analyses that lack statistical power. This in turn leads to the danger of cherry picking, possibly over-interpreting results that might not be reproducible. Aggregate analyses ignore some of the complexity, but can lead to more reliable and interpretable “big picture” messages. In this article, our promoter region aggregated analysis leads to a clean and reliable correlation of methylation with RNA-seq results, a correlation that is more difficult to achieve by other means. Our remark that our method is especially appropriate for RRBS is prompted by the thought that researchers engaged in a WGBS study may have slightly different aims, and this goes back to your comment about hypotheses. In particular, it may be important in a WGBS analysis to discover DMRs in a de novo fashion, pooling results for consecutive genomic loci without the use of gene annotation. We have added remarks in the Introduction and the Discussion to clarify our aims and to briefly indicate possible extensions. There is no problem with computational resources – our pipeline is reasonably efficient. In this revision, we have relaxed the filtering threshold to 8 instead of 10 reads per CpG, although it makes no material difference to our analysis. While the filtering does indeed remove 85% of the CpGs, it keeps more than half the counts. The number of CpGs retained in the analysis is plenty for our purposes. In this revision, we form the MDS plot in a slightly better way and P8_6 is no longer an outlier. In Figure 2, the meaning of “average abundance” is explained by the axis label as log count per million, i.e., coverage for that CpG divided by sequencing depth. In effect, it is the coverage for each CpG relative to other CpGs. It is independent of sequencing depth. This is primarily a software pipeline article, showing how to use the Bioconductor edgeR package to analyze RRBS data, but to a certain extent is also a methods paper, as to my knowledge this is the first proposal for directly analyzing count data rather than converting to either ratios and using a beta-binomial, or to logits and using conventional linear modeling. This is an interesting idea, and should be explored further, but for this manuscript the main goal is to present the software pipeline. The authors progress through each step of the pipeline, clearly describing each step as well as providing code (and links to the underlying data), so readers can easily understand the process and get some hands on experience as well. The code is clearly written, and as straightforward as one could expect for a relatively complex analysis. However, I would prefer to see more consistent integration with other Bioconductor packages. In particular, when reading in the raw data, the authors use a clever trick to account for the fact that not all samples have reads for the same genomic positions. This step could just as easily be accomplished using the Bioconductor GenomicRanges package, which is intended for manipulating genomic data. In fact, the authors use GenomicRanges later in the pipeline to subset the methylation data to just gene promoter sites, so it would be more natural to start with a GRanges if you will need one later anyway. Otherwise this is a good article that clearly shows how one could use an innovative method to analyze RRBS data using the edgeR package. Typos: Under a small example section, (BvsA_MvsU) estimates the difference in logit proportions of mythylated I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. Thank you for your positive report on our article. You’re right, there is a novel methods proposal in our article, which we will seek to justify in more detail in a methodological article elsewhere, but the current article concentrates on the software pipeline. In this revision, we have introduced new functions readBismark2DGE() and nearestTSS() to simplify the workflow. These functions allow us to eliminate from the workflow any code chunks that involve substantial programming, making the workflow considerably easier for users. We felt the original code chunk to find CpGs in promoters in particular was too complex for the purposes of our article and assumed too much user knowledge. I have used GenomicRanges for other projects ( https://f1000research.com/articles/4-1080 ) but it doesn’t help with the current workflow. The file reading code would be similar in length and speed with or without GenomicRanges, so we tend to give preference to R base. In any case, we have eliminated that code entirely, as explained in our discussion with Peter Hickey. The new nearestTSS() function is more informative than the old code, as well as being fast. It has the added advantage of using the Bioconductor organism package, which has more up-to-date TSS annotation than the transcript package. Chen et al present an interesting re-application of the EdgeR analysis package to the analysis of bisulphite sequencing data. The method they propose would utilise the existing negative biomial models within EdgeR and would potentially provide the power which comes with the linear model framework to bisulphite data. The method described requires no changes to EdgeR itself, and merely describes a suitable formulation of design matrix to allow this to be applied to bisulphite data. The article is generally well written and the authors go to great lengths to break down and describe the method. They also provide a page from which all of the underlying data and code can be obtained and I was able to reproduce the results, and independently verify them in a parallel analysis. The main thing which I struggled with was some of the detail in the description of the method itself. There were some parts which I wasn't clear on, and some nomenclature which didn't help in understanding the explanation. I'll try to lay out my concerns below: 1) In the small example I completely understand that the authors wanted to keep this as simple as possible, but it might have helped to have had 2 samples per condition so that the full complexity of the method is visible. 2) There is a typo in the code for the small example so it doesn't run as is. The list function on line 2 has an extra bracket at the end. 3) The nomenclature in the small example is inconsistent. You have samples 1 and 2, but (in the table) also conditions 1 and 2, but in the code the conditions are A and B. If you had Samples 1,2,3,4 in conditions A and B this might help to alleviate some of the confusion. 4) In the small example description you say that A_MvsU estimates the log ratio for Sample1, but it wasn't clear to me why this would apply to only Sample 1 since the factor has a 1 against the meth count for both samples 1 and 2 In the expanded examples there were also some points on which I wasn't clear. 5) You calculate a single dispersion parameter for all data points and say that in contrast to RNA-Seq there is no global trend to follow. It wasn't clear to be exactly why this is since read count and methlyation level would all affect the dispersion - is it simply because these factors are explicitly accounted for in the linear model? 6) In the design matrix for the RRBS it wasn't clear why the first column was all 1s, whereas the rest obviously matched the condition from which they came. This also contrasted with the simple example where the structure wasn't like this. Is this because you were comparing both P7 and P8 to P6? 7) I think this is possibly the same thing as point 4, but you say that the Me column represents the methylation level in P6, but again this highlights the methylated values in all samples, so why only P6? For the final results obtained it would have been nice to show the general level of concordance with running the same analysis through one of the beta-distribution models to either show general agreement, or to generally explain any major differences. Minor points: In the introduction you say that "40% of mammalian genes and 70% of human genes have CpG islands enriched in their promoter regions". Enriched probably isn't the right word to use (or you need to say that CpGs are enriched rather than islands). The difference between 'humans' and 'mammals' is also somewhat contentious - non-human mammals certainly have weaker CpG islands which get missed by CpG island prediction tools, but for example in mouse Illingworth et al showed that if you use CpG binding protein ChIP that you can see about the same number of islands in both species. It's also not really fair to say that CpG methylation in promoters is "generally" associated with repression of transcription. There is a categorical expression level shift associated with the presence/absense of CpG islands, but you can make a Dnmt1 knockout which removes pretty much all methlyation from the genome and for the vast majority of genes their transcription is completely unaffected. P3 "with large genome" should be "with large genomes" P5 "mythylated" should be "methylated" I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. Thank you for your thoughtful comments on our article. We have rewritten the Introduction to take account of your comments and have fixed all typos. We have also expanded the expository section on the NB linear modeling approach to five pages instead of one. The purpose of the first toy example (Table 1) is to introduce the key mathematical idea that underlies our article, which is the use of linear model terms to account for read coverage at each CpG, and to relate this to the more familiar beta-binomial modeling approach. This idea is already manifest in the very small dataset without replicates, making the one-sample example the most fundamental and effective context in which to introduce the idea. We agree however that the condition labels in Table 1 should be A and B. Following your remarks, we have included a second example with 2 samples per condition (Table 2). Keep in mind though that the edgeR approach to methylation however is very powerful and flexible and its full complexity is still far from being visible in a single locus example like this. To see more of the features in play, one has to work through the full data example. We have however taken the opportunity to preview more ideas in the expanded linear modeling introduction, including dispersion estimation and the use of contrasts. You asked about the A_MvsU coefficient computation. This is a standard linear modeling result in R but one that does confuse a lot of people when they see it for the first time. You will see analogous results explained in the limma and edgeR User’s Guides. Anyway, we have rewritten the first example in a different way so that the coefficients are defined more explicitly and their calculation is explained in full arithmetic detail. The original implicit formulation of the logFC is now presented in the section “Another way to make the design matrix”. Consider the last two columns of the design matrix in that section, which are now called “Intercept” and “ConditionB”, and write β A and β BvsA for the corresponding two coefficients. The interpretation of these coefficients becomes clear by multiplying the coefficients by the design matrix. For samples corresponding to condition A, the methylation level is represented by β A alone because the last column of the design matrix is zero (rows 1 and 3). When condition B is applied, the methylation is modelled by β A + β BvsA (rows 5 and 7). So it is apparent that β A must represent the methylation status for condition A while β BvsA0 must be the difference in methylation between B and A. Users don’t need to follow this mathematics, we just present it here for completeness. You also asked about the first column of the design matrix. It doesn’t matter whether the first column is all 1s or not, as long as the leading columns of the design matrix span the sample effects. We have added a small example on pages 3-4 to demonstrate this. The first few columns of the design matrix model the sample coverages, and these columns are quite independent of the remaining columns which model the treatment conditions. We can compare the treatment conditions in any way we wish regardless of how the design matrix is parametrized. We have added some material to the introduction to try to explain this but, in any case, it isn’t something that a user needs to worry about. In this revision, we have introduced a new function modelMatrixMeth to create the design matrix automatically, so users now only need to focus on the treatment conditions, not on how to adjust for the sample effects. There is no reason why the dispersion should be a function of read count or of methylation level. The purpose of the NB statistical model is to capture the technical mean-variance trends so that the dispersion parameter can be interpreted independently of these things. We have added a sub-section on dispersion estimation to page 7 to try to make this clear. Regarding concordance with other methylation analysis software, we do not know of any beta-binomial modeling software that is able to conduct an equivalent analysis to that presented in our article. We have formed a contrast between luminal cells and the two basal populations, and no other software can do that as far as we know. This is an example of the extra flexibility provided by our approach. The DSS Bioconductor package is documented to be able to form general contrasts between treatment conditions, but this is not effective in practice because it does not distinguish hyper from hypo methylation for a contrast. Our experience is that edgeR gives better DM results than beta-binomial software but it would be unfair for us to claim that in our article without undertaking a full comparison study with rigorous benchmarking. Our plan is to publish such a comparison elsewhere.

43 in total

1. Profiling DNA methylome landscapes of mammalian cells with single-cell reduced-representation bisulfite sequencing.

Authors: Hongshan Guo; Ping Zhu; Fan Guo; Xianlong Li; Xinglong Wu; Xiaoying Fan; Lu Wen; Fuchou Tang
Journal: Nat Protoc Date: 2015-04-02 Impact factor: 13.491

Review 2. Keeping abreast of the mammary epithelial hierarchy and breast tumorigenesis.

Authors: Jane E Visvader
Journal: Genes Dev Date: 2009-11-15 Impact factor: 11.361

3. Preparation of reduced representation bisulfite sequencing libraries for genome-scale DNA methylation profiling.

Authors: Hongcang Gu; Zachary D Smith; Christoph Bock; Patrick Boyle; Andreas Gnirke; Alexander Meissner
Journal: Nat Protoc Date: 2011-03-18 Impact factor: 13.491

4. limma powers differential expression analyses for RNA-sequencing and microarray studies.

Authors: Matthew E Ritchie; Belinda Phipson; Di Wu; Yifang Hu; Charity W Law; Wei Shi; Gordon K Smyth
Journal: Nucleic Acids Res Date: 2015-01-20 Impact factor: 16.971

5. Comprehensive analysis of DNA methylation data with RnBeads.

Authors: Yassen Assenov; Fabian Müller; Pavlo Lutsik; Jörn Walter; Thomas Lengauer; Christoph Bock
Journal: Nat Methods Date: 2014-09-28 Impact factor: 28.547

6. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.

Authors: Ben Langmead; Cole Trapnell; Mihai Pop; Steven L Salzberg
Journal: Genome Biol Date: 2009-03-04 Impact factor: 13.583

7. Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis.

Authors: Alexander Meissner; Andreas Gnirke; George W Bell; Bernard Ramsahoye; Eric S Lander; Rudolf Jaenisch
Journal: Nucleic Acids Res Date: 2005-10-13 Impact factor: 16.971

8. BRAT: bisulfite-treated reads analysis tool.

Authors: Elena Y Harris; Nadia Ponts; Aleksandr Levchuk; Karine Le Roch; Stefano Lonardi
Journal: Bioinformatics Date: 2009-12-22 Impact factor: 6.937

9. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions.

Authors: Kasper D Hansen; Benjamin Langmead; Rafael A Irizarry
Journal: Genome Biol Date: 2012-10-03 Impact factor: 13.583

10. Using beta-binomial regression for high-precision differential methylation analysis in multifactor whole-genome bisulfite sequencing experiments.

Authors: Egor Dolzhenko; Andrew D Smith
Journal: BMC Bioinformatics Date: 2014-06-24 Impact factor: 3.169

32 in total

1. Smchd1 is a maternal effect gene required for genomic imprinting.

Authors: Iromi Wanigasuriya; Quentin Gouil; Sarah A Kinkel; Andrés Tapia Del Fierro; Tamara Beck; Ellise A Roper; Kelsey Breslin; Jessica Stringer; Karla Hutt; Heather J Lee; Andrew Keniry; Matthew E Ritchie; Marnie E Blewitt
Journal: Elife Date: 2020-11-13 Impact factor: 8.140

2. A Practical Guide to the Measurement and Analysis of DNA Methylation.

Authors: Benjamin D Singer
Journal: Am J Respir Cell Mol Biol Date: 2019-10 Impact factor: 6.914

3. BCurve: Bayesian Curve Credible Bands Approach for the Detection of Differentially Methylated Regions.

Authors: Chenggong Han; Jincheol Park; Shili Lin
Journal: Methods Mol Biol Date: 2022

4. Maternal undernutrition during periconceptional period affects whole-genome ovine muscle methylation in adult offspring.

Authors: Emanuele Capra; Paola Toschi; Marcello Del Corvo; Barbara Lazzari; Alessandra Stella; John Lewis Williams; Pasqualino Loi; Paolo Ajmone Marsan
Journal: J Anim Sci Date: 2022-08-01 Impact factor: 3.338

5. Decoupling Yeast Cell Division and Stress Defense Implicates mRNA Repression in Translational Reallocation during Stress.

Authors: Yi-Hsuan Ho; Evgenia Shishkova; James Hose; Joshua J Coon; Audrey P Gasch
Journal: Curr Biol Date: 2018-08-02 Impact factor: 10.834

6. Transcriptomic and epigenomic effects of insoluble particles on J774 macrophages.

Authors: T Emi; L M Rivera; V C Tripathi; N Yano; A Ragavendran; J Wallace; Alexey V Fedulov
Journal: Epigenetics Date: 2020-10-30 Impact factor: 4.528

7. Exercise during pregnancy mitigates negative effects of parental obesity on metabolic function in adult mouse offspring.

Authors: Rhianna C Laker; Ali Altıntaş; Travis S Lillard; Mei Zhang; Jessica J Connelly; Olivia L Sabik; Suna Onengut; Stephen S Rich; Charles R Farber; Romain Barrès; Zhen Yan
Journal: J Appl Physiol (1985) Date: 2020-12-17

8. Calling differentially methylated regions from whole genome bisulphite sequencing with DMRcate.

Authors: Timothy J Peters; Michael J Buckley; Yunshun Chen; Gordon K Smyth; Christopher C Goodnow; Susan J Clark
Journal: Nucleic Acids Res Date: 2021-11-08 Impact factor: 16.971

9. Characterizing the properties of bisulfite sequencing data: maximizing power and sensitivity to identify between-group differences in DNA methylation.

Authors: Jonathan Mill; Eilis Hannon; Dorothea Seiler Vellame; Isabel Castanho; Aisha Dahir
Journal: BMC Genomics Date: 2021-06-15 Impact factor: 3.969

10. Differential methylation analysis of reduced representation bisulfite sequencing experiments using edgeR.

Authors: Yunshun Chen; Bhupinder Pal; Jane E Visvader; Gordon K Smyth
Journal: F1000Res Date: 2017-11-28