Literature DB >> 24589664

MAGI: Methylation analysis using genome information.

Abstract

By incorporating annotation information into the analysis of next-generation sequencing DNA methylation data, we provide an improvement in performance over current testing procedures. Methylation analysis using genome information (MAGI) is applicable for both unreplicated and replicated data, and provides an effective analysis for studies with low sequencing depth. When compared with current tests, the annotation-informed tests provide an increase in statistical power and offer a significance-based interpretation of differential methylation.

Entities: Chemical Disease Gene Species

Keywords: annotation informed; differential methylation; epigenetics; epigenomics; statistical bioinformatics; testing methylation

Mesh：

Substances：
Cytosine

Year: 2014 PMID： 24589664 PMCID： PMC4063829 DOI： 10.4161/epi.28322

Source DB: PubMed Journal: Epigenetics ISSN： 1559-2294 Impact factor: 4.528

Introduction

DNA methylation (herein, methylation) is an important, heritable, epigenetic modification that is known to influence gene expression, X-inactivation, and cellular differentiation in higher eukaryotes.- Next-generation sequencing (NGS) technologies, such as MethylC-seq, make it feasible to investigate methylomes at the cytosine level and provide unparalleled insight into the role and function of DNA methylation in a variety of organisms. Two recent studies, investigated the cost, genome coverage, maximum resolution, and quality measures for a variety of NGS approaches as applied to DNA methylation. Although the conclusions from these studies provided us guidance in choosing a technology for this application (i.e., MethylC-seq is typically considered the gold-standard for methylation analysis), there are a variety of NGS technologies for which the proposed approach is applicable. It is well accepted that there can be vast differences in methylation patterns with respect to genomic regions (e.g., genes, promoters, intergenic regions). Changes in methylation between conditions give rise to epigenomic discoveries that are uniquely related to the organization and control of the genome. Interestingly, current epigenomic investigations do not incorporate genome organization into the actual quantitative analysis for testing differences between methylation profiles. In fact, annotation is typically consulted after the quantitative results are obtained, and only for the purpose of gaining genomic context. One of the most common approaches for testing differential methylation is a sliding window approach that compares, at the cytosine level, the observed methylation levels to known/annotated regions of methylation. This type of approach unfortunately leads to greater intraclass variability with less informative conclusions, simply because the windows are artifacts of the analysis and may in fact overlap multiple annotation regions simultaneously. As an improvement, we utilize existing annotation information to enhance the performance of testing for differences in methylation. The proposed approach is particularly useful for unreplicated data, and while we focus on the benefit of incorporating annotation using Fisher’s Exact Test (FET) for unreplicated data, the extension to replicated data are straightforward (see Methods).

Results and Discussion

We present methylation analysis for MethylC-seq data using two approaches, collectively referred to as Methylation Analysis using Genome Information (MAGI), both of which rely on an annotated genome. MethylC-seq involves a bisulfite treatment step, which converts unmethylated cytosines to uracils (and ultimately guanines), on each fragmented read prior to sequencing. As a result, a measurement of methylation level at each cytosine on the genome can be estimated by comparing the number of methylated and unmethylated cytosines on the sequenced reads. The first approach employs FET at the cytosine level using the number of (NGS) methylated reads, among a total number of reads, mapped to each cytosine in a known annotated region (Fig. 1). The false discovery rate (FDR) is controlled by applying multiple testing corrections to the cytosine level tests within each region. Results are summarized across each annotated region, and if the proportion of differentially methylated cytosines exceeds some ad hoc and arbitrary threshold (e.g., 10%), the region is declared differentially methylated. We refer to this standard approach as the MAGIC approach. MAGIC can be thought of as a special case of the sliding window approaches,, where the non-overlapping genomic ‘windows’ represent regions of homogeneous methylation profiles; in this sense, MAGIC represents the most powerful, best-case sliding window scenario. Since each cytosine is tested individually, the region-level summary is a comparison of methylation patterns between treatment groups. While this exploration is centered on regions that define a gene, it is natural to extend the approach to other annotations (e.g., promoters, exons, intergenic regions, etc.). Since MAGIC compares the patterns of methylation between treatments, it is suitable for regions where a classification of “methylated” or “unmethylated” is not of interest (e.g., intergenic regions). In these cases, an appropriate partitioning of the region could be developed to further investigate long genomic regions if intraclass variability is of concern; if not, MAGIC could be easily adapted to incorporate the sliding window approach over these regions.

Figure 1. Representation of the data structure and testing framework for NGS differential methylation studies (forward strand shown). For each cytosine, the number of methylated reads (filled circles) and unmethylated reads (unfilled circles) are recorded. These values are also recorded in binary representation for each cytosine, where a filled triangle indicates that the proportion of methylated reads for the given cytosine has exceeded a predetermined threshold (e.g., 40%), and an unfilled triangle indicates that this proportion was not exceeded. Tests for differential methylation are performed for each individual cytosine using the read information with subsequent summarization over the region (MAGIC), or with a single region-level test (MAGIG) using the summarized read information. The second approach summarizes the methylation status for each cytosine based on the observed proportions of methylation (see Methods), which allows for a marginalization over each annotated region that is then tested with a single FET (Fig. 1). FDR is controlled across the genome with multiple testing corrections applied to annotated regions. We call this data-adaptive approach MAGIG. By contrast to MAGIC, the MAGIG approach minimizes the number of statistical tests that are conducted and benefits from a significance-based interpretation of differential methylation. In addition, the region-level summary from MAGIG represents a comparison of overall methylation levels between treatment groups. Data were simulated for both unreplicated and replicated scenarios. For each simulation the FDR is controlled at 5%. The statistical power is estimated by assessing the average true positive rate for 1000 simulated data sets for each scenario (see Methods). For unreplicated experiments, power gains for MAGIG are most evident when either correlation in consecutive cytosine-to-cytosine methylation status decreases (Fig. 2; “Medium” or “Low”), or the differences in methylation level at each cytosine decrease (Fig. 2; “Small” or “Medium”). In these cases, the power of the MAGIG method can be upward of 40% greater than that of the MAGIC approach. These simulations also illustrate modest power increases when the average sequencing depth increases from 7 to 15 (representing the range of depths typically employed,); when differences in cytosine methylation levels are large (Fig. 2; “Large”) there is little gain in statistical power relative to additional sequencing depth.

Figure 2. Simulation results in unreplicated settings. Panel columns represent the separation of binomial probabilities for read methylation (Online Methods). Panel rows represent the transition matrices used in the Hidden Markov Model (HMM) process used to generate cytosine methylation status (Online Methods). Increases in the statistical power of the MAGIG over MAGIC are evident across the simulation settings. Observed false discovery rates (FDRs) are lower in MAGIG (2–19%) when compared with MAGIC (4–43%). However, both FDRs increase with greater separation of binomial probabilities and decrease with greater correlation between cytosines. Modest statistical power increases are observed when the average sequencing depth is increased from 7 to 15 with similar observed FDRs. In a real data application we reanalyzed the unreplicated Arabidopsis methylcytosine data from Lister et al. (2008) that compared wild-type (Col-0) lines to methylation-deficient mutants (met1–3). We considered all cytosines (with at least one read) in both samples, including those that demonstrated no evidence of methylation in either sample. Gene start and stop locations define the annotated genomic regions and were based on the Columbia reference genome. We applied MAGIC and MAGIG to the MethylC-seq data for each gene, and assessed the differential methylation detection rate after FDR corrections for each method, as defined above. As MAGIC represents the optimal sliding window scenario due to methylation profiles varying substantially between genomic regions, and since only gene regions were investigated to facilitate interpretation of results, comparisons with the sliding window approach from Lister et al. (2008) have been omitted. Due to fewer tests and more information per test, the MAGIG approach provides many more statistically significant results than the MAGIC approach (Methods, Table 4). These results reinforce that met1–3 mutants have defective methylation maintenance when compared with the wild-type (Col-0) controls, and are consistent with the average rate of genic methylation in the wild-type and mutant lines and simulation power estimates (Fig. 2).

Table 4. Exploration and impact of low-coverage filtering on significance results from MAGIC and MAGIG for 33,759 analyzed gene regions

Filtering Level	% Filtered	MAGI_C	MAGI_G	Intersection
No Filtering	0	216	3146	181
5	40	612	2926	310
7	51	669	2132	275
10	67	651	1528	246

Arabidopsis data (Col-0 vs. met1–3) from Lister et al. (2008) were analyzed using both MAGIC and MAGIG with varying degrees of low-coverage filtering. Significance thresholds of 0.10 and 0.05 were employed as example thresholds for each method, respectively. The “Filtering Level” represents the threshold for which individual cytosines are removed from downstream analyses, while % filtered indicates the percentage removed as a result of the filtering. Specifically, if the coverage of a given cytosine is below this threshold in either sample, the cytosine information is not used. As the filtering becomes more strict (i.e., higher filtering level), the number of significant subsets decreases using MAGIG, and increases using MAGIC. A balance between increased detection for MAGIC and decreased detection for MAGIG occurs when the filtering level is set to 5. Single-cytosine analyses of methylation using NGS technologies encounter two primary challenges when summarizing to region-level results, namely dependence in methylation status between cytosines and the discrete nature of test statistics (and associated P values). Several methods have been introduced to combine P values over regions under dependence,- including weighted approaches that could account for differences in sequencing depths,, but each method assumes that the P values were generated from multivariate t (or Z) distributions. On the other hand, FETs produce discrete, non-uniform P values under the null distribution, and as such, the distributional assumptions for combining P values are violated. The approaches recently proposed by Hebestreit et al. (BiSeq) and Pedersen et al. (comb-p) both employ variations on Stouffer’s method to combine P values from bisulfite methylation data, but it is unclear whether the distributional assumptions are reasonable in either case. Although BiSeq improves upon the assumptions from Stouffer’s method by first smoothing the methylation data over a genomic range, the models employed require larger data sets than are required by either MAGI approach. Further, both MAGI approaches can be applied to studies without replication. In order to more closely satisfy distributional assumptions, researchers often rely on increased average sequencing depth. Increasing average sequencing depth in unreplicated studies has distinct advantages when testing individual cytosines, but the benefits all but disappear when differential methylation is considered over specific annotated regions (Fig. 2). This is due in part to the use of nominal significance thresholds (e.g., α = 0.05). When dealing with low coverage at individual cytosines, the discreteness of the P value space covered by FET typically leads to a reduction in statistical power when compared with other unconditional exact tests (e.g., Barnard's Test). Unfortunately, this discreteness translates to a loss in power when the significance threshold is fixed. Fortunately, these issues dissipate when the column marginals are large (as is the case for MAGIG testing); this is due to the discreteness of FET being less pronounced. Combining dependent, discrete P values over genomic regions is an approach that we are currently investigating since it has the potential to further improve statistical inference in differential methylation studies beyond the gains observed in MAGIG testing. To explore the effects of low coverage of individual bases on detection of methylation difference, we filtered cytosines that had observed read depths of less than a specific low-count threshold (i.e., 5, 7, or 10) for either sample (Col-0 and met1–3). Overall, appreciable changes in methylation detection for each method were found, indicating that moderate low-count filtering (filtering level 5) is a reasonable approach to increase detection rate for the MAGIC approach and to distill the results from the MAGIG approach (Methods, Table 4). Excessive filtering (i.e., filtering levels 8 and 10) yields little benefit to MAGIC, however, and may be too extreme for MAGIG. The dramatic differences between the MAGIC and MAGIG results highlights the inferential distinctions between the two methods. Specifically, MAGIC may be better suited to exploring differences in methylation patterns, while MAGIG is more appropriate when testing for differences in methylation prevalence. In both cases, genomic context provides useful boundaries for region-level summaries.

Methods

MethylC-seq data can be represented at the cytosine level as the cumulative number of methylated and unmethylated sequencing bases covering a specific cytosine. For both unreplicated data and replicated data the analysis can be performed in two ways: focus on cytosine level tests and summarize to the genomic region, or summarize the cytosine level information and test once over the whole region (Table 1).

Table 1. Representation of the data structure and testing framework MAGI differential methylation studies

			Cytosine Index
	Rep.	1	c	C_g		Summary Information
	1	(m_{111 g},D_{111 g})	(m_11cg,D_11cg)	(m_11Cg, D_11Cg)	→	(M11g=∑c=1CI(m11cgD11cg≥Ts),Cg)
Trt1	j	(m_1j1g, D_1j1g)	(m_1jcg, D_1jcg)	(m_1jCg, D_1jCg)	→	(M1jg=∑c=1CI(m1jcgD1jcg≥Ts),Cg)
Trt1	J	(m_1J1g, D_{121 g})	(m_1Jcg, D_1Jcg)	(m_1JCg, D_1JCg)	→	(M1jg=∑c=1CI(m1JcgD1Jcg≥Ts),Cg)
	1	(m_{211 g}, D_{211 g})	(m_21cg, D_21cg)	(m_21Cg, D_21Cg)	→	(M21g=∑c=1CI(m21cgD21cg≥Ts),Cg)
Trt2	j	(m_2j1g, D_2j1g)	(m_2jcg, D_2jcg)	(m_2jCg, D_2jCg)	→	(M2jg=∑c=1CI(m1jcgD1jcg≥Ts),Cg)
Trt2	J	(m_2J1g, D_2J1g)	(m_2Jcg, D_2Jcg)	(m_2JCg, D_2JCg)	→	(M2Jg=∑c=1CI(m1JcgD1Jcg≥Ts),Cg)

For each treatment i, replicate j, cytosine c, and gene g, the number of methylated reads (mijcg) and the sequencing depth (total number of reads mapped to the cytosine, Dijcg) are recorded. MAGIC tests for differential methylation at each cytosine using a Fisher’s Exact Test (no replicates) or a logistic regression (replicates); if the proportion of positive base-pair decisions exceeds a predefined threshold, the subset is declared differentially methylated. MAGIG first summarizes the read information for each cytosine within each treatment and replicate, and then performs tests on this summarized information. For treatment i and replicate j for gene g, Mijg represents summary information on the number of cytosines for which mijcg/Dijcg exceeds a predetermined threshold TS. Given Mijg and the subset length Cg, tests similar to those used for the base-pair level framework (MAGIC) can be employed.

Cytosine level analysis (MAGIC)

We employ Fisher's Exact Test (FET) when testing unreplicated data due to its lack of asymptotic assumptions and generally similar performance when compared with either Wald's Test or the methods proposed by Audic and Claverie.- Under the assumptions of fixed marginals, the FET for cytosine level differential methylation (MAGIC) tests the hypotheses where θcg = [π1cg(1-π2cg)]/ [π2cg(1-π1cg)] and πicg is the true methylation level for the c cytosine in the g gene for treatment i (Trti). If the estimated odds ratio, where m and D are defined as in Table 1, differs significantly from 1, then cytosine c in gene g is differentially methylated. When the results from all cytosines in a given region are taken together, if the proportion of differentially methylated cytosines is above a pre-specified threshold (say, 0.10), the region is said to be differentially methylated. When biological replication is available, logistic regression can be applied, the hypotheses are similar to the unreplicated case, and the logistic model is The test for cytosine level differential methylation relies on the hypotheses

Genome region level analysis (MAGIG)

The MAGIG approach summarizes methylation status for a genomic region by first inferring a binary representation of methylation status for each cytosine in a given gene, within (treatment) groups. If the proportion of methylated reads is above a given threshold τ (e.g, 0.40), the cytosine is considered methylated. The threshold τ can be set a priori, or determined empirically. Here, τ is defined as the mean of the two cluster centroids as determined through k-means (k = 2) clustering on the observed methylation proportions for each chromosome and strand. The gene level information for both groups is then summarized into a 2 × 2 table, where the rows represent methylated and unmethylated cytosine status and the columns represent treatment groups. The FET for this scenario tests where θg = [π1g(1-π2g)]/ [π2g(1-π1g)] and πig is the true methylation level for the g gene for treatment i. If the estimated odds ratio, where M and C are defined as in Table 1, differs significantly from 1, gene g is differentially methylation. When biological replication is available, logistic regression can be applied, hypotheses are similar to the unreplicated case, and the logistic model is The test for base-pair level differential methylation relies on the hypotheses

Simulations

Methylation status for each read was generated by first assigning, within each treatment group, a methylation status to each subset via a random Binomial (2, 0.5) process (akin to a coin-tossing process for each treatment group). Then, given a subset’s methylation status, subject-level base-pair-specific cytosine methylation status was simulated under a Hidden Markov Model (HMM) framework. The HMM approach was used to account for the correlated nature of methylation of cytosines within a subset. The transition probabilities are defined in Table 2. The transition matrices were chosen to span a variety of methylation patterns, ranging from relaxed to strict methylation status based on the subset's status. Finally, given a cytosine's methylation status, individual read status is simulated via a random Binomial (n, p) distribution, where n is the sequencing depth at the given cytosine, and p is the probability of a methylated read given the cytosine methylation status (see Table 3). This process was repeated 1000 times.

Table 2. Cytosine-specific methylation status transition matrices for methylated genes

(A) High	(B) Medium	(C) Low
	UM	M		UM	M		UM	M
UM	0.35	0.65	UM	0.50	0.50	UM	0.35	0.65
M	0.15	0.85	M	0.15	0.85	M	0.35	0.65

“M” and “UM” represent methylated and unmethylated status, respectively. Unmethylated gene transition matrices are formed similarly, with elements on each diagonal interchanged. Transition matrix (a) forms chains with longer homogeneous strings of methylated cytosines, while matrices (b) and (c) allow more unmethylated cytosines to be generated when the gene is methylated.

Table 3. Binomial probabilities for assigning methylated status (MR) to a read for unmethylated and methylated cytosines (UC and MC, respectively)

Setting	Separation	P(MR\|UC)	P(MR\|MC)
1	Large	0.10	0.80
2	Medium	0.15	0.70
3	Small	0.15	0.60

Setting 1 indicates a large separation of read probabilities, and settings 2 and 3 decrease this level of separation.

Results under biological replication

Replicated data (i.e., three samples) were simulated using settings similar to the unreplicated scenario and analyzed using a logistic regression. Statistical power was comparable across varying sequencing depths, indicating that increased depth in the presence of replication may give rise to diminishing returns on investment. MAGIG gained power when differences in cytosine methylation levels are large; interestingly, this effect was not seen in MAGIC. In the cases with higher correlation in consecutive cytosine-to-cytosine methylation status, the MAGIC and MAGIG approaches are comparable (Fig. 3).

Figure 3. Simulation results in replicated settings. Panel columns represent the separation of binomial probabilities for read methylation (Methods). Panel rows represent the transition matrices used in the Hidden Markov Model (HMM) process used to generate cytosine methylation status (Methods). Increases in power of MAGIG were evident as the binomial probabilities of methylation increase in separation (i.e., from “Small” to “Large”). Very small power increases can be observed when increasing the average sequencing depth from 7 to 15. In general, replication may be sufficient to overcome the differences in sequencing depth, as well as the differences between MAGIC and MAGIG.

17 in total

1. Genome-wide high-resolution mapping and functional analysis of DNA methylation in arabidopsis.

Authors: Xiaoyu Zhang; Junshi Yazaki; Ambika Sundaresan; Shawn Cokus; Simon W-L Chan; Huaming Chen; Ian R Henderson; Paul Shinn; Matteo Pellegrini; Steve E Jacobsen; Joseph R Ecker
Journal: Cell Date: 2006-08-31 Impact factor: 41.582

2. Comb-p: software for combining, analyzing, grouping and correcting spatially correlated P-values.

Authors: Brent S Pedersen; David A Schwartz; Ivana V Yang; Katerina J Kechris
Journal: Bioinformatics Date: 2012-09-05 Impact factor: 6.937

Review 3. Analysing and interpreting DNA methylation data.

Authors: Christoph Bock
Journal: Nat Rev Genet Date: 2012-10 Impact factor: 53.242

4. A novel approach identifies new differentially methylated regions (DMRs) associated with imprinted genes.

Authors: Sanaa Choufani; Jonathan S Shapiro; Martha Susiarjo; Darci T Butcher; Daria Grafodatskaya; Youliang Lou; Jose C Ferreira; Dalila Pinto; Stephen W Scherer; Lisa G Shaffer; Philippe Coullin; Isabella Caniggia; Joseph Beyene; Rima Slim; Marisa S Bartolomei; Rosanna Weksberg
Journal: Genome Res Date: 2011-02-07 Impact factor: 9.043

5. Genome-wide analysis of Arabidopsis thaliana DNA methylation uncovers an interdependence between methylation and transcription.

Authors: Daniel Zilberman; Mary Gehring; Robert K Tran; Tracy Ballinger; Steven Henikoff
Journal: Nat Genet Date: 2006-11-26 Impact factor: 38.330

6. Highly integrated single-base resolution maps of the epigenome in Arabidopsis.

Authors: Ryan Lister; Ronan C O'Malley; Julian Tonti-Filippini; Brian D Gregory; Charles C Berry; A Harvey Millar; Joseph R Ecker
Journal: Cell Date: 2008-05-02 Impact factor: 41.582

7. Human DNA methylomes at base resolution show widespread epigenomic differences.

Authors: Ryan Lister; Mattia Pelizzola; Robert H Dowen; R David Hawkins; Gary Hon; Julian Tonti-Filippini; Joseph R Nery; Leonard Lee; Zhen Ye; Que-Minh Ngo; Lee Edsall; Jessica Antosiewicz-Bourget; Ron Stewart; Victor Ruotti; A Harvey Millar; James A Thomson; Bing Ren; Joseph R Ecker
Journal: Nature Date: 2009-10-14 Impact factor: 49.962

8. Quantitative comparison of genome-wide DNA methylation mapping technologies.

Authors: Christoph Bock; Eleni M Tomazou; Arie B Brinkman; Fabian Müller; Femke Simmer; Hongcang Gu; Natalie Jäger; Andreas Gnirke; Hendrik G Stunnenberg; Alexander Meissner
Journal: Nat Biotechnol Date: 2010-09-19 Impact factor: 54.908

9. Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications.

Authors: R Alan Harris; Ting Wang; Cristian Coarfa; Raman P Nagarajan; Chibo Hong; Sara L Downey; Brett E Johnson; Shaun D Fouse; Allen Delaney; Yongjun Zhao; Adam Olshen; Tracy Ballinger; Xin Zhou; Kevin J Forsberg; Junchen Gu; Lorigail Echipare; Henriette O'Geen; Ryan Lister; Mattia Pelizzola; Yuanxin Xi; Charles B Epstein; Bradley E Bernstein; R David Hawkins; Bing Ren; Wen-Yu Chung; Hongcang Gu; Christoph Bock; Andreas Gnirke; Michael Q Zhang; David Haussler; Joseph R Ecker; Wei Li; Peggy J Farnham; Robert A Waterland; Alexander Meissner; Marco A Marra; Martin Hirst; Aleksandar Milosavljevic; Joseph F Costello
Journal: Nat Biotechnol Date: 2010-09-19 Impact factor: 54.908

10. The Arabidopsis Information Resource (TAIR): gene structure and function annotation.

Authors: David Swarbreck; Christopher Wilks; Philippe Lamesch; Tanya Z Berardini; Margarita Garcia-Hernandez; Hartmut Foerster; Donghui Li; Tom Meyer; Robert Muller; Larry Ploetz; Amie Radenbaugh; Shanker Singh; Vanessa Swing; Christophe Tissier; Peifen Zhang; Eva Huala
Journal: Nucleic Acids Res Date: 2007-11-05 Impact factor: 16.971

1 in total

1. M3D: a kernel-based test for spatially correlated changes in methylation profiles.

Authors: Tom R Mayo; Gabriele Schweikert; Guido Sanguinetti
Journal: Bioinformatics Date: 2014-11-13 Impact factor: 6.937

1 in total