Literature DB >> 25161227

Modeling DNA methylation dynamics with approaches from phylogenetics.

Abstract

MOTIVATION: Methylation of CpG dinucleotides is a prevalent epigenetic modification that is required for proper development in vertebrates. Genome-wide DNA methylation assays have become increasingly common, and this has enabled characterization of DNA methylation in distinct stages across differentiating cellular lineages. Changes in CpG methylation are essential to cellular differentiation; however, current methods for modeling methylation dynamics do not account for the dependency structure between precursor and dependent cell types.
RESULTS: We developed a continuous-time Markov chain approach, based on the observation that changes in methylation state over tissue differentiation can be modeled similarly to DNA nucleotide changes over evolutionary time. This model explicitly takes precursor to descendant relationships into account and enables inference of CpG methylation dynamics. To illustrate our method, we analyzed a high-resolution methylation map of the differentiation of mouse stem cells into several blood cell types. Our model can successfully infer unobserved CpG methylation states from observations at the same sites in related cell types (90% correct), and this approach more accurately reconstructs missing data than imputation based on neighboring CpGs (84% correct). Additionally, the single CpG resolution of our methylation dynamics estimates enabled us to show that DNA sequence context of CpG sites is informative about methylation dynamics across tissue differentiation. Finally, we identified genomic regions with clusters of highly dynamic CpGs and present a likely functional example. Our work establishes a framework for inference and modeling that is well suited to DNA methylation data, and our success suggests that other methods for analyzing DNA nucleotide substitutions will also translate to the modeling of epigenetic phenomena.
AVAILABILITY AND IMPLEMENTATION: Source code is available at www.kostkalab.net/software.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
DNA

Year: 2014 PMID： 25161227 PMCID： PMC4147898 DOI： 10.1093/bioinformatics/btu445

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

DNA methylation is a common epigenetic modification essential to organism development (Smith and Meissner, 2013). In vertebrates, DNA is most commonly methylated at the fifth carbon position on cytosine nucleotides (5mC) that are followed by a guanine, so-called CpG sites. A family of three DNA methyltransferase enzymes (DNMT1, DNMT3A, DNMT3B) is responsible for the establishment and maintenance of methylation state at the millions of CpG sites in most mammalian genomes (Smith and Meissner, 2013). Recently, the ability to perform genome-wide assays of the methylation state of individual CpGs has become a reality because of advances in microarray and DNA sequencing technology. Several approaches that vary in their accuracy, biases, coverage and cost are commonly used; see Laird (2010) for a detailed review of current methods. Systematic screening of DNA methylation across tissue differentiation and development has improved our knowledge of its role in these processes (Bock ; Xie ). The methylation profile of the mammalian genome is largely stable, but the methylation of specific genomic regions changes dynamically across development, and different cellular lineages have unique methylation profiles (Ziller ). Additionally, the DNA methylation state nearby a gene’s transcription start site (TSS) correlates with gene expression (Xie ), and the correct orchestration of methylation changes is essential for proper cellular differentiation. Aberrant methylation changes may lead to tumorigenesis and other diseases (Bergman and Cedar, 2013; Hansen ; Portela and Esteller, 2010; Tost, 2010). Studies assaying DNA methylation often focus on the comparison of two types of conditions, like tumor versus normal tissue (Nordlund ), or stem cells versus lineage-committed cells (Xie ). However, the natural process of cellular differentiation and development has an essentially tree-like topology, in which precursor cell types are connected to their descendants by edges, thereby forming a so-called lineage tree (Frumkin ). For example, Figure 1 depicts a lineage tree for blood cell differentiation, where DNA methylation has been assayed in cell types represented by nodes (Bock ). Independent pairwise comparisons cannot accommodate this structure.

Fig. 1.

CpG methylation dynamics can be modeled with an approach inspired by phylogenetic analysis of nucleotide substitutions. Left: A lineage tree of sampled blood cell types during hematopoietic differentiation with stem cells on top and terminally differentiated cells on the bottom (see Section 2.4 for details). The lineage tree takes the role played by the species tree in the phylogenetic context. Right: Examples of methylation patterns across differentiation (columns) for CpG sites at different genomic locations. Each row corresponds to a cell type in the linage tree on the left. The white block represents missing data. The discretized methylation states are analogous to DNA sequence data To address this issue, we introduce an approach to model methylation state changes between cell types that explicitly takes dependencies induced by the lineage tree into account. In this setup, modeling methylation changes over developmental time is in many ways reminiscent of describing DNA nucleotide changes over evolutionary time (Fig. 1). As a result, we adapt established continuous-time Markov models of sequence evolution to fit this task. In addition to accommodating cell lineage relations during development, our approach has the benefit that it works at single CpG dinucleotide resolution and does not require the spatial aggregation of methylation measurements across the genome. Finally, the analogy with models for DNA sequence evolution provides intuitive means to handle missing data, which are common in many DNA methylation datasets. During parameter estimation, missing data can be marginalized over, and the equivalent of joint ancestry reconstruction (Pupko ) allows efficient inference of the most likely methylation states for unobserved data in context of the lineage tree. As an illustration of our approach, we analyzed methylation data collected across the cell lineage tree in Figure 1 from Bock . On this dataset, our method enabled the accurate reconstruction of missing methylation states. The single CpG resolution of our analysis allowed us to discover that the identity of neighboring dinucleotides is strongly correlated with CpG methylation dynamics at many sites in the mouse genome. Finally, using our predictions of CpG methylation variability, we identified a cluster of highly dynamic CpG sites that show evidence of enhancer activity in blood cells.

2 METHODS

2.1 Modeling methylation changes across tissue differentiation

We model the dynamics of DNA methylation across cellular differentiation using an approach motivated by phylogenetic models. In the phylogenetic context, a continuous time Markov chain is used to quantify DNA sequence changes between species over a known species tree. Intuitively, we adapt this approach and replace the species tree with a cell lineage tree and the four-state alphabet of DNA with an alphabet based on methylation status. In our model, the cell lineage tree consists of nodes that correspond to cell types and edges indicate precursor–descendant relationships. For example, the lineage tree shown in Figure 1 traces the differentiation of adult hematopoietic stem cells (HSCs) through several intermediate states into different terminally differentiated blood cell types. To describe methylation patterns, we define three discrete methylation states , corresponding to unmethylated, partially methylated and methylated CpGs dinucleotides, respectively. To model transitions between CpG methylation states along edges of the cell lineage tree, we associate each node with a discrete random variable X (, assuming N nodes), which is dependent on its parents (i.e. its direct precursor cell types) in the lineage tree. As in DNA sequence evolution models, we use a continuous time Markov chain to describe this dependency structure. Specifically, if nodes i and j are connected in the lineage tree by an edge of length t, then the probability of the methylation state at node j being l conditional on node i being in methylation state k is given by for (Guttorp and Minin, 1995). Q is a 3 × 3 rate matrix (or generator), and expm denotes the matrix exponential. We assume a time-reversible Markov chain with equilibrium frequency π, which implies Q is fully parameterized by three non-negative rate parameters, , and π. (The number of expected transitions along an edge is , and therefore, we will enforce and report t in units of expected methylation state transitions.) In summary, our model is parameterized by ϑ, which consists of the topology of the lineage tree (which we assume is fixed, known and consists of N nodes and E edges), the branch lengths , the equilibrium frequencies and the rate parameters . The likelihood of an observed methylation pattern is then where is the parent of node i in the lineage tree. For the root node, we have , and assuming independence between methylation patterns at different CpG sites, we have for the likelihood of all observed patterns (assuming there are L CpG sites): In contrast to most applications dealing with DNA sequence changes, non-leaf nodes can be observed in our setting. We handle missing data by marginalization, i.e. summation over all possible configurations of unobserved nodes in the lineage tree, which can be done efficiently (linear in the number of tree nodes) via the elimination algorithm (Siepel and Haussler, 2005). Maximum likelihood parameter estimates are then obtained by maximizing Equation (1) over branch lengths, equilibrium frequencies and rate parameters. In summary, we have adapted a well-known class of models that is typically used in the context of DNA sequence evolution to model the dynamics of methylation changes during tissue differentiation.

2.2 Integrating rate heterogeneity

2.2.1 Modeling rate heterogeneity

The approach described so far models methylation dynamics using the same process at all CpG sites in the genome, and thereby assumes homogeneity of methylation dynamics. This assumption is not always reasonable. For instance, CpGs located in CpG islands have a propensity to be unmethylated (compared with other CpG sites), and a disposition to stay in that state (Jones, 2012). To address this issue, we incorporate rate heterogeneity into our model using a mixture modeling approach similar to phylogenetic models for DNA changes under heterogeneous substitution rates. First, we assume a certain fraction (β) of CpG sites to be invariant, i.e. they do not change their methylation state during tissue differentiation. For the fraction of variable CpG sites, we assume M different equiprobable rate categories such that the probability of methylation pattern is now where we have used the ‘rate category’ r0 to denote invariance and to denote the new parameter set. For the invariant term on right side above, if all methylation states in are k (for ) and zero otherwise. For the variable part, we have , where we use Equation (1), but with all branch lengths in ϑ scaled by the factor r. The scale factors are determined by a Gamma distribution with shape parameter α and scale parameter 1/α (setting the scale parameter to 1/α ensures that the Gamma distribution has a mean of one). Next, the probability density function of the Gamma distribution is discretized by splitting its domain into M equal-mass bins and setting r equal to the mean conditional on bin m. Thus, a single positive parameter α determines all M rates. The additional parameters to account for rate variation between CpG sites are the fraction of invariant sites β, the frequencies of invariant states and the shape parameter α of the Gamma distribution. Maximum likelihood estimates are again obtained considering CpG sites as independent. In summary, we use the Γ + I model (Gu ) to account for rate heterogeneity across different CpG sites.

2.2.2 Assigning CpG sites to rate categories

To assign CpG sites to rate categories we use an empirical Bayes approach (Galtier ). Let denote the maximum likelihood estimates for . We assign methylation pattern to rate category , where for m = 0 and for .

2.3 Reconstructing missing data

Our model of DNA methylation dynamics can reconstruct missing or unobserved methylation states in a cell type. Intuitively, for a given CpG site, nearby cell types in the lineage tree carry information about its likely methylation state. We quantify this relationship using joint maximum likelihood ancestry reconstruction (Pupko ). In essence, assume methylation pattern contains one or more missing values (i.e. unobserved methylation states). Further assume the empirical Bayes procedure discussed above assigns pattern to rate category m. Note that during this procedure missing values in had been ‘marginalized out’. Then, we assign the missing values in to the methylation state configuration that maximizes the likelihood . The algorithm of Pupko is linear in the number of tree nodes, enabling efficient reconstruction of missing methylation states. This reconstruction strategy shares methylation state information for a CpG site ‘vertically’ across the lineage tree and is complementary to approaches leveraging ‘horizontal’ correlations between different but nearby CpG sites across the genome.

2.4 Data sources and processing

Our algorithm requires two inputs: (i) the topology of the lineage tree and (ii) discrete methylation state data for the stages in the lineage tree at specific positions along a genome. Missing methylation states are allowed (see above). We analyzed DNA methylation maps from 13 cell populations from stages of a differentiation of adult mouse HSCs to different blood lineages (Bock ). The purified cell types were obtained at progressive levels of differentiation, starting with HSCs, followed by multipotent progenitor cells (MPP1 and MPP2) and progenitor cells of the lymphoid (CLP) and myeloid (CMP) lineages. For the lymphoid progenitors, further differentiated cells included T helper cells (CD4), T cells (CD8) and B cells (BCELL). For myeloid progenitor cells, the next stages were granulocyte-monocyte progenitors and megakaryocyte–erythroid progenitors (MEP); the former was followed by monocytes (MONO) and granulocytes (GRAN), whereas the latter was followed by erythrocytes (ERY). The relationships between cell types are summarized in the lineage tree in Figure 1. Bock generated a methylation map for each cell type using reduced representation bisulfite sequencing (RRBS). We downloaded counts of methylated and unmethylated reads at each sequenced CpG dinucleotide for the two replicates performed in each cell type from the Supplementary Materials Web site (http://info.medical-epigenomics.org/papers/broad_mirror/invivomethylation/) for Bock . Values for each CpG were averaged over the two replicates for each cell type (and strand where applicable). Then we discretized the methylation status into methylated (>0.8), partially methylated (between 0.1 and 0.8) and unmethylated (<0.1) categories based on the fraction of methylated reads for the site. Histograms of these values showed clear peaks at the ends of the spectrum and were similar between replicates. We defined CpG islands using the cpgIslandExt table for the mm9 build of the mouse gene from the UCSC genome browser (Kent ). We implemented our algorithms in the R language (R Core Team, 2014). To estimate the parameters of the model described in Section 2.2.1, we used the observed frequencies (excluding CpGs with missing values) for , and used a box-constaint enabled version of the limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS-B) algorithm to obtain maximum likelihood estimates for all other parameters. Because we treat CpG positions independently (see above), the computational complexity of estimating rate categories and reconstructing missing data over a set of CpG sites scales linearly with the number of assayed CpGs. Optimizing the likelihood over all CpGs on Chromosome 1 in the Bock data (about 6% of the dataset) took 10 min and 15 s on a single core of an Intel(R) Xeon(R) X5690 CPU with 3.47 GHz clock speed.

3 RESULTS

We applied our methylation dynamics model to an RRBS dataset tracing the differentiation of adult HSCs (Bock ). This study queried a set of over 2 million CpG sites at different stages during blood lineage differentiation, and the assayed cell types with their relationships are summarized in the lineage tree in Figure 1. We fit a model with four rate categories (three variable and one invariant) to the discretized methylation status of CpG sites along each chromosome (Section 2). The maximum likelihood estimates of the model parameters qualitatively agree across chromosomes, and the resulting model is consistent with several previous findings. Invariant CpG sites are more prevalent than variable sites (Bock ; Ziller ); 61% of CpGs are invariant in our analysis, and 13, 12 and 14% fall into the slow, medium and fast rate categories, respectively. As expected, the equilibrium distribution for variable states favors methylated CpGs, i.e. for all chromosomes. Contrasting the dynamics at variable and invariant CpG sites, we find that invariant sites are most likely to remain unmethylated during differentiation (61%), whereas variable sites are most likely to be in methylated states (57%). The branch lengths obtained in our fitted models reflect the number of expected methylation state transitions between cell types, and we see the longest branch lengths between MEP and ERY cells and between HSC and MPP1 cells (t = 4.65 and t = 0.97 averaged over chromosomes, respectively). These numbers correspond to an expected fraction of CpG sites with observed methylation changes of 24% for MEP →ERY and of 17% and for HSC →MPP1, taking into account the prevalence of invariant sites and the different rate categories for variable sites. This is in qualitative agreement with previous results, and it provides a quantitative underpinning of the known ‘methylation divergences’ between these contexts. In the next three sections, we present examples of how our modeling approach enables analysis of methylation dynamics across cellular differentiations.

3.1 Reconstructing unobserved methylation states

Over the 2 079 144 CpG sites assayed in 13 cellular contexts in the blood differentiation dataset, 5 940 467 of 27 028 872 (22%) methylation states are missing because of a range of technical issues (Bock ). Given the prevalence of missing data, we assessed the ability of our model to reconstruct unobserved values using information from the observed methylation status of the same sites at other nodes in the lineage tree. Conceptually, this is akin to the problem of ancestral sequence reconstruction for DNA substitution models (Pupko ), but in the context of methylation, we also have observed data on internal nodes of the tree. For each cell type, we masked 10 000 CpG sites with measured methylation state, re-estimated model parameters on data missing the masked methylation states, reconstructed the masked values as described in the Section 2 and then compared the reconstructed methylation values with the actual values. The reconstructed methylation states are generally accurate (90% correct overall), with some differences in performance between different cell types (Fig. 2). As expected, the length of the edges connecting cell types is correlated with the accuracy of the reconstruction of missing values; the nodes with the longest incident edges in the lineage tree (HSC and ERY) are the most difficult to reconstruct.

Fig. 2.

Lineage tree-based reconstruction of methylation state is more accurate than nearest-neighbor-based reconstruction. We masked the methylation status for 10 000 CpG sites in each cell type and reconstructed these values using ‘vertical’ information from our lineage tree model and ‘horizontal’ information from neighboring CpG sites. The lineage tree approach proved significantly more accurate overall (90% versus 84%; , binomial test) and within every cell type except erythrocytes (P = 0.99), which have the longest branch length (p < 4E-17 for all other cell types) To compare our model’s reconstruction with a baseline method, we also reconstructed the methylation state of each masked CpG based on the methylation state of its nearest neighbor (in terms of genomic location) in the same cell type. This type of reconstruction assumes a ‘horizontal’ (i.e. location-wise) correlation between methylation states of neighboring CpG sites, whereas our approach can be viewed as assuming a ‘vertical’ (i.e. progenitor to descendant) correlation between the same CpG site in neighboring cell types. Overall, lineage tree-based reconstruction performs significantly better than location-based reconstruction (Fig. 2; 90% correct versus 84% correct; , binomial test). However, we note that more sophisticated ‘horizontal’ methods have achieved higher performance in some contexts; see Section 4. Stratifying reconstructed methylation states by our inferred rate categories revealed that, unsurprisingly, lineage tree-based reconstruction is hardest for ‘fast’ CpG sites, i.e. those in the fastest rate category according to the empirical Bayes procedure. Therefore, the lineage tree-based reconstruction approach not only performs better than a location-based method but also contributes valuable information about the confidence in the reconstruction result. We anticipate that combining these largely independent approaches could improve reconstruction further.

3.2 DNA sequence context is correlated with CpG methylation dynamics

Having established that our approach can successfully reconstruct unobserved methylation states, we now describe several analyses that use our model’s estimates of CpG methylation dynamics at single CpG resolution. Specifically, we show that outside CpG islands, and particularly for CpGs in promoters, the immediate DNA sequence content around CpG dinucleotides is correlated with the variability of their methylation state.

3.2.1 Local CpG sequence context correlates with methylation dynamics outside CpG islands

CpG islands are genomic regions with high CpG dinucleotide frequency, in which methylation status has been reported to influence gene expression (Illingworth and Bird, 2009; Jones, 2012). Here we study methylation dynamics by analyzing the rate category that our model assigns to each assayed CpG dinucleotide. As expected, we find that invariant CpG dinucleotides are strongly enriched in CpG islands, and that this enrichment falls off with increasing distance from the island (Fig. 3a). To obtain this aggregate view, we split the roughly 16 000 annotated CpG islands into an equal number of bins and discretized flanking genomic regions into equal-sized non-overlapping tiles. The averages for corresponding locations across CpG island loci are shown as points. The strong invariance of CpG islands is in agreement with the notion that CpG islands tend to retain their methylated state (Jones, 2012).

Fig. 3.

Methylation dynamics are influenced by DNA sequence context outside of CpG islands. (a) Invariant CpG sites (gray circles) are the most common category in our analysis, and they are strongly enriched in CpG islands. The other rate categories (red circles) are roughly equally likely in and around CpG islands. Each dot represents an average over evenly sized bins centered on the 16 024 CpG Islands. The red dashed line gives the sum of the variable categories. (b) The presence of adjacent CpG sites is strongly correlated with CpG methylation dynamics outside of CpG islands. The effect of DNA sequence context is much weaker within CpG islands. (c) Nearly all invariant CpG sites within CpG Islands are unmethylated. Outside of CpG islands, the adjacent CpG count for a site is strongly correlated with its methylation state Taking advantage of the single CpG resolution of our approach, we explored whether the enrichment of invariant CpG dinucleotides is exclusive to CpG islands. We hypothesized that local CpG content could be important, so we stratified each CpG dicnucleotide by (i) whether its two neighboring dinucleotides contain none, one or two CpGs and (ii) whether it is located inside a CpG island. For CpGs without neighboring CpG sites, those in CpG islands are strongly enriched for invariance over those outside of CpG islands (factor 1.67), but this enrichment decreases for CpGs with one or two neighboring CpGs (Fig. 3b). In other words, outside of CpG islands, local CpG sequence context is strongly correlated with the absence of methylation state changes across hematopoietic differentiation. Next, given the importance of invariant CpGs, we assessed whether there is a preference for methylated (or partially methylated) states compared with unmethylated states among invariant CpG sites. We find that invariant CpG sites in CpG islands are almost always unmethylated (96%), which is expected from previous analyses (Jones, 2012). However, invariant CpG sites outside of CpG islands are significantly more likely to be methylated (65% methylated and 28% unmethylated; , binomial test). We again hypothesized that adjacent sequence context could influence methylation at these sites. Figure 3c shows that invariant CpGs with neighboring CpG dinucleotides outside CpG islands are more unmethylated compared with their counterparts without neighboring CpG dinucleotides.

3.2.2 Low CpG content promoters are enriched for variable CpG sites

The previous subsection shows that local CpG context is associated with the dynamics of individual CpG sites in certain settings. Promoter CpGs are known to be functionally important and influenced by CpG density, so next we analyzed single CpG dynamics in promoters with respect to their overall CpG density. The methylation state of CpGs in gene promoters is associated with transcription levels (Jones, 2012). In several cell types, promoter methylation is negatively correlated with gene expression, and the effect is strongest in promoters with low CpG density (Xie ). The rate category assignments from our model enabled us to test whether promoter CpG content is also correlated with methylation dynamics across hematopoietic differentiation. Following Xie , we defined ‘promoters’ as regions 500 bp upstream and downstream of TSSs, and we analyzed CpG sites within this window for 19 244 mouse genes. We stratified promoters into low and high CpG density groups. As seen in human data, the CpG density distribution surrounding the mouse TSSs has two peaks; one at low (<0.034 CpG/bp) and one at high CpG density (≥0.034 CpG/bp). The low CpG content promoters have a significantly higher fraction of variable CpG sites compared with the high CpG promoters (Fig. 4; , chi-squared test). Nearly all (84%) of the high CpG promoter CpG sites were invariant across the differentiation, whereas only 55% of the low CpG sites were invariant. This pattern could be driven by the invariance of CpG Islands (Fig. 3) and their prevalence in high CpG content promoters, but the effect remained when CpG island sites were removed (76% versus 54%; ). These results are consistent with the previous observation that the correlation between methylation state and gene expression is strongest in low CpG promoters (Xie ).

Fig. 4.

Low CpG content promoters are enriched for CpGs with variable methylation state. We stratified mouse gene promoters into low and high CpG content groups and then compared the inferred dynamics of CpG sites in these groups. Low CpG content promoters were significantly less likely to be in the invariant rate category (; chi-squared test). This pattern remained when CpG islands were not considered

3.3 Identification of genomic regions with variable methylation state

The CpG site rate category predictions from our model enable us to identify genomic regions with frequent methylation state changes across hematopoietic differentiation. To do this, we discarded CpG dinucleotides in repeat masked regions (rmsk track for mm9 from UCSC genome browser) and then identified maximal subsets in the RRBS data, for which each site is no >20 bp from the nearest other CpG in the subset. We further filtered out short regions (<50 bp) and focused on regions without evidence for accelerated substitution rates (average phyloP score from UCSC genome browser >0); this approach leaves 61 980 regions with a high density of assayed CpGs that are between 50 and 758 bp long (mean: 95.1 bp). The density of fast (in terms of their annotated rate category) CpG sites in these regions ranges from ∼10 to 100%, with 1766 sites exceeding 50%. Figure 5 shows the longest of the 215 regions with all constituent CpGs in the fast rate category. This 129 bp region with 16 assayed CpG sites overlaps a CpG island, has strong evolutionary sequence conservation across placental mammals, and displays histone modifications correlated with transcriptional enhancer activity in various blood-related cell types (ENCODE Project Consortiumm; Bernstein ). These attributes suggest a gene regulatory role for this locus. This type of simple candidate approach based on examination of the extremes of the rate category distribution may shed light on regions whose methylation state influences transcriptional regulation during hematopoietic differentiation.

Fig. 5.

A block of highly variable CpG sites has evidence of gene regulatory enhancer activity in several blood cells. This region on Chromosome 4 (mm9.chr4:116 976 784–116 976 912) contains 16 CpG sites that our model places in the fast rate category within 129 bp. The region is located within a CpG island in an intron of the gene Rnf220, a ubiquitin ligase. The DNA sequence at this locus is strongly conserved across placental mammals; this suggests that it is likely functionally important. In addition, functional genomics data collected by the ENCODE project (ENCODE Project Consortiumm; Bernstein ) suggest that this locus is a regulatory enhancer in several blood cell types. It overlaps a DNaseI hypersensitive site in an erythroid progenitor (G1E), and it has peaks of the H3K4me1 enhancer-associated histone modification in B-cell lymphoma cells (CH12), erythroblasts, G1E cells and megakaryocytes

4 DISCUSSION

In this article, we adapt phylogenetic Markov models, which are prevalent in comparative genomics and statistical genetics, to accurately and efficiently analyze DNA methylation dynamics across lineage specification. As a proof of concept, we model RRBS methylation data collected from 13 related stages of blood cell development (Bock ). Using our model, we illustrate that (i) CpG site methylation status can be accurately reconstructed using data from related cell types at the same site, (ii) the single CpG site resolution of our methylation dynamics estimates enable the discovery of attributes, such as DNA sequence context, that correlate with CpG methylation dynamics and (iii) our models facilitate the identification of genomic regions with highly variable CpG methylation states that are likely functional. There are many additional methodologies that could be mapped from the rich reservoir of statistical genomics to the application of modeling methylation dynamics. It will be exciting to see which will prove most useful as genome-wide methylation data continue to be collected to elucidate tissue differentiation and development. For instance, our analyses confirm that methylation dynamics are different between CpG sites located in CpG islands and those elsewhere in the genome. Thus, one way to extend our current approach would be to use different parameterizations based on such ‘external’ annotations, as is commonly done when modeling coding versus non-coding sequence in comparative genomics. Further on, such models could also be integrated in a hidden Markov model (HMM) framework (as in phylo-HMMs (Siepel and Haussler, 2005)), which could lead to genome segmentations that account for methylation dynamics. An HMM framework has already proven useful in modeling the density of CpG sites across a genome and defining CpG islands (Hsieh ; Wu ). In Section 3.1, we demonstrated that progenitor–descendant relationships in the lineage tree can be used to accurately reconstruct the methylation status of CpG sites in different cellular contexts (average of 90% accuracy). These ‘vertical’ relationships enabled more accurate reconstruction on the RRBS dataset analyzed here than using nearest genomic neighbors to predict missing values. However, we note that there are many methods for reconstructing missing CpG methylation status using genomic information. These methods have largely focused on CpGs in CpG islands, but a recent approach (Zhang ) used a random forest classifier to accurately (91–94%) predict CpG methylation from a methylation array based on a suite of features including neighboring CpG methylation status and overlapping genomic elements. However, the CpG coverage provided by methylation arrays and RRBS assays is different (Laird, 2010), so it is difficult to directly compare the results of these methods on different datasets. Nonetheless, our approach is complementary to existing methods for reconstructing DNA methylation that use neighboring CpG sites along the genome. Context-dependent models that integrate such ‘vertical’ (i.e. progenitor to descendant) and ‘horizontal’ (i.e. distance on chromosome) relationships have the potential to further improve methods for reconstructing unobserved states and highlighting functionally relevant shifts in methylation. Most existing methods for analyzing DNA methylation dynamics are based on pairwise comparisons of cellular contexts (Xie ; Ziller ). For example, Ziller identify CpGs with large differences in their estimated methylation between pairs of tissues or cell lines, and then they cluster these to highlight differentially methylated regions. This type of pairwise comparison is appropriate for much of the methylation data currently available, but existing methods are challenging to generalize to analysis of more densely sampled sets of dependent cell types. To address this, our approach explicitly models the existence of statistical dependencies between methylation states from multiple related cell types. Thus, it enables multivariate analyses of methylation patterns in differentiating cell lineages (like those from Bock ), but it may not provide much improvement when analyzing essentially independent samples (like distantly related terminally differentiated cell types). In addition, our strategy is subject to the discretization of methylation status in a population of cells into discrete methylation states. The three states and the thresholds we use are supported by the distribution of methylation values in our data, but including direct modeling of counts of methylated versus unmethylated instances of a CpG site (Ziller ) into our approach is a promising future direction. DNA methylation is just one of several dynamic epigenetic biochemical modifications regulating precise spatiotemporal gene expression patterns that are essential for proper development. The approach we have demonstrated here provides an integrative, multivariate framework for modeling any epigenetic changes across multiple cell types and lineages. As phylogenetic models proved essential in the identification and interpretation of functional DNA sequence regions, we believe that lineage tree-aware Markov models of epigenetic dynamics can play a similar role in developing a deeper understanding of epigenetic phenomena and their roles in tissue differentiation and vertebrate development. Funding: JAC was supported by institutional funds from Vanderbilt University, DK was supported by institutional funds from the University of Pittsburgh School of Medicine. Conflict of interest: none declared.

20 in total

1. A fast algorithm for joint reconstruction of ancestral amino acid sequences.

Authors: T Pupko; I Pe'er; R Shamir; D Graur
Journal: Mol Biol Evol Date: 2000-06 Impact factor: 16.240

2. The human genome browser at UCSC.

Authors: W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler
Journal: Genome Res Date: 2002-06 Impact factor: 9.043

Review 3. Epigenetic modifications and human disease.

Authors: Anna Portela; Manel Esteller
Journal: Nat Biotechnol Date: 2010-10 Impact factor: 54.908

4. Redefining CpG islands using hidden Markov models.

Authors: Hao Wu; Brian Caffo; Harris A Jaffee; Rafael A Irizarry; Andrew P Feinberg
Journal: Biostatistics Date: 2010-03-08 Impact factor: 5.899

Review 5. Principles and challenges of genomewide DNA methylation analysis.

Authors: Peter W Laird
Journal: Nat Rev Genet Date: 2010-03 Impact factor: 53.242

Review 6. DNA methylation dynamics in health and disease.

Authors: Yehudit Bergman; Howard Cedar
Journal: Nat Struct Mol Biol Date: 2013-03 Impact factor: 15.369

Review 7. DNA methylation: roles in mammalian development.

Authors: Zachary D Smith; Alexander Meissner
Journal: Nat Rev Genet Date: 2013-02-12 Impact factor: 53.242

8. Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites.

Authors: X Gu; Y X Fu; W H Li
Journal: Mol Biol Evol Date: 1995-07 Impact factor: 16.240

9. Epigenomic analysis of multilineage differentiation of human embryonic stem cells.

Authors: Wei Xie; Matthew D Schultz; Ryan Lister; Zhonggang Hou; Nisha Rajagopal; Pradipta Ray; John W Whitaker; Shulan Tian; R David Hawkins; Danny Leung; Hongbo Yang; Tao Wang; Ah Young Lee; Scott A Swanson; Jiuchun Zhang; Yun Zhu; Audrey Kim; Joseph R Nery; Mark A Urich; Samantha Kuan; Chia-an Yen; Sarit Klugman; Pengzhi Yu; Kran Suknuntha; Nicholas E Propson; Huaming Chen; Lee E Edsall; Ulrich Wagner; Yan Li; Zhen Ye; Ashwinikumar Kulkarni; Zhenyu Xuan; Wen-Yu Chung; Neil C Chi; Jessica E Antosiewicz-Bourget; Igor Slukvin; Ron Stewart; Michael Q Zhang; Wei Wang; James A Thomson; Joseph R Ecker; Bing Ren
Journal: Cell Date: 2013-05-09 Impact factor: 41.582

10. Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements.

Authors: Weiwei Zhang; Tim D Spector; Panos Deloukas; Jordana T Bell; Barbara E Engelhardt
Journal: Genome Biol Date: 2015-01-24 Impact factor: 13.583

11 in total

1. Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues.

Authors: Jason Ernst; Manolis Kellis
Journal: Nat Biotechnol Date: 2015-02-18 Impact factor: 54.908

Review 2. Epimutations Define a Fast-Ticking Molecular Clock in Plants.

Authors: Nan Yao; Robert J Schmitz; Frank Johannes
Journal: Trends Genet Date: 2021-05-17 Impact factor: 11.821

3. BitPhylogeny: a probabilistic framework for reconstructing intra-tumor phylogenies.

Authors: Ke Yuan; Thomas Sakoparnig; Florian Markowetz; Niko Beerenwinkel
Journal: Genome Biol Date: 2015-02-13 Impact factor: 13.583

4. Evolutionary expansion of DNA hypomethylation in the mammalian germline genome.

Authors: Jianghan Qu; Emily Hodges; Antoine Molaro; Pascal Gagneux; Matthew D Dean; Gregory J Hannon; Andrew D Smith
Journal: Genome Res Date: 2017-12-19 Impact factor: 9.043

5. Punctuated evolution of canonical genomic aberrations in uveal melanoma.

Authors: Matthew G Field; Michael A Durante; Hima Anbunathan; Louis Z Cai; Christina L Decatur; Anne M Bowcock; Stefan Kurtenbach; J William Harbour
Journal: Nat Commun Date: 2018-01-09 Impact factor: 14.919

6. Unveiling new interdependencies between significant DNA methylation sites, gene expression profiles and glioma patients survival.

Authors: Michal J Dabrowski; Michal Draminski; Klev Diamanti; Karolina Stepniak; Magdalena A Mozolewska; Paweł Teisseyre; Jacek Koronacki; Jan Komorowski; Bozena Kaminska; Bartosz Wojtas
Journal: Sci Rep Date: 2018-03-13 Impact factor: 4.379

7. Inferring cell differentiation processes based on phylogenetic analysis of genome-wide epigenetic information: hematopoiesis as a model case.

Authors: Kanako O Koyanagi
Journal: Genome Biol Evol Date: 2015-01-31 Impact factor: 3.416

8. Inferring changes in histone modification during cell differentiation by ancestral state estimation based on phylogenetic trees of cell types: Human hematopoiesis as a model case.

Authors: Kanako O Koyanagi
Journal: Gene X Date: 2019-05-31

9. Extrapolating histone marks across developmental stages, tissues, and species: an enhancer prediction case study.

Authors: John A Capra
Journal: BMC Genomics Date: 2015-02-21 Impact factor: 3.969

10. On Predicting lung cancer subtypes using 'omic' data from tumor and tumor-adjacent histologically-normal tissue.

Authors: Arturo López Pineda; Henry Ato Ogoe; Jeya Balaji Balasubramanian; Claudia Rangel Escareño; Shyam Visweswaran; James Gordon Herman; Vanathi Gopalakrishnan
Journal: BMC Cancer Date: 2016-03-04 Impact factor: 4.430