Literature DB >> 35078341

Cluster mean-field theory accurately predicts statistical properties of large-scale DNA methylation patterns.

Lyndsay Kerr¹, Duncan Sproul², Ramon Grima³.

Abstract

The accurate establishment and maintenance of DNA methylation patterns is vital for mammalian development and disruption to these processes causes human disease. Our understanding of DNA methylation mechanisms has been facilitated by mathematical modelling, particularly stochastic simulations. Megabase-scale variation in DNA methylation patterns is observed in development, cancer and ageing and the mechanisms generating these patterns are little understood. However, the computational cost of stochastic simulations prevents them from modelling such large genomic regions. Here, we test the utility of three different mean-field models to predict summary statistics associated with large-scale DNA methylation patterns. By comparison to stochastic simulations, we show that a cluster mean-field model accurately predicts the statistical properties of steady-state DNA methylation patterns, including the mean and variance of methylation levels calculated across a system of CpG sites, as well as the covariance and correlation of methylation levels between neighbouring sites. We also demonstrate that a cluster mean-field model can be used within an approximate Bayesian computation framework to accurately infer model parameters from data. As mean-field models can be solved numerically in a few seconds, our work demonstrates their utility for understanding the processes underpinning large-scale DNA methylation patterns.

Entities: Chemical

Keywords: DNA methylation; epigenetics; master equations; stochastic modelling

Mesh：

Year: 2022 PMID： 35078341 PMCID： PMC8790364 DOI： 10.1098/rsif.2021.0707

Source DB: PubMed Journal: J R Soc Interface ISSN： 1742-5662 Impact factor: 4.118

Introduction

DNA methylation is a repressive epigenetic mark [1] which is primarily found on the cytosines of CpG dinucleotides in mammals. Double-stranded CpG dyads can be unmethylated or methylated on both strands (u and m, respectively) or methylated on only one strand (hemimethylated, h). DNA methylation is largely erased from the genome during early mammalian development [2]. It is then re-established by the de novo DNA methyltransferases DNMT3A and DNMT3B [3] resulting in a landscape where 70–80% of CpGs are methylated in most human cells [4]. Regulatory elements such as promoters and enhancers often remain methylation free [1]. During DNA replication, the nascent strand is synthesized with unmethylated cytosines and methylation patterns are copied by the maintenance methyltransferase, DNMT1 [5]. Failure to maintain DNA methylation at a locus results in passive DNA demethylation. Methylation can also be removed actively through transient modification by ten eleven translocation (TET) enzymes and subsequent DNA repair [6]. Waves of demethylation and remethylation take place during early development and the generation of germline cells [2]. Changes in DNA methylation patterns also occur during development and cellular differentiation, resulting in cell type-specific methylation patterns [7]. The correct establishment of DNA methylation patterns is vital for normal development. Mutations in DNMTs cause Mendelian disorders in humans [8-10] and mice knockouts die before or shortly after birth [3,5]. Widespread alterations in DNA methylation patterns occur in cancer and ageing [11,12], but the significance of these changes is unclear. It has been observed that globally hypomethylated mice expressing a single hypomorphic DNMT1 allele develop cancer, suggesting that altered DNA methylation can cause cancer [13]. However, the mechanisms underpinning DNA methylation changes remain unclear preventing the robust delineation of their role in development and disease. Mathematical models are powerful tools for understanding complex biological processes, including DNA methylation. The importance of interactions between CpGs in maintaining DNA methylation patterns was first postulated through modelling [14]. Specifically, the authors modelled collaborative interactions where CpGs within a region of the genome can influence the state of other CpGs, e.g. through enzyme recruitment. Models including such collaborativity were subsequently found to explain experimental measurements of methylation maintenance in vitro and in vivo more closely than those that did not include it [15,16]. A recent study also suggests that collaborativity mediated by neighbour-guided error correction through DNMT1 is important for maintaining DNA methylation [17]. Deterministic models, non-spatial stochastic models and spatial stochastic models have all been used to describe DNA methylation [18]. Deterministic models are based on rate equations while stochastic models are based on Fokker–Planck equations or chemical master equations (CMEs). CMEs are ideal because they take into account the inherent discreteness of molecular fluctuations [19] which is well known to play an important role in cellular dynamics [20]. The CME of simple non-spatial stochastic models can be solved exactly in closed-form [21], but this is often not possible for spatial stochastic models. Rather in this case, stochastic simulations are used to model the individual reaction processes described by the CME. Various types of stochastic models have been used to describe collaborative methylation systems (e.g. [22-26]). To date, such mathematical models have been applied to understand methylation patterns on a kilobase scale. However, megabase-scale alterations to DNA methylation patterns occur in development, cancer and ageing [27]. These include the formation of megabase-sized partially methylated domains in cancer which have been linked to genome instability [27,28]. Existing models rely on simulations that are too computationally expensive to run for such large genomic regions. Here, we test the idea that large-scale steady-state methylation patterns can be modelled in a tractable manner using mean-field (MF) models. By comparison to synthetic data generated from stochastic simulations, we demonstrate that a type of cluster MF model can predict the statistical properties of large-scale methylation patterns. In §2, we introduce a nearest-neighbour collaborative model for DNA methylation and describe the process used to simulate data from this model. We describe the three MF models we test in §3. In §4, we compare the ability of each MF model to predict statistics associated with methylation patterns resulting from the simulations. We find that a type of cluster MF model provides excellent predictions and demonstrate that this model can be used within an approximate Bayesian computation (ABC) framework to infer parameters underpinning large-scale methylation systems. Finally, in §5, we discuss the implications of our findings.

Nearest-neighbour collaborative model

Model set-up

We consider the reaction system in figure 1, where some reactions are non-collaborative (involving only the ‘target’ CpG, whose methylation state changes during the reaction), while others are collaborative (involving both a target CpG and a ‘mediator’ CpG). The role of the mediator is to encourage the reaction to occur, e.g. via the recruitment of methylase or demethylase enzymes. This system, and reduced versions, have previously been used to examine small-scale methylation patterns [14,29,30]. Previously, collaborative and non-collaborative reactions have been referred to as positive-feedback and noisy reactions in a study modelling histone modifications [31].

Figure 1

System of reactions under consideration. Here u, h and m represent unmethylated, hemimethylated and methylated CpGs, respectively. Non-collaborative reactions involve only one CpG, while collaborative reactions involve two CpGs. For each collaborative reaction, the second reactant (the mediator, see text) recruits an enzyme that changes the methylation state of the first reactant (the target). For example, the reaction involves a hemimethylated CpG at one site recruiting a methylase enzyme which changes the state of a CpG at another site from unmethylated to hemimethylated. Reaction rates are k, i = {1, …, 12}. Note that only one CpG site changes methylation state during each reaction. We make the following assumptions: CpGs can only influence the methylation state of their nearest neighbours (see figure 2). While both experimental and modelling studies have demonstrated the importance of CpGs being influenced by surrounding CpGs [14,15], the extent and range of such influence are unknown. We therefore consider only interactions between nearest neighbours.

Figure 2

Collaborative interactions that can influence a target X under the nearest-neighbour collaborative model. Individual CpGs are represented by ‘lollipops’, with their colour indicating their methylation status (white: unmethylated; grey: hemimethylated; black: methylated). Collaborative methylation and demethylation reactions can only occur between neighbouring CpGs (potential influences on CpG X are indicated by arrows).

There are no direct transitions between the unmethylated and methylated states. We justify this with the observation that methylase and demethylase enzymes act on single DNA strands [6,32]. This implies that hemimethylation is a necessary transition state between unmethylated and fully methylated CpGs. The system has reached a steady state. Here, we assume that there are no long-term effects of DNA replication on methylation patterns. This assumption is supported by the observation that the DNA methylation patterns of cycling and arrested cells are similar [33]. Collaborative interactions that can influence a target X under the nearest-neighbour collaborative model. Individual CpGs are represented by ‘lollipops’, with their colour indicating their methylation status (white: unmethylated; grey: hemimethylated; black: methylated). Collaborative methylation and demethylation reactions can only occur between neighbouring CpGs (potential influences on CpG X are indicated by arrows). We do not impose an upper bound on the system size, , allowing large-scale methylation patterns to be considered. We also assume that the rates k, i = {1, 2, …, 12} are of the form given in table 1, where x measures the strength of collaborativity between CpGs (x < 1 indicates that non-collaborative reactions dominate, while x > 1 indicates that collaborative reactions dominate). The parameter y measures the strength of methylation versus demethylation (y < 1 corresponds to demethylation dominating and y > 1 corresponds to methylation dominating). The parameter a > 0 scales the reaction rates. While the values of x ≥ 0, y ≥ 0 determine which reactions are most probable (see table 1 legend), all reactions in figure 1 are possible for any {x, y}.

Table 1

Reaction rates for the model in figure 1. Here, a > 0, x ≥ 0, y ≥ 0 so that all reaction rates are non-negative. The values of x and y determine the probability associated with each type of reaction. For example, consider the case x < 1, y < 1. Then k3 = k4 = a is the largest reaction rate and so non-collaborative demethylation reactions are most probable, while k5 = k6 = k7 = k8 = axy is the smallest reaction rate and so collaborative methylation reactions are least probable.

	demethylation	methylation
non-collaborative	k₃ = k₄ = a	k₁ = k₂ = ay
collaborative	k₉ = k₁₀ = k₁₁ =k₁₂ = ax	k₅ = k₆ = k₇ =k₈ = axy

Simulations of nearest-neighbour collaborative system

In the simple case of two CpGs, the system can be in six possible states: mm, uu, hh, um, hm and uh. For such a small system, all possible transitions between states (via reactions in figure 1) can be identified and the evolution of the system can be described exactly by six mathematical equations, one for each state. However, for large systems, it is infeasible to identify all possible states and transitions between states meaning that equations describing the exact evolution of the system cannot be formulated. While stochastic simulations are computationally expensive, they are the ground truth of the nearest-neighbour collaborative system to which we compare our MF models and so here we describe the process underlying these simulations. We focus on the steady-state case so that u, h and m levels fluctuate around some fixed steady-state values. For a system of N CpGs, we simulate the nearest-neighbour collaborative system using Gillespie’s algorithm [34]—see figure 3 for an illustration of how this algorithm simulates a 4-CpG system. Essentially, in each step of Gillespie’s algorithm, a timepoint is chosen and a single reaction selected to occur. This reaction changes the methylation state of a single CpG, i.e. all other CpGs remain unchanged. Throughout the simulations, we monitor the proportions of u, h and m sites, declaring steady state to be attained when these fluctuate around fixed values, after which we sample the methylation pattern, containing N methylation states, at constant time intervals. For each parameter set, we sample at T = 106/N timepoints to obtain a dataset of 106 steady-state methylation states.

Figure 3

Stochastic simulations of the nearest-neighbour collaborative system. For simplicity, a 4-CpG system at 3 timepoints is considered here. We fix a, x and y and impose periodic boundary conditions so that the first and final CpG can interact. After all potential non-collaborative and collaborative reactions are identified, Gillespie’s algorithm chooses a timepoint and a single reaction to occur at this timepoint. This reaction changes the methylation state of a single CpG and the list of potential reactions is updated to account for this change. This process is repeated to generate dynamical behaviour. In the example shown, we start with the system on the left. All possible reactions are listed and at the first timepoint (chosen by Gillespie’s algorithm) a reaction is chosen to occur at target position . The methylation state at the target position is changed accordingly (all other CpGs remain unchanged) and the list of potential reactions is updated to account for this change (middle). At the next timepoint (chosen by Gillespie’s algorithm), a reaction is chosen to occur at target position, , with CpG acting as mediator. Again, the methylation state at the target position is changed accordingly (all other CpGs remain unchanged) and the list of potential reactions is updated (right). This process is repeated until the system reaches steady state.

Analysis of simulated data

To facilitate our analysis, we define a sequence u, t ∈ {1, …, T}, where We define h and m similarly; see figure 4. We also define z via The steady-state mean u, h and m levels, μ, μ, μ, are obtained via Throughout, the subscript ‘s’ denotes steady-state statistics.

Figure 4

Construction of the sequences u, h, m and z. For simplicity, only 10 CpGs are shown for a single timepoint, t. A vector u is created, where if CpG i is unmethylated at time t and otherwise. Vectors h and m are constructed similarly. Finally, a vector z is created via z = u + 2h + 3m. For each t, we also calculate the mean and variance over z (, respectively) and average over all t ∈ {1, …, T} to obtain an overall steady-state mean and variance, μ and σ2(z), which are given by and We then define sequences v and w by For each t ∈ {1, …, T}, the sequence v is identical to z, while w is a shifted version of z (i.e. for i = {1, …, N − 1} and ). For each i = {1, …, N}, comparing and provides information regarding the methylation state of two neighbouring CpGs at time t ∈ {1, …, T}. We calculate the covariance between v and w for each t, Covar(v, w), averaging over t ∈ {1, …, T} to obtain an overall steady-state covariance between neighbouring sites: where . Finally, the steady-state correlation between neighbouring sites, ρ(z) : = ρ(v, w), is calculated via

Mean-field models for DNA methylation maintenance

For large CpG systems, the CME describing the nearest-neighbour collaborative model cannot be easily solved and stochastic simulations are computationally expensive. In contrast, it is often the case that models simplified using the MF approximation can be computationally solved in a time-efficient manner and we aim to test whether they accurately approximate the nearest-neighbour collaborative model for DNA methylation. To this end, we construct three MF models (see §§3.1–3.3). These models consider an infinite system of CpGs and so, by design, their ability to accurately describe a genomic region increases with the size of the region. In these models, nearest-neighbour interactions are approximated by considering the mean state of the system. In the first model, nearest-neighbour interactions are entirely approximated by considering the probability that two states are adjacent (one-site MF model). The second model describes distinct pairs of CpGs (distinct pairs MF model). Interactions occurring within a pair are directly accounted for and other nearest-neighbour interactions are approximated by considering the probability that two paired states are adjacent. In the third model, we consider overlapping pairs of CpGs (overlapping pairs MF model). Interactions occurring within a pair are again directly accounted for, but now other nearest-neighbour interactions are approximated by considering the probability that two paired states overlap. The remainder of this section is devoted to mathematical descriptions of these models. Note that while the computational cost of simulations increases with sequence length, numerical solutions of the MF models discussed in §§3.1–3.3 are independent of sequence length. Hence these models describe arbitrarily large genomic regions.

One-site mean-field model

We define the proportion of sites in the u, h, m states to be the mean u, h, m levels, μ, μ, μ, respectively. Here we construct a one-site MF model, where changes in the system are influenced by μ, μ, μ, rather than nearest-neighbour interactions; see figure 5.

Figure 5

Schematic of the one-site MF model. CpGs are influenced by the mean of the system rather than nearest-neighbour interactions.

Schematic of the one-site MF model. CpGs are influenced by the mean of the system rather than nearest-neighbour interactions. Consider the reaction in figure 1. Since the h mediator is unchanged by the reaction, we can write this as an effective first-order reaction , where the h mediator is absorbed into the effective reaction rate by making it proportional to μ. The factor of two accounts for the h mediator being on either side of the u target. Similarly, can be written as . Thus the reactions in figure 1 can be written as a single effective first-order reaction , with r = k1 + 2k5μ + 2k6μ. We thus write the system in figure 1 as the effective first-order system where and μ + μ + μ = 1. The k7 and k11 terms have an additional factor of two since their associated reactions involve two h reactants, and either of these can change state during the reaction. Let L, L, L be the ‘level’ (proportion) of u, h, m at a single CpG, respectively. A CpG can only be in one state at any time and so the state of a CpG at position j can be described by a state vector of the form The probabilities associated with these three states sum to one and so we only need to consider the probabilities associated with (L, L, L) = (1, 0, 0), and (L, L, L) = (0, 1, 0). Using (3.1) we construct the CME describing the probability that a site is in the (1, 0, 0) or (0, 1, 0) state: We can also obtain moment equations (see appendix C of [35]) for the statistics of L, L, L. The mean values are μ = 〈L〉, μ = 〈L〉, μ = 〈L〉, where the angled brackets denote the expected value. Due to the conservation law L = 1 − L − L, we again need only consider equations for u and h. From equation (C 1) in [35] with the means are described by the equations Note that these equations are identical to (3.3) with P(1, 0, 0) = μ and P(0, 1, 0) = μ. Setting leads to implicit equations for the steady-state means, where a1 = a1(μ, μ), a2 = a2(μ, 1 − μ − μ), a3 = a3(μ, 1 − μ − μ), a4 = a4(μ, μ). Since equation (3.6) is independent of a, the means depend only on x and y. For fixed parameters, we can solve equation (3.6) numerically using the Mathematica function NSolve to obtain values for μ and μ. Using equation (C2) in [35] with (3.4) the second-moment equations are given by From equation (3.2), we expect LL = L, LL = L and LL =0 at any CpG. Solving equation (3.7) in steady state leads to Variances, σ2(L) and σ2(L), can then be obtained via along with the covariance Covar(L, L), which is given by The mean, variance and covariances associated with the m state can now be obtained using We calculate the steady-state mean and variance, μ and σ2(zs), associated with the variable z = L + 2L + 3L via and Note that the superscript t was only used in §2.2 to differentiate between samples at different timepoints. Here, we simply have a single z. Since no spatial information is obtained from the one-site MF model, the covariance and correlation between neighbouring sites cannot be extracted.

Distinct pairs mean-field model

We next construct a two-site MF model, where we consider ‘clusters’ of two adjacent CpGs. Such cluster MF models have been successfully used to study vehicular traffic and driven-diffusive gas models [36,37]. We define the mean level (proportion) of pairs in the six possible states, to be μ1, μ2, μ3, μ4, μ5, μ6, respectively. Here, μ4 is the proportion of pairs containing u and m, irrespective of order. Similarly, μ5 is the proportion of pairs containing h and m, and μ6 is the proportion of pairs containing u and h, irrespective of order. In the distinct pairs MF model (DPMF model), CpGs within a pair are allowed to interact directly, preserving some nearest-neighbour interactions. The influence of the nearest-neighbour CpGs flanking the pair is then approximated by considering the probabilities that an adjacent pair is in each of the six possible states; see figure 6. Here, each CpG belongs to only one pair and each pair of sites is a single reactant. As with the one-site model, we consider an effective first-order reaction system, given by where the effective rates are given by and . We describe the construction of in appendix A. Essentially, – are constructed by considering all possible ways that each reaction in (3.13) can occur via a reaction from figure 1 taking place. Such reactions can occur within the reactant pair or can take place between a site within the pair and a site from an adjacent pair. Terms that are associated with the k7 and k11 reaction rates and involve interactions between two hh, two hm or two uh pairs have an additional factor of two since either pair can change state during these reactions. While we can, in principle, calculate the distribution of pairs in (3.13) [38], we restrict our attention to obtaining moments of the system.

Figure 6

Schematic of the distinct pairs MF model. The two CpGs within a pair can interact directly with each other and the pair is also influenced by the mean state of pairs in the system. In the figure, the uh pair can change state due to interactions between the u and h within the pair, and due to the mean state of pairs in the system. Let L1, L2, L3, L4, L5, L6 be the level (proportion) of each of the paired states at a single pair of CpGs. Since a pair of CpGs can only be in one state at any time, the state of a single pair of CpGs can be described by one of six vectors S, j = 1, …, 6, where The u, h, m levels within a pair of CpGs, , are then given by Using (3.13), we construct the CME for the system along with the first and second moment equations for L, i = {1, …, 6}; see appendix C. The first moment equations describe μ = 〈L〉, the mean values of L for i = {1, …, 6}. For fixed parameters, solving these equations numerically in steady state gives the steady-state means, μ, i = {1, …, 6}. Note that these means are independent of a. From the second moment equations, we obtain the steady-state expected values of LL, 〈LL〉, for i, j ∈ {1, 2, …, 6}. As expected from equation (3.15), 〈LL〉 = 〈L〉 = μ, 〈LL〉 = 0 for all i ≠ j, i, j ∈ {1, 2, …, 6}. Variances and covariances are then obtained using and Once again, these are independent of the parameter a. We now have statistics for the paired states in (3.12). The steady-state means for the u, h, m levels in a pair are then given by The pair-to-pair variance in u level is given by and similarly for the variances associated with h and m, and . Covariances are given by and analogously for . Note that the statistics obtained so far relate to u, h, m levels within a pair of CpGs. The mean level of a state within a pair is the same as the mean level of the state at each site. However, this is not the case for higher moments. For example, are pair-to-pair variances, rather than site-to-site variances. We aim to obtain statistics relating to z = L + 2L + 3L, where L, L, L are the single-site u, h, m levels. We define , noting that contains information regarding pairs of CpGs. Essentially, where v and w are as in equation (2.5) and we again do not require the superscript t. The steady-state means of z, v, w and coincide and Also, vw = L2 + 2L6 + 3L4 + 4L3 + 6L5 + 9L1 and so the steady-state expected value of vw is From this, we calculate the steady-state covariance between neighbouring sites as The steady-state variance associated with can be calculated via, However, is the pair-to-pair variance. Using σ2(z) = σ2(v) = σ2(w), we obtain leading to the steady-state site-to-site variance The steady-state correlation between neighbouring pairs is obtained via To summarize, the statistics of primary interest from our calculations are: the means μ, μ, μ, μ, the variance σ2(z), the covariance Covar(z) and the correlation ρ(z).

Overlapping pairs mean-field model

Similarly to the DPMF model, there are also six possible states for a pair of CpGs in the overlappling pairs MF model (OPMF model); see (3.12). The OPMF model also incorporates direct interactions within a pair. However, each CpG now belongs to two pairs, one with its left-hand neighbour and one with its right-hand neighbour, leading to a system of overlapping pairs. The influence of CpGs flanking a pair is now approximated by considering the conditional probability that the pair overlaps with another pair of a certain state; see figure 7.

Figure 7

Schematic of the overlapping pairs MF model. A pair of CpGs interact directly with each other and the effect of CpGs flanking the pair is approximated by considering the conditional probabilities that a flanking site is in the u, h or m state, given the state of its neighbour within the pair. As before, we consider an effective first-order reaction system, given by where the effective rates are given by We detail the construction of in appendix B. An identical approach to that in §3.2 leads to the CME and first and second moment equations for the system, see appendix C, and from these we obtain means, variances and covariances associated with the paired states and with the pair-to-pair u, h, m levels. Again, these depend only on x and y. The statistics associated with z are obtained as for the DPMF model; see §3.2.

Model comparison and parameter inference

To test whether MF models are capable of modelling large-scale methylation patterns, we now compare model predictions to synthetic data generated using nearest-neighbour collaborative simulations. The statistical properties obtained from our models are independent of a and so we fix a = 0.2. Since demethylation dominates when y < 1 and methylation dominates when y > 1, we hypothesize that a sharp change in the behaviour of the system may occur at y = 1. To capture this potential transition for different collaborativity strengths, we consider x = {0.1, 1, 5, 50}, y = {0.1, 0.2, …, 2}. Since we approximate the nearest-neighbour collaborative system by MF models, which consider an infinite system of CpGs, finite-size effects cause discrepancies between simulations and MF model predictions as x increases. We counteract this by increasing the number of simulated sites and so simulate N = 200 CpGs when x = {0.1, 1, 5} and N = 500 CpGs when x = 50. For each parameter set, we simulate n = 10 replicate datasets using the Gillespie algorithm (see §2.2). In particular, in each simulation, we take steady-state samples at T = 106/200 = 5000 timepoints when x = {0.1, 1, 5} and at T = 106/500 = 2000 timepoints when x = 50. Each of the replicate datasets therefore contains 106 states, from which we calculate the statistics of interest (see §2.3). We then calculate the means and standard errors over the 10 datasets to obtain overall summary statistics.

Mean-field models capture steady-state methylation levels

We first compare the mean u, h, m levels (μ, μ, μ) from the MF models to those from the simulations (figure 8). Considering the simulated data first, we observe that u and m dominate when y < 1 or y > 1, respectively. h is an intermediate state between u and m and peaks at y = 1, where there is also a sharp transition between u- and m-dominant states.

Figure 8

The OPMF model accurately predicts the average behaviour of large-scale methylation patterns. The mean u, h and m levels are plotted against the methylation strength y for the different models and collaboration strengths, x (left: one-site MF model; middle: DPMF model; right: OPMF model). For each model, the predictions of the means are shown by solid lines (with different colours denoting predictions for different values of x). The points and error bars correspond to the ground-truth means obtained from n = 10 replicate simulations of the full nearest-neighbour model (the mean results obtained from the ten simulations are shown by points with error bars denoting the standard error in this estimate across the ten simulations). While all models capture the qualitative behaviour of the means as x and y are varied, we observe that predictions of methylation levels from the OPMF model are closest to those observed in the simulated data (figure 8). All three models predict the mean u and m levels reasonably well; however only the OPMF model accurately predicts the mean h level for large x and for y close to one. All predictions from the OPMF model are within the error of the simulated data and it successfully captures the transition observed at y = 1 for all x considered. Conversely, the predictions of the other models deviate from the simulations at the transition point when x is large. The one-site MF model deviates to the greatest extent for 87% of the parameter sets and, for each mean, the average percentage error across y is higher for the one-site MF model compared to the other models for all x except μ when x = 5 (see appendix D, table 3). This suggests that the one-site model has the worst predictive power and we exclude it from further analysis.

Table 3

Percentage errors associated with figures 8 and 9 for the MF models and values of x considered. For each x, the relative error was calculated for the three models at each y, where the relative error associated with a statistic γ is |γsim − γpred|/γsim, with γpred being the model prediction and γsim being the result obtained from the simulations. Overall percentage errors were obtained for each model by averaging the relative errors across y (for each x) and multiplying by 100. For all three models, percentage errors are shown for μ, μ and μ. For the paired MF models, percentage errors for μ, σ2(z), covar(z) and ρ(z) are also displayed.

	μ_us	μ_hs	μ_ms	μ_zs	σ²(z_s)	covar(z_s)	ρ(z_s)
x = 0.1
one-site MF	1.46%	3.16%	1.58%	n.a.	n.a.	n.a.	n.a.
DPMF	0.78%	0.77%	0.95%	0.17%	0.45%	25.70%	25.38%
OPMF	0.70%	0.38%	0.90%	0.17%	0.42%	24.32%	24.05%
x = 1
one-site MF	6.11%	6.60%	6.46%	n.a.	n.a.	n.a.	n.a.
DPMF	4.27%	1.91%	3.91%	0.73%	3.34%	7.58%	5.90%
OPMF	1.06%	0.812%	1.65%	0.23%	0.96%	5.22%	5.64%
x = 5
one-site MF	26.5%	6.03%	22.21%	n.a.	n.a.	n.a.	n.a.
DPMF	19.13%	6.99%	14.91%	1.93%	16.52%	26.54%	13.60%
OPMF	2.60%	2.05%	1.63%	0.29%	2.67%	4.37%	3.30%
x = 50
one-site MF	46.00%	21.84%	42.65%	n.a.	n.a.	n.a.	n.a.
DPMF	32.62%	19.80%	28.98%	1.78%	34.89%	44.27%	18.85%
OPMF	7.13%	6.30%	7.20%	0.71%	10.36%	12.84%	6.91%

The overlapping pairs mean-field model accurately predicts associations between neighbouring sites

To test whether MF models can predict associations between neighbouring CpGs, we consider z = {z1, z2, …, z}, where z = {1, 2, 3} if CpG i is in the u, h, m state, respectively. From z, we calculate the mean and variance associated with the methylation state, and the covariance and correlation in methylation state between neighbouring sites. In the simulated data (figure 9), we again observe a transition in these statistics when the methylation and demethylation strengths are equal (y = 1). Our results counterintuitively suggest that neighbouring sites are most correlated here (the peak ρ(z) occurs when y = 1).

Figure 9

The OPMF model accurately predicts associations between neighbouring CpGs. Predictions of the means (μ), variances (σ2(z)), covariances (covar(z)) and correlations (ρ(z)) are plotted against y for the DPMF model (left) and OPMF model (right). For each model, the predictions of the statistics are shown by solid lines (with different colours denoting predictions for different values of x). The points and error bars correspond to the ground-truth statistics obtained from n = 10 replicate simulations of the full nearest-neighbour model (the mean results obtained from the ten simulations are shown by points, with the error bars denoting the standard error of this estimate across the ten simulations). Note that since z is determined by x and y, the statistics plotted here are implicit functions of x and y. To gain insight into this observation, we examine the patterns that evolve in the stochastic simulations for x = 50 (figure 10). When y is small, large u clusters form and we intuitively expect neighbouring sites to be highly correlated. However, these large clusters are interspersed with infrequent, isolated occurrences of h and m which have low correlations with their neighbours. Moreover, ten m (or h) sites appearing in isolation will result in smaller u clusters than the ten sites appearing as a single cluster. The overall effect is to reduce the correlation when y is small. A similar rationale explains the low correlation when y is large. u and m cluster sizes are most similar when methylation and demethylation are equally strong, resulting in u and m sites correlating equally with their neighbours and the overall correlation peaking.

Figure 10

Size of unmethylated (u) clusters and methylated (m) clusters are most similar when y = 1. Visualizations of stochastic simulations over time for different values of y. For each y, the horizontal axis shows simulated methylation patterns for 100 CpGs when x = 50 from the full nearest-neighbour collaborative system (see §2.2). The vertical axis shows patterns obtained at successive time points after steady state has been reached where timepoints are taken at constant time intervals. We next consider predictions of these statistics from the MF models. Predictions from the OPMF model lie within the error observed in the simulated data for all parameters considered (figure 9). Conversely, predicted statistics from the DPMF model show large deviations from the corresponding simulated statistics when x is large and y is close to one demonstrating that it has lower predictive power. The OPMF model again shows superior performance over the DPMF model when the percentage errors are considered. For each z statistic in figure 9, the average percentage error across y is higher for the DPMF model compared to the OPMF model for all x except μ when x = 0.1, where the percentage errors for the two models are the same to 2 decimal places (see appendix D, table 3).

Overlapping pairs mean-field model can infer the parameters underpinning large-scale methylation patterns

While thorough inference exploration is not the objective of the current work, we carry out a short proof-of-concept study to demonstrate that the OPMF model could, in principle, be used to infer collaborativity and methylation strengths from data. We generate synthetic data for selected model parameters, see §2.2, and then infer these parameters back using the OPMF model. Methylomes are typically assayed by whole-genome bisulfite sequencing [39]. A variant of this, hairpin-bisulfite sequencing, can be used to assay both strands of each DNA molecule [40]. In both cases, the resulting data are composed of short reads. Each read assays few CpGs and we do not know if reads originate from the same cell or DNA molecule. Simulated datasets from previous sections do not provide a good reflection of bisulfite sequencing since all of the CpGs in the simulated system were sampled at the same timepoints, the equivalent of the CpGs originating from the same molecule. To obtain short-read data, we instead simulate data for N = 1000 CpGs. After steady state is reached, we sample the system at 10 000 different timepoints. This is equivalent to sampling 10 000 molecules in steady state at a single timepoint. For each CpG, we take the methylation state at 30 timepoints, randomly chosen from the original 10 000. The timepoints chosen for each CpG site are independent of those chosen for other CpGs. This emulates hairpin-bisulfite sequencing data with coverage of 30 reads per CpG. We combine the sample states for all CpGs into a single dataset, X, and consider z = {z1, z2, …, z1000}, where z = {1, 2, 3} if X corresponds to a u, h, m state, respectively. The mean and variance of z are calculated and used for inference. There are numerous well-established methods for conducting inference. For example, in cases where a likelihood function is available, inference can be conducted using maximum-likelihood estimation [41], which provides point estimates for model parameters. In our case, this likelihood is the solution of the CME for the full nearest-neighbour collaborative model and thus is not available to us. We therefore use the approximate Bayesian computation sequential Monte Carlo algorithm (ABC SMC; see [42] for a comprehensive review) which is an alternative likelihood-free inference approach that allows us to use the moments derived from the OPMF model. Using a Bayesian inference approach also has the advantage of providing us with a distribution of the estimate value from which we can calculate confidence intervals associated with our inferred parameter values [43-46]. We use uniform priors, U(0, 100) and U(0, 2), for x and y, respectively. We also define the distance, d, between the simulations and model prediction to be the sum of the absolute relative errors of the mean and variance, i.e. where μmodel, μdata are the means of z from the model and data, respectively, and are the variances associated with the model and data, respectively. To rapidly select appropriate tolerances, we calculate the true distances between the simulated data and model predictions at the parameter values of interest. As in previous sections, we examine x = {0.1, 1, 5, 50}. For each x, we infer for y = {0.3, 1, 1.7} using the GpABC Julia package [47]. Accepted x, y values from the final ABC SMC population make up the posterior distributions for x and y, with the means taken to be the inferred parameter values. 95% confidence intervals were calculated by removing the lowest 2.5% and highest 2.5% from the posteriors. We find that the inferred parameter values are always of the same order of magnitude as the true values (table 2), with the most successfully inferred parameters being inferred within 1% of the true values (figure 11a,b). There are only two cases where the true parameter values lie outwith the inferred 95% confidence intervals (e.g. figure 11c). However, we obtain wide posteriors for large x (e.g. figure 11d), indicating more uncertainty in the inference.

Table 2

Inferred parameters for the nearest-neighbour collaborative model using the OPMF model and the ABC SMC algorithm.

	y = 0.3	y = 1	y = 1.7
x = 0.1	(x, y) = (0.197, 0.329)	(x, y) = (0.066, 0.997)	(x, y) = (0.101, 1.701)
x = 1	(x, y) = (0.676, 0.251)	(x, y) = (1.011, 0.990)	(x, y) = (0.923, 1.730)
x = 5	(x, y) = (6.302, 0.338)	(x, y) = (5.287, 1.000)	(x, y) = (6.432, 1.613)
x = 50	(x, y) = (59.571, 0.329)	(x, y) = (50.963, 0.999)	(x, y) = (65.347, 1.599)

Figure 11

The OPMF model can be used to infer collaborativity and methylation strengths. Example posteriors from inference are shown, with (a–d) each corresponding to a different parameter set. True parameter values are denoted by dashed lines, inferred parameter values are shown as dots, with 95% confidence intervals shown as horizontal bars.

Inferred parameters for the nearest-neighbour collaborative model using the OPMF model and the ABC SMC algorithm. The OPMF model can be used to infer collaborativity and methylation strengths. Example posteriors from inference are shown, with (a–d) each corresponding to a different parameter set. True parameter values are denoted by dashed lines, inferred parameter values are shown as dots, with 95% confidence intervals shown as horizontal bars. The accuracy of inference is highly dependent on the model sensitivity to parameters, with ease of inference increasing as sensitivity to the parameters increases. To test the sensitivity of our model to model parameters, we calculate the relative sensitivity [48] of the OPMF model to the parameters x and y, for x = {0.1, 0.2, …, 99.9, 100} and y = {0.1, 0.2, …, 1.9, 2}. For all parameters considered, the x-sensitivity divided by the y-sensitivity is strictly less than one, indicating that the model shows more sensitivity to y than x. This means that y will be inferred more accurately than x.

Discussion

Genomic DNA methylation patterns vary between cell types, across differentiation and in disease. The mechanisms underpinning this variation remain unclear but can be better understood using mathematical models. Current approaches are limited by their inability to feasibly model large systems of CpGs and thus understand known large-scale features of methylomes. Here, we show that a cluster MF model, based around overlapping pairs of CpGs, can predict DNA methylation patterns generated under a nearest-neighbour collaborative model. This suggests that MF models are a valuable tool for understanding large-scale DNA methylation features. Previous studies have used mathematical modelling to gain insight into the mechanisms regulating the establishment and maintenance of DNA methylation patterns. In particular, the requirement of collaborativity between CpGs to maintain DNA methylation patterns was postulated through modelling [14] before being observed experimentally [15,17]. Previous models of DNA methylation rely on stochastic simulations. However, their computational expense limits their use to the study of promoter-scale DNA methylation (regions around 1 Kb in size). The CPU-time taken to run simulations here on 200–500 sites (corresponding to approx. 20–50 Kb) varies and is parameter-dependent. For example, simulations of ten replicates with 200 CpGs take under one minute when x = 0.1, y = 0.1, but over 20 min when x = 5, y = 0.9. For x = 50, y = 1, simulations of 10 replicates with 500 CpGs take approximately 6 h. There are an average of 10 000 CpGs per megabase in the human genome making these models infeasible for such large genomic regions due to computational expense. This is particularly true in the case of parameter inference where such simulations would need to be rerun many times. By contrast, the OPMF model can be applied to arbitrarily large systems of CpGs and solved numerically in seconds for any parameters to give accurate predictions for statistics of interest. Our model is based upon the same, or similar, reaction systems used in previous stochastic modelling studies [14,29,30], but can be used to study larger systems of CpGs than previously considered. This makes our MF model far better suited to understanding the mechanisms underpinning megabase-sized variations in DNA methylation observed in development, ageing and cancer, which occur at a scale three orders of magnitude larger than promoters [27]. To our knowledge, the largest system previously examined mathematically contained 105 CpGs [15]. However, here simulations were conducted for only a single model parameter set. Running large-scale simulations of this type for many parameter sets will result in computational bottlenecks, meaning that such simulations cannot be used for inference. A previous study has proposed a method, based on the generalized method of moments, for rapid inference using DNA methylation patterns [49]. However, the largest system tackled with this approach contains 10 CpGs. Here, we show that our OPMF model can, in principle, be used for accurate, time-efficient inference when modelling arbitrarily large genomic regions. ‘Divide and conquer’ strategies have also previously been used to reduce the computational expense of large-scale simulations [50,51]. Here, a large simulation is split into smaller batches and batch results are aggregated to give overall estimates. Such an approach could make simulations of the full nearest-neighbour collaborative model applicable to larger systems of CpGs by reducing the computational burden. However, we have observed finite-size effects in our stochastic simulations suggesting that splitting them into smaller batches would result in discrepancies between estimates obtained via this approach and those obtained via a single large-scale simulation. Our results show that the success of a MF model depends on the way in which it is constructed. Of the three MF models considered here we observe that the one-site MF model and OPMF model provide the worst and best approximations to the full nearest-neighbour model, respectively. In the full nearest-neighbour collaborative system, the probability that a CpG is in a particular state depends on the state of its nearest neighbours on the left and right. Since the one-site MF model contains no spatial information, it least reflects the full model. The DPMF and OPMF models both incorporate spatial information by allowing direct nearest-neighbour interactions to occur within pairs of CpGs. However, in the DPMF model, non-overlapping pairs are considered and the probability that a pair is in a certain state depends only on the mean state of pairs in the system, rather than on the states of its adjacent pairs. Conversely, in the OPMF model, overlapping pairs are considered and the state of a pair depends on the state of pairs it overlaps. Hence the OPMF model best reflects the full model. When used for inference, the OPMF model has a higher sensitivity to methylation strength (y) than collaborativity strength (x) explaining why the former is generally better inferred than the latter. However, some posteriors obtained in §4.3 are very wide and/or the true parameter values lie outside the inferred 95% CIs, indicating that there is scope for inference to be improved. Since our model shows impressive performance in forward prediction, discrepancies between true and inferred parameters are likely due to insufficient data or the inference technique used. ABC algorithms are sensitive to the summary statistics, distance and tolerance used in the acceptance criteria for parameters [52]. Moreover, regardless of the method used, inference will always be difficult for systems of CpGs that are either methylated or unmethylated. This is because a very large x would result in unmethylation for almost all y < 1 and methylation for almost all y > 1. Despite these challenges, our analysis demonstrates that the OPMF model can in principle be used for inference. The accuracy of inference may be improved in future studies by experimenting with different inference techniques and sample sizes. In addition, the summary statistics that can be reliably calculated from real data are dependent on the technology used to generate the data. Most DNA methylation data are currently derived from technologies that produce short reads containing between one and three CpGs. Statistics involving associations between consecutive CpGs from a single molecule cannot therefore be accurately estimated. This restricts inference based on such data to statistics such as the mean and standard deviation. Long-read technologies which can capture hundreds of CpGs on a single read have recently been applied to assay DNA methylation patterns [53]. Data obtained from such technologies could allow higher order statistics to be accurately estimated and potentially improve inference. Here, we assume that the processes governing the creation of methylation patterns in vivo are described well by our nearest-neighbour collaborative model. It is possible that collaborativity in vivo can occur between non-nearest-neighbours, something which is not explicitly accounted for in our nearest-neighbour collaborative model. Collaborative methylation interactions are likely determined by the properties of the DNA methylation machinery. DNMT1 and DNMT3B both methylate processively along DNA strands whereas DNMT3A methylates in a distributive manner but can form multimers along the DNA fibre [32]. However, the range and strengths across which these interactions occur are currently unclear so we focus on nearest-neighbour interactions. We note that our OPMF model does capture interactions beyond nearest neighbours because the mean state of pairs in the system influences the change in state of CpGs. Previous modelling studies have also considered different forms of collaborative interactions between CpGs in a system (figure 12a,b). In [14], collaboration between any CpGs in the system and nearest-neighbour collaborative methylation alongside distance-dependent collaborative demethylation were both demonstrated to produce stable CpG clusters that were either methylated or unmethylated. Stable clusters were also observed under a distance-dependent collaborative model, where collaborative demethylation dominates over short ranges and collaborative methylation dominates over long ranges [29].

Figure 12

Potential collaborative interactions that can influence a target, X, under the models in [14] and the model considered here. (a) A cluster of 80 CpGs is first considered in [14], where a CpG can collaborate with any other CpG in the system with equal probability. (b) A high-density cluster (of 80 CpGs) adjacent to a highly methylated low-density region (of 240 CpGs) is then considered in [14], where there is nearest-neighbour collaborative methylation (red arrows) and the probability of collaborative demethylation occurring due to interaction between two sites decays as the distance between them increases (blue arrows; decay in reaction probability shown by narrowing width of arrows). Note that collaborative demethylation is restricted to the 80-CpG cluster. (c) In this paper, collaborative reactions only occur between neighbouring CpGs and there is no upper bound on the system size, allowing large-scale patterns to be considered. Here, we have also assumed that the system of CpGs we model reaches a steady state. This means that we consider either non-dividing cells, or dividing cells which settle down to steady state between replication events such that DNA replication has no long-term effects on DNA methylation patterns. Whether or not a cell satisfies the latter case is dependent on the real magnitudes of k, i = {1, …, 12}, and the time between replication events. Experimental studies show that arrested cells have similar DNA methylation patterns to those that are cycling, supporting the assumption that DNA replication has no long-term effect on DNA methylation [33]. Furthermore, an analysis of DNA methylation patterns on newly synthesized DNA suggests that re-methylation occurs within 20 min of replication [54]. However, another analysis of DNA methylation following replication suggests re-methylation is often delayed [55]. At present, it is unclear whether this delay is sufficient to have an effect on methylation patterns during the following cell cycle. Our assumption that k, i = {1, …, 12} take the form in table 1 could be violated in reality. For example, DNMT1 shows a strong preference for h sites over u sites [32], meaning that methylation reactions with an h target may have higher reaction rates than those with a u target. Future work could relax rate assumptions to account for such factors. Preliminary investigations confirm that the OPMF model provides a good approximation to the nearest-neighbour collaborative system when k, i = {1, …, 12} are considered as twelve independent parameters (data not shown). However, the difficulty of parameter inference increases with the number of parameters, meaning that relaxing parameter assumptions will likely decrease inference quality. Nonetheless, our model suggests that it is the parameters x and y that determine steady-state methylation patterns, rather than individual reaction rates. This is supported by a study where modelling of experimental data suggested that the ratio between methylation and demethylation rates determines steady-state methylation levels at single CpGs [56]. Here, we demonstrate that MF models can accurately predict the behaviour of large CpG systems subjected to nearest-neighbour collaboration. Our study presents the first mathematical modelling approach that can be applied to arbitrarily large systems of CpGs. The future application of this approach will facilitate the delineation of the methylation dynamics that underpin the formation of large-scale methylation patterns in developmental and disease contexts.

47 in total

1. Approximate Bayesian computation in population genetics.

Authors: Mark A Beaumont; Wenyang Zhang; David J Balding
Journal: Genetics Date: 2002-12 Impact factor: 4.562

2. Solving the chemical master equation for monomolecular reaction systems analytically.

Authors: Tobias Jahnke; Wilhelm Huisinga
Journal: J Math Biol Date: 2006-09-05 Impact factor: 2.259

Review 3. Stochastic simulation of chemical kinetics.

Authors: Daniel T Gillespie
Journal: Annu Rev Phys Chem Date: 2007 Impact factor: 12.703

4. Discrete stochastic models for traffic flow.

Authors:
Journal: Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics Date: 1995-04

5. Chromosome instability and immunodeficiency syndrome caused by mutations in a DNA methyltransferase gene.

Authors: G L Xu; T H Bestor; D Bourc'his; C L Hsieh; N Tommerup; M Bugge; M Hulten; X Qu; J J Russo; E Viegas-Péquignot
Journal: Nature Date: 1999-11-11 Impact factor: 49.962

6. Targeted mutation of the DNA methyltransferase gene results in embryonic lethality.

Authors: E Li; T H Bestor; R Jaenisch
Journal: Cell Date: 1992-06-12 Impact factor: 41.582

7. DNA methylation in human epigenomes depends on local topology of CpG sites.

Authors: Cecilia Lövkvist; Ian B Dodd; Kim Sneppen; Jan O Haerter
Journal: Nucleic Acids Res Date: 2016-02-29 Impact factor: 16.971

8. A genome-scale map of DNA methylation turnover identifies site-specific dependencies of DNMT and TET activity.

Authors: Paul Adrian Ginno; Dimos Gaidatzis; Angelika Feldmann; Leslie Hoerner; Dilek Imanci; Lukas Burger; Frederic Zilbermann; Antoine H F M Peters; Frank Edenhofer; Sébastien A Smallwood; Arnaud R Krebs; Dirk Schübeler
Journal: Nat Commun Date: 2020-05-29 Impact factor: 14.919

9. Stochastic modeling reveals kinetic heterogeneity in post-replication DNA methylation.

Authors: Luis Busto-Moner; Julien Morival; Honglei Ren; Arjang Fahim; Zachary Reitz; Timothy L Downing; Elizabeth L Read
Journal: PLoS Comput Biol Date: 2020-04-10 Impact factor: 4.475

Review 10. Genomic insights into cancer-associated aberrant CpG island hypermethylation.

Authors: Duncan Sproul; Richard R Meehan
Journal: Brief Funct Genomics Date: 2013-01-21 Impact factor: 4.241