Literature DB >> 27187204

seqlm: an MDL based method for identifying differentially methylated regions in high density methylation array data.

Raivo Kolde¹, Kaspar Märtens², Kaie Lokk³, Sven Laur², Jaak Vilo².

Abstract

MOTIVATION: One of the main goals of large scale methylation studies is to detect differentially methylated loci. One way is to approach this problem sitewise, i.e. to find differentially methylated positions (DMPs). However, it has been shown that methylation is regulated in longer genomic regions. So it is more desirable to identify differentially methylated regions (DMRs) instead of DMPs. The new high coverage arrays, like Illuminas 450k platform, make it possible at a reasonable cost. Few tools exist for DMR identification from this type of data, but there is no standard approach.
RESULTS: We propose a novel method for DMR identification that detects the region boundaries according to the minimum description length (MDL) principle, essentially solving the problem of model selection. The significance of the regions is established using linear mixed models. Using both simulated and large publicly available methylation datasets, we compare seqlm performance to alternative approaches. We demonstrate that it is both more sensitive and specific than competing methods. This is achieved with minimal parameter tuning and, surprisingly, quickest running time of all the tried methods. Finally, we show that the regional differential methylation patterns identified on sparse array data are confirmed by higher resolution sequencing approaches.
AVAILABILITY AND IMPLEMENTATION: The methods have been implemented in R package seqlm that is available through Github: https://github.com/raivokolde/seqlm CONTACT: rkolde@gmail.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2016 PMID： 27187204 PMCID： PMC5013909 DOI： 10.1093/bioinformatics/btw304

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

DNA methylation is an important cellular mechanism that is associated to processes like X-chromosome inactivation and genomic imprinting. It has also been related to several diseases such as diabetes, schizophrenia and cancer (Baylin and Jones, 2011; Mill ; Toperoff ). In recent years the role of methylation in various diseases has received considerable interest from the research community. This can be attributed largely to the development of high-density methylation microarrays, like Illumina Infinium 450K, which have made affordable the characterization of genome-wide methylation patterns on large disease related cohorts. The Illumina 450K microarray covers around 20 CpG sites per gene. Such resolution reveals a spatially correlated structure of DNA methylation. Closely situated CpG sites often display almost identical methylation patterns. This feature has been seen already in the early sequencing studies (Eckhardt ) and it has been also shown that methylation is regulated in longer regions (Lienert ). While strong spatial correlation is a dominant feature in the data, common analysis methods do not take this into account. For example, differential methylation analysis is commonly performed in a sitewise manner (Marabita ; Wessely and Emes, 2012), thus ignoring correlations between probes. To take spatial correlations into account when performing the analysis, it is natural to search for differentially methylated regions (DMRs) instead of sites. Statistically, it could improve the sensitivity of the analysis and make the results less redundant. Biologically, differential methylation supported by multiple independent probes is less likely to represent an experimental artefact. DMR centric analysis has been performed in multiple studies (Bell ; Lokk et al., 2014; Slieker ) and there are several tools available for it (Jaffe ; Pedersen ; Sofer ; Wang ). However, despite the numerous tools available, there is no standard approach for DMR identification. One possibility is to use predefined regions that are based on genomic features such as gene parts or CpG islands. This has been implemented in an R package IMA (Wang ), but it has several shortcomings. Such an analysis often reveals large amount of regions that cover the same set of differentially methylated sites, while their rankings are more based on the concordance between the borders of true DMRs and predefined regions than the true extent of differential methylation. Another, a more general approach is to define the regions dynamically based on the data. One such method is Comb-p (Kechris et al., 2010; Pedersen ) that combines single site P-values by using sliding windows and taking into account the correlation between sites. This method operates on P-values, which makes it flexible and computationally efficient. However, the DMRs are then based on summary statistics and this may lose some information compared to modeling directly on the measurements. Also, the user must pick the minimum P-value required to start a region, and the resulting regions strongly depend on this parameter value. Maybe the most well-known method is bump hunting (Jaffe ), integrated into R package minfi (Aryee ). Essentially, it performs site-level analysis on spatially smoothed data and then applies some rules to aggregate the sites into regions. Significance of these regions is assessed using permutations. The number and nature of bumphunter results depends strongly on the effect size cutoff and smoothing window size parameters that are hard to interpret in biological terms and thus tricky to optimize. Second, due to smoothing the method is unable to detect single-site differential methylation (Jaffe ), making it less effective in sparsely covered regions. Another tool Aclust (Sofer ) defines the regions by gradually clustering the consecutive sites together. The significance of identified clusters is tested using the Generalized Estimating Equations (GEE) model. This approach relies on an even larger number of user-defined parameters, such as correlation metric, agglomeration method, correlation threshold, etc. In this article we present a novel method for identifying DMRs. Probes are grouped into regions based similarity of differential methylation profiles by using the Minimum Description Length (MDL) principle. The significance of these regions is estimated using linear mixed models. Such an information theoretic background makes the model flexible in a variety of situations without the need of extensive parameter tuning. For validation, we show that our approach is effective in finding true DMRs while appropriately controlling the number of false positives both on simulated and real methylation datasets.

2 Approach

Biologically, a differentially methylated region is a rather intuitive concept—a collection of consecutive metylation sites in the DNA where the average methylation levels differ significantly between the tissues of interest. However, the intuition does not translate into clear definition that could be used for DMR finding. For example, should we prioritize DMRs that cover most CpG sites, span the largest genomic distance or exhibit the highest differential methylation. As there is no obvious biological criterion to optimize, there are many ad hoc methods for finding DMRs. From the statistical point of view, however, it is clearer what we want to achieve. We know that the dataset contains redundant information in terms of spatially correlated probes. The goal is to find a smaller set of features by combining the correlated consecutive sites into regions, while preserving the underlying signal. The resulting smaller and more independent set of features can then be used for performing differential methylation analysis. In seqlm we implement this strategy as the following three stage procedure: The genome is divided into initial segments, according to the distances between consecutive CpG probes. These segments are subdivided into regions, based on the differential methylation patterns. For each region, the statistical significance of differential methylation is assessed. Stages 1 and 3 are relatively straightforward to carry out, while the main novelty of the method lies in step 2. Next we introduce all the stages (illustrated in Fig. 1) one by one.

Fig. 1.

Method workflow. First, the genome is segmented based on distance between consequent probes. The boxplots show the dependence between the distance and correlation of methylation patterns. Second, the resulting segments are divided further into regions with consistent methylation profiles. Finally, the differential methylation is tested using a linear mixed model (Color version of this figure is available at Bioinformatics online.)

2.1 Initial segmentation

As the second step in our analysis is computationally rather intensive, we do the initial partitioning using simpler rules. CpG sites on the genome cluster into tighter groups near promoters and other functional elements and the arrays concentrate also on these regions. Thus, we do not lose much information if we identify denser regions just based on genomic location. Therefore, we have chosen 1000 bp as an upper bound for the distance between two consecutive probes to belong into one region. The exact cut-off value was determined by exploring in a large dataset (Lokk et al., 2014) the relation between methylation correlation of consecutive sites and the genomic distance (Fig. 1). As expected very close pairs of probes (less than 100 bp) are highly correlated. However, when the distance is already more than 1000 bp, the preferential correlation effect seems to be diminished. Such initial segmentation creates many isolated sites and short regions, but also substantial amount of more populous segments in areas of sufficient coverage of the array. In the next step, these will be subdivided into regions according to their methylation patterns.

2.2 Refined methylation based segmentation

Given a continuous stretch of CpG probes, we want to divide this into regions with homogeneous methylation patterns with respect to our variable of interest, e.g. extent of differential methylation is constant within each segment. Let y denote the methylation value for sample i and site j and x be the variable describing the samples. For two distinct groups, discrete values are appropriate. If the difference between our groups of interest is constant within a segment, it is well described by a linear model: where μ is the baseline methylation level of site j and β is the average effect size within this segment. The same model can also handle continuous x, then we are looking for regions where the variable of interest affects methylation in an consistent manner. The core of the segmentation algorithm is depicted on the middle portion of Figure 1. In a given stretch of DNA we try out all the possible segmentations of CpG probes (five of those are displayed on the figure). For each segment, we fit the linear model (1), the coefficients are shown for the first segmentation. The optimal segmentation should prefer longer regions to shorter ones, while assuring the segment-wise linear models provide a good fit. This goal can be viewed as a model selection problem and solved using the Minimum Description Length (MDL) principle.

2.2.1 Minimum description length principle

The MDL principle is a way to find an optimal trade-off between the complexity of the model and its accuracy (Rissanen, 1978). It states that from a collection of models , one should choose a model that gives the shortest description of the data. More formally, let L(M) is the description length of a model M and is the description length of data D given the model M. Then one should choose The one-to-one correspondence between probability distributions and code lengths (Hansen and Yu, 2001), allows us to calculate as the negative log-likelihood of data D given model M. Thus, finding the optimal model is equivalent to maximizing a penalized log-likelihood function where the term L(M) measures model complexity. The MDL principle has been successfully used in various contexts in computational biology, such as for haplotype block detection (Koivisto ) and motif discovery (Ritz ).

2.2.2 Finding the optimal segmentation

To apply the MDL principle for segmentation, we must first extend a linear regression model (1) for one segment to the entire collection of segments. Let the shorthand denote a segment of consecutive CpG sites where s and e represent the start and end positions of the segment. Then a segmentation is a gapless collection of non-overlapping segments . As a result, we can characterize methylation intensities in a fixed segmentation with a piecewise linear model where I is the indicator function and M represents the linear model (1) that is fitted to segment i. Given a segmentation , fitting the model (3) to a region reduces to finding the least squares estimates for the linear models in each segment. Thus, it is straightforward to find the minimal penalized log-likelihood for each segmentation. See supplementary material for the exact expressions of L(M) and . To minimize description length over the entire region, we can check all possible segmentations. As a result, we obtain a balance between the number of the segments and goodness of fit of the linear models. Less segments implies a smaller number of parameters in model (3) and thus decreases the term L(M). As the increase in the number of segments provides better fitting models and decreases , there exist a balance point. In practice, exhaustive search for the optimum can be avoided by first fitting the linear model (1) to every possible segment and then using dynamic programming to find best segmentation. See the supplementary material for further details. The complexity of such algorithm grows quadratically with the number of probes in the original region. In case of Illumina 450K array this is not a problem, since the regions created in the first step are short enough. For larger regions, one could potentially limit the complexity growth, by introducing a restriction on the number of probes in one segment. Finally, the calculations can be trivially parallelized, making them feasible even for large datasets.

2.3 Assessing statistical significance

The previous step provides a collection of genomic regions and single sites. To identify which of these regions show condition specific methylation, we have to assess the extent of differences in methylation. As we already fitted linear models in previous step to all of the regions, we can use the same models to assess the significance of these DMRs. However, the model (1) does not take into account that nearby measurements of a same sample are highly correlated. Hence, resulting P-values are greatly inflated. To take these correlations into account, we must add a sample specific methylation baseline to the model (1). The resulting linear mixed model distinguishes sample- and condition-based effects. Interestingly, the extra term in the model (4) does not change the estimate of β compared to (1), but adjusts the respective P-values appropriately. Finally, these P-values must be adjusted for multiple testing. In general, using the same variable x for finding segmentation and assessing the significance of regions can inflate P-values. For example, selecting region boundaries based on strength of differential methylation or removing ‘less-promising’ regions would immediately introduce bias. However, in our approach, the second stage maximizes coherence of the regions rather than differential methylation and we do not select regions before applying model (4). As a result, the method does not introduce false positive findings (see also Table 1).

Table 1.

Detected DMRs in the simulation study with 5000 generated DMRs, with an average effect size μ

	Bumphunter						Aclust
	TP		FP		missed		TP		FP		missed
Μ	# regions	# sites	# regions	# sites	# regions	# sites	# regions	# sites	# regions	# sites	# regions	# sites
0.0	0	0	0	0	0	0	0	0	87	255	0	0
0.025	0	0	0	0	5000	28697	1568	6385	352	1273	3464	18545
0.050	1	36	0	0	4999	28946	3241	15485	537	1980	1947	8903
0.075	5	174	0	0	4994	28958	4039	20952	581	2068	1213	5067
0.10	18	788	0	0	4978	28486	4423	24077	619	2158	810	3047
0.15	33	1401	0	0	4956	28079	4642	26681	646	2254	456	1656
0.20	63	2046	0	0	4930	26652	4715	26771	641	2224	356	1002

	comb-p						seqlm
	TP		FP		missed		TP		FP		missed
Μ	# regions	# sites	# regions	# sites	# regions	# sites	# regions	# sites	# regions	# sites	# regions	# sites

0.0	0	0	71	555	0	0	0	0	0	0	0	0
0.025	979	6005	61	506	3967	21005	1004	4792	65	330	3941	22803
0.050	2353	15518	40	295	2538	11540	3014	15713	172	862	1988	11079
0.075	3237	20921	29	181	1632	6731	4060	21450	201	1027	1055	5892
0.10	3686	23897	28	169	1169	4475	4626	24649	232	1241	591	3514
0.15	4140	26599	23	158	689	2343	5082	27137	257	1373	252	1734
0.20	4360	27005	27	185	467	1388	5280	27457	263	1354	112	917

For each method and for each μ, total number of the detected regions and the corresponding number of sites has been given (divided into true and false positives, ‘TP’ and ‘FP’), together with the the number of missed regions.

Detected DMRs in the simulation study with 5000 generated DMRs, with an average effect size μ For each method and for each μ, total number of the detected regions and the corresponding number of sites has been given (divided into true and false positives, ‘TP’ and ‘FP’), together with the the number of missed regions.

3 Results

We demonstrate the utility of seqlm method in three parts. First, we study the statistical properties and performance of seqlm and other methods on simulated data. Second, we apply the methods to a large public methylation dataset covering 17 different tissues. In all cases we compare seqlm with bumphunter, Aclust, Comb-p and IMA. Finally, we validate several DMRs that were identified using seqlm on Illumina 450K chip data with Sanger sequencing.

3.1 Simulation study

To study sensitivity and specificity of the DMR finding algorithms we need a dataset where we know true differential methylation patterns. For that we permuted labels on a real dataset and introduced differential methylation by changing methylation levels inside specific regions (see Section 4 for further details). This allowed us to preserve much of the structure of the original data, but at the same time introduce controlled variation into it. Without introducing any differential methylation, this schema was effective in generating data that was distributed according to the null hypothesis—the sitewise t-test P-values follow expected uniform distribution. To test our algorithm, we run all DMR finding algorithms on the simulated data, using FDR 0.05 as a cutoff. After that we counted the number of true and false positive results. If a detected DMR overlapped with a region where we inserted differential methylation, we classified this as a true positive, otherwise as a false positive. The results can be seen in Table 1. Each row corresponds to a different μ value which represents the average generated effect size between the two groups. The columns show the number of detected regions or number of sites within them. Together with the true and false positives we also show the number of missed regions and sites. The first row serves as a check on the statistical validity of the algorithms, as we do not expect to find any DMRs for μ = 0. Neither bumphunter or seqlm find any DMRs in this case. Overall, the number of regions bumphunter identifies is an order of magnitude smaller compared to other methods, even with very large effect sizes. On the other hand, Aclust identifies 87 significant regions that cover 255 sites for the first row. This indicates, that the Generalised Estimating Equations (GEE) model used by Aclust, combined with the FDR correction, might be inappropriate for this type of data. Moreover, the proportion of false positive sites stays considerably higher than the expected 5% even if we introduce differential methylation. Interestingly, with Comb-p the number of false positives is not high overall, but there are several of them even if the dataset does not contain any signal. Thus, both Comb-p and Aclust can return too many false positives, especially, when the signal in the data is weak. Even if the number of true positive sites is comparable, it is still consistently lower than for seqlm. In terms of sensitivity, the performance of seqlm is comparable to Aclust and Comb-p. For higher effect sizes, it finds consistently larger number of true positive sites, while keeping the number of false positives below 5% for most cases. The comparison with the IMA package is more complicated, since IMA outputs overlapping results. For example, significantly differentially methylated region can partially overlap with several CpG islands, exons, promoters and other functionally relevant regions. Thus for the IMA results we calculated two sets of true and false positive values: for all and for unique results. The results for IMA are given in Supplementary Table 1. In brief, the IMA package finds the same order of magnitude of unique sites and regions as Aclust and seqlm without exceeding 5% threshold for false positives. The main problem with IMA is that on average each site is reported as a member of two regions, but in many cases few differentially methylated sites can drive the significance of tens of regions. One could merge the overlapping significant regions as we did, but this would lose the interpretability of the results. To summarize, seqlm displays the most sensitivity and specificity among the four alternative algorithms.

3.2 Comparisons on the 17 tissues data

As a more practical comparison, we also applied all DMR finding methods also on a real dataset that describes methylation in 17 tissues. We searched for DMRs specific to single tissues and performed sitewise t-test as a comparison. The numbers of significant regions and sites covered by those is given in Table 2.

Table 2.

For each approach, the number of significant DMRs and the corresponding number of sites has been given

	Single site	Bumphunter		Aclust		Comb-p		seqlm
	# sites	# regions	# sites	# regions	# sites	# regions	# sites	# regions	# sites
Lymph node versus others	58	1	4	2543	6943	93	564	21	61
Gall bladder versus others	2875	5	99	6585	16054	722	3062	1359	2758
Gastric mucosa versus others	23359	2	64	16661	26494	2663	25248	5218	24717
Artery versus others	13639	30	369	16823	32626	2095	9418	9126	16775
Bone, joint-cartilage versus others	11553	36	513	17208	38107	2104	9042	6816	13232
Bladder versus others	19822	39	537	31413	68352	3785	11865	13529	22938
Adipose versus others	28379	32	407	34191	63725	4182	17910	15776	33196
Ischiatic nerve versus others	27163	20	275	37846	76119	4947	15895	16894	30337
Aorta versus others	40347	116	1081	40063	64206	6307	20809	27514	47593
Tonsils versus others	87950	12	283	67344	95188	6695	53474	33549	94807
Medulla oblongata versus others	98352	179	1887	97618	139954	14461	51007	62370	119795
Bone marrow versus others	173080	507	4217	153184	191147	33698	108590	100910	187950

For each approach, the number of significant DMRs and the corresponding number of sites has been given The results are consistent with the simulation study. Bumphunter is finding a significantly smaller number of DMRs than Aclust, Comb-p or seqlm. Compared to the single site analysis, seqlm consistently identifies 10–15% more differentially methylated sites. As the single site analysis can be considered a good baseline, we can see that the seqlm algorithm does not pick up spurious signals. Instead, it gives roughly the same set of sites grouped into fewer regions. The behaviour of Aclust and Comb-p is less consistent. In some cases, the number of reported sites of Aclust is several times higher than for single site t-test. In other occasions, the numbers are more comparable but always higher. Given the over-sensitivity of Aclust we showed on simulated data, such differences are rather suspicious and are likely to contain more than 5% false positives. Comb-p, in turn, seems to find many regions in comparisons where other methods do not find much and less in other comparisons with stronger signals. When considering length of the identified regions we can also see slight differences. Supplementary Figure S1 details the length distribution of all regions with at least 2 sites in Table 2. Bumphunter tends to identify the longest regions and does not return almost any regions with less than 3 sites, as the length of the region is also an important criterion in region selection. In other cases the distribution was more skewed towards short regions with Aclust giving the shortest and seqlm and combp giving slightly longer correspondingly. In summary, seqlm seems to give a consistent improvement over single site analysis without returning a suspiciously high amount of differentially methylated sites.

3.3 Validation of identified DMRs

While the 450K sites on Illumina methylation arrays represent a marked improvement over previous technologies, they still represent only 1.5% of the 29M CpG sites on the genome. Thus, it is not a priori clear that DMR-s inferred from such low-resolution measurements reflect biological reality and are not technical artifacts. To address this concern, we have validated 14 of the DMR-s identified in Table 2 using Sanger sequencing. Figure 2A shows an example of such a validation. We can see that there is good overlap between the results of two approaches at the sites that are measured by both. More importantly, the differential methylation pattern identified on the array level can also be observed at the intermediate sites.

Fig. 2.

Validation of identified DMR-s. Panel A has an example from the tissue dataset together with a DMR detected by seqlm (large grey box). The top plot shows the data on array resolution and in bottom plot a portion of the DMR has been zoomed in, and methylation levels of intermediate CpG sites obtained by Sanger sequencing are shown. Panel B shows the effect sizes that are estimated from array data and from Sanger sequencing data for all 14 regions. The 95% confidence intervals are shown for both sources of estimates (Color version of this figure is available at Bioinformatics online.) Figure 2B summarizes the results for all 14 regions by showing the estimated effect sizes from both array and Sanger sequencing data. We can see that differential methylation of all the regions is confirmed on the higher resolution. Moreover, the effect sizes from array and Sanger measurements are well correlated. In 6 cases out of 14 the effect size, seen independently from Sanger measurements, is within the confidence intervals of the array based estimate. All together the data shows that DMRs identified on array data adequately represent the underlying methylation patterns on single site resolution.

3.4 Implementation and performance

Our method has been implemented in an R package seqlm, freely available through Github: https://github.com/raivokolde/seqlm. The method is not specific to the Illumina 450k methylation array and can be used with many other arrays. For example the code can be easily used for analyzing tiling array data. The performance of seqlm is sufficient to analyze any array based dataset on ordinary laptop computers. Moreover, as the implementation supports multi-core computations, it is possible to speed up the analysis on dedicated computation servers. To compare running time with competitors, we ran seqlm together with bumphunter and Aclust on our simulated dataset. Surprisingly, seqlm was the fastest of the three (Supplementary Table S2). Using multiple cores it is possible to improve the performance of seqlm and bumphunter, but not Aclust.

4 Methods

4.1 Datasets

For demonstrative purposes, we have used a large publicly available methylation dataset on Illumina 450k platform (GSE50192). There, the methylation of 17 tissues from four autopsied humans has been measured.

4.2 Data simulation

To simulate data while retaining most of the characteristics of a real methylation dataset, we have permuted a subset of the 17 tissues data. To obtain data with relatively homogeneous patterns, we chose the subset consisting of a total of 16 samples: coronary artery, splenic artery, thoracic aorta and abdominal aorta. We selected a set of similar tissues to avoid encountering a rather strong signal in the permuted data by chance. To start we split the genome into smaller pieces that have properties as follows. First, the genomic distances between consecutive sites remain below 1000 bp within each piece. Second, the correlation between all consecutive sites within each piece is above a threshold of 0.1. The aim of these choices was to extract those locations from the data where the methylation patterns are reasonably similar. As a result, we obtain 250 000 pieces with length greater than one, with average length 3.8 sites. For each piece, we assign the group labels randomly such that we would have eight samples in one group and eight in another. As a result, we get the data where on average there is no group effect, the single site P-values are following closely the uniform distribution. Finally, we choose randomly N = 5000 pieces with length greater than 1 (with probability proportional to length) and change them into DMRs, by increasing the values in one group by . We varied μ in .

4.3 Parameters for tested methods

While comparing seqlm to other methods which require several input parameters, we either used the default values or the ones recommended in the original publication. For all methods except for bumphunter, we defined significant differential methylation with FDR corrected P-values below the threshold 0.05. As bumphunter does not divide the genome into regions and rather searches for DMRs, one cannot use FDR correction, so instead we used its reported familywise error rate with threshold 0.10, which was also used in Jaffe . For bumphunter the parameters used were: pickCutoffQ = 0.99, maxGap = 1000 and smooth = TRUE (as suggested by Jaffe ). For Aclust,we used the ’best’ configuration of the parameters reported in Sofer , i.e. Spearman correlation, average distance clustering, distance cutoff 0.2, and 999-base-pair-merge. For comb-p, we used P-value threshold of 0.05 for candidate regions.

4.4 Sanger sequencing

For validation we selected 14 DMR-s that were tissue specific, showed large effect size and where it was possible to design primers. Primers for PCR amplification of the bisulfite-treated DNA were designed using MethPrimer (Li and Dahiya, 2002) and are listed in Additional file 9. The 20 L reaction mixes contained 80 mM Tris–HCl (pH 9.4–9.5), 20 mM (NH4)2SO4, 0.02% Tween-20 PCR buffer, 3 mM MgCl2, 1× Betaine, 0.25 mM dNTP mix, 2 U Smart-Taq Hot DNA polymerase (Naxo, Tartu, Estonia), 50 pmol forward primer, 50 pmol reverse primer and 20 ng bisulfite-treated genomic DNA. The PCR cycling conditions were: 15 min at 95 °C for enzyme activation, followed by 17 cycles of 30 s at 95 °C, 45 s at 62 °C and 120 s at 72 °C, with a final -0.5 °C /cycle step-down gradient over 21 cycles of 30 s at 95 °C, 30 s at 52 °C and 120 s at 72 °C. The sequencing results were analyzed with Mutation Surveyor software (Softgenetics, State College, PA, USA).

5 Discussion

The method as defined in this article opens up a number of future directions for development. The current model can also handle continuous variables in addition to the two groups of data. Thus, it is possible to use seqlm to search for methylation quantitative trait loci (meQTLs). In such analysis, even small improvements in statistical power can have huge consequences. The method can be generalized to handle raw sequencing data with methylation counts instead of aggregated methylation values. For that we must replace linear regression with logistic regression. The change only alters the model fitting inside fixed region and not the core of the dynamic programming routine. The MDL framework underlying seqlm is a powerful way to identify genomic regions. By employing different statistical models it is possible to specify the properties of the desired regions. For example one can include more sophisticated linear models to test more complex hypotheses or use clustering methods instead, to perform unsupervised region finding.

6 Conclusion

We presented a novel approach for DMR identification, described as a three-stage process. First, the data is divided into smaller segments based on genomic distance between consecutive probes. Then, each of these segments is divided into regions with consistent differential methylation patterns. For this, all possible segmentations are considered and the optimal one is chosen according to the MDL principle. Finally, the significance of differential methylation in each region is assessed using linear mixed models. In our algorithm, the latter two steps are naturally related as both the segmentation and assessing the statistical significance are based on the β parameter. We showed that seqlm performs well on simulated data, being both more sensitive and more specific than the alternative methods. On a real dataset we can see that DMRs found by seqlm cover more sites than other methods while controlling the Type I error rate. At the same time, the redundancy within the results is smaller as the close sites are reported together. Finally, we validated 14 DMRs using Sanger sequencing and managed to show good correlation between the array and sequencing based estimates of differential methylation.

19 in total

1. An MDL method for finding haplotype blocks and for estimating the strength of haplotype block boundaries.

Authors: M Koivisto; M Perola; T Varilo; W Hennah; J Ekelund; M Lukk; L Peltonen; E Ukkonen; H Mannila
Journal: Pac Symp Biocomput Date: 2003

2. Identification of genetic elements that autonomously determine DNA methylation states.

Authors: Florian Lienert; Christiane Wirbelauer; Indrani Som; Ann Dean; Fabio Mohn; Dirk Schübeler
Journal: Nat Genet Date: 2011-10-02 Impact factor: 38.330

3. Epigenomic profiling reveals DNA-methylation changes associated with major psychosis.

Authors: Jonathan Mill; Thomas Tang; Zachary Kaminsky; Tarang Khare; Simin Yazdanpanah; Luigi Bouchard; Peixin Jia; Abbas Assadzadeh; James Flanagan; Axel Schumacher; Sun-Chong Wang; Arturas Petronis
Journal: Am J Hum Genet Date: 2008-03 Impact factor: 11.025

4. Discovery of phosphorylation motif mixtures in phosphoproteomics data.

Authors: Anna Ritz; Gregory Shakhnarovich; Arthur R Salomon; Benjamin J Raphael
Journal: Bioinformatics Date: 2008-11-07 Impact factor: 6.937

5. Comb-p: software for combining, analyzing, grouping and correcting spatially correlated P-values.

Authors: Brent S Pedersen; David A Schwartz; Ivana V Yang; Katerina J Kechris
Journal: Bioinformatics Date: 2012-09-05 Impact factor: 6.937

6. MethPrimer: designing primers for methylation PCRs.

Authors: Long-Cheng Li; Rajvir Dahiya
Journal: Bioinformatics Date: 2002-11 Impact factor: 6.937

7. Identification of DNA methylation biomarkers from Infinium arrays.

Authors: Frank Wessely; Richard D Emes
Journal: Front Genet Date: 2012-08-25 Impact factor: 4.599

8. Epigenome-wide scans identify differentially methylated regions for age and age-related phenotypes in a healthy ageing population.

Authors: Jordana T Bell; Pei-Chien Tsai; Tsun-Po Yang; Ruth Pidsley; James Nisbet; Daniel Glass; Massimo Mangino; Guangju Zhai; Feng Zhang; Ana Valdes; So-Youn Shin; Emma L Dempster; Robin M Murray; Elin Grundberg; Asa K Hedman; Alexandra Nica; Kerrin S Small; Emmanouil T Dermitzakis; Mark I McCarthy; Jonathan Mill; Tim D Spector; Panos Deloukas
Journal: PLoS Genet Date: 2012-04-19 Impact factor: 5.917

9. DNA methylation profiling of human chromosomes 6, 20 and 22.

Authors: Florian Eckhardt; Joern Lewin; Rene Cortese; Vardhman K Rakyan; John Attwood; Matthias Burger; John Burton; Tony V Cox; Rob Davies; Thomas A Down; Carolina Haefliger; Roger Horton; Kevin Howe; David K Jackson; Jan Kunde; Christoph Koenig; Jennifer Liddle; David Niblett; Thomas Otto; Roger Pettett; Stefanie Seemann; Christian Thompson; Tony West; Jane Rogers; Alex Olek; Kurt Berlin; Stephan Beck
Journal: Nat Genet Date: 2006-10-29 Impact factor: 38.330

10. DNA methylome profiling of human tissues identifies global and tissue-specific methylation patterns.

Authors: Kaie Lokk; Vijayachitra Modhukur; Balaji Rajashekar; Kaspar Märtens; Reedik Mägi; Raivo Kolde; Marina Koltšina; Torbjörn K Nilsson; Jaak Vilo; Andres Salumets; Neeme Tõnisson
Journal: Genome Biol Date: 2014-04-01 Impact factor: 13.583

14 in total

1. coMethDMR: accurate identification of co-methylated and differentially methylated regions in epigenome-wide association studies with continuous phenotypes.

Authors: Lissette Gomez; Gabriel J Odom; Juan I Young; Eden R Martin; Lizhong Liu; Xi Chen; Anthony J Griswold; Zhen Gao; Lanyu Zhang; Lily Wang
Journal: Nucleic Acids Res Date: 2019-09-26 Impact factor: 16.971

2. Aclust2.0: a revamped unsupervised R tool for Infinium methylation beadchips data analyses.

Authors: Oladele A Oluwayiose; Haotian Wu; Feng Gao; Andrea A Baccarelli; Tamar Sofer; J Richard Pilsner
Journal: Bioinformatics Date: 2022-10-14 Impact factor: 6.931

3. Imbalanced multi-label learning for identifying antimicrobial peptides and their functional types.

Authors: Weizhong Lin; Dong Xu
Journal: Bioinformatics Date: 2016-08-26 Impact factor: 6.937

4. Detect differentially methylated regions using non-homogeneous hidden Markov model for methylation array data.

Authors: Linghao Shen; Jun Zhu; Shuo-Yen Robert Li; Xiaodan Fan
Journal: Bioinformatics Date: 2017-12-01 Impact factor: 6.937

5. Detecting differentially methylated regions with multiple distinct associations.

Authors: Samantha Lent; Andres Cardenas; Sheryl L Rifas-Shiman; Patrice Perron; Luigi Bouchard; Ching-Ti Liu; Marie-France Hivert; Josée Dupuis
Journal: Epigenomics Date: 2021-03-01 Impact factor: 4.778

6. Association of dietary folate and vitamin B-12 intake with genome-wide DNA methylation in blood: a large-scale epigenome-wide association analysis in 5841 individuals.

Authors: Pooja R Mandaviya; Roby Joehanes; Jennifer Brody; Juan E Castillo-Fernandez; Koen F Dekkers; Anh N Do; Mariaelisa Graff; Ismo K Hänninen; Toshiko Tanaka; Ester A L de Jonge; Jessica C Kiefte-de Jong; Devin M Absher; Stella Aslibekyan; Yolanda B de Rijke; Myriam Fornage; Dena G Hernandez; Mikko A Hurme; M Arfan Ikram; Paul F Jacques; Anne E Justice; Douglas P Kiel; Rozenn N Lemaitre; Michael M Mendelson; Vera Mikkilä; Ann Z Moore; Tess Pallister; Olli T Raitakari; Casper G Schalkwijk; Jin Sha; Eline P E Slagboom; Caren E Smith; Coen D A Stehouwer; Pei-Chien Tsai; André G Uitterlinden; Carla J H van der Kallen; Diana van Heemst; Donna K Arnett; Stefania Bandinelli; Jordana T Bell; Bastiaan T Heijmans; Terho Lehtimäki; Daniel Levy; Kari E North; Nona Sotoodehnia; Marleen M J van Greevenbroek; Joyce B J van Meurs; Sandra G Heil
Journal: Am J Clin Nutr Date: 2019-08-01 Impact factor: 8.472

7. Genome-Wide DNA Methylation Patterns in Children Exposed to Nonpharmacologically Treated Prenatal Depressive Symptoms: Results From 2 Independent Cohorts.

Authors: Valeska Stonawski; Jakob Roetner; Tamme W Goecke; Peter A Fasching; Matthias W Beckmann; Johannes Kornhuber; Oliver Kratz; Gunther H Moll; Anna Eichler; Hartmut Heinrich; Stefan Frey
Journal: Epigenet Insights Date: 2020-06-16

8. Data-Driven-Based Approach to Identifying Differentially Methylated Regions Using Modified 1D Ising Model.

Authors: Yuanyuan Zhang; Shudong Wang; Xinzeng Wang
Journal: Biomed Res Int Date: 2018-11-18 Impact factor: 3.411

9. DNA methylation changes in endometrium and correlation with gene expression during the transition from pre-receptive to receptive phase.

Authors: Viktorija Kukushkina; Vijayachitra Modhukur; Marina Suhorutšenko; Maire Peters; Reedik Mägi; Nilufer Rahmioglu; Agne Velthut-Meikas; Signe Altmäe; Francisco J Esteban; Jaak Vilo; Krina Zondervan; Andres Salumets; Triin Laisk-Podar
Journal: Sci Rep Date: 2017-06-20 Impact factor: 4.379

10. MADA: a web service for analysing DNA methylation array data.

Authors: Xinyu Hu; Li Tang; Linconghua Wang; Fang-Xiang Wu; Min Li
Journal: BMC Bioinformatics Date: 2020-11-18 Impact factor: 3.169