Alla Slynko1, Axel Benner2. 1. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Canada. 2. Division of Biostatistics, German Cancer Research Center, Heidelberg, Germany.
Abstract
Hydroxymethylcytosine (5hmC) methylation is a well-known epigenetic mark that is involved in gene regulation and may impact genome stability. To investigate a possible role of 5hmC in cancer development and progression, one must be able to detect and quantify its level first. In this paper, we address the issue of 5hmC detection at a single base resolution, starting with consideration of the well-established 5hmC measure Δβ and, in particular, with an analysis of its properties, both analytically and empirically. Then we propose several alternative hydroxymethylation measures and compare their properties with those of Δβ. In the absence of a gold standard, the (pairwise) resemblance of those 5hmC measures to Δβ is characterized by means of a similarity analysis and relative accuracy analysis. All results are illustrated on matched healthy and cancer tissue data sets as derived by means of bisulfite (BS) and oxidative bisulfite converting (oxBS) procedures.
Hydroxymethylcytosine (5hmC) methylation is a well-known epigenetic mark that is involved in gene regulation and may impact genome stability. To investigate a possible role of 5hmC in cancer development and progression, one must be able to detect and quantify its level first. In this paper, we address the issue of 5hmC detection at a single base resolution, starting with consideration of the well-established 5hmC measure Δβ and, in particular, with an analysis of its properties, both analytically and empirically. Then we propose several alternative hydroxymethylation measures and compare their properties with those of Δβ. In the absence of a gold standard, the (pairwise) resemblance of those 5hmC measures to Δβ is characterized by means of a similarity analysis and relative accuracy analysis. All results are illustrated on matched healthy and cancer tissue data sets as derived by means of bisulfite (BS) and oxidative bisulfite converting (oxBS) procedures.
DNA methylation is known to play a crucial role in the development of diseases such as diabetes, schizophrenia, and some forms of cancer; for details see, e.g., [1-7] and references therein. In order to address the possible impact of DNA methylation on the various biological functions and processes, an entire strand of extensive biological, bioinformatical, and statistical analyses has been developed in the past years. Some of those analyses, most relevant for our setting, were discussed in [8-15]. A substantial part of the methods introduced in those analyses aims at quantifying the actual level of DNA methylation, in particular on a single nucleotide resolution in genomic DNA.At some point, this research indicated that the obtained DNA methylation level, sometimes referred to as “total DNA methylation” [16, 17], can be split, inter alia, into 5-hydroxymethylcytosine (5hmC) and 5-methylcytosine (5mC) components, with 5mC playing an important role in gene silencing and genome stability [18]. The second component, 5hmC methylation, was first discovered in 2009 as another form of cytosine modification [19-21]. Since then, its function as an intermediate in active DNA demethylation and an important epigenetic regulator of mammalian development which is strongly associated with genes and regulatory elements in the genome, as well as its role as a possible epigenetic mark impacting genome stability has come into the spotlight [16, 18, 22–38]. At that point, the questions concerning reliable identification and accurate quantification of 5hmC levels emerged.Until now, a number of techniques for the quantification of 5hmC levels have been established [16–18, 24, 28, 31, 39–41]. Two key techniques to be named here are the TET-assisted bisulfite (TAB) technique and the oxidative bisulfite (oxBS) technique. The TAB technique is based on the conversion of 5mC to 5hmC in mammalian DNA by means of TET emzymes [17, 31]. When using the oxBS technique, 5hmC methylation levels can be obtained by means of the paired bisulfite sequencing (BS) and oxidative bisulfite sequencing (oxBS) procedures [18]. In particular, since the BS procedure can only differentiate between methylated and unmethylated cytosine bases, and cannot discriminate between 5mC and 5hmC, the oxBS procedure must be applied, in order to determine the level of 5hmC at a considered nucleotide position. This procedure yields Cs only at 5mC sites while oxidating 5hmC to 5-formylcytosine (5fC) and later converting them to uracil. As a result, an amount of 5hmC at each particular nucleotide position can be determined as the difference between the oxBS (which identifies 5mC) and the BS (which identifies 5mC+5hmC) readouts. In the present paper, all obtained results are illustrated on paired BS and oxBS data.In order to quantify the 5hmC level in the context of the oxBS technique, and, in particular, to identify a given CpG site as being either hydroxymethylated or non-hydroxymethylated, the following quantity was introduced in [41]Here, M denotes the intensity of the methylated allele, U is the intensity of the unmethylated allele, β is the methylation level obtained from the BS method, and β is the methylation level derived by means of the oxBS method. As stated in [31, 41], the quantity Δβ computed for each single CpG and sample can be interpreted as a “measure of hydroxymethylation” and “a reflection of the 5hmC level at each particular probe location”. This measure can then be applied in the screening step so as to exclude from further analysis those CpGs that do not appear to be hydroxymethylated.In [42], the authors introduced a related quantity Δm, defined as a difference of the corresponding m-values [13], to be another measure for identification and quantification of the 5hmC levels. However, our discussion in S1 Appendix shows that in the context of the 5hmC identification both measures, Δβ and Δm, flag exactly the same cytosines as being substantially hydroxymethylated and thus can be used interchangeably.Due to its definition, Δβ in (1) can take values between -1 and 1, with negative values of Δβ representing “false differences in methylation score between paired BS-only and oxBS data sets” and being interpreted as a “background noise” [41].While applying Δβ for the identification of substantially hydroxymethylated cytosines, the issue of an appropriate Δβ
threshold arises; such threshold can be applied “to identify a probe-set of substantially hydroxymethylated cytosines”. In [41], the threshold for Δβ has been set to 0.3 or 30%. However, it is not evident, whether such threshold can be applied for any given data set or should be specified for each particular setting.This paper is organized as follows. First, we address the applicability of the 5hmC measure Δβ (in the following notation just Δβ) for detection of hydroxymethylated CpGs and then indicate several limitations of this measure by discussing its properties, both analytically and on data sets. Further, we propose several alternative hydroxymethylation measures which can also be applied for the 5hmC identification and compare their properties and resemblance with those of Δβ. Relative accuracy and resemblance of all three considered 5hmC measures are discussed numerically, under the assumption that no gold standard is available. All data analyses were performed on 38 matched samples, with cancer and healthy tissue available for each sample.
Discussion
On the applicability of Δβ for 5hmC detection
According to [8], for a given methylated and unmethylated intensities M and U, the methylation level of the particular probe can be described by the methylation proportionThus, the 5hmC measure Δβ in (1) is just the difference of two methylation proportions as derived from BS and oxBS treatment, respectively. This simple definition, while appearing to be plausible at first, nevertheless leads to a number of ambiguities as discussed below.The first ambiguity arising from (1) concerns the application of Δβ as a measure for the identification of hydroxymethylated CpGs, and, in particular, its adequate interpretation as such. Even if both components in the difference (1) do represent the respective methylation proportions for BS and oxBS data, these proportions are evidently calculated on two different bases: the proportion β represents the methylation proportion based on the global BS intensity M + U, whereas the proportion β represents the methylation proportion based on the global oxBS intensity M + U. Thus, a direct comparison of these two proportions is difficult to justify and, as a result, the interpretation of Δβ as “a reflection of the 5hmC level at each particular probe” suggested in [41] is not well founded.Further, while identifying hydroxymethylated CpGs in the context of the screening step, the outcomes of Δβ are interpreted as follows [41]: Positive values of Δβ are taken as an indicator for a substantial 5hmC level and “represent potential sites of 5hmC”, whereas small values of Δβ should indicate no or only nonsubstantial hydroxymethylation levels. Negative values of Δβ are considered as resulting from background noise; for the 5hmC measure Δm, the same view is shared in [42]. To analyze this interpretation, let us first refer to Fig 1. As the left-hand panel of that figure shows, all ten simulated data points s1, s2, …, s10 satisfy both conditions
simultaneously which intuitively should be interpreted as “no substantial 5hmC level observed”. Nevertheless, the condition Δβ > 0 holds for each of these ten data points as well; see the right-hand panel of Fig 1 for an illustration.
Fig 1
On the interpretation of Δβ as a 5hmC measure in case with Δβ > 0.
Negativity of the differences on the left-hand panel implies that none of the data points s1, s2, …s10 shows any substantial 5hmC level, but, due to Δβ > 0, all these points will nevertheless be flagged by Δβ as being hydroxymethylated.
On the interpretation of Δβ as a 5hmC measure in case with Δβ > 0.
Negativity of the differences on the left-hand panel implies that none of the data points s1, s2, …s10 shows any substantial 5hmC level, but, due to Δβ > 0, all these points will nevertheless be flagged by Δβ as being hydroxymethylated.Further, the left-hand panel of Fig 2 introduces another ten simulated data points s1, s2, …s10 that satisfy both
Fig 2
On the interpretation of Δβ as a 5hmC measure in case with Δβ < 0.
Due to the positivity of the differences on the left-hand panel, all ten data points s1, s2, …, s10 appear to exhibit a substantial level of 5hmC, whereas the right-hand panel shows negative Δβ values.
On the interpretation of Δβ as a 5hmC measure in case with Δβ < 0.
Due to the positivity of the differences on the left-hand panel, all ten data points s1, s2, …, s10 appear to exhibit a substantial level of 5hmC, whereas the right-hand panel shows negative Δβ values.At the same time, the condition Δβ < 0 holds for each of s1, s2, …s10 as well; see the right-hand panel of Fig 2 for an illustration. Thus, even though the data points s1, s2, …s10 in Fig 2 actually appear to exhibit a substantial 5hmC level due to their BS intensities exceeding their oxBS intensities, they will not be selected by the measure Δβ as being hydroxymethylated.One of the main advantages of the measure β, which has definitely contributed to its common application as a methylation measure, is its intuitive interpretation as an approximation of the percentage of methylation [13]; thereby β = 0 indicates unmethylated probes and β = 1 denotes fully methylated probes. Unfortunately, this interpretation does not carry over to the measure Δβ. Indeed, in (1) the condition Δβ = 0 solely implies
and it is unclear how this last equality should be interpreted in terms of the observed 5hmC level. In particular, Fig 3 demonstrates that we can obtain Δβ = 0 in cases with “no substantial 5hmC level observed”, i.e., in cases where the conditions
hold. Similar results can be derived in cases with “a substantial 5hmC level observed”, i.e., in cases with
Fig 3
On the interpretation of Δβ as a 5hmC measure in case with Δβ = 0.
Negativity of the differences on the left-hand panel implies that none of the data points should show any substantial 5hmC level.
On the interpretation of Δβ as a 5hmC measure in case with Δβ = 0.
Negativity of the differences on the left-hand panel implies that none of the data points should show any substantial 5hmC level.Altogether, our analyses of the conditions Δβ > 0, Δβ = 0, and Δβ < 0 show that their interpretations as indicators for substantial hydroxymethylation, no hydroxymethylation, and background noise may become problematic in certain situations.Another ambiguity arising from (1) is related to the choice of the number 100 in the denominators M + U + 100 and M + U + 100 of the expression (1) for Δβ. This choice seems to stem from the practical convention in the definition of β values [13], and just being transferred at the definition of Δβ [31, 41]. As a matter of fact, there is no strong reason why the correction term 100 in the denominator of (2) should not be replaced with any other value α > 0. In fact, such replacement would lead to the following more general definition of the methylation proportionWhile one can safely argue that the actual choice of the parameter α is not crucial for the interpretation of the methylation proportion β(α) itself [13], this choice may become critical when using the sign of the measure
as an indicator for hydroxymethylation in the screening step. In particular, under certain conditions, the sign of Δβ(α) can change from positive to negative or vice versa as α varies; see the left-hand panel of Fig 4 as well as Fig A in S1 Appendix for an illustration.
Fig 4
Sign change and convergence of the 5hmC measure Δβ(α).
The left-hand panel: Δβ(α) changing its sign from positive to negative (the dark blue curve, healthy tissue) and from negative to positive (the dark red dotted curve, cancer tissue) as α increases. The result refers to a given CpG (cg00050873) and sample (sample 7). The right-hand panel: Convergence of Δβ(α) for healthy tissue, a given sample (sample 7) and three CpGs (cg00050873, cg05480730, cg10698069).
Sign change and convergence of the 5hmC measure Δβ(α).
The left-hand panel: Δβ(α) changing its sign from positive to negative (the dark blue curve, healthy tissue) and from negative to positive (the dark red dotted curve, cancer tissue) as α increases. The result refers to a given CpG (cg00050873) and sample (sample 7). The right-hand panel: Convergence of Δβ(α) for healthy tissue, a given sample (sample 7) and three CpGs (cg00050873, cg05480730, cg10698069).Further, Fig 5 shows changes in the density of Δβ(α) as well as the percentage of CpGs (for each given sample) where Δβ(α) may change its sign as α increases.
Fig 5
Density of Δβ(α) and the percentage of CpGs, where Δβ(α) may change its sign for varying values of α.
The left-hand panel: Density of Δβ(α), for a given CpG (cg00213748) and α = 0, 100, 500 and 2000 (the red, the dark blue, the blue and the dark green curves, respectively). The right-hand panel: The percentage of CpGs, where Δβ(α) may change its sign for varying values of α. All results were computed on cancer tissue and across all 38 samples.
Density of Δβ(α) and the percentage of CpGs, where Δβ(α) may change its sign for varying values of α.
The left-hand panel: Density of Δβ(α), for a given CpG (cg00213748) and α = 0, 100, 500 and 2000 (the red, the dark blue, the blue and the dark green curves, respectively). The right-hand panel: The percentage of CpGs, where Δβ(α) may change its sign for varying values of α. All results were computed on cancer tissue and across all 38 samples.In view of such dependence of Δβ(α) on the choice of α, a question concerning the possible impact of this choice on the percentage of CpGs satisfying the condition Δβ(α) > 0 and thus identified as being hydroxymethylated at the end of the screening step arises.
Alternative 5hmC measures
One of the limitations of the 5hmC measure Δβ(α) we discussed in the previous section concerns its interpretation and robustness with respect to the choice of the correction term α. To overcome this limitation, we now introduce two alternative measures which can be used in the screening procedure while indicating CpGs with a substantial level of 5hmC; the basic properties of these measures are discussed in S2 Appendix.We start our analysis by considering the behavior of Δβ(α) (and also Δm(α) as discussed in S2 Appendix) for increasing values of α. As follows from (9), Δβ(α) vanishes as α increases; see the right-hand panel of Fig 4 for an illustration. This convergence result is also transferable to Δm(α), with the only difference that in case of Δβ(α) the limit will always be zero, independently of the CpGs, sample, and the tissue chosen, whereas in case of Δm(α) the limit depends on the CpG, sample, and tissue under consideration.The convergence results for Δβ(α) and Δm(α) imply that the percentage of CpGs satisfying the condition Δβ(α) > 0 and Δm(α) > 0 for a given sample, respectively, approaches a positive constant as α increases. Standard computations verify this limit value to be just the percentage of CpGs satisfying M > M for a given sample; see S2 Appendix for details.Inspired by the convergence results obtained for the measures Δβ(α) and Δm(α), we next propose
as the first alternative 5hmC measure that can be used for the detection of hydroxymethylated CpGs. Note that Δm∞ is well-defined for all CpGs satisfying M > 0 and M > 0 simultaneously.The main advantage of the measure Δm∞ in comparison to the measures Δβ(α) and Δm(α) is its complete independence of the correction term α; this fact makes Δm∞ more robust for application in the screening step. Furthermore, the sign of Δm∞ has a very intuitive interpretation. Indeed, we get Δm∞ > 0 if M > M holds, i.e., if the global methylated intensity M exceeds the “adjusted” methylated intensity M. In all other cases we will have Δm∞ ≤ 0; for instance, Δm∞ = 0 implies M = M, which can intuitively be interpreted as “no substantial 5hmC level observed”.In the context of our screening procedure, the most crucial question concerns a relation between the subsets of CpGs satisfying Δm(α) > 0 and Δm∞ > 0, respectively. To answer this question in a formal way, we divided the set of all CpGs with Δm(α) > 0 in several disjoint subsets, and showed that, for a given sample and increasing α, the union of these subsets converges to the subset of CpGs satisfying M > M; see S2 Appendix for more details.Due to its definition, Δm∞ does not take into account the unmethylated intensities U and U. This may become an issue even if the role of these intensities in the detection of hydroxymethylated CpGs has not been clarified yet. We address this issue by proposing another measure for selecting CpGs with a substantial level of hydroxymethylation, namely,In (11), M + U is the global intensity obtained from the BS procedure and M + U is the global intensity derived by means of the oxBS procedure.For CpGs with M + U exceeding M + U, i.e., for those CpGs which can be intuitively interpreted as exhibiting a substantial level of hydroxymethylation, the measure Δh must range between 0 and 1. In particular, the values of Δh close to zero correspond to M + U being approximately equal to M + U and thus the global 5hmC level being (almost) negligible. On the other hand, for Δh approximately equal to one we deduce that M + U must be substantially smaller than M + U and thus the global 5hmC level has to be high. Altogether, larger values of Δh correspond to larger proportions of the global 5hmC levels and we can interpret Δh as the proportion of 5hmC in the global methylation.Our intuition in the interpretation of the values of Δh is based on the assumption that a substantial 5hmC level is associated with a substantial decrease in the overall intensities M + U, with M + U > M + U for a given CpG site. Such interpretation is induced by the fact that, in contrast to the methylation process, a role of the unmethylated intensities U in the hydroxymethylation process is unclear. Thus, negative values of Δh are currently treated as a measurement error. Note that in (11) one has to assume that M + U is different from zero; in other words, all CpGs with M + U equal to zero have to be excluded from the analysis as exhibiting measurement error.In view of the screening procedure, we also analyzed whether positive values of the measure Δh lead to positivity of other 5hmC measures introduced above, and vice versa. As (11) implies, the inequality Δh > 0 holds forHowever, the latter inequality is not sufficient to make a statement about the sign of the measures Δβ(α) and Δm∞, so that additional assumptions are needed; see S2 Appendix for details.Altogether, our discussion indicates that the application of Δh for the detection of hydroxymethylated CpGs can be of advantage, since this 5hmC measure overcomes the limitation of both 5hmC measures considered earlier. In particular, this measure does not depend on the choice of the correction term α, has an intuitive interpretation of its outcomes in terms of the observed 5hmC level, and can be computed directly from measured array data.
Materials and methods
Numerical analyses of the resemblance of Δβ(α), Δm∞ and Δh
In the previous sections we considered three 5hmC measures, Δβ(α), Δm∞ and Δh, as possible tools for the classification of CpGs into hydroxymethylated and those which do not exhibit a substantial level of hydroxymethylation. To estimate a possible classification error, one would usually compare each of these 5hmC measures with a certain gold standard. However, no gold standard is available in our case, since even the actual meaning of the formulations “a substantial 5hmC level observed” or “no substantial 5hmC level observed” in terms of measured methylated and unmethylated intensities M and U is unclear so far. One of possible ways to evaluate the accuracy of Δβ(α), Δm∞ and Δh in the absence of a gold standard, as proposed in this section, is to describe this accuracy in terms of relative sensitivities and specificities of these measures with respect to each other. On the other hand, the resemblance of the considered 5hmC measures with respect to each other can also be addressed by means of a similarity analysis.Numerical analyses of the present section were motivated by the discussions presented in [24, 43–48].
Study cohort, 5hmC isolation, data preprocessing
All analyses were performed on 38 paired samples, with both (colorectal) cancer and normal tissue available for each sample. All 38 patients were enrolled in the ongoing population-based case-control study DACHS (Darmkrebs: Chancen der Verhütung durch Screening, http://dachs.dkfz.org/dachs/), extensively described in [49]. Data collecting and patient recruitment procedures as well as the processes of DNA isolation and methylation profiling using the Infinium HumanMethylation450 BeadChip array (Illumina) are similar to those described in [50]. All data are publicly available at https://zenodo.org/record/2639285#.XLYzNKZS_XE.All data analyses were performed using the computational environment R, V.3.5.2 (http://www.r-project.org/). Raw data signals from each of the BS- and oxBS- converted samples were preprocessed using the R/Bioconductor minfi-package [51]. In particular, the procedure preprocessRaw from that package was applied in order to convert the red/green channel for an Illumina methylation array into methylation signal.
Results
Prevalence of positive results
We applied all three considered 5hmC measures to both healthy and cancer tissue and computed the percentage of CpGs satisfying Δβ(100) > 0, Δm∞ > 0 and Δh > 0 for each given sample; Fig 6 illustrates the obtained results. Note that such prevalence of positive results is crucial in the screening procedure and represents the most intuitive approach for the comparison of any two 5hmC measures. Dependence of the prevalence of positive results of the measure Δβ(α) on the choice of α is discussed in S1 Appendix.
Fig 6
Sample-wise prevalence of positive results.
The dark blue dots correspond to the percentage of CpGs with Δβ(100) > 0, the dark red dots to Δm∞ > 0 and the grey dots to Δh > 0.
Sample-wise prevalence of positive results.
The dark blue dots correspond to the percentage of CpGs with Δβ(100) > 0, the dark red dots to Δm∞ > 0 and the grey dots to Δh > 0.Further, we adopted the statement in [24] on a reduction of 5hmC levels in cancer tissue to the prevalence of positive results, by expecting this prevalence to be higher in healthy tissue compared to cancer one. In a sample-wise analysis, this anticipation was indeed confirmed for the 5hmC measure Δβ(100), but not for the measures Δm∞ and Δh.The same analysis, performed CpG-wise, i.e., with prevalence of positive results computed for each single CpG across all 38 samples that provides the hydroxymethylation level for each given CpG, again showed a significant reduction in 5hmC levels as obtained on cancer tissue, in particular for the measures Δβ(100) and Δm∞. Contrary to our expectations, for the measure Δh, prevalence of positive results was significantly lower in healthy tissue compared to cancer one.Next, we compared prevalences of positive results of any two 5hmC measure on a given tissue, in order to investigate the conservativeness of these measures when screening for hydroxymethylated CpGs. This analysis, performed sample-wise, resulted in the 5hmC measure Δm∞ being less conservative than Δh on healthy tissue and less conservative than Δβ(100) on cancer tissue; see Fig 6 for an illustration. The same analysis, performed CpG-wise, determined Δm∞ as being the least conservative 5hmC measure, on both considered tissues. Further, on healthy tissue Δβ(100) appeared to be less conservative than Δh, whereas on cancer tissue, Δh was less conservative than Δβ(100). This result can be interpreted as an evidence of the tissue effect [41, 52].We also analyzed the joint prevalence of positive results defined as the percentage of CpGs with any two 5hmC measures being positive; such joint prevalence characterizes the agreement between any two 5hmC measures in the context of the screening step. Sample-wise analysis did not reveal any significant differences in these joint prevalences as calculated on healthy and cancer tissue. On the other hand, the joint prevalence of the measures Δβ(100) and Δm∞ appeared to exceed the joint prevalence of Δβ(100) and Δh significantly, on both considered tissues. The same result, again for both tissues, holds for the joint prevalences of the measures Δm∞ and Δh as well as of the measures Δβ(100) and Δh. Finally, on cancer tissue, the joint prevalence of the measures Δβ(100) and Δm∞ significantly exceeded the joint prevalence of the measures Δm∞ and Δh. In total, we conclude that, in a sample-wise analysis performed on cancer tissue, the 5hmC measures Δβ(100) and Δm∞ demonstrate the strongest agreement, followed by agreement between the measures Δm∞ and Δh.The same joint prevalence analysis, performed CpG-wise, revealed the joint prevalence of the measures Δβ(100) and Δm∞ on healthy tissue being significantly higher than the corresponding joint prevalence on cancer tissue; similar result is true for the joint prevalence of the measures Δβ(100) and Δh. As in case of a sample-wise analysis, the joint prevalence of the measures Δβ(100) and Δm∞ significantly exceeded the joint prevalence of Δβ(100) and Δh, both on healthy and cancer tissue; the same relation is true for the joint prevalences of the measures Δm∞ and Δh and of the measures Δβ(100) and Δh. On the other hand, in contrast to the results of the sample-wise analysis above, the joint prevalence of the measures Δβ(100) and Δm∞ is significantly lower than the joint prevalence of the measures Δm∞ and Δh, both on healthy and cancer tissue. Altogether, the CpG-wise analysis showed the highest agreement between the measures Δm∞ and Δh, followed by the agreement between the measures Δβ(100) and Δm∞; the 5hmC measures Δβ(100) and Δh demonstrated the lowest pairwise agreement, consistent with the results of the sample-wise analysis. Sample-wise joint prevalence of positive results is visualized in Fig 7. Joint agreement between all three 5hmC measures is illustrated in Fig 8; for more results see also Figs A and B in S3 Appendix.
Fig 7
Sample-wise joint prevalence of positive results.
Orange squares correspond to the values for Δβ(100) and Δh, dark green circles to the values for Δβ(100) and Δm∞ and brown squares to the values for Δh and Δm∞.
Fig 8
The number of substantially hydroxymethylated CpGs as identified by all three 5hmC measures.
The number of substantially hydroxymethylated CpGs as identified by all three 5hmC measures, on healthy (the left-hand panel) and cancer (the right-hand panels) tissues and across all 38 samples. A CpG site is considered to be substantially hydroxymethylated under a given 5hmC measure x, if at least 75% of all values of x computed for this CpG and across all 38 samples are positive.
Sample-wise joint prevalence of positive results.
Orange squares correspond to the values for Δβ(100) and Δh, dark green circles to the values for Δβ(100) and Δm∞ and brown squares to the values for Δh and Δm∞.
The number of substantially hydroxymethylated CpGs as identified by all three 5hmC measures.
The number of substantially hydroxymethylated CpGs as identified by all three 5hmC measures, on healthy (the left-hand panel) and cancer (the right-hand panels) tissues and across all 38 samples. A CpG site is considered to be substantially hydroxymethylated under a given 5hmC measure x, if at least 75% of all values of x computed for this CpG and across all 38 samples are positive.To summarize the results of our discussion above, we state that the 5hmC measure Δβ(100) demonstrates a higher agreement with Δm∞ than with Δh. Moreover, the agreement between the measures Δm∞ and Δh exceeds the agreement between Δβ(100) and Δh.
Similarity analyses
In order to address the resemblance of the proposed 5hmC measures without making any statement about their performance, similarity analyses can also be applied; the main tool of such analyses is a similarity coefficient. There is a variety of similarity coefficients proposed in literature. For an overview see, e.g., [53-56] and references therein.In order to quantify the pairwise similarity of the proposed 5hmC measures Δβ(α), Δm∞, and Δh in the context of the screening step, we first considered the similarity coefficient , also known as the simple matching coefficient [53, 54]. In particular, for a given CpG and two given 5hmC measures x1 and x2 we rewrite this similarity coefficient asHere n is the number of samples under consideration, I{ is the indicator function, with I{ = 1 for x > 0 and I{ = 0 otherwise, and is the value of the measure x(j = 1, 2) in the ith CpG. Clearly, the similarity coefficient in (13) ranges between 0 and 1, with 1 corresponding to complete similarity and 0 to complete dissimilarity between the considered two measures x1 and x2. Moreover, the similarity coefficient represents an extension of the prevalence of positive results introduced earlier, since it considers not only the CpG sites that were flagged as hydroxymethylated but also those CpG sites that were identified as non-hydroxymethylated by two considered 5hmC measures.While performing the similarity analysis for each given sample, we could not state any significant difference in the values of as computed on healthy and cancer tissue. Further, the 5hmC measures Δm∞ and Δh appears to be the most similar, whereas the measures Δh and Δβ(100) are the least similar, both on healthy and cancer tissue. Finally, the 5hmC measure Δβ(100) is less similar to Δh than to Δm∞, both on healthy and cancer tissue. All these results are visualized in Fig 9.
Fig 9
Pairwise similarity of the 5hmC measures Δβ(100), Δm∞ and Δh, in terms of the similarity coefficient .
Orange rectangles correspond to the values of , dark green dots to the values of and brown rectangles to the values of .
Pairwise similarity of the 5hmC measures Δβ(100), Δm∞ and Δh, in terms of the similarity coefficient .
Orange rectangles correspond to the values of , dark green dots to the values of and brown rectangles to the values of .To describe the distribution of for any two given 5hmC measures, we adapted the ideas of [56, 57] and calculated the expected value of this similarity coefficient. The results of those calculations are presented in the left-hand panel of Table 1. Due to that table, on cancer tissue the measures Δm∞ and Δh are again the most similar 5hmC measures; further, Δβ(100) and Δh are the least similar to each other, both on healthy and cancer tissue.
Table 1
Expected values of the similarity coefficients and .
E[S(.,.)]
E[SH(.,.)]
5hmC measures
healthy tissue
cancer tissue
healthy tissue
cancer tissue
Δβ(100), Δh
0.7946
0.7897
0.5892
0.5794
Δβ(100), Δm∞
0.8052
0.8016
0.6104
0.6032
Δm∞, Δh
0.8021
0.8060
0.6042
0.6120
Expected values of the similarity coefficients and as applied for the pairwise comparison of the 5hmC measures Δβ(100), Δm∞ and Δh.
Expected values of the similarity coefficients and as applied for the pairwise comparison of the 5hmC measures Δβ(100), Δm∞ and Δh.The similarity coefficient in (13) exhibits a number of advantages such as simple applicability and intuitive interpretation of the obtained values. However, there are also some issues related to this coefficient. One of these issues arises in situations with two 5hmC measures x1 and x2 characterized byFor such 5hmC measures, which should actually be considered as completely dissimilar in the context of 5hmC detection, there is still a real possibility to get a positive value of the coefficient as
which may indeed become misleading in the context of the screening step. This situation will even deteriorate forTo mitigate this issue, we consider the similarity coefficient of Hamann, defined asClearly, is just a transformation of the simple matching coefficient [53] that incorporates a correction for possible mismatches between the considered 5hmC measures x1 and x2. While ranging in the interval [−1, 1], can be interpreted as complete dissimilarity and as complete similarity between x1 and x2. Further, due to (13) and (16), for any two measures x1 and x2.As in case with , we calculated the expected value of for any two given 5hmC measures; the results are presented in the right-hand panel of Table 1. As expected, this table shows the similarity coefficient confirming the results obtained under , e.g., with the 5hmC measures Δm∞ and Δh being most similar to each other on cancer tissue.Altogether, we state that among three considered 5hmC measures Δβ(α), Δm∞ and Δh, the measures Δm∞ and Δh appear to be most similar to each other on cancer tissue, both in terms of and . Further, as in case of the prevalence of positive results analysis, the measure Δm∞ is more similar to Δβ(α) than the measure Δh is.
Relative accuracy analyses
A different aproach for addressing the pairwise resemblance of the proposed 5hmC measures is to consider their relative sensitivities SE, specificities SP and false discovery rates FDR. Here, for any two 5hmC measures x1 and x2, we set as the relative sensitivity of x1 with respect to x2, as the relative specificity of x1 with respect to x2 and
as the relative false discovery rate. The quantities SE and SP are also known as co-positivities and co-negativities, respectively [58, 59].We started our data analyses on relative accuracies by checking for a significant difference in relative sensitivities as computed on healthy and cancer tissue. In a sample-wise analysis, such difference was observed for the relative sensitivities SE(Δβ(100)|Δh) and SE(Δβ(100)|Δm∞), with the relative sensitivity on healthy tissue exceeding the corresponding relative sensitivity on cancer tissue. The same analysis, performed CpG-wise, showed all relative sensitivities differentiating significantly between healthy and cancer tissue.Further, in a sample-wise analysis, performed on healthy tissue, the 5hmC measure Δm∞ demonstrated a higher sensitivity with respect to Δh than Δh did with respect to Δm∞. This is consistent with our results on prevalence of positive results, with the measure Δm∞ being less conservative than Δh on healthy tissue. Further, there was a trend for a significant increase in the relative sensitivity SE(Δm∞|Δβ(100)) compared to the relative sensitivity SE(Δβ(100)|Δm∞) on cancer tissue. This is also related to our result on prevalence of positive results, with the measure Δm∞ being less conservative than Δβ(100) on cancer tissue.A CpG-wise analysis of relative sensitivities revealed, the 5hmC measure Δβ(100) showing a lower sensitivity with respect to the measure Δh than the other way around, on cancer tissue. This result changed to the opposite on healthy tissue. Further, the measure Δm∞ showed a higher sensitivity with respect to the measure Δβ(100) than Δβ(100) did with respect to Δm∞, both on healthy and cancer tissue. Analogous result was true for the measures Δm∞ and Δh, with SE(Δm∞|Δh) exceeding SE(Δh|Δm∞), both on healthy and cancer tissue.While analyzed sample-wise for its relative specificity, the measure Δβ(100) demonstrated a significantly lower specificity with respect to Δh on healthy tissue than on cancer tissue; similar result holds for relative specificity of the measure Δm∞ with respect to the measure Δh. The same analysis, performed CpG-wise, showed all relative specificities differentiating significantly between healthy and cancer tissue. Further, the 5hmC measure Δβ(100) demonstrated a higher specificity with respect to the measure Δm∞ than Δm∞ did with respect to Δβ(100), both on healthy and cancer tissue, with the difference being more substantial on cancer tissue. This is again in correspondence with the measure Δm∞ being less conservative than Δβ(100), in particular on cancer tissue.In a CpG-wise analysis, on healthy tissue the measure Δβ(100) demonstrated a significantly lower specificity with respect to the measure Δh than Δh did with respect to Δβ(100); this result changes to the opposite while considering the same relative specificities on cancer tissue. Further, the measure Δm∞ showed a lower specificity with respect to Δβ(100) than Δβ(100) did with respect to Δm∞, on both considered tissues; similar result is true for the measures Δm∞ and Δh.Due to its definition, the results on relative false discovery rates FDR can be immediately derived from the corresponding results on SE. For instance, one can show that the measure Δh has a higher false discovery rate with respect to Δβ(100) than Δm∞, both on healthy and cancer tissue.We also computed expected relative sensitivities, specificities and false discovery rates of each 5hmC measure with respect to two others; the results are presented in Tables 2–4 below.
Table 2
Expected relative sensitivities .
healthy tissue
cancer tissue
E[SEr]
Δm∞
Δβ(100)
Δh
Δm∞
Δβ(100)
Δh
Δm∞
1.0
0.7739
0.8624
1.0
0.7827
0.8383
Δβ(100)
0.7456
1.0
0.5938
0.7133
1.0
0.5549
Δh
0.7940
0.5613
1.0
0.8199
0.5906
1.0
Expected relative sensitivities , computed for any two 5hmC measures x1, x2 ∈ {Δβ(100), Δm∞, Δh}. The measure x1 is in rows and x2 is in columns; e.g., the value 0.7456 corresponds to the expected relative sensitivity and the value 0.7739 to the expected relative sensitivity , both on healthy tissue. Since larger values in the table describe a higher relative sensitivity, our results in the table above indicate the measures Δh and Δm∞ demonstrating the highest resemblance. At the same time, the measures Δβ(100) and Δh appear to be least similar to each other.
Table 4
Expected relative false discovery rates .
healthy tissue
cancer tissue
E[FDRr]
Δm∞
Δβ(100)
Δh
Δm∞
Δβ(100)
Δh
Δm∞
0.0
0.2544
0.2060
0.0
0.2867
0.1801
Δβ(100)
0.2261
0.0
0.4387
0.2173
0.0
0.4094
Δh
0.1376
0.4062
0.0
0.1617
0.4451
0.0
Expected relative false discovery rate , computed for any two 5hmC measures x1, x2 ∈ {Δβ(100), Δm∞, Δh}. The measure x1 is in rows and x2 is in columns; e.g., the value 0.2261 corresponds to the expected relative false discovery rate and the value 0.2544 to , both on healthy tissue. Altogether, the measure Δm∞ demonstrates a lower false discovery rate with respect to Δβ(100) than Δh.
Expected relative sensitivities , computed for any two 5hmC measures x1, x2 ∈ {Δβ(100), Δm∞, Δh}. The measure x1 is in rows and x2 is in columns; e.g., the value 0.7456 corresponds to the expected relative sensitivity and the value 0.7739 to the expected relative sensitivity , both on healthy tissue. Since larger values in the table describe a higher relative sensitivity, our results in the table above indicate the measures Δh and Δm∞ demonstrating the highest resemblance. At the same time, the measures Δβ(100) and Δh appear to be least similar to each other.Expected relative specificities , computed for any two 5hmC measures x1, x2 ∈ {Δβ(100), Δm∞, Δh}. The measure x1 is in rows and x2 is in columns; e.g., the value 0.5982 corresponds to the expected relative specificity and the value 0.5698 to the expected relative specificity , both on healthy tissue. As in Table 2, larger values correspond to a higher relative specificity, and thus Δβ(100) and Δm∞ demonstrate a higher resemblance than Δβ(100) and Δh.Expected relative false discovery rate , computed for any two 5hmC measures x1, x2 ∈ {Δβ(100), Δm∞, Δh}. The measure x1 is in rows and x2 is in columns; e.g., the value 0.2261 corresponds to the expected relative false discovery rate and the value 0.2544 to , both on healthy tissue. Altogether, the measure Δm∞ demonstrates a lower false discovery rate with respect to Δβ(100) than Δh.Altogether, due to our relative accuracy analyses, the measure Δm∞ again demonstrates more resemblance with Δβ(100) than the measure Δh, both on healthy and cancer tissue.
Comparison of Δβ(α), Δh and Δm∞ to the oxBS-MLE and OxyBS procedures in the context of a screening step
When detecting CpGs with a substantial 5hmC level, one may compare the results provided by each of the considered three 5hmC measures Δβ(α), Δh and Δm∞ with those derived from the oxBS-MLE and OxyBS procedures introduced in [44, 47]. When applied in a screening step, both oxBS-MLE and OxyBS procedures will flag the same cytosines as being hydroxymethylated as the 5hmC measure Δβ(0) will do. This results follows immediately from the problem formulations and the derivation of the MLEs as suggested by both procedures; see S4 Appendix for details. Thus, the comparison of the 5hmC measures Δm∞ and Δh with the oxBS-MLE and OxyBS procedures in detection of hydroxymethylated cytosines can be traced back to the comparison of these measures with the measure Δβ(0).
Conclusion
Presently, the measure most commonly used for the detection of hydroxymethylated CpGs is the measure Δβ(α) and its derivatives as introduced in [31, 41, 42]. Well-established due to its easy computation and alleged intuitivity, this 5hmC measure nevertheless exhibits a number of limitations and has already been criticized due to its interpretation. This interpretation has meanwhile been questioned in [44], where the authors discussed the “naive” estimation of the 5hmC level via the difference of two β values as proposed in [31, 41] and introduced a model for describing the 5mC and 5hmC proportions by means of maximum likelihood estimation and beta-distributed random variables. Such modeling disallows negative proportions in particular; the corresponding model was also implemented in the R-package OxyBS [44].In this paper, we performed a detailed analysis of Δβ(α), both analytically and empirically, and discussed a number of limitations of Δβ(α) which could make its practical applicability for screening of hydroxymethylated CpGs questionable. These limitations concern in particular the interpretation of Δβ(α) and its robustness with respect to the choice of α.Further, we proposed two alternative 5hmC measures which can be applied in the screening step. The first of these 5hmC measures is the measure Δm∞. While intuitively interpretable and independent of the correction term α, this measure does not incorporates the unmethylated intensities U and U. Even though the role of these intensities in detection of the 5hmC levels has not been clarified yet, we took this fact into account and suggested the second alternative 5hmC measure, Δh. Due to its definition, this measure does not depend on the choice of α, has an intuitive interpretation in detecting hydroxymethylated CpGs, takes into account all intensities, and can be computed directly from the observed data.The main challenge to be handled in our analysis referred to a mutual comparison of the considered 5hmC measures in the absence of a gold standard, as no biological or biochemical criterion for a CpG to be considered as “hydroxymethylated”, e.g., in terms of methylated and non-methylated intensities M and U, is available so far. To overcome this challenge and to be able to address resemblance of the proposed 5hmC measures in the context of the screening step, we first analyzed the prevalences of positive results for each single 5hmC measure. Here, we first observed a decrease in this prevalence, while moving from healthy to cancer tissue, for the measures Δβ(α) and Δm∞. This result is also in accordance with the observation on a depletion of 5hmC levels in tumors compared to corresponding normal tissue as stated, e.g., in [24, 45, 60]. Moreover, the measure Δm∞ appears to be the measure with the largest prevalence of positive results, both on healthy and cancer tissue. In addition, data-based analysis of the joint prevalence of positive results revealed the strongest agreement between the measures Δm∞ and Δh, followed by the agreement between the measures Δβ(100) and Δm∞; the 5hmC measures Δβ(100) and Δh demonstrated the lowest pairwise agreement. In other words, a stronger resemblance between the measures Δβ(α) and Δm∞ than between the measures Δβ(α) and Δh was observed so far. This result was also confirmed in the context of a similarity analysis as performed for a pairwise comparison of the proposed 5hmC measures.In order to estimate relative accuracies of Δβ(100), Δm∞, and Δh with respect to each other, we also used relative sensitivity and specificity analyses. As a result of those analyses, the measure Δm∞ demonstrated a higher sensitivity and a lower specificity with respect to Δβ(100) than vice versa; the same result holds for the measures Δm∞ and Δh. Moreover, we observed that the measure Δh has a higher false discovery rate with respect to Δβ(100) than Δm∞. Altogether, we concluded, that, in the context of the screening step, the 5hmC measure Δm∞ exhibits more resemblance with the measure Δβ(α) than Δh does and thus this measure would be the first choice if looking for a possible substitute for Δβ(α) with another 5hmC measure in the screening procedure.Our numerical analyses are based on raw data, with no normalization method applied. There are a variety of reasons for this. First, some of our results (such as the convergence result for Δβ(α)) were derived analytically and thus do not depend on the data used for their illustration. Second, there is no consistent normalization method to be applied when quantifying the 5hmC levels [42, 44]. Third, a possible impact of a particular normalization method on the results of the 5hmC classification is currently not obvious to us and can in fact be considered as a topic of future research.Nevertheless, we did check our results on the data normalized by three different normalization methods, funNorm, SWAN and Illumina, as available in the R-package minfi [51]; for more details see S5 Appendix. As a consequence of such normalized data analyses, we do observe some differences to our results as obtained on raw data. However, there is no evidence that such differences have any biological meaning and are not just a product of the normalization method applied. For instance, in some cases we observe a reduction in the prevalence of positive results of a given 5hmC measure as calculated on normalized data compared to raw data. On the other hand, a reduction in the 5hmC levels on cancer tissue as observed in terms of the measure Δβ(100) is confirmed for all three normalized data sets as well. The same is true for the measure Δm∞ being less conservative than Δh on healthy tissue. Further, the measures Δm∞ and Δh are the ones that are most similar to each other (in terms of the similarity coefficient ) followed by the measures Δβ(100) and Δm∞, both on raw and normalized data; this result holds both for healthy and cancer tissue.There are also differences in results on detection of the hydroxymethylated CpGs provided by different normalization procedures. For instance, on cancer tissue, the measure Δβ(100) shows a significant reduction in the prevalence of positive results calculated on the Illumina data compared to the prevalence computed on the funNorm data. Further, both on healthy and cancer tissue, the measures Δβ(100) and Δm∞ demonstrate the strongest similarity (in terms of the similarity coefficient ) on the funNorm normalized data, followed by the SWAN normalized data; the similarity between Δβ(100) and Δm∞ on the Illumina normalized data is the lowest one.In the present paper we discussed the possible applicability of the considered 5hmC measures for detection of hydroxymethylated CpGs in the screening procedure. The immediate question arising in this context is the question about the applicability of these measures for the quantification of the observed 5hmC levels, similar to the applicability of β values used for quantification of the methylation levels. Even if the measure Δh appears to provide the most intuitive interpretation in contrast to the remaining two 5hmC measures, this question is still a topic of future research.
On the 5hmC measure Δβ(α).
Sign change of Δβ(α), sample-wise convergence of the CpG sets satisfying {Δβ(α) > 0} as α increases, the role of α in similarity analyses.(PDF)Click here for additional data file.
On the 5hmC measures Δm(α).
Relation between the measures Δm(α) and Δβ(α), the 5hmC measure Δm(α) as a function of α (monotonicity, convergence, sign change of Δm(α)), relation between the subsets {Δm(α) > 0} and {Δm∞ > 0}, relation between the subsets {Δh > 0}, {Δβ(α) > 0} and {Δm∞ > 0}.(PDF)Click here for additional data file.
On the resemblance of Δβ(α), Δm∞ and Δh: Numerical results.
Prevalence of positive results, joint prevalence of positive results, similarity analyses, relative accuracy analyses (relative sensitivity and specificity).(PDF)Click here for additional data file.
A comparison of the 5hmC measures Δβ(α), Δh and Δm∞ with the results of the oxBS-MLE and OxyBS procedures.
(PDF)Click here for additional data file.
A comparison of numerical analyses on raw and normalized data.
(PDF)Click here for additional data file.
Table 3
Expected relative specificities .
healthy tissue
cancer tissue
E[SPr]
Δm∞
Δβ(100)
Δh
Δm∞
Δβ(100)
Δh
Δm∞
1.0
0.5698
0.6844
1.0
0.5669
0.7112
Δβ(100)
0.5982
1.0
0.3473
0.6455
1.0
0.3834
Δh
0.7890
0.3788
1.0
0.7422
0.3529
1.0
Expected relative specificities , computed for any two 5hmC measures x1, x2 ∈ {Δβ(100), Δm∞, Δh}. The measure x1 is in rows and x2 is in columns; e.g., the value 0.5982 corresponds to the expected relative specificity and the value 0.5698 to the expected relative specificity , both on healthy tissue. As in Table 2, larger values correspond to a higher relative specificity, and thus Δβ(100) and Δm∞ demonstrate a higher resemblance than Δβ(100) and Δh.
Authors: Benjamin B Green; E Andres Houseman; Kevin C Johnson; Dylan J Guerin; David A Armstrong; Brock C Christensen; Carmen J Marsit Journal: FASEB J Date: 2016-04-26 Impact factor: 5.191
Authors: Michael Stevens; Jeffrey B Cheng; Daofeng Li; Mingchao Xie; Chibo Hong; Cécile L Maire; Keith L Ligon; Martin Hirst; Marco A Marra; Joseph F Costello; Ting Wang Journal: Genome Res Date: 2013-06-26 Impact factor: 9.043
Authors: Sarah F Field; Dario Beraldi; Martin Bachman; Sabrina K Stewart; Stephan Beck; Shankar Balasubramanian Journal: PLoS One Date: 2015-02-23 Impact factor: 3.240