Literature DB >> 25452765

Quantifying the impact of inter-site heterogeneity on the distribution of ChIP-seq data.

Jonathan Cairns¹, Andy G Lynch², Simon Tavaré².

Abstract

Chromatin Immunoprecipitation followed by sequencing (ChIP-seq) is a valuable tool for epigenetic studies. Analysis of the data arising from ChIP-seq experiments often requires implicit or explicit statistical modeling of the read counts. The simple Poisson model is attractive, but does not provide a good fit to observed ChIP-seq data. Researchers therefore often either extend to a more general model (e.g., the Negative Binomial), and/or exclude regions of the genome that do not conform to the model. Since many modeling strategies employed for ChIP-seq data reduce to fitting a mixture of Poisson distributions, we explore the problem of inferring the optimal mixing distribution. We apply the Constrained Newton Method (CNM), which suggests the Negative Binomial - Negative Binomial (NB-NB) mixture model as a candidate for modeling ChIP-seq data. We illustrate fitting the NB-NB model with an accelerated EM algorithm on four data sets from three species. Zero-inflated models have been suggested as an approach to improve model fit for ChIP-seq data. We show that the NB-NB mixture model requires no zero-inflation and suggest that in some cases the need for zero inflation is driven by the model's inability to cope with both artifactual large read counts and the frequently observed very low read counts. We see that the CNM-based approach is a useful diagnostic for the assessment of model fit and inference in ChIP-seq data and beyond. Use of the suggested NB-NB mixture model will be of value not only when calling peaks or otherwise modeling ChIP-seq data, but also when simulating data or constructing blacklists de novo.

Entities: CellLine Chemical Disease Gene Species

Keywords: ChIP-seq; Negative Binomial; high-throughput sequencing; mixture model; zero-inflation

Year: 2014 PMID： 25452765 PMCID： PMC4231950 DOI： 10.3389/fgene.2014.00399

Source DB: PubMed Journal: Front Genet ISSN： 1664-8021 Impact factor: 4.599

1. Introduction

Chromatin Immunoprecipitation followed by sequencing (ChIP-seq) is an experiment for the genome-wide location of events such as transcription factor binding sites and histone modifications (Park, 2009; Cairns et al., 2013). These events provide information on chromatin status, a topic of great interest in epigenetics. As with all sequencing assays, ChIP-seq can be represented as count data across the genome. Commonly, researchers need algorithms that find sites where the counts are larger than one would expect under a null noise model, a procedure known as “peak-calling.” Consider read counts x, where i indexes over genomic bins, for a single ChIP-seq sample. When modeling these counts as independent samples from some Poisson random variable X, a common problem is overdispersion—that is, the property that the observations have greater variance than allowed for by the model. In our case, the Poisson distribution requires that mean and variance are equal, an assumption that (even when ignoring between-sample variability) is violated by ChIP-seq data, impairing model fit (Spyrou et al., 2009). The choice of distribution for X is important when peak-calling; in particular, underestimating the variance should decrease specificity. Indeed, many have commented on the presence of false positives in peak-calling and how this quantity varies depending on the choice of peak-caller (Landt et al., 2012). A number of strategies exist to account for the extra variance. For example, we can allow the Poisson distribution to have site-specific means—that is, X ~ Pois(λ). This strategy requires some sort of smoothing criterion to make parameter estimation robust (Zhang et al., 2008a). It is also difficult to expand this model to account for between-sample heterogeneity. Many researchers use more general distributions—for example, the Negative Binomial (NB) model is used by Spyrou et al. (2009); Wu et al. (2010); Cairns et al. (2011); Song and Smith (2011) and others. The log-normal Poisson model has been used to model data from other high-throughput sequencing experiments, such as Serial Analysis of Gene Expression (SAGE) (Thygesen and Zwinderman, 2006). Other strategies involve regression. ZINBA (Rashid et al., 2011) uses an NB model whose mean is regressed against known covariates, though the dispersion is fixed. Such a strategy requires knowledge of all appropriate dependent variables to capture the full variability. An increasingly common approach is to use blacklists (Myers et al., 2011)—regions that have unusual mappability and previously have been seen to accumulate artifacts across many ChIP-seq experiments. Reads that fall in blacklisted regions are removed. Use of blacklists appears to improve peak-calling (Carroll et al., 2014), but has a number of downsides. Firstly, this strategy requires an organism-specific blacklist, and it is possible that the blacklist is non-exhaustive. Secondly, there is no reason to assume that enrichment cannot occur in these loci—a better alternative would be to model these regions appropriately, or perhaps separately. Thirdly, there may be sample-specific or copy-number specific events. For example, a transfected vector may artificially inflate copy number at a particular locus. We consider the general form of the above models, taking a completely unsupervised random-effects approach. It is reasonable to retain the Poisson element in our model, since it has been shown that counts from a single site in a single library, sequenced repeatedly, follow a Poisson distribution (Marioni et al., 2008). However, by expanding to a mixture model setting, where x is a sample from X ~ Pois(Λ) and the latent mixing variable Λ satisfies Λ ~ f(λ), we can use f(λ) to capture the genome-wide variation in our sample. Indeed, if Λ has gamma distribution, then X has NB distribution. However, there is no biological or statistical reason, other than convenience, to believe that Λ is best modeled with the gamma distribution. We assess the fit of the NB model to the data; that is, investigate the appropriateness of the gamma distribution as a mixing distribution. Next, we use CNM to make inferences about the distribution of Λ, and see that it suggests a mixture of two distributions. Thus, we consider the NB-NB distribution and show that it provides a better fit to the data. We also consider zero inflation, and show that the poor fit of the NB distribution can be mistaken for the presence of zero-inflation in the data.

1.1. Data

We use four different data sets, all of which are input samples (that is, untreated data). We focus on input samples because, in order to detect sites where counts differ from noise, it is important to model the noise appropriately. However, the methods used can also be applied to treated ChIP-seq data—see Supplementary Material. Samples were chosen to represent a variety of species. In each case, we considered only the first chromosome to avoid inter-chromosome normalization issues (Schmidt et al., 2010; Ross-Innes et al., 2012). We partition the first chromosome into 100 bp bins, avoiding conservatively-chosen regions that span the telomeres and centromeres, as obtained from the UCSC Genome Browser at http://genome.ucsc.edu/. We did not remove duplicates—however, we found that deleting duplicates did not substantially affect our results (see Supplementary Material). Code and data to reproduce the analysis are linked to in the Supplementary Material section.

2. Initial mixture models

2.1. Poisson

The Poisson model's Maximum Likelihood Estimate (MLE) occurs when its expected mean is equated to the observed mean: = .

2.2. Negative binomial

The MLE for the Negative Binomial (NB) distribution does not have closed form. However, we can fit the model using standard techniques, using the BFGS method implemented as fitdistr() in the MASS R package (Venables and Ripley, 2002). Though the MLE for the dispersion of the NB distribution is biased, the bias is of the order 1/n (Saha and Paul, 2005) and we have enough bins in our data sets for the bias to be negligible.

2.3. Log-normal poisson

Here, Λ is a log-normal random variable: log(Λ) ~ N(μ, σ2). The full log-likelihood involves a complicated integral: and the MLEs have no known closed form. Therefore, to estimate the parameters of this distribution, we take the following Bayesian approach: Assume a weak prior distribution for each of μ (the mean) and σ−2 (the precision). We use conjugate prior distributions: The prior distributions should have very little effect on the posterior, given that we have so many data. Use the Metropolis-Hastings algorithm to sample from each parameter's posterior distribution. Excluding a suitably-chosen burn-in period, take the mean of the posterior samples as an approximation to the maximum a posteriori (MAP) estimate. We perform the analysis in WinBUGS (Lunn et al., 2000).

2.4. Initial results

Figure 1 shows an example of the fit of the above distributions to sample A. We see that the empirical distribution has a large tail that the fitted distributions cannot account for, which in turn affects their ability to model bins with low counts.

Figure 1

Illustrating the best fits of the commonly used distributions to the count data for sample A. Note that none of these distributions can model adequately the heavy tail of large bin counts.

Illustrating the best fits of the commonly used distributions to the count data for sample A. Note that none of these distributions can model adequately the heavy tail of large bin counts. To explore this effect further, we investigate the properties of the underlying mixing distribution.

3. Mixture inference

As described in the introduction, suppose that we have multiple samples {x; i = 1, …, n} from a variable X distributed according to a Poisson mixture—that is, Our aim is to estimate the density f(λ) from our observations {x}. A formulation of the general problem, where X|Λ has arbitrary distribution, is given in Roueff and Rydén (2005). In the case of the Poisson distribution, it is known that (λ), the maximum likelihood estimator (MLE) of f(λ), consists of a finite number of point masses: that is, if we adopt the MLE (λ), then Λ is a discrete random variable, taking values from some vector with associated probabilities π, where and π both have length L. It is also known that L, the number of support points used in the MLE distribution, can never exceed the number of distinct values in our X samples (Laird, 1978). A number of methods have been proposed to infer the parameters L, , and π, including those of Tucker (1963); Boes (1966); Simar (1976); Laird (1978). Here, we use the Constrained Newton Method (CNM), described in Wang (2007). The CNM algorithm takes an initial value for (L, , π), then repeats the following steps until convergence: Update (L, ), as follows: Suppose that G(λ) is our current estimate for the mixing distribution, corresponding to the current value of (L, , π). Now, consider some new candidate support point θ*, and let H(λ) be the distribution that has a point mass at θ*—in other words, H is the Dirac delta function H(λ) = δ(λ − θ*). Consider the “gradient function,” the directional derivative which we can think of as the rate at which the likelihood increases, as we shift probability mass across to the new support point θ*. The value of θ* that maximizes d(θ*; G) is added to , and L is increased by one. If multiple values of θ* maximize d(θ*; G), then we add each of them to , increasing L by one for each value. Update by calculating the MLE for π given (, L). Wang (2007) show that this problem is equivalent to subject to , and where (here, ℓ refers to the likelihood for a single data point, x). This minimization problem is solved using the Constrained Newton (CN) method. If π = 0 for some i*, then delete π and θ and reduce L by one. Wang (2007) prove theoretical convergence and demonstrate that, in practice, convergence is dramatically faster than previous algorithms designed to find L, , and π. CNM reported that it did not converge for data sets C or D, and we check the output in the next section. In practice, we found that the value of L ranged from around 4 to 11 in our input samples.

3.1. Distribution recovery

First, we show that the CNM estimate (λ) is appropriate for our data, by demonstrating that the empirical distribution of X can be recovered from the CNM estimate according to the following procedure: Calculate the empirical probability density (x) directly, from the data. Calculate the mixing distribution MLE (λ) from the data according to the CNM algorithm. Estimate the density f(x) from the mixing distribution according to where f(x; θ) is the density of the Poisson distribution with mean θ. If the CNM algorithm retains the key features of the true distribution f(x), then we expect (x) to be “similar” to the empirical distribution (x). The results of this comparison are shown in Figure 2.

Figure 2

Distribution recovery in each data set. The empirical distribution obtained from the actual data (black line) appears to show a dramatic change in behavior around x = 9. While CNM reported failure to converge for samples C and D, it can be seen that sample C's distribution was adequately recovered.

3.2. Smoothing

The MLE (λ) is desirable in the sense that it is the “best fit” to our data. However, it does not make physical sense to use (λ) when modeling, which would make Λ a discrete variable. In reality, drawing from Λ represents a complicated chain of library-preparation procedures, leading to wide variation across hundreds of millions of sites across the genome. Therefore, it is much more logical to model Λ as a continuous random variable. As such, we adopt the following strategy to assess the fit of various distributions: Fit a continuous mixing distribution Θ(λ) to the data. Compare Θ(λ) with the discrete MLE (λ), obtained from CNM. If Θ(λ) is flexible enough to fit our data, then it should be “similar” to the MLE (λ). Thus, we need to compare a discrete distribution, (λ), with a continuous distribution, Θ(λ). Methodology for making such a comparison tends to be ad hoc, as the general question is ill-defined. A common strategy is to smooth out the data through “kernel smoothing”—that is, replacing each point mass in the discrete distribution with an appropriately-scaled kernel distribution, often the normal distribution. The problem with this approach is that the support of the mixing distribution's MLE contains only a handful of points, and the points become dramatically less dense at higher values. Thus, by using this approach, we end up with a distribution that is highly sensitive to the choice of length scale for the kernel. Instead, we take the following non-parametric approach. Assume that CNM's output [the discrete distribution with support = (θ1, …, θ) and associated probabilities π = (π1, …, π)] has been generated from a “true” distribution f(λ) through the following process: Take some partition = (P1 = 0, P2, P3, …, P, ∞) of the interval [0, ∞]. For each i ∈ {1, …, L}, replace the probability mass of f(λ) that lies in the interval [P, P) with an equivalent point mass of size π = ∫f(u)du placed at any point θ within that interval. Now, consider the true distribution's CDF, F(λ) = ∫λ0f(u)du. For i > 1, we have: For the case i = 1, we set Note that Equation 2 assumes that Λ is not zero-inflated. We can now place bounds on the value of F(θ). θ must lie in [P, P), by assumption. Therefore, F(θ) must lie in [F(P), F(P)), since F(λ) is a CDF and is therefore an increasing function of λ. Thus, for i > 1, by Equation (1) we have and for i = 1, we have We can use these upper and lower bounds to assess the fit of various candidate mixing distributions to the observed mixing distribution. Figure 3 shows an example of this procedure, as applied to sample A. We see that all of the candidate mixing distributions considered thus far violate the CNM bounds early on, indicating that these distributions cannot cope with large counts. This motivates the selection of a mixing distribution that can stay within the bounds.

Figure 3

Candidate mixing distributions, and their consistency with the CNM-derived mixing distribution. For clarity, we plot log(1 − CDF) where CDF is the Cumulative Density Function of Λ, and values are calculated only at CNM's λ support points. A mixing distribution that is consistent with CNM's predicted mixing distribution, as derived in Section 3.2, would have a line contained within the shaded region, with the black lines representing upper and lower bounds. Here, the lines do not stay within the bounds, indicating that all of the models deviate from CNM's prediction.

3.3. NB-NB mixture

The curves plotted in Figure 3 have insufficient curvature to accommodate the sharp turn present in the shaded region. This suggests that a mixture of two distributions is required—consistent with the abrupt change in behavior seen in Figures 1, 2. We consider the case where X is a mixture of two NB distributions, equivalent to modeling Λ as a mixture of two gamma distributions. In this case, we have a mixing parameter τ, and two separate NB parametrizations (μ1, r1) and (μ2, r2). Thus, We could find the MLE of the parameters with the EM algorithm (Dempster et al., 1977). However, since the EM algorithm can converge very slowly, we accelerated the process using SQUAREM (Varadhan and Roland, 2008). SQUAREM assumes that each step of the EM algorithm can be approximated using a particular quadratic form, allowing us to estimate the cumulation of a large number of EM updates in one go. Note that we did not consider mixtures of Poisson distributions, since these cases are attended to by the CNM algorithm, and we did not consider mixtures of log-normal Poisson distributions due to the computational complexity.

4. Results

4.1. Model fits

We considered the following distributions: We visually inspect the fits of the various distributions to the full data in Figure 4. To quantify the fit, we used Total Variation distance: . The results for this are given in Figure 5.

Figure 4

The fit of various distributions to the count data. We again see a striking change of behavior, occurring at around x = 9. The only distribution that can model this change is the NB-NB mixture.

Figure 5

Total variation. We see from the plot that the NB-NB mixture outperforms all of the other distributions, achieving the closest fit to the CNM estimate in the cases where it exists.

The fit of various distributions to the count data. We again see a striking change of behavior, occurring at around x = 9. The only distribution that can model this change is the NB-NB mixture. Total variation. We see from the plot that the NB-NB mixture outperforms all of the other distributions, achieving the closest fit to the CNM estimate in the cases where it exists. Then we examine the similarity between the mixing distributions used and the observed mixing distribution as calculated by CNM, in Figure 6.

Figure 6

Predicted mixing distributions, as per Figure . (Sample D is omitted since CNM did not recover its distribution.) The NB-NB mixture is the only model whose mixing distribution consistently fits between the CNM bounds. The NB-NB mixture is the only choice of distribution that can consistently model the higher counts.

4.2. Zero inflation

Zero-inflated models have been proposed to improve model fit in ChIP-seq data (Rashid et al., 2011; Diaz et al., 2012). A zero-inflated model is one such that, with some probability ν, we set X = 0. Otherwise, we draw X from a candidate distribution as before. Having accounted for some of the variation in our data with the NB-NB mixture model, we investigated whether or not zero-inflation can improve model fit further. To quantify this, we delete some proportion of zeros from the data, fit each model, then assess the fit with Total Variation as defined in Section 4.1. We also considered the case where we increase the number of zeros. Note that the log-normal Poisson fit was excluded due to computational complexity. The results are given in Figure 7. For sample A (dog) we see evidence of zero inflation, perhaps due to a less-established genome assembly. In samples B-D, the NB-NB models showed no evidence of zero-inflation. However, when using distributions that cannot account for the heavy tail in the high bin-counts, there is an erroneous indication that a zero-inflation component is necessary.

Figure 7

The effect of zero removal on the model fit. Each Subfigure represents an individual sample. A proportion of zeros are removed (or added), and after refitting each model, we reassess model fit with the Total Variation distance d, plotted as −log(d) on the Y-axis against zero proportion on the X-axis. For severely zero-inflated data, deleting zeros should improve model fit, so we expect to see a “peak” somewhere to the right of the black vertical line. The NB and Poisson models recommend the removal of most of the zero-valued data, a conclusion that seems biologically implausible. In contrast, the NB-NB mixture consistently recommends either no zero-removal, or a modest level of zero-inflation in the case of sample A. Note the dramatic change in the NB-NB mixture's fit in sample D as additional counts are added—this could be due to the accelerated EM algorithm encountering alternative local maxima.

5. Discussion

We found that, of the distributions tested, a mixture of two NB distributions best modeled counts in input data, even after removing problematic centromeric and telomeric regions. This result, and the observation that the empirical distribution changes behavior at high counts, reflected two apparent sources of counts in the data—one large population of counts, and another smaller population of higher counts. The regions with higher counts have higher variance than lower counts, and could be mistaken for peaks if using a naïve peak-calling method. Importantly, we saw that the heavy tail of large counts could cause inadequate models to suggest the need for a zero-inflation component. Our results suggest that researchers should check that large counts are not distorting model fit before assuming that a zero-inflation component is necessary. Several aspects of experimental design could influence the abundance of large counts. Any biases in mapping that cause reads to align preferentially to the same locus could cause high-count artifacts. Thus, we might be able to reduce the abundance of large counts by improving mappability (that is, reducing the number of ambiguously-mapped reads). For example, better genome assembly, or use of a better aligner, or using longer reads could all accomplish this task. Appropriate mixture modeling could apply when simulating ChIP-seq data—a simulation paradigm that fails to account for these different populations of counts cannot test peak-callers properly and overestimates their performance (Zhang et al., 2008b). A better understanding of the properties of the underlying mixture model allows us to simulate noise in ChIP-seq data sets more accurately. Mixture modeling is also of importance when peak-calling in ChIP-seq data, allowing us to model large counts without removing them. Models that regress on genomic covariates, such as copy number or GC content, can help us explain some of the variation in the data (Rashid et al., 2011; Robinson et al., 2012). However, our unsupervised approach can model the noise, even in the absence of appropriate covariates to regress over—for example, if we have samples that are from less well-elucidated species or that have abnormal copy number events (such as after chromothrypsis). We may also be able to extend the approach to make inferences about the variation that remains after regressing out known covariates. Additionally, we note that Rashid et al. (2011) assume constant dispersion due to general linear modeling restrictions—in contrast, a mixture model permits multiple dispersions. Alternatively, should we wish to adopt a blacklist strategy, we can construct a blacklist de novo by classifying bins into artifact and non-artifact regions (for example, by finding the probability that a bin belongs to each of the NB components in our mixture model). Again, this could be particularly useful for abnormal samples or species. An alternative approach to smoothing the MLE (λ) is to enforce continuity of (λ) during maximum likelihood estimation. For example, Liu et al. (2009) minimize a penalized log-likelihood: where α, the smoothing parameter, controls how smooth the output function is required to be. The methods we described have general applicability to other sequencing experiments based on genomic DNA, and not just ChIP-seq. As we increasingly see the emergence of large-scale epigenetic studies based on gDNA assays like ChIP-seq, it is important to choose models whose false discovery rates are robust. Properly accounting for the null variability in ChIP-seq data is vital to avoid false positives.

Funding

We acknowledge the support of the University of Cambridge, the Medical Research Council, Cancer Research UK, and Hutchison-Whampoa.

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Sample	Species	Description	Read lengths	# Reads	Aligner
A	C. Familiaris	Normal liver tissue	45	480,843	MAQ
B	M. Musculus	Normal liver tissue	36	542,021	MAQ
C	H. Sapiens	Cancer cell line MCF7	44	633,931	BWA 0.5.5
D	H. Sapiens	Tumor sample BT82277	44	1,398,520	BWA 0.5.5

Count distribution, f(x)	Parameters	Mixing distribution, f(λ)
Poisson	1	Single point
Negative Binomial (NB)	2	Gamma
Log-normal Poisson	2	Log-normal
NB-NB mixture	5	Gamma-gamma mixture

19 in total

1. Bias-corrected maximum likelihood estimator of the negative binomial dispersion parameter.

Authors: Krishna Saha; Sudhir Paul
Journal: Biometrics Date: 2005-03 Impact factor: 2.571

2. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays.

Authors: John C Marioni; Christopher E Mason; Shrikant M Mane; Matthew Stephens; Yoav Gilad
Journal: Genome Res Date: 2008-06-11 Impact factor: 9.043

3. BayesPeak--an R package for analysing ChIP-seq data.

Authors: Jonathan Cairns; Christiana Spyrou; Rory Stark; Mike L Smith; Andy G Lynch; Simon Tavaré
Journal: Bioinformatics Date: 2011-01-17 Impact factor: 6.937

4. ChIP-PaM: an algorithm to identify protein-DNA interaction using ChIP-Seq data.

Authors: Song Wu; Jianmin Wang; Wei Zhao; Stanley Pounds; Cheng Cheng
Journal: Theor Biol Med Model Date: 2010-06-03 Impact factor: 2.432

Review 5. ChIP-seq: advantages and challenges of a maturing technology.

Authors: Peter J Park
Journal: Nat Rev Genet Date: 2009-09-08 Impact factor: 53.242

6. Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding.

Authors: Dominic Schmidt; Michael D Wilson; Benoit Ballester; Petra C Schwalie; Gordon D Brown; Aileen Marshall; Claudia Kutter; Stephen Watt; Celia P Martinez-Jimenez; Sarah Mackay; Iannis Talianidis; Paul Flicek; Duncan T Odom
Journal: Science Date: 2010-04-08 Impact factor: 47.728