MOTIVATION: Identification of genomic regions of interest in ChIP-seq data, commonly referred to as peak-calling, aims to find the locations of transcription factor binding sites, modified histones or nucleosomes. The BayesPeak algorithm was developed to model the data structure using Bayesian statistical techniques and was shown to be a reliable method, but did not have a full-genome implementation. RESULTS: In this note we present BayesPeak, an R package for genome-wide peak-calling that provides a flexible implementation of the BayesPeak algorithm and is compatible with downstream BioConductor packages. The BayesPeak package introduces a new method for summarizing posterior probability output, along with methods for handling overfitting and support for parallel processing. We briefly compare the package with other common peak-callers. AVAILABILITY: Available as part of BioConductor version 2.6. URL: http://bioconductor.org/packages/release/bioc/html/BayesPeak.html.
MOTIVATION: Identification of genomic regions of interest in ChIP-seq data, commonly referred to as peak-calling, aims to find the locations of transcription factor binding sites, modified histones or nucleosomes. The BayesPeak algorithm was developed to model the data structure using Bayesian statistical techniques and was shown to be a reliable method, but did not have a full-genome implementation. RESULTS: In this note we present BayesPeak, an R package for genome-wide peak-calling that provides a flexible implementation of the BayesPeak algorithm and is compatible with downstream BioConductor packages. The BayesPeak package introduces a new method for summarizing posterior probability output, along with methods for handling overfitting and support for parallel processing. We briefly compare the package with other common peak-callers. AVAILABILITY: Available as part of BioConductor version 2.6. URL: http://bioconductor.org/packages/release/bioc/html/BayesPeak.html.
Chromatin ImmunoPrecipitation (ChIP) experiments produce short DNA fragments, preferentially selected to identify the locations of protein binding sites, histone modifications or nucleosome positions. In the ChIP-seq protocol, as described in Robertson ), the 5′-end of one strand of each fragment is sequenced, obtaining a ‘read’, and then aligned to a reference genome. These aligned reads form ‘peaks’—localized regions of high read density—along the genome. Determining the locations and magnitudes of these peaks is an active area of research, and a number of tools exist for the so-called ‘peak-calling’, using a variety of methodologies.The algorithm described in Spyrou ) takes a Bayesian approach to modelling aligned reads from ChIP-seq data. Many peak-callers model read counts with the Poisson distribution, and thus do not allow for the overdispersion seen in practice. BayesPeak addresses this issue by using the negative binomial distribution.The method optionally allows for the inclusion of a control sample, which enables us to mitigate the effect of experimental artefacts where protein binding is absent. Additionally, the algorithm was shown (in Spyrou ) to call merged peaks that are narrower than regions identified by other callers and this, along with inference from posterior probabilities, can improve the efficiency of downstream processing, e.g. motif analysis.We present the R package BayesPeak, which uses a modified version of the algorithm in Spyrou ) with a flexible genome-wide implementation. As well as providing compatibility with common input formats and downstream analyses, the BayesPeak package adds additional methods for summarizing data, tools for handling overfitting and support for parallel processing.
2 METHODS
The BayesPeak package, written in R and C, forms part of the BioConductor release branch since version 2.6 (Gentleman ).The implementation of BayesPeak allows it to take advantage of parallel processing, improving its efficiency. We provide optional parallelization support using the multicore package (Urbanek, 2009) for Linux/Mac OS X.BayesPeak can analyse a human genome in under 12 h, when run in parallel on an 8-core 2.5 GHz machine. For a benchmark example in which the treatment and control .bed files totalled 33 million reads (2.3 GB of disk space), BayesPeak required not more than 3 GB of RAM (although larger .bed files require more RAM). R version 2.11.0 or later is required.The BayesPeak software package fits a hidden Markov model (HMM) to the data (aligned reads) as follows: the genome is divided into ‘jobs’, i.e. short regions on which the algorithm is run independently. By default, jobs are of length 6 Mb (numerical stability may be compromised in larger jobs), and each job region is expanded by 2 Kb in each direction (to allow peaks falling on the boundary between two jobs to be called).Within a job, the region is divided into small bins (each of length 100 bases, by default), and reads are aggregated by the bin in which they start and the strand on which they lie. A two-state HMM is fitted to these aggregate counts. The HMM's hidden states correspond to enrichment or unenrichment for sites of interest. A hidden state produces two negative binomial emissions, each corresponding to a bin count (one on each strand), with enriched states tending to emit larger values. The HMM is fitted through MCMC techniques that sample from the posterior distributions of the parameters. The analysis is performed a second time on the same job region, but with all bins offset by half their width (the ‘offset’ analysis) as illustrated in Figure 1. Further details can be found in Spyrou ).
Fig. 1.
A schematic representation of a hypothetical peak region, with bins labelled by genomic order. Supposing that each bin has an associated PP value above a threshold (by default 0.5), we would merge these six bins into a peak from 1100 to 1450, with an associated PP value as calculated in Section 2.1.
A schematic representation of a hypothetical peak region, with bins labelled by genomic order. Supposing that each bin has an associated PP value above a threshold (by default 0.5), we would merge these six bins into a peak from 1100 to 1450, with an associated PP value as calculated in Section 2.1.The output of each job is the posterior probability (PP) of each site being enriched. The data are summarized to form the final peaks as follows. All bins with PP values greater than a user-specified threshold (by default, 0.5) are collected. Where two adjacent jobs call the exact same bin, the maximal PP value is used—this is a rare occurrence under default settings.
2.1 Merging bins
Bins from both the ‘normal’ and ‘offset’ jobs that are adjacent or overlap are merged to form contiguous peak regions.The PP value of the peak can be calculated from the constituent bins by either naively taking the maximal value, or by using the ‘lower bound’ method defined as follows: assign the indices 1,…, n to the n bins within the peak, in order of genomic location (as in Fig. 1). Note that bins with adjacent indices overlap. Now let the PP value of bin i be π, and define q = 1 − π as the probability of no enrichment in bin i.Let S be the set of all subsequences of {1,…, n} such that I ∈ S ⇔ I contains no consecutive integers ⇔ the bins with indices in I do not overlap. Then, for each I ∈ S, a lower bound for the probability of enrichment in at least one of the original n bins is .The ‘best’ (highest) lower bound for the probability of peak enrichment is therefore the maximum of this quantity,We can find Q(n) by dynamic programming since, by conditioning on whether i ∈ I, we have Q(i) = min(Q(i − 1), qQ(i − 2)).The advantage of using this method over taking the maximum PP value is that it can give an appropriate score to sustained regions of only moderately large PP values, which will be undervalued when taking the maximum.We tested BayesPeak on the NRSF/REST ChIP-seq dataset from (Johnson ), in which a small subset of regions have been experimentally validated, and we compared the findings against other common peak callers.
3 RESULTS
We present the peak-caller comparison results in the Supplementary Material. BayesPeak demonstrated a competitive sensitivity and specificity on the genome-wide scale and showed a substantial overlap with other peak-callers. The over-fitting correction greatly improved the enrichment for true binding sites in BayesPeak's data, as did subsequent filtering by PP value.In its raw output, BayesPeak returns PP values for each bin and, for each job, the posterior mean of each estimated parameter (excluding half of the draws as burn-in). As of BayesPeak version 1.1.3, MCMC samples of several key parameters are also present, permitting convergence tests such as the Geweke diagnostic in the boa (Smith, 2007) or coda (Plummer ) packages.Since the summarized output is in RangedData format, this allows direct analysis of the peaks in any downstream package compatible with IRanges (Pages ), including those in BioConductor. For example, ChIPpeakAnno (Zhu ) can annotate the output.We have observed some phenomena that occur with lower quality data. For example, over-fitting can occur as follows: the model assumes that for each job there are both enriched and unenriched states. As such, when there are no peaks in a job or when the peaks are extremely weak, these two states are used to explain the natural variance present in the unenriched background. We identify over-fit jobs from their low λ1 values (where λ1 is the expected number of counts in an enriched bin), and from their PP values being spread out over [0, 1] rather than tending to be 0 or 1. BayesPeak supports the identification and removal of jobs that have encountered this effect (Supplementary Table S2 and Fig. S1).
4 DISCUSSION
BayesPeak provides a Bayesian analysis, with advantages including allowance for overdispersion in read counts and a competitive genome-wide specificity and sensitivity. By anticipating peak structure, BayesPeak does not call peaks based on sheer numbers of reads without appropriate read formation.Careful selection of job regions may improve the analysis. For example, we can use prior knowledge to partition jobs in a manner that avoids analysing the centromeres and telomeres, which usually contain no reads. This will prevent unnecessary computation, and may also improve results in the surrounding regions.There is scope for adapting the BayesPeak approach to other forms of peak-calling. For example, some histone mark data consist of regions of enrichment containing many peaks and, in BrDU-seq data, peaks are much broader than those in transcription factor data.
Authors: Lihua J Zhu; Claude Gazin; Nathan D Lawson; Hervé Pagès; Simon M Lin; David S Lapointe; Michael R Green Journal: BMC Bioinformatics Date: 2010-05-11 Impact factor: 3.169
Authors: Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang Journal: Genome Biol Date: 2004-09-15 Impact factor: 13.583
Authors: Gordon Robertson; Martin Hirst; Matthew Bainbridge; Misha Bilenky; Yongjun Zhao; Thomas Zeng; Ghia Euskirchen; Bridget Bernier; Richard Varhol; Allen Delaney; Nina Thiessen; Obi L Griffith; Ann He; Marco Marra; Michael Snyder; Steven Jones Journal: Nat Methods Date: 2007-06-11 Impact factor: 28.547
Authors: Paul A Northcott; Catherine Lee; Thomas Zichner; Adrian M Stütz; Serap Erkek; Daisuke Kawauchi; David J H Shih; Volker Hovestadt; Marc Zapatka; Dominik Sturm; David T W Jones; Marcel Kool; Marc Remke; Florence M G Cavalli; Scott Zuyderduyn; Gary D Bader; Scott VandenBerg; Lourdes Adriana Esparza; Marina Ryzhova; Wei Wang; Andrea Wittmann; Sebastian Stark; Laura Sieber; Huriye Seker-Cin; Linda Linke; Fabian Kratochwil; Natalie Jäger; Ivo Buchhalter; Charles D Imbusch; Gideon Zipprich; Benjamin Raeder; Sabine Schmidt; Nicolle Diessl; Stephan Wolf; Stefan Wiemann; Benedikt Brors; Chris Lawerenz; Jürgen Eils; Hans-Jörg Warnatz; Thomas Risch; Marie-Laure Yaspo; Ursula D Weber; Cynthia C Bartholomae; Christof von Kalle; Eszter Turányi; Peter Hauser; Emma Sanden; Anna Darabi; Peter Siesjö; Jaroslav Sterba; Karel Zitterbart; David Sumerauer; Peter van Sluis; Rogier Versteeg; Richard Volckmann; Jan Koster; Martin U Schuhmann; Martin Ebinger; H Leighton Grimes; Giles W Robinson; Amar Gajjar; Martin Mynarek; Katja von Hoff; Stefan Rutkowski; Torsten Pietsch; Wolfram Scheurlen; Jörg Felsberg; Guido Reifenberger; Andreas E Kulozik; Andreas von Deimling; Olaf Witt; Roland Eils; Richard J Gilbertson; Andrey Korshunov; Michael D Taylor; Peter Lichter; Jan O Korbel; Robert J Wechsler-Reya; Stefan M Pfister Journal: Nature Date: 2014-06-22 Impact factor: 49.962
Authors: Julie A Law; Jiamu Du; Christopher J Hale; Suhua Feng; Krzysztof Krajewski; Ana Marie S Palanca; Brian D Strahl; Dinshaw J Patel; Steven E Jacobsen Journal: Nature Date: 2013-05-01 Impact factor: 49.962
Authors: On Sun Lau; Kelli A Davies; Jessica Chang; Jessika Adrian; Matthew H Rowe; Catherine E Ballenger; Dominique C Bergmann Journal: Science Date: 2014-09-04 Impact factor: 47.728
Authors: Kelly P Stanton; Jiaqi Jin; Roy R Lederman; Sherman M Weissman; Yuval Kluger Journal: Nucleic Acids Res Date: 2017-12-01 Impact factor: 16.971
Authors: Katherine L Smollett; Kimberley M Smith; Christina Kahramanoglou; Kristine B Arnvig; Roger S Buxton; Elaine O Davis Journal: J Biol Chem Date: 2012-04-23 Impact factor: 5.157
Authors: Xuehua Zhong; Christopher J Hale; Julie A Law; Lianna M Johnson; Suhua Feng; Andy Tu; Steven E Jacobsen Journal: Nat Struct Mol Biol Date: 2012-08-05 Impact factor: 15.369