Literature DB >> 24407222

PeaKDEck: a kernel density estimator-based peak calling program for DNaseI-seq data.

Michael T McCarthy¹, Christopher A O'Callaghan.

Abstract

Hypersensitivity to DNaseI digestion is a hallmark of open chromatin, and DNaseI-seq allows the genome-wide identification of regions of open chromatin. Interpreting these data is challenging, largely because of inherent variation in signal-to-noise ratio between datasets. We have developed PeaKDEck, a peak calling program that distinguishes signal from noise by randomly sampling read densities and using kernel density estimation to generate a dataset-specific probability distribution of random background signal. PeaKDEck uses this probability distribution to select an appropriate read density threshold for peak calling in each dataset. We benchmark PeaKDEck using published ENCODE DNaseI-seq data and other peak calling programs, and demonstrate superior performance in low signal-to-noise ratio datasets.

Entities: Disease Mutation Species

Mesh：

Year: 2014 PMID： 24407222 PMCID： PMC3998130 DOI： 10.1093/bioinformatics/btt774

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

DNaseI hypersensitivity analysis can be used to map sites of open chromatin in genomic DNA (Wu, 1980). Hypersensitivity of DNA to digestion by DNaseI arises when nucleosomal histone proteins are displaced from chromatin, leaving a region of ‘naked’ nucleosome-free DNA that is accessible to the DNaseI enzyme (Owen-Hughes and Workman, 1996). Histone displacement and consequent DNaseI hypersensitivity characteristically occur at promoter and enhancer sites (Song ), allowing the sequence-specific binding of proteins, such as transcription factors to the DNA. Recently, advances in high-throughput sequencing methods have been applied to DNaseI hypersensitivity testing [DNaseI-seq; Supplementary Information, Section S1; (Boyle ; Hesselberth )]). With DNaseI-seq, regulatory DNA fragments at accessible open chromatin sites are released by ‘two-hit’ digestion. These fragments are sequenced using high-throughput technology and mapped back to the reference genome. The comparison of DNaseI hypersensitivity patterns in different datasets can play an important role in the study of gene regulation (Sheffield ), for example, in response to a physiological stimulus. However, a major challenge in analyzing these data is the variation in signal-to-noise ratio (SNR) between datasets (Supplementary Table S1). While many potential sources of noise exist, a key variable affecting the SNR is the enzymatic activity of DNaseI, which is difficult to control between experiments. Variation in the level of DNaseI activity leads to different amounts of digestion (Supplementary Fig. S3), altering the composition of the population of short DNA fragments that are sequenced. There is no universal surrogate measurement of digestion that can be used to accurately quantify digestion at the library preparation stage and no ideal control sample (discussed in Supplementary Information, Section S2). For these reasons, distinguishing signal from noise in a manner that allows comparison between datasets is more challenging with DNaseI-seq than with other high-throughput sequencing approaches, such as ChIP-seq. Several peak-calling programs have been developed for use with high-throughput sequencing data. Most focus on ChIP-seq where a clear input control is available, but some have also been used for DNaseI-seq data, including F-seq (Boyle ), MACS (Zhang ) and HOMER (Heinz ). While analyzing our own DNaseI-seq data, we found variable performance between peak callers, particularly at low SNRs, and the identification of a suitable peak threshold was challenging. This confounded the comparison between datasets with different SNRs (see Supplementary Information, Section S4 for SNR estimation). We have developed a peak-calling algorithm (PeaKDEck) that limits the effect of SNR on threshold setting, which is of particular value in datasets with lower SNRs. We have used Hotspot-identified (John ) DNaseI-seq sites from 125 cell types published by ENCODE (Thurman ) to compare the quality of peak calling by PeaKDEck with that by other peak calling programs. PeaKDEck also includes additional tools for DNaseI-seq data analysis (see Supplementary Information, Section S9 for description of additional tools).

2 PEAK CALLING

The method of peak calling used by PeaKDEck is illustrated in Supplementary Figure S12. First, 50 000 sites are selected randomly from the genome, and overlapping sites are discarded (Supplementary Fig. S12A). Next, the signal strength is measured at the non-overlapping sites using sampling bins (Fig. 1A and Supplementary Fig. S12B). This is achieved by measuring the background read density in a large bin surrounding the site (‘background read density’; default—3000 bp), and then measuring the read density in a smaller focused bin at the same site (‘central read density’; default—300 bp). The corrected read density is calculated by subtracting the expected read density (given the background read density) from the central read density.

Fig. 1.

(A) PeaKDEck uses a sampling bin to measure signal at any given locus. PeaKDEck calculates the corrected read density by first counting the number of read start sites (green) within a central bin (e.g. Five read start sites in a bin of size 300 bp). Next, the read density in a larger background bin is measured (e.g. 10 reads in a bin of size 3000 bp). Based on this background read density, the expected read density in a bin of central bin size is calculated (e.g. 10 reads in 3000 bp, giving an expected read density of 1 read in 300 bp) and subtracted from the central bin read density to give the corrected read density (4 in this example). (B) We calculated the percentage of unique sites identified by four different peak callers in each of 10 sample datasets, and color-coded each dataset based on the SNR from blue (low SNR) to red (high SNR). For datasets with low SNR, PeaKDEck had the lowest percentage of unique peaks out of the four peak callers Once this calculation has been repeated for all the randomly selected sites in the dataset, a probability distribution is generated to describe the distribution of these corrected read density scores (Supplementary Fig. S12C). Because the distribution of these scores is typically non-Gaussian, PeaKDEck uses kernel density estimation (Supplementary Information, Section S5) to calculate a probability distribution for the randomly selected corrected read density scores, where n is the number of sites sampled, h is the bandwidth (h = 1), x is the value of x for the ith site and K is a Gaussian kernel: To identify a threshold for peak calling, PeaKDEck calculates the probability that a given corrected read density belongs to the background probability distribution for increasing values of corrected read density. The corrected read density threshold is when the calculated probability drops below a pre-determined level (default—P < 0.001). The entire dataset is then scanned in overlapping sampling bins and the corrected read density determined across the genome (Supplementary Fig. S12D). Peaks are called where the corrected read density exceeds the threshold. Peaks can be scored by their maximum corrected read density or probability score.

3 PERFORMANCE

We assessed the performance of PeaKDEck by calling peaks in 10 DNaseI-seq datasets from the NCBI Short Reads Archive (Supplementary Information, Section S3). To determine whether the sites identified by PeaKDEck as open chromatin were known open chromatin sites, we amalgamated 125 ENCODE DNaseI-seq datasets for different cell types, tagging each genomic locus with the number of cell types with open chromatin at that site (Supplementary Fig. S5). For each of the 125 datasets, we calculated the percentage of open chromatin sites unique to that dataset, the percentage of sites shared with one other cell type, continuing up to the percentage of sites per dataset shared across all 125 cell types. The mean percentage of unique peaks per dataset was 3.61 ± 4.13% ( ± standard deviation). For the peaks called by PeaKDEck in our 10 sample DNaseI-seq datasets, the mean percentage of unique peaks per dataset was 4.6 ± 1.6% (± standard deviation; see Supplementary Information, Section S6 and S7 for details). This demonstrates that the overlap between open chromatin sites identified by PeaKDEck and known open chromatin sites is within the normal range of variation observed in the ENCODE data. Because PeaKDEck adjusts signal measurement to account for local variation in read densities and extensively samples background signal in individual datasets, PeaKDEck performs well at setting thresholds in low SNR datasets. To demonstrate this, we called peaks in the 10 sample NCBI DNaseI-seq datasets with PeaKDEck, MACS, FSEQ and HOMER (Supplementary Fig. S13) and quantified the number of unique sites (not occurring in the ENCODE-125 dataset) as a percentage of the total identified sites in each dataset, with each peak caller (Fig. 1B). In the dataset with lowest SNR, 6.95% of the total peaks identified by PeaKDEck were unique, compared with 7.38, 9.64 and 31.4% of peaks identified by MACS, Homer and FSEQ, respectively, suggesting that PeaKDEck is more likely to identify authentic open chromatin sites even at low SNRs compared with other available peak callers. Although PeaKDEck is designed for use in DNaseI-seq data analysis (using the read start site as the point of interest), it can be used for similar methods such as chromatin immunoprecipitation sequencing and FAIRE-seq, by applying a user-defined offset to calculated genomic positions. PeaKDEck is especially useful compared with other peak callers where SNR is low (see Supplementary Information, Section S8). Funding: Medical Research Council grants (MRC G0901998, MRC G116/165) and National Institute for Health Research Oxford Comprehensive Biomedical Research Centre (BRC) Program. Conflict of Interest: none declared.

11 in total

1. High-resolution mapping and characterization of open chromatin across the genome.

Authors: Alan P Boyle; Sean Davis; Hennady P Shulha; Paul Meltzer; Elliott H Margulies; Zhiping Weng; Terrence S Furey; Gregory E Crawford
Journal: Cell Date: 2008-01-25 Impact factor: 41.582

2. Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity.

Authors: Lingyun Song; Zhancheng Zhang; Linda L Grasfeder; Alan P Boyle; Paul G Giresi; Bum-Kyu Lee; Nathan C Sheffield; Stefan Gräf; Mikael Huss; Damian Keefe; Zheng Liu; Darin London; Ryan M McDaniell; Yoichiro Shibata; Kimberly A Showers; Jeremy M Simon; Teresa Vales; Tianyuan Wang; Deborah Winter; Zhuzhu Zhang; Neil D Clarke; Ewan Birney; Vishwanath R Iyer; Gregory E Crawford; Jason D Lieb; Terrence S Furey
Journal: Genome Res Date: 2011-07-12 Impact factor: 9.043

3. Remodeling the chromatin structure of a nucleosome array by transcription factor-targeted trans-displacement of histones.

Authors: T Owen-Hughes; J L Workman
Journal: EMBO J Date: 1996-09-02 Impact factor: 11.598

4. The 5' ends of Drosophila heat shock genes in chromatin are hypersensitive to DNase I.

Authors: C Wu
Journal: Nature Date: 1980-08-28 Impact factor: 49.962

5. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities.

Authors: Sven Heinz; Christopher Benner; Nathanael Spann; Eric Bertolino; Yin C Lin; Peter Laslo; Jason X Cheng; Cornelis Murre; Harinder Singh; Christopher K Glass
Journal: Mol Cell Date: 2010-05-28 Impact factor: 17.970

6. F-Seq: a feature density estimator for high-throughput sequence tags.

Authors: Alan P Boyle; Justin Guinney; Gregory E Crawford; Terrence S Furey
Journal: Bioinformatics Date: 2008-09-10 Impact factor: 6.937

7. Chromatin accessibility pre-determines glucocorticoid receptor binding patterns.

Authors: Sam John; Peter J Sabo; Robert E Thurman; Myong-Hee Sung; Simon C Biddie; Thomas A Johnson; Gordon L Hager; John A Stamatoyannopoulos
Journal: Nat Genet Date: 2011-01-23 Impact factor: 38.330

8. Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions.

Authors: Nathan C Sheffield; Robert E Thurman; Lingyun Song; Alexias Safi; John A Stamatoyannopoulos; Boris Lenhard; Gregory E Crawford; Terrence S Furey
Journal: Genome Res Date: 2013-03-12 Impact factor: 9.043

9. Model-based analysis of ChIP-Seq (MACS).

Authors: Yong Zhang; Tao Liu; Clifford A Meyer; Jérôme Eeckhoute; David S Johnson; Bradley E Bernstein; Chad Nusbaum; Richard M Myers; Myles Brown; Wei Li; X Shirley Liu
Journal: Genome Biol Date: 2008-09-17 Impact factor: 13.583

10. The accessible chromatin landscape of the human genome.

Authors: Robert E Thurman; Eric Rynes; Richard Humbert; Jeff Vierstra; Matthew T Maurano; Eric Haugen; Nathan C Sheffield; Andrew B Stergachis; Hao Wang; Benjamin Vernot; Kavita Garg; Sam John; Richard Sandstrom; Daniel Bates; Lisa Boatman; Theresa K Canfield; Morgan Diegel; Douglas Dunn; Abigail K Ebersol; Tristan Frum; Erika Giste; Audra K Johnson; Ericka M Johnson; Tanya Kutyavin; Bryan Lajoie; Bum-Kyu Lee; Kristen Lee; Darin London; Dimitra Lotakis; Shane Neph; Fidencio Neri; Eric D Nguyen; Hongzhu Qu; Alex P Reynolds; Vaughn Roach; Alexias Safi; Minerva E Sanchez; Amartya Sanyal; Anthony Shafer; Jeremy M Simon; Lingyun Song; Shinny Vong; Molly Weaver; Yongqi Yan; Zhancheng Zhang; Zhuzhu Zhang; Boris Lenhard; Muneesh Tewari; Michael O Dorschner; R Scott Hansen; Patrick A Navas; George Stamatoyannopoulos; Vishwanath R Iyer; Jason D Lieb; Shamil R Sunyaev; Joshua M Akey; Peter J Sabo; Rajinder Kaul; Terrence S Furey; Job Dekker; Gregory E Crawford; John A Stamatoyannopoulos
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

16 in total

1. Alternative Activation of Macrophages Is Accompanied by Chromatin Remodeling Associated with Lineage-Dependent DNA Shape Features Flanking PU.1 Motifs.

Authors: Mei San Tang; Emily R Miraldi; Natasha M Girgis; Richard A Bonneau; P'ng Loke
Journal: J Immunol Date: 2020-07-13 Impact factor: 5.422

2. Active and Inactive Enhancers Cooperate to Exert Localized and Long-Range Control of Gene Regulation.

Authors: Charlotte Proudhon; Valentina Snetkova; Ramya Raviram; Camille Lobry; Sana Badri; Tingting Jiang; Bingtao Hao; Thomas Trimarchi; Yuval Kluger; Iannis Aifantis; Richard Bonneau; Jane A Skok
Journal: Cell Rep Date: 2016-05-26 Impact factor: 9.423

3. Critical role of IRF1 and BATF in forming chromatin landscape during type 1 regulatory cell differentiation.

Authors: Katarzyna Karwacz; Emily R Miraldi; Maria Pokrovskii; Asaf Madi; Nir Yosef; Ivo Wortman; Xi Chen; Aaron Watters; Nicholas Carriero; Amit Awasthi; Aviv Regev; Richard Bonneau; Dan Littman; Vijay K Kuchroo
Journal: Nat Immunol Date: 2017-02-06 Impact factor: 25.606

4. Analytical Approaches for ATAC-seq Data Analysis.

Authors: Jason P Smith; Nathan C Sheffield
Journal: Curr Protoc Hum Genet Date: 2020-06

5. Characterization of Transcriptional Regulatory Networks that Promote and Restrict Identities and Functions of Intestinal Innate Lymphoid Cells.

Authors: Maria Pokrovskii; Jason A Hall; David E Ochayon; Ren Yi; Natalia S Chaimowitz; Harsha Seelamneni; Nicholas Carriero; Aaron Watters; Stephen N Waggoner; Dan R Littman; Richard Bonneau; Emily R Miraldi
Journal: Immunity Date: 2019-07-02 Impact factor: 31.745

6. Lipid-induced epigenomic changes in human macrophages identify a coronary artery disease-associated variant that regulates PPAP2B Expression through Altered C/EBP-beta binding.

Authors: Michael E Reschen; Kyle J Gaulton; Da Lin; Elizabeth J Soilleux; Andrew J Morris; Susan S Smyth; Christopher A O'Callaghan
Journal: PLoS Genet Date: 2015-04-02 Impact factor: 5.917

7. Cohesin loss alters adult hematopoietic stem cell homeostasis, leading to myeloproliferative neoplasms.

Authors: Jasper Mullenders; Beatriz Aranda-Orgilles; Priscillia Lhoumaud; Matthew Keller; Juhee Pae; Kun Wang; Clarisse Kayembe; Pedro P Rocha; Ramya Raviram; Yixiao Gong; Prem K Premsrirut; Aristotelis Tsirigos; Richard Bonneau; Jane A Skok; Luisa Cimmino; Daniela Hoehn; Iannis Aifantis
Journal: J Exp Med Date: 2015-10-05 Impact factor: 14.307

8. Comparative evaluation of DNase-seq footprint identification strategies.

Authors: Iros Barozzi; Pranami Bora; Marco J Morelli
Journal: Front Genet Date: 2014-08-15 Impact factor: 4.599

9. Predicting the three-dimensional folding of cis-regulatory regions in mammalian genomes using bioinformatic data and polymer models.

Authors: Chris A Brackley; Jill M Brown; Dominic Waithe; Christian Babbs; James Davies; Jim R Hughes; Veronica J Buckle; Davide Marenduzzo
Journal: Genome Biol Date: 2016-03-31 Impact factor: 13.583

10. 4C-ker: A Method to Reproducibly Identify Genome-Wide Interactions Captured by 4C-Seq Experiments.

Authors: Ramya Raviram; Pedro P Rocha; Christian L Müller; Emily R Miraldi; Sana Badri; Yi Fu; Emily Swanzey; Charlotte Proudhon; Valentina Snetkova; Richard Bonneau; Jane A Skok
Journal: PLoS Comput Biol Date: 2016-03-03 Impact factor: 4.475