Literature DB >> 21690102

False positive peaks in ChIP-seq and other sequencing-based functional assays caused by unannotated high copy number regions.

Joseph K Pickrell1, Daniel J Gaffney, Yoav Gilad, Jonathan K Pritchard.   

Abstract

MOTIVATION: Sequencing-based assays such as ChIP-seq, DNase-seq and MNase-seq have become important tools for genome annotation. In these assays, short sequence reads enriched for loci of interest are mapped to a reference genome to determine their origin. Here, we consider whether false positive peak calls can be caused by particular type of error in the reference genome: multicopy sequences which have been incorrectly assembled and collapsed into a single copy.
RESULTS: Using sequencing data from the 1000 Genomes Project, we systematically scanned the human genome for regions of high sequencing depth. These regions are highly enriched for erroneously inferred transcription factor binding sites, positions of nucleosomes and regions of open chromatin. We suggest a simple masking procedure to remove these regions and reduce false positive calls. AVAILABILITY: Files for masking out these regions are available at eqtl.uchicago.edu

Entities:  

Mesh:

Year:  2011        PMID: 21690102      PMCID: PMC3137225          DOI: 10.1093/bioinformatics/btr354

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


The combination of classical methods from molecular biology with high-throughput sequencing, as used in ChIP-seq (Johnson ), DNaseI-seq (Boyle ) and MNase-seq (Schones ), has dramatically increased the scale at which genomic sequences can be assayed for various properties of function. In each of these experiments, sequences of interest (for example, sites bound by a particular transcription factor) are enriched and then sequenced. These resulting sequences are then mapped back to a reference genome to determine the position of their origin. It is well appreciated that characteristics of the reference genome influence this mapping step; for example, some sequences in the genome are present in multiple copies, leading to ambiguity when determining the origin of a sequencing read (Koehler ). A related issue that has received less attention in this context is the existence of sequences that appear to be present in a single copy in the available reference genome, but which, in reality, are present in multiple copies in all or some individuals. Such sequences could potentially cause artifactual peaks in sequencing-based assays (Hesselberth ; Zhang ), and can be identified as regions of high sequencing depth in genomic DNA (Bailey ; Vega ). To screen for potentially problematic regions, we used data from the 1000 Genomes Project (1000 Genomes Project Consortium, 2010). Specifically, we downloaded the Illumina sequencing reads derived from low-coverage sequencing of 57 Nigerian individuals, mapped the reads to the human genome and then calculated the coverage at each base in the genome using only uniquely mapped reads. For full details on the data used, see the Supplementary Material. An example of a problematic genomic region is presented in Figure 1A. In the genome sequencing data, there are multiple clear peaks of reads, suggesting the presence of collapsed repeats in the reference genome [the reference genome used throughout is hg18; ~10% of these regions are no longer problematic in hg19 (Supplementary Fig. S1)]. We find that 0.1% of the genome has read depth at least twice the median, and 0.01% of the genome has read depth at least 15 times the median (Fig. 1B). We identified contiguous regions where the read depth exceeded thresholds corresponding to the top 0.1, 0.01, and 0.001% of the per-base read depths, merging regions which fall within 50 bases of each other. At the 0.1% threshold, there are 34 359 such high depth regions (HDRs), with a mean size of 188 bases.
Fig. 1.

Sequences absent from the reference genome cause spurious peaks of sequencing reads. (A) An example of such a region. In each panel, we plot the density of uniquely mapped sequencing reads from three sources: the Illumina data from low coverage sequencing of Yoruba individuals from the 1000 Genomes Project (summed across all individuals), a study of DNaseI hypersensitivity (Pique-Regi ) and a study of MNase sensitivity (Schones ). In the first of these, copy number is expected to be approximately constant. In red are regions that we call as high depth regions at a threshold of 0.1%. (B) A long tail of very high read depth for sequences present once in the human reference. Using the coverage from the 1000 Genomes Project data, we plot the histogram of the coverage at each base (using 500 Mb of sequence). Marked are the positions corresponding to the top 0.1 and 0.01% of the distribution. (C) Collapsed repeats cause false peaks of sequencing reads in functional assays. For each experiment, we plot the fraction of the genome covered occupied by the mark, as well as the fraction of the HDRs covered by the mark. For the ChIP-seq on transcription factors, we used the binding sites called by the ENCODE Project (ENCODE Project Consortium, 2007) using PeakSeq (Rozowsky ). For the ChIP-seq on histone modifications (Wang ), we split the genome into windows of 200 bases and called the most extreme 0.1% of windows as bound. Shown are selected experiments; for all experiments see Supplementary Figures S2 and S3.

Sequences absent from the reference genome cause spurious peaks of sequencing reads. (A) An example of such a region. In each panel, we plot the density of uniquely mapped sequencing reads from three sources: the Illumina data from low coverage sequencing of Yoruba individuals from the 1000 Genomes Project (summed across all individuals), a study of DNaseI hypersensitivity (Pique-Regi ) and a study of MNase sensitivity (Schones ). In the first of these, copy number is expected to be approximately constant. In red are regions that we call as high depth regions at a threshold of 0.1%. (B) A long tail of very high read depth for sequences present once in the human reference. Using the coverage from the 1000 Genomes Project data, we plot the histogram of the coverage at each base (using 500 Mb of sequence). Marked are the positions corresponding to the top 0.1 and 0.01% of the distribution. (C) Collapsed repeats cause false peaks of sequencing reads in functional assays. For each experiment, we plot the fraction of the genome covered occupied by the mark, as well as the fraction of the HDRs covered by the mark. For the ChIP-seq on transcription factors, we used the binding sites called by the ENCODE Project (ENCODE Project Consortium, 2007) using PeakSeq (Rozowsky ). For the ChIP-seq on histone modifications (Wang ), we split the genome into windows of 200 bases and called the most extreme 0.1% of windows as bound. Shown are selected experiments; for all experiments see Supplementary Figures S2 and S3. We then asked whether HDRs were indeed causing false positive peaks of coverage in high-throughput molecular biology assays. To do this, we used data from DNase-seq (Pique-Regi ), MNase-seq (Schones ) and ChIP-seq from various transcription factors (ENCODE Project Consortium, 2007) and histone marks (Wang ) (Supplementary Material). In the example in Figure 1A, the regions found to be copy number variable also show up as sensitive to DNaseI and show extremely high read depth in the MNase assay. Overall, of the top 0.1% of 200 base pair windows in the genome with the greatest read depth in the DNase-seq experiment, 3% overlap HDRs, and of the top 0.1% of 200 base pair windows in the MNase-seq experiment, 26% overlap HDRs. Additionally, across many ChIP-seq experiments on both transcription factors and histone modifications, we see enrichment of signal in HDRs (Fig. 1C, Supplementary Figs S2 and S3). The magnitude of enrichment of ChIP-seq peaks in HDRs depends on the choice of peak-calling algorithm; peaks called using PeakSeq (Rozowsky ) are dramatically enriched in HDRs (Fig. 1C), while peaks called using MACS (Zhang ) are not (Supplementary Fig. S4). This is likely attributable to the different choices in how to use the control lanes in the two algorithms (Supplementary Material). We next sought to confirm that HDRs are due to collapsed repeats in the genome (as opposed to, for example, biases due to GC content or other sequence properties during library construction or Illumina sequencing). First, we examined the impact of GC content of a region on sequencing coverage. As expected, there is a relationship between GC content and coverage, but this effect is too small to account for the dramatic peaks of coverage we see in the data (Supplementary Fig. S5). Next, we examined copy number data generated on separate individuals using the orthogonal technology of array CGH (Conrad ). The intensities of array probes falling in HDRs is dramatically higher than that of control probes, often approaching the limit of the dynamic range of the array (Supplementary Fig. S6). Finally, we examined the overlap of HDRs with annotated repeats. Of the most extreme outliers in the coverage distribution (at the 0.01% point in the distribution of coverage), 92% of the regions overlap annotated repeats. Of these repeats, 81% are satellite DNA and the remainder largely consist of L1 retrotransposons and Alu elements. We conclude that the majority of HDRs are indeed collapsed repeats, with the caveat that some fraction may be copy number variable across individuals (Vega ). We suggest a simple masking procedure to remove false positive calls due to collapsed repeats. This can be done in two ways: first, we have generated BED files with the coordinates of regions we suggest masking out (available at eqtl.uchicago.edu). Files are available at five different cutoffs. Alternatively, we have made available a FASTA file with the sequences present in these regions. If this FASTA file is included in the reference genome when mapping, sequencing reads from these regions will no longer map uniquely to the genome and can be filtered out. In summary, we have identified a set of genomic regions in humans which are likely to generate spurious peaks in any assay involving high-throughput sequencing, and have provided a resource for screening out these regions. Similar approaches will be feasible in other organisms as resequencing data from multiple individuals becomes available. Screening out these regions will be particularly useful in studies, like DNase-seq and MNase-seq, where there is no natural control experiment [apart from copy number quantification via whole genome sequencing (Kharchenko ), which remains impractical for species with large genomes], and will aid interpretation in ChIP-seq experiments where ‘control’ lanes contain biologically relevant signal (Rozowsky ; Vega ).
  15 in total

1.  Recent segmental duplications in the human genome.

Authors:  Jeffrey A Bailey; Zhiping Gu; Royden A Clark; Knut Reinert; Rhea V Samonte; Stuart Schwartz; Mark D Adams; Eugene W Myers; Peter W Li; Evan E Eichler
Journal:  Science       Date:  2002-08-09       Impact factor: 47.728

2.  High-resolution mapping and characterization of open chromatin across the genome.

Authors:  Alan P Boyle; Sean Davis; Hennady P Shulha; Paul Meltzer; Elliott H Margulies; Zhiping Weng; Terrence S Furey; Gregory E Crawford
Journal:  Cell       Date:  2008-01-25       Impact factor: 41.582

3.  Combinatorial patterns of histone acetylations and methylations in the human genome.

Authors:  Zhibin Wang; Chongzhi Zang; Jeffrey A Rosenfeld; Dustin E Schones; Artem Barski; Suresh Cuddapah; Kairong Cui; Tae-Young Roh; Weiqun Peng; Michael Q Zhang; Keji Zhao
Journal:  Nat Genet       Date:  2008-06-15       Impact factor: 38.330

4.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

Authors:  Ewan Birney; John A Stamatoyannopoulos; Anindya Dutta; Roderic Guigó; Thomas R Gingeras; Elliott H Margulies; Zhiping Weng; Michael Snyder; Emmanouil T Dermitzakis; Robert E Thurman; Michael S Kuehn; Christopher M Taylor; Shane Neph; Christoph M Koch; Saurabh Asthana; Ankit Malhotra; Ivan Adzhubei; Jason A Greenbaum; Robert M Andrews; Paul Flicek; Patrick J Boyle; Hua Cao; Nigel P Carter; Gayle K Clelland; Sean Davis; Nathan Day; Pawandeep Dhami; Shane C Dillon; Michael O Dorschner; Heike Fiegler; Paul G Giresi; Jeff Goldy; Michael Hawrylycz; Andrew Haydock; Richard Humbert; Keith D James; Brett E Johnson; Ericka M Johnson; Tristan T Frum; Elizabeth R Rosenzweig; Neerja Karnani; Kirsten Lee; Gregory C Lefebvre; Patrick A Navas; Fidencio Neri; Stephen C J Parker; Peter J Sabo; Richard Sandstrom; Anthony Shafer; David Vetrie; Molly Weaver; Sarah Wilcox; Man Yu; Francis S Collins; Job Dekker; Jason D Lieb; Thomas D Tullius; Gregory E Crawford; Shamil Sunyaev; William S Noble; Ian Dunham; France Denoeud; Alexandre Reymond; Philipp Kapranov; Joel Rozowsky; Deyou Zheng; Robert Castelo; Adam Frankish; Jennifer Harrow; Srinka Ghosh; Albin Sandelin; Ivo L Hofacker; Robert Baertsch; Damian Keefe; Sujit Dike; Jill Cheng; Heather A Hirsch; Edward A Sekinger; Julien Lagarde; Josep F Abril; Atif Shahab; Christoph Flamm; Claudia Fried; Jörg Hackermüller; Jana Hertel; Manja Lindemeyer; Kristin Missal; Andrea Tanzer; Stefan Washietl; Jan Korbel; Olof Emanuelsson; Jakob S Pedersen; Nancy Holroyd; Ruth Taylor; David Swarbreck; Nicholas Matthews; Mark C Dickson; Daryl J Thomas; Matthew T Weirauch; James Gilbert; Jorg Drenkow; Ian Bell; XiaoDong Zhao; K G Srinivasan; Wing-Kin Sung; Hong Sain Ooi; Kuo Ping Chiu; Sylvain Foissac; Tyler Alioto; Michael Brent; Lior Pachter; Michael L Tress; Alfonso Valencia; Siew Woh Choo; Chiou Yu Choo; Catherine Ucla; Caroline Manzano; Carine Wyss; Evelyn Cheung; Taane G Clark; James B Brown; Madhavan Ganesh; Sandeep Patel; Hari Tammana; Jacqueline Chrast; Charlotte N Henrichsen; Chikatoshi Kai; Jun Kawai; Ugrappa Nagalakshmi; Jiaqian Wu; Zheng Lian; Jin Lian; Peter Newburger; Xueqing Zhang; Peter Bickel; John S Mattick; Piero Carninci; Yoshihide Hayashizaki; Sherman Weissman; Tim Hubbard; Richard M Myers; Jane Rogers; Peter F Stadler; Todd M Lowe; Chia-Lin Wei; Yijun Ruan; Kevin Struhl; Mark Gerstein; Stylianos E Antonarakis; Yutao Fu; Eric D Green; Ulaş Karaöz; Adam Siepel; James Taylor; Laura A Liefer; Kris A Wetterstrand; Peter J Good; Elise A Feingold; Mark S Guyer; Gregory M Cooper; George Asimenos; Colin N Dewey; Minmei Hou; Sergey Nikolaev; Juan I Montoya-Burgos; Ari Löytynoja; Simon Whelan; Fabio Pardi; Tim Massingham; Haiyan Huang; Nancy R Zhang; Ian Holmes; James C Mullikin; Abel Ureta-Vidal; Benedict Paten; Michael Seringhaus; Deanna Church; Kate Rosenbloom; W James Kent; Eric A Stone; Serafim Batzoglou; Nick Goldman; Ross C Hardison; David Haussler; Webb Miller; Arend Sidow; Nathan D Trinklein; Zhengdong D Zhang; Leah Barrera; Rhona Stuart; David C King; Adam Ameur; Stefan Enroth; Mark C Bieda; Jonghwan Kim; Akshay A Bhinge; Nan Jiang; Jun Liu; Fei Yao; Vinsensius B Vega; Charlie W H Lee; Patrick Ng; Atif Shahab; Annie Yang; Zarmik Moqtaderi; Zhou Zhu; Xiaoqin Xu; Sharon Squazzo; Matthew J Oberley; David Inman; Michael A Singer; Todd A Richmond; Kyle J Munn; Alvaro Rada-Iglesias; Ola Wallerman; Jan Komorowski; Joanna C Fowler; Phillippe Couttet; Alexander W Bruce; Oliver M Dovey; Peter D Ellis; Cordelia F Langford; David A Nix; Ghia Euskirchen; Stephen Hartman; Alexander E Urban; Peter Kraus; Sara Van Calcar; Nate Heintzman; Tae Hoon Kim; Kun Wang; Chunxu Qu; Gary Hon; Rosa Luna; Christopher K Glass; M Geoff Rosenfeld; Shelley Force Aldred; Sara J Cooper; Anason Halees; Jane M Lin; Hennady P Shulha; Xiaoling Zhang; Mousheng Xu; Jaafar N S Haidar; Yong Yu; Yijun Ruan; Vishwanath R Iyer; Roland D Green; Claes Wadelius; Peggy J Farnham; Bing Ren; Rachel A Harte; Angie S Hinrichs; Heather Trumbower; Hiram Clawson; Jennifer Hillman-Jackson; Ann S Zweig; Kayla Smith; Archana Thakkapallayil; Galt Barber; Robert M Kuhn; Donna Karolchik; Lluis Armengol; Christine P Bird; Paul I W de Bakker; Andrew D Kern; Nuria Lopez-Bigas; Joel D Martin; Barbara E Stranger; Abigail Woodroffe; Eugene Davydov; Antigone Dimas; Eduardo Eyras; Ingileif B Hallgrímsdóttir; Julian Huppert; Michael C Zody; Gonçalo R Abecasis; Xavier Estivill; Gerard G Bouffard; Xiaobin Guan; Nancy F Hansen; Jacquelyn R Idol; Valerie V B Maduro; Baishali Maskeri; Jennifer C McDowell; Morgan Park; Pamela J Thomas; Alice C Young; Robert W Blakesley; Donna M Muzny; Erica Sodergren; David A Wheeler; Kim C Worley; Huaiyang Jiang; George M Weinstock; Richard A Gibbs; Tina Graves; Robert Fulton; Elaine R Mardis; Richard K Wilson; Michele Clamp; James Cuff; Sante Gnerre; David B Jaffe; Jean L Chang; Kerstin Lindblad-Toh; Eric S Lander; Maxim Koriabine; Mikhail Nefedov; Kazutoyo Osoegawa; Yuko Yoshinaga; Baoli Zhu; Pieter J de Jong
Journal:  Nature       Date:  2007-06-14       Impact factor: 49.962

5.  PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls.

Authors:  Joel Rozowsky; Ghia Euskirchen; Raymond K Auerbach; Zhengdong D Zhang; Theodore Gibson; Robert Bjornson; Nicholas Carriero; Michael Snyder; Mark B Gerstein
Journal:  Nat Biotechnol       Date:  2009-01-04       Impact factor: 54.908

6.  Dynamic regulation of nucleosome positioning in the human genome.

Authors:  Dustin E Schones; Kairong Cui; Suresh Cuddapah; Tae-Young Roh; Artem Barski; Zhibin Wang; Gang Wei; Keji Zhao
Journal:  Cell       Date:  2008-03-07       Impact factor: 41.582

7.  Genome-wide mapping of in vivo protein-DNA interactions.

Authors:  David S Johnson; Ali Mortazavi; Richard M Myers; Barbara Wold
Journal:  Science       Date:  2007-05-31       Impact factor: 47.728

8.  Inherent signals in sequencing-based Chromatin-ImmunoPrecipitation control libraries.

Authors:  Vinsensius B Vega; Edwin Cheung; Nallasivam Palanisamy; Wing-Kin Sung
Journal:  PLoS One       Date:  2009-04-15       Impact factor: 3.240

9.  Global mapping of protein-DNA interactions in vivo by digital genomic footprinting.

Authors:  Jay R Hesselberth; Xiaoyu Chen; Zhihong Zhang; Peter J Sabo; Richard Sandstrom; Alex P Reynolds; Robert E Thurman; Shane Neph; Michael S Kuehn; William S Noble; Stanley Fields; John A Stamatoyannopoulos
Journal:  Nat Methods       Date:  2009-03-22       Impact factor: 28.547

10.  Model-based analysis of ChIP-Seq (MACS).

Authors:  Yong Zhang; Tao Liu; Clifford A Meyer; Jérôme Eeckhoute; David S Johnson; Bradley E Bernstein; Chad Nusbaum; Richard M Myers; Myles Brown; Wei Li; X Shirley Liu
Journal:  Genome Biol       Date:  2008-09-17       Impact factor: 13.583

View more
  49 in total

1.  Mapping the human reference genome's missing sequence by three-way admixture in Latino genomes.

Authors:  Giulio Genovese; Robert E Handsaker; Heng Li; Eimear E Kenny; Steven A McCarroll
Journal:  Am J Hum Genet       Date:  2013-08-08       Impact factor: 11.025

2.  New insights from existing sequence data: generating breakthroughs without a pipette.

Authors:  Alex M Plocik; Brenton R Graveley
Journal:  Mol Cell       Date:  2013-02-21       Impact factor: 17.970

3.  Noninvasive prenatal testing using a novel analysis pipeline to screen for all autosomal fetal aneuploidies improves pregnancy management.

Authors:  Baran Bayindir; Luc Dehaspe; Nathalie Brison; Paul Brady; Simon Ardui; Molka Kammoun; Lars Van der Veken; Klaske Lichtenbelt; Kris Van den Bogaert; Jeroen Van Houdt; Hilde Peeters; Hilde Van Esch; Thomy de Ravel; Eric Legius; Koen Devriendt; Joris R Vermeesch
Journal:  Eur J Hum Genet       Date:  2015-01-14       Impact factor: 4.246

Review 4.  Analysing and interpreting DNA methylation data.

Authors:  Christoph Bock
Journal:  Nat Rev Genet       Date:  2012-10       Impact factor: 53.242

5.  CNV-guided multi-read allocation for ChIP-seq.

Authors:  Qi Zhang; Sündüz Keleş
Journal:  Bioinformatics       Date:  2014-06-24       Impact factor: 6.937

6.  Computational methodology for ChIP-seq analysis.

Authors:  Hyunjin Shin; Tao Liu; Xikun Duan; Yong Zhang; X Shirley Liu
Journal:  Quant Biol       Date:  2013-03-01

7.  HMCan-diff: a method to detect changes in histone modifications in cells with different genetic characteristics.

Authors:  Haitham Ashoor; Caroline Louis-Brennetot; Isabelle Janoueix-Lerosey; Vladimir B Bajic; Valentina Boeva
Journal:  Nucleic Acids Res       Date:  2017-05-05       Impact factor: 16.971

8.  ChIP-BIT: Bayesian inference of target genes using a novel joint probabilistic model of ChIP-seq profiles.

Authors:  Xi Chen; Jin-Gyoung Jung; Ayesha N Shajahan-Haq; Robert Clarke; Ie-Ming Shih; Yue Wang; Luca Magnani; Tian-Li Wang; Jianhua Xuan
Journal:  Nucleic Acids Res       Date:  2015-12-23       Impact factor: 16.971

9.  Interplay between PREX2 mutations and the PI3K pathway and its effect on epigenetic regulation of gene expression in NRAS-mutant melanoma.

Authors:  Yonathan Lissanu Deribe
Journal:  Small GTPases       Date:  2016-04-25

10.  Coordinated effects of sequence variation on DNA binding, chromatin structure, and transcription.

Authors:  Helena Kilpinen; Sebastian M Waszak; Andreas R Gschwind; Sunil K Raghav; Robert M Witwicki; Andrea Orioli; Eugenia Migliavacca; Michaël Wiederkehr; Maria Gutierrez-Arcelus; Nikolaos I Panousis; Alisa Yurovsky; Tuuli Lappalainen; Luciana Romano-Palumbo; Alexandra Planchon; Deborah Bielser; Julien Bryois; Ismael Padioleau; Gilles Udin; Sarah Thurnheer; David Hacker; Leighton J Core; John T Lis; Nouria Hernandez; Alexandre Reymond; Bart Deplancke; Emmanouil T Dermitzakis
Journal:  Science       Date:  2013-10-17       Impact factor: 47.728

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.