Andreas Heger1, Caleb Webber, Martin Goodson, Chris P Ponting, Gerton Lunter. 1. MRC CGAT Programme and Functional Genomics Unit, MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, UK. andreas.heger@dpag.ox.ac.uk
Abstract
MOTIVATION: A common question in genomic analysis is whether two sets of genomic intervals overlap significantly. This question arises, for example, when interpreting ChIP-Seq or RNA-Seq data in functional terms. Because genome organization is complex, answering this question is non-trivial. SUMMARY: We present Genomic Association Test (GAT), a tool for estimating the significance of overlap between multiple sets of genomic intervals. GAT implements a null model that the two sets of intervals are placed independently of one another, but allows each set's density to depend on external variables, for example, isochore structure or chromosome identity. GAT estimates statistical significance based on simulation and controls for multiple tests using the false discovery rate. AVAILABILITY: GAT's source code, documentation and tutorials are available at http://code.google.com/p/genomic-association-tester.
MOTIVATION: A common question in genomic analysis is whether two sets of genomic intervals overlap significantly. This question arises, for example, when interpreting ChIP-Seq or RNA-Seq data in functional terms. Because genome organization is complex, answering this question is non-trivial. SUMMARY: We present Genomic Association Test (GAT), a tool for estimating the significance of overlap between multiple sets of genomic intervals. GAT implements a null model that the two sets of intervals are placed independently of one another, but allows each set's density to depend on external variables, for example, isochore structure or chromosome identity. GAT estimates statistical significance based on simulation and controls for multiple tests using the false discovery rate. AVAILABILITY: GAT's source code, documentation and tutorials are available at http://code.google.com/p/genomic-association-tester.
A common question in genomic analysis is whether two sets of genomic intervals, for example, ChIP-seq peaks and gene annotation classes, overlap significantly more than expected by chance alone. Interval overlap is easy to compute, but the significance can be computed analytically only for trivial situations. Hence, significance is usually estimated by simulation under some null model. This model must account for genome organization; a model that assumes independent and uniform placement of both interval sets is almost always inappropriate when testing for association with gene annotations because gene density strongly correlates with G + C content, and datasets of interest often also show G + C biases.Here, we introduce Genomic Association Test (GAT), a tool for computing the significance of overlap between multiple sets of genomic intervals. GAT permits the restriction of the analysis to parts of a genome relevant to the experiment and accounts for chromosomal and isochore biases. Additional genomic features can be controlled for by providing additional segmentation files.GAT’s approach was developed originally to test for the association of non-coding transcripts with other genomic elements (Ponjavic ), but has since been applied to a variety of problems, including:Conservation of non-coding transcription between human and mouse (Church );Enrichment of histone marks and evolutionarily conserved genomic regions within non-coding transcripts (Marques and Ponting, 2009);Functional prediction of non-coding transcripts via their neighboring genes (Marques and Ponting, 2009); andEnrichment of ChIP-Seq binding events within signatures of open chromatin or disease-associated intervals (Ramagopalan ).GAT's re-implementation delivers to the scientific community the extended functionality of the Ponjavic methods.
2 USAGE
GAT is controlled from the command line. It requires at least three bed-formatted files that delimit genomic intervals (tuples of chromosome, start and end). The principal output of GAT is a table listing significant overlaps.
2.1 Input
Example: does a set of transcription factor binding site intervals from a ChIP-Seq experiment overlaps more than expected by chance with a set of DNaseI-hypersensitive sites? To perform this analysis, GAT requires three files:A bed-formatted file with the intervals from the ChIP-Seq experiment (Segments
). Several experiments can be supplied as multiple files or as a single file with multiple tracks.A bed-formatted file with DNaseI-hypersensitive sites (Annotations
). These could be obtained directly from the UCSC Genome Browser (Rosenbloom et al., 2012). Several annotations from, for example, multiple cell lines can be supplied as multiple files or as a single file with multiple tracks.A bed-formatted file with the workspace (). The workspace defines the sequence that is accessible for the simulation. The simplest workspace contains the full genome assembly. In this example, the analysis should be restricted to only repeat-free regions, as only these are reliably mappable by short read data and thus could contain ChIP-Seq intervals. Again, appropriate bed-formatted files are available from the UCSC Genome Browser.By default, the randomization procedure accounts for differences among chromosomes; for example, the X chromosome contains many sequence features that are atypical of autosomes. In addition to chromosome identity, local genomic G + C content is another common confounding factor. For example, G + C content might cause experimental biases in sequencing and hybridization protocols, while it is also a correlate of gene density (Lander ). To correct for G + C content, an optional bed-formatted file with the isochore structure of the genome can be supplied. GAT will then normalize by isochore and by chromosome. Here, isochores are discretized, for example, the genome is partitioned into windows falling into eight bins of different regional G + C content.
2.2 Output
In the aforementioned example, GAT will compute the overlap of ChIP-Seq binding events and DNaseI-hypersensitive sites. GAT will also estimate if the overlap is larger or smaller than expected by chance and will provide an empirical P-value of the statistical significance. If multiple ChIP-Seq experiments or multiple annotations have been submitted, GAT will compute the overlap for each combination of experiment and annotation and will estimate its significance. Storey's q-value (Storey and Tibshirani, 2003) or the Benjamini–Hochberg method (Benjamini and Hochberg, 1995) is used to control for multiple testing using a False discovery rate (FDR) procedure.
3 IMPLEMENTATION
3.1 Overview
GAT is a python script (http://python.org) requiring only common and freely available numerical and scientific libraries. The memory and time-critical parts are implemented in cython (http://cython.org). It requires two collections of genomic intervals: Segments
(S) and Annotations
(A). Each collection can contain one or more lists of genomic intervals (S1, S2 , … , S; A1, A2 , … , A). Intervals within a list of genomic intervals are required to be non-overlapping, and any overlapping intervals within S or A are merged prior to analysis. In addition, GAT requires a Workspace W describing the part of the genome accessible to the simulation. The analysis proceeds as follows. For each pair of interval lists S and A (x ∈ {1 , … , m}, y ∈{1 , … , n}), GAT computes the overlap between the intervals in S and A within workspace W: observed = |S∩ A∩ W|. |Here,| is the overlap operator and defaults to the number of nucleotides overlapping, but other operators (such as the number of segments) can be used. GAT subsequently creates randomly placed intervals in the genome with the same size distribution of S within the workspace W. See below for simulation details. The overlap between each simulated set and A is recorded. The average over all simulations represents the expected overlap. GAT reports the fold enrichment as the ratio of observed and expected overlap and associates an empirical P-value with it. GAT’s runtime and memory usage scale linearly with the number of simulations and the number and size of the genomic interval sets S, A and W.
3.2 Sampling method
The sampling method creates a list R of randomly placed intervals from an interval list S within a workspace W. The sampling is done on a per-chromosome basis. For each chromosome c, randomly placed intervals are created by a two-step procedure:Select an interval size from the empirical interval size distribution S.Select a position within the workspace W.Sampled intervals are added to R until exactly the same number of nucleotides are in R as are in S. For reasons of performance, intervals are initially sampled without checking for overlap. Overlaps and overshoot are subsequently resolved in an iterative procedure once the sampled number of nucleotides approximates the target number.The current sampling protocol is restricted to non-overlapping single segment intervals. Although amenable to many genomic features, it notably leaves discontinuous genomic segments, such as transcripts, untreated.
3.3 Isochores
Isochores are defined within GAT as chromosomal segments within a workspace. For each isochore i, the workspace W is subdivided into a workspace W = W ∩ I. The sampling is performed separately for each W and samples combined at the end. Isochores are thus treated in an equivalent manner to chromosomes. Isochores can be defined by G + C content, but can reflect any segmentation of the genome, such as chromatin marks.
4 CONCLUSIONS
GAT provides critical functionality for genomic analyses. By using standard BED files, it may be used alongside major data resources, such as the UCSC Genome Browser and Galaxy (Giardine ). GAT can be used in a similar context to GREAT (McLean ) and other tools, but can address a more diverse range of questions because of its simulation approach that takes into account both segment and annotation size distributions.Funding: UK Medical Research Council.Conflict of Interest: none declared.
Authors: E S Lander; L M Linton; B Birren; C Nusbaum; M C Zody; J Baldwin; K Devon; K Dewar; M Doyle; W FitzHugh; R Funke; D Gage; K Harris; A Heaford; J Howland; L Kann; J Lehoczky; R LeVine; P McEwan; K McKernan; J Meldrim; J P Mesirov; C Miranda; W Morris; J Naylor; C Raymond; M Rosetti; R Santos; A Sheridan; C Sougnez; Y Stange-Thomann; N Stojanovic; A Subramanian; D Wyman; J Rogers; J Sulston; R Ainscough; S Beck; D Bentley; J Burton; C Clee; N Carter; A Coulson; R Deadman; P Deloukas; A Dunham; I Dunham; R Durbin; L French; D Grafham; S Gregory; T Hubbard; S Humphray; A Hunt; M Jones; C Lloyd; A McMurray; L Matthews; S Mercer; S Milne; J C Mullikin; A Mungall; R Plumb; M Ross; R Shownkeen; S Sims; R H Waterston; R K Wilson; L W Hillier; J D McPherson; M A Marra; E R Mardis; L A Fulton; A T Chinwalla; K H Pepin; W R Gish; S L Chissoe; M C Wendl; K D Delehaunty; T L Miner; A Delehaunty; J B Kramer; L L Cook; R S Fulton; D L Johnson; P J Minx; S W Clifton; T Hawkins; E Branscomb; P Predki; P Richardson; S Wenning; T Slezak; N Doggett; J F Cheng; A Olsen; S Lucas; C Elkin; E Uberbacher; M Frazier; R A Gibbs; D M Muzny; S E Scherer; J B Bouck; E J Sodergren; K C Worley; C M Rives; J H Gorrell; M L Metzker; S L Naylor; R S Kucherlapati; D L Nelson; G M Weinstock; Y Sakaki; A Fujiyama; M Hattori; T Yada; A Toyoda; T Itoh; C Kawagoe; H Watanabe; Y Totoki; T Taylor; J Weissenbach; R Heilig; W Saurin; F Artiguenave; P Brottier; T Bruls; E Pelletier; C Robert; P Wincker; D R Smith; L Doucette-Stamm; M Rubenfield; K Weinstock; H M Lee; J Dubois; A Rosenthal; M Platzer; G Nyakatura; S Taudien; A Rump; H Yang; J Yu; J Wang; G Huang; J Gu; L Hood; L Rowen; A Madan; S Qin; R W Davis; N A Federspiel; A P Abola; M J Proctor; R M Myers; J Schmutz; M Dickson; J Grimwood; D R Cox; M V Olson; R Kaul; C Raymond; N Shimizu; K Kawasaki; S Minoshima; G A Evans; M Athanasiou; R Schultz; B A Roe; F Chen; H Pan; J Ramser; H Lehrach; R Reinhardt; W R McCombie; M de la Bastide; N Dedhia; H Blöcker; K Hornischer; G Nordsiek; R Agarwala; L Aravind; J A Bailey; A Bateman; S Batzoglou; E Birney; P Bork; D G Brown; C B Burge; L Cerutti; H C Chen; D Church; M Clamp; R R Copley; T Doerks; S R Eddy; E E Eichler; T S Furey; J Galagan; J G Gilbert; C Harmon; Y Hayashizaki; D Haussler; H Hermjakob; K Hokamp; W Jang; L S Johnson; T A Jones; S Kasif; A Kaspryzk; S Kennedy; W J Kent; P Kitts; E V Koonin; I Korf; D Kulp; D Lancet; T M Lowe; A McLysaght; T Mikkelsen; J V Moran; N Mulder; V J Pollara; C P Ponting; G Schuler; J Schultz; G Slater; A F Smit; E Stupka; J Szustakowki; D Thierry-Mieg; J Thierry-Mieg; L Wagner; J Wallis; R Wheeler; A Williams; Y I Wolf; K H Wolfe; S P Yang; R F Yeh; F Collins; M S Guyer; J Peterson; A Felsenfeld; K A Wetterstrand; A Patrinos; M J Morgan; P de Jong; J J Catanese; K Osoegawa; H Shizuya; S Choi; Y J Chen; J Szustakowki Journal: Nature Date: 2001-02-15 Impact factor: 49.962
Authors: Belinda Giardine; Cathy Riemer; Ross C Hardison; Richard Burhans; Laura Elnitski; Prachi Shah; Yi Zhang; Daniel Blankenberg; Istvan Albert; James Taylor; Webb Miller; W James Kent; Anton Nekrutenko Journal: Genome Res Date: 2005-09-16 Impact factor: 9.043
Authors: Cory Y McLean; Dave Bristor; Michael Hiller; Shoa L Clarke; Bruce T Schaar; Craig B Lowe; Aaron M Wenger; Gill Bejerano Journal: Nat Biotechnol Date: 2010-05-02 Impact factor: 54.908
Authors: Sreeram V Ramagopalan; Andreas Heger; Antonio J Berlanga; Narelle J Maugeri; Matthew R Lincoln; Amy Burrell; Lahiru Handunnetthi; Adam E Handel; Giulio Disanto; Sarah-Michelle Orton; Corey T Watson; Julia M Morahan; Gavin Giovannoni; Chris P Ponting; George C Ebers; Julian C Knight Journal: Genome Res Date: 2010-08-24 Impact factor: 9.043
Authors: Kate R Rosenbloom; Cricket A Sloan; Venkat S Malladi; Timothy R Dreszer; Katrina Learned; Vanessa M Kirkup; Matthew C Wong; Morgan Maddren; Ruihua Fang; Steven G Heitner; Brian T Lee; Galt P Barber; Rachel A Harte; Mark Diekhans; Jeffrey C Long; Steven P Wilder; Ann S Zweig; Donna Karolchik; Robert M Kuhn; David Haussler; W James Kent Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971
Authors: Deanna M Church; Leo Goodstadt; Ladeana W Hillier; Michael C Zody; Steve Goldstein; Xinwe She; Carol J Bult; Richa Agarwala; Joshua L Cherry; Michael DiCuccio; Wratko Hlavina; Yuri Kapustin; Peter Meric; Donna Maglott; Zoë Birtle; Ana C Marques; Tina Graves; Shiguo Zhou; Brian Teague; Konstantinos Potamousis; Christopher Churas; Michael Place; Jill Herschleb; Ron Runnheim; Daniel Forrest; James Amos-Landgraf; David C Schwartz; Ze Cheng; Kerstin Lindblad-Toh; Evan E Eichler; Chris P Ponting Journal: PLoS Biol Date: 2009-05-26 Impact factor: 8.029
Authors: Deqing Hu; Xin Gao; Kaixiang Cao; Marc A Morgan; Gloria Mas; Edwin R Smith; Andrew G Volk; Elizabeth T Bartom; John D Crispino; Luciano Di Croce; Ali Shilatifard Journal: Mol Cell Date: 2017-02-02 Impact factor: 17.970
Authors: Cinthya J Zepeda-Mendoza; Jonas Ibn-Salem; Tammy Kammin; David J Harris; Debra Rita; Karen W Gripp; Jennifer J MacKenzie; Andrea Gropman; Brett Graham; Ranad Shaheen; Fowzan S Alkuraya; Campbell K Brasington; Edward J Spence; Diane Masser-Frye; Lynne M Bird; Erica Spiegel; Rebecca L Sparkes; Zehra Ordulu; Michael E Talkowski; Miguel A Andrade-Navarro; Peter N Robinson; Cynthia C Morton Journal: Am J Hum Genet Date: 2017-07-20 Impact factor: 11.025
Authors: Daniel A Berg; Yijing Su; Dennisse Jimenez-Cyrus; Aneek Patel; Nancy Huang; David Morizet; Stephanie Lee; Reeti Shah; Francisca Rojas Ringeling; Rajan Jain; Jonathan A Epstein; Qing-Feng Wu; Stefan Canzar; Guo-Li Ming; Hongjun Song; Allison M Bond Journal: Cell Date: 2019-03-28 Impact factor: 41.582
Authors: David H Spencer; David A Russler-Germain; Shamika Ketkar; Nichole M Helton; Tamara L Lamprecht; Robert S Fulton; Catrina C Fronick; Michelle O'Laughlin; Sharon E Heath; Marwan Shinawi; Peter Westervelt; Jacqueline E Payton; Lukas D Wartman; John S Welch; Richard K Wilson; Matthew J Walter; Daniel C Link; John F DiPersio; Timothy J Ley Journal: Cell Date: 2017-02-16 Impact factor: 41.582
Authors: Arijita Chakraborty; Piroon Jenjaroenpun; Jing Li; Sami El Hilali; Andrew McCulley; Brian Haarer; Elizabeth A Hoffman; Aimee Belak; Audrey Thorland; Heidi Hehnly; Carl L Schildkraut; Chun-Long Chen; Vladimir A Kuznetsov; Wenyi Feng Journal: Cell Rep Date: 2020-09-22 Impact factor: 9.423