Literature DB >> 23782611

GAT: a simulation framework for testing the association of genomic intervals.

Andreas Heger¹, Caleb Webber, Martin Goodson, Chris P Ponting, Gerton Lunter.

Abstract

MOTIVATION: A common question in genomic analysis is whether two sets of genomic intervals overlap significantly. This question arises, for example, when interpreting ChIP-Seq or RNA-Seq data in functional terms. Because genome organization is complex, answering this question is non-trivial.
SUMMARY: We present Genomic Association Test (GAT), a tool for estimating the significance of overlap between multiple sets of genomic intervals. GAT implements a null model that the two sets of intervals are placed independently of one another, but allows each set's density to depend on external variables, for example, isochore structure or chromosome identity. GAT estimates statistical significance based on simulation and controls for multiple tests using the false discovery rate. AVAILABILITY: GAT's source code, documentation and tutorials are available at http://code.google.com/p/genomic-association-tester.

Entities: Chemical

Mesh：

Substances：

Year: 2013 PMID： 23782611 PMCID： PMC3722528 DOI： 10.1093/bioinformatics/btt343

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

A common question in genomic analysis is whether two sets of genomic intervals, for example, ChIP-seq peaks and gene annotation classes, overlap significantly more than expected by chance alone. Interval overlap is easy to compute, but the significance can be computed analytically only for trivial situations. Hence, significance is usually estimated by simulation under some null model. This model must account for genome organization; a model that assumes independent and uniform placement of both interval sets is almost always inappropriate when testing for association with gene annotations because gene density strongly correlates with G + C content, and datasets of interest often also show G + C biases. Here, we introduce Genomic Association Test (GAT), a tool for computing the significance of overlap between multiple sets of genomic intervals. GAT permits the restriction of the analysis to parts of a genome relevant to the experiment and accounts for chromosomal and isochore biases. Additional genomic features can be controlled for by providing additional segmentation files. GAT’s approach was developed originally to test for the association of non-coding transcripts with other genomic elements (Ponjavic ), but has since been applied to a variety of problems, including: Conservation of non-coding transcription between human and mouse (Church ); Enrichment of histone marks and evolutionarily conserved genomic regions within non-coding transcripts (Marques and Ponting, 2009); Functional prediction of non-coding transcripts via their neighboring genes (Marques and Ponting, 2009); and Enrichment of ChIP-Seq binding events within signatures of open chromatin or disease-associated intervals (Ramagopalan ). GAT's re-implementation delivers to the scientific community the extended functionality of the Ponjavic methods.

2 USAGE

GAT is controlled from the command line. It requires at least three bed-formatted files that delimit genomic intervals (tuples of chromosome, start and end). The principal output of GAT is a table listing significant overlaps.

2.1 Input

Example: does a set of transcription factor binding site intervals from a ChIP-Seq experiment overlaps more than expected by chance with a set of DNaseI-hypersensitive sites? To perform this analysis, GAT requires three files: A bed-formatted file with the intervals from the ChIP-Seq experiment (Segments ). Several experiments can be supplied as multiple files or as a single file with multiple tracks. A bed-formatted file with DNaseI-hypersensitive sites (Annotations ). These could be obtained directly from the UCSC Genome Browser (Rosenbloom et al., 2012). Several annotations from, for example, multiple cell lines can be supplied as multiple files or as a single file with multiple tracks. A bed-formatted file with the workspace (). The workspace defines the sequence that is accessible for the simulation. The simplest workspace contains the full genome assembly. In this example, the analysis should be restricted to only repeat-free regions, as only these are reliably mappable by short read data and thus could contain ChIP-Seq intervals. Again, appropriate bed-formatted files are available from the UCSC Genome Browser. By default, the randomization procedure accounts for differences among chromosomes; for example, the X chromosome contains many sequence features that are atypical of autosomes. In addition to chromosome identity, local genomic G + C content is another common confounding factor. For example, G + C content might cause experimental biases in sequencing and hybridization protocols, while it is also a correlate of gene density (Lander ). To correct for G + C content, an optional bed-formatted file with the isochore structure of the genome can be supplied. GAT will then normalize by isochore and by chromosome. Here, isochores are discretized, for example, the genome is partitioned into windows falling into eight bins of different regional G + C content.

2.2 Output

In the aforementioned example, GAT will compute the overlap of ChIP-Seq binding events and DNaseI-hypersensitive sites. GAT will also estimate if the overlap is larger or smaller than expected by chance and will provide an empirical P-value of the statistical significance. If multiple ChIP-Seq experiments or multiple annotations have been submitted, GAT will compute the overlap for each combination of experiment and annotation and will estimate its significance. Storey's q-value (Storey and Tibshirani, 2003) or the Benjamini–Hochberg method (Benjamini and Hochberg, 1995) is used to control for multiple testing using a False discovery rate (FDR) procedure.

3 IMPLEMENTATION

3.1 Overview

GAT is a python script (http://python.org) requiring only common and freely available numerical and scientific libraries. The memory and time-critical parts are implemented in cython (http://cython.org). It requires two collections of genomic intervals: Segments (S) and Annotations (A). Each collection can contain one or more lists of genomic intervals (S1, S2 , … , S; A1, A2 , … , A). Intervals within a list of genomic intervals are required to be non-overlapping, and any overlapping intervals within S or A are merged prior to analysis. In addition, GAT requires a Workspace W describing the part of the genome accessible to the simulation. The analysis proceeds as follows. For each pair of interval lists S and A (x ∈ {1 , … , m}, y ∈{1 , … , n}), GAT computes the overlap between the intervals in S and A within workspace W: observed = |S∩ A∩ W|. |Here,| is the overlap operator and defaults to the number of nucleotides overlapping, but other operators (such as the number of segments) can be used. GAT subsequently creates randomly placed intervals in the genome with the same size distribution of S within the workspace W. See below for simulation details. The overlap between each simulated set and A is recorded. The average over all simulations represents the expected overlap. GAT reports the fold enrichment as the ratio of observed and expected overlap and associates an empirical P-value with it. GAT’s runtime and memory usage scale linearly with the number of simulations and the number and size of the genomic interval sets S, A and W.

3.2 Sampling method

The sampling method creates a list R of randomly placed intervals from an interval list S within a workspace W. The sampling is done on a per-chromosome basis. For each chromosome c, randomly placed intervals are created by a two-step procedure: Select an interval size from the empirical interval size distribution S. Select a position within the workspace W. Sampled intervals are added to R until exactly the same number of nucleotides are in R as are in S. For reasons of performance, intervals are initially sampled without checking for overlap. Overlaps and overshoot are subsequently resolved in an iterative procedure once the sampled number of nucleotides approximates the target number. The current sampling protocol is restricted to non-overlapping single segment intervals. Although amenable to many genomic features, it notably leaves discontinuous genomic segments, such as transcripts, untreated.

3.3 Isochores

Isochores are defined within GAT as chromosomal segments within a workspace. For each isochore i, the workspace W is subdivided into a workspace W = W ∩ I. The sampling is performed separately for each W and samples combined at the end. Isochores are thus treated in an equivalent manner to chromosomes. Isochores can be defined by G + C content, but can reflect any segmentation of the genome, such as chromatin marks.

4 CONCLUSIONS

GAT provides critical functionality for genomic analyses. By using standard BED files, it may be used alongside major data resources, such as the UCSC Genome Browser and Galaxy (Giardine ). GAT can be used in a similar context to GREAT (McLean ) and other tools, but can address a more diverse range of questions because of its simulation approach that takes into account both segment and annotation size distributions. Funding: UK Medical Research Council. Conflict of Interest: none declared.

9 in total

1. Initial sequencing and analysis of the human genome.

Authors: E S Lander; L M Linton; B Birren; C Nusbaum; M C Zody; J Baldwin; K Devon; K Dewar; M Doyle; W FitzHugh; R Funke; D Gage; K Harris; A Heaford; J Howland; L Kann; J Lehoczky; R LeVine; P McEwan; K McKernan; J Meldrim; J P Mesirov; C Miranda; W Morris; J Naylor; C Raymond; M Rosetti; R Santos; A Sheridan; C Sougnez; Y Stange-Thomann; N Stojanovic; A Subramanian; D Wyman; J Rogers; J Sulston; R Ainscough; S Beck; D Bentley; J Burton; C Clee; N Carter; A Coulson; R Deadman; P Deloukas; A Dunham; I Dunham; R Durbin; L French; D Grafham; S Gregory; T Hubbard; S Humphray; A Hunt; M Jones; C Lloyd; A McMurray; L Matthews; S Mercer; S Milne; J C Mullikin; A Mungall; R Plumb; M Ross; R Shownkeen; S Sims; R H Waterston; R K Wilson; L W Hillier; J D McPherson; M A Marra; E R Mardis; L A Fulton; A T Chinwalla; K H Pepin; W R Gish; S L Chissoe; M C Wendl; K D Delehaunty; T L Miner; A Delehaunty; J B Kramer; L L Cook; R S Fulton; D L Johnson; P J Minx; S W Clifton; T Hawkins; E Branscomb; P Predki; P Richardson; S Wenning; T Slezak; N Doggett; J F Cheng; A Olsen; S Lucas; C Elkin; E Uberbacher; M Frazier; R A Gibbs; D M Muzny; S E Scherer; J B Bouck; E J Sodergren; K C Worley; C M Rives; J H Gorrell; M L Metzker; S L Naylor; R S Kucherlapati; D L Nelson; G M Weinstock; Y Sakaki; A Fujiyama; M Hattori; T Yada; A Toyoda; T Itoh; C Kawagoe; H Watanabe; Y Totoki; T Taylor; J Weissenbach; R Heilig; W Saurin; F Artiguenave; P Brottier; T Bruls; E Pelletier; C Robert; P Wincker; D R Smith; L Doucette-Stamm; M Rubenfield; K Weinstock; H M Lee; J Dubois; A Rosenthal; M Platzer; G Nyakatura; S Taudien; A Rump; H Yang; J Yu; J Wang; G Huang; J Gu; L Hood; L Rowen; A Madan; S Qin; R W Davis; N A Federspiel; A P Abola; M J Proctor; R M Myers; J Schmutz; M Dickson; J Grimwood; D R Cox; M V Olson; R Kaul; C Raymond; N Shimizu; K Kawasaki; S Minoshima; G A Evans; M Athanasiou; R Schultz; B A Roe; F Chen; H Pan; J Ramser; H Lehrach; R Reinhardt; W R McCombie; M de la Bastide; N Dedhia; H Blöcker; K Hornischer; G Nordsiek; R Agarwala; L Aravind; J A Bailey; A Bateman; S Batzoglou; E Birney; P Bork; D G Brown; C B Burge; L Cerutti; H C Chen; D Church; M Clamp; R R Copley; T Doerks; S R Eddy; E E Eichler; T S Furey; J Galagan; J G Gilbert; C Harmon; Y Hayashizaki; D Haussler; H Hermjakob; K Hokamp; W Jang; L S Johnson; T A Jones; S Kasif; A Kaspryzk; S Kennedy; W J Kent; P Kitts; E V Koonin; I Korf; D Kulp; D Lancet; T M Lowe; A McLysaght; T Mikkelsen; J V Moran; N Mulder; V J Pollara; C P Ponting; G Schuler; J Schultz; G Slater; A F Smit; E Stupka; J Szustakowki; D Thierry-Mieg; J Thierry-Mieg; L Wagner; J Wallis; R Wheeler; A Williams; Y I Wolf; K H Wolfe; S P Yang; R F Yeh; F Collins; M S Guyer; J Peterson; A Felsenfeld; K A Wetterstrand; A Patrinos; M J Morgan; P de Jong; J J Catanese; K Osoegawa; H Shizuya; S Choi; Y J Chen; J Szustakowki
Journal: Nature Date: 2001-02-15 Impact factor: 49.962

2. Statistical significance for genomewide studies.

Authors: John D Storey; Robert Tibshirani
Journal: Proc Natl Acad Sci U S A Date: 2003-07-25 Impact factor: 11.205

3. Galaxy: a platform for interactive large-scale genome analysis.

Authors: Belinda Giardine; Cathy Riemer; Ross C Hardison; Richard Burhans; Laura Elnitski; Prachi Shah; Yi Zhang; Daniel Blankenberg; Istvan Albert; James Taylor; Webb Miller; W James Kent; Anton Nekrutenko
Journal: Genome Res Date: 2005-09-16 Impact factor: 9.043

4. GREAT improves functional interpretation of cis-regulatory regions.

Authors: Cory Y McLean; Dave Bristor; Michael Hiller; Shoa L Clarke; Bruce T Schaar; Craig B Lowe; Aaron M Wenger; Gill Bejerano
Journal: Nat Biotechnol Date: 2010-05-02 Impact factor: 54.908

5. A ChIP-seq defined genome-wide map of vitamin D receptor binding: associations with disease and evolution.

Authors: Sreeram V Ramagopalan; Andreas Heger; Antonio J Berlanga; Narelle J Maugeri; Matthew R Lincoln; Amy Burrell; Lahiru Handunnetthi; Adam E Handel; Giulio Disanto; Sarah-Michelle Orton; Corey T Watson; Julia M Morahan; Gavin Giovannoni; Chris P Ponting; George C Ebers; Julian C Knight
Journal: Genome Res Date: 2010-08-24 Impact factor: 9.043

6. Functionality or transcriptional noise? Evidence for selection within long noncoding RNAs.

Authors: Jasmina Ponjavic; Chris P Ponting; Gerton Lunter
Journal: Genome Res Date: 2007-03-26 Impact factor: 9.043

7. ENCODE data in the UCSC Genome Browser: year 5 update.

Authors: Kate R Rosenbloom; Cricket A Sloan; Venkat S Malladi; Timothy R Dreszer; Katrina Learned; Vanessa M Kirkup; Matthew C Wong; Morgan Maddren; Ruihua Fang; Steven G Heitner; Brian T Lee; Galt P Barber; Rachel A Harte; Mark Diekhans; Jeffrey C Long; Steven P Wilder; Ann S Zweig; Donna Karolchik; Robert M Kuhn; David Haussler; W James Kent
Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971

8. Catalogues of mammalian long noncoding RNAs: modest conservation and incompleteness.

Authors: Ana C Marques; Chris P Ponting
Journal: Genome Biol Date: 2009-11-06 Impact factor: 13.583

9. Lineage-specific biology revealed by a finished genome assembly of the mouse.

Authors: Deanna M Church; Leo Goodstadt; Ladeana W Hillier; Michael C Zody; Steve Goldstein; Xinwe She; Carol J Bult; Richa Agarwala; Joshua L Cherry; Michael DiCuccio; Wratko Hlavina; Yuri Kapustin; Peter Meric; Donna Maglott; Zoë Birtle; Ana C Marques; Tina Graves; Shiguo Zhou; Brian Teague; Konstantinos Potamousis; Christopher Churas; Michael Place; Jill Herschleb; Ron Runnheim; Daniel Forrest; James Amos-Landgraf; David C Schwartz; Ze Cheng; Kerstin Lindblad-Toh; Evan E Eichler; Chris P Ponting
Journal: PLoS Biol Date: 2009-05-26 Impact factor: 8.029

9 in total

102 in total

1. GINOM: A statistical framework for assessing interval overlap of multiple genomic features.

Authors: Darshan Bryner; Stephen Criscione; Andrew Leith; Quyen Huynh; Fred Huffer; Nicola Neretti
Journal: PLoS Comput Biol Date: 2017-06-15 Impact factor: 4.475

2. Not All H3K4 Methylations Are Created Equal: Mll2/COMPASS Dependency in Primordial Germ Cell Specification.

Authors: Deqing Hu; Xin Gao; Kaixiang Cao; Marc A Morgan; Gloria Mas; Edwin R Smith; Andrew G Volk; Elizabeth T Bartom; John D Crispino; Luciano Di Croce; Ali Shilatifard
Journal: Mol Cell Date: 2017-02-02 Impact factor: 17.970

3. Computational Prediction of Position Effects of Apparently Balanced Human Chromosomal Rearrangements.

Authors: Cinthya J Zepeda-Mendoza; Jonas Ibn-Salem; Tammy Kammin; David J Harris; Debra Rita; Karen W Gripp; Jennifer J MacKenzie; Andrea Gropman; Brett Graham; Ranad Shaheen; Fowzan S Alkuraya; Campbell K Brasington; Edward J Spence; Diane Masser-Frye; Lynne M Bird; Erica Spiegel; Rebecca L Sparkes; Zehra Ordulu; Michael E Talkowski; Miguel A Andrade-Navarro; Peter N Robinson; Cynthia C Morton
Journal: Am J Hum Genet Date: 2017-07-20 Impact factor: 11.025

4. A Common Embryonic Origin of Stem Cells Drives Developmental and Adult Neurogenesis.

Authors: Daniel A Berg; Yijing Su; Dennisse Jimenez-Cyrus; Aneek Patel; Nancy Huang; David Morizet; Stephanie Lee; Reeti Shah; Francisca Rojas Ringeling; Rajan Jain; Jonathan A Epstein; Qing-Feng Wu; Stefan Canzar; Guo-Li Ming; Hongjun Song; Allison M Bond
Journal: Cell Date: 2019-03-28 Impact factor: 41.582

5. CpG Island Hypermethylation Mediated by DNMT3A Is a Consequence of AML Progression.

Authors: David H Spencer; David A Russler-Germain; Shamika Ketkar; Nichole M Helton; Tamara L Lamprecht; Robert S Fulton; Catrina C Fronick; Michelle O'Laughlin; Sharon E Heath; Marwan Shinawi; Peter Westervelt; Jacqueline E Payton; Lukas D Wartman; John S Welch; Richard K Wilson; Matthew J Walter; Daniel C Link; John F DiPersio; Timothy J Ley
Journal: Cell Date: 2017-02-16 Impact factor: 41.582

6. Comprehensive analysis of long noncoding RNA (lncRNA)-chromatin interactions reveals lncRNA functions dependent on binding diverse regulatory elements.

Authors: Guanxiong Zhang; Yujia Lan; Aimin Xie; Jian Shi; Hongying Zhao; Liwen Xu; Shiwei Zhu; Tao Luo; Tingting Zhao; Yun Xiao; Xia Li
Journal: J Biol Chem Date: 2019-09-04 Impact factor: 5.157

7. The Hox proteins Ubx and AbdA collaborate with the transcription pausing factor M1BP to regulate gene transcription.

Authors: Amel Zouaz; Ankush Auradkar; Marie Claire Delfini; Meiggie Macchi; Marine Barthez; Serge Ela Akoa; Leila Bastianelli; Gengqiang Xie; Wu-Min Deng; Stuart S Levine; Yacine Graba; Andrew J Saurin
Journal: EMBO J Date: 2017-09-04 Impact factor: 11.598

8. Phenotypic interpretation of complex chromosomal rearrangements informed by nucleotide-level resolution and structural organization of chromatin.

Authors: Cinthya J Zepeda-Mendoza; Alexandra Bardon; Tammy Kammin; David J Harris; Helen Cox; Claire Redin; Zehra Ordulu; Michael E Talkowski; Cynthia C Morton
Journal: Eur J Hum Genet Date: 2018-01-10 Impact factor: 4.246

9. Exploring the underlying biology of intrinsic cardiorespiratory fitness through integrative analysis of genomic variants and muscle gene expression profiling.

Authors: Sujoy Ghosh; Monalisa Hota; Xiaoran Chai; Jencee Kiranya; Palash Ghosh; Zihong He; Jonathan J Ruiz-Ramie; Mark A Sarzynski; Claude Bouchard
Journal: J Appl Physiol (1985) Date: 2019-01-03

10. Replication Stress Induces Global Chromosome Breakage in the Fragile X Genome.

Authors: Arijita Chakraborty; Piroon Jenjaroenpun; Jing Li; Sami El Hilali; Andrew McCulley; Brian Haarer; Elizabeth A Hoffman; Aimee Belak; Audrey Thorland; Heidi Hehnly; Carl L Schildkraut; Chun-Long Chen; Vladimir A Kuznetsov; Wenyi Feng
Journal: Cell Rep Date: 2020-09-22 Impact factor: 9.423