Literature DB >> 26508757

LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor.

Abstract

UNLABELLED: Genomic datasets are often interpreted in the context of large-scale reference databases. One approach is to identify significantly overlapping gene sets, which works well for gene-centric data. However, many types of high-throughput data are based on genomic regions. Locus Overlap Analysis (LOLA) provides easy and automatable enrichment analysis for genomic region sets, thus facilitating the interpretation of functional genomics and epigenomics data.
AVAILABILITY AND IMPLEMENTATION: R package available in Bioconductor and on the following website: http://lola.computational-epigenetics.org.

Entities: Disease Gene Species

Mesh：

Year: 2015 PMID： 26508757 PMCID： PMC4743627 DOI： 10.1093/bioinformatics/btv612

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

Many types of biological data can be interpreted by comparing them to reference databases and searching for interesting patterns of enrichment and depletion. A particularly successful approach focuses on identifying significant overlap between gene sets. To this end, a gene set of interest is compared with a large compendium of existing gene sets with biological annotations, and the observed patterns of overlap are used for interpreting the new gene set. This type of analysis is exemplified by the popular GSEA tool (Subramanian ), and it relies on existing gene set annotation databases such as Gene Ontology, KEGG Pathways and MSigDB. Although gene set analysis has been pivotal for making connections between diverse types of genomic data, this method suffers from one major limitation: it requires gene-centric data. This is becoming increasingly limiting as our understanding of gene regulation advances. Genes are no longer viewed as monolithic building blocks but as multifaceted elements with alternative splicing and alternative promoters, as well as various types of non-coding, antisense and regulatory transcripts. Furthermore, it has become evident that gene expression and chromatin organization are controlled by 100 000s of enhancers and other functional elements, which are often difficult to map to gene symbols. The increasing emphasis on genomic region sets has been propelled by next generation sequencing—a technology that produces data most naturally analyzed in the context of genomic regions, for example as peaks and segmentations. Driven by projects such as ENCODE (Encyclopedia of DNA Elements) and IHEC (International Human Epigenome Consortium), the research community has established large catalogs of regulatory elements and other genomic features across many cell types. Here, we present an R/Bioconductor package called LOLA (Locus Overlap Analysis) for enrichment analysis based on genomic regions. LOLA builds upon analytical concepts that we developed and applied in previous work (Bock ; Farlik ; Tomazou ), and our software makes genomic region set analysis fast and easy for any species with an annotated reference genome. LOLA complements existing tools for gene set analysis (Khatri ), tools that convert gene sets into genomic loci such as GREAT (McLean ) and the ChIP-Seq Significance Tool (Auerbach ), and other related tools including GenometriCorr (Favorov ), Genomic HyperBrowser (Sandve ), EpiGRAPH (Bock ), genomation (Akalin ), i-CisTarget (Imrichova ), Genome Track Analyzer (Kravatsky ), ColoWeb (Kim ) and ReMap (Griffon ). Key features of LOLA are its integration with R and Bioconductor; a command-line interface supporting automated data processing; compatibility with high-throughput pipelines as well as interactive scripting in R; fast runtime even for very large region lists and reference databases; a comprehensive core database of regulatory elements; and convenient support for users to create custom reference databases. Each LOLA analysis is based on three components (Fig. 1A): (i) The query set—one or more lists of genomic regions to be tested for enrichment; (ii) a region universe—the background set of regions that could potentially have been included in the query set; and (iii) a reference database of genomic region sets that are to be tested for overlap with the query set. LOLA includes a core reference database assembled from public data, including, for example, the CODEX database (Sanchez-Castillo ) and cross-tissue annotation of DNase hypersensitivity (Sheffield ). Alternatively or in addition, users can create problem-specific custom regions sets. To build a custom reference database, it is sufficient to collect text files with genomic coordinates (BED files) into a folder and to annotate them with descriptive names.

Fig. 1.

LOLA workflow and results. (A) Query sets, universe set and reference database are loaded into R. (B) LOLA identifies overlaps, calculates enrichment and ranks the results. (C) Example of ranked LOLA enrichment results obtained by runLOLA()

A simple example

Here we analyze a set of the top-100 strongest EWS-FLI1 binding peaks from a previous study (Tomazou ) and assess their overlap with public data. The query set and the LOLA core database are available from the LOLA website. queryA = readBed(“setA.bed”) activeDHS = readBed(“activeDHS_universe.bed”) lolaDB = loadRegionDB(“LOLACore/hg19”) result = runLOLA(queryA, activeDHS, lolaDB) result[1:3,] # View top results LOLA identifies all genomic regions from a query set that overlap with each region set in the reference database. This analysis is performed against a user-specified region universe, which is defined as the set of regions that could, in principle, have been included in the query set (e.g. subject to coverage constraints of the assay that was used to identify the query regions). By default, a single shared base pair is sufficient for regions to count as overlapping, but a stricter criterion can be chosen by the user. Next, considering each region as independent, LOLA uses Fisher’s exact test with false discovery rate correction to assess the significance of overlap in each pairwise comparison (Fig. 1B). The resulting rank score for each region set is then computed by assigning it the worst (max) rank among three measures: P-value, log odds ratio and number of overlapping regions. This ranking system emphasizes overlaps that do well on all three measures, and it tends to prioritize biologically relevant associations (Assenov ). Results are returned as a data.table object (Fig. 1C), providing a powerful interface to sort, explore, visualize and further process the results. In our example, the top hits accurately identify Ewing sarcoma specific regulatory elements. LOLA implements several helper functions to explore and export the results. All functions are described on the LOLA website with vignettes illustrating the basic and advanced features. In particular, a tutorial on manipulating the universe region set helps with configuring the most biologically relevant comparisons. Furthermore, the buildRestrictedUniverse() function automatically builds a universe based on query sets and can be used to test two region sets for differential enrichment against a reference database. LOLA facilitates large-scale comparisons by using optimized code for storing region sets and running vector calculations with the data.table (Dowle ) and GenomicRanges packages (Lawrence ). It also uses database caching and multiple CPUs to speed up the analysis. These optimizations make LOLA analyses fast and memory-efficient, completing within a few minutes on a standard desktop computer. Gene sets are sometimes regarded as a universal language connecting genes, diseases and drugs. We anticipate that sets of genomic regions can similarly connect diverse types of genome, epigenome and transcriptome data to identify relevant associations in large datasets, thereby leveraging the broad investment into large-scale functional genomics and epigenomics for biological discovery. Such analyses can now be done easily and efficiently using LOLA.

19 in total

1. Comprehensive analysis of DNA methylation data with RnBeads.

Authors: Yassen Assenov; Fabian Müller; Pavlo Lutsik; Jörn Walter; Thomas Lengauer; Christoph Bock
Journal: Nat Methods Date: 2014-09-28 Impact factor: 28.547

2. The Genomic HyperBrowser: an analysis web server for genome-scale data.

Authors: Geir K Sandve; Sveinung Gundersen; Morten Johansen; Ingrid K Glad; Krishanthi Gunathasan; Lars Holden; Marit Holden; Knut Liestøl; Ståle Nygård; Vegard Nygaard; Jonas Paulsen; Halfdan Rydbeck; Kai Trengereid; Trevor Clancy; Finn Drabløs; Egil Ferkingstad; Matús Kalas; Tonje Lien; Morten B Rye; Arnoldo Frigessi; Eivind Hovig
Journal: Nucleic Acids Res Date: 2013-04-30 Impact factor: 16.971

3. Software for computing and annotating genomic ranges.

Authors: Michael Lawrence; Wolfgang Huber; Hervé Pagès; Patrick Aboyoun; Marc Carlson; Robert Gentleman; Martin T Morgan; Vincent J Carey
Journal: PLoS Comput Biol Date: 2013-08-08 Impact factor: 4.475

4. CODEX: a next-generation sequencing experiment database for the haematopoietic and embryonic stem cell communities.

Authors: Manuel Sánchez-Castillo; David Ruau; Adam C Wilkinson; Felicia S L Ng; Rebecca Hannah; Evangelia Diamanti; Patrick Lombard; Nicola K Wilson; Berthold Gottgens
Journal: Nucleic Acids Res Date: 2014-09-30 Impact factor: 19.160

5. Epigenome mapping reveals distinct modes of gene regulation and widespread enhancer reprogramming by the oncogenic fusion protein EWS-FLI1.

Authors: Eleni M Tomazou; Nathan C Sheffield; Christian Schmidl; Michael Schuster; Andreas Schönegger; Paul Datlinger; Stefan Kubicek; Christoph Bock; Heinrich Kovar
Journal: Cell Rep Date: 2015-02-19 Impact factor: 9.423

6. Single-cell DNA methylome sequencing and bioinformatic inference of epigenomic cell-state dynamics.

Authors: Matthias Farlik; Nathan C Sheffield; Angelo Nuzzo; Paul Datlinger; Andreas Schönegger; Johanna Klughammer; Christoph Bock
Journal: Cell Rep Date: 2015-02-26 Impact factor: 9.423

7. Relating genes to function: identifying enriched transcription factors using the ENCODE ChIP-Seq significance tool.

Authors: Raymond K Auerbach; Bin Chen; Atul J Butte
Journal: Bioinformatics Date: 2013-06-03 Impact factor: 6.937

8. Genome-wide study of correlations between genomic features and their relationship with the regulation of gene expression.

Authors: Yuri V Kravatsky; Vladimir R Chechetkin; Nikolai A Tchurikov; Galina I Kravatskaya
Journal: DNA Res Date: 2015-01-27 Impact factor: 4.458

9. Integrative analysis of public ChIP-seq experiments reveals a complex multi-cell regulatory landscape.

Authors: Aurélien Griffon; Quentin Barbier; Jordi Dalino; Jacques van Helden; Salvatore Spicuglia; Benoit Ballester
Journal: Nucleic Acids Res Date: 2014-12-03 Impact factor: 16.971

10. Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions.

Authors: Nathan C Sheffield; Robert E Thurman; Lingyun Song; Alexias Safi; John A Stamatoyannopoulos; Boris Lenhard; Gregory E Crawford; Terrence S Furey
Journal: Genome Res Date: 2013-03-12 Impact factor: 9.043

148 in total

1. Single-cell chromatin accessibility landscape identifies tissue repair program in human regulatory T cells.

Authors: Michael Delacher; Malte Simon; Lieke Sanderink; Agnes Hotz-Wagenblatt; Marina Wuttke; Kathrin Schambeck; Lisa Schmidleithner; Sebastian Bittner; Asmita Pant; Uwe Ritter; Thomas Hehlgans; Dania Riegel; Verena Schneider; Florian Kai Groeber-Becker; Andreas Eigenberger; Claudia Gebhard; Nicholas Strieder; Alexander Fischer; Michael Rehli; Petra Hoffmann; Matthias Edinger; Till Strowig; Jochen Huehn; Christian Schmidl; Jens M Werner; Lukas Prantl; Benedikt Brors; Charles D Imbusch; Markus Feuerer
Journal: Immunity Date: 2021-03-30 Impact factor: 31.745

2. Acute BAF perturbation causes immediate changes in chromatin accessibility.

Authors: Sandra Schick; Sarah Grosche; Katharina Eva Kohl; Danica Drpic; Martin G Jaeger; Nara C Marella; Hana Imrichova; Jung-Ming G Lin; Gerald Hofstätter; Michael Schuster; André F Rendeiro; Anna Koren; Mark Petronczki; Christoph Bock; André C Müller; Georg E Winter; Stefan Kubicek
Journal: Nat Genet Date: 2021-02-08 Impact factor: 38.330

3. Integrated multiomics analysis of hepatoblastoma unravels its heterogeneity and provides novel druggable targets.

Authors: Masahiro Sekiguchi; Masafumi Seki; Tomoko Kawai; Kenichi Yoshida; Misa Yoshida; Tomoya Isobe; Noriko Hoshino; Ryota Shirai; Mio Tanaka; Ryota Souzaki; Kentaro Watanabe; Yuki Arakawa; Yasuhito Nannya; Hiromichi Suzuki; Yoichi Fujii; Keisuke Kataoka; Yuichi Shiraishi; Kenichi Chiba; Hiroko Tanaka; Teppei Shimamura; Yusuke Sato; Aiko Sato-Otsubo; Shunsuke Kimura; Yasuo Kubota; Mitsuteru Hiwatari; Katsuyoshi Koh; Yasuhide Hayashi; Yutaka Kanamori; Mureo Kasahara; Kenichi Kohashi; Motohiro Kato; Takako Yoshioka; Kimikazu Matsumoto; Akira Oka; Tomoaki Taguchi; Masashi Sanada; Yukichi Tanaka; Satoru Miyano; Kenichiro Hata; Seishi Ogawa; Junko Takita
Journal: NPJ Precis Oncol Date: 2020-07-07

4. R Tutorial: Detection of Differentially Interacting Chromatin Regions From Multiple Hi-C Datasets.

Authors: John C Stansfield; Duc Tran; Tin Nguyen; Mikhail G Dozmorov
Journal: Curr Protoc Bioinformatics Date: 2019-05-24

5. Quantitative comparison of within-sample heterogeneity scores for DNA methylation data.

Authors: Michael Scherer; Almut Nebel; Andre Franke; Jörn Walter; Thomas Lengauer; Christoph Bock; Fabian Müller; Markus List
Journal: Nucleic Acids Res Date: 2020-05-07 Impact factor: 16.971

6. GIGGLE: a search engine for large-scale integrated genome analysis.

Authors: Ryan M Layer; Brent S Pedersen; Tonya DiSera; Gabor T Marth; Jason Gertz; Aaron R Quinlan
Journal: Nat Methods Date: 2018-01-08 Impact factor: 28.547

Review 7. Molecular networks in Network Medicine: Development and applications.

Authors: Edwin K Silverman; Harald H H W Schmidt; Eleni Anastasiadou; Lucia Altucci; Marco Angelini; Lina Badimon; Jean-Luc Balligand; Giuditta Benincasa; Giovambattista Capasso; Federica Conte; Antonella Di Costanzo; Lorenzo Farina; Giulia Fiscon; Laurent Gatto; Michele Gentili; Joseph Loscalzo; Cinzia Marchese; Claudio Napoli; Paola Paci; Manuela Petti; John Quackenbush; Paolo Tieri; Davide Viggiano; Gemma Vilahur; Kimberly Glass; Jan Baumbach
Journal: Wiley Interdiscip Rev Syst Biol Med Date: 2020-04-19

8. Genetic variants drive altered epigenetic regulation of endotoxin response in BTBR macrophages.

Authors: Annie Vogel Ciernia; Verena M Link; Milo Careaga; Janine M LaSalle; Paul Ashwood
Journal: Brain Behav Immun Date: 2020-05-23 Impact factor: 7.217

9. Building a schizophrenia genetic network: transcription factor 4 regulates genes involved in neuronal development and schizophrenia risk.

Authors: Hanzhang Xia; Fay M Jahr; Nak-Kyeong Kim; Linying Xie; Andrey A Shabalin; Julien Bryois; Douglas H Sweet; Mohamad M Kronfol; Preetha Palasuberniam; MaryPeace McRae; Brien P Riley; Patrick F Sullivan; Edwin J van den Oord; Joseph L McClay
Journal: Hum Mol Genet Date: 2018-09-15 Impact factor: 6.150

10. Deficient H2A.Z deposition is associated with genesis of uterine leiomyoma.

Authors: Davide G Berta; Heli Kuisma; Niko Välimäki; Maritta Räisänen; Maija Jäntti; Annukka Pasanen; Auli Karhu; Jaana Kaukomaa; Aurora Taira; Tatiana Cajuso; Sanna Nieminen; Rosa-Maria Penttinen; Saija Ahonen; Rainer Lehtonen; Miika Mehine; Pia Vahteristo; Jyrki Jalkanen; Biswajyoti Sahu; Janne Ravantti; Netta Mäkinen; Kristiina Rajamäki; Kimmo Palin; Jussi Taipale; Oskari Heikinheimo; Ralf Bützow; Eevi Kaasinen; Lauri A Aaltonen
Journal: Nature Date: 2021-08-04 Impact factor: 49.962