Literature DB >> 24363377

WiggleTools: parallel processing of large collections of genome-wide datasets for visualization and statistical analysis.

Daniel R Zerbino1, Nathan Johnson, Thomas Juettemann, Steven P Wilder, Paul Flicek.   

Abstract

MOTIVATION: Using high-throughput sequencing, researchers are now generating hundreds of whole-genome assays to measure various features such as transcription factor binding, histone marks, DNA methylation or RNA transcription. Displaying so much data generally leads to a confusing accumulation of plots. We describe here a multithreaded library that computes statistics on large numbers of datasets (Wiggle, BigWig, Bed, BigBed and BAM), generating statistical summaries within minutes with limited memory requirements, whether on the whole genome or on selected regions.
AVAILABILITY AND IMPLEMENTATION: The code is freely available under Apache 2.0 license at www.github.com/Ensembl/Wiggletools

Entities:  

Mesh:

Year:  2013        PMID: 24363377      PMCID: PMC3967112          DOI: 10.1093/bioinformatics/btt737

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

With the advent of high-throughput sequencing, research teams and consortia are generating large numbers of datasets that are projected onto the same reference genome (Adams ; Bernstein ; The ENCODE Project Consortium, 2012). In particular, epigenomic assays quantify many continuous variables across the genome, e.g. transcription factor binding, histone marks, DNA methylation, chromatin structure or RNA transcription. Although they differ in their protocols, all the above assays include a sequencing step that generates a huge number of sequencing reads. These reads, or tags, are then aligned against the human genome. This placement information is normally stored in the BAM file format (Li ). Because the BAM files are generally large and information rich, they are often summarized into BigWig files that describe a numerical variable such as read depth across the genome (Kent ). These BAM and BigWig files can then readily be displayed on most genome browsers (Flicek ; Meyer ). In the current context, where researchers are testing many measurements across many samples, displaying all these data creates confusing graphics: either the plots are placed side-by-side and an observer is forced to continually shift their attention from one plot to another, or the plots are superimposed, blurring the information content. Instead, one could summarize all these datasets for each position in the genome. Similarly, one could display the difference between case and control datasets. Fundamentally, all of these datasets are simply vectors of numbers, and statistics, such as mean, variance, median, etc., can be generated from any such collection, producing a meaningful summary of the data. Common statistical tools such as R () do not scale well to such large datasets, especially with respect to memory requirements. Therefore, we developed a tool that can perform rigorous statistical tests across the whole genome and detect regions of interest without practical memory constraints. We drew inspiration from the popular BEDTools package (Quinlan and Hall, 2010), which computes overlaps and derived statistics between sets of regions. Converting numerical measurements into genomic regions (generally referred to as peak calling or segmentation, depending on the context) is a convenient and common approach to handling genome-wide data. However, it does imply an inevitable loss of information, as continuous variables are discretized and often binarized. Therefore, we wanted a tool that natively reads the numerical data contained in genomic files and computes statistics on it.

2 FEATURES AND METHODS

2.1 Composable iterators

WiggleTools is centered on the use of iterators. This approach ensures scalability and reduces memory requirements: instead of loading entire files in memory, an iterator simply stores local information, allowing a program to simultaneously process dozens, even hundreds of files. This simultaneous handling of multiple files is particularly useful to compute statistics such as medians, which require storing all possible values before evaluation. The only exceptions to the use of iterators are the input/output operations, which are run on separate threads that read/write, compress/decompress and parse/print data files independently. The basic iterators simply read the data from files, whether BAM, Wiggle, BigWig, Bed, BigBed or BedGraph. A range of iterators can be built on top of those. There are basic unary operators (multiplication by a constant scalar, absolute value, logarithm, exponential, exponentiation and filter), binary operators (sum, product, ratio and difference), statistics on sets (mean, median, standard deviation, variance, minimum, maximum) and statistics on pairs of sets (Welch’s t test, Mann–Whitney U). In turn, all these iterators can be combined or composed to create more complex operators. Iterators can either traverse the entire genome or a slice of the genome.

2.2 Functionalities

The primary intent of the library is to compute statistics across a large number of datasets, so that the users need only display one curve on their genome browser instead of a multitude. For example, they can compress a collection of datasets into a median, as well as compare datasets (e.g. cases versus controls) and generate a track that denotes the differences between the two sets. In addition, the WiggleTools library can compute statistics across genomic positions for a single iterator (area under the curve, variance) or a pair of iterators (Pearson correlation). These statistics can be computed across the entire genome or on regions of interest. For example, it can compute the read coverage at known promoter regions. Similarly, WiggleTools can be used to compute a scaled summary profile of the data on a set of regions. The WiggleTools library can be used as a C library but also as a standalone command-line tool. The user has complete access to the richness of the framework using a simple Polish Notation language. For example, to generate the sum of a collection of BigWig files and write the result into a new Wiggle file, the command would look like: wiggletools write sum.wig sum data/*.bw

2.3 Performance

The WiggleTools library has been specifically designed to handle many files simultaneously, allowing complex statistics to be computed as directly as possible, with low memory requirements. The limiting factor of this approach is the I/O access to the files, meaning that it requires the input files to be in the local network of the computation CPUs. However, because of the efficient indexing of BigWig files, the output can be directly displayed on a remote server, such as a genome browser. It is trivial to accelerate computations by slicing the genome into regions and assigning each region to a different CPU. A wrapper script is available to do this automatically. However, one obstacle to this approach is merging the final files, as the tools provided in the original Kent library quickly become a performance bottleneck. Therefore, we developed modified functions that parallelize the computation of summary tables (which are crucial to accelerate display at large scales), which we contributed to the Kent library. To evaluate the performance of our tool, we downloaded all the DNAseI hypersensitivity wiggle tracks contained on the ENCODE January 2011 data freeze (The ENCODE Project Consortium, 2012) and computed the sum of all these signals through three pipelines. We first ran the WiggleTools library in parallel on 116 sections of the genome (up to 30-Mbp long), producing as many output BigWig files that were merged with our new bigWigCat utility. Second, we ran WiggleTools but merged the output files with the default bigWigMerge utility (Kent ). Finally, we used bigWigMerge to directly sum the 126 BigWig files. The bigWigMerge tool only creates flat files; therefore, a compression and indexing stage, performed by the wigToBigWig tool, must also be done. The results in Table 1 clearly show that the first pipeline, which took 1090s to run, is ∼12 and 19 times faster than the other approaches, while requiring a fraction of the memory.
Table 1.

Benchmarking CPU and memory requirements to compute the sum of 126 BigWig files (121 GB of data in total)

PipelineStageCPUsTime/CPU (s)RAM/CPU (GB)
1wiggletools116351 mean0.22 mean
739 maximum0.32 maximum
bigWigCat13785.23
Overall11610905.23
2wiggletools116351 mean0.22 mean
739 maximum0.32 maximum
bigWigMerge134416.93
wigToBigWig1888768.85
Overall11613 06768.85
3bigWigMerge111 03643.73
wigToBigWig1942375.12
Overall120 45975.12

Note: Several pipelines are compared; hence some components appear multiple times.

Benchmarking CPU and memory requirements to compute the sum of 126 BigWig files (121 GB of data in total) Note: Several pipelines are compared; hence some components appear multiple times.
  8 in total

1.  The NIH Roadmap Epigenomics Mapping Consortium.

Authors:  Bradley E Bernstein; John A Stamatoyannopoulos; Joseph F Costello; Bing Ren; Aleksandar Milosavljevic; Alexander Meissner; Manolis Kellis; Marco A Marra; Arthur L Beaudet; Joseph R Ecker; Peggy J Farnham; Martin Hirst; Eric S Lander; Tarjei S Mikkelsen; James A Thomson
Journal:  Nat Biotechnol       Date:  2010-10       Impact factor: 54.908

2.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

3.  BEDTools: a flexible suite of utilities for comparing genomic features.

Authors:  Aaron R Quinlan; Ira M Hall
Journal:  Bioinformatics       Date:  2010-01-28       Impact factor: 6.937

4.  BigWig and BigBed: enabling browsing of large distributed datasets.

Authors:  W J Kent; A S Zweig; G Barber; A S Hinrichs; D Karolchik
Journal:  Bioinformatics       Date:  2010-07-17       Impact factor: 6.937

5.  BLUEPRINT to decode the epigenetic signature written in blood.

Authors:  David Adams; Lucia Altucci; Stylianos E Antonarakis; Juan Ballesteros; Stephan Beck; Adrian Bird; Christoph Bock; Bernhard Boehm; Elias Campo; Andrea Caricasole; Fredrik Dahl; Emmanouil T Dermitzakis; Tariq Enver; Manel Esteller; Xavier Estivill; Anne Ferguson-Smith; Jude Fitzgibbon; Paul Flicek; Claudia Giehl; Thomas Graf; Frank Grosveld; Roderic Guigo; Ivo Gut; Kristian Helin; Jonas Jarvius; Ralf Küppers; Hans Lehrach; Thomas Lengauer; Åke Lernmark; David Leslie; Markus Loeffler; Elizabeth Macintyre; Antonello Mai; Joost H A Martens; Saverio Minucci; Willem H Ouwehand; Pier Giuseppe Pelicci; Hèléne Pendeville; Bo Porse; Vardhman Rakyan; Wolf Reik; Martin Schrappe; Dirk Schübeler; Martin Seifert; Reiner Siebert; David Simmons; Nicole Soranzo; Salvatore Spicuglia; Michael Stratton; Hendrik G Stunnenberg; Amos Tanay; David Torrents; Alfonso Valencia; Edo Vellenga; Martin Vingron; Jörn Walter; Spike Willcocks
Journal:  Nat Biotechnol       Date:  2012-03-07       Impact factor: 54.908

6.  An integrated encyclopedia of DNA elements in the human genome.

Authors: 
Journal:  Nature       Date:  2012-09-06       Impact factor: 49.962

7.  Ensembl 2013.

Authors:  Paul Flicek; Ikhlak Ahmed; M Ridwan Amode; Daniel Barrell; Kathryn Beal; Simon Brent; Denise Carvalho-Silva; Peter Clapham; Guy Coates; Susan Fairley; Stephen Fitzgerald; Laurent Gil; Carlos García-Girón; Leo Gordon; Thibaut Hourlier; Sarah Hunt; Thomas Juettemann; Andreas K Kähäri; Stephen Keenan; Monika Komorowska; Eugene Kulesha; Ian Longden; Thomas Maurel; William M McLaren; Matthieu Muffato; Rishi Nag; Bert Overduin; Miguel Pignatelli; Bethan Pritchard; Emily Pritchard; Harpreet Singh Riat; Graham R S Ritchie; Magali Ruffier; Michael Schuster; Daniel Sheppard; Daniel Sobral; Kieron Taylor; Anja Thormann; Stephen Trevanion; Simon White; Steven P Wilder; Bronwen L Aken; Ewan Birney; Fiona Cunningham; Ian Dunham; Jennifer Harrow; Javier Herrero; Tim J P Hubbard; Nathan Johnson; Rhoda Kinsella; Anne Parker; Giulietta Spudich; Andy Yates; Amonida Zadissa; Stephen M J Searle
Journal:  Nucleic Acids Res       Date:  2012-11-30       Impact factor: 16.971

8.  The UCSC Genome Browser database: extensions and updates 2013.

Authors:  Laurence R Meyer; Ann S Zweig; Angie S Hinrichs; Donna Karolchik; Robert M Kuhn; Matthew Wong; Cricket A Sloan; Kate R Rosenbloom; Greg Roe; Brooke Rhead; Brian J Raney; Andy Pohl; Venkat S Malladi; Chin H Li; Brian T Lee; Katrina Learned; Vanessa Kirkup; Fan Hsu; Steve Heitner; Rachel A Harte; Maximilian Haeussler; Luvina Guruvadoo; Mary Goldman; Belinda M Giardine; Pauline A Fujita; Timothy R Dreszer; Mark Diekhans; Melissa S Cline; Hiram Clawson; Galt P Barber; David Haussler; W James Kent
Journal:  Nucleic Acids Res       Date:  2012-11-15       Impact factor: 16.971

  8 in total
  42 in total

1.  Myc Regulates Chromatin Decompaction and Nuclear Architecture during B Cell Activation.

Authors:  Kyong-Rim Kieffer-Kwon; Keisuke Nimura; Suhas S P Rao; Jianliang Xu; Seolkyoung Jung; Aleksandra Pekowska; Marei Dose; Evan Stevens; Ewy Mathe; Peng Dong; Su-Chen Huang; Maria Aurelia Ricci; Laura Baranello; Ying Zheng; Francesco Tomassoni Ardori; Wolfgang Resch; Diana Stavreva; Steevenson Nelson; Michael McAndrew; Adriel Casellas; Elizabeth Finn; Charles Gregory; Brian Glenn St Hilaire; Steven M Johnson; Wendy Dubois; Maria Pia Cosma; Eric Batchelor; David Levens; Robert D Phair; Tom Misteli; Lino Tessarollo; Gordon Hager; Melike Lakadamyali; Zhe Liu; Monique Floer; Hari Shroff; Erez Lieberman Aiden; Rafael Casellas
Journal:  Mol Cell       Date:  2017-08-10       Impact factor: 17.970

2.  Regulation of translation by site-specific ribosomal RNA methylation.

Authors:  Kübra Altinel; Disa Tehler; Martin D Jansson; Sophia J Häfner; Nicolai Krogh; Emil Jakobsen; Jens V Andersen; Kasper L Andersen; Erwin M Schoof; Patrice Ménard; Henrik Nielsen; Anders H Lund
Journal:  Nat Struct Mol Biol       Date:  2021-11-10       Impact factor: 15.369

3.  Maternal H3K36 and H3K27 HMTs protect germline development via regulation of the transcription factor LIN-15B.

Authors:  Chad Steven Cockrum; Susan Strome
Journal:  Elife       Date:  2022-08-03       Impact factor: 8.713

4.  The TRIPLE PHD FINGERS proteins are required for SWI/SNF complex-mediated +1 nucleosome positioning and transcription start site determination in Arabidopsis.

Authors:  Borja Diego-Martin; Jaime Pérez-Alemany; Joan Candela-Ferre; Antonio Corbalán-Acedo; Juan Pereyra; David Alabadí; Yasaman Jami-Alahmadi; James Wohlschlegel; Javier Gallego-Bartolomé
Journal:  Nucleic Acids Res       Date:  2022-10-14       Impact factor: 19.160

5.  Genes Possessing the Most Frequent DNA DSBs Are Highly Associated with Development and Cancers, and Essentially Overlap with the rDNA-Contacting Genes.

Authors:  Nickolai A Tchurikov; Ildar R Alembekov; Elena S Klushevskaya; Antonina N Kretova; Ann M Keremet; Anastasia E Sidorova; Polina B Meilakh; Vladimir R Chechetkin; Galina I Kravatskaya; Yuri V Kravatsky
Journal:  Int J Mol Sci       Date:  2022-06-28       Impact factor: 6.208

6.  Human-Specific NOTCH2NL Genes Affect Notch Signaling and Cortical Neurogenesis.

Authors:  Ian T Fiddes; Gerrald A Lodewijk; Meghan Mooring; Colleen M Bosworth; Adam D Ewing; Gary L Mantalas; Adam M Novak; Anouk van den Bout; Alex Bishara; Jimi L Rosenkrantz; Ryan Lorig-Roach; Andrew R Field; Maximilian Haeussler; Lotte Russo; Aparna Bhaduri; Tomasz J Nowakowski; Alex A Pollen; Max L Dougherty; Xander Nuttle; Marie-Claude Addor; Simon Zwolinski; Sol Katzman; Arnold Kriegstein; Evan E Eichler; Sofie R Salama; Frank M J Jacobs; David Haussler
Journal:  Cell       Date:  2018-05-31       Impact factor: 41.582

7.  Cis and trans determinants of epigenetic silencing by Polycomb repressive complex 2 in Arabidopsis.

Authors:  Jun Xiao; Run Jin; Xiang Yu; Max Shen; John D Wagner; Armaan Pai; Claire Song; Michael Zhuang; Samantha Klasfeld; Chongsheng He; Alexandre M Santos; Chris Helliwell; Jose L Pruneda-Paz; Steve A Kay; Xiaowei Lin; Sujuan Cui; Meilin Fernandez Garcia; Oliver Clarenz; Justin Goodrich; Xiaoyu Zhang; Ryan S Austin; Roberto Bonasio; Doris Wagner
Journal:  Nat Genet       Date:  2017-08-21       Impact factor: 38.330

8.  Epigenome editing of the CFTR-locus for treatment of cystic fibrosis.

Authors:  Ami M Kabadi; Leah Machlin; Nikita Dalal; Rhianna E Lee; Ian McDowell; Nirav N Shah; Lauren Drowley; Scott H Randell; Timothy E Reddy
Journal:  J Cyst Fibros       Date:  2021-05-25       Impact factor: 5.482

9.  The ensembl regulatory build.

Authors:  Daniel R Zerbino; Steven P Wilder; Nathan Johnson; Thomas Juettemann; Paul R Flicek
Journal:  Genome Biol       Date:  2015-03-24       Impact factor: 13.583

10.  Functionally distinct patterns of nucleosome remodeling at enhancers in glucocorticoid-treated acute lymphoblastic leukemia.

Authors:  Jennifer N Wu; Luca Pinello; Elinor Yissachar; Jonathan W Wischhusen; Guo-Cheng Yuan; Charles W M Roberts
Journal:  Epigenetics Chromatin       Date:  2015-12-02       Impact factor: 4.954

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.