Literature DB >> 24363377

WiggleTools: parallel processing of large collections of genome-wide datasets for visualization and statistical analysis.

Daniel R Zerbino¹, Nathan Johnson, Thomas Juettemann, Steven P Wilder, Paul Flicek.

Abstract

MOTIVATION: Using high-throughput sequencing, researchers are now generating hundreds of whole-genome assays to measure various features such as transcription factor binding, histone marks, DNA methylation or RNA transcription. Displaying so much data generally leads to a confusing accumulation of plots. We describe here a multithreaded library that computes statistics on large numbers of datasets (Wiggle, BigWig, Bed, BigBed and BAM), generating statistical summaries within minutes with limited memory requirements, whether on the whole genome or on selected regions.
AVAILABILITY AND IMPLEMENTATION: The code is freely available under Apache 2.0 license at www.github.com/Ensembl/Wiggletools

Entities: Disease Gene Species

Mesh：

Year: 2013 PMID： 24363377 PMCID： PMC3967112 DOI： 10.1093/bioinformatics/btt737

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

With the advent of high-throughput sequencing, research teams and consortia are generating large numbers of datasets that are projected onto the same reference genome (Adams ; Bernstein ; The ENCODE Project Consortium, 2012). In particular, epigenomic assays quantify many continuous variables across the genome, e.g. transcription factor binding, histone marks, DNA methylation, chromatin structure or RNA transcription. Although they differ in their protocols, all the above assays include a sequencing step that generates a huge number of sequencing reads. These reads, or tags, are then aligned against the human genome. This placement information is normally stored in the BAM file format (Li ). Because the BAM files are generally large and information rich, they are often summarized into BigWig files that describe a numerical variable such as read depth across the genome (Kent ). These BAM and BigWig files can then readily be displayed on most genome browsers (Flicek ; Meyer ). In the current context, where researchers are testing many measurements across many samples, displaying all these data creates confusing graphics: either the plots are placed side-by-side and an observer is forced to continually shift their attention from one plot to another, or the plots are superimposed, blurring the information content. Instead, one could summarize all these datasets for each position in the genome. Similarly, one could display the difference between case and control datasets. Fundamentally, all of these datasets are simply vectors of numbers, and statistics, such as mean, variance, median, etc., can be generated from any such collection, producing a meaningful summary of the data. Common statistical tools such as R () do not scale well to such large datasets, especially with respect to memory requirements. Therefore, we developed a tool that can perform rigorous statistical tests across the whole genome and detect regions of interest without practical memory constraints. We drew inspiration from the popular BEDTools package (Quinlan and Hall, 2010), which computes overlaps and derived statistics between sets of regions. Converting numerical measurements into genomic regions (generally referred to as peak calling or segmentation, depending on the context) is a convenient and common approach to handling genome-wide data. However, it does imply an inevitable loss of information, as continuous variables are discretized and often binarized. Therefore, we wanted a tool that natively reads the numerical data contained in genomic files and computes statistics on it.

2 FEATURES AND METHODS

2.1 Composable iterators

WiggleTools is centered on the use of iterators. This approach ensures scalability and reduces memory requirements: instead of loading entire files in memory, an iterator simply stores local information, allowing a program to simultaneously process dozens, even hundreds of files. This simultaneous handling of multiple files is particularly useful to compute statistics such as medians, which require storing all possible values before evaluation. The only exceptions to the use of iterators are the input/output operations, which are run on separate threads that read/write, compress/decompress and parse/print data files independently. The basic iterators simply read the data from files, whether BAM, Wiggle, BigWig, Bed, BigBed or BedGraph. A range of iterators can be built on top of those. There are basic unary operators (multiplication by a constant scalar, absolute value, logarithm, exponential, exponentiation and filter), binary operators (sum, product, ratio and difference), statistics on sets (mean, median, standard deviation, variance, minimum, maximum) and statistics on pairs of sets (Welch’s t test, Mann–Whitney U). In turn, all these iterators can be combined or composed to create more complex operators. Iterators can either traverse the entire genome or a slice of the genome.

2.2 Functionalities

The primary intent of the library is to compute statistics across a large number of datasets, so that the users need only display one curve on their genome browser instead of a multitude. For example, they can compress a collection of datasets into a median, as well as compare datasets (e.g. cases versus controls) and generate a track that denotes the differences between the two sets. In addition, the WiggleTools library can compute statistics across genomic positions for a single iterator (area under the curve, variance) or a pair of iterators (Pearson correlation). These statistics can be computed across the entire genome or on regions of interest. For example, it can compute the read coverage at known promoter regions. Similarly, WiggleTools can be used to compute a scaled summary profile of the data on a set of regions. The WiggleTools library can be used as a C library but also as a standalone command-line tool. The user has complete access to the richness of the framework using a simple Polish Notation language. For example, to generate the sum of a collection of BigWig files and write the result into a new Wiggle file, the command would look like: wiggletools write sum.wig sum data/*.bw

2.3 Performance

The WiggleTools library has been specifically designed to handle many files simultaneously, allowing complex statistics to be computed as directly as possible, with low memory requirements. The limiting factor of this approach is the I/O access to the files, meaning that it requires the input files to be in the local network of the computation CPUs. However, because of the efficient indexing of BigWig files, the output can be directly displayed on a remote server, such as a genome browser. It is trivial to accelerate computations by slicing the genome into regions and assigning each region to a different CPU. A wrapper script is available to do this automatically. However, one obstacle to this approach is merging the final files, as the tools provided in the original Kent library quickly become a performance bottleneck. Therefore, we developed modified functions that parallelize the computation of summary tables (which are crucial to accelerate display at large scales), which we contributed to the Kent library. To evaluate the performance of our tool, we downloaded all the DNAseI hypersensitivity wiggle tracks contained on the ENCODE January 2011 data freeze (The ENCODE Project Consortium, 2012) and computed the sum of all these signals through three pipelines. We first ran the WiggleTools library in parallel on 116 sections of the genome (up to 30-Mbp long), producing as many output BigWig files that were merged with our new bigWigCat utility. Second, we ran WiggleTools but merged the output files with the default bigWigMerge utility (Kent ). Finally, we used bigWigMerge to directly sum the 126 BigWig files. The bigWigMerge tool only creates flat files; therefore, a compression and indexing stage, performed by the wigToBigWig tool, must also be done. The results in Table 1 clearly show that the first pipeline, which took 1090s to run, is ∼12 and 19 times faster than the other approaches, while requiring a fraction of the memory.

Table 1.

Benchmarking CPU and memory requirements to compute the sum of 126 BigWig files (121 GB of data in total)

Pipeline	Stage	CPUs	Time/CPU (s)	RAM/CPU (GB)
1	wiggletools	116	351 mean	0.22 mean
			739 maximum	0.32 maximum
	bigWigCat	1	378	5.23
	Overall	116	1090	5.23
2	wiggletools	116	351 mean	0.22 mean
			739 maximum	0.32 maximum
	bigWigMerge	1	3441	6.93
	wigToBigWig	1	8887	68.85
	Overall	116	13 067	68.85
3	bigWigMerge	1	11 036	43.73
	wigToBigWig	1	9423	75.12
	Overall	1	20 459	75.12

Note: Several pipelines are compared; hence some components appear multiple times.

Benchmarking CPU and memory requirements to compute the sum of 126 BigWig files (121 GB of data in total) Note: Several pipelines are compared; hence some components appear multiple times.

8 in total

1. The NIH Roadmap Epigenomics Mapping Consortium.

Authors: Bradley E Bernstein; John A Stamatoyannopoulos; Joseph F Costello; Bing Ren; Aleksandar Milosavljevic; Alexander Meissner; Manolis Kellis; Marco A Marra; Arthur L Beaudet; Joseph R Ecker; Peggy J Farnham; Martin Hirst; Eric S Lander; Tarjei S Mikkelsen; James A Thomson
Journal: Nat Biotechnol Date: 2010-10 Impact factor: 54.908

2. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

3. BEDTools: a flexible suite of utilities for comparing genomic features.

Authors: Aaron R Quinlan; Ira M Hall
Journal: Bioinformatics Date: 2010-01-28 Impact factor: 6.937

4. BigWig and BigBed: enabling browsing of large distributed datasets.

Authors: W J Kent; A S Zweig; G Barber; A S Hinrichs; D Karolchik
Journal: Bioinformatics Date: 2010-07-17 Impact factor: 6.937

5. BLUEPRINT to decode the epigenetic signature written in blood.

Authors: David Adams; Lucia Altucci; Stylianos E Antonarakis; Juan Ballesteros; Stephan Beck; Adrian Bird; Christoph Bock; Bernhard Boehm; Elias Campo; Andrea Caricasole; Fredrik Dahl; Emmanouil T Dermitzakis; Tariq Enver; Manel Esteller; Xavier Estivill; Anne Ferguson-Smith; Jude Fitzgibbon; Paul Flicek; Claudia Giehl; Thomas Graf; Frank Grosveld; Roderic Guigo; Ivo Gut; Kristian Helin; Jonas Jarvius; Ralf Küppers; Hans Lehrach; Thomas Lengauer; Åke Lernmark; David Leslie; Markus Loeffler; Elizabeth Macintyre; Antonello Mai; Joost H A Martens; Saverio Minucci; Willem H Ouwehand; Pier Giuseppe Pelicci; Hèléne Pendeville; Bo Porse; Vardhman Rakyan; Wolf Reik; Martin Schrappe; Dirk Schübeler; Martin Seifert; Reiner Siebert; David Simmons; Nicole Soranzo; Salvatore Spicuglia; Michael Stratton; Hendrik G Stunnenberg; Amos Tanay; David Torrents; Alfonso Valencia; Edo Vellenga; Martin Vingron; Jörn Walter; Spike Willcocks
Journal: Nat Biotechnol Date: 2012-03-07 Impact factor: 54.908

6. An integrated encyclopedia of DNA elements in the human genome.

Authors:
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

7. Ensembl 2013.

Authors: Paul Flicek; Ikhlak Ahmed; M Ridwan Amode; Daniel Barrell; Kathryn Beal; Simon Brent; Denise Carvalho-Silva; Peter Clapham; Guy Coates; Susan Fairley; Stephen Fitzgerald; Laurent Gil; Carlos García-Girón; Leo Gordon; Thibaut Hourlier; Sarah Hunt; Thomas Juettemann; Andreas K Kähäri; Stephen Keenan; Monika Komorowska; Eugene Kulesha; Ian Longden; Thomas Maurel; William M McLaren; Matthieu Muffato; Rishi Nag; Bert Overduin; Miguel Pignatelli; Bethan Pritchard; Emily Pritchard; Harpreet Singh Riat; Graham R S Ritchie; Magali Ruffier; Michael Schuster; Daniel Sheppard; Daniel Sobral; Kieron Taylor; Anja Thormann; Stephen Trevanion; Simon White; Steven P Wilder; Bronwen L Aken; Ewan Birney; Fiona Cunningham; Ian Dunham; Jennifer Harrow; Javier Herrero; Tim J P Hubbard; Nathan Johnson; Rhoda Kinsella; Anne Parker; Giulietta Spudich; Andy Yates; Amonida Zadissa; Stephen M J Searle
Journal: Nucleic Acids Res Date: 2012-11-30 Impact factor: 16.971

8. The UCSC Genome Browser database: extensions and updates 2013.

Authors: Laurence R Meyer; Ann S Zweig; Angie S Hinrichs; Donna Karolchik; Robert M Kuhn; Matthew Wong; Cricket A Sloan; Kate R Rosenbloom; Greg Roe; Brooke Rhead; Brian J Raney; Andy Pohl; Venkat S Malladi; Chin H Li; Brian T Lee; Katrina Learned; Vanessa Kirkup; Fan Hsu; Steve Heitner; Rachel A Harte; Maximilian Haeussler; Luvina Guruvadoo; Mary Goldman; Belinda M Giardine; Pauline A Fujita; Timothy R Dreszer; Mark Diekhans; Melissa S Cline; Hiram Clawson; Galt P Barber; David Haussler; W James Kent
Journal: Nucleic Acids Res Date: 2012-11-15 Impact factor: 16.971

8 in total

42 in total

1. Myc Regulates Chromatin Decompaction and Nuclear Architecture during B Cell Activation.

Authors: Kyong-Rim Kieffer-Kwon; Keisuke Nimura; Suhas S P Rao; Jianliang Xu; Seolkyoung Jung; Aleksandra Pekowska; Marei Dose; Evan Stevens; Ewy Mathe; Peng Dong; Su-Chen Huang; Maria Aurelia Ricci; Laura Baranello; Ying Zheng; Francesco Tomassoni Ardori; Wolfgang Resch; Diana Stavreva; Steevenson Nelson; Michael McAndrew; Adriel Casellas; Elizabeth Finn; Charles Gregory; Brian Glenn St Hilaire; Steven M Johnson; Wendy Dubois; Maria Pia Cosma; Eric Batchelor; David Levens; Robert D Phair; Tom Misteli; Lino Tessarollo; Gordon Hager; Melike Lakadamyali; Zhe Liu; Monique Floer; Hari Shroff; Erez Lieberman Aiden; Rafael Casellas
Journal: Mol Cell Date: 2017-08-10 Impact factor: 17.970

2. Regulation of translation by site-specific ribosomal RNA methylation.

Authors: Kübra Altinel; Disa Tehler; Martin D Jansson; Sophia J Häfner; Nicolai Krogh; Emil Jakobsen; Jens V Andersen; Kasper L Andersen; Erwin M Schoof; Patrice Ménard; Henrik Nielsen; Anders H Lund
Journal: Nat Struct Mol Biol Date: 2021-11-10 Impact factor: 15.369

3. Maternal H3K36 and H3K27 HMTs protect germline development via regulation of the transcription factor LIN-15B.

Authors: Chad Steven Cockrum; Susan Strome
Journal: Elife Date: 2022-08-03 Impact factor: 8.713

4. The TRIPLE PHD FINGERS proteins are required for SWI/SNF complex-mediated +1 nucleosome positioning and transcription start site determination in Arabidopsis.

Authors: Borja Diego-Martin; Jaime Pérez-Alemany; Joan Candela-Ferre; Antonio Corbalán-Acedo; Juan Pereyra; David Alabadí; Yasaman Jami-Alahmadi; James Wohlschlegel; Javier Gallego-Bartolomé
Journal: Nucleic Acids Res Date: 2022-10-14 Impact factor: 19.160

5. Genes Possessing the Most Frequent DNA DSBs Are Highly Associated with Development and Cancers, and Essentially Overlap with the rDNA-Contacting Genes.

Authors: Nickolai A Tchurikov; Ildar R Alembekov; Elena S Klushevskaya; Antonina N Kretova; Ann M Keremet; Anastasia E Sidorova; Polina B Meilakh; Vladimir R Chechetkin; Galina I Kravatskaya; Yuri V Kravatsky
Journal: Int J Mol Sci Date: 2022-06-28 Impact factor: 6.208

6. Human-Specific NOTCH2NL Genes Affect Notch Signaling and Cortical Neurogenesis.

Authors: Ian T Fiddes; Gerrald A Lodewijk; Meghan Mooring; Colleen M Bosworth; Adam D Ewing; Gary L Mantalas; Adam M Novak; Anouk van den Bout; Alex Bishara; Jimi L Rosenkrantz; Ryan Lorig-Roach; Andrew R Field; Maximilian Haeussler; Lotte Russo; Aparna Bhaduri; Tomasz J Nowakowski; Alex A Pollen; Max L Dougherty; Xander Nuttle; Marie-Claude Addor; Simon Zwolinski; Sol Katzman; Arnold Kriegstein; Evan E Eichler; Sofie R Salama; Frank M J Jacobs; David Haussler
Journal: Cell Date: 2018-05-31 Impact factor: 41.582

7. Cis and trans determinants of epigenetic silencing by Polycomb repressive complex 2 in Arabidopsis.

Authors: Jun Xiao; Run Jin; Xiang Yu; Max Shen; John D Wagner; Armaan Pai; Claire Song; Michael Zhuang; Samantha Klasfeld; Chongsheng He; Alexandre M Santos; Chris Helliwell; Jose L Pruneda-Paz; Steve A Kay; Xiaowei Lin; Sujuan Cui; Meilin Fernandez Garcia; Oliver Clarenz; Justin Goodrich; Xiaoyu Zhang; Ryan S Austin; Roberto Bonasio; Doris Wagner
Journal: Nat Genet Date: 2017-08-21 Impact factor: 38.330

8. Epigenome editing of the CFTR-locus for treatment of cystic fibrosis.

Authors: Ami M Kabadi; Leah Machlin; Nikita Dalal; Rhianna E Lee; Ian McDowell; Nirav N Shah; Lauren Drowley; Scott H Randell; Timothy E Reddy
Journal: J Cyst Fibros Date: 2021-05-25 Impact factor: 5.482

9. The ensembl regulatory build.

Authors: Daniel R Zerbino; Steven P Wilder; Nathan Johnson; Thomas Juettemann; Paul R Flicek
Journal: Genome Biol Date: 2015-03-24 Impact factor: 13.583

10. Functionally distinct patterns of nucleosome remodeling at enhancers in glucocorticoid-treated acute lymphoblastic leukemia.

Authors: Jennifer N Wu; Luca Pinello; Elinor Yissachar; Jonathan W Wischhusen; Guo-Cheng Yuan; Charles W M Roberts
Journal: Epigenetics Chromatin Date: 2015-12-02 Impact factor: 4.954