Literature DB >> 24489365

bwtool: a tool for bigWig files.

Andy Pohl1, Miguel Beato1.   

Abstract

UNLABELLED: BigWig files are a compressed, indexed, binary format for genome-wide signal data for calculations (e.g. GC percent) or experiments (e.g. ChIP-seq/RNA-seq read depth). bwtool is a tool designed to read bigWig files rapidly and efficiently, providing functionality for extracting data and summarizing it in several ways, globally or at specific regions. Additionally, the tool enables the conversion of the positions of signal data from one genome assembly to another, also known as 'lifting'. We believe bwtool can be useful for the analyst frequently working with bigWig data, which is becoming a standard format to represent functional signals along genomes. The article includes supplementary examples of running the software.
AVAILABILITY AND IMPLEMENTATION: The C source code is freely available under the GNU public license v3 at http://cromatina.crg.eu/bwtool.
© The Author 2014. Published by Oxford University Press.

Entities:  

Mesh:

Substances:

Year:  2014        PMID: 24489365      PMCID: PMC4029031          DOI: 10.1093/bioinformatics/btu056

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

For many laboratories, it has become an everyday task to generate or to analyze genome-wide data such as ChIP-seq read depth. To facilitate visualization of these data with tools such as the UCSC Genome Browser (Kent ) or ENSEMBL (Flicek ), or for further processing, it is common to use the wiggle (WIG) file format. This format is not without a few disadvantages, principally that the files can become large, particularly when care is not taken to store the data at a minimally necessary decimal precision. Another disadvantage is that wiggles exist in three different forms, the choice of which depends on the sparseness of the data. Programs that expect WIG data do not always allow all three formats interchangeably. The bigWig format (Kent ) was created as a means for the UCSC Genome Browser to access real-valued signal data remotely hosted on HTTP/FTP servers worldwide. The format is binary, compressed, indexed and allows random access to directly query a subset of the larger dataset. In general, programs designed to read bigWig files should treat remote URLs of bigWigs the same as they would treat URLs local to that computer. BigWig uses an indexing strategy similar to other binary/indexed formats such as bigBed (Kent ), binary SAM (BAM) (Li ) and tabix-based formats (Li, 2011), but unlike BAM or tabix-based formats, bigWig is specific to numerical data. WIG and BAM are both common data formats and are used by many applications, e.g. Model-based analysis of ChIP-Seq (MACS) (Zhang ) and Mixture of Isoforms (MISO) (Katz ), respectively, but, to date, there are not many applications that accommodate bigWig data. We have created command-line software under the UNIX operating system called bwtool in a similar spirit to bedtools (Quinlan and Hall, 2010) or samtools (Li ) that offers the possibility to carry out a number of diverse operations on bigWigs in a convenient way. Until now, the common procedure to access the data within bigWig files has been to use the tools available from UCSC: bigWigToWig, bigWigSummary, bigWigAverageOverBed, bigWigMerge, bigWigCorrelate or bigWigInfo. These offer some basic usability for bigWigs. bigWigInfo provides instant information about a bigWig file and is useful for glancing at the overall mean and standard deviation as well as seeing how many bases are covered by the signal. bigWigToWig is indispensible, as it is occasionally necessary to convert a bigWig into the original WIG to use legacy software. Beyond those two, bwtool provides additional features and flexibility not found in other software.

2 DESCRIPTION

The bwtool program is designed to rapidly collect summary statistics and do common wiggle manipulations. The program is a collection of utilities (the names of which are in bold), which allow for the following features: Common options to many of the features include the ability to specify the decimal precision, to fill missing bases with a given value or to provide a bed file specifying specific regions of the bigWig to read. Aggregate data by averaging it over a series of given intervals with respect to central bases. This common aggregation procedure is used to produce plots showing enrichment, but has a tendency to be problematic, particularly when centering on genomic features without a known strand or directionality (Kundaje ). For this reason, simple k-means functionality is built-in to group regions with similar profiles. Figure 1 demonstrates the aggregate program on data collected from the ENCODE project (Raney ).
Fig. 1.

Example of aggregated plots of different histone modification ChIP sequence read-depth signals from MCF7 cells from ENCODE aligned at each of the 20 330 protein-coding gene transcription start sites in GENCODE release v17 (Harrow ). See Supplement for instructions on how to reproduce this plot. The raw signals in this example are not normalized, so specific values cannot be compared between signals; however, the morphological differences in averaged profiles are nevertheless useful in characterizing the patterns of each histone mark

‘Lift’, i.e. project data from one genome assembly to another using a ‘liftOver chain’ file, available from the UCSC Genome Browser utilities page (Kuhn ). Lifting data often results in a small percentage of data loss, so care must be taken to ensure that the only lifted data analyzed is that which is within regions lifting correctly. Options are available to catalog all of the problematic regions involved. Quickly find regions in the bigWig exhibiting local minima/maxima, or above/below specified thresholds. Extract equally sized intervals of data as a matrix or as a sliding window at adjustable steps and sizes. Again, clustering is available as an option when extracting data as a matrix. A random matrix of data can also be produced, with the ability to exclude specific regions in the genome. Unequally sized intervals can be also extracted with the extract utility. Another way to extract data from multiple bigWigs is to use the paste utility. This outputs tab-delimited data from a set of bigWigs, one base per line. Pasting bigWigs together makes it possible to perform many complex calculations with small auxiliary scripts. In this way, the functionality of bwtool can easily expand on the functionality of bigWigMerge and bigWigCorrelate from UCSC. Discretize the real-valued signal into letters, using the Symbolic Aggregate Approximation (SAX) algorithm (Shieh and Keogh, 2009). Removing data based on thresholds and specific regions if desired. Conversely, regions missing data in a bigWig can be replaced with a constant using the fill utility. Summarize data at specific regions. This functionality is similar to the combined programs of bigWigSummary and bigWigAverageOverBed, with the addition of median and optional quantile information in the output. Example of aggregated plots of different histone modification ChIP sequence read-depth signals from MCF7 cells from ENCODE aligned at each of the 20 330 protein-coding gene transcription start sites in GENCODE release v17 (Harrow ). See Supplement for instructions on how to reproduce this plot. The raw signals in this example are not normalized, so specific values cannot be compared between signals; however, the morphological differences in averaged profiles are nevertheless useful in characterizing the patterns of each histone mark

3 USAGE AND AVAILABILITY

bwtool is command-line software for UNIX, a common platform for bioinformatics researchers to conduct analysis. Running the bwtool command without additional parameters displays a description of the various utilities and some general options. Combined with a utility name, bwtool will display specific information about how to perform an operation using that utility. A detailed guide has been created on bwtool's web page (http://cromatina.crg.eu/bwtool) to provide thorough examples of using the program. bwtool is written in C. The source code for the program is available on its GitHub web page. Distributed (with permission) with bwtool is the basic C library from Jim Kent that is needed for routines specific to bigWig data, as well as other algorithmic code. Jim Kent and the University of California hold the copyright to this specific library, but the remaining code is covered by the GNU Public License v3. bwtools makes use of GNU autotools to simplify the installation process to the standard ‘./configure’, ‘make’, ‘make install’ procedure most UNIX users will be familiar with. To verify the accuracy of the software, tests may be run with ‘make check’. bwtool does not require additional libraries that are not typically found in common UNIX environments, but if the GNU Scientific Library is installed, it will make use of that for the random utility.
  12 in total

1.  The human genome browser at UCSC.

Authors:  W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler
Journal:  Genome Res       Date:  2002-06       Impact factor: 9.043

2.  Tabix: fast retrieval of sequence features from generic TAB-delimited files.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2011-01-05       Impact factor: 6.937

3.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

4.  BEDTools: a flexible suite of utilities for comparing genomic features.

Authors:  Aaron R Quinlan; Ira M Hall
Journal:  Bioinformatics       Date:  2010-01-28       Impact factor: 6.937

5.  BigWig and BigBed: enabling browsing of large distributed datasets.

Authors:  W J Kent; A S Zweig; G Barber; A S Hinrichs; D Karolchik
Journal:  Bioinformatics       Date:  2010-07-17       Impact factor: 6.937

6.  GENCODE: the reference human genome annotation for The ENCODE Project.

Authors:  Jennifer Harrow; Adam Frankish; Jose M Gonzalez; Electra Tapanari; Mark Diekhans; Felix Kokocinski; Bronwen L Aken; Daniel Barrell; Amonida Zadissa; Stephen Searle; If Barnes; Alexandra Bignell; Veronika Boychenko; Toby Hunt; Mike Kay; Gaurab Mukherjee; Jeena Rajan; Gloria Despacio-Reyes; Gary Saunders; Charles Steward; Rachel Harte; Michael Lin; Cédric Howald; Andrea Tanzer; Thomas Derrien; Jacqueline Chrast; Nathalie Walters; Suganthi Balasubramanian; Baikang Pei; Michael Tress; Jose Manuel Rodriguez; Iakes Ezkurdia; Jeltje van Baren; Michael Brent; David Haussler; Manolis Kellis; Alfonso Valencia; Alexandre Reymond; Mark Gerstein; Roderic Guigó; Tim J Hubbard
Journal:  Genome Res       Date:  2012-09       Impact factor: 9.043

7.  Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements.

Authors:  Anshul Kundaje; Sofia Kyriazopoulou-Panagiotopoulou; Max Libbrecht; Cheryl L Smith; Debasish Raha; Elliott E Winters; Steven M Johnson; Michael Snyder; Serafim Batzoglou; Arend Sidow
Journal:  Genome Res       Date:  2012-09       Impact factor: 9.043

8.  ENCODE whole-genome data in the UCSC genome browser (2011 update).

Authors:  Brian J Raney; Melissa S Cline; Kate R Rosenbloom; Timothy R Dreszer; Katrina Learned; Galt P Barber; Laurence R Meyer; Cricket A Sloan; Venkat S Malladi; Krishna M Roskin; Bernard B Suh; Angie S Hinrichs; Hiram Clawson; Ann S Zweig; Vanessa Kirkup; Pauline A Fujita; Brooke Rhead; Kayla E Smith; Andy Pohl; Robert M Kuhn; Donna Karolchik; David Haussler; W James Kent
Journal:  Nucleic Acids Res       Date:  2010-10-30       Impact factor: 16.971

9.  The UCSC genome browser and associated tools.

Authors:  Robert M Kuhn; David Haussler; W James Kent
Journal:  Brief Bioinform       Date:  2012-08-20       Impact factor: 11.622

10.  Model-based analysis of ChIP-Seq (MACS).

Authors:  Yong Zhang; Tao Liu; Clifford A Meyer; Jérôme Eeckhoute; David S Johnson; Bradley E Bernstein; Chad Nusbaum; Richard M Myers; Myles Brown; Wei Li; X Shirley Liu
Journal:  Genome Biol       Date:  2008-09-17       Impact factor: 13.583

View more
  88 in total

1.  Epigenetic drift of H3K27me3 in aging links glycolysis to healthy longevity in Drosophila.

Authors:  Zaijun Ma; Hui Wang; Yuping Cai; Han Wang; Kongyan Niu; Xiaofen Wu; Huanhuan Ma; Yun Yang; Wenhua Tong; Feng Liu; Zhandong Liu; Yaoyang Zhang; Rui Liu; Zheng-Jiang Zhu; Nan Liu
Journal:  Elife       Date:  2018-05-29       Impact factor: 8.140

2.  Myc Regulates Chromatin Decompaction and Nuclear Architecture during B Cell Activation.

Authors:  Kyong-Rim Kieffer-Kwon; Keisuke Nimura; Suhas S P Rao; Jianliang Xu; Seolkyoung Jung; Aleksandra Pekowska; Marei Dose; Evan Stevens; Ewy Mathe; Peng Dong; Su-Chen Huang; Maria Aurelia Ricci; Laura Baranello; Ying Zheng; Francesco Tomassoni Ardori; Wolfgang Resch; Diana Stavreva; Steevenson Nelson; Michael McAndrew; Adriel Casellas; Elizabeth Finn; Charles Gregory; Brian Glenn St Hilaire; Steven M Johnson; Wendy Dubois; Maria Pia Cosma; Eric Batchelor; David Levens; Robert D Phair; Tom Misteli; Lino Tessarollo; Gordon Hager; Melike Lakadamyali; Zhe Liu; Monique Floer; Hari Shroff; Erez Lieberman Aiden; Rafael Casellas
Journal:  Mol Cell       Date:  2017-08-10       Impact factor: 17.970

3.  Primate-specific endogenous retrovirus-driven transcription defines naive-like stem cells.

Authors:  Jichang Wang; Gangcai Xie; Manvendra Singh; Avazeh T Ghanbarian; Tamás Raskó; Attila Szvetnik; Huiqiang Cai; Daniel Besser; Alessandro Prigione; Nina V Fuchs; Gerald G Schumann; Wei Chen; Matthew C Lorincz; Zoltán Ivics; Laurence D Hurst; Zsuzsanna Izsvák
Journal:  Nature       Date:  2014-10-15       Impact factor: 49.962

4.  Cancer-Specific Retargeting of BAF Complexes by a Prion-like Domain.

Authors:  Gaylor Boulay; Gabriel J Sandoval; Nicolo Riggi; Sowmya Iyer; Rémi Buisson; Beverly Naigles; Mary E Awad; Shruthi Rengarajan; Angela Volorio; Matthew J McBride; Liliane C Broye; Lee Zou; Ivan Stamenkovic; Cigall Kadoch; Miguel N Rivera
Journal:  Cell       Date:  2017-08-24       Impact factor: 41.582

5.  OTX2 Activity at Distal Regulatory Elements Shapes the Chromatin Landscape of Group 3 Medulloblastoma.

Authors:  Gaylor Boulay; Mary E Awad; Nicolo Riggi; Tenley C Archer; Sowmya Iyer; Wannaporn E Boonseng; Nikki E Rossetti; Beverly Naigles; Shruthi Rengarajan; Angela Volorio; James C Kim; Jill P Mesirov; Pablo Tamayo; Scott L Pomeroy; Martin J Aryee; Miguel N Rivera
Journal:  Cancer Discov       Date:  2017-02-17       Impact factor: 39.397

6.  Phylogenetic Modeling of Regulatory Element Turnover Based on Epigenomic Data.

Authors:  Noah Dukler; Yi-Fei Huang; Adam Siepel
Journal:  Mol Biol Evol       Date:  2020-07-01       Impact factor: 16.240

7.  Improving the value of public RNA-seq expression data by phenotype prediction.

Authors:  Shannon E Ellis; Leonardo Collado-Torres; Andrew Jaffe; Jeffrey T Leek
Journal:  Nucleic Acids Res       Date:  2018-05-18       Impact factor: 16.971

8.  Hybrid incompatibility caused by an epiallele.

Authors:  Todd Blevins; Jing Wang; David Pflieger; Frédéric Pontvianne; Craig S Pikaard
Journal:  Proc Natl Acad Sci U S A       Date:  2017-03-07       Impact factor: 11.205

9.  Targeting the CALR interactome in myeloproliferative neoplasms.

Authors:  Elodie Pronier; Paolo Cifani; Tiffany R Merlinsky; Katharine Barr Berman; Amritha Varshini Hanasoge Somasundara; Raajit K Rampal; John LaCava; Karen E Wei; Friederike Pastore; Jesper Lv Maag; Jane Park; Richard Koche; Alex Kentsis; Ross L Levine
Journal:  JCI Insight       Date:  2018-11-15

10.  ZCWPW1 is recruited to recombination hotspots by PRDM9 and is essential for meiotic double strand break repair.

Authors:  Daniel Wells; Emmanuelle Bitoun; Daniela Moralli; Gang Zhang; Anjali Hinch; Julia Jankowska; Peter Donnelly; Catherine Green; Simon R Myers
Journal:  Elife       Date:  2020-08-03       Impact factor: 8.140

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.