Literature DB >> 26424856

smallWig: parallel compression of RNA-seq WIG files.

Zhiying Wang1, Tsachy Weissman1, Olgica Milenkovic1.   

Abstract

CONTRIBUTIONS: We developed a new lossless compression method for WIG data, named smallWig, offering the best known compression rates for RNA-seq data and featuring random access functionalities that enable visualization, summary statistics analysis and fast queries from the compressed files. Our approach results in order of magnitude improvements compared with bigWig and ensures compression rates only a fraction of those produced by cWig. The key features of the smallWig algorithm are statistical data analysis and a combination of source coding methods that ensure high flexibility and make the algorithm suitable for different applications. Furthermore, for general-purpose file compression, the compression rate of smallWig approaches the empirical entropy of the tested WIG data. For compression with random query features, smallWig uses a simple block-based compression scheme that introduces only a minor overhead in the compression rate. For archival or storage space-sensitive applications, the method relies on context mixing techniques that lead to further improvements of the compression rate. Implementations of smallWig can be executed in parallel on different sets of chromosomes using multiple processors, thereby enabling desirable scaling for future transcriptome Big Data platforms.
MOTIVATION: The development of next-generation sequencing technologies has led to a dramatic decrease in the cost of DNA/RNA sequencing and expression profiling. RNA-seq has emerged as an important and inexpensive technology that provides information about whole transcriptomes of various species and organisms, as well as different organs and cellular communities. The vast volume of data generated by RNA-seq experiments has significantly increased data storage costs and communication bandwidth requirements. Current compression tools for RNA-seq data such as bigWig and cWig either use general-purpose compressors (gzip) or suboptimal compression schemes that leave significant room for improvement. To substantiate this claim, we performed a statistical analysis of expression data in different transform domains and developed accompanying entropy coding methods that bridge the gap between theoretical and practical WIG file compression rates.
RESULTS: We tested different variants of the smallWig compression algorithm on a number of integer-and real- (floating point) valued RNA-seq WIG files generated by the ENCODE project. The results reveal that, on average, smallWig offers 18-fold compression rate improvements, up to 2.5-fold compression time improvements, and 1.5-fold decompression time improvements when compared with bigWig. On the tested files, the memory usage of the algorithm never exceeded 90 KB. When more elaborate context mixing compressors were used within smallWig, the obtained compression rates were as much as 23 times better than those of bigWig. For smallWig used in the random query mode, which also supports retrieval of the summary statistics, an overhead in the compression rate of roughly 3-17% was introduced depending on the chosen system parameters. An increase in encoding and decoding time of 30% and 55% represents an additional performance loss caused by enabling random data access. We also implemented smallWig using multi-processor programming. This parallelization feature decreases the encoding delay 2-3.4 times compared with that of a single-processor implementation, with the number of processors used ranging from 2 to 8; in the same parameter regime, the decoding delay decreased 2-5.2 times.
AVAILABILITY AND IMPLEMENTATION: The smallWig software can be downloaded from: http://stanford.edu/~zhiyingw/smallWig/smallwig.html, http://publish.illinois.edu/milenkovic/, http://web.stanford.edu/~tsachy/. CONTACT: zhiyingw@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

Mesh:

Year:  2015        PMID: 26424856      PMCID: PMC6078172          DOI: 10.1093/bioinformatics/btv561

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  13 in total

1.  Compressive genomics.

Authors:  Po-Ru Loh; Michael Baym; Bonnie Berger
Journal:  Nat Biotechnol       Date:  2012-07-10       Impact factor: 54.908

2.  The human genome contracts again.

Authors:  Dmitri S Pavlichin; Tsachy Weissman; Golan Yona
Journal:  Bioinformatics       Date:  2013-06-22       Impact factor: 6.937

3.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays.

Authors:  John C Marioni; Christopher E Mason; Shrikant M Mane; Matthew Stephens; Yoav Gilad
Journal:  Genome Res       Date:  2008-06-11       Impact factor: 9.043

4.  CWig: compressed representation of Wiggle/BedGraph format.

Authors:  Do Huy Hoang; Wing-Kin Sung
Journal:  Bioinformatics       Date:  2014-05-27       Impact factor: 6.937

5.  BigWig and BigBed: enabling browsing of large distributed datasets.

Authors:  W J Kent; A S Zweig; G Barber; A S Hinrichs; D Karolchik
Journal:  Bioinformatics       Date:  2010-07-17       Impact factor: 6.937

6.  Minimax Estimation of Functionals of Discrete Distributions.

Authors:  Jiantao Jiao; Kartik Venkat; Yanjun Han; Tsachy Weissman
Journal:  IEEE Trans Inf Theory       Date:  2015-03-13       Impact factor: 2.501

Review 7.  RNA-Seq: a revolutionary tool for transcriptomics.

Authors:  Zhong Wang; Mark Gerstein; Michael Snyder
Journal:  Nat Rev Genet       Date:  2009-01       Impact factor: 53.242

8.  A novel compression tool for efficient storage of genome resequencing data.

Authors:  Congmao Wang; Dabing Zhang
Journal:  Nucleic Acids Res       Date:  2011-01-25       Impact factor: 16.971

9.  On the representability of complete genomes by multiple competing finite-context (Markov) models.

Authors:  Armando J Pinho; Paulo J S G Ferreira; António J R Neves; Carlos A C Bastos
Journal:  PLoS One       Date:  2011-06-30       Impact factor: 3.240

10.  GReEn: a tool for efficient compression of genome resequencing data.

Authors:  Armando J Pinho; Diogo Pratas; Sara P Garcia
Journal:  Nucleic Acids Res       Date:  2011-12-01       Impact factor: 16.971

View more
  4 in total

1.  ChIPWig: a random access-enabling lossless and lossy compression method for ChIP-seq data.

Authors:  Vida Ravanmehr; Minji Kim; Zhiying Wang; Olgica Milenkovic
Journal:  Bioinformatics       Date:  2018-03-15       Impact factor: 6.937

2.  Balancing efficient analysis and storage of quantitative genomics data with the D4 format and d4tools.

Authors:  Hao Hou; Brent Pedersen; Aaron Quinlan
Journal:  Nat Comput Sci       Date:  2021-06-21

Review 3.  Single-cell Transcriptome Study as Big Data.

Authors:  Pingjian Yu; Wei Lin
Journal:  Genomics Proteomics Bioinformatics       Date:  2016-02-11       Impact factor: 7.691

4.  Productive visualization of high-throughput sequencing data using the SeqCode open portable platform.

Authors:  Enrique Blanco; Mar González-Ramírez; Luciano Di Croce
Journal:  Sci Rep       Date:  2021-10-01       Impact factor: 4.379

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.