Literature DB >> 22556365

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform.

Anthony J Cox1, Markus J Bauer, Tobias Jakobi, Giovanna Rosone.   

Abstract

MOTIVATION: The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work, we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets.
RESULTS: We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm. We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel 'implicit sorting' strategy that enables these benefits to be realized without the overhead of sorting the reads. With these techniques, a 45× coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3 Gb of sequence to fit into only 8.2 GB of space (trimming a small proportion of low-quality bases from the reads improves the compression still further). This is >4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FM-index on large-scale DNA sequence collections. AVAILABILITY: Code to construct the BWT and SAP-array on large genomic datasets is part of the BEETL library, available as a github repository at https://github.com/BEETL/BEETL.

Entities:  

Mesh:

Year:  2012        PMID: 22556365     DOI: 10.1093/bioinformatics/bts173

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  41 in total

Review 1.  Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.

Authors:  Oliver Bonham-Carter; Joe Steele; Dhundy Bastola
Journal:  Brief Bioinform       Date:  2013-07-31       Impact factor: 11.622

2.  Using Genome Query Language to uncover genetic variation.

Authors:  Christos Kozanitis; Andrew Heiberg; George Varghese; Vineet Bafna
Journal:  Bioinformatics       Date:  2013-06-10       Impact factor: 6.937

3.  LFQC: a lossless compression algorithm for FASTQ files.

Authors:  Marius Nicolae; Sudipta Pathak; Sanguthevar Rajasekaran
Journal:  Bioinformatics       Date:  2015-06-20       Impact factor: 6.937

4.  DeeZ: reference-based compression by local assembly.

Authors:  Faraz Hach; Ibrahim Numanagić; S Cenk Sahinalp
Journal:  Nat Methods       Date:  2014-11       Impact factor: 28.547

5.  Comparison of high-throughput sequencing data compression tools.

Authors:  Ibrahim Numanagić; James K Bonfield; Faraz Hach; Jan Voges; Jörn Ostermann; Claudio Alberti; Marco Mattavelli; S Cenk Sahinalp
Journal:  Nat Methods       Date:  2016-10-24       Impact factor: 28.547

6.  SCALCE: boosting sequence compression algorithms using locally consistent encoding.

Authors:  Faraz Hach; Ibrahim Numanagic; Can Alkan; S Cenk Sahinalp
Journal:  Bioinformatics       Date:  2012-10-09       Impact factor: 6.937

7.  Fast construction of FM-index for long sequence reads.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2014-08-08       Impact factor: 6.937

8.  Merging of multi-string BWTs with applications.

Authors:  James Holt; Leonard McMillan
Journal:  Bioinformatics       Date:  2014-08-28       Impact factor: 6.937

9.  Fast lossless compression via cascading Bloom filters.

Authors:  Roye Rozov; Ron Shamir; Eran Halperin
Journal:  BMC Bioinformatics       Date:  2014-09-10       Impact factor: 3.169

10.  Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.

Authors:  Shubham Chandak; Kedar Tatwawadi; Tsachy Weissman
Journal:  Bioinformatics       Date:  2018-02-15       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.