Literature DB >> 23969136

Genome compression: a novel approach for large collections.

Sebastian Deorowicz1, Agnieszka Danek, Szymon Grabowski.   

Abstract

MOTIVATION: Genomic repositories are rapidly growing, as witnessed by the 1000 Genomes or the UK10K projects. Hence, compression of multiple genomes of the same species has become an active research area in the past years. The well-known large redundancy in human sequences is not easy to exploit because of huge memory requirements from traditional compression algorithms.
RESULTS: We show how to obtain several times higher compression ratio than of the best reported results, on two large genome collections (1092 human and 775 plant genomes). Our inputs are variant call format files restricted to their essential fields. More precisely, our novel Ziv-Lempel-style compression algorithm squeezes a single human genome to ∼400 KB. The key to high compression is to look for similarities across the whole collection, not just against one reference sequence, what is typical for existing solutions. AVAILABILITY: http://sun.aei.polsl.pl/tgc (also as Supplementary Material) under a free license. Supplementary data: Supplementary data are available at Bioinformatics online.

Entities:  

Mesh:

Year:  2013        PMID: 23969136     DOI: 10.1093/bioinformatics/btt460

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  14 in total

1.  smallWig: parallel compression of RNA-seq WIG files.

Authors:  Zhiying Wang; Tsachy Weissman; Olgica Milenkovic
Journal:  Bioinformatics       Date:  2015-09-30       Impact factor: 6.937

2.  ERGC: an efficient referential genome compression algorithm.

Authors:  Subrata Saha; Sanguthevar Rajasekaran
Journal:  Bioinformatics       Date:  2015-07-02       Impact factor: 6.937

3.  Compression and fast retrieval of SNP data.

Authors:  Francesco Sambo; Barbara Di Camillo; Gianna Toffolo; Claudio Cobelli
Journal:  Bioinformatics       Date:  2014-07-26       Impact factor: 6.937

4.  NRGC: a novel referential genome compression algorithm.

Authors:  Subrata Saha; Sanguthevar Rajasekaran
Journal:  Bioinformatics       Date:  2016-08-02       Impact factor: 6.937

5.  GTRAC: fast retrieval from compressed collections of genomic variants.

Authors:  Kedar Tatwawadi; Mikel Hernaez; Idoia Ochoa; Tsachy Weissman
Journal:  Bioinformatics       Date:  2016-09-01       Impact factor: 6.937

6.  iDoComp: a compression scheme for assembled genomes.

Authors:  Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  Bioinformatics       Date:  2014-10-24       Impact factor: 6.937

7.  GDC 2: Compression of large collections of genomes.

Authors:  Sebastian Deorowicz; Agnieszka Danek; Marcin Niemiec
Journal:  Sci Rep       Date:  2015-06-25       Impact factor: 4.379

8.  Sequence Factorization with Multiple References.

Authors:  Sebastian Wandelt; Ulf Leser
Journal:  PLoS One       Date:  2015-09-30       Impact factor: 3.240

9.  Data compression for sequencing data.

Authors:  Sebastian Deorowicz; Szymon Grabowski
Journal:  Algorithms Mol Biol       Date:  2013-11-18       Impact factor: 1.405

10.  Indexes of large genome collections on a PC.

Authors:  Agnieszka Danek; Sebastian Deorowicz; Szymon Grabowski
Journal:  PLoS One       Date:  2014-10-07       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.