Literature DB >> 24524158

FRESCO: Referential compression of highly similar sequences.

Sebastian Wandelt, Ulf Leser.   

Abstract

In many applications, sets of similar texts or sequences are of high importance. Prominent examples are revision histories of documents or genomic sequences. Modern high-throughput sequencing technologies are able to generate DNA sequences at an ever-increasing rate. In parallel to the decreasing experimental time and cost necessary to produce DNA sequences, computational requirements for analysis and storage of the sequences are steeply increasing. Compression is a key technology to deal with this challenge. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. In this paper, we propose a general open-source framework to compress large amounts of biological sequence data called Framework for REferential Sequence COmpression (FRESCO). Our basic compression algorithm is shown to be one to two orders of magnitudes faster than comparable related work, while achieving similar compression ratios. We also propose several techniques to further increase compression ratios, while still retaining the advantage in speed: 1) selecting a good reference sequence; and 2) rewriting a reference sequence to allow for better compression. In addition,we propose a new way of further boosting the compression ratios by applying referential compression to already referentially compressed files (second-order compression). This technique allows for compression ratios way beyond state of the art, for instance,4,000:1 and higher for human genomes. We evaluate our algorithms on a large data set from three different species (more than 1,000 genomes, more than 3 TB) and on a collection of versions of Wikipedia pages. Our results show that real-time compression of highly similar sequences at high compression ratios is possible on modern hardware.

Entities:  

Mesh:

Year:  2013        PMID: 24524158     DOI: 10.1109/tcbb.2013.122

Source DB:  PubMed          Journal:  IEEE/ACM Trans Comput Biol Bioinform        ISSN: 1545-5963            Impact factor:   3.710


  17 in total

1.  Comment on: 'ERGC: an efficient referential genome compression algorithm'.

Authors:  Sebastian Deorowicz; Szymon Grabowski; Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  Bioinformatics       Date:  2015-11-28       Impact factor: 6.937

2.  iDoComp: a compression scheme for assembled genomes.

Authors:  Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  Bioinformatics       Date:  2014-10-24       Impact factor: 6.937

3.  Efficient DNA sequence compression with neural networks.

Authors:  Milton Silva; Diogo Pratas; Armando J Pinho
Journal:  Gigascience       Date:  2020-11-11       Impact factor: 6.524

4.  GDC 2: Compression of large collections of genomes.

Authors:  Sebastian Deorowicz; Agnieszka Danek; Marcin Niemiec
Journal:  Sci Rep       Date:  2015-06-25       Impact factor: 4.379

5.  On-Demand Indexing for Referential Compression of DNA Sequences.

Authors:  Fernando Alves; Vinicius Cogo; Sebastian Wandelt; Ulf Leser; Alysson Bessani
Journal:  PLoS One       Date:  2015-07-06       Impact factor: 3.240

6.  Sequence Factorization with Multiple References.

Authors:  Sebastian Wandelt; Ulf Leser
Journal:  PLoS One       Date:  2015-09-30       Impact factor: 3.240

7.  MAFCO: a compression tool for MAF files.

Authors:  Luís M O Matos; António J R Neves; Diogo Pratas; Armando J Pinho
Journal:  PLoS One       Date:  2015-03-27       Impact factor: 3.240

8.  A new algorithm for "the LCS problem" with application in compressing genome resequencing data.

Authors:  Richard Beal; Tazin Afrin; Aliya Farheen; Donald Adjeroh
Journal:  BMC Genomics       Date:  2016-08-18       Impact factor: 3.969

9.  Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.

Authors:  Kelvin V Kredens; Juliano V Martins; Osmar B Dordal; Mauri Ferrandin; Roberto H Herai; Edson E Scalabrin; Bráulio C Ávila
Journal:  PLoS One       Date:  2020-05-26       Impact factor: 3.240

10.  Indexes of large genome collections on a PC.

Authors:  Agnieszka Danek; Sebastian Deorowicz; Szymon Grabowski
Journal:  PLoS One       Date:  2014-10-07       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.