Literature DB >> 23793748

The human genome contracts again.

Dmitri S Pavlichin1, Tsachy Weissman, Golan Yona.   

Abstract

UNLABELLED: The number of human genomes that have been sequenced completely for different individuals has increased rapidly in recent years. Storing and transferring complete genomes between computers for the purpose of applying various applications and analysis tools will soon become a major hurdle, hindering the analysis phase. Therefore, there is a growing need to compress these data efficiently. Here, we describe a technique to compress human genomes based on entropy coding, using a reference genome and known Single Nucleotide Polymorphisms (SNPs). Furthermore, we explore several intrinsic features of genomes and information in other genomic databases to further improve the compression attained. Using these methods, we compress James Watson's genome to 2.5 megabytes (MB), improving on recent work by 37%. Similar compression is obtained for most genomes available from the 1000 Genomes Project. Our biologically inspired techniques promise even greater gains for genomes of lower organisms and for human genomes as more genomic data become available. AVAILABILITY: Code is available at sourceforge.net/projects/genomezip/

Entities:  

Mesh:

Year:  2013        PMID: 23793748     DOI: 10.1093/bioinformatics/btt362

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  11 in total

1.  smallWig: parallel compression of RNA-seq WIG files.

Authors:  Zhiying Wang; Tsachy Weissman; Olgica Milenkovic
Journal:  Bioinformatics       Date:  2015-09-30       Impact factor: 6.937

2.  ERGC: an efficient referential genome compression algorithm.

Authors:  Subrata Saha; Sanguthevar Rajasekaran
Journal:  Bioinformatics       Date:  2015-07-02       Impact factor: 6.937

3.  NRGC: a novel referential genome compression algorithm.

Authors:  Subrata Saha; Sanguthevar Rajasekaran
Journal:  Bioinformatics       Date:  2016-08-02       Impact factor: 6.937

4.  iDoComp: a compression scheme for assembled genomes.

Authors:  Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  Bioinformatics       Date:  2014-10-24       Impact factor: 6.937

5.  GDC 2: Compression of large collections of genomes.

Authors:  Sebastian Deorowicz; Agnieszka Danek; Marcin Niemiec
Journal:  Sci Rep       Date:  2015-06-25       Impact factor: 4.379

6.  Data compression for sequencing data.

Authors:  Sebastian Deorowicz; Szymon Grabowski
Journal:  Algorithms Mol Biol       Date:  2013-11-18       Impact factor: 1.405

7.  Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.

Authors:  Kelvin V Kredens; Juliano V Martins; Osmar B Dordal; Mauri Ferrandin; Roberto H Herai; Edson E Scalabrin; Bráulio C Ávila
Journal:  PLoS One       Date:  2020-05-26       Impact factor: 3.240

8.  MBGC: Multiple Bacteria Genome Compressor.

Authors:  Szymon Grabowski; Tomasz M Kowalski
Journal:  Gigascience       Date:  2022-01-27       Impact factor: 6.524

9.  Data-dependent bucketing improves reference-free compression of sequencing reads.

Authors:  Rob Patro; Carl Kingsford
Journal:  Bioinformatics       Date:  2015-04-24       Impact factor: 6.937

10.  Indexes of large genome collections on a PC.

Authors:  Agnieszka Danek; Sebastian Deorowicz; Szymon Grabowski
Journal:  PLoS One       Date:  2014-10-07       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.