Literature DB >> 19447783

Data structures and compression algorithms for genomic sequence data.

Marty C Brandon1, Douglas C Wallace, Pierre Baldi.   

Abstract

MOTIVATION: The continuing exponential accumulation of full genome data, including full diploid human genomes, creates new challenges not only for understanding genomic structure, function and evolution, but also for the storage, navigation and privacy of genomic data. Here, we develop data structures and algorithms for the efficient storage of genomic and other sequence data that may also facilitate querying and protecting the data.
RESULTS: The general idea is to encode only the differences between a genome sequence and a reference sequence, using absolute or relative coordinates for the location of the differences. These locations and the corresponding differential variants can be encoded into binary strings using various entropy coding methods, from fixed codes such as Golomb and Elias codes, to variables codes, such as Huffman codes. We demonstrate the approach and various tradeoffs using highly variables human mitochondrial genome sequences as a testbed. With only a partial level of optimization, 3615 genome sequences occupying 56 MB in GenBank are compressed down to only 167 KB, achieving a 345-fold compression rate, using the revised Cambridge Reference Sequence as the reference sequence. Using the consensus sequence as the reference sequence, the data can be stored using only 133 KB, corresponding to a 433-fold level of compression, roughly a 23% improvement. Extensions to nuclear genomes and high-throughput sequencing data are discussed. AVAILABILITY: Data are publicly available from GenBank, the HapMap web site, and the MITOMAP database. Supplementary materials with additional results, statistics, and software implementations are available from http://mammag.web.uci.edu/bin/view/Mitowiki/ProjectDNACompression.

Entities:  

Mesh:

Year:  2009        PMID: 19447783      PMCID: PMC2705231          DOI: 10.1093/bioinformatics/btp319

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  25 in total

1.  Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA.

Authors:  R M Andrews; I Kubacka; P F Chinnery; R N Lightowlers; D M Turnbull; N Howell
Journal:  Nat Genet       Date:  1999-10       Impact factor: 38.330

2.  The International HapMap Project.

Authors: 
Journal:  Nature       Date:  2003-12-18       Impact factor: 49.962

3.  DNACompress: fast and effective DNA sequence compression.

Authors:  Xin Chen; Ming Li; Bin Ma; John Tromp
Journal:  Bioinformatics       Date:  2002-12       Impact factor: 6.937

4.  Whole-genome patterns of common DNA variation in three human populations.

Authors:  David A Hinds; Laura L Stuve; Geoffrey B Nilsen; Eran Halperin; Eleazar Eskin; Dennis G Ballinger; Kelly A Frazer; David R Cox
Journal:  Science       Date:  2005-02-18       Impact factor: 47.728

5.  Frequency of a 9-bp deletion in the mitochondrial DNA among Asian populations.

Authors:  S Harihara; M Hirai; Y Suutou; K Shimizu; K Omoto
Journal:  Hum Biol       Date:  1992-04       Impact factor: 0.553

6.  Sequence and organization of the human mitochondrial genome.

Authors:  S Anderson; A T Bankier; B G Barrell; M H de Bruijn; A R Coulson; J Drouin; I C Eperon; D P Nierlich; B A Roe; F Sanger; P H Schreier; A J Smith; R Staden; I G Young
Journal:  Nature       Date:  1981-04-09       Impact factor: 49.962

7.  Compression of nucleotide databases for fast searching.

Authors:  H Williams; J Zobel
Journal:  Comput Appl Biosci       Date:  1997-10

8.  MITOMASTER: a bioinformatics tool for the analysis of mitochondrial DNA sequences.

Authors:  Marty C Brandon; Eduardo Ruiz-Pesini; Dan Mishmar; Vincent Procaccio; Marie T Lott; Kevin Cuong Nguyen; Syawal Spolim; Upen Patil; Pierre Baldi; Douglas C Wallace
Journal:  Hum Mutat       Date:  2009-01       Impact factor: 4.878

9.  Molecular instability in the COII-tRNA(Lys) intergenic region of the human mitochondrial genome: multiple origins of the 9-bp deletion and heteroplasmy for expanded repeats.

Authors:  M G Thomas; C E Cook; K W Miller; M J Waring; E Hagelberg
Journal:  Philos Trans R Soc Lond B Biol Sci       Date:  1998-06-29       Impact factor: 6.237

10.  MITOMAP: a human mitochondrial genome database--2004 update.

Authors:  Marty C Brandon; Marie T Lott; Kevin Cuong Nguyen; Syawal Spolim; Shamkant B Navathe; Pierre Baldi; Douglas C Wallace
Journal:  Nucleic Acids Res       Date:  2005-01-01       Impact factor: 16.971

View more
  30 in total

1.  Compressive genomics.

Authors:  Po-Ru Loh; Michael Baym; Bonnie Berger
Journal:  Nat Biotechnol       Date:  2012-07-10       Impact factor: 54.908

2.  Compressing genomic sequence fragments using SlimGene.

Authors:  Christos Kozanitis; Chris Saunders; Semyon Kruglyak; Vineet Bafna; George Varghese
Journal:  J Comput Biol       Date:  2011-03       Impact factor: 1.479

3.  An extended IUPAC nomenclature code for polymorphic nucleic acids.

Authors:  Andrew D Johnson
Journal:  Bioinformatics       Date:  2010-03-03       Impact factor: 6.937

4.  ERGC: an efficient referential genome compression algorithm.

Authors:  Subrata Saha; Sanguthevar Rajasekaran
Journal:  Bioinformatics       Date:  2015-07-02       Impact factor: 6.937

5.  Compression and fast retrieval of SNP data.

Authors:  Francesco Sambo; Barbara Di Camillo; Gianna Toffolo; Claudio Cobelli
Journal:  Bioinformatics       Date:  2014-07-26       Impact factor: 6.937

6.  NRGC: a novel referential genome compression algorithm.

Authors:  Subrata Saha; Sanguthevar Rajasekaran
Journal:  Bioinformatics       Date:  2016-08-02       Impact factor: 6.937

7.  iDoComp: a compression scheme for assembled genomes.

Authors:  Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  Bioinformatics       Date:  2014-10-24       Impact factor: 6.937

Review 8.  Computational solutions for omics data.

Authors:  Bonnie Berger; Jian Peng; Mona Singh
Journal:  Nat Rev Genet       Date:  2013-05       Impact factor: 53.242

9.  Efficient DNA sequence compression with neural networks.

Authors:  Milton Silva; Diogo Pratas; Armando J Pinho
Journal:  Gigascience       Date:  2020-11-11       Impact factor: 6.524

10.  Review of applications of high-throughput sequencing in personalized medicine: barriers and facilitators of future progress in research and clinical application.

Authors:  Gaye Lightbody; Valeriia Haberland; Fiona Browne; Laura Taggart; Huiru Zheng; Eileen Parkes; Jaine K Blayney
Journal:  Brief Bioinform       Date:  2019-09-27       Impact factor: 11.622

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.