Literature DB >> 21245279

Efficient storage of high throughput DNA sequencing data using reference-based compression.

Markus Hsi-Yang Fritz1, Rasko Leinonen, Guy Cochrane, Ewan Birney.   

Abstract

Data storage costs have become an appreciable proportion of total cost in the creation and analysis of DNA sequence data. Of particular concern is that the rate of increase in DNA sequencing is significantly outstripping the rate of increase in disk storage capacity. In this paper we present a new reference-based compression method that efficiently compresses DNA sequences for storage. Our approach works for resequencing experiments that target well-studied genomes. We align new sequences to a reference genome and then encode the differences between the new sequence and the reference genome for storage. Our compression method is most efficient when we allow controlled loss of data in the saving of quality information and unaligned sequences. With this new compression method we observe exponential efficiency gains as read lengths increase, and the magnitude of this efficiency gain can be controlled by changing the amount of quality information stored. Our compression method is tunable: The storage of quality scores and unaligned sequences may be adjusted for different experiments to conserve information or to minimize storage costs, and provides one opportunity to address the threat that increasing DNA sequence volumes will overcome our ability to store the sequences.

Mesh:

Year:  2011        PMID: 21245279      PMCID: PMC3083090          DOI: 10.1101/gr.114819.110

Source DB:  PubMed          Journal:  Genome Res        ISSN: 1088-9051            Impact factor:   9.043


  14 in total

1.  Biological sequence compression algorithms.

Authors:  T Matsumoto; K Sadakane; H Imai
Journal:  Genome Inform Ser Workshop Genome Inform       Date:  2000

2.  DNACompress: fast and effective DNA sequence compression.

Authors:  Xin Chen; Ming Li; Bin Ma; John Tromp
Journal:  Bioinformatics       Date:  2002-12       Impact factor: 6.937

3.  The ENCODE (ENCyclopedia Of DNA Elements) Project.

Authors: 
Journal:  Science       Date:  2004-10-22       Impact factor: 47.728

4.  Genome sequencing in microfabricated high-density picolitre reactors.

Authors:  Marcel Margulies; Michael Egholm; William E Altman; Said Attiya; Joel S Bader; Lisa A Bemben; Jan Berka; Michael S Braverman; Yi-Ju Chen; Zhoutao Chen; Scott B Dewell; Lei Du; Joseph M Fierro; Xavier V Gomes; Brian C Godwin; Wen He; Scott Helgesen; Chun Heen Ho; Chun He Ho; Gerard P Irzyk; Szilveszter C Jando; Maria L I Alenquer; Thomas P Jarvie; Kshama B Jirage; Jong-Bum Kim; James R Knight; Janna R Lanza; John H Leamon; Steven M Lefkowitz; Ming Lei; Jing Li; Kenton L Lohman; Hong Lu; Vinod B Makhijani; Keith E McDade; Michael P McKenna; Eugene W Myers; Elizabeth Nickerson; John R Nobile; Ramona Plant; Bernard P Puc; Michael T Ronan; George T Roth; Gary J Sarkis; Jan Fredrik Simons; John W Simpson; Maithreyan Srinivasan; Karrie R Tartaro; Alexander Tomasz; Kari A Vogt; Greg A Volkmer; Shally H Wang; Yong Wang; Michael P Weiner; Pengguang Yu; Richard F Begley; Jonathan M Rothberg
Journal:  Nature       Date:  2005-07-31       Impact factor: 49.962

5.  Base-calling of automated sequencer traces using phred. II. Error probabilities.

Authors:  B Ewing; P Green
Journal:  Genome Res       Date:  1998-03       Impact factor: 9.043

6.  A map of human genome variation from population-scale sequencing.

Authors:  Gonçalo R Abecasis; David Altshuler; Adam Auton; Lisa D Brooks; Richard M Durbin; Richard A Gibbs; Matt E Hurles; Gil A McVean
Journal:  Nature       Date:  2010-10-28       Impact factor: 49.962

7.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

8.  DNA sequencing with chain-terminating inhibitors.

Authors:  F Sanger; S Nicklen; A R Coulson
Journal:  Proc Natl Acad Sci U S A       Date:  1977-12       Impact factor: 11.205

9.  Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2009-05-18       Impact factor: 6.937

10.  Archiving next generation sequencing data.

Authors:  Martin Shumway; Guy Cochrane; Hideaki Sugawara
Journal:  Nucleic Acids Res       Date:  2009-12-03       Impact factor: 16.971

View more
  124 in total

1.  Aligned genomic data compression via improved modeling.

Authors:  Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  J Bioinform Comput Biol       Date:  2014-12       Impact factor: 1.122

2.  Compressive genomics.

Authors:  Po-Ru Loh; Michael Baym; Bonnie Berger
Journal:  Nat Biotechnol       Date:  2012-07-10       Impact factor: 54.908

3.  Using Genome Query Language to uncover genetic variation.

Authors:  Christos Kozanitis; Andrew Heiberg; George Varghese; Vineet Bafna
Journal:  Bioinformatics       Date:  2013-06-10       Impact factor: 6.937

4.  Compressive mapping for next-generation sequencing.

Authors:  Deniz Yorukoglu; Yun William Yu; Jian Peng; Bonnie Berger
Journal:  Nat Biotechnol       Date:  2016-04       Impact factor: 54.908

5.  LFQC: a lossless compression algorithm for FASTQ files.

Authors:  Marius Nicolae; Sudipta Pathak; Sanguthevar Rajasekaran
Journal:  Bioinformatics       Date:  2015-06-20       Impact factor: 6.937

6.  QVZ: lossy compression of quality values.

Authors:  Greg Malysa; Mikel Hernaez; Idoia Ochoa; Milind Rao; Karthik Ganesan; Tsachy Weissman
Journal:  Bioinformatics       Date:  2015-05-28       Impact factor: 6.937

7.  ERGC: an efficient referential genome compression algorithm.

Authors:  Subrata Saha; Sanguthevar Rajasekaran
Journal:  Bioinformatics       Date:  2015-07-02       Impact factor: 6.937

8.  Quality score compression improves genotyping accuracy.

Authors:  Y William Yu; Deniz Yorukoglu; Jian Peng; Bonnie Berger
Journal:  Nat Biotechnol       Date:  2015-03       Impact factor: 54.908

9.  Cram-JS: reference-based decompression in node and the browser.

Authors:  Robert Buels; Shihab Dider; Colin Diesh; James Robinson; Ian Holmes
Journal:  Bioinformatics       Date:  2019-11-01       Impact factor: 6.937

Review 10.  Existing and emerging technologies for tumor genomic profiling.

Authors:  Laura E MacConaill
Journal:  J Clin Oncol       Date:  2013-04-15       Impact factor: 44.544

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.