Literature DB >> 28825060

Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification.

Y William Yu1, Deniz Yorukoglu1, Bonnie Berger1.   

Abstract

It is becoming increasingly impractical to indefinitely store raw sequencing data for later processing in an uncompressed state. In this paper, we describe a scalable compressive framework, Read-Quality-Sparsifier (RQS), which substantially outperforms the compression ratio and speed of other de novo quality score compression methods while maintaining SNP-calling accuracy. Surprisingly, RQS also improves the SNP-calling accuracy on a gold-standard, real-life sequencing dataset (NA12878) using a k-mer density profile constructed from 77 other individuals from the 1000 Genomes Project. This improvement in downstream accuracy emerges from the observation that quality score values within NGS datasets are inherently encoded in the k-mer landscape of the genomic sequences. To our knowledge, RQS is the first scalable sequence based quality compression method that can efficiently compress quality scores of terabyte-sized and larger sequencing datasets. AVAILABILITY: An implementation of our method, RQS, is available for download at: http://rqs.csail.mit.edu/.

Entities:  

Keywords:  RQS; accuracy; compression; quality score; sparsification; variant calling

Year:  2014        PMID: 28825060      PMCID: PMC5558603          DOI: 10.1007/978-3-319-05269-4_31

Source DB:  PubMed          Journal:  Res Comput Mol Biol


  27 in total

Review 1.  A survey of error-correction methods for next-generation sequencing.

Authors:  Xiao Yang; Sriram P Chockalingam; Srinivas Aluru
Journal:  Brief Bioinform       Date:  2012-04-06       Impact factor: 11.622

2.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

Authors:  Guillaume Marçais; Carl Kingsford
Journal:  Bioinformatics       Date:  2011-01-07       Impact factor: 6.937

3.  Human genomes as email attachments.

Authors:  Scott Christley; Yiming Lu; Chen Li; Xiaohui Xie
Journal:  Bioinformatics       Date:  2008-11-07       Impact factor: 6.937

4.  Compression of DNA sequence reads in FASTQ format.

Authors:  Sebastian Deorowicz; Szymon Grabowski
Journal:  Bioinformatics       Date:  2011-01-19       Impact factor: 6.937

5.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

6.  Efficient counting of k-mers in DNA sequences using a bloom filter.

Authors:  Páll Melsted; Jonathan K Pritchard
Journal:  BMC Bioinformatics       Date:  2011-08-10       Impact factor: 3.169

7.  Quake: quality-aware detection and correction of sequencing errors.

Authors:  David R Kelley; Michael C Schatz; Steven L Salzberg
Journal:  Genome Biol       Date:  2010-11-29       Impact factor: 13.583

8.  An integrated map of genetic variation from 1,092 human genomes.

Authors:  Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal:  Nature       Date:  2012-11-01       Impact factor: 49.962

9.  Full-length transcriptome assembly from RNA-Seq data without a reference genome.

Authors:  Manfred G Grabherr; Brian J Haas; Moran Yassour; Joshua Z Levin; Dawn A Thompson; Ido Amit; Xian Adiconis; Lin Fan; Raktima Raychowdhury; Qiandong Zeng; Zehua Chen; Evan Mauceli; Nir Hacohen; Andreas Gnirke; Nicholas Rhind; Federica di Palma; Bruce W Birren; Chad Nusbaum; Kerstin Lindblad-Toh; Nir Friedman; Aviv Regev
Journal:  Nat Biotechnol       Date:  2011-05-15       Impact factor: 54.908

10.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly.

Authors:  Daniel C Jones; Walter L Ruzzo; Xinxia Peng; Michael G Katze
Journal:  Nucleic Acids Res       Date:  2012-08-16       Impact factor: 16.971

View more
  7 in total

1.  QVZ: lossy compression of quality values.

Authors:  Greg Malysa; Mikel Hernaez; Idoia Ochoa; Milind Rao; Karthik Ganesan; Tsachy Weissman
Journal:  Bioinformatics       Date:  2015-05-28       Impact factor: 6.937

2.  Effect of lossy compression of quality scores on variant calling.

Authors:  Idoia Ochoa; Mikel Hernaez; Rachel Goldfeder; Tsachy Weissman; Euan Ashley
Journal:  Brief Bioinform       Date:  2017-03-01       Impact factor: 11.622

3.  Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.

Authors:  David Pellow; Darya Filippova; Carl Kingsford
Journal:  J Comput Biol       Date:  2016-11-09       Impact factor: 1.479

4.  A cluster-based approach to compression of Quality Scores.

Authors:  Mikel Hernaez; Idoia Ochoa; Tsachy Weissman
Journal:  Proc Data Compress Conf       Date:  2016-12-19

5.  Denoising of Quality Scores for Boosted Inference and Reduced Storage.

Authors:  Idoia Ochoa; Mikel Hernaez; Rachel Goldfeder; Tsachy Weissman; Euan Ashley
Journal:  Proc Data Compress Conf       Date:  2016-12-19

6.  Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.

Authors:  Gaëtan Benoit; Claire Lemaitre; Dominique Lavenier; Erwan Drezen; Thibault Dayris; Raluca Uricaru; Guillaume Rizk
Journal:  BMC Bioinformatics       Date:  2015-09-14       Impact factor: 3.169

7.  Better quality score compression through sequence-based quality smoothing.

Authors:  Yoshihiro Shibuya; Matteo Comin
Journal:  BMC Bioinformatics       Date:  2019-11-22       Impact factor: 3.169

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.