Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification.

Literature DB >> 28825060

Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification.

Y William Yu¹, Deniz Yorukoglu¹, Bonnie Berger¹.

Abstract

It is becoming increasingly impractical to indefinitely store raw sequencing data for later processing in an uncompressed state. In this paper, we describe a scalable compressive framework, Read-Quality-Sparsifier (RQS), which substantially outperforms the compression ratio and speed of other de novo quality score compression methods while maintaining SNP-calling accuracy. Surprisingly, RQS also improves the SNP-calling accuracy on a gold-standard, real-life sequencing dataset (NA12878) using a k-mer density profile constructed from 77 other individuals from the 1000 Genomes Project. This improvement in downstream accuracy emerges from the observation that quality score values within NGS datasets are inherently encoded in the k-mer landscape of the genomic sequences. To our knowledge, RQS is the first scalable sequence based quality compression method that can efficiently compress quality scores of terabyte-sized and larger sequencing datasets. AVAILABILITY: An implementation of our method, RQS, is available for download at: http://rqs.csail.mit.edu/.

Entities: Chemical Species

Keywords: RQS; accuracy; compression; quality score; sparsification; variant calling

Year: 2014 PMID： 28825060 PMCID： PMC5558603 DOI： 10.1007/978-3-319-05269-4_31

Source DB: PubMed Journal: Res Comput Mol Biol

27 in total

Review 1. A survey of error-correction methods for next-generation sequencing.

Authors: Xiao Yang; Sriram P Chockalingam; Srinivas Aluru
Journal: Brief Bioinform Date: 2012-04-06 Impact factor: 11.622

2. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

Authors: Guillaume Marçais; Carl Kingsford
Journal: Bioinformatics Date: 2011-01-07 Impact factor: 6.937

3. Human genomes as email attachments.

Authors: Scott Christley; Yiming Lu; Chen Li; Xiaohui Xie
Journal: Bioinformatics Date: 2008-11-07 Impact factor: 6.937

4. Compression of DNA sequence reads in FASTQ format.

Authors: Sebastian Deorowicz; Szymon Grabowski
Journal: Bioinformatics Date: 2011-01-19 Impact factor: 6.937

5. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

6. Efficient counting of k-mers in DNA sequences using a bloom filter.

Authors: Páll Melsted; Jonathan K Pritchard
Journal: BMC Bioinformatics Date: 2011-08-10 Impact factor: 3.169

7. Quake: quality-aware detection and correction of sequencing errors.

Authors: David R Kelley; Michael C Schatz; Steven L Salzberg
Journal: Genome Biol Date: 2010-11-29 Impact factor: 13.583

8. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

9. Full-length transcriptome assembly from RNA-Seq data without a reference genome.

Authors: Manfred G Grabherr; Brian J Haas; Moran Yassour; Joshua Z Levin; Dawn A Thompson; Ido Amit; Xian Adiconis; Lin Fan; Raktima Raychowdhury; Qiandong Zeng; Zehua Chen; Evan Mauceli; Nir Hacohen; Andreas Gnirke; Nicholas Rhind; Federica di Palma; Bruce W Birren; Chad Nusbaum; Kerstin Lindblad-Toh; Nir Friedman; Aviv Regev
Journal: Nat Biotechnol Date: 2011-05-15 Impact factor: 54.908

Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification.

Review 1. A survey of error-correction methods for next-generation sequencing.

2. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

3. Human genomes as email attachments.

4. Compression of DNA sequence reads in FASTQ format.

5. The Sequence Alignment/Map format and SAMtools.

6. Efficient counting of k-mers in DNA sequences using a bloom filter.

7. Quake: quality-aware detection and correction of sequencing errors.

8. An integrated map of genetic variation from 1,092 human genomes.

9. Full-length transcriptome assembly from RNA-Seq data without a reference genome.

10. Compression of next-generation sequencing reads aided by highly efficient de novo assembly.

1. QVZ: lossy compression of quality values.

2. Effect of lossy compression of quality scores on variant calling.

3. Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.

4. A cluster-based approach to compression of Quality Scores.

5. Denoising of Quality Scores for Boosted Inference and Reduced Storage.

6. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.

7. Better quality score compression through sequence-based quality smoothing.