Literature DB >> 26966283

Effect of lossy compression of quality scores on variant calling.

Idoia Ochoa1, Mikel Hernaez2, Rachel Goldfeder2, Tsachy Weissman2, Euan Ashley3,4,5.   

Abstract

Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear.Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance that is superior to that using the original file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors.
© The Author 2016. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.

Keywords:  Genomic data; lossy compression; quality scores; variant calling

Mesh:

Year:  2017        PMID: 26966283      PMCID: PMC5862240          DOI: 10.1093/bib/bbw011

Source DB:  PubMed          Journal:  Brief Bioinform        ISSN: 1467-5463            Impact factor:   11.622


  28 in total

1.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors:  Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal:  Genome Res       Date:  2010-07-19       Impact factor: 9.043

2.  QVZ: lossy compression of quality values.

Authors:  Greg Malysa; Mikel Hernaez; Idoia Ochoa; Milind Rao; Karthik Ganesan; Tsachy Weissman
Journal:  Bioinformatics       Date:  2015-05-28       Impact factor: 6.937

3.  Adaptive reference-free compression of sequence quality scores.

Authors:  Lilian Janin; Giovanna Rosone; Anthony J Cox
Journal:  Bioinformatics       Date:  2013-05-09       Impact factor: 6.937

4.  Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification.

Authors:  Y William Yu; Deniz Yorukoglu; Bonnie Berger
Journal:  Res Comput Mol Biol       Date:  2014-04

5.  From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.

Authors:  Geraldine A Van der Auwera; Mauricio O Carneiro; Christopher Hartl; Ryan Poplin; Guillermo Del Angel; Ami Levy-Moonshine; Tadeusz Jordan; Khalid Shakir; David Roazen; Joel Thibault; Eric Banks; Kiran V Garimella; David Altshuler; Stacey Gabriel; Mark A DePristo
Journal:  Curr Protoc Bioinformatics       Date:  2013

6.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

7.  The variant call format and VCFtools.

Authors:  Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin
Journal:  Bioinformatics       Date:  2011-06-07       Impact factor: 6.937

8.  Phased whole-genome genetic risk in a family quartet using a major allele reference sequence.

Authors:  Frederick E Dewey; Rong Chen; Sergio P Cordero; Kelly E Ormond; Colleen Caleshu; Konrad J Karczewski; Michelle Whirl-Carrillo; Matthew T Wheeler; Joel T Dudley; Jake K Byrnes; Omar E Cornejo; Joshua W Knowles; Mark Woon; Katrin Sangkuhl; Li Gong; Caroline F Thorn; Joan M Hebert; Emidio Capriotti; Sean P David; Aleksandra Pavlovic; Anne West; Joseph V Thakuria; Madeleine P Ball; Alexander W Zaranek; Heidi L Rehm; George M Church; John S West; Carlos D Bustamante; Michael Snyder; Russ B Altman; Teri E Klein; Atul J Butte; Euan A Ashley
Journal:  PLoS Genet       Date:  2011-09-15       Impact factor: 5.917

9.  Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications.

Authors:  Andy Rimmer; Hang Phan; Iain Mathieson; Zamin Iqbal; Stephen R F Twigg; Andrew O M Wilkie; Gil McVean; Gerton Lunter
Journal:  Nat Genet       Date:  2014-07-13       Impact factor: 38.330

10.  Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing.

Authors:  Jason O'Rawe; Tao Jiang; Guangqing Sun; Yiyang Wu; Wei Wang; Jingchu Hu; Paul Bodily; Lifeng Tian; Hakon Hakonarson; W Evan Johnson; Zhi Wei; Kai Wang; Gholson J Lyon
Journal:  Genome Med       Date:  2013-03-27       Impact factor: 11.117

View more
  14 in total

1.  SPRING: a next-generation compressor for FASTQ data.

Authors:  Shubham Chandak; Kedar Tatwawadi; Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  Bioinformatics       Date:  2019-08-01       Impact factor: 6.937

Review 2.  Towards precision medicine.

Authors:  Euan A Ashley
Journal:  Nat Rev Genet       Date:  2016-08-16       Impact factor: 53.242

3.  An Evaluation Framework for Lossy Compression of Genome Sequencing Quality Values.

Authors:  Claudio Alberti; Noah Daniels; Mikel Hernaez; Jan Voges; Rachel L Goldfeder; Ana A Hernandez-Lopez; Marco Mattavelli; Bonnie Berger
Journal:  Proc Data Compress Conf       Date:  2016-12-19

4.  CALQ: compression of quality values of aligned sequencing data.

Authors:  Jan Voges; Jörn Ostermann; Mikel Hernaez
Journal:  Bioinformatics       Date:  2018-05-15       Impact factor: 6.937

5.  A cluster-based approach to compression of Quality Scores.

Authors:  Mikel Hernaez; Idoia Ochoa; Tsachy Weissman
Journal:  Proc Data Compress Conf       Date:  2016-12-19

6.  CROMqs: an infinitesimal successive refinement lossy compressor for the quality scores.

Authors:  I Ochoa; A No; M Hernaez; T Weissman
Journal:  Proc Inf Theory Workshop       Date:  2016-10-27

7.  GeneComp, a new reference-based compressor for SAM files.

Authors:  Reggy Long; Mikel Hernaez; Idoia Ochoa; Tsachy Weissman
Journal:  Proc Data Compress Conf       Date:  2017-05-11

8.  Denoising of Quality Scores for Boosted Inference and Reduced Storage.

Authors:  Idoia Ochoa; Mikel Hernaez; Rachel Goldfeder; Tsachy Weissman; Euan Ashley
Journal:  Proc Data Compress Conf       Date:  2016-12-19

9.  ACO:lossless quality score compression based on adaptive coding order.

Authors:  Yi Niu; Mingming Ma; Fu Li; Xianming Liu; Guangming Shi
Journal:  BMC Bioinformatics       Date:  2022-06-07       Impact factor: 3.307

10.  DUDE-Seq: Fast, flexible, and robust denoising for targeted amplicon sequencing.

Authors:  Byunghan Lee; Taesup Moon; Sungroh Yoon; Tsachy Weissman
Journal:  PLoS One       Date:  2017-07-27       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.