Literature DB >> 27354700

GeneCodeq: quality score compression and improved genotyping using a Bayesian framework.

Daniel L Greenfield1, Oliver Stegle2, Alban Rrustemi1.   

Abstract

MOTIVATION: The exponential reduction in cost of genome sequencing has resulted in a rapid growth of genomic data. Most of the entropy of short read data lies not in the sequence of read bases themselves but in their Quality Scores-the confidence measurement that each base has been sequenced correctly. Lossless compression methods are now close to their theoretical limits and hence there is a need for lossy methods that further reduce the complexity of these data without impacting downstream analyses.
RESULTS: We here propose GeneCodeq, a Bayesian method inspired by coding theory for adjusting quality scores to improve the compressibility of quality scores without adversely impacting genotyping accuracy. Our model leverages a corpus of k-mers to reduce the entropy of the quality scores and thereby the compressibility of these data (in FASTQ or SAM/BAM/CRAM files), resulting in compression ratios that significantly exceeds those of other methods. Our approach can also be combined with existing lossy compression schemes to further reduce entropy and allows the user to specify a reference panel of expected sequence variations to improve the model accuracy. In addition to extensive empirical evaluation, we also derive novel theoretical insights that explain the empirical performance and pitfalls of corpus-based quality score compression schemes in general. Finally, we show that as a positive side effect of compression, the model can lead to improved genotyping accuracy.
AVAILABILITY AND IMPLEMENTATION: GeneCodeq is available at: github.com/genecodeq/eval CONTACT: dan@petagene.comSupplementary information: Supplementary data are available at Bioinformatics online.
© The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

Entities:  

Mesh:

Year:  2016        PMID: 27354700     DOI: 10.1093/bioinformatics/btw385

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  5 in total

1.  Cram-JS: reference-based decompression in node and the browser.

Authors:  Robert Buels; Shihab Dider; Colin Diesh; James Robinson; Ian Holmes
Journal:  Bioinformatics       Date:  2019-11-01       Impact factor: 6.937

2.  MZPAQ: a FASTQ data compression tool.

Authors:  Achraf El Allali; Mariam Arshad
Journal:  Source Code Biol Med       Date:  2019-06-03

3.  IonCRAM: a reference-based compression tool for ion torrent sequence files.

Authors:  Moustafa Shokrof; Mohamed Abouelhoda
Journal:  BMC Bioinformatics       Date:  2020-09-09       Impact factor: 3.169

4.  Crumble: reference free lossy compression of sequence quality values.

Authors:  James K Bonfield; Shane A McCarthy; Richard Durbin
Journal:  Bioinformatics       Date:  2019-01-15       Impact factor: 6.937

5.  Better quality score compression through sequence-based quality smoothing.

Authors:  Yoshihiro Shibuya; Matteo Comin
Journal:  BMC Bioinformatics       Date:  2019-11-22       Impact factor: 3.169

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.