Literature DB >> 34280186

Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.

Yuansheng Liu1, Jinyan Li1.   

Abstract

Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 - 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly.

Entities:  

Year:  2021        PMID: 34280186     DOI: 10.1371/journal.pcbi.1009229

Source DB:  PubMed          Journal:  PLoS Comput Biol        ISSN: 1553-734X            Impact factor:   4.475


  22 in total

1.  Efficient storage of high throughput DNA sequencing data using reference-based compression.

Authors:  Markus Hsi-Yang Fritz; Rasko Leinonen; Guy Cochrane; Ewan Birney
Journal:  Genome Res       Date:  2011-01-18       Impact factor: 9.043

2.  Disk-based compression of data from genome sequencing.

Authors:  Szymon Grabowski; Sebastian Deorowicz; Łukasz Roguski
Journal:  Bioinformatics       Date:  2014-12-22       Impact factor: 6.937

3.  SPRING: a next-generation compressor for FASTQ data.

Authors:  Shubham Chandak; Kedar Tatwawadi; Idoia Ochoa; Mikel Hernaez; Tsachy Weissman
Journal:  Bioinformatics       Date:  2019-08-01       Impact factor: 6.937

4.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2016-03-19       Impact factor: 6.937

5.  Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers.

Authors:  Yuansheng Liu; Xiaocai Zhang; Quan Zou; Xiangxiang Zeng
Journal:  Bioinformatics       Date:  2021-07-12       Impact factor: 6.937

6.  Highly accurate fluorogenic DNA sequencing with information theory-based error correction.

Authors:  Zitian Chen; Wenxiong Zhou; Shuo Qiao; Li Kang; Haifeng Duan; X Sunney Xie; Yanyi Huang
Journal:  Nat Biotechnol       Date:  2017-11-06       Impact factor: 54.908

7.  FaStore: a space-saving solution for raw sequencing data.

Authors:  Lukasz Roguski; Idoia Ochoa; Mikel Hernaez; Sebastian Deorowicz
Journal:  Bioinformatics       Date:  2018-08-15       Impact factor: 6.937

8.  Improving the performance of minimizers and winnowing schemes.

Authors:  Guillaume Marçais; David Pellow; Daniel Bork; Yaron Orenstein; Ron Shamir; Carl Kingsford
Journal:  Bioinformatics       Date:  2017-07-15       Impact factor: 6.937

9.  Compressing DNA sequence databases with coil.

Authors:  W Timothy J White; Michael D Hendy
Journal:  BMC Bioinformatics       Date:  2008-05-20       Impact factor: 3.169

10.  Data-dependent bucketing improves reference-free compression of sequencing reads.

Authors:  Rob Patro; Carl Kingsford
Journal:  Bioinformatics       Date:  2015-04-24       Impact factor: 6.937

View more
  1 in total

1.  SparkGC: Spark based genome compression for large collections of genomes.

Authors:  Haichang Yao; Guangyong Hu; Shangdong Liu; Houzhi Fang; Yimu Ji
Journal:  BMC Bioinformatics       Date:  2022-07-25       Impact factor: 3.307

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.