Literature DB >> 33118018

Allowing mutations in maximal matches boosts genome compression performance.

Yuansheng Liu1, Limsoon Wong2, Jinyan Li1.   

Abstract

MOTIVATION: A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations.
RESULTS: We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark datasets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission.
AVAILABILITY AND IMPLEMENTATION: https://github.com/yuansliu/memRGC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2020. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Mesh:

Year:  2020        PMID: 33118018     DOI: 10.1093/bioinformatics/btaa572

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  3 in total

1.  AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models.

Authors:  Milton Silva; Diogo Pratas; Armando J Pinho
Journal:  Entropy (Basel)       Date:  2021-04-26       Impact factor: 2.524

2.  MBGC: Multiple Bacteria Genome Compressor.

Authors:  Szymon Grabowski; Tomasz M Kowalski
Journal:  Gigascience       Date:  2022-01-27       Impact factor: 6.524

3.  SparkGC: Spark based genome compression for large collections of genomes.

Authors:  Haichang Yao; Guangyong Hu; Shangdong Liu; Houzhi Fang; Yimu Ji
Journal:  BMC Bioinformatics       Date:  2022-07-25       Impact factor: 3.307

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.