Literature DB >> 26743127

A FASTQ compressor based on integer-mapped k-mer indexing for biologist.

Yeting Zhang1, Khyati Patel2, Tony Endrawis3, Autumn Bowers4, Yazhou Sun5.   

Abstract

Next generation sequencing (NGS) technologies have gained considerable popularity among biologists. For example, RNA-seq, which provides both genomic and functional information, has been widely used by recent functional and evolutionary studies, especially in non-model organisms. However, storing and transmitting these large data sets (primarily in FASTQ format) have become genuine challenges, especially for biologists with little informatics experience. Data compression is thus a necessity. KIC, a FASTQ compressor based on a new integer-mapped k-mer indexing method, was developed (available at http://www.ysunlab.org/kic.jsp). It offers high compression ratio on sequence data, outstanding user-friendliness with graphic user interfaces, and proven reliability. Evaluated on multiple large RNA-seq data sets from both human and plants, it was found that the compression ratio of KIC had exceeded all major generic compressors, and was comparable to those of the latest dedicated compressors. KIC enables researchers with minimal informatics training to take advantage of the latest sequence compression technologies, easily manage large FASTQ data sets, and reduce storage and transmission cost.
Copyright © 2015 Elsevier B.V. All rights reserved.

Entities:  

Keywords:  Biologist-friendly NGS data compressor; Data compression utility; FASTQ compression; FASTQ sequence data compression; Integer-mapped k-mer indexing

Mesh:

Year:  2015        PMID: 26743127     DOI: 10.1016/j.gene.2015.12.053

Source DB:  PubMed          Journal:  Gene        ISSN: 0378-1119            Impact factor:   3.688


  5 in total

1.  Comparison of high-throughput sequencing data compression tools.

Authors:  Ibrahim Numanagić; James K Bonfield; Faraz Hach; Jan Voges; Jörn Ostermann; Claudio Alberti; Marco Mattavelli; S Cenk Sahinalp
Journal:  Nat Methods       Date:  2016-10-24       Impact factor: 28.547

Review 2.  Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format.

Authors:  Kirill Kryukov; Lihua Jin; So Nakagawa
Journal:  Patterns (N Y)       Date:  2022-07-07

3.  LW-FQZip 2: a parallelized reference-based compression of FASTQ files.

Authors:  Zhi-An Huang; Zhenkun Wen; Qingjin Deng; Ying Chu; Yiwen Sun; Zexuan Zhu
Journal:  BMC Bioinformatics       Date:  2017-03-20       Impact factor: 3.169

4.  Optimal compressed representation of high throughput sequence data via light assembly.

Authors:  Antonio A Ginart; Joseph Hui; Kaiyuan Zhu; Ibrahim Numanagić; Thomas A Courtade; S Cenk Sahinalp; David N Tse
Journal:  Nat Commun       Date:  2018-02-08       Impact factor: 14.919

5.  Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences.

Authors:  Kirill Kryukov; Mahoko Takahashi Ueda; So Nakagawa; Tadashi Imanishi
Journal:  Gigascience       Date:  2020-07-01       Impact factor: 6.524

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.