Literature DB >> 30010777

CoMSA: compression of protein multiple sequence alignment files.

Sebastian Deorowicz1, Joanna Walczyszyn1, Agnieszka Debudaj-Grabysz1.   

Abstract

Motivation: Bioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB is multiple sequence alignments of protein families. Its largest database, i.e. Pfam, consumes 40-230 GB, depending of the variant. Storage and transfer of such massive data has become a challenge.
Results: We propose a novel compression algorithm, CoMSA, designed especially for aligned data. It is based on a generalization of the positional Burrows-Wheeler transform for non-binary alphabets. CoMSA handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e. gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio. Availability and implementation: CoMSA is available for free at https://github.com/refresh-bio/comsa and http://sun.aei.polsl.pl/REFRESH/comsa. Supplementary material: Supplementary data are available at Bioinformatics online.

Mesh:

Year:  2019        PMID: 30010777     DOI: 10.1093/bioinformatics/bty619

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  2 in total

1.  AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models.

Authors:  Milton Silva; Diogo Pratas; Armando J Pinho
Journal:  Entropy (Basel)       Date:  2021-04-26       Impact factor: 2.524

2.  CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments.

Authors:  Md Ashiqur Rahman; Abdullah Aman Tutul; Sifat Muhammad Abdullah; Md Shamsuzzoha Bayzid
Journal:  PLoS One       Date:  2022-04-18       Impact factor: 3.752

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.