Literature DB >> 33925812

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models.

Milton Silva1,2, Diogo Pratas1,2,3, Armando J Pinho1,2.   

Abstract

Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2-9% and 6-7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences' input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.

Entities:  

Keywords:  context mixing; coronavirus; lossless data compression; mixture of experts; neural networks; protein sequence compression

Year:  2021        PMID: 33925812     DOI: 10.3390/e23050530

Source DB:  PubMed          Journal:  Entropy (Basel)        ISSN: 1099-4300            Impact factor:   2.524


  43 in total

1.  The Protein Data Bank.

Authors:  H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  Application of compression-based distance measures to protein sequence classification: a methodological study.

Authors:  András Kocsor; Attila Kertész-Farkas; László Kaján; Sándor Pongor
Journal:  Bioinformatics       Date:  2005-11-29       Impact factor: 6.937

3.  Compressing proteomes: the relevance of medium range correlations.

Authors:  Dario Benedetto; Emanuele Caglioti; Claudia Chica
Journal:  EURASIP J Bioinform Syst Biol       Date:  2007

4.  CoMSA: compression of protein multiple sequence alignment files.

Authors:  Sebastian Deorowicz; Joanna Walczyszyn; Agnieszka Debudaj-Grabysz
Journal:  Bioinformatics       Date:  2019-01-15       Impact factor: 6.937

5.  Stochastic models for heterogeneous DNA sequences.

Authors:  G A Churchill
Journal:  Bull Math Biol       Date:  1989       Impact factor: 1.758

6.  Allowing mutations in maximal matches boosts genome compression performance.

Authors:  Yuansheng Liu; Limsoon Wong; Jinyan Li
Journal:  Bioinformatics       Date:  2020-09-15       Impact factor: 6.937

7.  Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences.

Authors:  Kirill Kryukov; Mahoko Takahashi Ueda; So Nakagawa; Tadashi Imanishi
Journal:  Bioinformatics       Date:  2019-10-01       Impact factor: 6.937

8.  Statistical Complexity Analysis of Turing Machine tapes with Fixed Algorithmic Complexity Using the Best-Order Markov Model.

Authors:  Jorge M Silva; Eduardo Pinho; Sérgio Matos; Diogo Pratas
Journal:  Entropy (Basel)       Date:  2020-01-16       Impact factor: 2.524

9.  Compressive genomics for protein databases.

Authors:  Noah M Daniels; Andrew Gallant; Jian Peng; Lenore J Cowen; Michael Baym; Bonnie Berger
Journal:  Bioinformatics       Date:  2013-07-01       Impact factor: 6.937

10.  Probable Pangolin Origin of SARS-CoV-2 Associated with the COVID-19 Outbreak.

Authors:  Tao Zhang; Qunfu Wu; Zhigang Zhang
Journal:  Curr Biol       Date:  2020-03-19       Impact factor: 10.834

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.