Literature DB >> 16317070

Application of compression-based distance measures to protein sequence classification: a methodological study.

András Kocsor1, Attila Kertész-Farkas, László Kaján, Sándor Pongor.   

Abstract

MOTIVATION: Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences.
RESULTS: We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algorithms and compared their performance with that of the Smith-Waterman algorithm and BLAST, using nearest neighbour or support vector machine classification schemes. The datasets included a subset of the SCOP protein structure database to test distant protein similarities, a 3-phosphoglycerate-kinase sequences selected from archaean, bacterial and eukaryotic species as well as low and high-complexity sequence segments of the human proteome, CBMs values show a dependence on the length and the complexity of the sequences compared. In classification tasks CBMs performed especially well on distantly related proteins where the performance of a combined measure, constructed from a CBM and a BLAST score, approached or even slightly exceeded that of the Smith-Waterman algorithm and two hidden Markov model-based algorithms.

Entities:  

Mesh:

Substances:

Year:  2005        PMID: 16317070     DOI: 10.1093/bioinformatics/bti806

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  13 in total

1.  Normalized Compression Distance of Multisets with Applications.

Authors:  Andrew R Cohen; Paul M B Vitányi
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2015-08       Impact factor: 6.226

2.  Data Compression Concepts and Algorithms and their Applications to Bioinformatics.

Authors:  O U Nalbantog̃lu; D J Russell; K Sayood
Journal:  Entropy (Basel)       Date:  2010-01-01       Impact factor: 2.524

3.  Comparing biological networks via graph compression.

Authors:  Morihiro Hayashida; Tatsuya Akutsu
Journal:  BMC Syst Biol       Date:  2010-09-13

4.  AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models.

Authors:  Milton Silva; Diogo Pratas; Armando J Pinho
Journal:  Entropy (Basel)       Date:  2021-04-26       Impact factor: 2.524

5.  Network compression as a quality measure for protein interaction networks.

Authors:  Loic Royer; Matthias Reimann; A Francis Stewart; Michael Schroeder
Journal:  PLoS One       Date:  2012-06-18       Impact factor: 3.240

6.  A Protein Classification Benchmark collection for machine learning.

Authors:  Paolo Sonego; Mircea Pacurar; Somdutta Dhir; Attila Kertész-Farkas; András Kocsor; Zoltán Gáspári; Jack A M Leunissen; Sándor Pongor
Journal:  Nucleic Acids Res       Date:  2006-11-16       Impact factor: 16.971

7.  ProCKSI: a decision support system for Protein (structure) Comparison, Knowledge, Similarity and Information.

Authors:  Daniel Barthel; Jonathan D Hirst; Jacek Błazewicz; Edmund K Burke; Natalio Krasnogor
Journal:  BMC Bioinformatics       Date:  2007-10-26       Impact factor: 3.169

8.  Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification.

Authors:  Ivan Borozan; Stuart Watt; Vincent Ferretti
Journal:  Bioinformatics       Date:  2015-01-07       Impact factor: 6.937

9.  Compression-based distance (CBD): a simple, rapid, and accurate method for microbiota composition comparison.

Authors:  Fang Yang; Nicholas Chia; Bryan A White; Lawrence B Schook
Journal:  BMC Bioinformatics       Date:  2013-04-23       Impact factor: 3.169

10.  Compressing DNA sequence databases with coil.

Authors:  W Timothy J White; Michael D Hendy
Journal:  BMC Bioinformatics       Date:  2008-05-20       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.