Literature DB >> 29741583

LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification.

Gleb Filatov1, Bruno Bauwens2, Attila Kertész-Farkas1.   

Abstract

Motivation: Bioinformatics studies often rely on similarity measures between sequence pairs, which often pose a bottleneck in large-scale sequence analysis.
Results: Here, we present a new convolutional kernel function for protein sequences called the Lempel-Ziv-Welch (LZW)-Kernel. It is based on code words identified with the LZW universal text compressor. The LZW-Kernel is an alignment-free method, it is always symmetric, is positive, always provides 1.0 for self-similarity and it can directly be used with Support Vector Machines (SVMs) in classification problems, contrary to normalized compression distance, which often violates the distance metric properties in practice and requires further techniques to be used with SVMs. The LZW-Kernel is a one-pass algorithm, which makes it particularly plausible for big data applications. Our experimental studies on remote protein homology detection and protein classification tasks reveal that the LZW-Kernel closely approaches the performance of the Local Alignment Kernel (LAK) and the SVM-pairwise method combined with Smith-Waterman (SW) scoring at a fraction of the time. Moreover, the LZW-Kernel outperforms the SVM-pairwise method when combined with Basic Local Alignment Search Tool (BLAST) scores, which indicates that the LZW code words might be a better basis for similarity measures than local alignment approximations found with BLAST. In addition, the LZW-Kernel outperforms n-gram based mismatch kernels, hidden Markov model based SAM and Fisher kernel and protein family based PSI-BLAST, among others. Further advantages include the LZW-Kernel's reliance on a simple idea, its ease of implementation, and its high speed, three times faster than BLAST and several magnitudes faster than SW or LAK in our tests. Availability and implementation: LZW-Kernel is implemented as a standalone C code and is a free open-source program distributed under GPLv3 license and can be downloaded from https://github.com/kfattila/LZW-Kernel. Supplementary information: Supplementary data are available at Bioinformatics Online.

Mesh:

Substances:

Year:  2018        PMID: 29741583     DOI: 10.1093/bioinformatics/bty349

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  3 in total

1.  Benchmarking of alignment-free sequence comparison methods.

Authors:  Andrzej Zielezinski; Hani Z Girgis; Guillaume Bernard; Chris-Andre Leimeister; Kujin Tang; Thomas Dencker; Anna Katharina Lau; Sophie Röhling; Jae Jin Choi; Michael S Waterman; Matteo Comin; Sung-Hou Kim; Susana Vinga; Jonas S Almeida; Cheong Xin Chan; Benjamin T James; Fengzhu Sun; Burkhard Morgenstern; Wojciech M Karlowski
Journal:  Genome Biol       Date:  2019-07-25       Impact factor: 13.583

2.  Caretta - A multiple protein structure alignment and feature extraction suite.

Authors:  Mehmet Akdel; Janani Durairaj; Dick de Ridder; Aalt D J van Dijk
Journal:  Comput Struct Biotechnol J       Date:  2020-04-06       Impact factor: 7.271

Review 3.  A Review of Methods for Estimating Algorithmic Complexity: Options, Challenges, and New Directions.

Authors:  Hector Zenil
Journal:  Entropy (Basel)       Date:  2020-05-30       Impact factor: 2.524

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.