Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.

Literature DB >> 24829447

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.

Sebastian Horwege¹, Sebastian Lindner², Marcus Boden², Klas Hatje³, Martin Kollmar³, Chris-André Leimeister², Burkhard Morgenstern⁴.

Abstract

In this article, we present a user-friendly web interface for two alignment-free sequence-comparison methods that we recently developed. Most alignment-free methods rely on exact word matches to estimate pairwise similarities or distances between the input sequences. By contrast, our new algorithms are based on inexact word matches. The first of these approaches uses the relative frequencies of so-called spaced words in the input sequences, i.e. words containing 'don't care' or 'wildcard' symbols at certain pre-defined positions. Various distance measures can then be defined on sequences based on their different spaced-word composition. Our second approach defines the distance between two sequences by estimating for each position in the first sequence the length of the longest substring at this position that also occurs in the second sequence with up to k mismatches. Both approaches take a set of deoxyribonucleic acid (DNA) or protein sequences as input and return a matrix of pairwise distance values that can be used as a starting point for clustering algorithms or distance-based phylogeny reconstruction. The two alignment-free programmes are accessible through a web interface at 'Göttingen Bioinformatics Compute Server (GOBICS)': http://spaced.gobics.de http://kmacs.gobics.de and the source codes can be downloaded.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2014 PMID： 24829447 PMCID： PMC4086093 DOI： 10.1093/nar/gku398

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

17 in total

1. PatternHunter: faster and more sensitive homology search.

Authors: Bin Ma; John Tromp; Ming Li
Journal: Bioinformatics Date: 2002-03 Impact factor: 6.937

Review 2. Alignment-free sequence comparison-a review.

Authors: Susana Vinga; Jonas Almeida
Journal: Bioinformatics Date: 2003-03-01 Impact factor: 6.937

3. Multiple sequence alignment with the Clustal series of programs.

Authors: Ramu Chenna; Hideaki Sugawara; Tadashi Koike; Rodrigo Lopez; Toby J Gibson; Desmond G Higgins; Julie D Thompson
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

4. The rapid generation of mutation data matrices from protein sequences.

Authors: D T Jones; W R Taylor; J M Thornton
Journal: Comput Appl Biosci Date: 1992-06

5. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions.

Authors: Gregory E Sims; Se-Ran Jun; Guohong A Wu; Sung-Hou Kim
Journal: Proc Natl Acad Sci U S A Date: 2009-02-02 Impact factor: 11.205

6. Estimating mutation distances from unaligned genomes.

Authors: Bernhard Haubold; Peter Pfaffelhuber; Mirjana Domazet-Loso; Thomas Wiehe
Journal: J Comput Biol Date: 2009-10 Impact factor: 1.479

7. Alignment-free phylogeny of whole genomes using underlying subwords.

Authors: Matteo Comin; Davide Verzotto
Journal: Algorithms Mol Biol Date: 2012-12-06 Impact factor: 1.405

8. Pattern-based phylogenetic distance estimation and tree reconstruction.

Authors: Michael Höhl; Isidore Rigoutsos; Mark A Ragan
Journal: Evol Bioinform Online Date: 2007-02-25 Impact factor: 1.625

9. Reconstructing the phylogeny of 21 completely sequenced arthropod species based on their motor proteins.

Authors: Florian Odronitz; Sebastian Becker; Martin Kollmar
Journal: BMC Genomics Date: 2009-04-21 Impact factor: 3.969

10. Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.

Authors: Chris-Andre Leimeister; Burkhard Morgenstern
Journal: Bioinformatics Date: 2014-05-13 Impact factor: 6.937

26 in total

1. Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences.

Authors: Chris-Andre Leimeister; Jendrik Schellhorn; Svenja Dörrer; Michael Gerth; Christoph Bleidorn; Burkhard Morgenstern
Journal: Gigascience Date: 2019-03-01 Impact factor: 6.524

2. A coverage criterion for spaced seeds and its applications to support vector machine string kernels and k-mer distances.

Authors: Laurent Noé; Donald E K Martin
Journal: J Comput Biol Date: 2014-12 Impact factor: 1.479

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches.

1. PatternHunter: faster and more sensitive homology search.

Review 2. Alignment-free sequence comparison-a review.

3. Multiple sequence alignment with the Clustal series of programs.

4. The rapid generation of mutation data matrices from protein sequences.

5. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions.

6. Estimating mutation distances from unaligned genomes.

7. Alignment-free phylogeny of whole genomes using underlying subwords.

8. Pattern-based phylogenetic distance estimation and tree reconstruction.

9. Reconstructing the phylogeny of 21 completely sequenced arthropod species based on their motor proteins.

10. Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.

1. Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences.

2. A coverage criterion for spaced seeds and its applications to support vector machine string kernels and k-mer distances.

3. Sequence Comparison Without Alignment: The SpaM Approaches.

4. Specificity Analysis of Genome Based on Statistically Identical K-Words With Same Base Combination.

5. Interpreting alignment-free sequence comparison: what makes a score a good score?

6. Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage.

7. Estimating evolutionary distances between genomic sequences from spaced-word matches.

8. Fast alignment-free sequence comparison using spaced-word frequencies.

9. Cnidaria: fast, reference-free clustering of raw and assembled genome and transcriptome NGS data.

10. Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison.