Literature DB >> 20682467

Efficient tools for comparative substring analysis.

Alberto Apostolico1, Olgert Denas, Andreas Dress.   

Abstract

This paper introduces an efficient implementation of approaches to alignment-free comparative genome analysis and genome-based phylogeny relying on substring composition. Distances derived from substring statistics have been proposed recently as a meaningful alternative to distances derived from sequence alignment. In particular, procaryote phylogenies based on comparative 5- and 6-mer analysis of whole proteomes have successfully been worked out. The present implementation extends the computation of composition-based distances so as to involve allk-mers for anyk up to any preset m aximum length K (including K=infinity). Remarkably, although there may be Theta(L(2)) distinct strings that occur in a given sequence of length L (and Theta(KL) of length k< or =K), it is shown that composition-based distances as well as many other details of interest in comparative genome analysis can be computed in O(L) time and space (with a constant that is independent of the size of K, that is, the same constant works for all K). A typical run with 2 sequences of altogether 1.5 million characters computes their composition-based distance in about 2s, a performance to be contrasted with the several hours needed, even when restricting attention to substrings of length at most 6, by the direct method in use. This paper. Copyright 2010 Elsevier B.V. All rights reserved.

Entities:  

Mesh:

Year:  2010        PMID: 20682467     DOI: 10.1016/j.jbiotec.2010.05.006

Source DB:  PubMed          Journal:  J Biotechnol        ISSN: 0168-1656            Impact factor:   3.307


  1 in total

1.  MissMax: alignment-free sequence comparison with mismatches through filtering and heuristics.

Authors:  Cinzia Pizzi
Journal:  Algorithms Mol Biol       Date:  2016-04-21       Impact factor: 1.405

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.