Literature DB >> 23829649

A geometric interpretation for local alignment-free sequence comparison.

Ehsan Behnam1, Michael S Waterman, Andrew D Smith.   

Abstract

Local alignment-free sequence comparison arises in the context of identifying similar segments of sequences that may not be alignable in the traditional sense. We propose a randomized approximation algorithm that is both accurate and efficient. We show that under D2 and its important variant [Formula: see text] as the similarity measure, local alignment-free comparison between a pair of sequences can be formulated as the problem of finding the maximum bichromatic dot product between two sets of points in high dimensions. We introduce a geometric framework that reduces this problem to that of finding the bichromatic closest pair (BCP), allowing the properties of the underlying metric to be leveraged. Local alignment-free sequence comparison can be solved by making a quadratic number of alignment-free substring comparisons. We show both theoretically and through empirical results on simulated data that our approximation algorithm requires a subquadratic number of such comparisons and trades only a small amount of accuracy to achieve this efficiency. Therefore, our algorithm can extend the current usage of alignment-free-based methods and can also be regarded as a substitute for local alignment algorithms in many biological studies.

Mesh:

Year:  2013        PMID: 23829649      PMCID: PMC3704055          DOI: 10.1089/cmb.2012.0280

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  28 in total

1.  Distributional regimes for the number of k-word matches between two random sequences.

Authors:  Ross A Lippert; Haiyan Huang; Michael S Waterman
Journal:  Proc Natl Acad Sci U S A       Date:  2002-10-08       Impact factor: 11.205

2.  Alignment-free sequence comparison (I): statistics and power.

Authors:  Gesine Reinert; David Chew; Fengzhu Sun; Michael S Waterman
Journal:  J Comput Biol       Date:  2009-12       Impact factor: 1.479

3.  Protein sequence similarity searches using patterns as seeds.

Authors:  Z Zhang; A A Schäffer; W Miller; T L Madden; D J Lipman; E V Koonin; S F Altschul
Journal:  Nucleic Acids Res       Date:  1998-09-01       Impact factor: 16.971

4.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.

Authors:  S Karlin; S F Altschul
Journal:  Proc Natl Acad Sci U S A       Date:  1990-03       Impact factor: 11.205

5.  Improved tools for biological sequence comparison.

Authors:  W R Pearson; D J Lipman
Journal:  Proc Natl Acad Sci U S A       Date:  1988-04       Impact factor: 11.205

6.  Sequence turnover and tandem repeats in cis-regulatory modules in drosophila.

Authors:  Saurabh Sinha; Eric D Siggia
Journal:  Mol Biol Evol       Date:  2005-01-19       Impact factor: 16.240

7.  Alignment-free estimation of nucleotide diversity.

Authors:  Bernhard Haubold; Floyd A Reed; Peter Pfaffelhuber
Journal:  Bioinformatics       Date:  2010-12-14       Impact factor: 6.937

8.  Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs.

Authors:  Khalid Mahmood; Geoffrey I Webb; Jiangning Song; James C Whisstock; Arun S Konagurthu
Journal:  Nucleic Acids Res       Date:  2011-12-30       Impact factor: 16.971

9.  The evolution of two-component systems in bacteria reveals different strategies for niche adaptation.

Authors:  Eric Alm; Katherine Huang; Adam Arkin
Journal:  PLoS Comput Biol       Date:  2006-11-03       Impact factor: 4.475

10.  NCBI BLAST: a better web interface.

Authors:  Mark Johnson; Irena Zaretskaya; Yan Raytselis; Yuri Merezhuk; Scott McGinnis; Thomas L Madden
Journal:  Nucleic Acids Res       Date:  2008-04-24       Impact factor: 16.971

View more
  4 in total

Review 1.  Sequence analysis by iterated maps, a review.

Authors:  Jonas S Almeida
Journal:  Brief Bioinform       Date:  2013-10-25       Impact factor: 11.622

2.  Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.

Authors:  Jie Ren; Kai Song; Minghua Deng; Gesine Reinert; Charles H Cannon; Fengzhu Sun
Journal:  Bioinformatics       Date:  2015-06-30       Impact factor: 6.937

3.  The Amordad database engine for metagenomics.

Authors:  Ehsan Behnam; Andrew D Smith
Journal:  Bioinformatics       Date:  2014-06-27       Impact factor: 6.937

4.  Optimal choice of word length when comparing two Markov sequences using a χ 2-statistic.

Authors:  Xin Bai; Kujin Tang; Jie Ren; Michael Waterman; Fengzhu Sun
Journal:  BMC Genomics       Date:  2017-10-03       Impact factor: 3.969

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.