Literature DB >> 16144805

Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences.

Tiee-Jian Wu1, Ying-Hsueh Huang, Lung-An Li.   

Abstract

MOTIVATION: Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is 3-fold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determining the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scale simulation method to simulate data from the distribution of SK-LD (symmetric Kullback-Leibler discrepancy). These simulated data can be used to estimate the degree of dissimilarity beta between any pair of DNA sequences.
RESULTS: Our study shows (1) for whole sequence similiarity/dissimilarity identification the window size taken should be as large as possible, but probably not >3000, as restricted by CPU time in practice, (2) for each measure the optimal word size increases with window size, (3) when the optimal word size is used, SK-LD performance is superior in both simulation and real data analysis, (4) the estimate beta of beta based on SK-LD can be used to filter out quickly a large number of dissimilar sequences and speed alignment-based database search for similar sequences and (5) beta is also applicable in local similarity comparison situations. For example, it can help in selecting oligo probes with high specificity and, therefore, has potential in probe design for microarrays. AVAILABILITY: The algorithm SK-LD, estimate beta and simulation software are implemented in MATLAB code, and are available at http://www.stat.ncku.edu.tw/tjwu

Mesh:

Substances:

Year:  2005        PMID: 16144805     DOI: 10.1093/bioinformatics/bti658

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  18 in total

1.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions.

Authors:  Gregory E Sims; Se-Ran Jun; Guohong A Wu; Sung-Hou Kim
Journal:  Proc Natl Acad Sci U S A       Date:  2009-02-02       Impact factor: 11.205

2.  A genomic distance based on MUM indicates discontinuity between most bacterial species and genera.

Authors:  Marc Deloger; Meriem El Karoui; Marie-Agnès Petit
Journal:  J Bacteriol       Date:  2008-10-31       Impact factor: 3.490

Review 3.  Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.

Authors:  Oliver Bonham-Carter; Joe Steele; Dhundy Bastola
Journal:  Brief Bioinform       Date:  2013-07-31       Impact factor: 11.622

4.  K-mer natural vector and its application to the phylogenetic analysis of genetic sequences.

Authors:  Jia Wen; Raymond H F Chan; Shek-Chung Yau; Rong L He; Stephen S T Yau
Journal:  Gene       Date:  2014-05-22       Impact factor: 3.688

5.  New powerful statistics for alignment-free sequence comparison under a pattern transfer model.

Authors:  Xuemei Liu; Lin Wan; Jing Li; Gesine Reinert; Michael S Waterman; Fengzhu Sun
Journal:  J Theor Biol       Date:  2011-06-25       Impact factor: 2.691

6.  Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.

Authors:  Sylvain Forêt; Miriam R Kantorovitz; Conrad J Burden
Journal:  BMC Bioinformatics       Date:  2006-12-18       Impact factor: 3.169

7.  Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification.

Authors:  Ivan Borozan; Stuart Watt; Vincent Ferretti
Journal:  Bioinformatics       Date:  2015-01-07       Impact factor: 6.937

8.  Interpreting genomic data via entropic dissection.

Authors:  Rajeev K Azad; Jing Li
Journal:  Nucleic Acids Res       Date:  2012-10-03       Impact factor: 16.971

9.  Pattern-based phylogenetic distance estimation and tree reconstruction.

Authors:  Michael Höhl; Isidore Rigoutsos; Mark A Ragan
Journal:  Evol Bioinform Online       Date:  2007-02-25       Impact factor: 1.625

10.  Alignment-free genome tree inference by learning group-specific distance metrics.

Authors:  Kaustubh R Patil; Alice C McHardy
Journal:  Genome Biol Evol       Date:  2013       Impact factor: 3.416

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.