Literature DB >> 9423258

A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words.

T J Wu1, J P Burke, D B Davison.   

Abstract

A number of algorithms exist for searching genetic databases for biologically significant similarities in DNA sequences. Past research has shown that word-based search tools are computationally efficient and can find similarities or dissimilarities invisible to other algorithms like FASTA. We characterize a family of word-based dissimilarity measures that define distance between two sequences by simultaneously comparing the frequencies of all subsequences of n adjacent letters (i.e., n-words) in the two sequences. Applications to real data demonstrate that currently used word-based methods that rely on Euclidean distance can be significantly improved by using Mahalanobis distance, which accounts for both variances and covariances between frequencies of n-words. Furthermore, in those cases where Mahalanobis distance may be too difficult to compute, using standardized Euclidean distance, which only corrects for the variances of frequencies of n-words, still gives better performance than the Euclidean distance. Also, a simple way of combining distances obtained at different n-words is considered. The goal is to obtain a single measure of dissimilarity between two DNA sequences. The performance ranking of the preceding three distances still holds for their combined counterparts. All results obtained in this paper are applicable to amino acid sequences with minor modifications.

Mesh:

Substances:

Year:  1997        PMID: 9423258

Source DB:  PubMed          Journal:  Biometrics        ISSN: 0006-341X            Impact factor:   2.571


  30 in total

1.  Similar cases retrieval from the database of laboratory test results.

Authors:  Zhenjun Yang; Yasushi Matsumura; Shigeki Kuwata; Hideo Kusuoka; Hiroshi Takeda
Journal:  J Med Syst       Date:  2003-06       Impact factor: 4.460

2.  Distributional regimes for the number of k-word matches between two random sequences.

Authors:  Ross A Lippert; Haiyan Huang; Michael S Waterman
Journal:  Proc Natl Acad Sci U S A       Date:  2002-10-08       Impact factor: 11.205

3.  Metagenomic Classification Using an Abstraction Augmented Markov Model.

Authors:  Xiujun Sylvia Zhu; Monnie McGee
Journal:  J Comput Biol       Date:  2015-11-30       Impact factor: 1.479

4.  K-mer natural vector and its application to the phylogenetic analysis of genetic sequences.

Authors:  Jia Wen; Raymond H F Chan; Shek-Chung Yau; Rong L He; Stephen S T Yau
Journal:  Gene       Date:  2014-05-22       Impact factor: 3.688

5.  New powerful statistics for alignment-free sequence comparison under a pattern transfer model.

Authors:  Xuemei Liu; Lin Wan; Jing Li; Gesine Reinert; Michael S Waterman; Fengzhu Sun
Journal:  J Theor Biol       Date:  2011-06-25       Impact factor: 2.691

6.  d2_cluster: a validated method for clustering EST and full-length cDNAsequences.

Authors:  J Burke; D Davison; W Hide
Journal:  Genome Res       Date:  1999-11       Impact factor: 9.043

7.  Fast algorithms for computing sequence distances by exhaustive substring composition.

Authors:  Alberto Apostolico; Olgert Denas
Journal:  Algorithms Mol Biol       Date:  2008-10-28       Impact factor: 1.405

8.  Using Mahalanobis distance to compare genomic signatures between bacterial plasmids and chromosomes.

Authors:  Haruo Suzuki; Masahiro Sota; Celeste J Brown; Eva M Top
Journal:  Nucleic Acids Res       Date:  2008-10-25       Impact factor: 16.971

9.  A novel alignment-free method for comparing transcription factor binding site motifs.

Authors:  Minli Xu; Zhengchang Su
Journal:  PLoS One       Date:  2010-01-20       Impact factor: 3.240

10.  A hybrid distance measure for clustering expressed sequence tags originating from the same gene family.

Authors:  Keng-Hoong Ng; Chin-Kuan Ho; Somnuk Phon-Amnuaisuk
Journal:  PLoS One       Date:  2012-10-11       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.