Literature DB >> 11414568

Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition.

T J Wu1, Y C Hsieh, L A Li.   

Abstract

In molecular biology, the issue of quantifying the similarity between two biological sequences is very important. Past research has shown that word-based search tools are computationally efficient and can find some new functional similarities or dissimilarities invisible to other algorithms like FASTA. Recently, under the independent model of base composition, Wu, Burke, and Davison (1997, Biometrics 53, 1431 1439) characterized a family of word-based dissimilarity measures that defined distance between two sequences by simultaneously comparing the frequencies of all subsequences of n adjacent letters (i.e., n-words) in the two sequences. Specifically, they introduced the use of Mahalanobis distance and standardized Euclidean distance into the study of DNA sequence dissimilarity. They showed that both distances had better sensitivity and selectivity than the commonly used Euclidean distance. The purpose of this article is to extend Mahalanobis and standardized Euclidean distances to Markov chain models of base composition. In addition, a new dissimilarity measure based on Kullback-Leibler discrepancy between frequencies of all n-words in the two sequences is introduced. Applications to real data demonstrate that Kullback-Leibler discrepancy gives a better performance than Euclidean distance. Moreover, under a Markov chain model of order kQ for base composition, where kQ is the estimated order based on the query sequence, standardized Euclidean distance performs very well. Under such a model, it performs as well as Mahalanobis distance and better than Kullback-Leibler discrepancy and Euclidean distance. Since standardized Euclidean distance is drastically faster to compute than Mahalanobis distance, in a usual workstation/PC computing environment, the use of standardized Euclidean distance under the Markov chain model of order kQ of base composition is generally recommended. However, if the user is very concerned with computational efficiency, then the use of Kullback-Leibler discrepancy, which can be computed as fast as Euclidean distance, is recommended. This can significantly enhance the current technology in comparing large datasets of DNA sequences.

Mesh:

Substances:

Year:  2001        PMID: 11414568     DOI: 10.1111/j.0006-341x.2001.00441.x

Source DB:  PubMed          Journal:  Biometrics        ISSN: 0006-341X            Impact factor:   2.571


  30 in total

1.  Distributional regimes for the number of k-word matches between two random sequences.

Authors:  Ross A Lippert; Haiyan Huang; Michael S Waterman
Journal:  Proc Natl Acad Sci U S A       Date:  2002-10-08       Impact factor: 11.205

2.  Metagenomic Classification Using an Abstraction Augmented Markov Model.

Authors:  Xiujun Sylvia Zhu; Monnie McGee
Journal:  J Comput Biol       Date:  2015-11-30       Impact factor: 1.479

Review 3.  New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing.

Authors:  Kai Song; Jie Ren; Gesine Reinert; Minghua Deng; Michael S Waterman; Fengzhu Sun
Journal:  Brief Bioinform       Date:  2013-09-23       Impact factor: 11.622

Review 4.  Sequence analysis by iterated maps, a review.

Authors:  Jonas S Almeida
Journal:  Brief Bioinform       Date:  2013-10-25       Impact factor: 11.622

5.  K-mer natural vector and its application to the phylogenetic analysis of genetic sequences.

Authors:  Jia Wen; Raymond H F Chan; Shek-Chung Yau; Rong L He; Stephen S T Yau
Journal:  Gene       Date:  2014-05-22       Impact factor: 3.688

6.  Frequency analysis techniques for identification of viral genetic data.

Authors:  Vladimir Trifonov; Raul Rabadan
Journal:  MBio       Date:  2010-08-24       Impact factor: 7.867

7.  Estimation of Stratified Mark-Specific Proportional Hazards Models with Missing Marks.

Authors:  Yanqing Sun; Peter B Gilbert
Journal:  Scand Stat Theory Appl       Date:  2012-03       Impact factor: 1.396

8.  A novel hierarchical clustering algorithm for gene sequences.

Authors:  Dan Wei; Qingshan Jiang; Yanjie Wei; Shengrui Wang
Journal:  BMC Bioinformatics       Date:  2012-07-23       Impact factor: 3.169

9.  A novel alignment-free method for comparing transcription factor binding site motifs.

Authors:  Minli Xu; Zhengchang Su
Journal:  PLoS One       Date:  2010-01-20       Impact factor: 3.240

10.  A hybrid distance measure for clustering expressed sequence tags originating from the same gene family.

Authors:  Keng-Hoong Ng; Chin-Kuan Ho; Somnuk Phon-Amnuaisuk
Journal:  PLoS One       Date:  2012-10-11       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.