Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words.

Literature DB >> 9423258

A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words.

Abstract

A number of algorithms exist for searching genetic databases for biologically significant similarities in DNA sequences. Past research has shown that word-based search tools are computationally efficient and can find similarities or dissimilarities invisible to other algorithms like FASTA. We characterize a family of word-based dissimilarity measures that define distance between two sequences by simultaneously comparing the frequencies of all subsequences of n adjacent letters (i.e., n-words) in the two sequences. Applications to real data demonstrate that currently used word-based methods that rely on Euclidean distance can be significantly improved by using Mahalanobis distance, which accounts for both variances and covariances between frequencies of n-words. Furthermore, in those cases where Mahalanobis distance may be too difficult to compute, using standardized Euclidean distance, which only corrects for the variances of frequencies of n-words, still gives better performance than the Euclidean distance. Also, a simple way of combining distances obtained at different n-words is considered. The goal is to obtain a single measure of dissimilarity between two DNA sequences. The performance ranking of the preceding three distances still holds for their combined counterparts. All results obtained in this paper are applicable to amino acid sequences with minor modifications.

Mesh：

Substances：
DNA

Year: 1997 PMID： 9423258

Source DB: PubMed Journal: Biometrics ISSN： 0006-341X Impact factor: 2.571

Keyword Cloud
Cited

30 in total

A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words.

1. Similar cases retrieval from the database of laboratory test results.

2. Distributional regimes for the number of k-word matches between two random sequences.

3. Metagenomic Classification Using an Abstraction Augmented Markov Model.

4. K-mer natural vector and its application to the phylogenetic analysis of genetic sequences.

5. New powerful statistics for alignment-free sequence comparison under a pattern transfer model.

6. d2_cluster: a validated method for clustering EST and full-length cDNAsequences.

7. Fast algorithms for computing sequence distances by exhaustive substring composition.

8. Using Mahalanobis distance to compare genomic signatures between bacterial plasmids and chromosomes.

9. A novel alignment-free method for comparing transcription factor binding site motifs.

10. A hybrid distance measure for clustering expressed sequence tags originating from the same gene family.