Literature DB >> 18590747

WSE, a new sequence distance measure based on word frequencies.

Abstract

In this article, we present a new distance metric, the Weighted Sequence Entropy (WSE), based on the short word composition of biological sequences. As a revision of the classical relative entropy (RE), our metric (1) works equivalently with RE in the case of small k, (2) avoids the degeneracy when some word types are absent in one sequence but not in the other. Experiments on 25 viruses including SARS-CoVs show that our method and RE give exactly the same phylogenetic tree when word length k <or= 3. When k>3, our method still works and gets convergent phylogenetic topology but the RE gives degenerate results.

Entities: Chemical Disease Species

Mesh：

Substances：
DNA, Viral

Year: 2008 PMID： 18590747 PMCID： PMC7185439 DOI： 10.1016/j.mbs.2008.06.001

Source DB: PubMed Journal: Math Biosci ISSN： 0025-5564 Impact factor: 2.144

Introduction

The elucidation of the evolutionary history of different species is a major concern to biological science. Early approaches to deal with it were mainly based on the alignment of a gene or protein sequence. But on one hand, “different genes may tell different stories”—unequal mutational rates of different genes or different pieces of genes and lateral gene transfer are widely detected, so a single gene or protein sequence usually does not have enough information to determine its phylogenetic position. On the other hand, traditional alignment methods are computationally intensive and meaningless to whole genome comparison because each genome has its own genes and gene order. In a word, there is an urgent need to develop new phylogenetic methods utilizing the ever-increasing genome data. In recent years, several alignment-free phylogenetic methods have been proposed [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]. Among the earliest attempts is the gene content method [1], [2], which utilizes the ratio between the number of genes two species share and the total number of genes. This method is very simple and can be applied to long genomes. But it fails when species are closely related to each other, e.g., mt genome sequences of some placental mammals used by Cao et al. [3]. Because these genomes share exactly the same genes and gene order. Similar work along this line is the gene order method [4], which also fails in the above case. Besides, some methods based on graphical representations of DNA sequences are outlined [5], [6], [7], [8], [9], [10], which usually translate a DNA sequence to a set of plots in 2D/3D space, and use some graphical invariants to characterize this sequence. These methods provide a simple way of viewing, sorting and comparing various gene structures. Another widely studied method is based on the short word composition of biological sequences. Observing that relative abundances of all dinucleotide are remarkably constant across the genome, Karlin et al. [11], [12], [13] proposed the genome signature to characterize this genome. The genome signature consists of the array of dinucleotide relative abundances extended over all dinucleotides, where is the frequency of nucleotide x and is the frequency of dinucleotide xy. Alternatively, relative abundance of tri-, tetra- or even k-nucleotides can also be calculated. Besides phylogenetic inference, these signatures have been used in several applications, e.g., assessment of general relatedness of genomes [12], genetic sequence classification [16], clustering of genes from different genomes [17], and recognition of species-specific sequence blocks [18]. After counting frequencies of all k-nucleotides (or k-mer, k-word, etc.) in a DNA or protein sequence, we can represent this sequence as a vector in the or -dimensional Euclidean space. Since the frequency of a k-nucleotide measures the probability of its occurrence in the given sequence, it is intuitive to use the relative entropy between two frequency vectors as a measure of distance. Let and be the observed frequencies of the i-th k-nucleotide in sequences A and B, respectively. The relative entropy (or Kullback–Leibler distance) [19] is defined asThe relative entropy is the inefficiency of assuming that the distribution is when the true distribution is . It does not satisfy the symmetry condition and the triangle inequality, so it is not strictly a distance measure. The relative entropy, however, satisfies many important and useful mathematical properties. For example, it is a convex function of , always nonnegative, and vanishes if and only if for all possible i. Apparently, vector representations of biological sequences will lose some information in the case of small k, e.g., only nucleotide composition is known when . More information (especially the sequence-order information) will be captured if we consider higher-order words. But unfortunately, as the word order increases, some word types in one sequence may be absent in the other sequence. This leads to the degenerate result when Formula (1) is used. Meanwhile, it is widely detected that some “word spaces” are not saturated for an individual genome, especially for protein sequences. A research of the 101,602 protein sequences in SWISS-PROT database Rel. 40 (2000) showed that all these proteins have taken only less than 26% of the 6-aminoacids strings. In particular, EcoliK has taken less than 2% and Mycge less than 0.3% of the 6-aminoacids strings [20]. Traditionally, in order to avoid the degeneracy, one can unite some k-words (i.e., words with frequencies up to a particular threshold) into one event. However, this will also be accompanied by a loss of sequence information. In this paper, we give a new distance metric to overcome the degeneracy. As a revision of the classical relative entropy, this metric works equivalently to the relative entropy in phylogenetic inference when k is small, and avoids the degeneration in the case of large k. In the main text, we give the mathematical description and a simple implementation of this metric. As to its application, we construct phylogenetic trees of 25 viruses including SARS-CoVs.

Methods

Frequencies of k-words

Given a DNA or protein sequence of length L, we count the occurrences of words of length k. Partial overlap of words is allowed. There are in total possible words for DNA sequences and words for protein sequences. The frequency of a word is obtained from its occurrence by dividing the total number of words in the given sequence

Subtraction of random background

The above frequency reflects both the results of natural selection and random mutations [20]. In order to highlight the selective diversification of sequence composition, one should subtract the background from the simple frequency results by dividing the corresponding expected value:where is the background frequency of word . There are many approaches to estimate background frequency of a word in a sequence. In the present paper, we use the simplest one which is based on the Markov model of order 0 [11], [12], [13], [21]. For example, the trinucleotide ACT has the expected frequency . More elaborate estimations, e.g., models of higher-order Markov process, can be defined accordingly. Note that is actually not a probability distribution since the sum of its components is not 1. We normalize it here by dividing the sum: , for each word i of length k.

Weighted sequence entropy and distance matrix

In order to overcome the degeneracy, we replace in Formula (1) by , thenOur remedy is meaningful since is also a distribution. Actually, Formula (2) can be treated as the relative entropy between and the arithmetical average value of distributions and . Here . In order to ensure the symmetry condition, we take the sumRecall that Shannon’s information entropyand writeas the entropy of arithmetical average value of distributions and . Then can be rewritten asAlternatively, we can give more elaborate weights to and , e.g., the sequence lengths and , or their square roots and (the accuracy in estimation of probability in multinomial distribution is proportional to square root of the corresponding sequence length). This gives another two relative entropy functions Their corresponding distance measures can be defined and calculated similarly, Since , and are all linear combinations of Shannon entropies, we call them the Weighted Sequence Entropy (WSE). According to their definitions, these distances can avoid the degeneracy as their upper bounds are 2, and , respectively. To illustrate the virtues of WSE, we give an example as follows. The first exon of β-globin gene of Lemur is S = ATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGTCACCTCTCTGTGGGGCAAGGTGGATGTAGAGAAAGTTGGTGGCGAGGCCTTGGGCAG. There is only one dinucleotide TA in this sequence. If a point mutation which changes T in the only dinucleotide TA to C occurs (denotes the new sequence by ), then the relative entropy between S and is , which contradicts the biological meaning. While our methods give the values 0.0245821, 2.26155 and 0.23578 instead.

Results

Phylogenetic trees of 25 viruses

The outbreak of severe acute respiratory syndrome coronavirus (SARS-CoV) in 2003 has had a tremendous impact on worldwide health care systems [22], [23]. Coronaviruses are members of a family of enveloped viruses that replicate in the cytoplasm of animal host cell. According to the type of the host, coronaviruses isolated previously can be classified into three groups. Groups I and II contain mammalian viruses, whereas group III contains only avian viruses. After genome sequencing of some SARS-CoVs, many efforts have been made to identify the phylogenetic position of SARS-CoVs in coronavirus tree using molecular data. However, this is still a controversial topic—alignment-based methods showed that SARS-CoVs are not closely related to any previously isolated groups and form a new group [24], [25]; maximum likelihood tree built from a fragment of the spike protein preferred SARS-CoVs clustering with group II coronaviruses (murine hepatitis virus and rat coronavirus) [26]; while an information-based method, which makes use of the whole genome sequences, indicated that SARS-CoVs are close to the group I coronaviruses rather than form a new group [27]. In the present work, we select 25 complete virus genomes: 12 coronaviruses from the three isolated typical groups, 12 SARS-CoV strains, and a torovirus, which serves as the outgroup for coronaviruses [28] (Table 1 ). Pairwise distance matrices for these 25 sequences are calculated using our distances (, and ) and the classical relative entropy. Notice that RE does not satisfy the symmetry condition. We use the sum of and instead. Then, phylogenetic trees are built from these matrices using the UPGMA program in the PHYLIP package [29]. Finally, rooted phylogenetic trees are drawn by TREEVIEW program [30], and some of them are shown in Fig. 1, Fig. 2 .

Table 1

Coronaviruses and a torovirus used to construct phylogenetic tree

No.	Accession	Abbreviation	Genome	Group	Length (nt)
1	NC_002654	HCoV-229E	Human coronavirus 229E	I	27317
2	NC_002306	TGEV	Transmissible gastroenteritis virus	I	28586
3	NC_003436	PEDV	Porcine epidemic diarrhea virus	I	28033
4	U00735	BCoVM	Bovine coronavirus strain Mebuus	II	31032
5	AF391542	BCoVL	Bovine coronavirus isolate BCoV-LUN	II	31028
6	AF220295	BCoVQ	Bovin coronavirus strain Quebec	II	31100
7	NC_003045	BCoV	Bovine coronavirus	II	31028
8	AF208067	MHVM	Murine hepatituis virus strain ML-10	II	31233
9	AF201929	MHV2	Murine hepatitis virus stain 2	II	31276
10	AF208066	MHVP	Murine hepatitis virus stain Penn 97-1	II	31112
11	NC_001846	MHV	Murine hepatitis virus	II	31357
12	NC_001451	IBV	Avian infectiouis bronchitis virus	III	27608
13	AY278488	BJ01	SARS coronavirus BJ01	—	29725
14	AY278741	Urbani	SARS coronavirus Urbani	—	29727
15	AY278491	HKU-39849	SARS coronavirus HKU-39849	—	29742
16	AY278554	CUHK-W1	SARS coronavirus CUHK-W1	—	29736
17	AY282752	CUHK-Su10	SARS coronavirus CUHK-Su10	—	29736
18	AY283794	SIN2500	SARS coronavirus SIN2500	—	29711
19	AY283795	SIN2677	SARS coronavirus SIN2677	—	29705
20	AY283796	SIN2679	SARS coronavirus SIN2679	—	29711
21	AY283797	SIN2748	SARS coronavirus SIN2748	—	29706
22	AY283798	SIN2774	SARS coronavirus SIN2774	—	29711
23	AY291451	TW1	SARS coronavirus TW1	—	29729
24	NC_004718	TOR2	SARS coronavirus	—	29751
25	X52374	EToV	Equine torovirus	—	7920

Fig. 1

The common phylogenetic trees of 25 viruses constructed by the relative entropy and our distances (). For word orders and 3, the corresponding trees are listed as (a) and (b). When , the relative entropy method fails as some words are absent in these genomes.

Fig. 2

Phylogenetic trees of 25 viruses for different values of k with pairwise distances evaluated by our distance (for k increasing from 4 to 7, the respective trees are shown as (a), (b), (c) and (d)). In this case, the classical relative entropy method gets degenerate result.

Coronaviruses and a torovirus used to construct phylogenetic tree The common phylogenetic trees of 25 viruses constructed by the relative entropy and our distances (). For word orders and 3, the corresponding trees are listed as (a) and (b). When , the relative entropy method fails as some words are absent in these genomes. Phylogenetic trees of 25 viruses for different values of k with pairwise distances evaluated by our distance (for k increasing from 4 to 7, the respective trees are shown as (a), (b), (c) and (d)). In this case, the classical relative entropy method gets degenerate result. Our three WSE distances (, and ) perform similarly for all values of k (from 2 to 7). They get nearly the same phylogeny except for the length of some branches. This is under our expectation since these genomes have no significant difference in length (except for torovirus). Trees constructed by RE and our distances are of the same topology at the word lengths and 3 (these common topologies are shown in Fig. 1). But when , the relative entropy fails as some words are absent in these genomes (Table 2 ). Explicitly,

Table 2

Numbers of words that are absent in some species for different values of k

Species	BCoV	HCoV-229E	IBV	MHV	PEDV	SARS	TGEV
k=5	2	6	6	1	1	1	9
k=6	288	384	338	100	207	259	408
k=7	5607	6205	5861	4456	5215	5150	6049

Phylogenetic result got for deviates significantly from the commonly accepted one. This phylogeny prefers murine hepatitis viruses as the outgroup (Fig. 1a); The word order seems to make a remarkable improvement (Fig. 1b): three typical groups of coronaviruses, and 12 SARS-CoVs, all cluster accordingly. Two big branches are formed: one is the group I coronaviruses and all SARS genomes, the other is the torovirus and group II and III coronaviruses. But this topology also does not support the outgroup status of torovirus. Numbers of words that are absent in some species for different values of k With the definition of WSE, we can study higher-order words and check whether they are suitable to estimate phylogenies. WSE trees at word lengths are shown in Fig. 2. As can be seen from Fig. 1, Fig. 2, topologies of evolutionary trees converge with the increase of k, and become stable at , 5 and 6. These stable trees maintain the main topologies of Fig. 1b, and also successfully support the outgroup status of torovirus relative to all coronaviruses. When , the average frequency of words is too small for the present statistical analysis. In conclusion, we prefer giving the best phylogeny, that is, all 12 SARS-CoV strains are grouped together and form a new fourth group, which is distinctly related to the group I coronaviruses (TGEV, explicitly).

Comparisons with some classical methods

Besides the relative entropy, there are a great many of alignment-free comparisons on the basis of short word composition. In this part, we will list some of them and compare them with our method through phylogenetic experiments on the same data set. Euclidean distance [31], [32]:where and are the frequencies of the ith word in sequences A and B, respectively. Linear correlation coefficient [33]: Cosine function [20], [34]:where , and is the expect frequency of word i in sequence A estimated from the frequencies of appropriate shorter subwords under a Markov model of order . Information-based similarity index [35]:where is the Shannon entropy for word i in sequences A. and represent the ranks of word i in sequences A and B. We repeated our experiment using the above classical distances for different values of k, and some phylogenetic trees (at word length ) are shown in Fig. 3, Fig. 4, Fig. 5 .

Fig. 3

Phylogenetic tree of 25 viruses constructed using Euclidean distance as the dissimilarity metric.

Fig. 4

Phylogenetic tree of 25 viruses constructed using as the dissimilarity metric.

Fig. 5

Phylogenetic tree of 25 viruses constructed using as the dissimilarity metric.

Phylogenetic tree of 25 viruses constructed using Euclidean distance as the dissimilarity metric. Phylogenetic tree of 25 viruses constructed using as the dissimilarity metric. Phylogenetic tree of 25 viruses constructed using as the dissimilarity metric. Phylogenetic tree obtained using Euclidean distance as the dissimilarity metric is shown in Fig. 3. This tree is mainly identical to our tree, i.e., torovirus stays outside of all coronaviruses, and SARS-CoVs are closer to the transmissible gastroenteritis virus (TGEV). But an obvious default in this tree is that it fails to cluster the three group I coronaviruses: HCoV-229E, PEDV and TGEV. Fig. 4 lists the phylogenetic tree constructed by . This topology supports SARS-CoVs and group II as a closer pair, which is in accordance with the result of Liò and Goldman using a fragment of the spike protein [26]. However, this tree also fails to cluster the group I coronaviruses. In Fig. 5, we list the tree built by linear correlation coefficient between pairwise frequency vectors. This tree perfectly clusters species within each typical group, and supports the outgroup status of SARS-CoVs relative to other coronaviruses. But it fails to identify the outgroup status of torovirus relative to coronaviruses.

Discussion

Phylogenetic analysis from k-word composition is a good alternative approach to the traditional alignment methods. It has relatively low computational complexity, and does not suffer from genetic rearrangements and transposon activity, which serve as common ways of genome evolution. In most cases, biological sequences are represented as occurrence or composition vectors in a high dimensional Euclidean space, and phylogenetic results are high reliable on the quality of this vector representation. In order to capture more sequence information and obtain a better vector representation, the suitable word length is of critical importance. According to the information theory, this length reflects the balance between noise and information—some information may be lost if one uses overshort words, while noise will dominate when long words are considered. Moreover, suitable length of word is often species-specific. In a word, a well-defined distance measure which can be applied to any word length is of great importance. Relative entropy is one of the classical approaches to measure the dissimilarity between two composition vectors. However, as word order increases, this approach will get degenerate results due to the absent of some word types. In the present paper, we propose three revisions of the classical relative entropy (RE). These modifications perform equally with RE in phylogenetic inferences in the case of short words. But when long words are considered, our method still works and gets convergent phylogenetic topologies (in such case, RE gets degenerate results as some word types are absent). Through constructing phylogenetic trees using WSE and some classical distance measures, we find that, as far as the set of 25 virus sequences is concerned, our distances are not inferior to those classical methods. According to our tree, all SARS-CoVs form a group, which is more related to the group I coronavirus: TGEV. This result is mainly in accordance with the phylogeny given by Yang et al. [27], and is also supported by the experimental evidence, which showed that group I coronaviruses (transmissible gastroenteritis virus, TGEV) specific antibodies are able to recognize antigens in SARS-CoV infected cultured cells [37]. We hope our modification can be served as an alternative approach to avoid the degeneracy, and has potential to study information expressed by different orders of words (when RE-based approaches are used). However, as sequences in our data set have no significant difference in length, our three weight methods give nearly the same phylogeny. According to the Central Limit Theorem, the accuracy in estimation of probability in multinomial distribution is proportional to sqrt(length). So when using relatively larger data sets, especially in the case that sequences have significant differences in length, we prefer the revision giving relatively competitive phylogenetic results. In order to further assure our approach, and evaluate the performances of different WSE, we will do the following experimental and theoretical discussions. Experimentally, we will “generate” phylogenetic trees under different evolutionary models, and reconstruct the phylogenetic topologies from their corresponding external nodes (OTU) using our distances and RE. Then performance of different WSE can be evaluated by checking the consistency of these phylogenetic topologies. Theoretically, we hypothesize that order relations between pairwise sequence distances calculated by RE and WSE are in accordance, i.e., WSE WSE if and only if , for any sequences A, B, C and D (proof will be presented in our future paper). This phenomenon, on one hand, can partially explain the equivalence of WSE and RE in phylogenetic inferences in the case of short words, and on the other hand, can help to analyze the performances of different weight methods.

32 in total

1. Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA.

Authors: A Campbell; J Mrázek; S Karlin
Journal: Proc Natl Acad Sci U S A Date: 1999-08-03 Impact factor: 11.205

2. Linguistic analysis of the human heartbeat using frequency and rank order statistics.

Authors: Albert C-C Yang; Shu-Shya Hseu; Huey-Wen Yien; Ary L Goldberger; C-K Peng
Journal: Phys Rev Lett Date: 2003-03-13 Impact factor: 9.161

3. A novel bioinformatic strategy for unveiling hidden genome signatures of eukaryotes: self-organizing map of oligonucleotide frequency.

Authors: Takashi Abe; Shigehiko Kanaya; Makoto Kinouchi; Yuta Ichiba; Tokio Kozuki; Toshimichi Ikemura
Journal: Genome Inform Date: 2002

WSE, a new sequence distance measure based on word frequencies.

Introduction

Methods

Frequencies of k-words

Subtraction of random background

Weighted sequence entropy and distance matrix

Results

Phylogenetic trees of 25 viruses

Comparisons with some classical methods

Discussion

1. Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA.

2. Linguistic analysis of the human heartbeat using frequency and rank order statistics.

3. A novel bioinformatic strategy for unveiling hidden genome signatures of eukaryotes: self-organizing map of oligonucleotide frequency.

4. Compositional differences within and between eukaryotic genomes.

5. A major outbreak of severe acute respiratory syndrome in Hong Kong.

6. Simpler DNA sequence representations.

Review 7. Toroviruses: replication, evolution and comparison with other members of the coronavirus-like superfamily.

8. Novel DNA sequence representations.

9. H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences.

10. Genes from nine genomes are separated into their organisms in the dinucleotide composition space.

1. Phylogenetic analysis of protein sequences based on distribution of length about common sub-string.

2. A novel hierarchical clustering algorithm for gene sequences.

3. An improved alignment-free model for DNA sequence similarity metric.

4. Using Gaussian model to improve biological sequence comparison.

5. A Poisson model of sequence comparison and its application to coronavirus phylogeny.