Literature DB >> 18590747

WSE, a new sequence distance measure based on word frequencies.

Jun Wang1, Xiaoqi Zheng.   

Abstract

In this article, we present a new distance metric, the Weighted Sequence Entropy (WSE), based on the short word composition of biological sequences. As a revision of the classical relative entropy (RE), our metric (1) works equivalently with RE in the case of small k, (2) avoids the degeneracy when some word types are absent in one sequence but not in the other. Experiments on 25 viruses including SARS-CoVs show that our method and RE give exactly the same phylogenetic tree when word length k <or= 3. When k>3, our method still works and gets convergent phylogenetic topology but the RE gives degenerate results.

Entities:  

Mesh:

Substances:

Year:  2008        PMID: 18590747      PMCID: PMC7185439          DOI: 10.1016/j.mbs.2008.06.001

Source DB:  PubMed          Journal:  Math Biosci        ISSN: 0025-5564            Impact factor:   2.144


Introduction

The elucidation of the evolutionary history of different species is a major concern to biological science. Early approaches to deal with it were mainly based on the alignment of a gene or protein sequence. But on one hand, “different genes may tell different stories”—unequal mutational rates of different genes or different pieces of genes and lateral gene transfer are widely detected, so a single gene or protein sequence usually does not have enough information to determine its phylogenetic position. On the other hand, traditional alignment methods are computationally intensive and meaningless to whole genome comparison because each genome has its own genes and gene order. In a word, there is an urgent need to develop new phylogenetic methods utilizing the ever-increasing genome data. In recent years, several alignment-free phylogenetic methods have been proposed [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]. Among the earliest attempts is the gene content method [1], [2], which utilizes the ratio between the number of genes two species share and the total number of genes. This method is very simple and can be applied to long genomes. But it fails when species are closely related to each other, e.g., mt genome sequences of some placental mammals used by Cao et al. [3]. Because these genomes share exactly the same genes and gene order. Similar work along this line is the gene order method [4], which also fails in the above case. Besides, some methods based on graphical representations of DNA sequences are outlined [5], [6], [7], [8], [9], [10], which usually translate a DNA sequence to a set of plots in 2D/3D space, and use some graphical invariants to characterize this sequence. These methods provide a simple way of viewing, sorting and comparing various gene structures. Another widely studied method is based on the short word composition of biological sequences. Observing that relative abundances of all dinucleotide are remarkably constant across the genome, Karlin et al. [11], [12], [13] proposed the genome signature to characterize this genome. The genome signature consists of the array of dinucleotide relative abundances extended over all dinucleotides, where is the frequency of nucleotide x and is the frequency of dinucleotide xy. Alternatively, relative abundance of tri-, tetra- or even k-nucleotides can also be calculated. Besides phylogenetic inference, these signatures have been used in several applications, e.g., assessment of general relatedness of genomes [12], genetic sequence classification [16], clustering of genes from different genomes [17], and recognition of species-specific sequence blocks [18]. After counting frequencies of all k-nucleotides (or k-mer, k-word, etc.) in a DNA or protein sequence, we can represent this sequence as a vector in the or -dimensional Euclidean space. Since the frequency of a k-nucleotide measures the probability of its occurrence in the given sequence, it is intuitive to use the relative entropy between two frequency vectors as a measure of distance. Let and be the observed frequencies of the i-th k-nucleotide in sequences A and B, respectively. The relative entropy (or Kullback–Leibler distance) [19] is defined asThe relative entropy is the inefficiency of assuming that the distribution is when the true distribution is . It does not satisfy the symmetry condition and the triangle inequality, so it is not strictly a distance measure. The relative entropy, however, satisfies many important and useful mathematical properties. For example, it is a convex function of , always nonnegative, and vanishes if and only if for all possible i. Apparently, vector representations of biological sequences will lose some information in the case of small k, e.g., only nucleotide composition is known when . More information (especially the sequence-order information) will be captured if we consider higher-order words. But unfortunately, as the word order increases, some word types in one sequence may be absent in the other sequence. This leads to the degenerate result when Formula (1) is used. Meanwhile, it is widely detected that some “word spaces” are not saturated for an individual genome, especially for protein sequences. A research of the 101,602 protein sequences in SWISS-PROT database Rel. 40 (2000) showed that all these proteins have taken only less than 26% of the 6-aminoacids strings. In particular, EcoliK has taken less than 2% and Mycge less than 0.3% of the 6-aminoacids strings [20]. Traditionally, in order to avoid the degeneracy, one can unite some k-words (i.e., words with frequencies up to a particular threshold) into one event. However, this will also be accompanied by a loss of sequence information. In this paper, we give a new distance metric to overcome the degeneracy. As a revision of the classical relative entropy, this metric works equivalently to the relative entropy in phylogenetic inference when k is small, and avoids the degeneration in the case of large k. In the main text, we give the mathematical description and a simple implementation of this metric. As to its application, we construct phylogenetic trees of 25 viruses including SARS-CoVs.

Methods

Frequencies of k-words

Given a DNA or protein sequence of length L, we count the occurrences of words of length k. Partial overlap of words is allowed. There are in total possible words for DNA sequences and words for protein sequences. The frequency of a word is obtained from its occurrence by dividing the total number of words in the given sequence

Subtraction of random background

The above frequency reflects both the results of natural selection and random mutations [20]. In order to highlight the selective diversification of sequence composition, one should subtract the background from the simple frequency results by dividing the corresponding expected value:where is the background frequency of word . There are many approaches to estimate background frequency of a word in a sequence. In the present paper, we use the simplest one which is based on the Markov model of order 0 [11], [12], [13], [21]. For example, the trinucleotide ACT has the expected frequency . More elaborate estimations, e.g., models of higher-order Markov process, can be defined accordingly. Note that is actually not a probability distribution since the sum of its components is not 1. We normalize it here by dividing the sum: , for each word i of length k.

Weighted sequence entropy and distance matrix

In order to overcome the degeneracy, we replace in Formula (1) by , thenOur remedy is meaningful since is also a distribution. Actually, Formula (2) can be treated as the relative entropy between and the arithmetical average value of distributions and . Here . In order to ensure the symmetry condition, we take the sumRecall that Shannon’s information entropyand writeas the entropy of arithmetical average value of distributions and . Then can be rewritten asAlternatively, we can give more elaborate weights to and , e.g., the sequence lengths and , or their square roots and (the accuracy in estimation of probability in multinomial distribution is proportional to square root of the corresponding sequence length). This gives another two relative entropy functions Their corresponding distance measures can be defined and calculated similarly, Since , and are all linear combinations of Shannon entropies, we call them the Weighted Sequence Entropy (WSE). According to their definitions, these distances can avoid the degeneracy as their upper bounds are 2, and , respectively. To illustrate the virtues of WSE, we give an example as follows. The first exon of β-globin gene of Lemur is S  = ATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGTCACCTCTCTGTGGGGCAAGGTGGATGTAGAGAAAGTTGGTGGCGAGGCCTTGGGCAG. There is only one dinucleotide TA in this sequence. If a point mutation which changes T in the only dinucleotide TA to C occurs (denotes the new sequence by ), then the relative entropy between S and is , which contradicts the biological meaning. While our methods give the values 0.0245821, 2.26155 and 0.23578 instead.

Results

Phylogenetic trees of 25 viruses

The outbreak of severe acute respiratory syndrome coronavirus (SARS-CoV) in 2003 has had a tremendous impact on worldwide health care systems [22], [23]. Coronaviruses are members of a family of enveloped viruses that replicate in the cytoplasm of animal host cell. According to the type of the host, coronaviruses isolated previously can be classified into three groups. Groups I and II contain mammalian viruses, whereas group III contains only avian viruses. After genome sequencing of some SARS-CoVs, many efforts have been made to identify the phylogenetic position of SARS-CoVs in coronavirus tree using molecular data. However, this is still a controversial topic—alignment-based methods showed that SARS-CoVs are not closely related to any previously isolated groups and form a new group [24], [25]; maximum likelihood tree built from a fragment of the spike protein preferred SARS-CoVs clustering with group II coronaviruses (murine hepatitis virus and rat coronavirus) [26]; while an information-based method, which makes use of the whole genome sequences, indicated that SARS-CoVs are close to the group I coronaviruses rather than form a new group [27]. In the present work, we select 25 complete virus genomes: 12 coronaviruses from the three isolated typical groups, 12 SARS-CoV strains, and a torovirus, which serves as the outgroup for coronaviruses [28] (Table 1 ). Pairwise distance matrices for these 25 sequences are calculated using our distances (, and ) and the classical relative entropy. Notice that RE does not satisfy the symmetry condition. We use the sum of and instead. Then, phylogenetic trees are built from these matrices using the UPGMA program in the PHYLIP package [29]. Finally, rooted phylogenetic trees are drawn by TREEVIEW program [30], and some of them are shown in Fig. 1, Fig. 2 .
Table 1

Coronaviruses and a torovirus used to construct phylogenetic tree

No.AccessionAbbreviationGenomeGroupLength (nt)
1NC_002654HCoV-229EHuman coronavirus 229EI27317
2NC_002306TGEVTransmissible gastroenteritis virusI28586
3NC_003436PEDVPorcine epidemic diarrhea virusI28033
4U00735BCoVMBovine coronavirus strain MebuusII31032
5AF391542BCoVLBovine coronavirus isolate BCoV-LUNII31028
6AF220295BCoVQBovin coronavirus strain QuebecII31100
7NC_003045BCoVBovine coronavirusII31028
8AF208067MHVMMurine hepatituis virus strain ML-10II31233
9AF201929MHV2Murine hepatitis virus stain 2II31276
10AF208066MHVPMurine hepatitis virus stain Penn 97-1II31112
11NC_001846MHVMurine hepatitis virusII31357
12NC_001451IBVAvian infectiouis bronchitis virusIII27608
13AY278488BJ01SARS coronavirus BJ0129725
14AY278741UrbaniSARS coronavirus Urbani29727
15AY278491HKU-39849SARS coronavirus HKU-3984929742
16AY278554CUHK-W1SARS coronavirus CUHK-W129736
17AY282752CUHK-Su10SARS coronavirus CUHK-Su1029736
18AY283794SIN2500SARS coronavirus SIN250029711
19AY283795SIN2677SARS coronavirus SIN267729705
20AY283796SIN2679SARS coronavirus SIN267929711
21AY283797SIN2748SARS coronavirus SIN274829706
22AY283798SIN2774SARS coronavirus SIN277429711
23AY291451TW1SARS coronavirus TW129729
24NC_004718TOR2SARS coronavirus29751
25X52374EToVEquine torovirus7920
Fig. 1

The common phylogenetic trees of 25 viruses constructed by the relative entropy and our distances (). For word orders and 3, the corresponding trees are listed as (a) and (b). When , the relative entropy method fails as some words are absent in these genomes.

Fig. 2

Phylogenetic trees of 25 viruses for different values of k with pairwise distances evaluated by our distance (for k increasing from 4 to 7, the respective trees are shown as (a), (b), (c) and (d)). In this case, the classical relative entropy method gets degenerate result.

Coronaviruses and a torovirus used to construct phylogenetic tree The common phylogenetic trees of 25 viruses constructed by the relative entropy and our distances (). For word orders and 3, the corresponding trees are listed as (a) and (b). When , the relative entropy method fails as some words are absent in these genomes. Phylogenetic trees of 25 viruses for different values of k with pairwise distances evaluated by our distance (for k increasing from 4 to 7, the respective trees are shown as (a), (b), (c) and (d)). In this case, the classical relative entropy method gets degenerate result. Our three WSE distances (, and ) perform similarly for all values of k (from 2 to 7). They get nearly the same phylogeny except for the length of some branches. This is under our expectation since these genomes have no significant difference in length (except for torovirus). Trees constructed by RE and our distances are of the same topology at the word lengths and 3 (these common topologies are shown in Fig. 1). But when , the relative entropy fails as some words are absent in these genomes (Table 2 ). Explicitly,
Table 2

Numbers of words that are absent in some species for different values of k

SpeciesBCoVHCoV-229EIBVMHVPEDVSARSTGEV
k=52661119
k=6288384338100207259408
k=75607620558614456521551506049
Phylogenetic result got for deviates significantly from the commonly accepted one. This phylogeny prefers murine hepatitis viruses as the outgroup (Fig. 1a); The word order seems to make a remarkable improvement (Fig. 1b): three typical groups of coronaviruses, and 12 SARS-CoVs, all cluster accordingly. Two big branches are formed: one is the group I coronaviruses and all SARS genomes, the other is the torovirus and group II and III coronaviruses. But this topology also does not support the outgroup status of torovirus. Numbers of words that are absent in some species for different values of k With the definition of WSE, we can study higher-order words and check whether they are suitable to estimate phylogenies. WSE trees at word lengths are shown in Fig. 2. As can be seen from Fig. 1, Fig. 2, topologies of evolutionary trees converge with the increase of k, and become stable at , 5 and 6. These stable trees maintain the main topologies of Fig. 1b, and also successfully support the outgroup status of torovirus relative to all coronaviruses. When , the average frequency of words is too small for the present statistical analysis. In conclusion, we prefer giving the best phylogeny, that is, all 12 SARS-CoV strains are grouped together and form a new fourth group, which is distinctly related to the group I coronaviruses (TGEV, explicitly).

Comparisons with some classical methods

Besides the relative entropy, there are a great many of alignment-free comparisons on the basis of short word composition. In this part, we will list some of them and compare them with our method through phylogenetic experiments on the same data set. Euclidean distance [31], [32]:where and are the frequencies of the ith word in sequences A and B, respectively. Linear correlation coefficient [33]: Cosine function [20], [34]:where , and is the expect frequency of word i in sequence A estimated from the frequencies of appropriate shorter subwords under a Markov model of order . Information-based similarity index [35]:where is the Shannon entropy for word i in sequences A. and represent the ranks of word i in sequences A and B. We repeated our experiment using the above classical distances for different values of k, and some phylogenetic trees (at word length ) are shown in Fig. 3, Fig. 4, Fig. 5 .
Fig. 3

Phylogenetic tree of 25 viruses constructed using Euclidean distance as the dissimilarity metric.

Fig. 4

Phylogenetic tree of 25 viruses constructed using as the dissimilarity metric.

Fig. 5

Phylogenetic tree of 25 viruses constructed using as the dissimilarity metric.

Phylogenetic tree of 25 viruses constructed using Euclidean distance as the dissimilarity metric. Phylogenetic tree of 25 viruses constructed using as the dissimilarity metric. Phylogenetic tree of 25 viruses constructed using as the dissimilarity metric. Phylogenetic tree obtained using Euclidean distance as the dissimilarity metric is shown in Fig. 3. This tree is mainly identical to our tree, i.e., torovirus stays outside of all coronaviruses, and SARS-CoVs are closer to the transmissible gastroenteritis virus (TGEV). But an obvious default in this tree is that it fails to cluster the three group I coronaviruses: HCoV-229E, PEDV and TGEV. Fig. 4 lists the phylogenetic tree constructed by . This topology supports SARS-CoVs and group II as a closer pair, which is in accordance with the result of Liò and Goldman using a fragment of the spike protein [26]. However, this tree also fails to cluster the group I coronaviruses. In Fig. 5, we list the tree built by linear correlation coefficient between pairwise frequency vectors. This tree perfectly clusters species within each typical group, and supports the outgroup status of SARS-CoVs relative to other coronaviruses. But it fails to identify the outgroup status of torovirus relative to coronaviruses.

Discussion

Phylogenetic analysis from k-word composition is a good alternative approach to the traditional alignment methods. It has relatively low computational complexity, and does not suffer from genetic rearrangements and transposon activity, which serve as common ways of genome evolution. In most cases, biological sequences are represented as occurrence or composition vectors in a high dimensional Euclidean space, and phylogenetic results are high reliable on the quality of this vector representation. In order to capture more sequence information and obtain a better vector representation, the suitable word length is of critical importance. According to the information theory, this length reflects the balance between noise and information—some information may be lost if one uses overshort words, while noise will dominate when long words are considered. Moreover, suitable length of word is often species-specific. In a word, a well-defined distance measure which can be applied to any word length is of great importance. Relative entropy is one of the classical approaches to measure the dissimilarity between two composition vectors. However, as word order increases, this approach will get degenerate results due to the absent of some word types. In the present paper, we propose three revisions of the classical relative entropy (RE). These modifications perform equally with RE in phylogenetic inferences in the case of short words. But when long words are considered, our method still works and gets convergent phylogenetic topologies (in such case, RE gets degenerate results as some word types are absent). Through constructing phylogenetic trees using WSE and some classical distance measures, we find that, as far as the set of 25 virus sequences is concerned, our distances are not inferior to those classical methods. According to our tree, all SARS-CoVs form a group, which is more related to the group I coronavirus: TGEV. This result is mainly in accordance with the phylogeny given by Yang et al. [27], and is also supported by the experimental evidence, which showed that group I coronaviruses (transmissible gastroenteritis virus, TGEV) specific antibodies are able to recognize antigens in SARS-CoV infected cultured cells [37]. We hope our modification can be served as an alternative approach to avoid the degeneracy, and has potential to study information expressed by different orders of words (when RE-based approaches are used). However, as sequences in our data set have no significant difference in length, our three weight methods give nearly the same phylogeny. According to the Central Limit Theorem, the accuracy in estimation of probability in multinomial distribution is proportional to sqrt(length). So when using relatively larger data sets, especially in the case that sequences have significant differences in length, we prefer the revision giving relatively competitive phylogenetic results. In order to further assure our approach, and evaluate the performances of different WSE, we will do the following experimental and theoretical discussions. Experimentally, we will “generate” phylogenetic trees under different evolutionary models, and reconstruct the phylogenetic topologies from their corresponding external nodes (OTU) using our distances and RE. Then performance of different WSE can be evaluated by checking the consistency of these phylogenetic topologies. Theoretically, we hypothesize that order relations between pairwise sequence distances calculated by RE and WSE are in accordance, i.e., WSE WSE if and only if , for any sequences A, B, C and D (proof will be presented in our future paper). This phenomenon, on one hand, can partially explain the equivalence of WSE and RE in phylogenetic inferences in the case of short words, and on the other hand, can help to analyze the performances of different weight methods.
  32 in total

1.  Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA.

Authors:  A Campbell; J Mrázek; S Karlin
Journal:  Proc Natl Acad Sci U S A       Date:  1999-08-03       Impact factor: 11.205

2.  Linguistic analysis of the human heartbeat using frequency and rank order statistics.

Authors:  Albert C-C Yang; Shu-Shya Hseu; Huey-Wen Yien; Ary L Goldberger; C-K Peng
Journal:  Phys Rev Lett       Date:  2003-03-13       Impact factor: 9.161

3.  A novel bioinformatic strategy for unveiling hidden genome signatures of eukaryotes: self-organizing map of oligonucleotide frequency.

Authors:  Takashi Abe; Shigehiko Kanaya; Makoto Kinouchi; Yuta Ichiba; Tokio Kozuki; Toshimichi Ikemura
Journal:  Genome Inform       Date:  2002

4.  Compositional differences within and between eukaryotic genomes.

Authors:  S Karlin; J Mrázek
Journal:  Proc Natl Acad Sci U S A       Date:  1997-09-16       Impact factor: 11.205

5.  A major outbreak of severe acute respiratory syndrome in Hong Kong.

Authors:  Nelson Lee; David Hui; Alan Wu; Paul Chan; Peter Cameron; Gavin M Joynt; Anil Ahuja; Man Yee Yung; C B Leung; K F To; S F Lui; C C Szeto; Sydney Chung; Joseph J Y Sung
Journal:  N Engl J Med       Date:  2003-04-07       Impact factor: 91.245

6.  Simpler DNA sequence representations.

Authors:  M A Gates
Journal:  Nature       Date:  1985 Jul 18-24       Impact factor: 49.962

Review 7.  Toroviruses: replication, evolution and comparison with other members of the coronavirus-like superfamily.

Authors:  E J Snijder; M C Horzinek
Journal:  J Gen Virol       Date:  1993-11       Impact factor: 3.891

8.  Novel DNA sequence representations.

Authors:  E Hamori
Journal:  Nature       Date:  1985 Apr 18-24       Impact factor: 49.962

9.  H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences.

Authors:  E Hamori; J Ruskin
Journal:  J Biol Chem       Date:  1983-01-25       Impact factor: 5.157

10.  Genes from nine genomes are separated into their organisms in the dinucleotide composition space.

Authors:  H Nakashima; M Ota; K Nishikawa; T Ooi
Journal:  DNA Res       Date:  1998-10-30       Impact factor: 4.458

View more
  5 in total

1.  Phylogenetic analysis of protein sequences based on distribution of length about common sub-string.

Authors:  Guisong Chang; Tianming Wang
Journal:  Protein J       Date:  2011-03       Impact factor: 2.371

2.  A novel hierarchical clustering algorithm for gene sequences.

Authors:  Dan Wei; Qingshan Jiang; Yanjie Wei; Shengrui Wang
Journal:  BMC Bioinformatics       Date:  2012-07-23       Impact factor: 3.169

3.  An improved alignment-free model for DNA sequence similarity metric.

Authors:  Junpeng Bao; Ruiyu Yuan; Zhe Bao
Journal:  BMC Bioinformatics       Date:  2014-09-28       Impact factor: 3.169

4.  Using Gaussian model to improve biological sequence comparison.

Authors:  Qi Dai; Xiaoqing Liu; Lihua Li; Yuhua Yao; Bin Han; Lei Zhu
Journal:  J Comput Chem       Date:  2010-01-30       Impact factor: 3.376

5.  A Poisson model of sequence comparison and its application to coronavirus phylogeny.

Authors:  Xiaoqi Zheng; Yufang Qin; Jun Wang
Journal:  Math Biosci       Date:  2008-12-06       Impact factor: 2.144

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.